# CommonLit - Simple SE and Excerpt Length EDA

In this notebook, we explore the distributions of the target (*Reading Ease*), *Standard Error*, *Excerpt Length* and *Word Count*. We also explore the relationship of the latter variables with the target.

# Load Libraries

In [None]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

# Load Data

In [None]:
data_dir = '../input/commonlitreadabilityprize/'
train = pd.read_csv(data_dir + 'train.csv')
print(train.shape)
train.head()

# Explore Data

In [None]:
print('% populated:')
train.notnull().sum() / len(train)

# Target (Reading Ease)

In [None]:
col = 'target'
display(pd.DataFrame(train[col].describe()))
sns.histplot(train[col])
plt.show()

*Reading Ease* average is around -1 and ranges from around -3.7 to 1.7.

# Standard Error

In [None]:
col = 'standard_error'
display(pd.DataFrame(train[col].describe()))
sns.histplot(train[col], bins=30)
plt.show()

In [None]:
print('correlation:')
display(train[['target', 'standard_error']].corr())

plt.figure(figsize=(8, 6))
sns.scatterplot(x=train['standard_error'], y=train['target'])
plt.axhline(train['target'].mean(), color='tab:red')
plt.text(0.03, 
         train['target'].mean() + 0.1, 
         'reading ease mean = {}'.format(round(train['target'].mean(), 2)),
         weight='bold')
plt.ylabel('reading ease')
plt.show()

Seems like the *Standard Error* generally increases as *Reading Ease* goes farther from its mean value.

According to the Data tab, *Standard Error* is a "measure of spread of scores among multiple raters for each excerpt." Looking at the above plot, we can say that the __scores from multiple raters are more consistent for excerpts with average *Reading Ease*__.

# Excerpt Length

In [None]:
train['excerpt_length'] = train['excerpt'].apply(lambda x: len(x))
col = 'excerpt_length'
display(pd.DataFrame(train[col].describe()))
sns.histplot(train[col])
plt.show()

Excerpts may have ~670 to ~1340 characters (including spaces) and average at ~970 characters.

In [None]:
print('correlation:')
display(train[['target', 'excerpt_length']].corr())

plt.figure(figsize=(8, 6))
sns.scatterplot(x=train['excerpt_length'], y=train['target'])
sns.regplot(x=train['excerpt_length'], y=train['target'], scatter=False, ci=None, color='tab:blue')
plt.ylabel('reading ease')
plt.show()

- __There is some negative correlation between *Reading Ease* and *Excerpt Length*__. Correlation Coefficient = -0.36. This makes sense as longer excerpts may be harder to read.

# Word Count

In [None]:
train['word_count'] = train['excerpt'].apply(lambda x: len(x.split()))
col = 'word_count'
display(pd.DataFrame(train[col].describe()))
sns.histplot(train[col])
plt.show()

Excerpts may have ~130 to ~200 characters (separated by spaces) and average at ~170 characters.

In [None]:
print('correlation:')
display(train[['target', 'word_count']].corr())

plt.figure(figsize=(8, 6))
sns.scatterplot(x=train['word_count'], y=train['target'])
plt.ylabel('reading ease')
plt.show()

The negative correlation of *Reading Ease* with *Word Count* is quite small. I expected this to be higher than Excerpt Length but looks like this is not the case.