<a href="https://www.kaggle.com/code/mikedelong/python-eda-mostly-bars-and-scatters?scriptVersionId=143036023" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
from numpy import nan
import pandas as pd

df = pd.read_csv(filepath_or_buffer='/kaggle/input/commonlit-texts/commonlit_texts.csv')
df['description_length'] = df['description'].apply(func=lambda x: len(x.split()))
df.head()

In [None]:
df.info()

In [None]:
df.nunique()

In [None]:
from plotly.express import histogram
histogram(data_frame=df, x='grade')

In [None]:
from plotly.express import bar
bar(data_frame=df['author'].value_counts().nlargest(n=20).to_frame().reset_index(), x='author', y='count')

Weird how the top twenty is dominated by authors who are not exactly household names.

In [None]:
bar(data_frame=df.sort_values(by='grade'), x='genre', category_orders= {'genre': sorted(df['genre'].unique().tolist())}, hover_name='title', color='grade')

This gives us a sense of how the corpus is dominated by Informational Text, Poem, and Short Story documents, and the color gradient gives us a vague sense of the proportions of the grades in each vertical. Maybe lexile sorting would be more helpful?

In [None]:
bar(data_frame=df.sort_values(by='lexile'), x='genre', category_orders= {'genre': sorted(df['genre'].unique().tolist())}, hover_name='title', color='lexile')

This tells us we have no lexile scores for most poems (if I remember correctly lexile requires counting syllables per line, which is arguably inappropriate for some kinds of poetry) and some other documents. Also, it looks like our lexile data has a few outliers on the high end, which eats up a big chunk of our colorbar here.

Maybe volumetric plots would be helpful. Let's try.

In [None]:
from plotly.express import treemap
treemap(data_frame=df[['genre', 'grade']].groupby(by=['genre', 'grade']).size().reset_index().rename(columns={0: 'count'}), names='genre',  values='count', path=['genre'])

In [None]:
treemap(data_frame=df[['genre', 'grade']].groupby(by=['genre', 'grade']).size().reset_index().rename(columns={0: 'count'}), names='grade', values='count', path=['grade'])

In [None]:
treemap(data_frame=df[['author', 'genre']].groupby(by=['author', 'genre']).size().reset_index(),
        names='author', values=0, path=['author'], color='genre',)

The corpus is mostly low-cardinality authors.

In [None]:
from plotly.express import scatter
scatter(data_frame=df, x='grade', y='lexile', color='genre', hover_name='title', trendline='ols', trendline_scope='overall')

Lexile is a measure of reading difficulty; we expect it to rise linearly with grade.

In [None]:
from plotly.express import scatter_matrix
scatter_matrix(data_frame=df, dimensions=['grade', 'lexile', 'description_length'], color='genre', hover_name='title')

These are more fun than informative. Look at all the pretty colors. 
Also see how the lexile outliers really stand out once we know to look for them. 

In [None]:
from plotly.express import violin
violin(data_frame=df, x='grade', y='lexile', hover_name='title')

This graph looks weird, but it shows how the bulk of the lexile distribution rises slowly with the grade. Maybe a ridge/joy plot would be helpful here.

In [None]:
scatter(data_frame=df, x='grade', y='description_length', color='genre', hover_name='title', trendline='ols', trendline_scope='overall')

We would like description length to be a proxy for something, but it doesn't seem to be. Descriptions get longer for texts for higher grades, but not much. Really the value here is in the slope of the OLS trendline.

In [None]:
scatter(data_frame=df, y='lexile', x='description_length', color='genre', hover_name='title', trendline='ols', trendline_scope='overall')

Description length and lexile score are positively correlated for the corpus overall. The slope of the OLS trendline is larger here.

In [None]:
from plotly.express import imshow
imshow(img=df[['grade', 'lexile', 'description_length']].dropna().corr(numeric_only=True), color_continuous_scale='Blues')

We've worked really hard in the charts above to get three numbers: the pairwise correlations among these three quantities. We're not surprised that the lexile scores are somewhat highly correlated with the grade, since they're meant to capture reading difficulty; and the different positive correlations between the description length and the other two measures is what we've been looking for in the scatter plots (at a more granular level, maybe?) above. 

In [None]:
from wordcloud import WordCloud
from matplotlib.pyplot import axis
from matplotlib.pyplot import imshow
from matplotlib.pyplot import subplots
subplots(figsize=(12, 12))
axis('off')
imshow(WordCloud().generate(' '.join(df['description'].values.tolist())))

Even after we remove common stopwords the descriptions are dominated by filler words. Words like discuss, describe, speaker, poem, etc. are more form than content in a corpus like this.