# Exploratory Data Analysis

In this notebook, we explored the possibility of training models on specific raters. Ultimately, we decided there would not be enough data from a single rater to train an effective model.

In [1]:
# Import Python library for working with dataframes
import pandas as pd
import numpy as np
from sklearn.metrics import cohen_kappa_score

# Enables inline display of plots within the Python Notebook (instead of having them pop up on new windows)
%matplotlib inline

# Display figures the same way they will be saved.
%config InlineBackend.print_figure_kwargs = {'bbox_inches': 'tight'}

# Import Python libraries for plotting
import seaborn as sns
sns.set_theme(style="white")
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.rcParams.update({
    'figure.dpi': 300,
    "font.family": "serif",
})

In [2]:
df = pd.read_csv('../data/All_adjudicated_ELL_data_1022.csv')

Confirm that Overall_1 is not a simple average of the other scores. This should be a human-generated holistic score.

In [3]:
(
    pd.DataFrame()
    .assign(
        Overall_1 = df['Overall_1'],
        Average_1 = (
            df[['Cohesion_1', 'Syntax_1', 'Vocabulary_1', 'Phraseology_1', 'Grammar_1', 'Conventions_1']]
            .mean(axis=1)
            .round(0)
            .astype(int)
        )
    )
    .query('Overall_1 != Average_1-1')
)

Unnamed: 0,Overall_1,Average_1
1,3,3
2,3,3
3,3,3
4,3,3
5,3,3
...,...,...
8875,3,3
8876,4,4
8877,4,4
8878,4,4


In [46]:
pd.concat([df["Rater_1"], df["Rater_2"]]).value_counts()

alorapruitt        1605
brittnybyrom       1370
rswab10            1338
Joselyn            1222
sullrich1          1205
skonuk1            1062
sulynnn            1043
MelodyYates         991
dhughes23           890
hannah-page         860
jfu7                780
JonathanPutnam      778
jrose18             748
Gmaldonado2         605
ACohen              587
jbarton8            447
SavladorW2          360
MelanieKnezevic     359
emilyatkins28       330
Maurice             260
CristinaL423        241
jessicawillis       225
soniyairfat         200
SallyRen            119
jcarver              75
deborahsanta         31
nnixon               27
jfu7                  2
Name: count, dtype: int64

In [62]:
with pd.option_context("display.max_rows", None):
    display(pd.Series(df[["Rater_1", "Rater_2"]].values.tolist()).apply(set).value_counts())

{brittnybyrom, jrose18}              519
{alorapruitt, hannah-page}           446
{alorapruitt, sullrich1}             436
{jbarton8, sulynnn}                  389
{JonathanPutnam, jfu7}               375
{hannah-page, rswab10}               300
{skonuk1, Joselyn}                   284
{dhughes23, Joselyn}                 278
{Gmaldonado2, sullrich1}             276
{MelodyYates, skonuk1}               269
{alorapruitt, rswab10}               230
{MelodyYates, Joselyn}               226
{emilyatkins28, rswab10}             218
{SavladorW2, brittnybyrom}           216
{JonathanPutnam, alorapruitt}        214
{brittnybyrom, jfu7}                 208
{dhughes23, MelodyYates}             200
{dhughes23, skonuk1}                 200
{brittnybyrom, rswab10}              200
{ACohen, rswab10}                    167
{Gmaldonado2, alorapruitt}           167
{ACohen, dhughes23}                  150
{sulynnn, skonuk1}                   149
{MelanieKnezevic, brittnybyrom}      144
{sulynnn, Melody

In [44]:
with pd.option_context("display.max_rows", None):
    display(df.groupby(["Rater_1"])["Rater_2"].value_counts())

Rater_1          Rater_2        
ACohen           dhughes23           35
                 soniyairfat         12
                 Joselyn              2
CristinaL423     skonuk1             20
                 Joselyn             15
                 ACohen              14
                 nnixon               4
                 MelodyYates          1
Gmaldonado2      alorapruitt        139
                 emilyatkins28       55
                 sullrich1           54
                 rswab10              5
JonathanPutnam   jfu7               315
                 alorapruitt        174
                 sullrich1           47
                 brittnybyrom        26
                 jrose18              4
                 MelanieKnezevic      2
Joselyn          dhughes23          256
                 CristinaL423        78
                 Maurice             72
                 ACohen              48
                 MelodyYates         31
                 skonuk1             24
       

In [23]:
df[~df["Rater_1"].isin(["alorapruitt", "brittnybyrom"]) & ~df["Rater_2"].isin(["alorapruitt", "brittnybyrom"])]

Unnamed: 0,Filename,ID,Text,Rater_1,Overall_1,Cohesion_1,Syntax_1,Vocabulary_1,Phraseology_1,Grammar_1,...,Identifying_Info_1,Rater_2,Overall_2,Cohesion_2,Syntax_2,Vocabulary_2,Phraseology_2,Grammar_2,Conventions_2,Identifying_Info_2
56,2081005746.txt,2081005746,Dear principle\r\n\r\nI think that it is a goo...,Gmaldonado2,3,3,3,4,3,4,...,0,emilyatkins28,3,3,3,3,3,3,2,0
57,2081005797.txt,2081005797,Dear: Principal\r\n\r\nI think that all studen...,Gmaldonado2,3,3,3,3,3,3,...,0,emilyatkins28,3,2,3,3,2,3,3,0
58,2081006041.txt,2081006041,Dear principal.\r\n\r\nWe like to clean litter...,Gmaldonado2,3,3,3,3,3,3,...,0,emilyatkins28,3,3,3,3,2,3,3,0
59,2081006815.txt,2081006815,Some of your friends perform community service...,Gmaldonado2,0,0,0,0,0,0,...,0,emilyatkins28,0,0,0,0,0,0,0,0
60,2081006840.txt,2081006840,dear principal\r\n\r\ni think some of student ...,Gmaldonado2,3,2,2,3,2,2,...,0,emilyatkins28,2,3,2,3,2,2,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8874,AAAXMP138200002211022133_OR.txt,AAAXMP138200002211022133_OR,The decision regarding extracurricular involve...,skonuk1,4,4,4,3,4,4,...,0,Joselyn,4,4,4,4,4,4,5,0
8875,AAAXMP138200002211062115_OR.txt,AAAXMP138200002211062115_OR,"The school plans to change to a new, healthier...",skonuk1,3,3,3,3,3,3,...,0,Joselyn,4,4,3,4,3,4,4,0
8876,AAAXMP138200002211752151_OR.txt,AAAXMP138200002211752151_OR,I raised by my grandparents and they always to...,skonuk1,4,4,3,4,4,3,...,0,Joselyn,4,4,3,4,4,3,4,0
8877,AAAXMP138200002213552114_OR.txt,AAAXMP138200002213552114_OR,Imagine a world where people would not be affe...,emilyatkins28,4,4,4,4,4,4,...,0,rswab10,3,3,3,3,3,4,4,0


In [15]:
df[df["Rater_1"].isin(["alorapruitt", "brittnybyrom"]) & df["Rater_2"].isin(["alorapruitt", "brittnybyrom", "hannah-page"])]

Unnamed: 0,Filename,ID,Text,Rater_1,Overall_1,Cohesion_1,Syntax_1,Vocabulary_1,Phraseology_1,Grammar_1,...,Identifying_Info_1,Rater_2,Overall_2,Cohesion_2,Syntax_2,Vocabulary_2,Phraseology_2,Grammar_2,Conventions_2,Identifying_Info_2
392,AAAUUP138180000009642117_OR.txt,AAAUUP138180000009642117_OR,What would be a life without art? Just with th...,alorapruitt,4,4,4,5,5,4,...,0,hannah-page,3,3,3,4,4,3,3,0
438,AAAUUP138180000032062126_OR.txt,AAAUUP138180000032062126_OR,On my opinion Schools should be allow or follo...,alorapruitt,2,3,2,3,3,2,...,0,hannah-page,2,3,2,2,2,2,1,0
443,AAAUUP138180000035192140_OR.txt,AAAUUP138180000035192140_OR,My dad usually told me that failure is the mot...,alorapruitt,4,4,3,4,4,3,...,0,hannah-page,3,3,3,4,4,3,3,0
503,AAAUUP138180000045812140_OR.txt,AAAUUP138180000045812140_OR,''Success consists of going from failure to fa...,alorapruitt,3,3,3,3,4,3,...,0,hannah-page,3,3,3,4,3,3,2,0
512,AAAUUP138180000046432140_OR.txt,AAAUUP138180000046432140_OR,I think that yes people should try to get thei...,alorapruitt,3,3,3,3,3,2,...,0,hannah-page,2,2,2,3,3,3,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8859,AAAXMP138200002200562850_OR.txt,AAAXMP138200002200562850_OR,I disagree about what the principal has decide...,alorapruitt,3,2,2,3,3,2,...,0,hannah-page,3,3,3,3,3,3,3,0
8860,AAAXMP138200002200972850_OR.txt,AAAXMP138200002200972850_OR,Your principal has decided all students must p...,alorapruitt,3,4,3,4,3,4,...,0,hannah-page,3,3,4,3,4,4,3,0
8866,AAAXMP138200002205252144_OR.txt,AAAXMP138200002205252144_OR,Should these summer projects be teacher-design...,alorapruitt,3,2,3,3,2,3,...,0,hannah-page,3,3,3,3,3,3,3,0
8869,AAAXMP138200002207732810_OR.txt,AAAXMP138200002207732810_OR,Asking more than one person for and advice hel...,alorapruitt,3,3,3,4,2,3,...,0,hannah-page,3,3,3,3,3,3,3,0
