# Paper Grading Assistant

## Modeling

Data comes from this link:
- https://www.kaggle.com/c/asap-aes/data

Heavy inspiration drawn from:
- https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45

(Use incognito window when opening that link)

## About this notebook

This notebook is the part of the grading process where a teacher might categorize his or her students' papers by letter grade.

The idea here is that the teacher will only need to adjust a few grades instead of having to grade an entire stack of papers.

In [1]:
# !pip install gensim
import os, sys
from gensim import corpora, models
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re


In [2]:
# Run the utilty functions from a seperate notebook
%run topic_model_utils.ipynb

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maxw2\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\maxw2\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
data = pd.read_csv("D:\\Kaggle\\asap-aes\\training_set_rel3.tsv", sep='\t')
# data.head()

In [4]:
data['tokenized_essay'] = data.essay.apply(process_text)

In [5]:
# replace NaN w/ 0
data = data.fillna(0)

# add a max_score column to use later 
# for standardizing scores, as all the 
# different essays sets have different 
# scales on which they were scored
data['max_score'] = 0
data.head()

Unnamed: 0,essay_id,essay_set,essay,rater1_domain1,rater2_domain1,rater3_domain1,domain1_score,rater1_domain2,rater2_domain2,domain2_score,...,rater2_trait5,rater2_trait6,rater3_trait1,rater3_trait2,rater3_trait3,rater3_trait4,rater3_trait5,rater3_trait6,tokenized_essay,max_score
0,1,1,"Dear local newspaper, I think effects computer...",4,4,0.0,8,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[dear, local, newspaper, think, effect, comput...",0
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",5,4,0.0,9,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[dear, believe, using, computer, benefit, way,...",0
2,3,1,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",4,3,0.0,7,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[dear, people, use, computer, agrees, benefit,...",0
3,4,1,"Dear Local Newspaper, @CAPS1 I have found that...",5,5,0.0,10,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[dear, local, newspaper, expert, computer, ben...",0
4,5,1,"Dear @LOCATION1, I know having computers has a...",4,4,0.0,8,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[dear, location, know, having, computer, posit...",0


In [6]:
# change max score col based on essay set
# max vals:
# set 1: 12
# set 2: 10 or 24, needs some experimenting
# set 3: 3
# set 4: 3
# set 5: 4
# set 6: 4
# set 7: 30
# set 8: 60

essay_sets = data.essay_set.unique()


In [7]:
for set_ in essay_sets:
    if set_ == 1:
        data.loc[data.essay_set == set_, 'max_score'] = 12
    if set_ == 2:
        data.loc[data.essay_set == set_, 'max_score'] = 10
    if set_ == 3 or set_ == 4:
        data.loc[data.essay_set == set_, 'max_score'] = 3
    if set_ == 5 or set_ == 6:
        data.loc[data.essay_set == set_, 'max_score'] = 4
    if set_ == 7:
        data.loc[data.essay_set == set_, 'max_score'] = 30
    if set_ == 8:
        data.loc[data.essay_set == set_, 'max_score'] = 60
# spot checking some of the data
print(data.loc[data.essay_set == 1, 'max_score'])
print(data.loc[data.essay_set == 4, 'max_score'])
print(data.loc[data.essay_set == 7, 'max_score'])
print(data.loc[data.essay_set == 8, 'max_score'])

0       12
1       12
2       12
3       12
4       12
        ..
1778    12
1779    12
1780    12
1781    12
1782    12
Name: max_score, Length: 1783, dtype: int64
5309    3
5310    3
5311    3
5312    3
5313    3
       ..
7074    3
7075    3
7076    3
7077    3
7078    3
Name: max_score, Length: 1770, dtype: int64
10684    30
10685    30
10686    30
10687    30
10688    30
         ..
12248    30
12249    30
12250    30
12251    30
12252    30
Name: max_score, Length: 1569, dtype: int64
12253    60
12254    60
12255    60
12256    60
12257    60
         ..
12971    60
12972    60
12973    60
12974    60
12975    60
Name: max_score, Length: 723, dtype: int64


In [8]:
# create temp column for 
# model's later internal classes
data['temp'] = 0
for set_ in essay_sets:
    if set_ == 2:
        data.loc[data.essay_set == set_, 'temp'] = (data.loc[data.essay_set==set_,'domain1_score'] \
                                                   + data.loc[data.essay_set==set_,'domain2_score']) \
                                                   / data.loc[data.essay_set==set_,'max_score']
        continue
    else:
        data.loc[data.essay_set == set_, 'temp'] = data.loc[data.essay_set==set_,'domain1_score'] \
                                                   / data.loc[data.essay_set==set_,'max_score']

In [9]:
# re-classify each paper on a scale of 1-5,
# with 5 being a high score (like an A on an 
# ABCDF scale)
data['class'] = 1
for x in range(len(data)):
    if (data.temp[x]) >= .9:
        data['class'][x] = 5
        continue
    elif data.temp[x] >= .8 and data.temp[x] < .9:
        data['class'][x] = 4
        continue
    elif data.temp[x] >= .7 and data.temp[x] < .8:
        data['class'][x] = 3
        continue
    elif data.temp[x] >= .6 and data.temp[x] < .7:
        data['class'][x] = 2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['class'][x] = 2
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['class'][x] = 3
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['class'][x] = 4
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['class'][x] = 5


In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

no_features = 1000

# Initialize tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.85, 
                                   min_df=3, 
                                   max_features=no_features, 
                                   stop_words='english', 
                                   preprocessor=' '.join)
tfidf = tfidf_vectorizer.fit_transform(data['tokenized_essay'])
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# Bag of words
tf_vectorizer = CountVectorizer(max_df=0.85, 
                                min_df=3, 
                                max_features=no_features, 
                                stop_words='english', 
                                preprocessor=' '.join)
tf = tf_vectorizer.fit_transform(data['tokenized_essay'])
tf_feature_names = tf_vectorizer.get_feature_names()

# Word2Vec
word2vec = WordEmbeddingsService()
word2vec_model = word2vec.train_w2v_model(tokenized_text=data['tokenized_essay'])

In [11]:
# create a few different vecotrizations of the data
# to see which version does the best

X_tfidf = tfidf
X_tf = tf
X_w2v = word2vec.create_word_embeddings(data['tokenized_essay'], word2vec_model)
y = data['class']

In [12]:
# import all the different classifiers 
# to test with the paper scores
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

from xgboost import XGBClassifier

In [13]:
def make_classification(classifier, X, y, rs=42):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = rs)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score, prec_score, rec_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    return cm, acc_score, f1, prec_score, rec_score

def make_confusion_matrix(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)
    acc_score = accuracy_score(y_test, y_pred)
    prec_score = precision_score(y_test, y_pred, average='weighted')
    rec_score = recall_score(y_test, y_pred, average='weighted')
    return cm, acc_score, prec_score, rec_score

In [14]:
# create a dictionary of all the different classifiers
# to loop through.
# There are some unsupervised models just for comparison.
classifiers = {
    "knn": KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2),
    "nb" : MultinomialNB(), 
    "log_reg": LogisticRegression(random_state=0),
    "lin_svm" : SVC(kernel = 'linear', random_state = 0), # took too long with word2vec (more than 5000 secs)
    "rbf_svm" : SVC(kernel = 'rbf', random_state = 0),
    "tree" : DecisionTreeClassifier(criterion = 'entropy', random_state = 0),
    "rf" : RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0),
    "ada" : AdaBoostClassifier(random_state = 0),
    "gb" : GradientBoostingClassifier(random_state = 0),
    "xgb" : XGBClassifier(random_state = 0),
}

In [15]:
# tfidf vectors first, 3 min
tfidf_res = {}
for key in classifiers.keys():
    print(key)
    cm, acc, f1, prec, rec = make_classification(classifiers[key], X_tfidf, y)
    tfidf_res[key] = {
        'cm' : cm,
        'acc' : acc,
        'f1' : f1,
        'prec' : prec,
        'rec' : rec
    }
    print("==============")

knn
nb
log_reg


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


lin_svm
rbf_svm
tree
rf
ada
gb
xgb






In [16]:
# repeat classification with bag of words models, 2.5 min
tf_res = {}
for key in classifiers.keys():
    print(key)
    cm, acc, f1, prec, rec = make_classification(classifiers[key], X_tf, y)
    print("==============")
    tf_res[key] = {
        'cm' : cm,
        'acc' : acc,
        'f1' : f1,
        'prec' : prec,
        'rec' : rec
    }

knn
nb
log_reg


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


lin_svm
rbf_svm
tree
rf
ada
gb
xgb






In [17]:
# repeat classification with word2vec models, 5 min
w2v_res = {}
for key in classifiers.keys():
    # lin_svm takes more than 1 hour on its own.
    # nb doesn't accept negative numbers from the vectors.
    if key == 'lin_svm' or key == 'nb': 
        continue
    print(key)
    try:
        cm, acc, f1, prec, rec = make_classification(classifiers[key], X_w2v, y)
    except:
        cm, acc, f1, prec, rec = 0,0,0,0,0
    print("==============")
    w2v_res[key] = {
        'cm' : cm,
        'acc' : acc,
        'f1' : f1,
        'prec' : prec,
        'rec' : rec
    }

knn
log_reg


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


rbf_svm


  _warn_prf(average, modifier, msg_start, len(result))


tree
rf
ada
gb
xgb






In [18]:
# everything else being equal,
# we want the one with highest precisions 
# (precision is affected by FP, which would be 
# overestimation of the grade of the paper)

for key in classifiers.keys():
    try:
        print(key)
        print("==================")
        print("tfidf acc: ", tfidf_res[key]['acc'])
        print("tfidf f1: ", tfidf_res[key]['f1'])
        print("tfidf precision: ", tfidf_res[key]['prec'])
        print("tfidf recall: ", tfidf_res[key]['rec'])
        print("==================")
        print("tf acc: ", tf_res[key]['acc'])
        print("tf f1: ", tf_res[key]['f1'])
        print("tf precision: ", tf_res[key]['prec'])
        print("tf recall: ", tf_res[key]['rec'])
        print("==================")
        print("w2v acc: ", w2v_res[key]['acc'])
        print("w2v f1: ", w2v_res[key]['f1'])
        print("w2v precision: ", w2v_res[key]['prec'])
        print("w2v recall: ", w2v_res[key]['rec'])
        print("==================")
    except:
        pass

knn
tfidf acc:  0.4518489984591679
tfidf f1:  0.3845584550763674
tfidf precision:  0.4528836182166853
tfidf recall:  0.4518489984591679
tf acc:  0.41756548536209553
tf f1:  0.2887254767695843
tf precision:  0.4608517797915959
tf recall:  0.41756548536209553
w2v acc:  0.40331278890600925
w2v f1:  0.389688959015477
w2v precision:  0.3850056213809215
w2v recall:  0.40331278890600925
nb
tfidf acc:  0.4872881355932203
tfidf f1:  0.4727566874766083
tfidf precision:  0.5062737992494357
tfidf recall:  0.4872881355932203
tf acc:  0.4476117103235747
tf f1:  0.4522116160194951
tf precision:  0.4892058007112515
tf recall:  0.4476117103235747
log_reg
tfidf acc:  0.6171032357473035
tfidf f1:  0.6096852266087265
tfidf precision:  0.6113537557873977
tfidf recall:  0.6171032357473035
tf acc:  0.5970724191063174
tf f1:  0.5930907314611129
tf precision:  0.5908905542533972
tf recall:  0.5970724191063174
w2v acc:  0.45454545454545453
w2v f1:  0.4256449606103841
w2v precision:  0.4129149380725789
w2v recal

Here are the best results from the training above. 

*Note:* I left out the unsupervised learning models because I generally just like to test them for a "shot in the dark" type of look at finding the optimal model. I attribute this to a short stent as a marketer where testing EVERYTHING was an important part of the puzzle.

### log_reg
- tfidf acc:  0.6171032357473035
- tfidf f1:  0.6096852266087265
- tfidf precision:  0.6113537557873977
- tfidf recall:  0.6171032357473035

### lin_svm
- tfidf acc:  0.613251155624037
- tfidf f1:  0.608159930006525
- tfidf precision:  0.6131331303612533
- tfidf recall:  0.613251155624037

### rbf_svm
- tf acc:  0.6475346687211094
- tf f1:  0.6415366136770412
- tf precision:  0.6535206117830535
- tf recall:  0.6475346687211094

### tree
- tfidf acc:  0.5520030816640986
- tfidf f1:  0.5528056512924708
- tfidf precision:  0.5539368422055277
- tfidf recall:  0.5520030816640986

### rf
- tfidf acc:  0.6147919876733436
- tfidf f1:  0.605882006540668
- tfidf precision:  0.6049429246696592
- tfidf recall:  0.6147919876733436

### ada
- tf acc:  0.49768875192604006
- tf f1:  0.4845876911067036
- tf precision:  0.48788715673893546
- tf recall:  0.49768875192604006

### gb
- tfidf acc:  0.6348228043143297
- tfidf f1:  0.6312471077545355
- tfidf precision:  0.6348880983202261
- tfidf recall:  0.6348228043143297

### best word2vec model results (gradient boost)
- w2v acc:  0.5963020030816641
- w2v f1:  0.594721033739939
- w2v precision:  0.5999014604841247
- w2v recall:  0.5963020030816641

### xgb
- tfidf acc:  0.6432973805855162
- tfidf f1:  0.6412741241938258
- tfidf precision:  0.642001655993716
- tfidf recall:  0.6432973805855162


In [19]:
for key in classifiers.keys():
    try:
        print(key)
        print("==================")
        print("tfidf cm: \n", tfidf_res[key]['cm'])
        print("==================")
        print("tf cm: \n", tf_res[key]['cm'])
        print("==================")
        print("w2v cm: \n", w2v_res[key]['cm'])
    except:
        pass

knn
tfidf cm: 
 [[897  62  34   1   3]
 [499  98  11  16  13]
 [295  29  99  17  37]
 [133  17   7  31   0]
 [166  26  48   9  48]]
tf cm: 
 [[986   6   5   0   0]
 [608  27   1   0   1]
 [401  12  58   0   6]
 [155  25   6   2   0]
 [210  21  55   0  11]]
w2v cm: 
 [[593 186 137  25  56]
 [270 231  42  55  39]
 [201  70 145  28  33]
 [ 47  68  38  34   1]
 [126  65  56   6  44]]
nb
tfidf cm: 
 [[598 204 136  48  11]
 [180 308   0 136  13]
 [125  70 197  85   0]
 [ 20  29   0 139   0]
 [ 68  73  89  44  23]]
tf cm: 
 [[530 168 147  88  64]
 [138 190   0 219  90]
 [104  45 201 115  12]
 [ 16  14   1 157   0]
 [ 43  35  88  47  84]]
log_reg
tfidf cm: 
 [[813 120  53   2   9]
 [170 366  23  40  38]
 [ 94  95 221  41  26]
 [  7  52  24  99   6]
 [ 14  54  92  34 103]]
tf cm: 
 [[802 117  58   8  12]
 [163 315  58  47  54]
 [ 90  74 210  49  54]
 [  7  40  32  86  23]
 [ 10  54  73  23 137]]
w2v cm: 
 [[649 214  88  36  10]
 [274 274   4  85   0]
 [161  80 184  50   2]
 [ 27  65  24  72   0

## Cross Validation

In [20]:
# Confusion matrices just for fun. The best models look to be
# SVM with rbf kernel and gradient boosting. Now for some cross validation.

# rbf svm uses tf
# gb uses tfidif
from sklearn.model_selection import cross_val_score, KFold

X_tfidf = tfidf
X_tf = tf
y = data['class']

svm_X_train, svm_X_test, svm_y_train, svm_y_test = train_test_split(X_tf, 
                                                                    y, 
                                                                    test_size = 0.2, 
                                                                    random_state = 42)

gb_X_train, gb_X_test, gb_y_train, gb_y_test = train_test_split(X_tfidf, 
                                                                y, 
                                                                test_size = 0.2, 
                                                                random_state = 42)

svm_accuracies = cross_val_score(estimator = classifiers['rbf_svm'], 
                                 X = svm_X_train, 
                                 y = svm_y_train, 
                                 cv = KFold(shuffle=True))

gb_accuracies = cross_val_score(estimator = classifiers['gb'], 
                                 X = gb_X_train, 
                                 y = gb_y_train, 
                                 cv = KFold(shuffle=True))

In [None]:
print("svm Accuracies: ", svm_accuracies)
print("svm Accuracies mean: ", svm_accuracies.mean())
print("GB Accuracies: ", gb_accuracies)
print("GB Accuracies mean: ", gb_accuracies.mean())

svm Accuracies:  0.646917148362235
GB Accuracies:  0.6466281310211945


Both SVM and Gradient boosting have cross-validation accuracies that are in-line with the initial values.

SVM has slightly higher accuracies, and better precision, so it's the winner.

## Optimization

In [22]:
# Now that we have a "best" model, it's time to make sure we
# are getting the best we can out of it.
from sklearn.model_selection import GridSearchCV

gamma_range = [1e-7,1e-6,1e-5,1e-4,1e-3,1e-2,'scale','auto']
c_range = [1e-2,1e0,1e2,1e5]
svr_param_grid = {
    'kernel' : ('rbf', 'sigmoid'),
    'C' : c_range,
    'gamma' : gamma_range
}

gs = GridSearchCV(classifiers['rbf_svm'],svr_param_grid,cv=3,n_jobs=2)
gs.fit(X_tf,y)

In [None]:
print('svr_cv.best_score_: \n')
print(gs.best_score_)
print('svr_cv.best_params_: \n')
print(gs.best_params_)