# Paper Grading Assistant

## Modeling

Data comes from this link:
- https://www.kaggle.com/c/asap-aes/data

Heavy inspiration drawn from:
- https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45

(Use incognito window when opening that link)

In [1]:
# !pip install gensim
import os, sys
from gensim import corpora, models
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re


In [2]:
# Run the utilty functions from a seperate notebook
%run topic_model_utils.ipynb

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maxw2\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\maxw2\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
data = pd.read_csv("D:\\Kaggle\\asap-aes\\training_set_rel3.tsv", sep='\t')
# data.head()

In [4]:
data['tokenized_essay'] = data.essay.apply(process_text)
data['max_score'] = 0

In [5]:
# replace NaN w/ 0
data = data.fillna(0)

# add a max_score column to use later 
# for standardizing scores, as all the 
# different essays sets have different 
# scales on which they were scored
data['max_score'] = 0
data.head()

Unnamed: 0,essay_id,essay_set,essay,rater1_domain1,rater2_domain1,rater3_domain1,domain1_score,rater1_domain2,rater2_domain2,domain2_score,...,rater2_trait5,rater2_trait6,rater3_trait1,rater3_trait2,rater3_trait3,rater3_trait4,rater3_trait5,rater3_trait6,tokenized_essay,max_score
0,1,1,"Dear local newspaper, I think effects computer...",4,4,0.0,8,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[dear, local, newspaper, think, effect, comput...",0
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",5,4,0.0,9,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[dear, cap, believe, using, computer, benefit,...",0
2,3,1,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",4,3,0.0,7,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[dear, cap, people, use, computer, agrees, ben...",0
3,4,1,"Dear Local Newspaper, @CAPS1 I have found that...",5,5,0.0,10,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[dear, local, newspaper, cap, expert, computer...",0
4,5,1,"Dear @LOCATION1, I know having computers has a...",4,4,0.0,8,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[dear, location, know, having, computer, posit...",0


In [6]:
# change max score col based on essay set
# max vals:
# set 1: 12
# set 2: 10 or 24, needs some experimenting
# set 3: 3
# set 4: 3
# set 5: 4
# set 6: 4
# set 7: 30
# set 8: 60

essay_sets = data.essay_set.unique()


In [7]:
for set_ in essay_sets:
    if set_ == 1:
        data.loc[data.essay_set == set_, 'max_score'] = 12
    if set_ == 2:
        data.loc[data.essay_set == set_, 'max_score'] = 10
    if set_ == 3 or set_ == 4:
        data.loc[data.essay_set == set_, 'max_score'] = 3
    if set_ == 5 or set_ == 6:
        data.loc[data.essay_set == set_, 'max_score'] = 4
    if set_ == 7:
        data.loc[data.essay_set == set_, 'max_score'] = 30
    if set_ == 8:
        data.loc[data.essay_set == set_, 'max_score'] = 60
# spot checking some of the data
print(data.loc[data.essay_set == 1, 'max_score'])
print(data.loc[data.essay_set == 4, 'max_score'])
print(data.loc[data.essay_set == 7, 'max_score'])
print(data.loc[data.essay_set == 8, 'max_score'])

0       12
1       12
2       12
3       12
4       12
        ..
1778    12
1779    12
1780    12
1781    12
1782    12
Name: max_score, Length: 1783, dtype: int64
5309    3
5310    3
5311    3
5312    3
5313    3
       ..
7074    3
7075    3
7076    3
7077    3
7078    3
Name: max_score, Length: 1770, dtype: int64
10684    30
10685    30
10686    30
10687    30
10688    30
         ..
12248    30
12249    30
12250    30
12251    30
12252    30
Name: max_score, Length: 1569, dtype: int64
12253    60
12254    60
12255    60
12256    60
12257    60
         ..
12971    60
12972    60
12973    60
12974    60
12975    60
Name: max_score, Length: 723, dtype: int64


In [8]:
# create temp column for 
# model's later internal classes
data['temp'] = 0
for set_ in essay_sets:
    if set_ == 2:
        data.loc[data.essay_set == set_, 'temp'] = (data.loc[data.essay_set==set_,'domain1_score'] \
                                                   + data.loc[data.essay_set==set_,'domain2_score']) \
                                                   / data.loc[data.essay_set==set_,'max_score']
        continue
    else:
        data.loc[data.essay_set == set_, 'temp'] = data.loc[data.essay_set==set_,'domain1_score'] \
                                                   / data.loc[data.essay_set==set_,'max_score']

In [9]:
# re-classify each paper on a scale of 1-5,
# with 5 being a high score (like an A on an 
# ABCDF scale)
data['class'] = 1
for x in range(len(data)):
    if (data.temp[x]) >= .9:
        data['class'][x] = 5
        continue
    elif data.temp[x] >= .8 and data.temp[x] < .9:
        data['class'][x] = 4
        continue
    elif data.temp[x] >= .7 and data.temp[x] < .8:
        data['class'][x] = 3
        continue
    elif data.temp[x] >= .6 and data.temp[x] < .7:
        data['class'][x] = 2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['class'][x] = 2
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['class'][x] = 3
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['class'][x] = 4
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['class'][x] = 5


In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

no_features = 1000

# Initialize tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.85, 
                                   min_df=3, 
                                   max_features=no_features, 
                                   stop_words='english', 
                                   preprocessor=' '.join)
tfidf = tfidf_vectorizer.fit_transform(data['tokenized_essay'])
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# Bag of words
tf_vectorizer = CountVectorizer(max_df=0.85, 
                                min_df=3, 
                                max_features=no_features, 
                                stop_words='english', 
                                preprocessor=' '.join)
tf = tf_vectorizer.fit_transform(data['tokenized_essay'])
tf_feature_names = tf_vectorizer.get_feature_names()

# Word2Vec
word2vec = WordEmbeddingsService()
word2vec_model = word2vec.train_w2v_model(tokenized_text=data['tokenized_essay'])

In [11]:
# create a few different vecotrizations of the data
# to see which version does the best

X_tfidf = tfidf
X_tf = tf
X_w2v = word2vec.create_word_embeddings(data['tokenized_essay'], word2vec_model)
y = data['class']

In [12]:
# import all the different classifiers 
# to test with the paper scores
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

from xgboost import XGBClassifier

In [13]:
def make_classification(classifier, X, y, rs=42):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = rs)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score, prec_score, rec_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    return cm, acc_score, f1, prec_score, rec_score

def make_confusion_matrix(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)
    acc_score = accuracy_score(y_test, y_pred)
    prec_score = precision_score(y_test, y_pred, average='weighted')
    rec_score = recall_score(y_test, y_pred, average='weighted')
    return cm, acc_score, prec_score, rec_score

In [14]:
# create a dictionary of all the different classifiers
# to loop through.
# There are some unsupervised models just for comparison.
classifiers = {
    "knn": KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2),
    "nb" : MultinomialNB(), 
    "log_reg": LogisticRegression(random_state=0),
    "lin_svm" : SVC(kernel = 'linear', random_state = 0), # took too long with word2vec (more than 5000 secs)
    "rbf_svm" : SVC(kernel = 'rbf', random_state = 0),
    "tree" : DecisionTreeClassifier(criterion = 'entropy', random_state = 0),
    "rf" : RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0),
    "ada" : AdaBoostClassifier(random_state = 0),
    "gb" : GradientBoostingClassifier(random_state = 0),
    "xgb" : XGBClassifier(random_state = 0),
}

In [15]:
# tfidf vectors first, 3 min
tfidf_res = {}
for key in classifiers.keys():
    print(key)
    cm, acc, f1, prec, rec = make_classification(classifiers[key], X_tfidf, y)
    tfidf_res[key] = {
        'cm' : cm,
        'acc' : acc,
        'f1' : f1,
        'prec' : prec,
        'rec' : rec
    }
    print("==============")

knn
nb
log_reg


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


lin_svm
rbf_svm
tree
rf
ada
gb
xgb






In [16]:
# repeat classification with bag of words models, 2.5 min
tf_res = {}
for key in classifiers.keys():
    print(key)
    cm, acc, f1, prec, rec = make_classification(classifiers[key], X_tf, y)
    print("==============")
    tf_res[key] = {
        'cm' : cm,
        'acc' : acc,
        'f1' : f1,
        'prec' : prec,
        'rec' : rec
    }

knn
nb
log_reg


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


lin_svm
rbf_svm
tree
rf
ada
gb
xgb






In [17]:
# repeat classification with word2vec models, 5 min
w2v_res = {}
for key in classifiers.keys():
    # lin_svm takes more than 1 hour on its own.
    # nb doesn't accept negative numbers from the vectors.
    if key == 'lin_svm' or key == 'nb': 
        continue
    print(key)
    try:
        cm, acc, f1, prec, rec = make_classification(classifiers[key], X_w2v, y)
    except:
        cm, acc, f1, prec, rec = 0,0,0,0,0
    print("==============")
    w2v_res[key] = {
        'cm' : cm,
        'acc' : acc,
        'f1' : f1,
        'prec' : prec,
        'rec' : rec
    }

knn
log_reg


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


rbf_svm


  _warn_prf(average, modifier, msg_start, len(result))


tree
rf
ada
gb
xgb






In [18]:
# everything else being equal,
# we want the one with highest precisions 
# (precision is affected by FP, which would be 
# overestimation of the grade of the paper)

for key in classifiers.keys():
    try:
        print(key)
        print("==================")
        print("tfidf acc: ", tfidf_res[key]['acc'])
        print("tfidf f1: ", tfidf_res[key]['f1'])
        print("tfidf precision: ", tfidf_res[key]['prec'])
        print("tfidf recall: ", tfidf_res[key]['rec'])
        print("==================")
        print("tf acc: ", tf_res[key]['acc'])
        print("tf f1: ", tf_res[key]['f1'])
        print("tf precision: ", tf_res[key]['prec'])
        print("tf recall: ", tf_res[key]['rec'])
        print("==================")
        print("w2v acc: ", w2v_res[key]['acc'])
        print("w2v f1: ", w2v_res[key]['f1'])
        print("w2v precision: ", w2v_res[key]['prec'])
        print("w2v recall: ", w2v_res[key]['rec'])
        print("==================")
    except:
        pass

knn
tfidf acc:  0.45377503852080125
tfidf f1:  0.38660270480340003
tfidf precision:  0.45451067082238295
tfidf recall:  0.45377503852080125
tf acc:  0.41640986132511554
tf f1:  0.2857625663744916
tf precision:  0.45449679052202635
tf recall:  0.41640986132511554
w2v acc:  0.3983050847457627
w2v f1:  0.3853732858118746
w2v precision:  0.3804922404417706
w2v recall:  0.3983050847457627
nb
tfidf acc:  0.487673343605547
tfidf f1:  0.4733505994306289
tfidf precision:  0.5085628883422912
tfidf recall:  0.487673343605547
tf acc:  0.4464560862865948
tf f1:  0.450944452802403
tf precision:  0.48771433931539104
tf recall:  0.4464560862865948
log_reg
tfidf acc:  0.6167180277349769
tfidf f1:  0.6092357866617898
tfidf precision:  0.6108999346788971
tfidf recall:  0.6167180277349769
tf acc:  0.5982280431432974
tf f1:  0.5946366613761039
tf precision:  0.5926540993034797
tf recall:  0.5982280431432974
w2v acc:  0.4530046224961479
w2v f1:  0.42431693929298675
w2v precision:  0.4109848284751718
w2v rec

Here are the best results from the training above. 

*Note:* I left out the unsupervised learning models because I generally just like to test them for a "shot in the dark" type of look at finding the optimal model. I attribute this to a short stent as a marketer where testing EVERYTHING was an important part of the puzzle.

### log_reg
- tfidf acc:  0.6879815100154083
- tfidf f1:  0.6826731058258083
- tfidf precision:  0.6840742515662983
- tfidf recall:  0.6879815100154083

### lin_svm
- tfidf acc:  0.6798921417565486
- tfidf f1:  0.6796612464208107
- tfidf precision:  0.6851511607249138
- tfidf recall:  0.6798921417565486

### rbf_svm
- tf acc:  0.714175654853621
- tf f1:  0.7137245090277426
- tf precision:  0.7245950249260565
- tf recall:  0.714175654853621

### tree
- tfidf acc:  0.6302003081664098
- tfidf f1:  0.6289259312408813
- tfidf precision:  0.6278195501435767
- tfidf recall:  0.6302003081664098

### rf
- tfidf acc:  0.687211093990755
- tfidf f1:  0.6795152273421196
- tfidf precision:  0.6789466187123883
- tfidf recall:  0.687211093990755

### ada
- tf acc:  0.5670261941448382
- tf f1:  0.5388070054332165
- tf precision:  0.5626726131804644
- tf recall:  0.5670261941448382


### gb
- tfidf acc:  0.7126348228043143
- tfidf f1:  0.7112178825280995
- tfidf precision:  0.7124145728645078
- tfidf recall:  0.7126348228043143

### These were the best of the word2vec models (Gradient boosting)
- w2v acc:  0.6771956856702619
- w2v f1:  0.6786444062419466
- w2v precision:  0.6813704223813889
- w2v recall:  0.6771956856702619

### xgb
- tfidf acc:  0.7245762711864406
- tfidf f1:  0.7223852160700287
- tfidf precision:  0.7212926330136676
- tfidf recall:  0.7245762711864406




In [19]:
for key in classifiers.keys():
    try:
        print(key)
        print("==================")
        print("tfidf cm: \n", tfidf_res[key]['cm'])
        print("==================")
        print("tf cm: \n", tf_res[key]['cm'])
        print("==================")
        print("w2v cm: \n", w2v_res[key]['cm'])
    except:
        pass

knn
tfidf cm: 
 [[900  63  31   0   3]
 [496  99  11  18  13]
 [296  27  97  19  38]
 [132  17   7  32   0]
 [162  27  48  10  50]]
tf cm: 
 [[988   7   2   0   0]
 [609  26   1   0   1]
 [404  11  56   0   6]
 [154  28   5   1   0]
 [210  22  55   0  10]]
w2v cm: 
 [[583 189 139  25  61]
 [270 228  41  57  41]
 [202  69 146  28  32]
 [ 47  69  37  34   1]
 [126  65  57   6  43]]
nb
tfidf cm: 
 [[599 204 136  48  10]
 [181 307   0 136  13]
 [125  70 197  85   0]
 [ 20  29   0 139   0]
 [ 69  72  88  44  24]]
tf cm: 
 [[530 168 147  88  64]
 [137 192   0 219  89]
 [107  45 198 115  12]
 [ 16  14   1 157   0]
 [ 43  35  90  47  82]]
log_reg
tfidf cm: 
 [[814 118  53   3   9]
 [169 367  22  41  38]
 [ 97  94 219  41  26]
 [  7  53  25  97   6]
 [ 14  54  91  34 104]]
tf cm: 
 [[800 113  63   9  12]
 [164 315  59  47  52]
 [ 90  71 216  48  52]
 [  7  39  35  84  23]
 [  8  54  73  24 138]]
w2v cm: 
 [[646 214  89  37  11]
 [273 273   3  88   0]
 [164  77 181  52   3]
 [ 27  60  26  75   0