# M - Automated Essay Scoring
_School of Information Technology_<br>
_Monash University Malaysia_<br>
(c) Copyright 2020, Ian Tan & Jun Qing Lim

Steps

- Read dataset (ASAP)
- Extract features (into file) using EASE
- Conduct machine learning (Sci-kit Learn libraries)
    - Naive Bayes
    - SVR
    - BLRR (later)
- Evaluate (QWK)

## Import Libraries

In [None]:
import time
start = time.time()

In [None]:
import numpy as np
import pandas as pd
from collections import defaultdict

from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm #SVR is in SVM
from sklearn.metrics import accuracy_score, confusion_matrix

### Import the EASE functions, which is located in the ease folder.

In [None]:
import sys
sys.path.insert(1, 'ease')
import create
import grade 
import model_creator 
import predictor_extractor 
import predictor_set 
import util_functions
import essay_set
import feature_extractor

from essay_set import EssaySet
from feature_extractor import FeatureExtractor

## Read Dataset

AES (Hewlett Foundation dataset from Kaggle) in the folder `asap-aes`.  For this, we use the `training_set_rel3` for training and testing.  Note that the `test_set` and the `valid_set` cannot be used as they don't contain the scores and are meant for the competition to score the entries.

In [None]:
data_set = pd.read_csv("asap-aes/training_set_rel3.tsv", sep='\t', encoding="latin-1")

In [None]:
data_set['essay'] = [entry.lower() for entry in data_set['essay']] # lower case for all words in essay

There are 8 different essay sets.  As an overview:
- Sets 1 & 2 are of persuasive/narrative in the form of letters
- Sets 3, 4, 5 & 6 are source dependent response to a given essay
- Sets 7 & 8 are of persuasive/narrative in the form of story writing essays

These format makes it good for transfer learning.

In [None]:
data_set_1 = data_set[data_set['essay_set'] == 1]
data_set_2 = data_set[data_set['essay_set'] == 2]
#data_set_3 = data_set[data_set['essay_set'] == 3]
#data_set_4 = data_set[data_set['essay_set'] == 4]
#data_set_5 = data_set[data_set['essay_set'] == 5]
#data_set_6 = data_set[data_set['essay_set'] == 6]
#data_set_7 = data_set[data_set['essay_set'] == 7]
#data_set_8 = data_set[data_set['essay_set'] == 8]

As each set will retain the original index, we want each of them to have their own indexing so that it is easier to match the essay and the scores.

In [None]:
data_set_1 = data_set_1.reset_index() # resets index
data_set_2 = data_set_2.reset_index()
#data_set_3 = data_set_3.reset_index()
#data_set_4 = data_set_4.reset_index()
#data_set_5 = data_set_5.reset_index()
#data_set_6 = data_set_6.reset_index()
#data_set_7 = data_set_7.reset_index()
#data_set_8 = data_set_8.reset_index()

We use just the `essay` content and the respective `scores`.

In [None]:
# If you want for the whole dataset.
# Commented out as we will work on individual datasets
#essays = data_set['essay']
#scores = data_set['domain1_score']

In [None]:
essays_1 = data_set_1['essay']
scores_1 = data_set_1['domain1_score']
essays_2 = data_set_2['essay']
scores_2 = data_set_2['domain1_score']
#essays_3 = data_set_3['essay']
#scores_3 = data_set_3['domain1_score']
#essays_4 = data_set_4['essay']
#scores_4 = data_set_4['domain1_score']
#essays_5 = data_set_5['essay']
#scores_5 = data_set_5['domain1_score']
#essays_6 = data_set_6['essay']
#scores_6 = data_set_6['domain1_score']
#essays_7 = data_set_7['essay']
#scores_7 = data_set_7['domain1_score']
#essays_8 = data_set_8['essay']
#scores_8 = data_set_8['domain1_score']

Rename the `domain1_score` column to `score`.

In [None]:
scores_1.columns = "score"
scores_2.columns = "score"
#scores_3.columns = "score"
#scores_4.columns = "score"
#scores_5.columns = "score"
#scores_6.columns = "score"
#scores_7.columns = "score"
#scores_8.columns = "score"

THE ABOVE NEEDS TO BE PUT INTO A LOOP BUT I LEFT IT AS IS BECAUSE YOU CAN PICK AND CHOOSE EASILY INSTEAD.

## Prepare Data

### Create the essay sets

Again, these can be looped but I kept them separated for ease of readability and commenting out those that we don't need.  Each set takes a long time to process, and hence please be patient with this part.

In [None]:
e_set_1 = EssaySet()
e_set_2 = EssaySet()
#e_set_3 = EssaySet()
#e_set_4 = EssaySet()
#e_set_5 = EssaySet()
#e_set_6 = EssaySet()
#e_set_7 = EssaySet()
#e_set_8 = EssaySet()

In [None]:
for i in range(len(essays_1)):
    e_set_1.add_essay(essays_1[i], scores_1[i])

In [None]:
for i in range(len(essays_2)):
    e_set_2.add_essay(essays_2[i], scores_2[i])

Left out for sets 3 - 6 for now.

In [None]:
"""
for i in range(len(essays_7)):
    e_set_7.add_essay(essays_7[i], scores_7[i])
"""

In [None]:
"""
for i in range(len(essays_8)):
    e_set_8.add_essay(essays_8[i], scores_8[i])
"""

## Extract Features

In [None]:
f_extractor = FeatureExtractor()

Change the next two variable assignment to change the evaluation of the essay sets.

Would be better to do this above.

**SETUP HERE**

In [None]:
e_set = e_set_1
score = scores_1

In [None]:
length = f_extractor.gen_length_feats(e_set)
length_df = pd.DataFrame(
    length, 
    columns = [
        'chars', 
        'words', 
        'commas', 
        'apostrophes', 
        'punctuations', 
        'avg_word_length',
        # new stuff, will need to compare original with new and separate punctuations
        'sentences',
        'questions',
        'avg_word_sentence',
        'POS', 
        'POS/total_words'
    ]
)

_*Exclude the prompts for the time being*_

To be included next.

In [None]:
# Merge this with the score based on the index
# We use the shallow features first
features = length_df
dataset = features.merge(score, left_index=True, right_index=True)
dataset.columns = ['chars', 'words', 'commas', 'apostrophes', 'punctuations',
                   'avg_word_length', 'sentences', 'questions', 'avg_word_sentence',
                   'POS', 'POS/total_words', 'score']
#X_1 = dataset.iloc[:,0:10].values.astype(float)
#y_1 = dataset.iloc[:,11].values.astype(float)

## Determine Essay Prompts

In [None]:
essay_prompts = []

#for i in range(1,8):
# We use only for the first 2
for i in range(1,2):
    file = "prompts/set" + str(i) + ".txt"
    f = open(file, "r", encoding="latin-1") # there are some 0x9x characters, hence need to specify encoding
    essay_prompts.append(f.read())
    
def get_essay_prompt(essay_set):
    return essay_prompts[essay_set-1]

**SETUP HERE**

In [None]:
# Unsure how this works
e_set.update_prompt(get_essay_prompt(1))

# Need more explanation on how this works - look into EASE
prompts = f_extractor.gen_prompt_feats(e_set)
prompts_df = pd.DataFrame(prompts, columns = [
    'prompt_words', 'prompt_words/total_words', 'synonym_words', 'synonym_words/total_words'
])
e_set # To check

In [None]:
# Another process that takes sometime to process
unstemmed = util_functions.get_vocab_essays_count(e_set._text, e_set._score)
stemmed = util_functions.get_vocab_essays_count(e_set._clean_stem_text, e_set._score)

bow = list(map(lambda a,b:[a,b], unstemmed, stemmed))
bow_df = pd.DataFrame(bow, columns = ['unstemmed', 'stemmed'])

In [None]:
features = pd.concat([length_df, prompts_df, bow_df], axis=1, sort=False)
features.head()

In [None]:
# Export features to a file for next stage (optional)
dataset = features.merge(score, left_index=True, right_index=True)
dataset.head()

In [None]:
"""
dataset.columns = ['chars', 'words', 'commas', 'apostrophes', 'punctuations',
                   'avg_word_length', 'sentences', 'questions', 'avg_word_sentence',
                   'POS', 'POS/total_words',
                   'score']
"""

dataset.columns = ['chars', 'words', 'commas', 'apostrophes', 'punctuations',
                   'avg_word_length', 'sentences', 'questions', 'avg_word_sentence',
                   'POS', 'POS/total_words',
                   'prompt_words', 'prompt_words/total_words', 'synonym_words',
                   'synonym_words/total_words', 'unstemmed', 'stemmed',
                   'score']
dataset.head()
dataset.to_csv('maes_features.csv')

Can just use the features and score for the X and y but just to keep to certain convention if reading back from the CSV file above.

**YOU CAN RUN FROM HERE ON BY READING THE FEATURES FOR THE TRAINING**

In [177]:
dataset = pd.read_csv('maes_features.csv')

Reshape the data and label

In [178]:
X = dataset.iloc[:,1:16].values.astype(float)
y = dataset.iloc[:,18].values.astype(float)
y

array([ 8.,  9.,  7., 10.,  8.,  8., 10., 10.,  9.,  9.,  8.,  8.,  7.,
        6.,  6., 12.,  8.,  8.,  4.,  6.,  8.,  3., 10., 11.,  8.,  9.,
        4.,  9.,  9.,  8., 10., 10.,  6.,  8.,  9., 10., 12.,  8., 10.,
        7.,  2.,  8.,  6.,  8.,  8.,  8.,  8., 11.,  6.,  5.,  9.,  7.,
        8., 10.,  8., 10.,  9.,  7.,  8.,  4.,  8.,  8.,  8.,  7.,  9.,
        9.,  8.,  9.,  7., 12., 10., 10.,  8.,  7.,  8.,  8., 10., 10.,
       10.,  8.,  8.,  8.,  7.,  6., 10.,  8., 10.,  9.,  6.,  7.,  8.,
       11., 11.,  8., 10.,  7.,  8., 11.,  8.,  7., 10.,  8.,  9., 10.,
        9., 11., 10.,  8.,  8.,  6., 11.,  9.,  8.,  8.,  9.,  4.,  8.,
       12., 10.,  8.,  8.,  9., 10.,  7.,  5.,  8.,  9.,  9., 10., 10.,
       10., 11., 10., 10.,  7.,  9.,  7., 10.,  7.,  9., 10., 10.,  7.,
        9., 10.,  6., 12.,  9.,  8.,  8.,  8.,  6.,  9., 12., 10.,  8.,
        9.,  9., 10.,  8., 10., 12.,  8.,  8.,  9., 10.,  8.,  9.,  6.,
        8.,  8.,  8.,  8.,  8.,  8., 10.,  8.,  8.,  8., 10., 10

In [179]:
X.shape

(1783, 15)

In [180]:
y = np.array(y).reshape(-1,1)
y.shape

(1783, 1)

In [181]:
### Split the train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Have a look at the first few lines
print(y_1_test[:5, :])

[[ 7.]
 [ 8.]
 [10.]
 [ 7.]
 [10.]]


## Model Training

In [182]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

### Naive Bayes Training

No scaling used for Naive Bayes

In [183]:
X_trainNB = X_train
y_trainNB = y_train
X_testNB = X_test
y_testNB = y_test

In [184]:
model_nb = naive_bayes.MultinomialNB()
model_nb.fit(X_trainNB, y_trainNB.ravel())

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

At this stage, the Naive Bayes model is called `model_nb`

### SVM Training

Use standard scaler for the data

In [185]:
from sklearn.preprocessing import StandardScaler
sc_Xsvm = StandardScaler()
sc_ysvm = StandardScaler()
X_trainSVM = sc_Xsvm.fit_transform(X_train)
y_trainSVM = sc_ysvm.fit_transform(y_train)
X_testSVM = sc_Xsvm.transform(X_test)
y_testSVM = sc_ysvm.transform(y_test)

In [186]:
from sklearn.svm import SVR
# most important SVR parameter is Kernel type. It can be #linear,polynomial or gaussian SVR. We have a non-linear condition #so we can select polynomial or gaussian but here we select RBF(a #gaussian type) kernel.
# kernel{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}, default=’rbf’
# maybe use poly and increase the degree
model_svm = SVR(kernel='rbf', gamma='auto', verbose=True)
#regressor = SVR(kernel='poly', degree=5, gamma='auto', verbose=True)
model_svm.fit(X_trainSVM,y_trainSVM.ravel())

[LibSVM]

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=True)

At this stage, the Support Vector Machine (SVM) model is called `model_svm`

### BLRR

In [187]:
from sklearn.preprocessing import StandardScaler
sc_Xblrr = StandardScaler()
sc_yblrr = StandardScaler()
X_trainBLRR = sc_Xblrr.fit_transform(X_train)
y_trainBLRR = sc_yblrr.fit_transform(y_train)
X_testBLRR = sc_Xblrr.transform(X_test)
y_testBLRR = sc_yblrr.transform(y_test)

In [188]:
from sklearn import linear_model
model_blrr = linear_model.BayesianRidge()
model_blrr.fit(X_trainBLRR, y_trainBLRR.ravel())

BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
       fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300,
       normalize=False, tol=0.001, verbose=False)

At this stage, the Bayesian Linear Ridge Regression (BLRR) model is called `model_blrr_1`

## Prediction

We will be using the respective validation set and will have to also pre-process the data.

### Naive Bayes

In [189]:
y_predNB = model_nb.predict(X_testNB)
#y_1_predNB = sc_ynb.inverse_transform(y_1_predNB).round()

cm = confusion_matrix(y_test, y_predNB)
print(cm)

[[ 4  0  0  0  0  0  0  0  0  0]
 [ 0  2  0  0  0  0  0  0  0  0]
 [ 1  0  0  1  0  0  0  0  0  0]
 [ 0  3  2  7  4  4  0  0  0  0]
 [ 0  0  0  7  4 12  0  1  0  0]
 [ 0  0  1  4 11 95 18 15  0  0]
 [ 0  0  0  0  2 19 13 19  1  1]
 [ 0  0  0  1  1  5 21 36  8  6]
 [ 0  0  0  0  0  0  4  7  2  7]
 [ 0  0  0  0  0  0  0  2  1  5]]


### SVM

In [190]:
y_predSVM = model_svm.predict(X_testSVM)
y_predSVM = sc_ysvm.inverse_transform(y_predSVM).round()

cm = confusion_matrix(y_test, y_predSVM)
#np.set_printoptions(threshold=np.inf)
print(cm)

[[ 0  1  3  0  0  0  0  0  0  0]
 [ 0  1  0  1  0  0  0  0  0  0]
 [ 0  0  1  0  0  1  0  0  0  0]
 [ 0  0  1  8  6  5  0  0  0  0]
 [ 0  0  0  0 14  8  2  0  0  0]
 [ 0  0  0  3 16 79 40  6  0  0]
 [ 0  0  0  0  1 13 29 11  1  0]
 [ 0  0  0  0  0  3 26 42  7  0]
 [ 0  0  0  0  0  0  4 11  5  0]
 [ 0  0  0  0  0  0  0  3  5  0]]


In [191]:
# y_predSVM

### BLRR

In [192]:
y_predBLRR = model_blrr.predict(X_testBLRR)
y_predBLRR = sc_yblrr.inverse_transform(y_predBLRR).round()

cm = confusion_matrix(y_test, y_predBLRR)
print(cm)

[[ 0  0  3  1  0  0  0  0  0  0  0]
 [ 0  0  0  2  0  0  0  0  0  0  0]
 [ 0  0  0  2  0  0  0  0  0  0  0]
 [ 0  0  0  7 13  0  0  0  0  0  0]
 [ 0  0  0  1 14  7  2  0  0  0  0]
 [ 0  0  0  3 25 73 38  5  0  0  0]
 [ 0  0  0  0  3 10 32  9  0  1  0]
 [ 0  0  0  0  0  3 28 35 12  0  0]
 [ 0  0  0  0  0  1  2  9  6  2  0]
 [ 0  0  0  0  0  0  0  2  4  1  1]
 [ 0  0  0  0  0  0  0  0  0  0  0]]


## Evaluation using QWK

QWK scores for NB, SVR and BLRR

In [193]:
from sklearn.metrics import classification_report
from sklearn.metrics import cohen_kappa_score

### Naive Bayes

In [194]:
rpt = classification_report(y_test, y_predNB)
print(rpt)

              precision    recall  f1-score   support

         2.0       0.80      1.00      0.89         4
         4.0       0.40      1.00      0.57         2
         5.0       0.00      0.00      0.00         2
         6.0       0.35      0.35      0.35        20
         7.0       0.18      0.17      0.17        24
         8.0       0.70      0.66      0.68       144
         9.0       0.23      0.24      0.23        55
        10.0       0.45      0.46      0.46        78
        11.0       0.17      0.10      0.12        20
        12.0       0.26      0.62      0.37         8

   micro avg       0.47      0.47      0.47       357
   macro avg       0.35      0.46      0.39       357
weighted avg       0.48      0.47      0.47       357



In [195]:
print(cohen_kappa_score(y_test, y_predNB, weights="quadratic"))

0.7914276903145422


### SVM

In [196]:
rpt = classification_report(y_test, y_predSVM)
print(rpt)

              precision    recall  f1-score   support

         2.0       0.00      0.00      0.00         4
         4.0       0.50      0.50      0.50         2
         5.0       0.20      0.50      0.29         2
         6.0       0.67      0.40      0.50        20
         7.0       0.38      0.58      0.46        24
         8.0       0.72      0.55      0.62       144
         9.0       0.29      0.53      0.37        55
        10.0       0.58      0.54      0.56        78
        11.0       0.28      0.25      0.26        20
        12.0       0.00      0.00      0.00         8

   micro avg       0.50      0.50      0.50       357
   macro avg       0.36      0.38      0.36       357
weighted avg       0.54      0.50      0.51       357



  'precision', 'predicted', average, warn_for)


In [197]:
print(cohen_kappa_score(y_test, y_predSVM, weights="quadratic"))

0.7981076983547645


### BLRR

In [198]:
rpt = classification_report(y_test, y_predBLRR)
print(rpt)

              precision    recall  f1-score   support

         2.0       0.00      0.00      0.00         4
         4.0       0.00      0.00      0.00         2
         5.0       0.00      0.00      0.00         2
         6.0       0.44      0.35      0.39        20
         7.0       0.25      0.58      0.35        24
         8.0       0.78      0.51      0.61       144
         9.0       0.31      0.58      0.41        55
        10.0       0.58      0.45      0.51        78
        11.0       0.27      0.30      0.29        20
        12.0       0.25      0.12      0.17         8
        13.0       0.00      0.00      0.00         0

   micro avg       0.47      0.47      0.47       357
   macro avg       0.26      0.26      0.25       357
weighted avg       0.55      0.47      0.49       357



  'recall', 'true', average, warn_for)


In [199]:
print(cohen_kappa_score(y_test, y_predBLRR, weights="quadratic"))

0.8026556723826195


In [200]:
end = time.time()
print("Total time to execute the notebook is " + str(end - start))

Total time to execute the notebook is 584.8888618946075


QWK scores output are from -1 to 1, where -1 means that it is totally wrong while 1 is a perfect match (classification).  The aim is to get as close as possible to 1, with a score of 0.6 being generally accepted as a good score.

On the output of the QWK agreements, the score is just "moderate agreement".  Work now is to achieve substantial agreement.

https://www.statisticshowto.com/cohens-kappa-statistic/

In short, BLRR works better than SVM but a small margin but better than NB.

## Appendix

### QWK Scores (Manual Code)

In [None]:
N = len(cm) # Just to get the same size as the confusion matrix from above
w = np.zeros((N,N)) # create a matrix of N by N
d = (N-1)**2 # the weighted portion
for i in range(len(w)):
    for j in range(len(w)):
        w[i][j] = float(((i-j)**2)/d) 
w # The weighted matrix

In [None]:
N

In [None]:
np.unique(y_test)

In [None]:
np.unique(y_predNB)

In [None]:
act_hist=np.zeros([N])
for item in y_test: 
    act_hist[item-1] += 1

In [None]:
pred_hist=np.zeros([N])
for item in y_predNB: 
    pred_hist[item-1]+=1

In [None]:
E = np.outer(act_hist, pred_hist)
E

In [None]:
E = E/E.sum()
E.sum()

In [None]:
cm = cm/cm.sum()
cm.sum()

In [None]:
num=0
den=0
for i in range(len(w)):
    for j in range(len(w)):
        num+=w[i][j]*cm[i][j]
        den+=w[i][j]*E[i][j]
            
weighted_kappa = (1 - (num/den))
weighted_kappa

# END