# M - Automated Essay Scoring
_School of Information Technology_<br>
_Monash University Malaysia_<br>
(c) Copyright 2020, Ian Tan & Jun Qing Lim

Steps

- Read dataset (ASAP)
- Extract features (into file) using EASE
- Conduct machine learning (Sci-kit Learn libraries)
    - Naive Bayes
    - SVR
    - BLRR (later)
- Evaluate (QWK)

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
from collections import defaultdict

from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm #SVR is in SVM
from sklearn.metrics import accuracy_score, confusion_matrix

### Import the EASE functions, which is located in the ease folder.

In [2]:
import sys
sys.path.insert(1, 'ease')
import create
import grade 
import model_creator 
import predictor_extractor 
import predictor_set 
import util_functions
import essay_set
import feature_extractor

from essay_set import EssaySet
from feature_extractor import FeatureExtractor

## Read Dataset

AES (Hewlett Foundation dataset from Kaggle) in the folder `asap-aes`.  For this, we use the `training_set_rel3` for training and the `valid_set` for testing.

In [3]:
train_set = pd.read_csv("asap-aes/training_set_rel3.tsv", sep='\t', encoding="latin-1")
test_set = pd.read_csv("asap-aes/test_set.tsv", sep='\t', encoding="latin-1")

In [4]:
train_set['essay'] = [entry.lower() for entry in train_set['essay']] # lower case for all words in essay
test_set['essay'] = [entry.lower() for entry in test_set['essay']] # lower case for all words in essay

There are 8 different essay sets.  As an overview:
- Sets 1 & 2 are of persuasive/narrative in the form of letters
- Sets 3, 4, 5 & 6 are source dependent response to a given essay
- Sets 7 & 8 are of persuasive/narrative in the form of story writing essays

These format makes it good for transfer learning.

In [5]:
train_set_1 = train_set[train_set['essay_set'] == 1]
train_set_2 = train_set[train_set['essay_set'] == 2]
#train_set_3 = train_set[train_set['essay_set'] == 3]
#train_set_4 = train_set[train_set['essay_set'] == 4]
#train_set_5 = train_set[train_set['essay_set'] == 5]
#train_set_6 = train_set[train_set['essay_set'] == 6]
train_set_7 = train_set[train_set['essay_set'] == 7]
train_set_8 = train_set[train_set['essay_set'] == 8]

We do similarly for the test sets.

In [6]:
test_set_1 = test_set[test_set['essay_set'] == 1]
test_set_2 = test_set[test_set['essay_set'] == 2]
#test_set_3 = test_set[test_set['essay_set'] == 3]
#test_set_4 = test_set[test_set['essay_set'] == 4]
#test_set_5 = test_set[test_set['essay_set'] == 5]
#test_set_6 = test_set[test_set['essay_set'] == 6]
test_set_7 = test_set[test_set['essay_set'] == 7]
test_set_8 = test_set[test_set['essay_set'] == 8]

As each set will retain the original index, we want each of them to have their own indexing so that it is easier to match the essay and the scores.

In [7]:
train_set_1 = train_set_1.reset_index() # resets index
train_set_2 = train_set_2.reset_index()
#train_set_3 = train_set_3.reset_index()
#train_set_4 = train_set_4.reset_index()
#train_set_5 = train_set_5.reset_index()
#train_set_6 = train_set_6.reset_index()
train_set_7 = train_set_7.reset_index()
train_set_8 = train_set_8.reset_index()

In [8]:
test_set_1 = test_set_1.reset_index() # resets index
test_set_2 = test_set_2.reset_index()
#test_set_3 = test_set_3.reset_index()
#test_set_4 = test_set_4.reset_index()
#test_set_5 = test_set_5.reset_index()
#test_set_6 = test_set_6.reset_index()
test_set_7 = test_set_7.reset_index()
test_set_8 = test_set_8.reset_index()

We use just the `essay` content and the respective `scores`.

In [9]:
# If you want for the whole dataset.
# Commented out as we will work on individual datasets
#essays = train_set['essay']
#scores = train_set['domain1_score']

In [10]:
essays_1 = train_set_1['essay']
scores_1 = train_set_1['domain1_score']

In [11]:
essays_2 = train_set_2['essay']
scores_2 = train_set_2['domain1_score']

In [12]:
#essays_3 = train_set_3['essay']
#scores_3 = train_set_3['domain1_score']

In [13]:
#essays_4 = train_set_4['essay']
#scores_4 = train_set_4['domain1_score']

In [14]:
#essays_5 = train_set_5['essay']
#scores_5 = train_set_5['domain1_score']

In [15]:
#essays_6 = train_set_6['essay']
#scores_6 = train_set_6['domain1_score']

In [16]:
essays_7 = train_set_7['essay']
scores_7 = train_set_7['domain1_score']

In [17]:
essays_8 = train_set_8['essay']
scores_8 = train_set_8['domain1_score']

Rename the `domain1_score` column to `score`.

In [18]:
scores_1.columns = "score"
scores_2.columns = "score"
#scores_3.columns = "score"
#scores_4.columns = "score"
#scores_5.columns = "score"
#scores_6.columns = "score"
scores_7.columns = "score"
scores_8.columns = "score"

THE ABOVE NEEDS TO BE PUT INTO A LOOP BUT I LEFT IT AS IS BECAUSE YOU CAN PICK AND CHOOSE EASILY INSTEAD.

### Create the essay sets

Again, these can be looped but I kept them separated for ease of readability and commenting out those that we don't need.  Each set takes a long time to process, and hence please be patient with this part.

In [19]:
e_set_1 = EssaySet()
e_set_2 = EssaySet()
#e_set_3 = EssaySet()
#e_set_4 = EssaySet()
#e_set_5 = EssaySet()
#e_set_6 = EssaySet()
e_set_7 = EssaySet()
e_set_8 = EssaySet()

In [20]:
for i in range(len(essays_1)):
    e_set_1.add_essay(essays_1[i], scores_1[i])

In [21]:
for i in range(len(essays_2)):
    e_set_2.add_essay(essays_2[i], scores_2[i])

Left out for sets 3 - 6 for now.

In [22]:
for i in range(len(essays_7)):
    e_set_7.add_essay(essays_7[i], scores_7[i])

In [23]:
for i in range(len(essays_8)):
    e_set_8.add_essay(essays_8[i], scores_8[i])

## Extract Features

Currently only doing for Set 1

In [24]:
f_extractor = FeatureExtractor()

In [25]:
length = f_extractor.gen_length_feats(e_set_1)
length_df_1 = pd.DataFrame(
    length, 
    columns = [
        'chars', 
        'words', 
        'commas', 
        'apostrophes', 
        'punctuations', 
        'avg_word_length',
        # new stuff
        'paragraphs',
        #'avg_word_sentence',
        #'avg_sentence_para',
        'POS', 
        'POS/total_words'
    ]
)

In [26]:
length_df_1.head()

Unnamed: 0,chars,words,commas,apostrophes,punctuations,avg_word_length,paragraphs,POS,POS/total_words
0,1908.0,389.0,18.0,7.0,16.0,4.904884,0.0,385.324675,0.990552
1,2309.0,459.0,12.0,5.0,20.0,5.030501,0.0,454.991209,0.991266
2,1560.0,310.0,9.0,6.0,14.0,5.032258,0.0,306.993464,0.990301
3,3130.0,574.0,13.0,7.0,27.0,5.452962,0.0,569.992982,0.993019
4,2615.0,520.0,13.0,7.0,30.0,5.028846,0.0,509.634367,0.980066


In [27]:
length_df_1.tail()

Unnamed: 0,chars,words,commas,apostrophes,punctuations,avg_word_length,paragraphs,POS,POS/total_words
1778,2614.0,541.0,7.0,13.0,22.0,4.831793,0.0,533.988785,0.98704
1779,1130.0,238.0,3.0,14.0,19.0,4.747899,0.0,234.991453,0.987359
1780,1671.0,318.0,0.0,7.0,18.0,5.254717,0.0,314.993631,0.990546
1781,78.0,18.0,1.0,2.0,0.0,4.333333,0.0,17.0,0.944444
1782,1139.0,236.0,0.0,2.0,18.0,4.826271,0.0,232.318966,0.984402


_*Exclude the prompts for the time being*_

In [None]:
# Merge this with the score based on the index
# We use the shallow features first
features = length_df_1
dataset = features.merge(scores_1, left_index=True, right_index=True)
dataset.columns = ['chars', 'words', 'commas', 'apostrophes', 'punctuations',
       'avg_word_length', 'POS', 'POS/total_words', 'score']
X_1 = dataset.iloc[:,0:7].values.astype(float)
y_1 = dataset.iloc[:,8].values.astype(float)

Reshape the data and label

In [None]:
X_1.shape

In [None]:
y_1 = np.array(y_1).reshape(-1,1)
y_1.shape

## Model Training

### Naive Bayes Training

In [None]:
model_nb_1 = naive_bayes.MultinomialNB()
model_nb_1.fit(X_1, y_1.ravel())

At this stage, the Naive Bayes model is called `model_nb_1`

### SVM Training

Use standard scaler for the data

In [None]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X_1 = sc_X.fit_transform(X_1)
y_1 = sc_y.fit_transform(y_1)

In [None]:
from sklearn.svm import SVR
# most important SVR parameter is Kernel type. It can be #linear,polynomial or gaussian SVR. We have a non-linear condition #so we can select polynomial or gaussian but here we select RBF(a #gaussian type) kernel.
# kernel{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}, default=’rbf’
# maybe use poly and increase the degree
regressor = SVR(kernel='rbf', gamma='auto', verbose=True)
#regressor = SVR(kernel='poly', degree=5, gamma='auto', verbose=True)
regressor.fit(X_1,y_1.ravel())

### BLRR (Later)

## Prediction

We will be using the respective validation set and will have to also pre-process the data.

### Naive Bayes

### SVM

### BLRR (Later)

## Evaluation using QWK

QWK scores for NB, SVR and BLRR

# END

#### Collate the essay prompts
This consist of one essay from each set

In [None]:
essay_prompts = []

# Takes a bit of time also :)
for i in range(1,9):
    file = "prompts/set" + str(i) + ".txt"
    f = open(file, "r", encoding="latin-1") # there are some 0x9x characters, hence need to specify encoding
    essay_prompts.append(f.read())
    
def get_essay_prompt(essay_set):
    return essay_prompts[essay_set-1]

In [None]:
# Unsure how this works
e_set.update_prompt(get_essay_prompt(2))

# Need more explanation on how this works - look into EASE

prompts = f_extractor.gen_prompt_feats(e_set)
prompts_df = pd.DataFrame(prompts, columns = ['prompt_words', 'prompt_words/total_words', 'synonym_words', 'synonym_words/total_words'])

In [None]:
e_set

In [None]:
# Another process that takes sometime to process
unstemmed = util_functions.get_vocab_essays_count(e_set._text, e_set._score)
stemmed = util_functions.get_vocab_essays_count(e_set._clean_stem_text, e_set._score)

bow = list(map(lambda a,b:[a,b], unstemmed, stemmed))
bow_df = pd.DataFrame(bow, columns = ['unstemmed', 'stemmed'])

In [None]:
features = pd.concat([length_df, prompts_df, bow_df], axis=1, sort=False)

In [None]:
features.head()

In [None]:
# Export features to a file for next stage (optional)
dataset = features.merge(scores, left_index=True, right_index=True)

In [None]:
dataset.head()

In [None]:
dataset.columns = ['chars', 'words', 'commas', 'apostrophes', 'punctuations',
       'avg_word_length', 'POS', 'POS/total_words', 'prompt_words',
       'prompt_words/total_words', 'synonym_words',
       'synonym_words/total_words', 'unstemmed', 'stemmed', 'score']

In [None]:
dataset.head()

In [None]:
dataset.to_csv('maes_features.csv')

Can just use the features and score for the X and y but just to keep to certain convention if reading back from the CSV file above.


In [None]:
X = dataset.iloc[:,0:13].values.astype(float)
y = dataset.iloc[:,14].values.astype(float)

In [None]:
y

In [None]:
X.shape

In [None]:
y = np.array(y).reshape(-1,1)
y.shape

#### Conduct Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)

In [None]:
len(X)

In [None]:
len(y)

#### Split the train and test sets

In [None]:
# To split the train / test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Have a look at the first few lines
print(y_test[:5, :])

### Training

#### Support Vector Regression

In [None]:
from sklearn.svm import SVR
# most important SVR parameter is Kernel type. It can be #linear,polynomial or gaussian SVR. We have a non-linear condition #so we can select polynomial or gaussian but here we select RBF(a #gaussian type) kernel.
# kernel{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}, default=’rbf’
# maybe use poly and increase the degree
regressor = SVR(kernel='rbf', gamma='auto', verbose=True)
#regressor = SVR(kernel='poly', degree=5, gamma='auto', verbose=True)
regressor.fit(X_train,y_train.ravel())

#### Test / Predict the fit

In [None]:
# Not used yet as I don't have a sample X
y_pred = regressor.predict(X_test)
y_pred = sc_y.inverse_transform(y_pred).round()

In [None]:
df = pd.DataFrame(
    {
        'Real Values':sc_y.inverse_transform(y_test.reshape(-1)),
        'Predicted Values':y_pred
    }
)
df.head()

#### Accuracy Score

In [None]:
# y_pred

In [None]:
# y_test = sc_y.inverse_transform(y_test).round()
# y_test.ravel()

In [None]:
# Need to wrap my head around this (where's the predictor)

print("accuracy score:", regressor.score(X_test, y_test))

In [None]:
print("accuracy score:", accuracy_score(df['Real Values'], df['Predicted Values']))

In [None]:
from sklearn.metrics import cohen_kappa_score

In [None]:
print(cohen_kappa_score(sc_y.inverse_transform(y_test).round(), y_pred, weights="quadratic"))

### Naive Bayes

In [None]:
X_train

In [None]:
X_train_test = sc_X.inverse_transform(X_train)
X_train_test = X_train_test.astype(int)
X_train_test

In [None]:
y_train_test = sc_y.inverse_transform(y_train.reshape(-1))
y_train_test = y_train_test.astype(int)
y_train_test

In [None]:
nbclassifier = naive_bayes.MultinomialNB()
nbclassifier.fit(X_train_test, y_train_test)

In [None]:
X_test_test = sc_X.inverse_transform(X_test)
X_test_test = X_test_test.astype(int)
X_test_test

In [None]:
y_test_test = sc_y.inverse_transform(y_test.reshape(-1))
y_test_test = y_test_test.astype(int)
y_test_test

In [None]:
y_predNB = nbclassifier.predict(X_test_test)

cm = confusion_matrix(y_test_test, y_predNB)
print(cm)

In [None]:
from sklearn.metrics import classification_report

rpt = classification_report(y_test_test, y_predNB)
print(rpt)

### QWK Scores (Manual Code)

In [None]:
N = len(cm) # Just to get the same size as the confusion matrix from above
w = np.zeros((N,N)) # create a matrix of N by N
d = (N-1)**2 # the weighted portion
for i in range(len(w)):
    for j in range(len(w)):
        w[i][j] = float(((i-j)**2)/d) 
w # The weighted matrix

In [None]:
N

In [None]:
np.unique(y_test_test)

In [None]:
np.unique(y_predNB)

In [None]:
act_hist=np.zeros([N])
for item in y_test_test: 
    act_hist[item-1] += 1

In [None]:
pred_hist=np.zeros([N])
for item in y_predNB: 
    pred_hist[item-1]+=1

In [None]:
E = np.outer(act_hist, pred_hist)
E

In [None]:
E = E/E.sum()
E.sum()

In [None]:
cm = cm/cm.sum()
cm.sum()

In [None]:
num=0
den=0
for i in range(len(w)):
    for j in range(len(w)):
        num+=w[i][j]*cm[i][j]
        den+=w[i][j]*E[i][j]
            
weighted_kappa = (1 - (num/den))
weighted_kappa

QWK scores output are from -1 to 1, where -1 means that it is totally wrong while 1 is a perfect match (classification).  The aim is to get as close as possible to 1, with a score of 0.6 being generally accepted as a good score.

### QWK for Naive Bayes

The above code is a manual computation of the QWK, which we later found that it is already available as an option with the [Cohen Kappa Score](https://journals.sagepub.com/doi/10.1177/001316446002000104) in sklearn, when we set the weights to 'quadratic'.  Since it has already been manually coded above, we use the sklearn.metrics.cohen_kappa_score to validate our manual coded scoring. 

In [None]:
y_test_test

In [None]:
y_predNB

In [None]:
print(cohen_kappa_score(y_test_test, y_predNB))
print(cohen_kappa_score(y_test_test, y_predNB, weights="quadratic"))

On the output of the QWK agreements, the score is just "moderate agreement".  Work now is to achieve substantial agreement.

https://www.statisticshowto.com/cohens-kappa-statistic/

In short, SVM works a little better than Naive Bayes for AES.

# End