# AUTOMATED ESSAY SCORER 
------
*21-09-2021 22:04:00*
#### **Obective**: 
*To design a system that reads an essay and grades it without any human assistance*
#### Dataset: 
*Presented by Kaggle*

## 0. Prepare and Import

The following phase involves initiating the notebook with the following:
- Importing necessary pipelines
- Setting up Constants and pre-trained assets (if any)
- Importing dataset

In [2]:
!pip install pyspellchecker
# import all necessary libraries

# For dataframes
import pandas as pd 

# For numerical arrays
import numpy as np 

# For stemming/Lemmatisation/POS tagging
import spacy

# For getting stopwords
from spacy.lang.en.stop_words import STOP_WORDS

# For K-Fold cross validation
from sklearn.model_selection import KFold

# For visualizations
import matplotlib.pyplot as plt

# For regular expressions
import re

# For handling string
import string

# For all torch-supported actions
import torch

# For spell-check
from spellchecker import SpellChecker

# For performing mathematical operations
import math

# For dictionary related activites
from collections import defaultdict

# For counting actions (EDA)
from collections import  Counter

# For count vectorisation (EDA)
from sklearn.feature_extraction.text import CountVectorizer

# For one-hot encoding
from tensorflow.keras.utils import to_categorical

# For DL model
from tensorflow.keras.layers import Dense, Input, GlobalMaxPooling1D
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding, LSTM
from tensorflow.keras.models import Model, Sequential

# For generating random integers
from random import randint

# For TF-IDF vectorisation
from sklearn.feature_extraction.text import TfidfVectorizer

# For padding
from tensorflow.keras.preprocessing.sequence import pad_sequences

# For tokenization
from tensorflow.keras.preprocessing.text import Tokenizer

# For plotting
import seaborn as sns

print("Necessary libraries imported")

# Constant variables 

# spaCy language lemmatiser model
sp=spacy.load('en_core_web_sm')
spell = SpellChecker()

print("Constant variables ready")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [3]:
valid=pd.read_csv('../input/asap-aes/valid_set.tsv', sep='\t', encoding='ISO-8859-1')
test=pd.read_csv('../input/asap-aes/test_set.tsv', sep='\t', encoding='ISO-8859-1')
df=pd.read_csv('../input/asap-aes/training_set_rel3.tsv', sep='\t', encoding='ISO-8859-1')
df=df.drop_duplicates(ignore_index=True)
print('total rows in train: ',len(df),'\ntotal rows in valid: ',len(valid),'\ntotal rows in test: ',len(test))
df.head()

## 1. Exploratory Data Analysis
This phase involves complete understanding of what is there in the dataset, and the key nuances that needs to be understood before framing the Input-Output ML pipeline. We perform the following:

- Dataset description (to know what's presented and what's not available)


### 1.1 Dataset Description
Studying the basic statistics of the dataset, which covers the following aspects:
- Analysing columns
- Null-Value statistics
- Overall column-wise stats
- Essay prompt types and frequency

In [4]:
print('Dataset columns: ')
print(df.columns)
print('\nNull Statistics (in %): ')
print(df.isnull().sum()* 100 / len(df))
print('\nDataset description: ')
print(df.describe())
print('\nEssay prompt frequency: ')
print(df.essay_set.value_counts())

#### Inference

- There are a lot of null values (80 percent plus in most columns). Null Value Treatment is a necessity
- The prompts are unevenly spread, all the way from 500+ to 1500+
- There are numerous columns, most of which might not contribute to the performance of training a system


### 1.2 Score Distribution
Here, we observe the trend of distribution of scores across different essay sets, capturing the different available scores for each domain and for each prompt

In [5]:
fig = plt.figure(figsize=(20,20))
for prompt in range(0,8):
  curdf=df[df['essay_set']==prompt+1]
  labels = []
  sizes = []
  for x, y in curdf.domain1_score.value_counts().items():
    labels.append(x)
    sizes.append(y)
  ax1 = plt.subplot2grid((3,3),(math.floor((prompt)/3),math.floor((prompt)%3)))
  plt.pie(sizes,labels=labels,autopct='%1.1f%%')
  plt.title('Score distribution (Domain 1): Prompt'+str(prompt+1))

prompt=1
curdf=df[df['essay_set']==prompt+1]
labels = []
sizes = []
for x, y in curdf.domain2_score.value_counts().items():
    labels.append(x)
    sizes.append(y)
ax1 = plt.subplot2grid((3,3),(2,2))
plt.pie(sizes,labels=labels,autopct='%1.1f%%')
plt.title('Score distribution (Domain 2): Prompt'+str(prompt+1))
plt.show()


#### Inference
From these pie charts, we observe the following:
- Distribution of scores within an essay set is uneven
- The range of scores (ie, minimum and maximum attainable mark) for each prompt is highly varying

### 1.3 Effect of essay-word-lengths over score
Here, we observe the trend of distribution of scores across different essay sets, capturing the following trends:
- Total words vs score
- Word length vs score

In [6]:
def get_avg_length(essay):
    summ=0
    for word in essay.split():
        summ+=len(word)
    return round(summ/len(essay.split()),2)

df['avg_word_length']=df.essay.apply(lambda x: get_avg_length(x))
print("Graph for score vs average word length")

fig, axes = plt.subplots(3, 3, figsize=(24,24)) # creating a figure with 3 rows and 3 columns of plots
fig.suptitle('Scores versus average word length')

for prompt in range(0,8):
  curdf=df[df['essay_set']==prompt+1]  
  sns.stripplot(ax=axes[prompt%3,math.floor(prompt/3)],
    data=curdf,
    x="domain1_score", y="avg_word_length")
  axes[prompt%3,math.floor(prompt/3)].set_title("Prompt "+str(prompt+1))

prompt=1
curdf=df[df['essay_set']==prompt+1]  
sns.stripplot(ax=axes[2,2],
    data=curdf,
    x="domain2_score", y="avg_word_length")
axes[2,2].set_title("Prompt "+str(prompt+1))

plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.9, 
                    wspace=0.3, 
                    hspace=0.3)
fig.show()

In [7]:
print("Graph for score vs total words")


df['total_words']=df.essay.apply(lambda x: len(x.split()))
fig, axes = plt.subplots(3, 3, figsize=(24,24)) # creating a figure with 3 rows and 3 columns of plots
fig.suptitle('Scores versus total words')

for prompt in range(0,8):
  curdf=df[df['essay_set']==prompt+1]  
  sns.boxplot(ax=axes[prompt%3,math.floor(prompt/3)],
    data=curdf,
    x="domain1_score", y="total_words")
  axes[prompt%3,math.floor(prompt/3)].set_title("Prompt "+str(prompt+1))

prompt=1
curdf=df[df['essay_set']==prompt+1]  
sns.boxplot(ax=axes[2,2],
    data=curdf,
    x="domain2_score", y="total_words")
axes[2,2].set_title("Prompt "+str(prompt+1))

plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.9, 
                    wspace=0.3, 
                    hspace=0.3)
fig.show()

#### Inference

We get the following information from these two seaborn-figures:
- Average-word-length is tightly packed, densley populated around the 3-5 region for almost all prompts, with a few out-of-range-outliers
- The box-range of total words used in an essay moves up the graph with increase in domain's score, pin-pointing a trend of length versus score, with the presence of few outliers

### 1.4 Unigram analysis
The primary goal here is to see what words are most frequently used. This will be done in the following ways:
- Frequency of stop-words used
- Most commonly occuring non-stop-words in each essay set

In [8]:
print("Stop-word freuency")

fig, axes = plt.subplots(3, 3, figsize=(24,24))
fig.suptitle('Stop-word freuency')

for prompt in range(0,8):
  dct=defaultdict(int) 
  curdf=df[df['essay_set']==prompt+1]  
  counter=Counter((" ".join(curdf.essay)).split())
  most=counter.most_common()
  x=[]
  y=[]
  for word,count in most[:10]:
      if (word in STOP_WORDS) :
          x.append(word)
          y.append(count)
  sns.barplot(ax=axes[prompt%3,math.floor(prompt/3)],x=y,y=x)
  axes[prompt%3,math.floor(prompt/3)].set_title("Prompt "+str(prompt+1))

axes[2,2].set_title("NaN")

plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.9, 
                    wspace=0.3, 
                    hspace=0.3)
fig.show()

In [9]:
print("Most commonly occcuring words in all essay prompts")

fig, axes = plt.subplots(3, 3, figsize=(24,24))
fig.suptitle('Common-word frequency')

for prompt in range(0,8):
  dct=defaultdict(int) 
  curdf=df[df['essay_set']==prompt+1]  
  counter=Counter((" ".join(curdf.essay)).split())
  most=counter.most_common()
  x=[]
  y=[]
  for word,count in most[:50]:
      if (word.lower() not in STOP_WORDS):
          x.append(word)
          y.append(count)
  sns.barplot(ax=axes[prompt%3,math.floor(prompt/3)],x=y,y=x)
  axes[prompt%3,math.floor(prompt/3)].set_title("Prompt "+str(prompt+1))

axes[2,2].set_title("NaN")

plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.9, 
                    wspace=0.3, 
                    hspace=0.3)
fig.show()

#### Inference

We get the following information from these two series of barcharts:
- The frequency of stop-words in each word is heavy enough, raising the need to do stopword-removal during data cleaning and feature-formatting
- The most commonly words are in direct relationship with the prompt/essay_set chosen. The most commonly occuring words are in extremely high correlation with the essay set's question

### 1.4 Bigram analysis
The primary goal here is to analyse the trend of bigrams used in each essay

In [10]:
def get_top_bigrams(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

print("Bigram analysis")

fig, axes = plt.subplots(3,3, figsize=(12,12))
fig.suptitle('Bigram analysis')

for prompt in range(0,8):
  dct=defaultdict(int) 
  top_bigrams=get_top_bigrams(df[df['essay_set']==prompt+1].essay)[:10]
  x,y=map(list,zip(*top_bigrams))
  sns.barplot(ax=axes[prompt%3,math.floor(prompt/3)],x=y,y=x)
  axes[prompt%3,math.floor(prompt/3)].set_title("Prompt "+str(prompt+1))

axes[prompt%3,math.floor(prompt/3)].set_title("Prompt "+str(prompt+1))
plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.9, 
                    wspace=0.3, 
                    hspace=0.3)
fig.show()

#### Inference

We get the following information from these two series of barcharts:
- We need to treat stop-words before creating any model that involves restoring the sequence-ness of the words
- Most frequently occuring bigrams have something to do against the scoring of each prompt

### 