# Automated Essay Scoring (AES)

Essays are crucial testing tools for assessing academic achievement, integration of ideas and ability, but are expensive and time consuming to grade manually. Automated essay scoring (AES) saves the efforts of human graders and hence significantly reduces costs and time. In some high stakes examinations, AES is used so there is no need to have a second human grader to verify or compare; In low stakes evaluations, AES is the only grading scheme. 

Previous studies have included baseline features:
1. Bag of Words (BOW) counts (10000 words with maximum frequency)
2. Number of characters 
3. Number of words 
4. Number of sentences 
5. Average word length
6. Number of lemmas
7. Number of spellng errors 
8. Number of nouns
9. Number of adjectives 
10. Number of verbs 
11. Number of adverbs 

In addition to those features, this project tries to extract 
1. sentiment features
2. content features
3. grammar features 

## Goal

1. What kind of features could imporve AES models so it grades as close as to the human graders. 

2. Different AES models will also be compared. 

# Data

## Dataset from kaggle.com, by the William and Flora Hewlett Foundation.

|Type of Essay|grade level|number of training |number of validation|
|----------|-------------------|--------------|------|
|narrative/persuasive/expository|8|1785|592 
|narrative/persuasive/expository|10|1800|600| 
|source dependent|10|1726|575|
|source dependent|10|1772|589|
|source dependent|8|1805|601|
|source dependent|10|1800|600|
|narrative/persuasive/expository|7|1730|576|
|narrative/persuasive/expository|10|918|305|

## Example: Essay set 1 prompt


More and more people use computers, but not everyone agrees that this benefits society. Those who support advances in technology believe that computers have a positive effect on people. They teach hand-eye coordination, give people the ability to learn about faraway places and people, and even allow people to talk online with other people. Others have different ideas. Some experts are concerned that people are spending too much time on their computers and less time exercising, enjoying nature, and interacting with family and friends. 

Write a letter to your local newspaper in which you state your opinion on the effects computers have on people. Persuade the readers to agree with you.


# Descriptive Analysis

In [4]:
%matplotlib inline
import numpy as np
import pandas as pd
import xlrd
import matplotlib.pyplot as plt
import re, collections
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_extraction.text import TfidfVectorizer
from itertools import chain
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
import string
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from collections import defaultdict

  return f(*args, **kwds)
  return f(*args, **kwds)
  from numpy.core.umath_tests import inner1d


In [42]:
#read in training data
df = pd.read_excel("training_set_rel3.xls")

print(df)

       essay_id  essay_set                                              essay  \
0             1          1  Dear local newspaper, I think effects computer...   
1             2          1  Dear @CAPS1 @CAPS2, I believe that using compu...   
2             3          1  Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...   
3             4          1  Dear Local Newspaper, @CAPS1 I have found that...   
4             5          1  Dear @LOCATION1, I know having computers has a...   
5             6          1  Dear @LOCATION1, I think that computers have a...   
6             7          1  Did you know that more and more people these d...   
7             8          1  @PERCENT1 of people agree that computers make ...   
8             9          1  Dear reader, @ORGANIZATION1 has had a dramatic...   
9            10          1  In the @LOCATION1 we have the technology of a ...   
10           11          1  Dear @LOCATION1, @CAPS1 people acknowledge the...   
11           12          1  

In [6]:
#Putting essays, essay_set into dictionary and calculate each essay length
train_data = {}
for index, row in df.iterrows():
    essay = row["essay"].strip()#.split(" ")
    essay_set = row['essay_set']
    domain1_score = row['domain1_score']/2
    if essay_set not in train_data:
        train_data[essay_set] = {"essays":[], "score":[]}

    train_data[essay_set]["essays"].append(essay)
    train_data[essay_set]['score'].append(domain1_score)

## Preperations
Cleaning the essays and tokenizing the essays etc. 

## Feature Extractions
total word count, the number of sentences, word features (word length, word count per sentence, sd pf word count across sentences), spelling errors, number of lemmas, number of nouns, adjectives, adverbs and verbs.


|Type of Features|| |
|----------|-------------------|--------------|------|
|Essay features|Total word count||
|               |Number of Sentences||| 
|Sentence features|Average sentence length|||
|     | SD of sentence length |  | |
|Word features|Number of lemmas: |verb, noun, adj. adv. | |
| | average word length (in characters)| ||
|| BOW (TfidfVectorizer) |1 gram and 2 gram||
||spelling error|||
||Count of different words:|verb, noun, adj. adv.



In [7]:
def sentence_to_wordlist(raw_sentence):
    clean_sentence = re.sub("[^a-zA-Z0-9]", " ", raw_sentence)
    tokens = nltk.word_tokenize(clean_sentence)

    return tokens


def tokenize(essay):
    stripped_essay = essay.strip()

    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sentences = tokenizer.tokenize(stripped_essay)

    tokenized_sentences = []
    for raw_sentence in raw_sentences:
        if len(raw_sentence) > 0:
            tokenized_sentences.append(sentence_to_wordlist(raw_sentence))

    return tokenized_sentences


def get_clean_essay(essay):
    clean_essay = re.sub(r'\W', ' ', essay)
    return clean_essay

#get list of sentences:
def get_sent_list(essay):
    sent_list = []
    sentences = nltk.sent_tokenize(essay)
    for sentence in sentences:
        clean_sentence = re.sub(r'\W', ' ', str(sentence).lower())
        clean_sentence = re.sub(r'[0-9]', '', clean_sentence)

        sent_list.append(clean_sentence)
    return sent_list

def get_big_dict():
    big = open('big.txt').read()

    words_ = re.findall('[a-z]+', big.lower())

    big_dict = collections.defaultdict(lambda: 0)
    #creating correct word dictionary
    for word in words_:
        big_dict[word] += 1
    return(big_dict)

In [8]:
#get word count discarding punctuations:
def total_word_count(essay): #excluding punctuations
    list_of_word_list = tokenize(essay)
    flat_list_of_word = [w for l in list_of_word_list for w in l]
    return len(flat_list_of_word)

# get sentence
def sent_num(essay):  #number of sentences in an essay
    sentences_num = len(sent_tokenize(essay))
    return sentences_num

def word_feature(essay): #average word count and std of word count in sentence, avg word length throughout an essay
    word_len =[]
    words_in_sent = tokenize(essay)
    for sent in words_in_sent:
        word_len.extend([len(word) for word in sent])
    avg_word_len = np.mean(word_len)
    word_count_per_sentence = [len(s) for s in words_in_sent]
    avg_wordcount = np.mean(word_count_per_sentence)
    std_word_count =  np.std(word_count_per_sentence) #by sentence
    return [avg_word_len, avg_wordcount, std_word_count]


##number of lemmas:
def count_lemmas(essay):
    tokenized_sentences = tokenize(essay)

    lemmas = []
    wordnet_lemmatizer = WordNetLemmatizer()

    for sentence in tokenized_sentences:
        tagged_tokens = nltk.pos_tag(sentence)

        for token_tuple in tagged_tokens:

            pos_tag = token_tuple[1]

            if pos_tag.startswith('N'):
                pos = wordnet.NOUN
                lemmas.append(wordnet_lemmatizer.lemmatize(token_tuple[0], pos))
            elif pos_tag.startswith('J'):
                pos = wordnet.ADJ
                lemmas.append(wordnet_lemmatizer.lemmatize(token_tuple[0], pos))
            elif pos_tag.startswith('V'):
                pos = wordnet.VERB
                lemmas.append(wordnet_lemmatizer.lemmatize(token_tuple[0], pos))
            elif pos_tag.startswith('R'):
                pos = wordnet.ADV
                lemmas.append(wordnet_lemmatizer.lemmatize(token_tuple[0], pos))
            else:
                pos = wordnet.NOUN
                lemmas.append(wordnet_lemmatizer.lemmatize(token_tuple[0], pos))

    lemma_count = len(set(lemmas))

    return lemma_count


##BOW
def BOW(essay): ##essay is in the format of df[df['essay_set'] == 1]['essay']
    #sentence = nltk.sent_tokenize(essay)  ##
    vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_df=1.0, min_df=1, max_features=10000,stop_words='english')
    feature_matrix = vectorizer.fit_transform(essay)
    feature_names = vectorizer.get_feature_names()
    return feature_names, feature_matrix

##Spelllng errors
def count_spell_error(essay):
    clean_essay = re.sub(r'\W', ' ', str(essay).lower())
    clean_essay = re.sub(r'[0-9]', '', clean_essay)

    mispell_count = 0

    words = clean_essay.split()

    for word in words:
        if not word in big_dict:
            mispell_count += 1

    return mispell_count

###Number of Nouns, Verbs, adj, adv. in an essay
def count_pos(essay):
    tokenized_sentences = tokenize(essay)

    noun_count = 0
    adj_count = 0
    verb_count = 0
    adv_count = 0

    for sentence in tokenized_sentences:
        tagged_tokens = nltk.pos_tag(sentence)

        for token_tuple in tagged_tokens:
            pos_tag = token_tuple[1]

            if pos_tag.startswith('N'):
                noun_count += 1
            elif pos_tag.startswith('J'):
                adj_count += 1
            elif pos_tag.startswith('V'):
                verb_count += 1
            elif pos_tag.startswith('R'):
                adv_count += 1

    return noun_count, adj_count, verb_count, adv_count

In [9]:
def extract_features(essays, feature_functions):
    return [[f(es) for f in feature_functions] for es in essays] ##list of list of features for each essay

feature_functions = [total_word_count, sent_num, word_feature, count_lemmas, count_spell_error, count_pos]

keys = [1,2,3,4,5,6,7,8]
features = {key: [] for key in keys}
big_dict = get_big_dict()
for es_set in keys:
    #if es_set not in BOW_dict:
     #   BOW_dict={'es_set':[]}
    #BOW_dict['es_set'].append([get_clean_essay(essay) for essay in train_data[es_set]['essays']])

    print("Extracting Features for Essay Set %s" % es_set)
    #if es_set not in features:
        #features={es_set :[]}
    features[es_set].extend(extract_features(train_data[es_set]["essays"], feature_functions))
    


Extracting Features for Essay Set 1
Extracting Features for Essay Set 2
Extracting Features for Essay Set 3
Extracting Features for Essay Set 4
Extracting Features for Essay Set 5
Extracting Features for Essay Set 6
Extracting Features for Essay Set 7
Extracting Features for Essay Set 8


In [10]:
new_keys = ["total word_count","sentence_number","word_features","count_lemma","spelling_error","count_pos"]
Dict = {key: defaultdict(list) for key in new_keys}
for key, value in features.items():
    for v in value:
        Dict['total word_count']['essay_set %s' % key].append(v[0])
        Dict['sentence_number']['essay_set %s' % key].append(v[1])
        Dict['word_features']['essay_set %s' % key].append(v[2])
        Dict['count_lemma']['essay_set %s' % key].append(v[3])
        Dict['spelling_error']['essay_set %s' % key].append(v[4])
        Dict['count_pos']['essay_set %s' % key].append(v[5])

In [14]:
import dill

dill.dump(Dict, open('featurs_dict.pkd', 'wb'))
dill.dump(features, open('features.pkd', 'wb'))

In [15]:
Dict = dill.load(open('featurs_dict.pkd', 'rb'))
features = dill.load(open('features.pkd', 'rb'))

In [None]:
from numpy import linspace
from scipy.stats.kde import gaussian_kde

from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource, FixedTicker, PrintfTickFormatter
from bokeh.plotting import figure
from bokeh.sampledata.perceptions import probly

import colorcet as cc

word_count_dict = Dict['total word_count']
sentence_number_dict = Dict['sentence_number']
word_features_dict = Dict['word_features']
count_lemma_dict = Dict['count_lemma']
spelling_error_dict = Dict['spelling_error']
count_pos_dict = Dict['count_pos']

# word_count=[]
# Essay_sets=[]
# for key in word_count_dict:
#     n = len(word_count_dict[key])
#     word_count.extend(word_count_dict[key])
#     Essay_sets.extend(np.repeat(key, n))

# df_word_count = pd.DataFrame({'Essay sets': Essay_sets, 'word count': word_count})

def joy(category, data, scale=100):
    return list(zip([category]*len(data), scale*data))

cats = sorted(list(set(Essay_sets)))

palette = [cc.rainbow[i*15] for i in range(17)]

x = linspace(0,70, 500)

Data = {'x': x}
source = ColumnDataSource(Data)

p1 = figure(y_range=cats, plot_width=900, x_range=(-5, 70), toolbar_location=None)

for i, cat in enumerate(cats):
    pdf = gaussian_kde(count_lemma_dict[cat])
    y = joy(cat, pdf(x))
    source.add(y, cat)
    p1.patch('x', cat, color=palette[i], alpha=0.7, line_color="black", source=source)

p1.title.text = "The Number of Sentences per Essay Category"

p1.title.align = 'center'
p1.title.text_font_size = '20pt'
p1.title.text_font = 'serif'

    # Axis titles
p1.xaxis.axis_label_text_font_size = '14pt'
p1.xaxis.axis_label_text_font_style = 'bold'
p1.yaxis.axis_label_text_font_size = '14pt'
p1.yaxis.axis_label_text_font_style = 'bold'

    # Tick labels
p1.xaxis.major_label_text_font_size = '12pt'
p1.yaxis.major_label_text_font_size = '12pt'

In [1]:
from bokeh.io import output_notebook
output_notebook()

  return f(*args, **kwds)


ModuleNotFoundError: No module named 'bokeh.models'; 'bokeh' is not a package

In [1]:
show(p1)


NameError: name 'show' is not defined

In [3]:

def joy(category, data, scale=200):
    return list(zip([category]*len(data), scale*data))

x = linspace(0,900, 500)

Data = {'x': x}
source = ColumnDataSource(Data)

p2 = figure(y_range=cats, plot_width=900, x_range=(-5, 900), toolbar_location=None)

for i, cat in enumerate(cats):
    pdf = gaussian_kde(word_count_dict[cat])
    y = joy(cat, pdf(x))
    source.add(y, cat)
    p2.patch('x', cat, color=palette[i], alpha=0.7, line_color="black", source=source)

p2.title.text = "The Total Word Count"

p2.title.align = 'center'
p2.title.text_font_size = '20pt'
p2.title.text_font = 'serif'

    # Axis titles
p2.xaxis.axis_label_text_font_size = '14pt'
p2.xaxis.axis_label_text_font_style = 'bold'
p2.yaxis.axis_label_text_font_size = '14pt'
p2.yaxis.axis_label_text_font_style = 'bold'

    # Tick labels
p2.xaxis.major_label_text_font_size = '12pt'
p2.yaxis.major_label_text_font_size = '12pt'

show(p2)

NameError: name 'linspace' is not defined