## W266 Project - Essay Scoring

### Data Description

The data is obtained from the Automated Student Assessment Prize (ASAP) AES dataset (https://www.kaggle.com/c/asap-aes/data), which contains essays written by students ranging from Grade 7 to Grade 10. The dataset consists of 8 essay sets, each with a different topic or prompt, with a total of 12,978 essays with scores.

Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students ranging in grade levels from Grade 7 to Grade 10. All essays were hand graded and were double-scored. Each of the eight data sets has its own unique characteristics. The variability is intended to test the limits of your scoring engine's capabilities.

The training data is provided in three formats: a tab-separated value (TSV) file, a Microsoft Excel 2010 spreadsheet, and a Microsoft Excel 2003 spreadsheet.  The current release of the training data contains essay sets 1-6.  Sets 7-8 will be released on February 10, 2012.  Each of these files contains 28 columns:

    essay_id: A unique identifier for each individual student essay
    essay_set: 1-8, an id for each set of essays
    essay: The ascii text of a student's response
    rater1_domain1: Rater 1's domain 1 score; all essays have this
    rater2_domain1: Rater 2's domain 1 score; all essays have this
    rater3_domain1: Rater 3's domain 1 score; only some essays in set 8 have this.
    domain1_score: Resolved score between the raters; all essays have this
    rater1_domain2: Rater 1's domain 2 score; only essays in set 2 have this
    rater2_domain2: Rater 2's domain 2 score; only essays in set 2 have this
    domain2_score: Resolved score between the raters; only essays in set 2 have this
    rater1_trait1 score - rater3_trait6 score: trait scores for sets 7-8


### Setting up ML libraries

Importing the relevant NLP and tensorflow libraries for our use.

In [1]:
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import json, os, re, shutil, sys, time
from importlib import reload
import collections, itertools
import unittest
from IPython.display import display, HTML

# NLTK for NLP utils and corpora
import nltk

# NumPy and TensorFlow
import numpy as np
import tensorflow as tf
assert(tf.__version__.startswith("1."))

# Pandas and SKLearn
import pandas as pd
from sklearn.model_selection import train_test_split

# Helper libraries
#from w266_common import utils, vocabulary, tf_embed_viz


  from ._conv import register_converters as _register_converters


### Loading in the data

Data from AES dataset is stored in the `/data/` folder.  We will begin by loading the training dataset `training_set_rel3.tsv` and partitioning it into train, test split.

In [2]:
training_set_rel3_df = pd.read_csv("data/training_set_rel3.csv")
#training_set_rel3_df.head()
print("No. of rows in full data set:", len(training_set_rel3_df))

No. of rows in full data set: 12978


In [3]:
# Creating train, dev and test sets
train_set, test_set = train_test_split(training_set_rel3_df, test_size=0.1, random_state=0)
train_set, dev_set = train_test_split(train_set, test_size=15/90, random_state=0)

In [4]:
print("Train Set:", len(train_set))
print("Dev Set:", len(dev_set))
print("Test Set:", len(test_set))

Train Set: 9733
Dev Set: 1947
Test Set: 1298


In [5]:
train_set_essays = np.array(train_set["essay"])
train_set_labels = np.array(train_set["domain1_score"])
dev_set_essays = np.array(dev_set["essay"])
dev_set_labels = np.array(dev_set["domain1_score"])
test_set_essays = np.array(test_set["essay"])
test_set_labels = np.array(test_set["domain1_score"])

### Baseline

### Setting up the LSTM with Attention

### Two-headed Creativity 

### BERT Embeddings

For the word embeddings, we utilized the pre-trained BERT word embeddings 