## Introduction
Essay grading is costly and time consuming for humans. Often in standardized tests (e.g. the SATand GRE), multiple graders must grade each essay, greatly amplifying the cost of scoring the tests. Our aim is to develop an automatic scoring algorithm that delivers scores close to those of human expert graders, so it can replace all or part of human graders.

Some of the potential challenges that we may face are the following:
- How to convert language into math- and computer-friendly format.
- How to deal with high dimensionality
- How to take into account the different prompts and question types that each essay has.

Below, we provide summary of the approaches we took. Clicking each title will bring you a page explaining each component in more detail.

<h2 style="color:#0000EE; text-decoration:underline; display:inline" onclick="includeContent('data_exploration.html'); document.getElementById('topnv').selectedIndex=1; document.getElementById('botnav').selectedIndex=1;"> Chapter 1 - Data Exploration</h2>

We worked on a publicly available data from the Hewlett Foundation containing 12976 essays from eight different sets. Each essay set contained data from a different exam, though all essays from any given set received the same prompt. Some essay prompts were just a short question requesting a response, while others required reading a passage before responding.

We extracted some basic statistics from the data as well as plotted histograms and scatter plots of selected characteristics to gain insights into data's general makeup. The below image presents the distribution of essay lengths in the eight sets. While some appear to be normally distributed—such as Set 1—others are decidedly non-normal—such as Set 8. While not a problem in terms of modeling, this does slightly restrict the classes of model we will be able to use (models like LDA, which assumes multi-variate normality among the predictors, will probably not be a good choice, for example).

<img src='figures/length_hist_overview.png' title='Histogram for essay length'>

The below image shows the distribution of different grades as determined by human readers. Again, some essay sets can reasonably be approximated with a normal distribution, while others cannot.

<img src='figures/essay_score_hist.png' title='Histogram for essay length'>

<h2 style="color:#0000EE; text-decoration:underline" onclick="includeContent('baseline.html'); document.getElementById('topnv').selectedIndex=2; document.getElementById('botnav').selectedIndex=2;"> Chapter 2 - Data Cleaning: tf-idf </h2>
In order to turn essays (text data) into computer-friendly format, we used a technique called *tf-idf vectorizing*. This method takes a set of text as an input, and outputs a vector of numbers that correspond to each document. Tf-idf balances two competing factors about each word: words that appear often in a given document are considered important, and therefore receive a higher weight, while words that appear often in all documents are considered common and less important, lowering their weight. This prevents often-used but relatively unimportant words like *the*, *we*, etc. from dominating the results. Because this method applies to every possible word in all the essays among a set, it can produce models with tens of thousands of predictors.

<h2 style="color:#0000EE; text-decoration:underline" onclick="includeContent('baseline.html'); document.getElementById('topnv').selectedIndex=3; document.getElementById('botnav').selectedIndex=3;"> Chapter 3 - Baseline Model </h2>


Using the resultant matrices from *tf-idf*, we created a baseline model using multiple linear regression, the results of which are shown below. In the following plots, the predictions from a perfect model would fall entirely on the grey line, something that is decidedly *not* true in some of the essay sets. The baseline resulted in a poor fit. In next few chapters, we worked on improving the model by dimension reduction, adding in meta features, and applying regularizations to improve the model.


<img src='figures/baseline_overview.png'>


<h2 style="color:#0000EE; text-decoration:underline" onclick="includeContent('lsa.html'); document.getElementById('topnv').selectedIndex=4; document.getElementById('botnav').selectedIndex=4;"> Chapter 4 - LSA </h2>

After *tf idf*, we have a sparse matrix of vectorized text with something on the order of 15 million elements, which may have resulted in overfitting in our baseline model. Thus, we chose to perform dimensionality reduction to reduce the total number of features. Specifically, we used a method called *latent semantic analysis (LSA)* for this. *LSA* is essentially a PCA for textual data.

We performed a cross-validation to choose the optimal number of dimension ($d$) for each essay set.

<img src='figures/d_tuning.png'>

<h2 style="color:#0000EE; text-decoration:underline" onclick="includeContent('meta_features.html'); document.getElementById('topnv').selectedIndex=5; document.getElementById('botnav').selectedIndex=5;"> Chapter 5 - Meta Features </h2>

To improve the model even further, we also added in several meta-features:
- Similarity of the essay to the prompt
- Essay length in words
- Mean word length
- Mean sentence length
- Number of unique words

We checked that some of hte meta-features had correlation to the score, so we can expect them to be useful for forming the final model.

<img src='figures/corr_length_overview.png'>

<h2 style="color:#0000EE; text-decoration:underline" onclick="includeContent('lasso.html'); document.getElementById('topnv').selectedIndex=6; document.getElementById('botnav').selectedIndex=6;"> Chapter 6 - Regularization: Lasso </h2>
Even after dimension reduction using LSA, our model still had large number of predictors. To separate out only the most important predictors, we applied lasso regression to avoid overfitting and further improving the prediction accuracy. In general, regularization penalizes large coefficients in the linear regression model by some penalty $\lambda$, bringing them all toward zero and reducing the likelihood of overfitting. Lasso, the particular flavor regression we chose to perform, actually manages to bring some some coefficients completely to zero, eliminating them from the model altogether. However, choosing to regularize our model left us with a decision of exactly *which* $\lambda$ to choose. Therefore, we performed cross-validation to find the optimal value.

<h2 style="color:#0000EE; text-decoration:underline" onclick="includeContent('final.html'); document.getElementById('topnv').selectedIndex=7; document.getElementById('botnav').selectedIndex=7;"> Chapter 7 - Final Model </h2>

LSA and regularization left our model with between 10 and 205 parameters, depending on the essay set, a marked improvement from the original 15,000. Even better, our model scores consistently higher compared to human graders, as shown in the following plots.

![Final model scores](figures/final.png)

The chief metric we use to evaluate this model is Spearman's rank correlation, $\rho$, which measures the degree to which the predicted and actual scores follow one another. Our final model has correlations betwen 0.75 and 0.90, a striking improvement over the baseline models' 0.5. With these scores, our model is ready to offload at least some of the work human graders have to do now.

<h2 style="color:#0000EE; text-decoration:underline" onclick="includeContent('futre.html'); document.getElementById('topnv').selectedIndex=8; document.getElementById('botnav').selectedIndex=8;"> Chapter 8 - Future Work </h2>

Although our model has improved significantly from the baseline, we believe there is room for further improvement.

Specifically, we propose doing higher level text analytics and incorporating them as meta-features. Here are some examples:
- Syntax complexity
    - We could extract features that represent syntax complexity such as length of clases, amount of embedding or subordination, amount of coordination, range of surface syntactic structures. There are some tools available for this. ([L2 Syntatic Complexity Analyzer](http://www.personal.psu.edu/xxl13/downloads/l2sca.html), [TAASSC](http://www.kristopherkyle.com/taassc.html))

- Syllable count

- Count misspellings

- Part-of-speech tags

For statistical models, we've also tested logistic regression and linear discriminant analysis (LDA) classifier for classifying the score. However, the performance (Spearman's rank correlation coefficients ($\rho$) and $R^2$) was worse than our baseline multiple linear regression. 
We would also like to try other models such as Random Forest Classifier/Regressor and neural networks.