# Milestone #4

**Prompt:**

For the upcoming milestone, we will build some baseline models with basic features. You are free to explore on your own to investigate which features are useful. Here are some examples to get you started:

1. "meta" features (e.g. as most of you discovered, essay length seems to be a good indicator of score)
2. Essay "content" features, where the content is given by the bag-of-words. I.e. Use CountVectorizer or TfidfVectorizer (I recommend TfidfVectorizer) to get a feature vector for each essay
3. Prompt content features.
4. Combinations of the above.

Always be sure to report train/test scores, and feel free to experiment with regularization, ngram size for vectorizers, etc.

Note: feature vectors you get from (2) and (3) will be sparse. Therefore in order to combine (2) with (1), you will need to make (1) sparse as well.

Let me know if your group wants to meet with me to discuss more in person.

In [31]:
import matplotlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
%matplotlib inline

In [6]:
essay_df = pd.read_csv("datasets/training_set_rel3.tsv", delimiter="\t")
essay_df.head(2)

Unnamed: 0,essay_id,essay_set,essay,rater1_domain1,rater2_domain1,rater3_domain1,domain1_score,rater1_domain2,rater2_domain2,domain2_score,...,rater2_trait3,rater2_trait4,rater2_trait5,rater2_trait6,rater3_trait1,rater3_trait2,rater3_trait3,rater3_trait4,rater3_trait5,rater3_trait6
0,1,1,"Dear local newspaper, I think effects computer...",4,4,,8,,,,...,,,,,,,,,,
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",5,4,,9,,,,...,,,,,,,,,,


In [7]:
essay_df.shape

(12976, 28)

In [12]:
vectorizer = TfidfVectorizer(stop_words="english", min_df=4, decode_error="ignore")
corpus = essay_df["essay"].values
essay_vector = vectorizer.fit_transform(corpus)
essay_vector = essay_vector.toarray()

In [14]:
essay_vector.shape

(12976, 10017)

Each essay vector has dimension ~10k

In [15]:
# adding an essay length column to the dataframe
essay_df["essay_length"] = map(len, essay_df["essay"])

In [16]:
essay_df.head(2)

Unnamed: 0,essay_id,essay_set,essay,rater1_domain1,rater2_domain1,rater3_domain1,domain1_score,rater1_domain2,rater2_domain2,domain2_score,...,rater2_trait4,rater2_trait5,rater2_trait6,rater3_trait1,rater3_trait2,rater3_trait3,rater3_trait4,rater3_trait5,rater3_trait6,essay_length
0,1,1,"Dear local newspaper, I think effects computer...",4,4,,8,,,,...,,,,,,,,,,1875
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",5,4,,9,,,,...,,,,,,,,,,2288


In [25]:
feature_df = pd.DataFrame(essay_vector)

### We should also make sure to turn essay prompts into features.

In [34]:
def get_essay_x_y(data):
    y = data["domain1_score"]
    X = data.drop("domain1_score", axis=1)
    return X, y

In [27]:
feature_df["essay_set"] = essay_df["essay_set"]
feature_df["essay_length"] = essay_df["essay_length"]
feature_df["domain1_score"] = essay_df["domain1_score"]

In [30]:
print feature_df.shape
feature_df.head(2)

(12976, 10020)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,10010,10011,10012,10013,10014,10015,10016,essay_set,essay_length,domain1_score
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1875,8
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,2288,9


In [35]:
X, y = get_essay_x_y(feature_df)

In [None]:
LinReg = LinearRegression(n_jobs=-1)

In [None]:
LinReg.fit(X, y)
LinReg.score(X)