After vectorizing our essays with `sklearn`'s tf-idf vectorizer, we have a sparse matrix of vectorized text, often with *more than 16,000* different features. Outsized feature sets like this are ripe for overfitting, as shown in our baseline model, so we chose to perform **dimensionality reduction** to reduce the total number of features going into our model. A powerful tool at our disposal was **latent semantic analysis**, (LSA). Essentially, LSA takes in the large feature set of individual words and finds combinations of those words that explain most of the variability in the original text. This means that after performing LSA on a tf-idf matrix with thousands of features, we can use a much-reduced model with substantively fewer predictors in our model, avoiding some of the risk of overfitting.

To explain how LSA works, consider the two one-sentence documents below:
1. "Here is one document."
2. "Here is another."

As before, we can vectorize these two documents using a tf-idf algorithm. For the sake of explanation here, we will demonstrate LSA using a counting vectorizer, which simply counts the occurences of each term, but LSA part of the analysis is exactly the same no matter how we vectorize the text. As shown in the below code, after vectorization, the two documents become:

| |another|document|here|is|one|this|
|---|---|---|---|---|---|---|
|**Here is one document.** |0|1|1|1|1|0|
|**Here is one document.** |1|0|0|1|0|1|

A single LSA component will give each term in this tf-idf matrix a weight, multiply a document's tf-idf (row) vector by those weights, then sum across all the different terms. Effectively, it is transforming the high-dimensional tf-idf matrix into a much lower-dimensional space. To make these weights as useful as possible, LSA orders these weights such that the first LSA component explains the most variance in original feature sets, with subsequent components explaining progressively less variance. LSA components must also be orthogonal (forming right angles with each other, so they are independent) and of unit magnitude (so they don't change the original feature set's shape). It is possible to create as many LSA components as there are original features, but doing so would defeat the purpose of performing LSA. As show below, `sklearn` can implement LSA via `Truncated SVD`. In the simple case of one LSA component, the calculation looks like this:

|  |another|document|here|is|one|this| **Sum**|
|---|---|---|---|---|---|---|---|
|**LSA weight: ** |0.24|0.4|0.4|0.64|0.4|0.24|--|
|**Weighted doc. 1** |0.0|0.4|0.4|0.64|0.4|0.0| **1.83**|
|**Weighted doc. 2** |0.24|0.0|0.0|0.64|0.0|0.24| **1.13**|

Recall that while this example uses a counting vectorizer, our analysis used a tf-idf vectorizer. Our 2 x 6 vectorized set of documents has become a 2 x 1 matrix with much lower dimensionality. 

Abstracting this concept to the vastly higher-dimensional tf-idf is straightforward, but it leaves the choice of how many LSA components to use in the final model, a hyperparameter we call $d$. Again, the above example uses $d = 1$, but it is possible to choose $d$ up to the original number of features. In our case, we used cross validation to tune $d$, by varying it, training a new model with $d$ components from a training set, and evaluating that model's %R^2$ on a testing set. The results for each essay set are below.

![Tuning d](figures/d_tuning.png "Tuning d")

In [150]:
# Import modules
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD

def toTableRow(array):
    # Convert an array to a markdown table row
    
    row = []
    for element in array:
        row += '|%s'%str(element)
    
    row += '|'
    
    return ''.join(row)
        

sentences = ['Here is one document.',
             'This is another.']

# Vectorize the text
vectorizer = CountVectorizer()
word_mat = np.array(vectorizer.fit_transform(sentences).todense())


# Perform LSA
n_components = 5
lsa_model = TruncatedSVD(n_components=n_components)
lsa_transformed = lsa_model.fit_transform(word_mat)

# Print out a markdown table of the vectorized text
n_features = len(vectorizer.get_feature_names())

print '| ', toTableRow(vectorizer.get_feature_names())
print toTableRow(np.tile('---', (n_features + 1,)))
print '|**%s**'%sentences[0], toTableRow(word_mat[0,:])
print '|**%s**'%sentences[0], toTableRow(word_mat[1,:])

print '\n'


# Print out a markdown table of the LSA calculation
# print '| ', toTableRow(vectorizer.get_feature_names()), '**Sum**|'
# print toTableRow(np.tile('---', (n_features + 2,)))
# print '|**LSA weight: **', toTableRow(np.round(lsa_model.components_[0,:], 2)), ' |'
# print '|**Weighted doc. 1**', toTableRow(np.round(
#                                             np.multiply(lsa_model.components_[0,:], 
#                                             word_mat[0,:]),
#                                             2)), '**%.2f**|'%round(lsa_transformed[0], 2)

# print '|**Weighted doc. 2**', toTableRow(np.round(
#                                             np.multiply(lsa_model.components_[0,:], 
#                                             word_mat[1,:]),
#                                             2)), '**%.2f**|'%round(lsa_transformed[1], 2)

print lsa_model.explained_variance_ratio_

|  |another|document|here|is|one|this|
|---|---|---|---|---|---|---|
|**Here is one document.** |0|1|1|1|1|0|
|**Here is one document.** |1|0|0|1|0|1|


[ 0.09750776  0.90249224]


In this case, the two documents produced six features. LSA will weight each of these features according to the formula

$$
v^{LSA}_{i, d} = \phi_{d,1} \cdot tf_{1} + \cdots + \phi_{d,j} \cdot tf_{j} + \cdots + \phi_{d,n} \cdot tf_{n}
$$

where $v^{LSA}_{i,d}$ is the $i^{th}$ element of the $d^{th}$ LSA component, $[tf_1, \cdots tf_n]$