Note that we're still talking about the feature engineering step of the data preprocessing pipeline as shown below:

![Feature Engineering](assets/feature_engineering.png)



# Semantics

In the BoW approach, with all the information we were able to pull out of the text, one thing we didn't really use was semantics- the *meaning* of the words and sentences. The models we built in the previous checkpoint may know that Jane Austen tends to use the word *lady* a lot in her writing, but they don't know what a lady is. There is nothing in our work on NLP so far that would allow a model to say whether 'queen' or 'car' is more similar to 'lady.' 

In the absence of semantic information, models can get tripped up on things like synonyms ('milady' and 'lady'). We could modify the SpaCy dictionary to include 'lady' as the lemma of 'milady,' then use lemmas for all our analyses, but for this to be an effective approach we would have to go through our entire corpus and identify all synonyms for all words by hand. This approach would also discard subtle differences in the connotations of (words, concepts, ideas, or emotions associated with) 'lady' (elicits thoughts of formal manners and England) and 'milady' (elicits thoughts of medieval ages and Rennaissance Faires).

Basically, language is complicated, and trying to explicitly model all the information encoded in a language is nearly impossibly complicated. Fortunately, we have some approaches and methods to get around this to some extent. In general, these methods work on a corpus of text and learn the rules of the language by identifying recurring patterns within the corpus. As the outcome, all of these methods produce a numerical representation of the words.  

In this checkpoint, we are going to introduce **tf-idf** method as a modification of the BoW approach of the previous checkpoint as well as the **Latent Semantic Analysis**.


# BoW revisited: tf-idf

The BoW approach rests upon counting the occurrences of the words in the documents (in our case the sentences). However, there is other information we can exploit from the occurrences of the words apart from their counts in the sentences. In the following, we'll go step by step in introducing **tf-idf** vectorization which takes some clever actions when counting the words. 

Consider the following sentences:

1. "The best Monty Python sketch is the one about the dead parrot,  I laughed so hard."
2. "I laugh when I think about Python's Ministry of Silly Walks sketch, it is funny, funny, funny, the best!"
3. "Chocolate is the best ice cream dessert topping, with a great taste."
4. "The Lumberjack Song is the funniest Monty Python bit: I can't think of it without laughing."
5. "I would rather put strawberries on my ice cream for dessert, they have the best taste."
6. "The taste of caramel is a fantastic accompaniment to tasty mint ice cream."

As a human being, it's easy to see that the sentences involve two topics, comedy and ice cream. One way to represent the sentences is in a **term-document matrix**, with a column for each sentence and a row for each word. Ignoring the stop words 'the', 'is','and', 'a', 'of,','I', and 'about,', discarding words that occur only once, and reducing words like 'laughing' to their root form ('laugh'), the term-document matrix for these sentences would be:

|           | 1 | 2 | 3 | 4 | 5 | 6 |
|-----------|---|---|---|---|---|---|
| Monty     | 1 | 0 | 0 | 1 | 0 | 0 |
| Python    | 1 | 1 | 0 | 1 | 0 | 0 |
| sketch    | 1 | 1 | 0 | 0 | 0 | 0 |
| laugh     | 1 | 1 | 0 | 1 | 0 | 0 |
| funny     | 0 | 3 | 0 | 1 | 0 | 0 |
| best      | 1 | 1 | 1 | 0 | 1 | 0 |
| ice cream | 0 | 0 | 1 | 0 | 1 | 1 |
| dessert   | 0 | 0 | 1 | 0 | 1 | 0 |
| taste     | 0 | 0 | 1 | 0 | 1 | 2 |



Note that we use the term 'document' to refer to the individual text chunks we are working with. It can sometimes mean sentences, sometimes paragraphs, and sometimes whole text files. In our case, each sentence is a document. Also note that, contrary to how we usually operate, a term-document matrix has words as rows and documents as columns.

The comedy sentences use the words: Python (3), laugh (3), Monty (2), sketch (2), funny (2), and best (2).
The ice cream sentences use the words: ice cream (3), dessert (3), taste (3), and best (2).

The word 'best' stands out here- it appears in more sentences than any other word (4 of 6). It is used equally to describe Monty Python and ice cream. If we were to use this term-document matrix as-is to teach a computer to parse sentences, 'best' would end up as a significant identifier for both topics, and every time we gave the model a new sentence to identify that included 'best,' it would bring up both topics. Not very useful. To avoid this, we want to weight the matrix so that words that occur in many different sentences have lower weights than words that occur in fewer sentences. We do want to put a floor on this though-- words that only occur once are totally useless for finding associations between sentences.  

Another word that stands out is 'funny', which appears more often in comedy sentences than any other word.  This suggests that 'funny' is a very important word for defining the 'comedy' topic.  

## Quantifying documents: collection and document frequencies

**Document frequency (df)** counts how many sentences a word appears in. **Collection frequency (cf)** counts how often a word appears, total, over all sentences. Let's calculate the df and cf for our sentence set:

|           |df |cf| 
|-----------|---|---|
| Monty     | 2 | 2 | 
| Python    | 3 | 3 | 
| sketch    | 2 | 2 | 
| laugh     | 3 | 3 | 
| funny     | 2 | 4 | 
| best      | 4 | 4 | 
| ice cream | 3 | 3 | 
| dessert   | 2 | 2 | 
| taste     | 3 | 4 | 



## Penalizing indiscriminate words: inverse document frequency

Now let's weight the document frequency so that words that occur less often (like 'sketch' and 'dessert') are more influential than words that occur a lot (like 'best').  We will calculate the ratio of total documents (N) divided by df, then take the log (base 2) of the ratio, to get our **inverse document frequency (idf)** number for each term (t):

$$idf_t=log \dfrac N{df_t}$$


|           |df |cf| idf |
|-----------|---|---|
| Monty     | 2 | 2 | 1.585 |
| Python    | 3 | 3 | 1 |
| sketch    | 2 | 2 | 1.585 |
| laugh     | 3 | 3 | 1 |
| funny     | 2 | 4 | 1.585 |
| best      | 4 | 4 | .585 |
| ice cream | 3 | 3 | 1 |
| dessert   | 2 | 2 | 1.585 |
| taste     | 3 | 4 | 1 |

The idf weights tell the model to consider 'best' as less important than other terms.  

## Term-frequency weights

The next piece of information to consider for our weights is how frequently a term appears within a sentence. The word 'funny' appears three times in one sentence- it would be good if we were able to weight 'funny' so that the model knows that. We can accomplish this by creating unique weights for each sentence that combines the **term frequency** (how often a word appears within an individual document) with the idf, like so:

$$tf-idf_{t,d}=(tf_{t,d})(idf_t)$$

Now the term 'funny' in sentence 2, where it occurs three times, will be weighted more heavily than the term 'funny' in sentence 1, where it only occurs once. If 'best' had appeared multiple times in one sentence, it would also have a higher weight for that sentence, but the weight would be reduced by the idf term that takes into account that 'best' is a pretty common word in our collection of sentences.

The **tf_idf** score will be highest for a term that occurs a lot within a small number of sentences, and lowest for a word that occurs in most or all sentences.  

Now we can represent each sentence as a vector made up of the tf-idf scores for each word:

|           | 1 | 2 | 3 | 
|-----------|---|---|---|
| Monty     | 1.585 | 0 | 0 |
| Python    | 1 | 1 | 0 | 
| sketch    | 1.585| 1.585 | 0 | 
| laugh     | 1 | 1 | 0 | 
| funny     | 0 | 4.755 | 0 | 
| best      | .585 | .585 | .585 | 
| ice cream | 0 | 0 | 1 | 
| dessert   | 0 | 0 | 1.585 | 
| taste     | 0 | 0 | 1 |


## Things to consider in tf-idf

As any feature generation technique for the text data, the tf-idf vectors are a kind of 'translation' from human-readable language to computer-usable numeric form. Some information is inevitably lost in translation, and the usefulness of any model we build from here on out depends on the decisions we made during the translation step. Possible decision-points include:

* Which stop words to include or exclude.
* Should we use phrases ('Monty Python' instead of 'Monty' and 'Python') as terms.
* The threshold for infrequent words: In our example, we excluded words that only occurred once. In longer documents, it may be a good idea to set a higher threshold.
* How many terms to keep. We kept all the terms that fit our criteria (not a stop word, occurred more than once), but for bigger document collections or longer documents, this may create unfeasibly long vectors. We may want to decide to only keep the 10,000 words with the highest collection frequency scores, for example.


## Implementing tf-idf

We're all set to implement tf-idf vectorization. As we did in the previous checkpoint for the BoW, we'll be using the great scikit-learn library for generating the tf-idf vectors of the Persuasion and Alice's Adventures in Wonderland novels of Jane Austin and Lewis Carroll.

Before jumping in the vectorization, we apply the same data cleaning steps of the previous checkpoint:

In [1]:
import numpy as np
import pandas as pd
import sklearn
import spacy
import re
from nltk.corpus import gutenberg
import nltk
import warnings
warnings.filterwarnings("ignore")

nltk.download('gutenberg')
!python -m spacy download en

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/mladmin/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/Users/mladmin/miniconda3/envs/datascience/lib/python3.6/site-packages/en_core_web_sm
-->
/Users/mladmin/miniconda3/envs/datascience/lib/python3.6/site-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [2]:
# utility function for standard text cleaning
def text_cleaner(text):
    # visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = ' '.join(text.split())
    return text

In [3]:
# load and clean the data.
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# the chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)

In [4]:
# parse the cleaned novels. this can take a bit
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [5]:
# group into sentences
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# combine the sentences from the two novels into one data frame
sentences = pd.DataFrame(alice_sents + persuasion_sents, columns = ["text", "author"])
sentences.head()

Unnamed: 0,text,author
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(Oh, dear, !)",Carroll


In [6]:
# get rid off stop words and punctuation
# and lemmatize the tokens
for i, sentence in enumerate(sentences["text"]):
    sentences.loc[i, "text"] = " ".join(
        [token.lemma_ for token in sentence if not token.is_punct and not token.is_stop])

Now, we can start using `TfidfVectorizer` class of scikit-learn. Note that below we'll set some parameters of the `TfidfVectorizer`. Specifically, we set:

* **max_df=0.5**: This drops words that occur in more than half the documents.
* **min_df=2**: This makes the vectorizer only use words that appear at least twice.
* **use_idf=True**: This makes the vectorizer use inverse document frequencies in weighting.
* **norm=u'l2'**: This applies a correction factor so that longer and shorter documents get treated equally.
* **smooth_idf=True**: This adds 1 to all document frequencies, as if an extra document existed that used every word once. This prevents divide-by-zero errors.

There are other parameters of `TfidfVectorizer` that we can set. For more information, you can look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). 

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_df=0.5, min_df=2, use_idf=True, norm=u'l2', smooth_idf=True)


# applying the vectorizer
X = vectorizer.fit_transform(sentences["text"])

tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences = pd.concat([tfidf_df, sentences[["text", "author"]]], axis=1)

# keep in mind that the log base 2 of 1 is 0,
# so a tf-idf score of 0 indicates that the word was present once in that sentence.
sentences.head()

Unnamed: 0,abide,ability,able,abominate,abroad,absence,absent,absolute,absolutely,absurd,...,yer,yes,yesterday,yield,young,youth,zeal,zealous,text,author
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Alice begin tired sit sister bank have twice p...,Carroll
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,remarkable Alice think way hear Rabbit,Carroll
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll


As you can see, we now have a dataset in a form that we like: in each row we have our observations and the columns are the numeric features. From now on, we can use this dataset as input to machine learning models. So, we're jumping in the modeling phase of the data preprocessing pipeline as shown below:

![Modeling](assets/modeling.png)

## Tf-idf in action

As we did in the previous checkpoint, we'll build some machine learning models using our dataset. Since our features are all numerical now, we can use them directly in our models. As in the previous checkpoint, our task is to predict the author of a sentence:

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.9003054707025826

Test set score: 0.8775510204081632
----------------------Random Forest Scores----------------------
Training set score: 0.964454318244932

Test set score: 0.8750520616409829
----------------------Gradient Boosting Scores----------------------
Training set score: 0.8617050819216884

Test set score: 0.8467305289462724


Compared to the BoW scores of the previous checkpoint, our scores are a little bit lower this time. But at least, tf-idf features seem to reduce the overfitting of the logistic regression.

## 2-grams example

We can also make use of n-grams in `TfidfVectorizer`. In order to do that, we need to set `ngram_range` parameter. Below, we use 2-grams as our features and apply tf-idf vectorization. Then we apply the same models above to get our predictions:

In [9]:
vectorizer = TfidfVectorizer(
    max_df=0.5, min_df=2, use_idf=True, norm=u'l2', smooth_idf=True, ngram_range=(2,2))


# applying the vectorizer
X = vectorizer.fit_transform(sentences["text"])

tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences = pd.concat([tfidf_df, sentences[["text", "author"]]], axis=1)

# keep in mind that the log base 2 of 1 is 0,
# so a tf-idf score of 0 indicates that the word was present once in that sentence.
sentences.head()

Unnamed: 0,able bear,able persuade,absence home,absolute necessity,absolutely hopeless,accident lyme,accommodation man,account louisa,account small,acquaint captain,...,young friend,young lady,young man,young people,young person,young sister,young woman,youth say,text,author
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Alice begin tired sit sister bank have twice p...,Carroll
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,remarkable Alice think way hear Rabbit,Carroll
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll


Notice that we have 3381 features now (excluding the text and author columns). If you remember, the 2-grams features of the BoW vectorizer from the previous checkpoint consist of more than 30.000 features! This is a huge reduction in the number of features due to the values we set for the parameters of the `TfidfVectorizer` like `max_df` and `min_df`.

Now, let's run our models using the 2-grams features:

In [10]:
Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.8236600944182172

Test set score: 0.7896709704289879
----------------------Random Forest Scores----------------------
Training set score: 0.8980838655928909

Test set score: 0.8263223656809663
----------------------Gradient Boosting Scores----------------------
Training set score: 0.8019994445987225

Test set score: 0.7913369429404414


The results are slightly lower again in comparison to the BoW scores of the previous checkpoint. However, this time the overfittings of the logistic regression and the random forest models are reduced substantially. This is basically because of the reduction in the number of features!

# Some applications of tf-idf

So far, we applied classification models using the tf-idf vectors to predict the authors of the sentences. Here, we briefly touch upon some popular NLP applications that make use of tf-idf vectorization. In the first one, we briefly review how we can measure the similarities of the documents and in the second one we talk about **topic modeling** which refers to deriving the fundamental topics of a collection of documents.

## Vector Space Model

By now, you've had some practice thinking of data as existing in multi-dimensional space. Our sentences exist in an n-dimensional space where n is equal to the number of terms in our term-document matrix. The vector representation of the text is referred to as **Vector Space Model**. We can use this representation to compute the similarity between the sentences and a new phrase or sentence. This method is often used by search engines to match a query to possible results.  

To compute the similarity of sentences to a new sentence, we transform the new sentence into a vector and place it in the space. We can then calculate how different the angles are for the original vectors and the new vector, and identify the vector whose angle is closest to the new vector. Typically this is done by calculating the cosine of the angle between the vectors. If the two vectors are identical, the angle between them will be 0° and the cosine will be 1. If the two vectors are orthogonal, with an angle of 90°, the cosine will be 0.  

If we were running a search query, then, we would return sentences that were most similar to the query sentence, ordered from the highest similarity score (cosine) to the lowest. Pretty handy!

Cool as this is, there are limitations to the VSM.  In particular, because it treats each word as distinct from every other word, it can run aground on *synonyms* (treating words that mean the same thing as though they are different, like big and large). Also, because it treats all occurrences of a word as the same regardless of context, it can run aground on *polysemy*, where there are different meanings attached to the same word: 'I need a break' vs 'I break things.' In addition, vector space model has difficulty with very large documents because the more words a document has, the more opportunities it has to diverge from other documents in the space, making it difficult to see similarities.

## Latent Semantic Analysis

A solution to this problem is to reduce our tf-idf-weighted term-document matrix into a lower-dimensional space, that is, to express the information in the matrix using fewer rows by combining the information from multiple terms into one new row/dimension. We do this using Principal Components Analysis. **Latent Semantic Analysis (in short LSA and also called Latent Semantic Indexing)** is the process of applying PCA to a tf-idf matrix. What we get, in the end, is clusters of terms that presumably reflect a topic. Each document will get a score for each topic, with higher scores indicating that the document is relevant to the topic. Documents can pertain to more than one topic.

LSA is handy when your corpus is too large to topically annotate by hand, or when you don't know what topics characterize your documents. It is also useful as a way of creating features to be used in other models.

Let's try it out! Once again, we'll use the Gutenberg corpus. This time, we'll focus on comparing paragraphs within Emma, another novel by Jane Austen.

In [11]:
#reading in the data, this time in the form of paragraphs
emma=gutenberg.paras('austen-emma.txt')

#processing
emma_paras=[]
for paragraph in emma:
    para=paragraph[0]
    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]
    #Forming each paragraph into a string and adding it to the list of strings.
    emma_paras.append(' '.join(para))

print(emma_paras[0:4])

['[ Emma by Jane Austen 1816 ]', 'VOLUME I', 'CHAPTER I', 'Emma Woodhouse , handsome , clever , and rich , with a comfortable home and happy disposition , seemed to unite some of the best blessings of existence ; and had lived nearly twenty - one years in the world with very little to distress or vex her .']


In [12]:
X_train, X_test = train_test_split(emma_paras, test_size=0.4, random_state=0)

vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=2, # only use words that appear at least twice
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )


#Applying the vectorizer
emma_paras_tfidf=vectorizer.fit_transform(emma_paras)
print("Number of features: %d" % emma_paras_tfidf.get_shape()[1])

#splitting into training and test sets
X_train_tfidf, X_test_tfidf= train_test_split(emma_paras_tfidf, test_size=0.4, random_state=0)

#Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = X_train_tfidf.tocsr()

#number of paragraphs
n = X_train_tfidf_csr.shape[0]
#A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]
#List of features
terms = vectorizer.get_feature_names()
#for each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

#Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
print('Original sentence:', X_train[5])
print('Tf_idf vector:', tfidf_bypara[5])

Number of features: 1948
Original sentence: A very few minutes more , however , completed the present trial .
Tf_idf vector: {'minutes': 0.7127450310382584, 'present': 0.701423210857947}


### Dimension reduction
Okay, now we have our vectors, with one vector per paragraph.  It's time to do some dimension reduction.  We use the Singular Value Decomposition (SVD) function from sklearn rather than PCA because we don't want to mean-center our variables (and thus lose sparsity):

In [13]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

#Our SVD data reducer.  We are going to reduce the feature space from 1379 to 130.
svd= TruncatedSVD(130)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(X_train_tfidf)

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)

#Looking at what sorts of paragraphs our solution considers similar, for the first five identified topics
paras_by_component=pd.DataFrame(X_train_lsa,index=X_train)
for i in range(5):
    print('-------COMPONENT {}:'.format(i))
    print(paras_by_component.loc[:,i].sort_values(ascending=False)[0:10])
    print("-------------------------------------------------------------")

Percent variance captured by all components: 45.20802645340931
-------COMPONENT 0:
" Oh !     0.99929
" Oh !"    0.99929
" Oh !     0.99929
" Oh !     0.99929
" Oh !     0.99929
" Oh !     0.99929
" Oh !     0.99929
" Oh !     0.99929
" Oh !"    0.99929
" Oh !     0.99929
Name: 0, dtype: float64
-------------------------------------------------------------
-------COMPONENT 1:
" You have made her too tall , Emma ," said Mr . Knightley .                                                                                                                0.633275
" You get upon delicate subjects , Emma ," said Mrs . Weston smiling ; " remember that I am here . Mr .                                                                     0.594526
" I do not know what your opinion may be , Mrs . Weston ," said Mr . Knightley , " of this great intimacy between Emma and Harriet Smith , but I think it a bad thing ."    0.562642
" You are right , Mrs . Weston ," said Mr . Knightley warmly , " Miss Fairfax 

From gazing at the most representative sample paragraphs, it appears that component 0 targets the exclamation 'Oh!', component 1 seems to largely involve critical dialogue directed at or about the main character Emma, component 2 is chapter headings, component 3 is exclamations involving 'Ah!, and component 4 involves actions by or directly related to Emma. What fun! 

LSA is one of many unsupervised methods that can be applied to text data. While it is good for dealing with synonyms, it cannot handle polysemy. For that, we need to try out other kinds of approaches that we'll see one of them in the next checkpoint. 

Although we have presented LSA as an unsupervised method, it can also be used to prepare text data for classification in supervised learning. In that case, the goal would be to use LSA to arrive at a smaller set of features that can be used to build a supervised model that will classify text into pre-labeled categories.