# Document Term Matrix and TFIDF


### In this notebook we will:
* Understand the DTM and why it's important to text analysis
* Learn how to create a DTM in Python
* Learn basic functionality of Python's package scikit-learn
* Understand tf-idf scores
* Learn a simple way to identify distinctive words
* In the process, gain more familiarity and comfort with the Pandas package and manipulating data


### Key Jargon
* *Document Term Matrix*:
    * a matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
* *TF-IDF Scores*: 
    * short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.


## DTM/TF-IDF <a id='dtm'></a>

In this lesson we will use Python's scikit-learn package learn to make a document term matrix from a .csv Music Reviews dataset (collected from MetaCritic.com). We will then use the DTM and a word weighting technique called tf-idf (term frequency inverse document frequency) to identify important and discriminating words within this dataset (utilizing the Pandas package). The illustrating question: **what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums?**

In [None]:
import os
import numpy as np
import pandas as pd

DATA_DIR = 'data'
music_fname = 'music_reviews.csv'
music_fname = os.path.join(DATA_DIR, music_fname)

### First attempt at reading in file

In [None]:
reviews = pd.read_csv(music_fname, sep='\t')
reviews.head()

Print the text of the first review.

In [None]:
print(reviews['body'][0])

### Explore the Data using Pandas

Let's first look at some descriptive statistics about this dataset, to get a feel for what's in it. We'll do this using the Pandas package. 

Note: this is always good practice. It serves two purposes. It checks to make sure your data is correct, and there's no major errors. It also keeps you in touch with your data, which will help with interpretation. <3 your data!

First, what genres are in this dataset, and how many reviews in each genre?

In [None]:
#We can count this using the value_counts() function
reviews['genre'].value_counts()

The first thing most people do is to `describe` their data. (This is the `summary` command in R, or the `sum` command in Stata).

Who were the reviewers?

In [None]:
reviews['critic'].value_counts().head(10)

And the artists?

In [None]:
reviews['artist'].value_counts().head(10)

We can get the average score as follows:

In [None]:
reviews['score'].mean()

Now we want to know the average score for each genre? To do this, we use Pandas `groupby` function. You'll want to get very familiar with the `groupby` function. It's quite powerful. (Similar to `collapse` on Stata)

In [None]:
reviews_grouped_by_genre = reviews.groupby("genre")
reviews_grouped_by_genre['score'].mean().sort_values(ascending=False)

### Creating the DTM using scikit-learn

Ok, that's the summary of the metadata. Next, we turn to analyzing the text of the reviews. Remember, the text is stored in the 'body' column. First, a preprocessing step to remove numbers.

In [None]:
def remove_digits(comment):
    return ''.join([ch for ch in comment if not ch.isdigit()])

reviews['body_without_digits'] = reviews['body'].apply(remove_digits)
reviews.head()

### CountVectorizer Function

Our next step is to turn the text into a document term matrix using the scikit-learn function called `CountVectorizer`.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()
sparse_dtm = countvec.fit_transform(reviews['body_without_digits'])

Great! We made a DTM! Let's look at it.

In [None]:
sparse_dtm

This format is called Compressed Sparse Format. It save a lot of memory to store the dtm in this format, but it is difficult to look at for a human. To illustrate the techniques in this lesson we will first convert this matrix back to a Pandas DataFrame, a format we're more familiar with. For larger datasets, you will have to use the Compressed Sparse Format. Putting it into a DataFrame, however, will enable us to get more comfortable with Pandas!

In [None]:
dtm = pd.DataFrame(sparse_dtm.toarray(), columns=countvec.get_feature_names(), index=reviews.index)
dtm.head()

### What can we do with a DTM?

We can quickly identify the most frequent words

In [None]:
dtm.sum().sort_values(ascending=False).head(10)

Is it surprising that **"the"** is the most frequent word? 

In fact there's a famous result in statistics known as [Zipf's Law](https://en.wikipedia.org/wiki/Zipf%27s_law) which states that "the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc: the rank-frequency distribution is an inverse relation." In our case, we kind of have something similar going on.

### Challenge

* Print out the most infrequent words rather than the most frequent words. You can look at the [Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats) for more information.
* Print the average number of times each word is used in a review.
* Print this out sorted from highest to lowest.

In [None]:
dtm.sum().sort_values().head()

In [None]:
dtm.mean().sort_values(ascending=False).head()

## [TF-IDF](http://tfidf.com/) scores

How to find distinctive words in a corpus is a long-standing question in text analysis. Today, we'll learn one simple approach to this: TF-IDF. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguising. We want to identify words that are unevenly distributed across the corpus.

One of the most popular ways to weight words (beyond frequency counts) is `tf-idf score`. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

Traditionally, the *inverse document frequency* of word $j$ is calculated as:

$idf_{j} = log\left(\frac{\#docs}{\#docs\,with\,j}\right)$ 

and the *term freqency - inverse document frequency* is 

$tfidf_{ij} = f_{ij}\times{idf_j}$ where $f_{ij}$ is the number of occurences of word $j$ in document $i$.

You can, and often should, normalize the word frequency: 

$tfidf_{ij} = \frac{f_{ij}}{\#words\,in\,doc\,i}\times{idf_{j}}$

We can calculate this manually, but scikit-learn has a built-in function to do so. This function also uses log frequencies, so the numbers will not correspond excactly to the calculations above. We'll use the [scikit-learn calculation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), but a challenge for you: use Pandas to calculate this manually. 

### Example
Consider a document containing 100 words wherein the word cat appears 3 times. 
- The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. 

Now, assume we have 10 million documents and the word cat appears in one thousand of these. 
- Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. 

Thus, the Tf-idf weight is the product of these quantities: **0.03 * 4 = 0.12.**

### TF-IDFVectorizer Function

To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer()
sparse_tfidf = tfidfvec.fit_transform(reviews['body_without_digits'])
sparse_tfidf

In [None]:
tfidf = pd.DataFrame(sparse_tfidf.toarray(), columns=tfidfvec.get_feature_names(), index=reviews.index)
tfidf.head()

Let's look at the 20 words with highest tf-idf weights.

In [None]:
tfidf.max().sort_values(ascending=False).head(20)

Ok! We have successfully identified content words, without removing stop words.

### Identifying Distinctive Words

What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.

First we add in a column of genre.

In [None]:
tfidf['genre_'] = reviews['genre']
tfidf.head()

Now lets compare the words with the highest tf-idf weight for each genre. 

In [None]:
rap = tfidf[tfidf['genre_']=='Rap']
indie = tfidf[tfidf['genre_']=='Indie']
jazz = tfidf[tfidf['genre_']=='Jazz']

rap.max(numeric_only=True).sort_values(ascending=False).head()

In [None]:
indie.max(numeric_only=True).sort_values(ascending=False).head()

In [None]:
jazz.max(numeric_only=True).sort_values(ascending=False).head()

There we go! A method of identifying distinctive words.

### Further resources

[This blog post](https://de.dariah.eu/tatom/feature_selection.html) goes through finding distinctive words using Python in more detail 

Paper: [Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf), Burt Monroe, Michael Colaresi, Kevin Quinn
