## The Document Term Matrix and Discriminating Words

The Document Term Matrix (DTM) in the bread and butter of most computational text analysis techniques, both simple and more sophisticated methods. In this lesson we will use Python's scikit-learn package learn to make a document term matrix from the .csv Music Reviews dataset. We will then use the DTM and a word weighting technique called tf-idf (term frequency inverse document frequency) to identify important and discriminating words within this dataset (utilizing the Pandas package). The illustrating question: what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums? Theoretical exerxise: What can we learn from these words?

Note: Python's scikit-learn package is an enormous package with a lot of functionality. Knowing this package will enable you to do some very sophisticated analyses, including almost all machine learning techniques. (It looks great on your CV too!). We'll get back to this package later in the workshop.

### Learning Goals
* Understand the DTM and why it's important to text analysis
* Learn how to create a DTM from a .csv file
* Learn basic functionality of Python's package scikit-learn (we'll return to scikit-learn in lesson 06)
* Understand tf-idf scores, and word scores in general
* Learn a simple way to identify distinctive words
* In the process, gain more familiarity and comfort with the Pandas pacakge and manipulating data

### Outline
* The Pandas Dataframe: Music Reviews
* Explore the Data using Pandas
    * Basic descriptive statistics
* Creating the DTM: scikit-learn
    * CountVectorizer function
* Tf-idf scores
    * TfidfVectorizer
*  Identifying Distinctive Words
    * Identify distinctive reviews by genre


### Key Jargon
* *Document Term Matrix*:
    * a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
* *TF-IDF Scores*: 
    *  short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
    
### Further Resources

[This blog post](https://de.dariah.eu/tatom/feature_selection.html) goes through finding distinctive words using Python in more detail 

Paper: [Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf), Burt Monroe, Michael Colaresi, Kevin Quinn
    
### 0. The Pandas Dataframe: Music Reviews

First, we read our music reviews corpus, which is stored as a .csv file on our hard drive, into a Pandas dataframe.

In [3]:
import pandas

#create a dataframe called "df"
df = pandas.read_csv("BDHSI2016_music_reviews.csv", sep = '\t')

#view the dataframe
print(df)

#notice the metadata. The column "body" contains our text of interest.


                                    album  \
0                             Don't Panic   
1                 Fear and Saturday Night   
2                      The Way I'm Livin'   
3                                   Doris   
4                                 Giraffe   
5                            Weathervanes   
6                    Build a Rocket Boys!   
7                      Ambivalence Avenue   
8                                 Wavvves   
9                          Peachtree Road   
10                               Heritage   
11                            White Chalk   
12                    Tyrannosaurus Hives   
13                             JackInABox   
14                            Liquid Love   
15                  The  Truth About Love   
16                            The Monitor   
17                         Ones and Sixes   
18        In Search Of... [First Version]   
19                            Tarot Sport   
20                             July Flame   
21        

### 1. Explore the Data using Pandas

Let's first look at some descriptive statistics about this dataset, to get a feel for what's in it. We'll do this using the Pandas package. 

Note: this is always good practice. It serves two purposes. It checks to make sure your data is correct, and there's no major errors. It also keeps you in touch with your data, which will help with interpretation. <3 your data!

First, what genres are in this dataset, and how many reviews in each genre?

In [4]:
#We can count this using the value_counts() function
print(df['genre'].value_counts())

Pop/Rock                  1486
Indie                     1115
Rock                       932
Electronic                 513
Rap                        363
Pop                        149
Country                    140
R&B;                       112
Folk                        70
Alternative/Indie Rock      42
Dance                       41
Jazz                        38
Name: genre, dtype: int64


Who are the reviewers?

In [5]:
print(df['critic'].value_counts())

AllMusic                     282
PopMatters                   228
Pitchfork                    207
Q Magazine                   178
Uncut                        171
Mojo                         137
Drowned In Sound             132
New Musical Express (NME)    127
The A.V. Club                121
Rolling Stone                112
Under The Radar              100
Spin                          97
The Guardian                  96
musicOMH.com                  88
Entertainment Weekly          87
Slant Magazine                83
Paste Magazine                72
Consequence of Sound          69
Alternative Press             69
Prefix Magazine               68
NOW Magazine                  66
Tiny Mix Tapes                64
Blender                       57
Dusted Magazine               56
Dot Music                     56
Stylus Magazine               55
No Ripcord                    53
Boston Globe                  52
Austin Chronicle              52
Filter                        50
          

And the artists?

In [6]:
print(df['artist'].value_counts())

Various Artists            22
R.E.M.                     16
Arcade Fire                14
Sigur Rós                  13
Belle & Sebastian          12
Brian Eno                  11
The Raveonettes            10
Weezer                     10
Radiohead                  10
Low                        10
Mogwai                     10
Bob Dylan                  10
LCD Soundsystem            10
Kings of Leon              10
Los Campesinos!             9
Sun Kil Moon                9
Franz Ferdinand             9
Wilco                       9
Ghostface Killah            9
M. Ward                     9
Eels                        9
Beck                        8
Elbow                       8
Of Montreal                 8
The Decemberists            8
Britney Spears              8
Daft Punk                   8
Usher                       8
Kanye West                  8
Ryan Adams                  8
                           ..
Fabolous                    1
Eve                         1
Small Blac

What is the average score given?

In [7]:
print(df['score'].mean())

72.6842231554


Slightly more complicted to code: what is the average score for each genre? To do this, we use Pandas *groupby* function. Note: If you are planning on doing any sort of statistics, including basic statistics, you'll want to get very familiar with the groupby function. It's quite powerful.

In [8]:
#create a groupby dataframe grouped by genre
df_genres = df.groupby("genre")

#calculate the mean score by genre, print out the results
print(df_genres['score'].mean().sort_values(ascending=False))

genre
Jazz                      77.631579
Folk                      75.900000
Indie                     74.400897
Country                   74.071429
Alternative/Indie Rock    73.928571
Electronic                73.140351
Pop/Rock                  73.033782
R&B;                      72.366071
Rap                       72.173554
Rock                      70.754292
Dance                     70.146341
Pop                       64.608054
Name: score, dtype: float64


### 2. Creating the DTM: scikit-learn

Ok, that's the summary of the metadata. Next, we turn to analyzing the text of the reviews. Remember, the text is stored in the 'body' column.

Our first step is to turn the text into a document term matrix using the scikit-learn function called CountVectorizer. There are two ways to do this. We can turn it into a sparse matrix type, which can be used within scikit-learn for further analyses.

In [9]:
#import the function CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()

sklearn_dtm = CountVectorizer().fit_transform(df.body)
print(sklearn_dtm)

  (0, 16029)	1
  (0, 5755)	1
  (0, 1290)	1
  (0, 11209)	1
  (0, 14577)	2
  (0, 2231)	1
  (0, 13770)	1
  (0, 16272)	1
  (0, 6359)	1
  (0, 1302)	1
  (0, 16017)	1
  (0, 11320)	1
  (0, 14534)	2
  (0, 9218)	1
  (0, 14771)	2
  (0, 7669)	1
  (0, 14818)	1
  (0, 8812)	1
  (0, 13339)	1
  (0, 940)	1
  (0, 3931)	1
  (0, 6688)	1
  (0, 9554)	1
  (0, 15075)	1
  (0, 6629)	1
  :	:
  (5000, 9998)	1
  (5000, 14520)	1
  (5000, 9939)	1
  (5000, 1013)	1
  (5000, 749)	3
  (5000, 16217)	1
  (5000, 14514)	1
  (5000, 1097)	1
  (5000, 14776)	1
  (5000, 16159)	1
  (5000, 9411)	1
  (5000, 13317)	1
  (5000, 9538)	1
  (5000, 1470)	1
  (5000, 14387)	1
  (5000, 5439)	1
  (5000, 13195)	1
  (5000, 10097)	1
  (5000, 11122)	1
  (5000, 13658)	1
  (5000, 11691)	1
  (5000, 4995)	1
  (5000, 12345)	1
  (5000, 5074)	1
  (5000, 4169)	1


This format is called Compressed Sparse Format. It save a lot of memory to store the dtm in this format, but it is difficult to look at for a human. To illustrate the techniques in this lesson we will first convert this matrix back to a Pandas dataframe, a format we're more familiar with. For larger datasets, you will have to use the Compressed Sparse Format. Putting it into a DataFrame, however, will enable us to get more comfortable with Pandas!

In [10]:
#we do the same as we did above, but covert it into a Pandas dataframe
dtm_df = pandas.DataFrame(countvec.fit_transform(df.body).toarray(), columns=countvec.get_feature_names(), index = df.index)

#view the dtm dataframe
print(dtm_df)

      00  000  00s  01  03  039  06  08  09  10  ...   zone  zones  zoo  \
0      0    0    0   0   0    0   0   0   0   0  ...      0      0    0   
1      0    0    0   0   0    0   0   0   0   0  ...      0      0    0   
2      0    0    0   0   0    0   0   0   0   0  ...      0      0    0   
3      0    0    0   0   0    0   0   0   0   0  ...      0      0    0   
4      0    0    0   0   0    0   0   0   0   0  ...      0      0    0   
5      0    0    0   0   0    0   0   0   0   0  ...      0      0    0   
6      0    0    0   0   0    0   0   0   0   0  ...      0      0    0   
7      0    0    0   0   0    0   0   0   0   0  ...      0      0    0   
8      0    0    0   0   0    0   0   0   0   0  ...      0      0    0   
9      0    0    0   0   0    0   0   0   0   0  ...      0      0    0   
10     0    0    0   0   0    0   0   0   0   0  ...      0      0    0   
11     0    0    0   0   0    0   0   0   0   0  ...      0      0    0   
12     0    0    0   0   

### 3. What can we do with a DTM?

We can do a number of calculations using a DTM. For a toy example, we can quickly identify the most frequent words (compare this to how many steps it took in lesson 1, where we found the most frequent words using NLTK).

In [11]:
print(dtm_df.sum().sort_values(ascending=False))

the              7406
and              4557
of               4400
to               3175
is               2914
it               2608
that             2039
in               1775
album            1719
this             1518
but              1439
with             1367
as               1310
on               1139
for              1073
are               812
their             775
you               775
an                751
his               743
more              712
be                691
like              681
from              676
not               650
songs             640
they              580
one               580
its               575
all               574
                 ... 
football            1
foothold            1
footnote            1
footprints          1
footsteps           1
footwork            1
footy               1
foppish             1
reveling            1
revelling           1
forbears            1
revels              1
stefani             1
steeple             1
steep     

We'll see further stuff we can do with a DTM in days to come. Because it is in the format of a matrix, we can perform any matrix algebra or vector manipulation on it, which enables some pretty exciting things (think vector space and Euclidean  geometry). But, what do we lose when we reprsent text in this format?

Today, we will use variations on the DTM to find distinctive words in this dataset.

### 4. Tf-idf scores

How to find distinctive words in a corpus is a long-standing question in text analysis. We saw two approaches to doing this in lesson 1 (removing stop words and identifying nouns, verbs, and adjectives). Today, we'll learn one more approach: word scores. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguising. We want to identify words that are unevenly distributed across the corpus.

One of the most popular ways to weight words (beyond frequency counts) is *tf-idf* scores. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

More precisely, the inverse document frequency is calculated as such:

number_of_documents / number_documents_with_term

so:

tf_idf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)

We can calculate this manually, but scikit-learn has a built-in function to do so. We'll use it, but a challenge for you: use Pandas to calculate this manually. 

To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.

In [12]:
#import the function
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer()
#create the dtm, but with cells weigthed by the tf-idf score.
dtm_tfidf_df = pandas.DataFrame(tfidfvec.fit_transform(df.body).toarray(), columns=tfidfvec.get_feature_names(), index = df.index)

#view results
print(dtm_tfidf_df)

       00  000  00s   01   03  039   06   08   09        10  ...   zone  \
0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  ...    0.0   
1     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  ...    0.0   
2     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  ...    0.0   
3     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  ...    0.0   
4     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  ...    0.0   
5     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  ...    0.0   
6     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  ...    0.0   
7     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  ...    0.0   
8     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  ...    0.0   
9     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  ...    0.0   
10    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  ...    0.0   
11    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  ...    0.0   
12    0.0  0.0  0.0  0.0 

It's still mostly zeros. Let's look at the 20 words with highest tf-idf weights.

In [13]:
print(dtm_tfidf_df.max().sort_values(ascending=False)[0:20])

perfect       1.000000
pppperfect    1.000000
meh           1.000000
awesome       1.000000
brill         1.000000
yummy         1.000000
wonderfull    1.000000
subpar        0.959257
ga            0.908259
masterful     0.898620
grower        0.888624
likable       0.867803
acirc         0.867003
great         0.864253
infectious    0.859996
blank         0.854475
smart         0.847852
8217          0.843505
stuff         0.834479
impeccable    0.828662
dtype: float64


Ok! We have successfully identified content words, without removing stop words. What else do you notice about this list?

### 5. Identifying Distinctive Words

What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.

First we merge the genre of the document into our dtm weighted by tf-idf scores, and then compare genres.

In [14]:
#creat dataset with document index and genre
df_genre = df['genre'].to_frame()
print(df_genre)

           genre
0       Pop/Rock
1        Country
2        Country
3            Rap
4           Rock
5          Indie
6       Pop/Rock
7          Indie
8          Indie
9           Rock
10    Electronic
11          Rock
12          Rock
13         Indie
14         Indie
15           Pop
16         Indie
17      Pop/Rock
18           Rap
19          Rock
20         Indie
21    Electronic
22          Rock
23          Rock
24           Rap
25         Indie
26         Indie
27      Pop/Rock
28          Rock
29    Electronic
...          ...
4971    Pop/Rock
4972       Indie
4973  Electronic
4974       Indie
4975        Rock
4976        Rock
4977        Rock
4978     Country
4979    Pop/Rock
4980     Country
4981  Electronic
4982    Pop/Rock
4983     Country
4984    Pop/Rock
4985    Pop/Rock
4986       Indie
4987    Pop/Rock
4988  Electronic
4989        Rock
4990    Pop/Rock
4991         Rap
4992  Electronic
4993        Rock
4994        Rock
4995         Rap
4996       Indie
4997        Ro

In [15]:
#merge this into the dtm_tfidf_df
merged_df = df_genre.join(dtm_tfidf_df, how = 'right', lsuffix='_x')

#view result
print(merged_df)

         genre_x   00  000  00s   01   03  039   06   08   09  ...   zone  \
0       Pop/Rock  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
1        Country  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
2        Country  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
3            Rap  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
4           Rock  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
5          Indie  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
6       Pop/Rock  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
7          Indie  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
8          Indie  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
9           Rock  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
10    Electronic  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   
11          Rock  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   

Now lets compare the words with the highest tf-idf weight for each genre. 

Note: there are other ways to do this. Challenge: what is a different approach to identifying rows from a certain genre in our dtm?

In [16]:
#pull out the reviews for three genres, Rap, Alternative/Indie Rock, and Jazz
dtm_rap = merged_df[merged_df['genre_x']=="Rap"]
dtm_indie = merged_df[merged_df['genre_x']=="Alternative/Indie Rock"]
dtm_jazz = merged_df[merged_df['genre_x']=="Jazz"]

#print the words with the highest tf-idf scores for each genre
print("Rap Words")
print(dtm_rap.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Indie Words")
print(dtm_indie.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Jazz Words")
print(dtm_jazz.max(numeric_only=True).sort_values(ascending=False)[0:20])

Rap Words
blank             0.854475
039               0.797595
waste             0.755918
amiable           0.730963
awesomely         0.717079
same              0.672391
sucker            0.663760
tight             0.653993
beastie           0.650603
lamest            0.639377
derivativeness    0.636271
authentic         0.627192
diverse           0.623373
sermon            0.621175
mastermind        0.609213
neat              0.608922
we                0.600755
lift              0.591821
supreme           0.590431
overwhelms        0.586293
dtype: float64

Indie Words
underplayed    0.516717
prisoner       0.512087
jezabels       0.512087
careworn       0.509386
folk           0.476719
victory        0.449289
exhausted      0.445969
bigger         0.441849
heyday         0.438114
babelfished    0.431543
bet            0.426091
worn           0.416482
93             0.416137
try            0.415525
triumph        0.413976
silhouette     0.413374
icelandic      0.411715
fourth        

There we go! A method of identifying distinctive words. You notice there are some proper nouns in there. How might we remove those if we're not interested in them?

Tf-idf scores are just one way to identify distinctive or discriminating words. See Monroe, Colaresi, and Quinn (2009) for more ideas for finding distinctive words. (Warning: this paper is a bit outdated. No one has taken up their recommendation to use a Dirichlet prior).

Exercise: 
* Compare words from different genres of your choice. Any interesting findings
* Instead of outputting the highest weighted words, output the lowest weighted words. How should we interpret these words? 
* Super challenge: apply this technique to your own corpus.