## The Document Term Matrix and Finding Distinctive Words

We have been dealing with texts as strings, or as lists of strings. Another way to represent text which opens up a variety of other possibilities for analysis is the Document Term Matrix (DTM).

The best Python library for this, along with the subsequent analyses we can peform on a DTM, is scikit-learn. It's a powerful library, and one you will continually return to as you advance in text analysis (and looks great on your CV!). At it's core, this library allow us to implement a variety of machine learning algorithms on our text.

Because scikit-learn is such a large and powerful library the goal today is not to become experts, but instead learn the basic functions in the library and gain an intuition about how you might use it to do text analysis. To give an overview, here are some of the things you can do using scikit-learn:
* word weighting
* feature extraction
* text classification / supervised machine learning
    * L2 regression
    * classification algorithms such as nearest neighbors, SVM, and random forest
* clustering / unsupervised machine learning
    * k-means
    * pca
    * cosine similarity
    * LDA

Today, we'll start with the Document Term Matrix (DTM). The DTM is the bread and butter of most computational text analysis techniques, both simple and more sophisticated methods. In this lesson we will use Python's scikit-learn package learn to make a document term matrix from a .csv Music Reviews dataset (collected from MetaCritic.com). We will visualize the DTM in a pandas dataframe. We will then use the DTM and a word weighting technique called tf-idf (term frequency inverse document frequency) to identify important and discriminating words within this dataset. The illustrating question: what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums? Finally, we'll use the DTM to implement a difference of proportions calculation on two novels in our data folder.

  

### Learning Goals
* Understand the DTM and why it's important to text analysis
* Learn how to create a DTM from a .csv file
* Learn basic functionality of Python's package scikit-learn
* Understand tf-idf scores, and word scores in general
* Learn a simple way to identify distinctive words


### Outline
<ol start="0">
  <li>The Pandas Dataframe: Music Reviews</li>
  <li>Explore the Data using Pandas</li>
          -Basic descriptive statistics
  <li>Creating the DTM: scikit-learn</li>
          -CountVectorizer function
  <li>What can we do with a DTM?</li>
  <li>Tf-idf scores</li>
          -TfidfVectorizer function
  <li>Identifying Distinctive Words</li>
          -Application: Identify distinctive words by genre
   <li>Identifying Distinctive Words</li>
          -Difference of Proportions    
         
</ol>

### Key Terms
* *Document Term Matrix*:
    * a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
* *TF-IDF Scores*: 
    *  short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

    
### Further Resources

[This blog post](https://de.dariah.eu/tatom/feature_selection.html) goes through finding distinctive words using Python in more detail.

Paper: [Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf), Burt Monroe, Michael Colaresi, Kevin Quinn

    
### 0. The Pandas Dataframe: Music Reviews

First, we read our music reviews corpus, which is stored as a .csv file on our hard drive, into a Pandas dataframe. 

In [1]:
import pandas
#import numpy as np #new library! We'll discuss it as it comes up.

#create a dataframe called "df"
df = pandas.read_csv("../data/BDHSI2016_music_reviews.csv", sep = '\t', encoding = 'utf-8')

#view the dataframe
#The column "body" contains our text of interest.
df

Unnamed: 0,album,artist,genre,release_date,critic,score,body
0,Don't Panic,All Time Low,Pop/Rock,2012-10-09 00:00:00,Kerrang!,74.0,While For Baltimore proves they can still writ...
1,Fear and Saturday Night,Ryan Bingham,Country,2015-01-20 00:00:00,Uncut,70.0,There's nothing fake about the purgatorial nar...
2,The Way I'm Livin',Lee Ann Womack,Country,2014-09-23 00:00:00,Q Magazine,84.0,All life's disastrous lows are here on a caree...
3,Doris,Earl Sweatshirt,Rap,2013-08-20 00:00:00,Pitchfork,82.0,"With Doris, Odd Future’s Odysseus is finally b..."
4,Giraffe,Echoboy,Rock,2003-02-25 00:00:00,AllMusic,71.0,Though Giraffe is definitely Echoboy's most im...
5,Weathervanes,Freelance Whales,Indie,2010-04-13 00:00:00,Q Magazine,68.0,Fans of Owl City and The Postal Service will r...
6,Build a Rocket Boys!,Elbow,Pop/Rock,2011-04-12 00:00:00,Delusions of Adequacy,82.0,"Whereas previous Elbow records set a mood, Bui..."
7,Ambivalence Avenue,Bibio,Indie,2009-06-23 00:00:00,Q Magazine,78.0,His remarkable Warp debut follows a series of ...
8,Wavvves,Wavves,Indie,2009-03-17 00:00:00,PopMatters,68.0,"There’s an energy coursing through this, and r..."
9,Peachtree Road,Elton John,Rock,2004-11-09 00:00:00,MelD.,70.0,Classic. Songs filled with soul. Lyrics refres...


In [2]:
#print the first review from the column 'body'
df.loc[0,'body']

'While For Baltimore proves they can still write a grade A banger when they put their mind to it, too many songs are destined to have "must try harder" stamped on their report card. [13 Oct 2012, p.52]'

### 1. Explore the Data using Pandas

You folks are experts at this now. Write Python code using pandas to do the following exploration of the data:

1. What different genres are in the data?
2. Who are the reviewers?
3. Who are the artists?
4. What is the average score given?
5. What is the average score by genre? What is the genre with the highest average score?

In [8]:
#Write your code here
#print(df['genre'].value_counts())
#print(df['critic'].value_counts())
#print(df['artist'].value_counts())
df['score'].mean()
grouped_genre = df.groupby('genre')
grouped_genre['score'].mean().sort_values(ascending=False)

genre
Jazz                      77.631579
Folk                      75.900000
Indie                     74.400897
Country                   74.071429
Alternative/Indie Rock    73.928571
Electronic                73.140351
Pop/Rock                  73.033782
R&B;                      72.366071
Rap                       72.173554
Rock                      70.754292
Dance                     70.146341
Pop                       64.608054
Name: score, dtype: float64

### 2. Creating the DTM: scikit-learn

Ok, that's the summary of the metadata. Next, we turn to analyzing the text of the reviews. Remember, the text is stored in the 'body' column. First, a preprocessing step to remove numbers. To do this we will use a lambda function.

In [9]:
df['body'] = df['body'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

Our next step is to turn the text into a document term matrix using the scikit-learn function called CountVectorizer. There are two ways to do this. We can turn it into a sparse matrix type, which can be used within scikit-learn for further analyses. We do this using the fit_transform() function from CountVectorizer.

[Let's first look at the documentation for CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [10]:
#import the function CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()

sklearn_dtm = countvec.fit_transform(df.body)
print(sklearn_dtm)

  (0, 9643)	1
  (0, 2011)	1
  (0, 11604)	1
  (0, 9722)	1
  (0, 13369)	1
  (0, 6358)	1
  (0, 14799)	1
  (0, 9277)	1
  (0, 6417)	1
  (0, 3662)	1
  (0, 671)	1
  (0, 13062)	1
  (0, 8536)	1
  (0, 14542)	1
  (0, 7398)	1
  (0, 14495)	2
  (0, 8941)	1
  (0, 14257)	2
  (0, 11042)	1
  (0, 15740)	1
  (0, 1034)	1
  (0, 6088)	1
  (0, 15995)	1
  (0, 13493)	1
  (0, 1963)	1
  :	:
  (5000, 4803)	1
  (5000, 12068)	1
  (5000, 4724)	1
  (5000, 11414)	1
  (5000, 13381)	1
  (5000, 10844)	1
  (5000, 9821)	1
  (5000, 12918)	1
  (5000, 5168)	1
  (5000, 14110)	1
  (5000, 1202)	1
  (5000, 9261)	1
  (5000, 13040)	1
  (5000, 9134)	1
  (5000, 15882)	1
  (5000, 14500)	1
  (5000, 828)	1
  (5000, 14237)	1
  (5000, 15940)	1
  (5000, 480)	3
  (5000, 744)	1
  (5000, 9663)	1
  (5000, 14243)	1
  (5000, 9722)	1
  (5000, 14257)	1


How do we know what each number indicates? We can access the words themselves through the CountVectorizer function get_feature_names.

In [11]:
print(countvec.get_feature_names()[:10])

['aa', 'aaaa', 'aahs', 'aaliyah', 'aaron', 'ab', 'abandon', 'abandoned', 'abandoning', 'abc']


This format is called Compressed Sparse Format. It save a lot of memory to store the dtm in this format, but it is difficult to look at for a human. To illustrate the techniques in this lesson we will first convert this matrix back to a Pandas dataframe, a format we're more familiar with.

Note: This is a case of do as I say, not as I do. As we continue we will rarely transform a DTM into a Pandas dataframe, because of memory issues. I'm doing it today so we can understand the intuition behind the DTM, word scores, and distinctive words.

In [12]:
#we do the same as we did above, but covert it into a Pandas dataframe. Note this takes quite a bit more memory, so will not be good for bigger data.
#don't understand this code? we'll go through it, but don't worry about understanding it.
dtm_df = pandas.DataFrame(countvec.fit_transform(df.body).toarray(), columns=countvec.get_feature_names(), index = df.index)

#view the dtm dataframe
dtm_df

Unnamed: 0,aa,aaaa,aahs,aaliyah,aaron,ab,abandon,abandoned,abandoning,abc,...,zone,zones,zoo,zooey,zoomer,zu,zydeco,álbum,être,über
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 3. What can we do with a DTM?

We can do a number of calculations using a DTM. For a toy example, we can quickly identify the most frequent words (compare this to how many steps it took using NLTK).

In [13]:
print(dtm_df.sum().sort_values(ascending=False))

the             7406
and             4557
of              4400
to              3175
is              2914
it              2608
that            2039
in              1775
album           1719
this            1518
but             1439
with            1367
as              1310
on              1139
for             1073
are              812
you              775
their            775
an               751
his              743
more             712
be               691
like             681
from             676
not              650
songs            640
one              580
they             580
its              575
all              574
                ... 
glimmering         1
glimmers           1
gliss              1
glisten            1
glistening         1
glitch             1
respond            1
glitchier          1
glitter            1
glittering         1
glittery           1
glitz              1
glo                1
gloating           1
respectively       1
globular           1
respectfully 

In [14]:
##Ex: print the average number of times each word is used in a review
#Print this out sorted from highest to lowest.

print(dtm_df.mean().sort_values(ascending=False))

the             1.480904
and             0.911218
of              0.879824
to              0.634873
is              0.582683
it              0.521496
that            0.407718
in              0.354929
album           0.343731
this            0.303539
but             0.287742
with            0.273345
as              0.261948
on              0.227754
for             0.214557
are             0.162368
you             0.154969
their           0.154969
an              0.150170
his             0.148570
more            0.142372
be              0.138172
like            0.136173
from            0.135173
not             0.129974
songs           0.127974
one             0.115977
they            0.115977
its             0.114977
all             0.114777
                  ...   
glimmering      0.000200
glimmers        0.000200
gliss           0.000200
glisten         0.000200
glistening      0.000200
glitch          0.000200
respond         0.000200
glitchier       0.000200
glitter         0.000200


Question: What does this tell us about our data?

What else does the DTM enable? Because it is in the format of a matrix, we can perform any matrix algebra or vector manipulation on it, which enables some pretty exciting things (think vector space and Euclidean geometry). But, what do we lose when we reprsent text in this format?

Today, we will use variations on the DTM to find distinctive words in this dataset.

### 4. Tf-idf scores

How to find content words in a corpus is a long-standing question in text analysis. We have seen a few ways of doing this: removing stop words and identifying and counting only nouns, verbs, and adjectives. Today, we'll learn one more simple approach to this: word scores. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be indicative of the content of that document. We want to instead identify frequent words that are unevenly distributed across the corpus.

One of the most popular ways to weight words (beyond frequency counts) is *tf-idf* scores. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'; what we have been calling stop words.

More precisely, the inverse document frequency is calculated as such:

number_of_documents / number_documents_with_term

so:

tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)

You can, and often should, normalize the numerator: 

tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)

We can calculate this manually, but scikit-learn has a built-in function to do so. We'll use it, but a challenge for you: use Pandas to calculate this manually. 

To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.

In [15]:
#import the function
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer()

#create the dtm, but with cells weigthed by the tf-idf score.
dtm_tfidf_df = pandas.DataFrame(tfidfvec.fit_transform(df.body).toarray(), columns=tfidfvec.get_feature_names(), index = df.index)

#view results
dtm_tfidf_df

Unnamed: 0,aa,aaaa,aahs,aaliyah,aaron,ab,abandon,abandoned,abandoning,abc,...,zone,zones,zoo,zooey,zoomer,zu,zydeco,álbum,être,über
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's look at the 20 words with highest tf-idf weights.

In [16]:
print(dtm_tfidf_df.max().sort_values(ascending=False)[:20])

brill         1.000000
perfect       1.000000
yummy         1.000000
pppperfect    1.000000
awesome       1.000000
wonderfull    1.000000
meh           1.000000
stars         1.000000
subpar        0.959257
ga            0.908259
masterful     0.898620
grower        0.888624
likable       0.867803
acirc         0.867003
great         0.864253
infectious    0.859996
blank         0.854475
thrilling     0.848810
smart         0.847852
stuff         0.834479
dtype: float64


Ok! We have successfully identified content words, without removing stop words and without part-of-speech tagging. What else do you notice about this list?

### 5. Identifying Distinctive Words

What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.

First we merge the genre of the document into our dtm weighted by tf-idf scores, and then compare genres.

In [17]:
#creat dataset with document index and genre
df_genre = df['genre'].to_frame()
print(df_genre)

           genre
0       Pop/Rock
1        Country
2        Country
3            Rap
4           Rock
5          Indie
6       Pop/Rock
7          Indie
8          Indie
9           Rock
10    Electronic
11          Rock
12          Rock
13         Indie
14         Indie
15           Pop
16         Indie
17      Pop/Rock
18           Rap
19          Rock
20         Indie
21    Electronic
22          Rock
23          Rock
24           Rap
25         Indie
26         Indie
27      Pop/Rock
28          Rock
29    Electronic
...          ...
4971    Pop/Rock
4972       Indie
4973  Electronic
4974       Indie
4975        Rock
4976        Rock
4977        Rock
4978     Country
4979    Pop/Rock
4980     Country
4981  Electronic
4982    Pop/Rock
4983     Country
4984    Pop/Rock
4985    Pop/Rock
4986       Indie
4987    Pop/Rock
4988  Electronic
4989        Rock
4990    Pop/Rock
4991         Rap
4992  Electronic
4993        Rock
4994        Rock
4995         Rap
4996       Indie
4997        Ro

In [18]:
#merge this into the dtm_tfidf_df
merged_df = df_genre.join(dtm_tfidf_df, how = 'right', lsuffix='_x')

#view result
merged_df

Unnamed: 0,genre_x,aa,aaaa,aahs,aaliyah,aaron,ab,abandon,abandoned,abandoning,...,zone,zones,zoo,zooey,zoomer,zu,zydeco,álbum,être,über
0,Pop/Rock,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Country,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Country,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Rap,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Rock,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Indie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Pop/Rock,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Indie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Indie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Rock,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now lets compare the words with the highest tf-idf weight for each genre. 

Note: there are other ways to do this. Challenge: what is a different approach to identifying rows from a certain genre in our dtm?

In [None]:
#pull out the reviews for three genres, Rap, Alternative/Indie Rock, and Jazz
dtm_rap = merged_df[merged_df['genre_x']=="Rap"]
dtm_indie = merged_df[merged_df['genre_x']=="Alternative/Indie Rock"]
dtm_jazz = merged_df[merged_df['genre_x']=="Jazz"]

#print the words with the highest tf-idf scores for each genre
print("Rap Words")
print(dtm_rap.max

In [19]:
#pull out the reviews for three genres, Rap, Alternative/Indie Rock, and Jazz
dtm_rap = merged_df[merged_df['genre_x']=="Rap"]
dtm_indie = merged_df[merged_df['genre_x']=="Alternative/Indie Rock"]
dtm_jazz = merged_df[merged_df['genre_x']=="Jazz"]

#print the words with the highest tf-idf scores for each genre
print("Rap Words")
print(dtm_rap.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Indie Words")
print(dtm_indie.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Jazz Words")
print(dtm_jazz.max(numeric_only=True).sort_values(ascending=False)[0:20])

Rap Words
blank             0.854475
waste             0.755918
amiable           0.730963
awesomely         0.717079
joyless           0.687687
beastie           0.672439
same              0.672392
sucker            0.663760
vanguard          0.661978
tight             0.653993
lamest            0.639377
derivativeness    0.636271
authentic         0.627192
diverse           0.623373
sermon            0.621175
pushin            0.617699
mastermind        0.609213
neat              0.608922
we                0.600755
lift              0.591821
dtype: float64

Indie Words
underplayed    0.516717
prisoner       0.512087
jezabels       0.512087
careworn       0.509386
folk           0.509321
fourth         0.480502
heyday         0.469035
their          0.458950
riffed         0.458182
bet            0.456164
victory        0.449289
exhausted      0.445969
bigger         0.441849
babelfished    0.431543
lightweight    0.428857
exercised      0.428857
powerhouse     0.422192
worn          

There we go! A method of identifying content words, and distinctive words based on groups of texts. You notice there are some proper nouns in there. How might we remove those if we're not interested in them?

### Ex: Compare the distinctive words for two artists in the data

Note: the artists should have a number of reviews, so check your frequency counts to identify artists.

HINT: Copy and paste the above code and modify it as needed.

In [21]:
##Write your code here
df_artist = df['artist'].to_frame()
merged_df_artist = df_artist.join(dtm_tfidf_df, how = 'right', lsuffix='_x')

#view result

dtm1 = merged_df_artist[merged_df_artist['artist_x']=="R.E.M."]
dtm2 = merged_df_artist[merged_df_artist['artist_x']=="Arcade Fire"]
print("REM")
print(dtm1.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Arcade Fire")
print(dtm2.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()

REM
reliably        0.579442
staid           0.550549
every           0.530261
isn             0.523994
unfussy         0.513744
crucially       0.459618
committed       0.434459
convincing      0.434459
fast            0.424265
collapse        0.421777
habit           0.410508
accelerate      0.410508
stun            0.391646
forming         0.391646
dec             0.376505
noncommittal    0.368986
beautiful       0.358367
mostly          0.352703
stutter         0.352486
stipe           0.352032
dtype: float64

Arcade Fire
disc           0.459815
reflektor      0.431429
jumping        0.423503
patterns       0.409032
features       0.408639
bitterness     0.408519
shorter        0.397541
radiates       0.389749
affection      0.389749
suburbs        0.377718
beguiling      0.374164
detox          0.373836
components     0.364664
divergence     0.363223
redeem         0.356659
paced          0.352743
letter         0.350524
divergent      0.345035
double         0.336293
proposition 

### 6. Difference of proportions

Another simple way to calculate distinctive words in two texts is to calculate the words with the highest and lowest difference or proportions. In theory frequent words like 'the' and 'of' will have a small difference. In practice this doesn't happen.

To demonstrate this we will run a difference of proportion calculation on *Pride and Prejudice* and *A Garland for Girls*.

To get the text in shape for scikit-learn we need to creat a list object with each novel as an element in a list. We'll use the append function to do this.


In [29]:
import nltk

text_list = []
#open and read the novels, save them as variables
austen_string = open('../data/Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
alcott_string = open('../data/Alcott_GarlandForGirls.txt', encoding='utf-8').read()

#append each novel to the list
text_list.append(austen_string)
text_list.append(alcott_string)
print(text_list[0][:100])

PRIDE AND PREJUDICE:

A NOVEL.

IN THREE VOLUMES.

BY THE AUTHOR OF "SENSE AND SENSIBILITY."

VOL. I


Creat a DTM from these two novels, force it into a pandas DF, and inspect the output:

In [30]:
novels_df = pandas.DataFrame(countvec.fit_transform(text_list).toarray(), columns=countvec.get_feature_names())
novels_df

Unnamed: 0,000,1500,15th,1813,1887,18th,20,2001,26th,30,...,youngest,youngsters,your,yours,yourself,yourselves,youth,youthful,youths,zip
0,0,0,1,2,0,1,0,0,1,0,...,14,0,466,12,50,2,9,0,1,0
1,1,1,1,0,2,0,1,1,0,1,...,2,1,110,9,7,1,9,1,3,1


Notice the number of rows and columns.

Question: What does this mean?

Next, we need to get a word frequency count for each novel, which we can do by summing across the entire row. Note how the syntax is different here compared to when we summed one column across all rows.

In [32]:
novels_df['word_count'] = novels_df.sum(axis=1)
novels_df

Unnamed: 0,000,1500,15th,1813,1887,18th,20,2001,26th,30,...,youngsters,your,yours,yourself,yourselves,youth,youthful,youths,zip,word_count
0,0,0,1,2,0,1,0,0,1,0,...,0,466,12,50,2,9,0,1,0,118609
1,1,1,1,0,2,0,1,1,0,1,...,1,110,9,7,1,9,1,3,1,71953


Next we divide each frequency cell by the word count. This syntax gets a bit tricky, so let's walk through it.

In [33]:
novels_df = novels_df.iloc[:,:].div(novels_df.word_count, axis=0)
novels_df

Unnamed: 0,000,1500,15th,1813,1887,18th,20,2001,26th,30,...,youngsters,your,yours,yourself,yourselves,youth,youthful,youths,zip,word_count
0,0.0,0.0,8e-06,1.7e-05,0.0,8e-06,0.0,0.0,8e-06,0.0,...,0.0,0.003929,0.000101,0.000422,1.7e-05,7.6e-05,0.0,8e-06,0.0,1.0
1,1.4e-05,1.4e-05,1.4e-05,0.0,2.8e-05,0.0,1.4e-05,1.4e-05,0.0,1.4e-05,...,1.4e-05,0.001529,0.000125,9.7e-05,1.4e-05,0.000125,1.4e-05,4.2e-05,1.4e-05,1.0


Finally, we subtract one row from another, and add the output as a third row.

In [34]:
novels_df.loc[2] = novels_df.loc[0] - novels_df.loc[1]
novels_df

Unnamed: 0,000,1500,15th,1813,1887,18th,20,2001,26th,30,...,youngsters,your,yours,yourself,yourselves,youth,youthful,youths,zip,word_count
0,0.0,0.0,8e-06,1.7e-05,0.0,8e-06,0.0,0.0,8e-06,0.0,...,0.0,0.003929,0.000101,0.000422,1.7e-05,7.6e-05,0.0,8e-06,0.0,1.0
1,1.4e-05,1.4e-05,1.4e-05,0.0,2.8e-05,0.0,1.4e-05,1.4e-05,0.0,1.4e-05,...,1.4e-05,0.001529,0.000125,9.7e-05,1.4e-05,0.000125,1.4e-05,4.2e-05,1.4e-05,1.0
2,-1.4e-05,-1.4e-05,-5e-06,1.7e-05,-2.8e-05,8e-06,-1.4e-05,-1.4e-05,8e-06,-1.4e-05,...,-1.4e-05,0.0024,-2.4e-05,0.000324,3e-06,-4.9e-05,-1.4e-05,-3.3e-05,-1.4e-05,0.0


We can sort based of the values of this row

In [35]:
novels_df.loc[2].sort_values(ascending=False)

not          0.008240
of           0.007876
he           0.007589
his          0.007422
elizabeth    0.005290
mr           0.005259
be           0.005204
had          0.004862
to           0.004779
was          0.004662
him          0.004615
have         0.004159
darcy        0.003524
that         0.003218
been         0.003077
bennet       0.002732
bingley      0.002588
is           0.002584
but          0.002544
your         0.002400
which        0.002160
could        0.002116
by           0.002054
jane         0.002017
from         0.001950
am           0.001929
what         0.001868
such         0.001751
you          0.001738
must         0.001707
               ...   
ll          -0.001145
becky       -0.001154
project     -0.001159
jenny       -0.001167
things      -0.001192
rosy        -0.001223
pretty      -0.001229
poor        -0.001236
went        -0.001242
gutenberg   -0.001293
out         -0.001302
about       -0.001451
over        -0.001518
don         -0.001534
girls     

Stop words are still in there. Why?

We can, of course, manually remove stop words. This does successfully identify distinctive content words. 

We can do this in the CountVectorizer step, by setting the correct option.

In [36]:
#change stop_words option to 'english
countvec_sw = CountVectorizer(stop_words="english")

#same as code above
novels_df_sw = pandas.DataFrame(countvec_sw.fit_transform(text_list).toarray(), columns=countvec_sw.get_feature_names())
novels_df_sw['word_count'] = novels_df_sw.sum(axis=1)
novels_df_sw = novels_df_sw.iloc[:,0:].div(novels_df_sw.word_count, axis=0)
novels_df_sw.loc[2] = novels_df_sw.loc[0] - novels_df_sw.loc[1]
novels_df_sw.loc[2].sort_values(axis=0, ascending=False)

mr           0.013737
elizabeth    0.013346
darcy        0.008878
bennet       0.006881
bingley      0.006520
jane         0.005245
wickham      0.004120
mrs          0.004063
collins      0.003712
lydia        0.003602
sister       0.003434
family       0.002929
soon         0.002865
catherine    0.002676
did          0.002337
think        0.002238
father       0.002179
replied      0.002128
thing        0.002124
gardiner     0.002060
lizzy        0.002060
letter       0.002015
said         0.001927
longbourn    0.001869
feelings     0.001809
charlotte    0.001805
say          0.001783
room         0.001780
manner       0.001731
brother      0.001618
               ...   
tm          -0.001705
come        -0.001713
hard        -0.001720
eyes        -0.001789
head        -0.001821
ruth        -0.001825
new         -0.001892
help        -0.001911
face        -0.001939
emily       -0.002064
child       -0.002066
jessie      -0.002094
ethel       -0.002363
ll          -0.002462
went      

We can also do this by setting the max_df option (maximum document frequency) to either an absolute value, or a decimal between 0 and 1. An absolute value indicate that if the word occurs in more documents than the stated value, that word **will not** be included in the DTM. A decimal value will do the same, but proportion of documents.

Question: In the case of this corpus, what does setting the max_df value to 1 do? What output do you expect?

In [37]:
#Change max_df option to 1
countvec_freq = CountVectorizer(max_df=1)

#same as the code above
novels_df_freq = pandas.DataFrame(countvec_freq.fit_transform(text_list).toarray(), columns=countvec_freq.get_feature_names())
novels_df_freq['word_count'] = novels_df_freq.sum(axis=1)
novels_df_freq = novels_df_freq.iloc[:,0:].div(novels_df_freq.word_count, axis=0)
novels_df_freq.loc[2] = novels_df_freq.loc[0] - novels_df_freq.loc[1]
novels_df_freq.loc[2].sort_values(axis=0, ascending=False)

darcy            0.034440
bennet           0.026695
bingley          0.025295
wickham          0.015984
catherine        0.010381
gardiner         0.007992
lizzy            0.007992
longbourn        0.007251
charlotte        0.007003
netherfield      0.006015
lucas            0.005850
attention        0.005603
immediately      0.005026
chapter          0.005026
meryton          0.004696
behaviour        0.004532
pemberley        0.004367
rosings          0.003955
scarcely         0.003708
honour           0.003460
ladyship         0.003460
hertfordshire    0.003378
convinced        0.003296
bourgh           0.003213
forster          0.003213
situation        0.003131
fitzwilliam      0.003049
entered          0.002966
advantage        0.002801
philips          0.002719
                   ...   
sun             -0.002214
arms            -0.002214
ellery          -0.002314
carrie          -0.002314
enjoyed         -0.002415
brave           -0.002516
warburton       -0.002616
electronic  

Question: What would happen if we set the max_df to 2, in this case?
Question: What might we do for the music reviews dataset?

### Exercise: 

Use the difference of proportions calculation to compare two genres, or two artists, in the music reviews dataset. There are many ways you can do this. Think through the problem in steps. We'll go over a solution the week after next.