## Dictionary Method

For the final two days we'll move to measuring the prevelence of themes in a corpus. We'll cover three ways of doing this: the dictionary method, supervised classification, and unsupervised machine learning. Today, dictionary method.

This is the most simple way to measure the prevelence of a theme in a corpus, and is used for many purposes, including sentiment analysis. This is one of the most long-standing, and ubiquitous, methods in automated text analysis, so it's important to both understand the method and be able to implement it.

The method is simple: it involves grouping words into categories or themes, and then counting the number of words from each theme in your corpus. We will use this method to do sentiment analysis, a popular text analysis task, on our Music Review corpus, using a standard sentiment analysis dictionary.

### Learning Goals
* Understand the intuition behind the dictionary method
* Learn how to implement in via Python Pandas and NLTK
* Get more comfortable combining Python packages together for more powerful analytic power
    * Today, we'll combine Pandas and NLTK
* Implement a rudimentary sentiment analysis tool


### Outline
* Introduction to the Dictionary Method
* Pre-Processing
    * Creat Pandas DF
    * Lowercase, remove punctuation, tokenize
    * Create column for token count
* Sentiment Analysis using the Dictionary Method


### Key Jargon

* *dictionary method*:
    * text analysis method that utilizes the frequency of key words, grouped into themes, to determine the prevelance of that theme throughout a corpus.
* *standard dictionary*:
    * otherwise known as general dictionaries, a dictionary created by experts meant to measure general phenomenon.
* *custom dictionary*:
    * dictionaries tailored to a specific domain or question. Usually created by the researcher based on the research question.
* *sentiment analysis*:
    * the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc., is positive, negative, or neutral.
* *lambda function*:
    * A function that your write yourself. This is different than the built-in functions we have been using.

### Further Resources

[A Novel Method for Detecting Plot](http://www.matthewjockers.net/2014/06/05/a-novel-method-for-detecting-plot/), Matt Jockers

Enns, Peter, Nathan Kelly, Jana Morgan, and Christopher Witko. 2015.[“Money and the Supply
of Political Rhetoric: Understanding the Congressional (Non-)Response to Economic Inequality.”](http://cdn.equitablegrowth.org/wp-content/uploads/2016/06/29155322/enns-kelly-morgan-witko-econinterests-policyagenda.pdf) Paper presented at the APSA Annual Meetings, San Francisco.
* Outlines the process of creating your own dictionary

[Neal Caren has a tutorial using MPQA](http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-3/), which implements the dictionary method in Python but in a much different way 

**__________________________________**


### 0. Introduction to the Dictionary Method

The dictionary method is based on the assumption that themes or categories consist of a group of words, and texts that cover that theme will have a higher percentage of that group of words compared to other texts. Dictionary methods are used for many purposes. A few possibilities:
* classify text into themes
* measure the *tone* of text
* measure sentiment
* measure psychological processes

There are two forms of dictionaries: standard or general dictionaries, and custom dictionaries.

#### Standard Dictionaries

There are a number of standard dictionaries that have been created by field experts. The benefit of standarized dictionaries is that they're developed by experts and have been throughoughly validated. Others have likely published using these dictionaries, so reviewers are more likely to accept them as valid. Because of this, they are good options if they fit your research question. 

Here are a few:

* [DICTION](http://www.dictionsoftware.com/): a computer-aided text analysis program for determining the tone of a text. It was created by and for organization scholars and political scientists.
    * Main five categories: Certainty, Activity, Optimism, Realism, Commonality
    * 35 sub-categories
    * Allows you to create your own dictionary
    * Proprietary software
* [Linguistic Inquiry and Word Count (LIWC)](http://liwc.wpengine.com/): Created by psychologists, it's meant to capture psychological processes around feelings, personality, and motivations. It's also proprietary.
* [Multi-Perspective Question Answering (MPQA)](http://mpqa.cs.pitt.edu/): The free version of LIWC. We will use this dictionary today.
* [Harvard General Inquirer](http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm). Multiple categories, including abstract and concrete words. It's free and available online.

#### Custom Dictionaries

Many research questions or data are domain specific, however, and will thus require you to create your own dictionary based on your own knowledge of the domain and question. Creating your own dictionary requires a lot of thought, and must be validated. These dictionaries are typically created in an interative fashion, and are modified as they are validated. See Enns et al. (2015) for an example of how they constructed their own dictionary. 

Today we will use the free and standard sentiment dictionary from MPQA to measure positive and negative sentiment in the music reviews.

Our first step, as with any technique, is the pre-processing step, to get the data ready for analyis.

### 1. Pre-Processing

First, read in our Music Reviews corpus as a Pandas dataframe.

In [190]:
#import the necessary packages
import pandas
import nltk
from nltk import word_tokenize
import string

#read the Music Reviews corpus into a Pandas dataframe
df = pandas.read_csv("BDHSI2016_music_reviews.csv", sep = '\t')

#view the dataframe
print(df)

                                    album  \
0                             Don't Panic   
1                 Fear and Saturday Night   
2                      The Way I'm Livin'   
3                                   Doris   
4                                 Giraffe   
5                            Weathervanes   
6                    Build a Rocket Boys!   
7                      Ambivalence Avenue   
8                                 Wavvves   
9                          Peachtree Road   
10                               Heritage   
11                            White Chalk   
12                    Tyrannosaurus Hives   
13                             JackInABox   
14                            Liquid Love   
15                  The  Truth About Love   
16                            The Monitor   
17                         Ones and Sixes   
18        In Search Of... [First Version]   
19                            Tarot Sport   
20                             July Flame   
21        

The next step is to create a new column in our dataset that contains tokenized words with all the pre-processing steps.

The code here will look slightly different that lesson 1, as we're applying these functions to every row in our dataframe.

In [191]:
#first create a new column called "body_tokens" and transform to lowercase by applying the string function str.lower()
df['body_tokens'] = df['body'].str.lower()

#make sure it worked
print(df[['body','body_tokens']])

                                                   body  \
0     While For Baltimore proves they can still writ...   
1     There's nothing fake about the purgatorial nar...   
2     All life's disastrous lows are here on a caree...   
3     With Doris, Odd Future’s Odysseus is finally b...   
4     Though Giraffe is definitely Echoboy's most im...   
5     Fans of Owl City and The Postal Service will r...   
6     Whereas previous Elbow records set a mood, Bui...   
7     His remarkable Warp debut follows a series of ...   
8     There’s an energy coursing through this, and r...   
9     Classic. Songs filled with soul. Lyrics refres...   
10    It’s by no means perfect and it does feel slig...   
11    Put in context, White Chalk serves her purpose...   
12    Although pretty catchy, this album is a tad to...   
13     Talk about a fall from grace. [4 Jun 2005, p.58]   
14    It's unusual to find a band equally at home wi...   
15    It's just a shame she gave album space to Mari... 

Next we tokenize the text. To do this on a Pandas dataframe we need the apply function. This simply tells the computer to take the function in the parentheses,, apply it to each row in the dataframe, and assign the output to a new column. 

There are two ways to do this. If it's a built-in function you're applying to the entire field, such as nltk.word_tokenize, you can simply put the function in the parentheses,. In some cases, you need to write your own function, called a lambda function. This is the case if you're applying something to a list (Pandas does not deal with list objects well. Hopefully someone smart will fix that). We'll get to that case below.

In [192]:
#tokenize
df['body_tokens'] = df['body_tokens'].apply(nltk.word_tokenize)

#view output
print(df['body_tokens'])

0       [while, for, baltimore, proves, they, can, sti...
1       [there, 's, nothing, fake, about, the, purgato...
2       [all, life, 's, disastrous, lows, are, here, o...
3       [with, doris, ,, odd, future’s, odysseus, is, ...
4       [though, giraffe, is, definitely, echoboy, 's,...
5       [fans, of, owl, city, and, the, postal, servic...
6       [whereas, previous, elbow, records, set, a, mo...
7       [his, remarkable, warp, debut, follows, a, ser...
8       [there’s, an, energy, coursing, through, this,...
9       [classic, ., songs, filled, with, soul, ., lyr...
10      [it’s, by, no, means, perfect, and, it, does, ...
11      [put, in, context, ,, white, chalk, serves, he...
12      [although, pretty, catchy, ,, this, album, is,...
13      [talk, about, a, fall, from, grace, ., [, 4, j...
14      [it, 's, unusual, to, find, a, band, equally, ...
15      [it, 's, just, a, shame, she, gave, album, spa...
16      [the, fundamental, difference, between, the, m...
17      [it, j

In [193]:
punctuations = list(string.punctuation)

#remove punctuation. Let's talk about that lambda x.
df['body_tokens'] = df['body_tokens'].apply(lambda x: [word for word in x if word not in punctuations])

#view output
print(df['body_tokens'])

0       [while, for, baltimore, proves, they, can, sti...
1       [there, 's, nothing, fake, about, the, purgato...
2       [all, life, 's, disastrous, lows, are, here, o...
3       [with, doris, odd, future’s, odysseus, is, fin...
4       [though, giraffe, is, definitely, echoboy, 's,...
5       [fans, of, owl, city, and, the, postal, servic...
6       [whereas, previous, elbow, records, set, a, mo...
7       [his, remarkable, warp, debut, follows, a, ser...
8       [there’s, an, energy, coursing, through, this,...
9       [classic, songs, filled, with, soul, lyrics, r...
10      [it’s, by, no, means, perfect, and, it, does, ...
11      [put, in, context, white, chalk, serves, her, ...
12      [although, pretty, catchy, this, album, is, a,...
13      [talk, about, a, fall, from, grace, 4, jun, 20...
14      [it, 's, unusual, to, find, a, band, equally, ...
15      [it, 's, just, a, shame, she, gave, album, spa...
16      [the, fundamental, difference, between, the, m...
17      [it, j

Pre-processing is done. What other pre-processing steps might we use?

One more step before getting to the dictionary method. We want a total token count for each row, so we can normalize the dictionary counts. To do this we simply create a new column that contains the length of the token list in each row.

In [194]:
df['token_count'] = df['body_tokens'].apply(lambda x: len(x))

print(df[['body_tokens','token_count']])

                                            body_tokens  token_count
0     [while, for, baltimore, proves, they, can, sti...           40
1     [there, 's, nothing, fake, about, the, purgato...           29
2     [all, life, 's, disastrous, lows, are, here, o...           14
3     [with, doris, odd, future’s, odysseus, is, fin...           16
4     [though, giraffe, is, definitely, echoboy, 's,...           51
5     [fans, of, owl, city, and, the, postal, servic...           34
6     [whereas, previous, elbow, records, set, a, mo...           34
7     [his, remarkable, warp, debut, follows, a, ser...           21
8     [there’s, an, energy, coursing, through, this,...           38
9     [classic, songs, filled, with, soul, lyrics, r...           60
10    [it’s, by, no, means, perfect, and, it, does, ...           20
11    [put, in, context, white, chalk, serves, her, ...           47
12    [although, pretty, catchy, this, album, is, a,...           10
13    [talk, about, a, fall, from,

### 2. Creating Dictionary Counts

I created two text files, one is a list of positive words from the MPQA dictionary, the other is a list of negative words. One word per line. Our goal here is to count the number of positive and negative words in each row of our dataframe, and add two columns to our dataset with the count of positive and negative words.

First, read in the positive and negative words and create list variables for each.

In [195]:
pos_sent = open("positive_words.txt").read()
neg_sent = open("negative_words.txt").read()

#view part of the pos_sent variable, to see how it's formatted.
print(pos_sent[:101])

abidance
abidance
abilities
ability
able
above
above-average
abundant
abundance
acceptance
acceptable


In [196]:
#remember the split function? We'll split on the newline character (\n) to create a list
positive_words=pos_sent.split('\n')
negative_words=neg_sent.split('\n')

#view the first elements in the lists
print(positive_words[:10])
print(negative_words[:10])

['abidance', 'abidance', 'abilities', 'ability', 'able', 'above', 'above-average', 'abundant', 'abundance', 'acceptance']
['abandoned', 'abandonment', 'aberration', 'aberration', 'abhorred', 'abhorrence', 'abhorrent', 'abhorrently', 'abhors', 'abhors']


In [197]:
#count number of words in each list
print(len(positive_words))
print(len(negative_words))

2231
3906


Great! Now we can create two more columns that contain the number of positive words and negative words in the review tokens. I'm going to get creative with this, as we need to do this step in one line of code for positive and negative words, each. Your challenges:

* Can you parse the code? We'll walk through it together.
* Think of other ways you could do this same thing.

In [198]:
#create column with the number of positive words
df['positive_tokens'] = df['body_tokens'].apply(lambda x: len([word for word in x if word in positive_words]))
df['negative_tokens'] = df['body_tokens'].apply(lambda x: len([word for word in x if word in negative_words]))

print(df[['token_count', 'positive_tokens', 'negative_tokens']])

      token_count  positive_tokens  negative_tokens
0              40                1                0
1              29                0                3
2              14                0                1
3              16                0                1
4              51                2                4
5              34                4                0
6              34                2                0
7              21                3                0
8              38                1                0
9              60                8                0
10             20                3                1
11             47                1                1
12             10                2                1
13             10                1                0
14             27                2                1
15             25                2                2
16             31                2                2
17             21                2                1
18          

That's the dictionary method! You can do this with any dictionary you want, standard or you can create your own.

### 2. Sentiment Analysis using the Dictionary Method

What can we do with this?

First, let's compare the overall sentiment of the reviews by genre.

In [199]:
#use groupby function
df_genres = df.groupby('genre')

print("Proportion Positive Words")
print((df_genres['positive_tokens'].sum()/df_genres['token_count'].sum()).sort_values(ascending=False))
print()
print("Proportion Negative Words")
print((df_genres['negative_tokens'].sum()/df_genres['token_count'].sum()).sort_values())

Proportion Positive Words
genre
Jazz                      0.085170
Folk                      0.084559
Alternative/Indie Rock    0.073557
Indie                     0.073287
Rock                      0.071613
Electronic                0.070994
Pop/Rock                  0.070922
Pop                       0.069388
R&B;                      0.069345
Country                   0.061577
Dance                     0.061299
Rap                       0.060397
dtype: float64

Proportion Negative Words
genre
R&B;                      0.023390
Jazz                      0.025050
Folk                      0.026961
Pop                       0.028912
Country                   0.031204
Pop/Rock                  0.031931
Alternative/Indie Rock    0.032588
Rock                      0.033084
Rap                       0.033481
Electronic                0.033514
Indie                     0.033866
Dance                     0.034767
dtype: float64


Notice the position of Rap and R&B; in both lists. These lists are not inverses. This suggests positive and negative emotion words are in some cases orthogonal. 

Compare these lists to the average score by genre.

In [200]:
print(df_genres['score'].mean().sort_values(ascending=False))

genre
Jazz                      77.631579
Folk                      75.900000
Indie                     74.400897
Country                   74.071429
Alternative/Indie Rock    73.928571
Electronic                73.140351
Pop/Rock                  73.033782
R&B;                      72.366071
Rap                       72.173554
Rock                      70.754292
Dance                     70.146341
Pop                       64.608054
Name: score, dtype: float64


Notice the position of Country in both lists. What might you conclude from this?

As another validation check, let's groupby score and see if that matches intuitively with our emotion proportions.

In [201]:
df_highscore = df[df['score']>=75]
df_lowscore = df[df['score']<65]
print("Proportion of Postive Words")
print("High Score:")
print((df_highscore['positive_tokens'].sum()/df_highscore['token_count'].sum()))
print("Low Score:")
print((df_lowscore['positive_tokens'].sum()/df_lowscore['token_count'].sum()))
print()
print("Proportion of Negative Words")
print("High Score:")
print((df_highscore['negative_tokens'].sum()/df_highscore['token_count'].sum()))
print("Low Score:")
print((df_lowscore['negative_tokens'].sum()/df_lowscore['token_count'].sum()))

Proportion of Postive Words
High Score:
0.0729579698652
Low Score:
0.0664526603203

Proportion of Negative Words
High Score:
0.0317625944322
Low Score:
0.0362703859494


Not bad. But this also illustrates potential problems with sentiment analysis, and the dictionary method in general.

Exercise: 
* Make your own dictionary of terms as a text file, each word on a separate line. Reproduce this analysis using your own dictionary.
* Reproduce this on your own corpus, or one of our other corpuses