## Dictionary Method

This is the most simple way to measure the prevelence of a theme in a corpus, and is used for many purposes, including sentiment analysis. This is one of the most long-standing, and ubiquitous, methods in automated text analysis, so it's important to both understand the method and be able to implement it.

The method is simple: it involves grouping words into categories or themes, and then counting the number of words from each theme in your corpus. We will use this method to do sentiment analysis, a popular text analysis task, on our Music Review corpus, using a standard sentiment analysis dictionary.

### Learning Goals
* Understand the intuition behind the dictionary method
* Learn how to implement in via Python Pandas and NLTK
* Get more comfortable combining Python packages together
* Implement a rudimentary sentiment analysis tool and test it on our Music Reviews data.


### Outline
* Introduction to the Dictionary Method
* Pre-Processing
* Sentiment Analysis using the Dictionary Method


### Key Jargon

* *dictionary method*:
    * text analysis method that utilizes the frequency of key words, grouped into themes, to determine the prevelance of that theme throughout a corpus.
* *standard dictionary*:
    * otherwise known as general dictionaries, a dictionary created by experts meant to measure general phenomenon.
* *custom dictionary*:
    * dictionaries tailored to a specific domain or question. Usually created by the researcher based on the research question.
* *sentiment analysis*:
    * the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc., is positive, negative, or neutral.
    
### Further Resources

[A Novel Method for Detecting Plot](http://www.matthewjockers.net/2014/06/05/a-novel-method-for-detecting-plot/), Matt Jockers

Enns, Peter, Nathan Kelly, Jana Morgan, and Christopher Witko. 2015.[“Money and the Supply
of Political Rhetoric: Understanding the Congressional (Non-)Response to Economic Inequality.”](http://cdn.equitablegrowth.org/wp-content/uploads/2016/06/29155322/enns-kelly-morgan-witko-econinterests-policyagenda.pdf) Paper presented at the APSA Annual Meetings, San Francisco.
* Outlines the process of creating your own dictionary

[Neal Caren has a tutorial using MPQA](http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-3/), which implements the dictionary method in Python but in a much different way 

**__________________________________**


### 0. Introduction to the Dictionary Method

The dictionary method is based on the assumption that themes or categories consist of a group of words, and texts that cover that theme will have a higher percentage of that group of words compared to other texts. Dictionary methods are used for many purposes. A few possibilities:
* classify text into themes
* measure the *tone* of text
* measure sentiment
* measure psychological processes

There are two forms of dictionaries: standard or general dictionaries, and custom dictionaries.

#### Standard Dictionaries

There are a number of standard dictionaries that have been created by field experts. The benefit of standarized dictionaries is that they're developed by experts and have been throughoughly validated. Others have likely published using these dictionaries, so reviewers are more likely to accept them as valid. Because of this, they are good options if they fit your research question. 

Here are a few:

* [DICTION](http://www.dictionsoftware.com/): a computer-aided text analysis program for determining the tone of a text. It was created by and for organization scholars and political scientists.
    * Main five categories: Certainty, Activity, Optimism, Realism, Commonality
    * 35 sub-categories
    * Allows you to create your own dictionary
    * Proprietary software
* [Linguistic Inquiry and Word Count (LIWC)](http://liwc.wpengine.com/): Created by psychologists, it's meant to capture psychological processes around feelings, personality, and motivations. It's also proprietary.
* [Multi-Perspective Question Answering (MPQA)](http://mpqa.cs.pitt.edu/): The free version of LIWC. We will use this dictionary today.
* [Harvard General Inquirer](http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm). Multiple categories, including abstract and concrete words. It's free and available online.

#### Custom Dictionaries

Many research questions or data are domain specific, however, and will thus require you to create your own dictionary based on your own knowledge of the domain and question. Creating your own dictionary requires a lot of thought, and must be validated. These dictionaries are typically created in an interative fashion, and are modified as they are validated. See Enns et al. (2015) for an example of how they constructed their own dictionary. 

Today we will use the free and standard sentiment dictionary from MPQA to measure positive and negative sentiment in the music reviews.

Our first step, as with any technique, is the pre-processing step, to get the data ready for analyis.

### 1. Pre-Processing

First, read in our Music Reviews corpus as a Pandas dataframe.

In [1]:
#import the necessary packages
import pandas
import nltk
from nltk import word_tokenize
import string

#read the Music Reviews corpus into a Pandas dataframe
df = pandas.read_csv("../data/BDHSI2016_music_reviews.csv", encoding='utf-8', sep = '\t')

#view the dataframe
df

Unnamed: 0,album,artist,genre,release_date,critic,score,body
0,Don't Panic,All Time Low,Pop/Rock,2012-10-09 00:00:00,Kerrang!,74.0,While For Baltimore proves they can still writ...
1,Fear and Saturday Night,Ryan Bingham,Country,2015-01-20 00:00:00,Uncut,70.0,There's nothing fake about the purgatorial nar...
2,The Way I'm Livin',Lee Ann Womack,Country,2014-09-23 00:00:00,Q Magazine,84.0,All life's disastrous lows are here on a caree...
3,Doris,Earl Sweatshirt,Rap,2013-08-20 00:00:00,Pitchfork,82.0,"With Doris, Odd Future’s Odysseus is finally b..."
4,Giraffe,Echoboy,Rock,2003-02-25 00:00:00,AllMusic,71.0,Though Giraffe is definitely Echoboy's most im...
5,Weathervanes,Freelance Whales,Indie,2010-04-13 00:00:00,Q Magazine,68.0,Fans of Owl City and The Postal Service will r...
6,Build a Rocket Boys!,Elbow,Pop/Rock,2011-04-12 00:00:00,Delusions of Adequacy,82.0,"Whereas previous Elbow records set a mood, Bui..."
7,Ambivalence Avenue,Bibio,Indie,2009-06-23 00:00:00,Q Magazine,78.0,His remarkable Warp debut follows a series of ...
8,Wavvves,Wavves,Indie,2009-03-17 00:00:00,PopMatters,68.0,"There’s an energy coursing through this, and r..."
9,Peachtree Road,Elton John,Rock,2004-11-09 00:00:00,MelD.,70.0,Classic. Songs filled with soul. Lyrics refres...


The next step is to create a new column in our dataset that contains tokenized words with all the pre-processing steps.

In [3]:
#first create a new column called "body_tokens" and transform to lowercase by applying the string function str.lower()
df['body'] = df['body'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))
df['body_tokens'] = df['body'].str.lower()

In [4]:
#tokenize
df['body_tokens'] = df['body_tokens'].apply(nltk.word_tokenize)

#view output
print(df['body_tokens'])

0       [while, for, baltimore, proves, they, can, sti...
1       [there, 's, nothing, fake, about, the, purgato...
2       [all, life, 's, disastrous, lows, are, here, o...
3       [with, doris, ,, odd, future’s, odysseus, is, ...
4       [though, giraffe, is, definitely, echoboy, 's,...
5       [fans, of, owl, city, and, the, postal, servic...
6       [whereas, previous, elbow, records, set, a, mo...
7       [his, remarkable, warp, debut, follows, a, ser...
8       [there’s, an, energy, coursing, through, this,...
9       [classic, ., songs, filled, with, soul, ., lyr...
10      [it’s, by, no, means, perfect, and, it, does, ...
11      [put, in, context, ,, white, chalk, serves, he...
12      [although, pretty, catchy, ,, this, album, is,...
13      [talk, about, a, fall, from, grace, ., [, jun,...
14      [it, 's, unusual, to, find, a, band, equally, ...
15      [it, 's, just, a, shame, she, gave, album, spa...
16      [the, fundamental, difference, between, the, m...
17      [it, j

In [5]:
punctuations = list(string.punctuation)

#remove punctuation. Let's talk about that lambda x.
df['body_tokens'] = df['body_tokens'].apply(lambda x: [word for word in x if word not in punctuations])

#view output
print(df['body_tokens'])

0       [while, for, baltimore, proves, they, can, sti...
1       [there, 's, nothing, fake, about, the, purgato...
2       [all, life, 's, disastrous, lows, are, here, o...
3       [with, doris, odd, future’s, odysseus, is, fin...
4       [though, giraffe, is, definitely, echoboy, 's,...
5       [fans, of, owl, city, and, the, postal, servic...
6       [whereas, previous, elbow, records, set, a, mo...
7       [his, remarkable, warp, debut, follows, a, ser...
8       [there’s, an, energy, coursing, through, this,...
9       [classic, songs, filled, with, soul, lyrics, r...
10      [it’s, by, no, means, perfect, and, it, does, ...
11      [put, in, context, white, chalk, serves, her, ...
12      [although, pretty, catchy, this, album, is, a,...
13            [talk, about, a, fall, from, grace, jun, p]
14      [it, 's, unusual, to, find, a, band, equally, ...
15      [it, 's, just, a, shame, she, gave, album, spa...
16      [the, fundamental, difference, between, the, m...
17      [it, j

Pre-processing is done. What other pre-processing steps might we use?

One more step before getting to the dictionary method. We want a total token count for each row, so we can normalize the dictionary counts. To do this we simply create a new column that contains the length of the token list in each row.

In [6]:
df['token_count'] = df['body_tokens'].apply(lambda x: len(x))

print(df[['body_tokens','token_count']])

                                            body_tokens  token_count
0     [while, for, baltimore, proves, they, can, sti...           38
1     [there, 's, nothing, fake, about, the, purgato...           28
2     [all, life, 's, disastrous, lows, are, here, o...           13
3     [with, doris, odd, future’s, odysseus, is, fin...           16
4     [though, giraffe, is, definitely, echoboy, 's,...           51
5     [fans, of, owl, city, and, the, postal, servic...           33
6     [whereas, previous, elbow, records, set, a, mo...           34
7     [his, remarkable, warp, debut, follows, a, ser...           20
8     [there’s, an, energy, coursing, through, this,...           38
9     [classic, songs, filled, with, soul, lyrics, r...           60
10    [it’s, by, no, means, perfect, and, it, does, ...           20
11    [put, in, context, white, chalk, serves, her, ...           47
12    [although, pretty, catchy, this, album, is, a,...           10
13          [talk, about, a, fall,

### 2. Creating Dictionary Counts

I created two text files, one is a list of positive words from the MPQA dictionary, the other is a list of negative words. One word per line. Our goal here is to count the number of positive and negative words in each row of our dataframe, and add two columns to our dataset with the count of positive and negative words.

First, read in the positive and negative words and create list variables for each.

In [7]:
pos_sent = open("../data/positive_words.txt", encoding='utf-8').read()
neg_sent = open("../data/negative_words.txt", encoding='utf-8').read()

#view part of the pos_sent variable, to see how it's formatted.
print(pos_sent[:101])

abidance
abidance
abilities
ability
able
above
above-average
abundant
abundance
acceptance
acceptable


In [8]:
#remember the split function? We'll split on the newline character (\n) to create a list
positive_words=pos_sent.split('\n')
negative_words=neg_sent.split('\n')

#view the first elements in the lists
print(positive_words[:10])
print(negative_words[:10])
positive_words

['abidance', 'abidance', 'abilities', 'ability', 'able', 'above', 'above-average', 'abundant', 'abundance', 'acceptance']
['abandoned', 'abandonment', 'aberration', 'aberration', 'abhorred', 'abhorrence', 'abhorrent', 'abhorrently', 'abhors', 'abhors']


['abidance',
 'abidance',
 'abilities',
 'ability',
 'able',
 'above',
 'above-average',
 'abundant',
 'abundance',
 'acceptance',
 'acceptable',
 'accessible',
 'acclaim',
 'acclaimed',
 'accolade',
 'accolades',
 'accommodative',
 'accomplishment',
 'accomplishments',
 'accordance',
 'accordantly',
 'accurate',
 'accurately',
 'achievable',
 'achievement',
 'achievements',
 'acknowledgement',
 'active',
 'acumen',
 'adaptable',
 'adaptability',
 'adaptive',
 'adept',
 'adeptly',
 'adequate',
 'adherence',
 'adherent',
 'adhesion',
 'admirable',
 'admirer',
 'admirable',
 'admirably',
 'admiration',
 'admiring',
 'admiringly',
 'admission',
 'admission',
 'adorable',
 'adored',
 'adorer',
 'adoring',
 'adoringly',
 'adroit',
 'adroitly',
 'adulatory',
 'advanced',
 'advantage',
 'advantage',
 'advantageous',
 'advantages',
 'advantages',
 'adventure',
 'adventure',
 'adventuresome',
 'adventurism',
 'adventurous',
 'advice',
 'advice',
 'advisable',
 'advocacy',
 'affable',
 'affabili

In [9]:
#count number of words in each list
print(len(positive_words))
print(len(negative_words))

2231
3906


Great! You know what to do now.

Exercise:
1. Create a column with the number of positive words, and another with the proportion of positive words
2. Create a column with the number of negative words, and another with the proportion of negative words
3. Print the average proportion of negative and positive words by genre
4. Compare this to the average score by genre

In [19]:
#exercise code here
df['pos_num'] = df['body_tokens'].apply(lambda x: len([word for word in x if word in positive_words]))
df['neg_num'] = df['body_tokens'].apply(lambda x: len([word for word in x if word in negative_words]))

df['pos_prop'] = df['pos_num']/df['token_count']
df['neg_prop'] = df['neg_num']/df['token_count']
df.drop('release_date', axis=1, inplace=True)
df

Unnamed: 0,album,artist,genre,release_date,critic,score,body,body_tokens,token_count,pos_num,neg_num,pos_prop,neg_prop
0,Don't Panic,All Time Low,Pop/Rock,2012-10-09 00:00:00,Kerrang!,74.0,While For Baltimore proves they can still writ...,"[while, for, baltimore, proves, they, can, sti...",38,1,0,0.026316,0.000000
1,Fear and Saturday Night,Ryan Bingham,Country,2015-01-20 00:00:00,Uncut,70.0,There's nothing fake about the purgatorial nar...,"[there, 's, nothing, fake, about, the, purgato...",28,0,3,0.000000,0.107143
2,The Way I'm Livin',Lee Ann Womack,Country,2014-09-23 00:00:00,Q Magazine,84.0,All life's disastrous lows are here on a caree...,"[all, life, 's, disastrous, lows, are, here, o...",13,0,1,0.000000,0.076923
3,Doris,Earl Sweatshirt,Rap,2013-08-20 00:00:00,Pitchfork,82.0,"With Doris, Odd Future’s Odysseus is finally b...","[with, doris, odd, future’s, odysseus, is, fin...",16,0,1,0.000000,0.062500
4,Giraffe,Echoboy,Rock,2003-02-25 00:00:00,AllMusic,71.0,Though Giraffe is definitely Echoboy's most im...,"[though, giraffe, is, definitely, echoboy, 's,...",51,2,4,0.039216,0.078431
5,Weathervanes,Freelance Whales,Indie,2010-04-13 00:00:00,Q Magazine,68.0,Fans of Owl City and The Postal Service will r...,"[fans, of, owl, city, and, the, postal, servic...",33,4,0,0.121212,0.000000
6,Build a Rocket Boys!,Elbow,Pop/Rock,2011-04-12 00:00:00,Delusions of Adequacy,82.0,"Whereas previous Elbow records set a mood, Bui...","[whereas, previous, elbow, records, set, a, mo...",34,2,0,0.058824,0.000000
7,Ambivalence Avenue,Bibio,Indie,2009-06-23 00:00:00,Q Magazine,78.0,His remarkable Warp debut follows a series of ...,"[his, remarkable, warp, debut, follows, a, ser...",20,3,0,0.150000,0.000000
8,Wavvves,Wavves,Indie,2009-03-17 00:00:00,PopMatters,68.0,"There’s an energy coursing through this, and r...","[there’s, an, energy, coursing, through, this,...",38,1,0,0.026316,0.000000
9,Peachtree Road,Elton John,Rock,2004-11-09 00:00:00,MelD.,70.0,Classic. Songs filled with soul. Lyrics refres...,"[classic, songs, filled, with, soul, lyrics, r...",60,8,0,0.133333,0.000000


In [21]:
grouped = df.groupby('genre')
grouped['pos_prop'].mean().sort_values(ascending=False)

genre
Folk                      0.098155
Jazz                      0.089603
Indie                     0.086019
Rock                      0.080848
Alternative/Indie Rock    0.080572
Electronic                0.079320
Pop/Rock                  0.078621
Dance                     0.078194
R&B;                      0.074830
Country                   0.073167
Rap                       0.071569
Pop                       0.070163
Name: pos_prop, dtype: float64

In [23]:
grouped['score'].mean().sort_values(ascending=False)

genre
Jazz                      77.631579
Folk                      75.900000
Indie                     74.400897
Country                   74.071429
Alternative/Indie Rock    73.928571
Electronic                73.140351
Pop/Rock                  73.033782
R&B;                      72.366071
Rap                       72.173554
Rock                      70.754292
Dance                     70.146341
Pop                       64.608054
Name: score, dtype: float64

That's the dictionary method! You can do this with any dictionary you want, standard or you can create your own.

### 3. Dictionary Method using Scikit-learn

We can also do this using the document term matrix. We'll again do this in pandas, to make it conceptually clear. As you get more comfortable with programming you may want to eventually shift over to working with sparse matrix format.

In [24]:
#import the function CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()

#create our document term matrix as a pandas dataframe
dtm_df = pandas.DataFrame(countvec.fit_transform(df.body).toarray(), columns=countvec.get_feature_names(), index = df.index)
dtm_df

Unnamed: 0,aa,aaaa,aahs,aaliyah,aaron,ab,abandon,abandoned,abandoning,abc,...,zone,zones,zoo,zooey,zoomer,zu,zydeco,álbum,être,über
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we can keep only those *columns* that occur in our positive words list. To do this, we'll first save a list of the columns names as a variable, and then only keep the elements of the list that occur in our positive words list. We'll then create a new dataframe keeping only those select columns.

In [25]:
#create a columns variable that is a list of all column names
columns = list(dtm_df)
columns

['aa',
 'aaaa',
 'aahs',
 'aaliyah',
 'aaron',
 'ab',
 'abandon',
 'abandoned',
 'abandoning',
 'abc',
 'abdullah',
 'abe',
 'aberrant',
 'abhorrent',
 'abides',
 'abilities',
 'ability',
 'ablaze',
 'able',
 'ably',
 'abortively',
 'abound',
 'abounds',
 'about',
 'aboutsonic',
 'above',
 'abovei',
 'abrasive',
 'abrupt',
 'absence',
 'absenceit',
 'absent',
 'absolute',
 'absolutely',
 'absolution',
 'absorb',
 'absorbed',
 'absorbing',
 'absorbs',
 'abstract',
 'abstruse',
 'absurd',
 'absurdities',
 'absurdity',
 'abundance',
 'abundant',
 'abuse',
 'abusers',
 'abusing',
 'abysmal',
 'abyss',
 'ac',
 'accelerate',
 'accelerating',
 'accent',
 'accented',
 'accents',
 'accept',
 'acceptable',
 'accepting',
 'access',
 'accessibility',
 'accessible',
 'accident',
 'accidental',
 'accidentally',
 'acclaim',
 'acclaimed',
 'acclimatise',
 'accompanied',
 'accompaniment',
 'accompanying',
 'accomplish',
 'accomplished',
 'accomplishes',
 'accomplishment',
 'accomplishments',
 'accord',

In [26]:
#create a new variable that contains only column names that are in our postive words list
pos_columns = [word for word in columns if word in positive_words]
pos_columns

['abilities',
 'ability',
 'able',
 'above',
 'abundance',
 'abundant',
 'acceptable',
 'accessible',
 'acclaim',
 'acclaimed',
 'accomplishment',
 'accomplishments',
 'accurately',
 'achievement',
 'achievements',
 'adaptable',
 'adept',
 'adeptly',
 'adherence',
 'admirable',
 'admirably',
 'admiration',
 'adored',
 'advanced',
 'advantage',
 'adventure',
 'adventurous',
 'advice',
 'affection',
 'affectionate',
 'affirmation',
 'affirmative',
 'affordable',
 'afloat',
 'agile',
 'agreeable',
 'allure',
 'alluring',
 'almighty',
 'amazed',
 'amazement',
 'amazing',
 'amazingly',
 'ambitious',
 'amenable',
 'amiable',
 'amicable',
 'amour',
 'ample',
 'amply',
 'amusement',
 'angel',
 'animated',
 'appeal',
 'appealing',
 'appreciation',
 'appropriate',
 'apt',
 'aptitude',
 'aptly',
 'ardent',
 'arresting',
 'articulate',
 'aspiration',
 'aspirations',
 'assertions',
 'assertive',
 'asset',
 'assurance',
 'astonishing',
 'astonishingly',
 'astounded',
 'astounding',
 'astoundingly',


In [27]:
#create a dtm from our dtm_df that keeps only positive sentiment columns
dtm_pos = dtm_df[pos_columns]
dtm_pos

Unnamed: 0,abilities,ability,able,above,abundance,abundant,acceptable,accessible,acclaim,acclaimed,...,wonderous,worth,worthwhile,worthy,wow,wry,yearning,youthful,zeal,zest
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
#count the number of positive words for each document
dtm_pos['pos_count'] = dtm_pos.sum(axis=1)
#dtm_pos.drop('pos_count',axis=1, inplace=True)
dtm_pos['pos_count']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


0        1
1        0
2        0
3        0
4        2
5        4
6        2
7        3
8        1
9        8
10       3
11       1
12       2
13       1
14       2
15       2
16       2
17       2
18       1
19       2
20       1
21       2
22       2
23       2
24       1
25       2
26       3
27       3
28       2
29       0
        ..
4971     3
4972     1
4973     0
4974     3
4975     2
4976     1
4977    11
4978     4
4979     0
4980     3
4981     1
4982     1
4983     2
4984     4
4985     1
4986     2
4987     4
4988     4
4989     4
4990     1
4991     0
4992     1
4993     4
4994     2
4995     1
4996     0
4997     3
4998     2
4999     1
5000     5
Name: pos_count, dtype: int64

EX: Do the same for negative words.  
EX: Calculate the proportion of negative and positive words for each document.

In [31]:
neg_columns = [word for word in columns if word in negative_words]
dtm_neg = dtm_df[neg_columns]

dtm_neg['neg_count'] = dtm_neg.sum(axis=1)
dtm_neg['neg_count']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


0       0
1       3
2       1
3       1
4       4
5       0
6       0
7       0
8       0
9       0
10      1
11      1
12      1
13      0
14      1
15      2
16      2
17      1
18      1
19      2
20      0
21      0
22      0
23      2
24      1
25      2
26      0
27      1
28      0
29      0
       ..
4971    1
4972    1
4973    2
4974    0
4975    0
4976    2
4977    0
4978    0
4979    0
4980    2
4981    0
4982    0
4983    2
4984    3
4985    0
4986    0
4987    1
4988    2
4989    0
4990    1
4991    0
4992    1
4993    1
4994    2
4995    0
4996    3
4997    0
4998    0
4999    0
5000    0
Name: neg_count, dtype: int64

In [34]:
dtm_pos['pos_proportion'] = dtm_pos['pos_count']/dtm_df.sum(axis=1)
print(dtm_pos['pos_proportion'])
df['pos_prop']

0       0.030303
1       0.000000
2       0.000000
3       0.000000
4       0.046512
5       0.137931
6       0.066667
7       0.187500
8       0.027778
9       0.140351
10      0.142857
11      0.023810
12      0.222222
13      0.166667
14      0.080000
15      0.083333
16      0.066667
17      0.100000
18      0.045455
19      0.040000
20      0.034483
21      0.074074
22      0.071429
23      0.051282
24      0.090909
25      0.066667
26      0.100000
27      0.214286
28      0.095238
29      0.000000
          ...   
4971    0.111111
4972    0.071429
4973    0.000000
4974    0.130435
4975    0.153846
4976    0.040000
4977    0.106796
4978    0.129032
4979    0.000000
4980    0.103448
4981    0.043478
4982    0.100000
4983    0.025974
4984    0.074074
4985    0.066667
4986    0.117647
4987    0.250000
4988    0.065574
4989    0.078431
4990    0.083333
4991    0.000000
4992    0.032258
4993    0.153846
4994    0.083333
4995    0.033333
4996    0.000000
4997    0.187500
4998    0.0952

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


0       0.026316
1       0.000000
2       0.000000
3       0.000000
4       0.039216
5       0.121212
6       0.058824
7       0.150000
8       0.026316
9       0.133333
10      0.150000
11      0.021277
12      0.200000
13      0.125000
14      0.074074
15      0.080000
16      0.064516
17      0.095238
18      0.045455
19      0.037736
20      0.035714
21      0.071429
22      0.066667
23      0.046512
24      0.000000
25      0.058824
26      0.096774
27      0.176471
28      0.095238
29      0.000000
          ...   
4971    0.111111
4972    0.055556
4973    0.000000
4974    0.090909
4975    0.142857
4976    0.035714
4977    0.095238
4978    0.121212
4979    0.000000
4980    0.103448
4981    0.047619
4982    0.090909
4983    0.023529
4984    0.070175
4985    0.062500
4986    0.105263
4987    0.250000
4988    0.061538
4989    0.075472
4990    0.076923
4991    0.000000
4992    0.030303
4993    0.111111
4994    0.083333
4995    0.031250
4996    0.000000
4997    0.176471
4998    0.0800