# Dictionary Methods

This is the most simple way to measure the prevalence of a theme in a corpus, and is used for many purposes, including sentiment analysis. This is one of the most long-standing, and ubiquitous, methods in automated text analysis, so it's important to both understand the method and be able to implement it.

The method is simple: it involves grouping words into categories or themes, and then counting the number of words from each theme's word list (or dictionary) in your corpus. We will use this method to do sentiment analysis, a popular text analysis task, on a corpus of Music Reviews, using a standard sentiment analysis dictionary.

## Learning Goals
* Understand the intuition behind dictionary methods
* Learn how to implement in via Python Pandas and NLTK
* Get more comfortable combining Python packages together
* Implement a rudimentary sentiment analysis tool and test it on sample data
* Practice applying a weighted dictionary


## Outline
(TO DO: Add links for each section)
* Part 0: Basic dictionary method
    0. Introduction to dictionary methods
        * Standard dictionaries
        * Custom dictionaries
    1. Pre-processing
    2. Creating dictionary counts
    3. Sentiment analysis using Scikit-learn
* Part 1: Weighted dictionary
    0. Read concreteness score dictionary
    1. Merging a DTM with a weighted dictionary
    2. Weight term frequencies by their concreteness score
    3. Calculating an average concreteness score for each text
    4. Assess the difference

## Vocabulary

* *dictionary method*:
    * text analysis method that utilizes the frequency of key words, grouped into themes, to determine the prevelance of that theme throughout a corpus.
* *standard dictionary*:
    * otherwise known as general dictionaries, a dictionary created by experts meant to measure general phenomenon.
* *custom dictionary*:
    * dictionaries tailored to a specific domain or question. Usually created by the researcher based on the research question.
* *sentiment analysis*:
    * the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc., is positive, negative, or neutral.
    
## Further Resources

TO DO: Edit these resources (consider deleting)

[A Novel Method for Detecting Plot](http://www.matthewjockers.net/2014/06/05/a-novel-method-for-detecting-plot/), Matt Jockers

Enns, Peter, Nathan Kelly, Jana Morgan, and Christopher Witko. 2015.[“Money and the Supply
of Political Rhetoric: Understanding the Congressional (Non-)Response to Economic Inequality.”](http://cdn.equitablegrowth.org/wp-content/uploads/2016/06/29155322/enns-kelly-morgan-witko-econinterests-policyagenda.pdf) Paper presented at the APSA Annual Meetings, San Francisco.
* Outlines the process of creating your own dictionary

[Neal Caren has a tutorial using MPQA](http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-3/), which implements the dictionary method in Python but in a much different way 

**__________________________________**


# Part 0: Basic dictionary methods

## 0.0 Introduction to dictionary methods

The basic idea behind dictionary methods is that language reflects social position and culture: the cognitive categories through which individuals attend to the world are embedded in the words they use.  Words that are used frequently are cognitively central and reflect what is most on the speaker’s (or writer’s) mind.  Words that are used infrequently or not at all are at the speaker’s cognitive periphery, perhaps even representing uncomfortable or alien concepts.

More practically, dictionary methods are based on the assumption that themes or categories consist of a group of words, and texts that cover that theme will have a higher percentage of that group of words compared to other texts. Dictionary-based analysis is appropriate when the categories and the textual features (words and/or phrases associated with each category) are known and fixed--based on expert knowledge, crowd-sourcing, etc. 

Dictionary methods are used for many purposes. A few possibilities:
* classify text into themes
* measure the *tone* of text
* measure sentiment
* measure psychological processes

There are two forms of dictionaries: standard or general dictionaries, and custom dictionaries.

### Standard dictionaries

There are a number of standard dictionaries that have been created by field experts. The benefit of standarized dictionaries is that they're developed by experts and have been throughoughly validated. Others have likely published using these dictionaries, so reviewers are more likely to accept them as valid. Because of this, they are good options if they fit your research question.

Here are a few:

* [DICTION](http://www.dictionsoftware.com/): a computer-aided text analysis program for determining the tone of a text. It was created by and for organization scholars and political scientists.
    * Main five categories: Certainty, Activity, Optimism, Realism, Commonality
    * 35 sub-categories
    * Allows you to create your own dictionary
    * Proprietary software
* [Linguistic Inquiry and Word Count (LIWC)](http://liwc.wpengine.com/): Created by psychologists, it's meant to capture psychological processes around feelings, personality, and motivations. It's also proprietary.
* [Multi-Perspective Question Answering (MPQA)](http://mpqa.cs.pitt.edu/): The free version of LIWC. We will use this dictionary today.
* [Valence Aware Dictionary and sEntiment Reasoner (VADER)](https://github.com/cjhutto/vaderSentiment): a popular and free rule-based sentiment analysis tool tuned to capture sentiment in social media data.
* [Harvard General Inquirer](http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm). Multiple categories, including abstract and concrete words. It's free and available online.

However, dictionaries are context-specific, and applying a dictionary outside its original target domain can lead to serious problems. A classic example was applying the Harvard General Inquirer to classify negative tone in corporate earnings reports--which [leads to serious problems](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-6261.2010.01625.x?casa_token=Vl0Os0QtDIQAAAAA%3AaxNRNAj4ajoocjLv_DCJMsS8GEA_jAqb_Z26h5I8g1yW0muShiPs_lglFbpHR-3AU1etsuyjhdN_ahZy). For example, the term "vice" characterizes an executive rather than immorality, while "tire" is not negative in the context of automobile industry reports. And some negative words aren't captured at all: the terms "litigation" and "unanticipated" are missing from the Harvard list. Enough of these problems means the results of our analyses will be wrong. So sometimes it's worth the effort to create a custom dictionary.

### Custom dictionaries

Many research questions or data are domain specific, however, and will thus require you to create your own dictionary based on your own knowledge of the domain and question. Creating your own dictionary requires a lot of thought, and must be validated. These dictionaries are typically created in an interative fashion, and are modified as they are validated. See Enns et al. (2015) or Haber (2020) for examples of how to construct a domain-specific dictionary. 

Today we will use the free and standard sentiment dictionary from MPQA (or VADER?) to measure positive and negative sentiment in the music reviews.

Our first step, as with any technique, is the pre-processing step, to get the data ready for analyis.

## 0.1 Pre-processing

First, read in our Music Reviews corpus as a Pandas dataframe.

In [1]:
#import the necessary packages
import pandas
import nltk
from nltk import word_tokenize
import string

#read the Music Reviews corpus into a Pandas dataframe
df = pandas.read_csv("../day-2/data/BDHSI2016_music_reviews.csv", encoding='utf-8', sep = '\t')

#view the dataframe
df

Unnamed: 0,album,artist,genre,release_date,critic,score,body
0,Don't Panic,All Time Low,Pop/Rock,2012-10-09 00:00:00,Kerrang!,74.0,While For Baltimore proves they can still writ...
1,Fear and Saturday Night,Ryan Bingham,Country,2015-01-20 00:00:00,Uncut,70.0,There's nothing fake about the purgatorial nar...
2,The Way I'm Livin',Lee Ann Womack,Country,2014-09-23 00:00:00,Q Magazine,84.0,All life's disastrous lows are here on a caree...
3,Doris,Earl Sweatshirt,Rap,2013-08-20 00:00:00,Pitchfork,82.0,"With Doris, Odd Future’s Odysseus is finally b..."
4,Giraffe,Echoboy,Rock,2003-02-25 00:00:00,AllMusic,71.0,Though Giraffe is definitely Echoboy's most im...
...,...,...,...,...,...,...,...
4996,Outer South,Conor Oberst And The Mystic Valley Band,Indie,2009-05-05 00:00:00,Slant Magazine,67.0,The result is an album that's unfortunately ba...
4997,On An Island,David Gilmour,Rock,2006-03-07 00:00:00,E! Online,67.0,"In the end, Island makes Dave sound like he's ..."
4998,Movement,Gossip,Indie,2003-05-06 00:00:00,Uncut,81.0,Beth Ditto's remarkable gospel holler and ferv...
4999,Locked Down,Dr. John,Pop/Rock,2012-04-03 00:00:00,PopMatters,86.0,"Dr. John is Dr. John. He's a star, and is on f..."


The next step is to create a new column in our dataset that contains tokenized words with all the pre-processing steps.

In [2]:
#first create a new column called "body_tokens" and transform to lowercase by applying the string function str.lower()
df['body'] = df['body'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))
df['body_tokens'] = df['body'].str.lower()

In [3]:
#tokenize
df['body_tokens'] = df['body_tokens'].apply(nltk.word_tokenize)

#view output
print(df['body_tokens'])

0       [while, for, baltimore, proves, they, can, sti...
1       [there, 's, nothing, fake, about, the, purgato...
2       [all, life, 's, disastrous, lows, are, here, o...
3       [with, doris, ,, odd, future, ’, s, odysseus, ...
4       [though, giraffe, is, definitely, echoboy, 's,...
                              ...                        
4996    [the, result, is, an, album, that, 's, unfortu...
4997    [in, the, end, ,, island, makes, dave, sound, ...
4998    [beth, ditto, 's, remarkable, gospel, holler, ...
4999    [dr., john, is, dr., john, ., he, 's, a, star,...
5000    [their, work, ,, especially, that, displayed, ...
Name: body_tokens, Length: 5001, dtype: object


In [4]:
punctuations = list(string.punctuation)

#remove punctuation. Let's talk about that lambda x.
df['body_tokens'] = df['body_tokens'].apply(lambda x: [word for word in x if word not in punctuations])

#view output
print(df['body_tokens'])

0       [while, for, baltimore, proves, they, can, sti...
1       [there, 's, nothing, fake, about, the, purgato...
2       [all, life, 's, disastrous, lows, are, here, o...
3       [with, doris, odd, future, ’, s, odysseus, is,...
4       [though, giraffe, is, definitely, echoboy, 's,...
                              ...                        
4996    [the, result, is, an, album, that, 's, unfortu...
4997    [in, the, end, island, makes, dave, sound, lik...
4998    [beth, ditto, 's, remarkable, gospel, holler, ...
4999    [dr., john, is, dr., john, he, 's, a, star, an...
5000    [their, work, especially, that, displayed, on,...
Name: body_tokens, Length: 5001, dtype: object


Pre-processing is done. What other pre-processing steps might we use?

One more step before getting to the dictionary method. We want a total token count for each row, so we can normalize the dictionary counts. To do this we simply create a new column that contains the length of the token list in each row.

In [5]:
df['token_count'] = df['body_tokens'].apply(lambda x: len(x))

print(df[['body_tokens','token_count']])

                                            body_tokens  token_count
0     [while, for, baltimore, proves, they, can, sti...           38
1     [there, 's, nothing, fake, about, the, purgato...           28
2     [all, life, 's, disastrous, lows, are, here, o...           13
3     [with, doris, odd, future, ’, s, odysseus, is,...           18
4     [though, giraffe, is, definitely, echoboy, 's,...           51
...                                                 ...          ...
4996  [the, result, is, an, album, that, 's, unfortu...           27
4997  [in, the, end, island, makes, dave, sound, lik...           17
4998  [beth, ditto, 's, remarkable, gospel, holler, ...           25
4999  [dr., john, is, dr., john, he, 's, a, star, an...           18
5000  [their, work, especially, that, displayed, on,...           28

[5001 rows x 2 columns]


## 0.2. Creating dictionary counts

I created two text files, one is a list of positive words from the MPQA dictionary, the other is a list of negative words. One word per line. Our goal here is to count the number of positive and negative words in each row of our dataframe, and add two columns to our dataset with the count of positive and negative words.

First, read in the positive and negative words and create list variables for each.

In [6]:
pos_sent = open("../day-2/data/positive_words.txt", encoding='utf-8').read()
neg_sent = open("../day-2/data/negative_words.txt", encoding='utf-8').read()

#view part of the pos_sent variable, to see how it's formatted.
print(pos_sent[:101])

abidance
abidance
abilities
ability
able
above
above-average
abundant
abundance
acceptance
acceptable


In [7]:
#remember the split function? We'll split on the newline character (\n) to create a list
positive_words=pos_sent.split('\n')
negative_words=neg_sent.split('\n')

#view the first elements in the lists
print(positive_words[:10])
print(negative_words[:10])
positive_words

['abidance', 'abidance', 'abilities', 'ability', 'able', 'above', 'above-average', 'abundant', 'abundance', 'acceptance']
['abandoned', 'abandonment', 'aberration', 'aberration', 'abhorred', 'abhorrence', 'abhorrent', 'abhorrently', 'abhors', 'abhors']


['abidance',
 'abidance',
 'abilities',
 'ability',
 'able',
 'above',
 'above-average',
 'abundant',
 'abundance',
 'acceptance',
 'acceptable',
 'accessible',
 'acclaim',
 'acclaimed',
 'accolade',
 'accolades',
 'accommodative',
 'accomplishment',
 'accomplishments',
 'accordance',
 'accordantly',
 'accurate',
 'accurately',
 'achievable',
 'achievement',
 'achievements',
 'acknowledgement',
 'active',
 'acumen',
 'adaptable',
 'adaptability',
 'adaptive',
 'adept',
 'adeptly',
 'adequate',
 'adherence',
 'adherent',
 'adhesion',
 'admirable',
 'admirer',
 'admirable',
 'admirably',
 'admiration',
 'admiring',
 'admiringly',
 'admission',
 'admission',
 'adorable',
 'adored',
 'adorer',
 'adoring',
 'adoringly',
 'adroit',
 'adroitly',
 'adulatory',
 'advanced',
 'advantage',
 'advantage',
 'advantageous',
 'advantages',
 'advantages',
 'adventure',
 'adventure',
 'adventuresome',
 'adventurism',
 'adventurous',
 'advice',
 'advice',
 'advisable',
 'advocacy',
 'affable',
 'affabili

In [8]:
#count number of words in each list
print(len(positive_words))
print(len(negative_words))

2231
3906


Great! You know what to do now.

### Challenge
1. Create a column with the number of positive words, and another with the proportion of positive words
2. Create a column with the number of negative words, and another with the proportion of negative words
3. Print the average proportion of negative and positive words by genre
4. Compare this to the average score by genre

In [10]:
df['pos_num'] = df['body_tokens'].apply(lambda x: len([word for word in x if word in positive_words]))
df['neg_num'] = df['body_tokens'].apply(lambda x: len([word for word in x if word in negative_words]))

df['pos_prop'] = df['pos_num']/df['token_count']
df['neg_prop'] = df['neg_num']/df['token_count']
df.drop('release_date', axis=1, inplace=True)
df

Unnamed: 0,album,artist,genre,critic,score,body,body_tokens,token_count,pos_num,neg_num,pos_prop,neg_prop
0,Don't Panic,All Time Low,Pop/Rock,Kerrang!,74.0,While For Baltimore proves they can still writ...,"[while, for, baltimore, proves, they, can, sti...",38,1,0,0.026316,0.000000
1,Fear and Saturday Night,Ryan Bingham,Country,Uncut,70.0,There's nothing fake about the purgatorial nar...,"[there, 's, nothing, fake, about, the, purgato...",28,0,3,0.000000,0.107143
2,The Way I'm Livin',Lee Ann Womack,Country,Q Magazine,84.0,All life's disastrous lows are here on a caree...,"[all, life, 's, disastrous, lows, are, here, o...",13,0,1,0.000000,0.076923
3,Doris,Earl Sweatshirt,Rap,Pitchfork,82.0,"With Doris, Odd Future’s Odysseus is finally b...","[with, doris, odd, future, ’, s, odysseus, is,...",18,0,1,0.000000,0.055556
4,Giraffe,Echoboy,Rock,AllMusic,71.0,Though Giraffe is definitely Echoboy's most im...,"[though, giraffe, is, definitely, echoboy, 's,...",51,2,4,0.039216,0.078431
...,...,...,...,...,...,...,...,...,...,...,...,...
4996,Outer South,Conor Oberst And The Mystic Valley Band,Indie,Slant Magazine,67.0,The result is an album that's unfortunately ba...,"[the, result, is, an, album, that, 's, unfortu...",27,0,3,0.000000,0.111111
4997,On An Island,David Gilmour,Rock,E! Online,67.0,"In the end, Island makes Dave sound like he's ...","[in, the, end, island, makes, dave, sound, lik...",17,3,0,0.176471,0.000000
4998,Movement,Gossip,Indie,Uncut,81.0,Beth Ditto's remarkable gospel holler and ferv...,"[beth, ditto, 's, remarkable, gospel, holler, ...",25,2,0,0.080000,0.000000
4999,Locked Down,Dr. John,Pop/Rock,PopMatters,86.0,"Dr. John is Dr. John. He's a star, and is on f...","[dr., john, is, dr., john, he, 's, a, star, an...",18,1,0,0.055556,0.000000


In [11]:
grouped = df.groupby('genre')
grouped['pos_prop'].mean().sort_values(ascending=False)

genre
Folk                      0.096497
Jazz                      0.087290
Indie                     0.085293
Alternative/Indie Rock    0.080572
Rock                      0.080200
Electronic                0.078628
Dance                     0.078059
Pop/Rock                  0.077627
R&B;                      0.074498
Country                   0.072140
Rap                       0.070954
Pop                       0.069679
Name: pos_prop, dtype: float64

In [12]:
grouped['score'].mean().sort_values(ascending=False)

genre
Jazz                      77.631579
Folk                      75.900000
Indie                     74.400897
Country                   74.071429
Alternative/Indie Rock    73.928571
Electronic                73.140351
Pop/Rock                  73.033782
R&B;                      72.366071
Rap                       72.173554
Rock                      70.754292
Dance                     70.146341
Pop                       64.608054
Name: score, dtype: float64

That's the dictionary method! You can do this with any dictionary you want, standard or you can create your own.

## 0.3. Sentiment analysis using scikit-learn

We can also do this using the document term matrix. We'll do this in pandas, to make it conceptually clear. As you get more comfortable with programming you may want to eventually shift over to working with sparse matrix format, which is more efficient to work with.

In [13]:
#import the function CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()

#create our document term matrix as a pandas dataframe
dtm_df = pandas.DataFrame(countvec.fit_transform(df.body).toarray(), columns=countvec.get_feature_names(), index = df.index)
dtm_df

Unnamed: 0,aa,aaaa,aahs,aaliyah,aaron,ab,abandon,abandoned,abandoning,abc,...,zone,zones,zoo,zooey,zoomer,zu,zydeco,álbum,être,über
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4999,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we can keep only those *columns* that occur in our positive words list. To do this, we'll first save a list of the columns names as a variable, and then only keep the elements of the list that occur in our positive words list. We'll then create a new dataframe keeping only those select columns.

In [14]:
#create a columns variable that is a list of all column names
columns = list(dtm_df)
columns

['aa',
 'aaaa',
 'aahs',
 'aaliyah',
 'aaron',
 'ab',
 'abandon',
 'abandoned',
 'abandoning',
 'abc',
 'abdullah',
 'abe',
 'aberrant',
 'abhorrent',
 'abides',
 'abilities',
 'ability',
 'ablaze',
 'able',
 'ably',
 'abortively',
 'abound',
 'abounds',
 'about',
 'aboutsonic',
 'above',
 'abovei',
 'abrasive',
 'abrupt',
 'absence',
 'absenceit',
 'absent',
 'absolute',
 'absolutely',
 'absolution',
 'absorb',
 'absorbed',
 'absorbing',
 'absorbs',
 'abstract',
 'abstruse',
 'absurd',
 'absurdities',
 'absurdity',
 'abundance',
 'abundant',
 'abuse',
 'abusers',
 'abusing',
 'abysmal',
 'abyss',
 'ac',
 'accelerate',
 'accelerating',
 'accent',
 'accented',
 'accents',
 'accept',
 'acceptable',
 'accepting',
 'access',
 'accessibility',
 'accessible',
 'accident',
 'accidental',
 'accidentally',
 'acclaim',
 'acclaimed',
 'acclimatise',
 'accompanied',
 'accompaniment',
 'accompanying',
 'accomplish',
 'accomplished',
 'accomplishes',
 'accomplishment',
 'accomplishments',
 'accord',

In [15]:
#create a new variable that contains only column names that are in our postive words list
pos_columns = [word for word in columns if word in positive_words]
pos_columns

['abilities',
 'ability',
 'able',
 'above',
 'abundance',
 'abundant',
 'acceptable',
 'accessible',
 'acclaim',
 'acclaimed',
 'accomplishment',
 'accomplishments',
 'accurately',
 'achievement',
 'achievements',
 'adaptable',
 'adept',
 'adeptly',
 'adherence',
 'admirable',
 'admirably',
 'admiration',
 'adored',
 'advanced',
 'advantage',
 'adventure',
 'adventurous',
 'advice',
 'affection',
 'affectionate',
 'affirmation',
 'affirmative',
 'affordable',
 'afloat',
 'agile',
 'agreeable',
 'allure',
 'alluring',
 'almighty',
 'amazed',
 'amazement',
 'amazing',
 'amazingly',
 'ambitious',
 'amenable',
 'amiable',
 'amicable',
 'amour',
 'ample',
 'amply',
 'amusement',
 'angel',
 'animated',
 'appeal',
 'appealing',
 'appreciation',
 'appropriate',
 'apt',
 'aptitude',
 'aptly',
 'ardent',
 'arresting',
 'articulate',
 'aspiration',
 'aspirations',
 'assertions',
 'assertive',
 'asset',
 'assurance',
 'astonishing',
 'astonishingly',
 'astounded',
 'astounding',
 'astoundingly',


In [16]:
#create a dtm from our dtm_df that keeps only positive sentiment columns
dtm_pos = dtm_df[pos_columns]
dtm_pos

Unnamed: 0,abilities,ability,able,above,abundance,abundant,acceptable,accessible,acclaim,acclaimed,...,wonderous,worth,worthwhile,worthy,wow,wry,yearning,youthful,zeal,zest
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4999,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
#count the number of positive words for each document
dtm_pos['pos_count'] = dtm_pos.sum(axis=1)
#dtm_pos.drop('pos_count',axis=1, inplace=True)
dtm_pos['pos_count']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dtm_pos['pos_count'] = dtm_pos.sum(axis=1)


0       1
1       0
2       0
3       0
4       2
       ..
4996    0
4997    3
4998    2
4999    1
5000    5
Name: pos_count, Length: 5001, dtype: int64

### Challenge
1. Do the same for negative words.  
2. Calculate the proportion of negative and positive words for each document.

In [19]:
neg_columns = [word for word in columns if word in negative_words]
dtm_neg = dtm_df[neg_columns]

dtm_neg['neg_count'] = dtm_neg.sum(axis=1)
dtm_neg['neg_count']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dtm_neg['neg_count'] = dtm_neg.sum(axis=1)


0       0
1       3
2       1
3       1
4       4
       ..
4996    3
4997    0
4998    0
4999    0
5000    0
Name: neg_count, Length: 5001, dtype: int64

In [20]:
dtm_pos['pos_proportion'] = dtm_pos['pos_count']/dtm_df.sum(axis=1)
print(dtm_pos['pos_proportion'])
df['pos_prop']

0       0.030303
1       0.000000
2       0.000000
3       0.000000
4       0.046512
          ...   
4996    0.000000
4997    0.187500
4998    0.095238
4999    0.062500
5000    0.178571
Name: pos_proportion, Length: 5001, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dtm_pos['pos_proportion'] = dtm_pos['pos_count']/dtm_df.sum(axis=1)


0       0.026316
1       0.000000
2       0.000000
3       0.000000
4       0.039216
          ...   
4996    0.000000
4997    0.176471
4998    0.080000
4999    0.055556
5000    0.178571
Name: pos_prop, Length: 5001, dtype: float64

# Part 1: Weighting dictionaries

Next we'll use a weighted dictionary to compare the relative average concreteness of the words used in Austen's *Pride and Prejudice* versus Alcott's *A Garland for Girls*. A weighted dictionary indicates not only whether a phrase is associated with a category, but *how strongly* it is associated with that category. In this approach, a dictionary is a list of weighted words.

This could be done using a regular dictionary: a list of concrete and abstract words. Instead, we'll use a crowdsourced dictionary that provides an average "concreteness score" for a large number of English words.

## 1.1 Read concreteness score dictionary

First we'll create a pandas dataframe from the concreteness score dictionary, saved on our hard drive in the form of a .csv file.

This dictionary comes from work by [Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman.](https://link.springer.com/article/10.3758/s13428-013-0403-5) In summary:

    The authors obtained Concreteness ratings for 37,058 English words and 2,896 two-word expressions (such as zebra crossing and zoom in), by means of a norming study using Internet crowdsourcing for data collection. They had over 4,000 participants rate 5 words on a concreteness scale, from 1 (very abstract) to 5 (very concrete). They define concrete words as words you can experience through the senses, and abstract words as words that you cannot experience through the senses. They provide the average concreteness score and the standard deviation for each word.

Let's read in the data.

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

con_score = pandas.read_csv('../day-2/data/Concreteness_ratings_Brysbaert_et_al.csv')
con_score

Unnamed: 0,Word,Bigram,Conc.M,Conc.SD,Unknown,Total,Percent_known,SUBTLEX,Dom_Pos
0,a,0,1.46,1.14,2,30,0.93,1041179,Article
1,a cappella,1,2.92,1.44,3,29,0.90,0,Err:512
2,aardvark,0,4.68,0.86,0,28,1.00,21,Noun
3,aback,0,1.65,1.07,4,27,0.85,15,Adverb
4,abacus,0,4.52,1.12,2,29,0.93,12,Noun
...,...,...,...,...,...,...,...,...,...
39949,zoom,0,3.10,1.49,0,30,1.00,181,Verb
39950,zoom in,1,3.57,1.40,0,28,1.00,0,Err:512
39951,zoom lens,1,4.81,0.49,1,27,0.96,0,Err:512
39952,zoophobia,0,2.04,1.02,2,25,0.92,0,Err:512


We can see the most concrete and most abstract words by sorting on `Conc.M`.

In [23]:
con_score[['Word','Conc.M']].sort_values(by='Conc.M',ascending=False)

Unnamed: 0,Word,Conc.M
2547,bat,5.00
10689,eagle,5.00
30740,shawl,5.00
36046,umbrella,5.00
2526,basket,5.00
...,...,...
39703,would,1.12
32378,spirituality,1.07
941,although,1.07
10905,eh,1.04


In [24]:
con_score[['Word','Conc.M']].sort_values(by='Conc.M',ascending=True)

Unnamed: 0,Word,Conc.M
10905,eh,1.04
11618,essentialness,1.04
32378,spirituality,1.07
941,although,1.07
39703,would,1.12
...,...,...
25452,pick-up truck,5.00
6160,comb,5.00
6476,computer mouse,5.00
7132,cookie,5.00


## 1.2. Merging a DTM with a weighted dictionary

The goal is to merge this score with our document term matrix, so we can calculate the average concreteness score for our texts.

To do this, we'll first create the DTM from our two novels, transpose this matrix, and merge it with the dataframe created above. We'll merge on the column 'Word'.

First, create the DTM.

In [25]:
text_list = []
#open and read the novels, save them as variables
austen_string = open('../day-2/data/Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
alcott_string = open('../day-2/data/Alcott_GarlandForGirls.txt', encoding='utf-8').read()

#append each novel to the list
text_list.append(austen_string)
text_list.append(alcott_string)

countvec = CountVectorizer(stop_words="english")

novels_df = pandas.DataFrame(countvec.fit_transform(text_list).toarray(), columns=countvec.get_feature_names())
novels_df

Unnamed: 0,000,1500,15th,1813,1887,18th,20,2001,26th,30,...,york,young,younge,younger,youngest,youngsters,youth,youthful,youths,zip
0,0,0,1,2,0,1,0,0,1,0,...,1,129,4,29,14,0,9,0,1,0
1,1,1,1,0,2,0,1,1,0,1,...,2,109,0,7,2,1,9,1,3,1


Next, we'll take a subset of the DTM, keeping only the intersection between the words in our corpus and the word in the dictionary.

In [None]:
columns=list(novels_df)
columns_con = [word for word in columns if word in list(con_score['Word'])]
columns_con[:10]

In [None]:
novels_df_con = novels_df[columns_con]
novels_df_con 

Next, transpose the matrix, rename the column, and merge with the dictionary dataframe.

In [None]:
df = novels_df_con.transpose()
df

In [None]:
df.rename(columns={0: 'Austen', 1: 'Alcott'}, inplace=True)
df

In [None]:
#Rename the index 'Word', and reset the index, so the words become a column in our dataframe and we get a new index.
df.index.names = ['Word']
df.reset_index(inplace=True)

df

In [None]:
#merge with our dictionary dataframe, called 'con_score'
df = df.merge(con_score, on = 'Word')
df

## 1.3. Weighting term frequencies by the concreteness score

Now we can weight the term frquency cells by the concreteness score, by multiplying the frequency count column by the concreteness score column.

In [None]:
df['austen_con_score'] = df['Austen'] * df['Conc.M']
df

In [None]:
df['alcott_con_score'] = df['Alcott'] * df['Conc.M']
df

### Challenge

Calculate and print the average concreteness score for each text. Careful! Think through this before you implement it. You want the average score, normalized over all the words in the text. 

In [None]:
#we'll devide the sum of the concreteness score by the total word count for each novel
print("Mean Concreteness for Austen's 'Pride and Prejudice'")
print(df['austen_con_score'].sum()/df['Austen'].sum())
print()
print("Mean Concreteness for Alcott's 'A Garland for Girls'")
print(df['alcott_con_score'].sum()/df['Alcott'].sum())

## 1.4. Assessing the difference

So there is a difference, but what does it mean? What is the magnitude of the difference?

We can look at the difference between the two means as a percent difference based on the scale range. We can calculate this using simple math.

In [None]:
#first find the difference between the means by substracting one from the other
3.1534507874-2.78328905828

In [None]:
#Find the range of concreteness scores
print(df['Conc.M'].min())
print(df['Conc.M'].max())

In [None]:
#The scale range
df['Conc.M'].max() - df['Conc.M'].min()

In [None]:
#Calculate the difference of means as a percent of this range
(0.37/3.83)* 100

### Challenge
Print the most concrete and abstract terms in Austen and in Alcott.  
*Hint:* You can't simply sort on the column `austen_con_score` and so on. Why not? What are your next steps?

In [None]:
#Create a new dataframe that keeps only words that have a non-zero value in Alcott
df_alcott = df[df['Alcott']>0]
#Sort on 'Conc.M' and pring in descending order for most concrete words
df_alcott[['Word', 'Conc.M', 'Alcott']].sort_values(by=['Conc.M', 'Alcott'], ascending = False)

In [None]:
#Create a new dataframe that keeps only words that have a non-zero value in Austen
df_austen = df[df['Austen']>0]
df_austen[['Word', 'Conc.M', 'Austen']].sort_values(by=['Conc.M', 'Austen'], ascending = False)