# Using Weighted Dictionaries

On Monday we applied dictionaries that consisted of lists of words associated with a category. We worked through the example of counting positive emotion words and negative emotion words in our music reviews dataset. This served as a rudimentary sentiment analysis tool.

In the reading today the authors use lists of phrases that were associated with categories, in their case Democrats and Republicans, but they did not simply indicate whether a phrase was associated with a category, but *how strongly* it was associated with a category. In other words, their dictionary was a list of weighted words.

Today we'll use a weighted dictionary to compare the relative average concreteness of the words used in Austen's *Pride and Prejudice* versus Alcott's *A Garland for Girls*.

This could be done using a regular dictionary: a list of concrete and abstract words. Instead, we'll use a crowdsourced dictionary that provides an average "concreteness score" for a large number of English words.

### Outline

1. Weighted dictionary
2. Merging a DTM with a weighted dictionary
    * Term-Document Matrix (transposed DTM)
    * Merging dataframes
3. Weight the term frequencies by their concreteness score
3. Calculating an average concreteness score for each text

### 1. Read a concreteness score dictionary

First we'll create a pandas dataframe from the concreteness score dictionary, saved on our hard drive in the form of a .csv file.

This dictionary comes from work by [Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman.](https://link.springer.com/article/10.3758/s13428-013-0403-5) In summary:

The authors obtained Concreteness ratings for 37,058 English words and 2,896 two-word expressions (such as zebra crossing and zoom in), by means of a norming study using Internet crowdsourcing for data collection. They had over 4,000 participants rate 5 words on a concreteness scale, from 1 (very abstract) to 5 (very concrete). They define concrete words as words you can experience through the senses, and abstract words as words that you cannot experience through the senses. They provide the average concreteness score and the standard deviation for each word.

Let's read in the data.

In [223]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer

con_score = pandas.read_csv('../data/Concreteness_ratings_Brysbaert_et_al.csv')
con_score

Unnamed: 0,Word,Bigram,Conc.M,Conc.SD,Unknown,Total,Percent_known,SUBTLEX,Dom_Pos
0,a,0,1.46,1.14,2,30,0.93,1041179,Article
1,a cappella,1,2.92,1.44,3,29,0.90,0,Err:512
2,aardvark,0,4.68,0.86,0,28,1.00,21,Noun
3,aback,0,1.65,1.07,4,27,0.85,15,Adverb
4,abacus,0,4.52,1.12,2,29,0.93,12,Noun
5,abandon,0,2.54,1.45,1,27,0.96,413,Verb
6,abandoned,0,2.52,1.27,0,29,1.00,678,Verb
7,abandonee,0,2.92,1.28,4,28,0.86,0,Err:512
8,abandoner,0,2.50,1.50,2,30,0.93,1,Noun
9,abandonment,0,2.54,1.29,0,28,1.00,49,Noun


We can see the most concrete and most abstract words by sorting on 'Conc.M'.

In [224]:
con_score[['Word','Conc.M']].sort_values(by='Conc.M',ascending=False)

Unnamed: 0,Word,Conc.M
2547,bat,5.00
10689,eagle,5.00
30740,shawl,5.00
36046,umbrella,5.00
2526,basket,5.00
22561,nail polish,5.00
22562,nail scissors,5.00
22563,nailbrush,5.00
30604,sewing machine,5.00
22701,neck,5.00


In [225]:
con_score[['Word','Conc.M']].sort_values(by='Conc.M',ascending=True)

Unnamed: 0,Word,Conc.M
10905,eh,1.04
11618,essentialness,1.04
32378,spirituality,1.07
941,although,1.07
39703,would,1.12
32381,spiritually,1.14
39087,whatsoever,1.17
6520,conceptualistic,1.18
7075,conventionalism,1.18
16971,if,1.19


### 2. Merging a DTM with a weighted dictionary

The goal is to merge this score with our document term matrix, so we can calculate the average concreteness score for our texts.

To do this, we'll first create the DTM from our two novels, transpose this matrix, and merge it with the dataframe created above. We'll merge on the column 'Word'.

First, create the DTM.

In [226]:
text_list = []
#open and read the novels, save them as variables
austen_string = open('../data/Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
alcott_string = open('../data/Alcott_GarlandForGirls.txt', encoding='utf-8').read()

#append each novel to the list
text_list.append(austen_string)
text_list.append(alcott_string)

countvec = CountVectorizer(stop_words="english")

novels_df = pandas.DataFrame(countvec.fit_transform(text_list).toarray(), columns=countvec.get_feature_names())
novels_df

Unnamed: 0,000,1500,15th,1813,1887,18th,20,2001,26th,30,...,york,young,younge,younger,youngest,youngsters,youth,youthful,youths,zip
0,0,0,1,2,0,1,0,0,1,0,...,1,129,4,29,14,0,9,0,1,0
1,1,1,1,0,2,0,1,1,0,1,...,2,109,0,7,2,1,9,1,3,1


Like monday, we'll next take a subset of the DTM, keeping only the intersection between the words in our corpus and the word in the dictionary.

In [227]:
columns=list(novels_df)
columns_con = [word for word in columns if word in list(con_score['Word'])]
columns_con[:10]

['aback',
 'abatement',
 'abhorrence',
 'abide',
 'abiding',
 'able',
 'aboard',
 'abode',
 'abominable',
 'abominably']

In [228]:
novels_df_con = novels_df[columns_con]
novels_df_con 

Unnamed: 0,aback,abatement,abhorrence,abide,abiding,able,aboard,abode,abominable,abominably,...,yes,yesterday,yield,yielding,yonder,young,younger,youth,youthful,zip
0,0,1,6,1,1,54,0,8,6,4,...,76,13,4,3,0,129,29,9,0,0
1,1,0,0,2,0,26,2,0,0,0,...,30,1,0,0,1,109,7,9,1,1


Next, transpose the matrix, rename the column, and merge with the dictionary dataframe.

In [229]:
df = novels_df_con.transpose()
df

Unnamed: 0,0,1
aback,0,1
abatement,1,0
abhorrence,6,0
abide,1,2
abiding,1,0
able,54,26
aboard,0,2
abode,8,0
abominable,6,0
abominably,4,0


In [230]:
df.rename(columns={0: 'Austen', 1: 'Alcott'}, inplace=True)
df

Unnamed: 0,Austen,Alcott
aback,0,1
abatement,1,0
abhorrence,6,0
abide,1,2
abiding,1,0
able,54,26
aboard,0,2
abode,8,0
abominable,6,0
abominably,4,0


In [231]:
#Rename the index 'Word', and reset the index, so the words become a column in our dataframe and we get a new index.
df.index.names = ['Word']
df.reset_index(inplace=True)

df

Unnamed: 0,Word,Austen,Alcott
0,aback,0,1
1,abatement,1,0
2,abhorrence,6,0
3,abide,1,2
4,abiding,1,0
5,able,54,26
6,aboard,0,2
7,abode,8,0
8,abominable,6,0
9,abominably,4,0


In [232]:
#merge with our dictionary dataframe, called 'con_score'
df = df.merge(con_score, on = 'Word')
df

Unnamed: 0,Word,Austen,Alcott,Bigram,Conc.M,Conc.SD,Unknown,Total,Percent_known,SUBTLEX,Dom_Pos
0,aback,0,1,0,1.65,1.07,4,27,0.85,15,Adverb
1,abatement,1,0,0,1.92,1.29,4,30,0.87,3,Noun
2,abhorrence,6,0,0,2.15,1.26,1,28,0.96,0,Err:512
3,abide,1,2,0,1.68,0.86,0,28,1.00,138,Verb
4,abiding,1,0,0,2.07,1.13,0,29,1.00,25,Adjective
5,able,54,26,0,2.38,1.42,1,27,0.96,8155,Adjective
6,aboard,0,2,0,3.97,1.30,0,30,1.00,1358,Adverb
7,abode,8,0,0,3.92,1.38,2,26,0.92,33,Noun
8,abominable,6,0,0,1.89,0.99,2,30,0.93,32,Adjective
9,abominably,4,0,0,2.14,1.13,1,30,0.97,8,Adverb


### 3. Weighting the term frequencies by the concreteness score

Now we can weight the term frquency cells by the concreteness score, by multiplying the frequency count column by the concreteness score column.

In [233]:
df['austen_con_score'] = df['Austen'] * df['Conc.M']
df

Unnamed: 0,Word,Austen,Alcott,Bigram,Conc.M,Conc.SD,Unknown,Total,Percent_known,SUBTLEX,Dom_Pos,austen_con_score
0,aback,0,1,0,1.65,1.07,4,27,0.85,15,Adverb,0.00
1,abatement,1,0,0,1.92,1.29,4,30,0.87,3,Noun,1.92
2,abhorrence,6,0,0,2.15,1.26,1,28,0.96,0,Err:512,12.90
3,abide,1,2,0,1.68,0.86,0,28,1.00,138,Verb,1.68
4,abiding,1,0,0,2.07,1.13,0,29,1.00,25,Adjective,2.07
5,able,54,26,0,2.38,1.42,1,27,0.96,8155,Adjective,128.52
6,aboard,0,2,0,3.97,1.30,0,30,1.00,1358,Adverb,0.00
7,abode,8,0,0,3.92,1.38,2,26,0.92,33,Noun,31.36
8,abominable,6,0,0,1.89,0.99,2,30,0.93,32,Adjective,11.34
9,abominably,4,0,0,2.14,1.13,1,30,0.97,8,Adverb,8.56


In [234]:
df['alcott_con_score'] = df['Alcott'] * df['Conc.M']
df

Unnamed: 0,Word,Austen,Alcott,Bigram,Conc.M,Conc.SD,Unknown,Total,Percent_known,SUBTLEX,Dom_Pos,austen_con_score,alcott_con_score
0,aback,0,1,0,1.65,1.07,4,27,0.85,15,Adverb,0.00,1.65
1,abatement,1,0,0,1.92,1.29,4,30,0.87,3,Noun,1.92,0.00
2,abhorrence,6,0,0,2.15,1.26,1,28,0.96,0,Err:512,12.90,0.00
3,abide,1,2,0,1.68,0.86,0,28,1.00,138,Verb,1.68,3.36
4,abiding,1,0,0,2.07,1.13,0,29,1.00,25,Adjective,2.07,0.00
5,able,54,26,0,2.38,1.42,1,27,0.96,8155,Adjective,128.52,61.88
6,aboard,0,2,0,3.97,1.30,0,30,1.00,1358,Adverb,0.00,7.94
7,abode,8,0,0,3.92,1.38,2,26,0.92,33,Noun,31.36,0.00
8,abominable,6,0,0,1.89,0.99,2,30,0.93,32,Adjective,11.34,0.00
9,abominably,4,0,0,2.14,1.13,1,30,0.97,8,Adverb,8.56,0.00


### 3. Calculating the Average Concreteness Score

Exercise: Calculate and print the average concreteness score for each text. Careful! Think through this before you implement it. You want the average score, normalized over all the words in the text. 

In [235]:
#code here
#we'll devide the sum of the concreteness score by the total word count for each novel
print("Mean Concreteness for Austen's 'Pride and Prejudice'")
print(df['austen_con_score'].sum()/df['Austen'].sum())
print()
print("Mean Concreteness for Alcott's 'A Garland for Girls'")
print(df['alcott_con_score'].sum()/df['Alcott'].sum())

Mean Concreteness for Austen's 'Pride and Prejudice'
2.78328905828

Mean Concreteness for Alcott's 'A Garland for Girls'
3.1534507874


### 4. Assessing the difference

So there is a difference, but what does it mean? What is the magnitude of the difference?

We can look at the difference between the two means as a percent difference based on the scale range. We can calculate this using simple math.

In [236]:
#first find the difference between the means by substracting one from the other
3.1534507874-2.78328905828

0.37016172912000034

In [237]:
#Find the range of concreteness scores
print(df['Conc.M'].min())
print(df['Conc.M'].max())

1.17
5.0


In [238]:
#The scale range
df['Conc.M'].max() - df['Conc.M'].min()

3.8300000000000001

In [239]:
#Calculate the difference of means as a percent of this range
(0.37/3.83)* 100

9.660574412532636

Ex: Print the most concrete and abstract terms in Austen and in Alcott.  
HINT: You can't simply sort on the column 'austen_con_score' and so on. Why not? What are your next steps?

In [244]:
#Create a new dataframe that keeps only words that have a non-zero value in Alcott
df_alcott = df[df['Alcott']>0]
#Sort on 'Conc.M' and pring in descending order for most concrete words
df_alcott[['Word', 'Conc.M', 'Alcott']].sort_values(by=['Conc.M', 'Alcott'], ascending = False)

Unnamed: 0,Word,Conc.M,Alcott
2692,house,5.00,65
6033,water,5.00,32
470,bed,5.00,25
590,boots,5.00,17
2139,fish,5.00,17
438,basket,5.00,16
400,baby,5.00,10
3577,neck,5.00,9
413,ball,5.00,8
921,clock,5.00,8


In [243]:
#Create a new dataframe that keeps only words that have a non-zero value in Austen
df_austen = df[df['Austen']>0]
df_austen[['Word', 'Conc.M', 'Austen']].sort_values(by=['Conc.M', 'Austen'], ascending = False)

Unnamed: 0,Word,Conc.M,Austen
2692,house,5.00,108
413,ball,5.00,36
5198,stairs,5.00,24
470,bed,5.00,6
921,clock,5.00,6
2139,fish,5.00,5
2681,horse,5.00,4
6033,water,5.00,4
2457,gravel,5.00,3
862,chimney,5.00,1
