# Dictionary Method: Additional Exercises - SOLUTIONS

## Challenge:

1. Read in the `childrens_lit.csv.bz2` file from the `data` folder.
2. Come up with a hypothesis on what you think the sentiment ratings is for children's literature.
3. Do a sentiment analysis on a subset of chilren's literature using the dictionary method from lecture.
    - Use the positive and negative words from lecture

### Question 1

In [5]:
import pandas as pd
import nltk
import string
import matplotlib.pyplot as plt


#read in our data
df = pd.read_csv("../day-2/data/childrens_lit.csv.bz2", sep = '\t', encoding = 'utf-8', compression = 'bz2', index_col=0)
df = df.dropna(subset=["text"])
df

Unnamed: 0,title,author gender,year,text
0,A Dog with a Bad Name,Male,1886,A DOG WITH A BAD NAME BY TALBOT BAINES REED ...
1,A Final Reckoning,Male,1887,A Final Reckoning: A Tale of Bush Life in Aust...
2,"A House Party, Don Gesualdo, and A Rainy June",Female,1887,A HOUSE-PARTY Don Gesualdo and A Rainy June...
3,A Houseful of Girls,Female,1889,"A HOUSEFUL OF GIRLS. BY SARAH TYTLER, AUTHOR ..."
4,A Little Country Girl,Female,1885,"LITTLE COUNTRY GIRL. BY SUSAN COOLIDGE, ..."
...,...,...,...,...
127,Up the River,Male,1881,UP THE RIVER OR YACHTING ON THE MISSISSIPPI ...
128,What Katy Did Next,Female,1886,WHAT KATY DID NEXT BY SUSAN COOLIDGE This...
129,Winning His Spurs,Male,1882,WINNING HIS SPURS ...
130,With Clive in India,Male,1884,"WITH CLIVE IN INDIA: Or, The Beginnings of an..."


Since the number of children literaturs is a lot to analyze, we'll just randomly select 5 books to do a sentiment analysis using the dictionary method.

*Note*: In case you're not familiar with seed. Seed is just a function that initializes a fixed state for random number generatoring. Basically if everyone uses the same number as an input to `seed()`, then everyone will get the same result when generating randomly.

In [6]:
import numpy as np
np.random.seed(1)
df = df.sample(5)
df

Unnamed: 0,title,author gender,year,text
126,Under Drake's Flag,Male,1883,Under Drake's Flag: A Tale of the Spanish Mai...
47,Kidnapped,Male,1886,KIDNAPPED BEING MEMOIRS OF THE ADVEN...
75,The Bee-Man of Orn,Male,1887,FRANK R. STOCKTON'S WRITINGS. * ...
95,The Island Queen,Male,1885,The Project Gutenberg EBook of The Island Quee...
108,The Red Man's Revenge A Tale of the Red River ...,Male,1880,The Project Gutenberg EBook of The Red Man's R...


### Question 2

Since these literatures are written for children, the overall sentiment rating is probably positive.

### Question 3

In [7]:
# Your code here
df['text_lc'] = df['text'].str.lower()
df['text_split'] = df['text_lc'].apply(nltk.word_tokenize)
df['text_split_clean'] = df['text_split'].apply(lambda x : [word for word in x if word not in string.punctuation])
df

Unnamed: 0,title,author gender,year,text,text_lc,text_split,text_split_clean
126,Under Drake's Flag,Male,1883,Under Drake's Flag: A Tale of the Spanish Mai...,under drake's flag: a tale of the spanish mai...,"[under, drake, 's, flag, :, a, tale, of, the, ...","[under, drake, 's, flag, a, tale, of, the, spa..."
47,Kidnapped,Male,1886,KIDNAPPED BEING MEMOIRS OF THE ADVEN...,kidnapped being memoirs of the adven...,"[kidnapped, being, memoirs, of, the, adventure...","[kidnapped, being, memoirs, of, the, adventure..."
75,The Bee-Man of Orn,Male,1887,FRANK R. STOCKTON'S WRITINGS. * ...,frank r. stockton's writings. * ...,"[frank, r., stockton, 's, writings, ., *, *, *...","[frank, r., stockton, 's, writings, new, unifo..."
95,The Island Queen,Male,1885,The Project Gutenberg EBook of The Island Quee...,the project gutenberg ebook of the island quee...,"[the, project, gutenberg, ebook, of, the, isla...","[the, project, gutenberg, ebook, of, the, isla..."
108,The Red Man's Revenge A Tale of the Red River ...,Male,1880,The Project Gutenberg EBook of The Red Man's R...,the project gutenberg ebook of the red man's r...,"[the, project, gutenberg, ebook, of, the, red,...","[the, project, gutenberg, ebook, of, the, red,..."


In [8]:
df['text_length'] = df['text_split_clean'].apply(len)
df

Unnamed: 0,title,author gender,year,text,text_lc,text_split,text_split_clean,text_length
126,Under Drake's Flag,Male,1883,Under Drake's Flag: A Tale of the Spanish Mai...,under drake's flag: a tale of the spanish mai...,"[under, drake, 's, flag, :, a, tale, of, the, ...","[under, drake, 's, flag, a, tale, of, the, spa...",110331
47,Kidnapped,Male,1886,KIDNAPPED BEING MEMOIRS OF THE ADVEN...,kidnapped being memoirs of the adven...,"[kidnapped, being, memoirs, of, the, adventure...","[kidnapped, being, memoirs, of, the, adventure...",86351
75,The Bee-Man of Orn,Male,1887,FRANK R. STOCKTON'S WRITINGS. * ...,frank r. stockton's writings. * ...,"[frank, r., stockton, 's, writings, ., *, *, *...","[frank, r., stockton, 's, writings, new, unifo...",57774
95,The Island Queen,Male,1885,The Project Gutenberg EBook of The Island Quee...,the project gutenberg ebook of the island quee...,"[the, project, gutenberg, ebook, of, the, isla...","[the, project, gutenberg, ebook, of, the, isla...",63870
108,The Red Man's Revenge A Tale of the Red River ...,Male,1880,The Project Gutenberg EBook of The Red Man's R...,the project gutenberg ebook of the red man's r...,"[the, project, gutenberg, ebook, of, the, red,...","[the, project, gutenberg, ebook, of, the, red,...",77673


In [11]:
pos_sent = open("../day-2/data/positive_words.txt", encoding='utf-8').read()
neg_sent = open("../day-2/data/negative_words.txt", encoding='utf-8').read()
positive_words = pos_sent.split('\n')
negative_words = neg_sent.split('\n')

In [12]:
df['num_pos_words'] = df['text_split_clean'].apply(lambda x: len([word for word in x if word in positive_words]))
df['num_neg_words'] = df['text_split_clean'].apply(lambda x: len([word for word in x if word in negative_words]))
df

Unnamed: 0,title,author gender,year,text,text_lc,text_split,text_split_clean,text_length,num_pos_words,num_neg_words
126,Under Drake's Flag,Male,1883,Under Drake's Flag: A Tale of the Spanish Mai...,under drake's flag: a tale of the spanish mai...,"[under, drake, 's, flag, :, a, tale, of, the, ...","[under, drake, 's, flag, a, tale, of, the, spa...",110331,4363,3531
47,Kidnapped,Male,1886,KIDNAPPED BEING MEMOIRS OF THE ADVEN...,kidnapped being memoirs of the adven...,"[kidnapped, being, memoirs, of, the, adventure...","[kidnapped, being, memoirs, of, the, adventure...",86351,3047,2770
75,The Bee-Man of Orn,Male,1887,FRANK R. STOCKTON'S WRITINGS. * ...,frank r. stockton's writings. * ...,"[frank, r., stockton, 's, writings, ., *, *, *...","[frank, r., stockton, 's, writings, new, unifo...",57774,2325,1321
95,The Island Queen,Male,1885,The Project Gutenberg EBook of The Island Quee...,the project gutenberg ebook of the island quee...,"[the, project, gutenberg, ebook, of, the, isla...","[the, project, gutenberg, ebook, of, the, isla...",63870,2557,2247
108,The Red Man's Revenge A Tale of the Red River ...,Male,1880,The Project Gutenberg EBook of The Red Man's R...,the project gutenberg ebook of the red man's r...,"[the, project, gutenberg, ebook, of, the, red,...","[the, project, gutenberg, ebook, of, the, red,...",77673,2905,2643


In [13]:
df['prop_pos_words'] = df['num_pos_words']/df['text_length']
df['prop_neg_words'] = df['num_neg_words']/df['text_length']
df

Unnamed: 0,title,author gender,year,text,text_lc,text_split,text_split_clean,text_length,num_pos_words,num_neg_words,prop_pos_words,prop_neg_words
126,Under Drake's Flag,Male,1883,Under Drake's Flag: A Tale of the Spanish Mai...,under drake's flag: a tale of the spanish mai...,"[under, drake, 's, flag, :, a, tale, of, the, ...","[under, drake, 's, flag, a, tale, of, the, spa...",110331,4363,3531,0.039545,0.032004
47,Kidnapped,Male,1886,KIDNAPPED BEING MEMOIRS OF THE ADVEN...,kidnapped being memoirs of the adven...,"[kidnapped, being, memoirs, of, the, adventure...","[kidnapped, being, memoirs, of, the, adventure...",86351,3047,2770,0.035286,0.032078
75,The Bee-Man of Orn,Male,1887,FRANK R. STOCKTON'S WRITINGS. * ...,frank r. stockton's writings. * ...,"[frank, r., stockton, 's, writings, ., *, *, *...","[frank, r., stockton, 's, writings, new, unifo...",57774,2325,1321,0.040243,0.022865
95,The Island Queen,Male,1885,The Project Gutenberg EBook of The Island Quee...,the project gutenberg ebook of the island quee...,"[the, project, gutenberg, ebook, of, the, isla...","[the, project, gutenberg, ebook, of, the, isla...",63870,2557,2247,0.040034,0.035181
108,The Red Man's Revenge A Tale of the Red River ...,Male,1880,The Project Gutenberg EBook of The Red Man's R...,the project gutenberg ebook of the red man's r...,"[the, project, gutenberg, ebook, of, the, red,...","[the, project, gutenberg, ebook, of, the, red,...",77673,2905,2643,0.0374,0.034027


<br>

# Weighted Dictionary: Additional Exercises - SOLUTIONS

<br><br>

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

con_score = pd.read_csv('../day-2/data/Concreteness_ratings_Brysbaert_et_al.csv')
print(con_score)

             Word  Bigram  Conc.M  Conc.SD  Unknown  Total  Percent_known  \
0               a       0    1.46     1.14        2     30           0.93   
1      a cappella       1    2.92     1.44        3     29           0.90   
2        aardvark       0    4.68     0.86        0     28           1.00   
3           aback       0    1.65     1.07        4     27           0.85   
4          abacus       0    4.52     1.12        2     29           0.93   
...           ...     ...     ...      ...      ...    ...            ...   
39949        zoom       0    3.10     1.49        0     30           1.00   
39950     zoom in       1    3.57     1.40        0     28           1.00   
39951   zoom lens       1    4.81     0.49        1     27           0.96   
39952   zoophobia       0    2.04     1.02        2     25           0.92   
39953    zucchini       0    4.87     0.57        0     30           1.00   

       SUBTLEX  Dom_Pos  
0      1041179  Article  
1            0  Err:512

## Question 1
* open the **Machiavelli_ThePrince.txt** and **Marx_CommunistManifesto.txt**.
* make a data frame that contains both of them.

In [18]:
text_list = []
#open and read the novels, save them as variables
machiavelli_string = open('../day-2/data/Machiavelli_ThePrince.txt', encoding='utf-8').read()
marx_string = open('../day-2/data/Marx_CommunistManifesto.txt', encoding='utf-8').read()

#append each novel to the list
text_list.append(machiavelli_string)
text_list.append(marx_string)

countvec = CountVectorizer(stop_words="english")

novels_df = pd.DataFrame(countvec.fit_transform(text_list).toarray(), columns=countvec.get_feature_names())
novels_df

Unnamed: 0,000,10,12,1232,1284,1300,1320,1328,1383,1390,...,yielded,yoke,young,youth,zanobi,zeal,zenith,zerezzanello,zip,zones
0,1,3,1,3,1,1,1,3,1,1,...,5,1,6,9,2,1,1,1,1,0
1,1,1,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,1,1


## Question 2
* take a subset of the DTM, keeping only the intersection between the words in our corpus and the word in the dictionary.
* tanspose and rename the column
* finally merge our data frame with con_score data frame

In [20]:
columns=list(novels_df)
columns_con = [word for word in columns if word in list(con_score['Word'])]
novels_df_con = novels_df[columns_con]
novels_df_con 

Unnamed: 0,abandon,abandoned,abandonment,abide,abiding,ability,abject,ablaze,able,abolish,...,wrote,year,yearly,yield,yoke,young,youth,zeal,zenith,zip
0,2,2,1,1,1,24,2,0,54,0,...,4,15,1,10,1,6,9,1,1,1
1,0,0,0,1,0,0,0,1,2,3,...,4,0,0,1,1,1,0,0,0,1


In [21]:
df = novels_df_con.transpose()
df.rename(columns={0: 'Machiavelli', 1: 'Marx'}, inplace=True)
df.index.names = ['Word']
df.reset_index(inplace=True)
df

Unnamed: 0,Word,Machiavelli,Marx
0,abandon,2,0
1,abandoned,2,0
2,abandonment,1,0
3,abide,1,1
4,abiding,1,0
...,...,...,...
3634,young,6,1
3635,youth,9,0
3636,zeal,1,0
3637,zenith,1,0


In [22]:
df = df.merge(con_score, on = 'Word')
df

Unnamed: 0,Word,Machiavelli,Marx,Bigram,Conc.M,Conc.SD,Unknown,Total,Percent_known,SUBTLEX,Dom_Pos
0,abandon,2,0,0,2.54,1.45,1,27,0.96,413,Verb
1,abandoned,2,0,0,2.52,1.27,0,29,1.00,678,Verb
2,abandonment,1,0,0,2.54,1.29,0,28,1.00,49,Noun
3,abide,1,1,0,1.68,0.86,0,28,1.00,138,Verb
4,abiding,1,0,0,2.07,1.13,0,29,1.00,25,Adjective
...,...,...,...,...,...,...,...,...,...,...,...
3634,young,6,1,0,3.16,1.46,0,25,1.00,12402,Adjective
3635,youth,9,0,0,3.28,1.34,0,25,1.00,858,Noun
3636,zeal,1,0,0,2.33,1.33,2,29,0.93,31,Noun
3637,zenith,1,0,0,2.83,1.61,2,25,0.92,22,Noun


## Question 3
* Calculate and print the **average concreteness score** for each text.
* What is the magnitude of the difference?

In [23]:
df['machiavelli_con_score'] = df['Machiavelli'] * df['Conc.M']
df['marx_con_score'] = df['Marx'] * df['Conc.M']

print("machiavelli: " + str(df['machiavelli_con_score'].sum()/df['Machiavelli'].sum()))
print("marx: " + str(df['marx_con_score'].sum()/df['Marx'].sum()))

machiavelli: 2.890535517520018
marx: 2.8289717978848414


In [24]:
avg_mach = df['machiavelli_con_score'].sum()/df['Machiavelli'].sum()
avg_marx = df['marx_con_score'].sum()/df['Marx'].sum()
abs(avg_mach-avg_marx)/(df['Conc.M'].max() - df['Conc.M'].min()) * 100

1.6074078233727511