## Additional Exercises for 02.27: Dictionary Method

Ex. 

1. Read in the `childrens_lit.csv.bz2` file from the `data` folder.
2. Come up with a hypothesis on what you think the sentiment ratings is for children's literature.
3. Do a sentiment analysis on a subset of chilren's literature using the dictionary method from lecture.
    - Use the positive and negative words from lecture

### Question 1

In [1]:
import pandas as pd
import nltk
import string
import matplotlib.pyplot as plt


#read in our data
df = pd.read_csv("../data/childrens_lit.csv.bz2", sep = '\t', encoding = 'utf-8', compression = 'bz2', index_col=0)
df = df.dropna(subset=["text"])
df

Unnamed: 0,title,author gender,year,text
0,A Dog with a Bad Name,Male,1886,A DOG WITH A BAD NAME BY TALBOT BAINES REED ...
1,A Final Reckoning,Male,1887,A Final Reckoning: A Tale of Bush Life in Aust...
2,"A House Party, Don Gesualdo, and A Rainy June",Female,1887,A HOUSE-PARTY Don Gesualdo and A Rainy June...
3,A Houseful of Girls,Female,1889,"A HOUSEFUL OF GIRLS. BY SARAH TYTLER, AUTHOR ..."
4,A Little Country Girl,Female,1885,"LITTLE COUNTRY GIRL. BY SUSAN COOLIDGE, ..."
5,A Round Dozen,Female,1883,\n A ROUND DOZEN. [Illustration: TOINETTE AND...
6,A Sailor's Lass,Female,1886,"A SAILOR'S LASS by EMMA LESLIE, Author of ""..."
7,A World of Girls,Female,1886,A WORLD OF GIRLS: THE STORY OF A SCHOOL. By ...
8,Adrift in the Wild,Male,1887,Adrift in the Wilds; ...
9,Adventures in Africa,Male,1883,"ADVENTURES IN AFRICA, BY W.H.G. KINGSTON. C..."


Since the number of children literaturs is a lot to analyze, we'll just randomly select 5 books to do a sentiment analysis using the dictionary method.

*Note*: In case you're not familiar with seed. Seed is just a function that initializes a fixed state for random number generatoring. Basically if everyone uses the same number as an input to `seed()`, then everyone will get the same result when generating randomly.

In [2]:
import numpy as np
np.random.seed(1)
df = df.sample(5)
df

Unnamed: 0,title,author gender,year,text
126,Under Drake's Flag,Male,1883,Under Drake's Flag: A Tale of the Spanish Mai...
47,Kidnapped,Male,1886,KIDNAPPED BEING MEMOIRS OF THE ADVEN...
75,The Bee-Man of Orn,Male,1887,FRANK R. STOCKTON'S WRITINGS. * ...
95,The Island Queen,Male,1885,The Project Gutenberg EBook of The Island Quee...
108,The Red Man's Revenge A Tale of the Red River ...,Male,1880,The Project Gutenberg EBook of The Red Man's R...


### Question 2

Since these literatures are written for children, the overall sentiment rating is probably positive.

### Question 3

In [4]:
# Your code here
df['text_lc'] = df['text'].str.lower()
df['text_split'] = df['text_lc'].apply(nltk.word_tokenize)
df['text_split_clean'] = df['text_split'].apply(lambda x : [word for word in x if word not in string.punctuation])
df

Unnamed: 0,title,author gender,year,text,text_lc,text_split,text_split_clean
126,Under Drake's Flag,Male,1883,Under Drake's Flag: A Tale of the Spanish Mai...,under drake's flag: a tale of the spanish mai...,"[under, drake, 's, flag, :, a, tale, of, the, ...","[under, drake, 's, flag, a, tale, of, the, spa..."
47,Kidnapped,Male,1886,KIDNAPPED BEING MEMOIRS OF THE ADVEN...,kidnapped being memoirs of the adven...,"[kidnapped, being, memoirs, of, the, adventure...","[kidnapped, being, memoirs, of, the, adventure..."
75,The Bee-Man of Orn,Male,1887,FRANK R. STOCKTON'S WRITINGS. * ...,frank r. stockton's writings. * ...,"[frank, r., stockton, 's, writings, ., *, *, *...","[frank, r., stockton, 's, writings, new, unifo..."
95,The Island Queen,Male,1885,The Project Gutenberg EBook of The Island Quee...,the project gutenberg ebook of the island quee...,"[the, project, gutenberg, ebook, of, the, isla...","[the, project, gutenberg, ebook, of, the, isla..."
108,The Red Man's Revenge A Tale of the Red River ...,Male,1880,The Project Gutenberg EBook of The Red Man's R...,the project gutenberg ebook of the red man's r...,"[the, project, gutenberg, ebook, of, the, red,...","[the, project, gutenberg, ebook, of, the, red,..."


In [6]:
df['text_length'] = df['text_split_clean'].apply(len)
df

Unnamed: 0,title,author gender,year,text,text_lc,text_split,text_split_clean,text_length
126,Under Drake's Flag,Male,1883,Under Drake's Flag: A Tale of the Spanish Mai...,under drake's flag: a tale of the spanish mai...,"[under, drake, 's, flag, :, a, tale, of, the, ...","[under, drake, 's, flag, a, tale, of, the, spa...",110331
47,Kidnapped,Male,1886,KIDNAPPED BEING MEMOIRS OF THE ADVEN...,kidnapped being memoirs of the adven...,"[kidnapped, being, memoirs, of, the, adventure...","[kidnapped, being, memoirs, of, the, adventure...",86353
75,The Bee-Man of Orn,Male,1887,FRANK R. STOCKTON'S WRITINGS. * ...,frank r. stockton's writings. * ...,"[frank, r., stockton, 's, writings, ., *, *, *...","[frank, r., stockton, 's, writings, new, unifo...",57839
95,The Island Queen,Male,1885,The Project Gutenberg EBook of The Island Quee...,the project gutenberg ebook of the island quee...,"[the, project, gutenberg, ebook, of, the, isla...","[the, project, gutenberg, ebook, of, the, isla...",63870
108,The Red Man's Revenge A Tale of the Red River ...,Male,1880,The Project Gutenberg EBook of The Red Man's R...,the project gutenberg ebook of the red man's r...,"[the, project, gutenberg, ebook, of, the, red,...","[the, project, gutenberg, ebook, of, the, red,...",77672


In [10]:
# Your code here
pos_sent = open("../data/positive_words.txt", encoding='utf-8').read()
neg_sent = open("../data/negative_words.txt", encoding='utf-8').read()
positive_words = pos_sent.split('\n')
negative_words = neg_sent.split('\n')

In [11]:
df['num_pos_words'] = df['text_split_clean'].apply(lambda x: len([word for word in x if word in positive_words]))
df['num_neg_words'] = df['text_split_clean'].apply(lambda x: len([word for word in x if word in negative_words]))
df

Unnamed: 0,title,author gender,year,text,text_lc,text_split,text_split_clean,text_length,num_pos_words,num_neg_words,prop_pos_words,prop_neg_words
126,Under Drake's Flag,Male,1883,Under Drake's Flag: A Tale of the Spanish Mai...,under drake's flag: a tale of the spanish mai...,"[under, drake, 's, flag, :, a, tale, of, the, ...","[under, drake, 's, flag, a, tale, of, the, spa...",110331,4363,3531,0.039545,0.032004
47,Kidnapped,Male,1886,KIDNAPPED BEING MEMOIRS OF THE ADVEN...,kidnapped being memoirs of the adven...,"[kidnapped, being, memoirs, of, the, adventure...","[kidnapped, being, memoirs, of, the, adventure...",86353,3047,2770,0.035285,0.032078
75,The Bee-Man of Orn,Male,1887,FRANK R. STOCKTON'S WRITINGS. * ...,frank r. stockton's writings. * ...,"[frank, r., stockton, 's, writings, ., *, *, *...","[frank, r., stockton, 's, writings, new, unifo...",57839,2325,1321,0.040198,0.022839
95,The Island Queen,Male,1885,The Project Gutenberg EBook of The Island Quee...,the project gutenberg ebook of the island quee...,"[the, project, gutenberg, ebook, of, the, isla...","[the, project, gutenberg, ebook, of, the, isla...",63870,2556,2247,0.040019,0.035181
108,The Red Man's Revenge A Tale of the Red River ...,Male,1880,The Project Gutenberg EBook of The Red Man's R...,the project gutenberg ebook of the red man's r...,"[the, project, gutenberg, ebook, of, the, red,...","[the, project, gutenberg, ebook, of, the, red,...",77672,2905,2642,0.037401,0.034015


In [12]:
df['prop_pos_words'] = df['num_pos_words']/df['text_length']
df['prop_neg_words'] = df['num_neg_words']/df['text_length']
df

Unnamed: 0,title,author gender,year,text,text_lc,text_split,text_split_clean,text_length,num_pos_words,num_neg_words,prop_pos_words,prop_neg_words
126,Under Drake's Flag,Male,1883,Under Drake's Flag: A Tale of the Spanish Mai...,under drake's flag: a tale of the spanish mai...,"[under, drake, 's, flag, :, a, tale, of, the, ...","[under, drake, 's, flag, a, tale, of, the, spa...",110331,4363,3531,0.039545,0.032004
47,Kidnapped,Male,1886,KIDNAPPED BEING MEMOIRS OF THE ADVEN...,kidnapped being memoirs of the adven...,"[kidnapped, being, memoirs, of, the, adventure...","[kidnapped, being, memoirs, of, the, adventure...",86353,3047,2770,0.035285,0.032078
75,The Bee-Man of Orn,Male,1887,FRANK R. STOCKTON'S WRITINGS. * ...,frank r. stockton's writings. * ...,"[frank, r., stockton, 's, writings, ., *, *, *...","[frank, r., stockton, 's, writings, new, unifo...",57839,2325,1321,0.040198,0.022839
95,The Island Queen,Male,1885,The Project Gutenberg EBook of The Island Quee...,the project gutenberg ebook of the island quee...,"[the, project, gutenberg, ebook, of, the, isla...","[the, project, gutenberg, ebook, of, the, isla...",63870,2556,2247,0.040019,0.035181
108,The Red Man's Revenge A Tale of the Red River ...,Male,1880,The Project Gutenberg EBook of The Red Man's R...,the project gutenberg ebook of the red man's r...,"[the, project, gutenberg, ebook, of, the, red,...","[the, project, gutenberg, ebook, of, the, red,...",77672,2905,2642,0.037401,0.034015
