# Problem set 8: Mini-project

We've put some effort into building our collection (see problem set 7 for details and for links to texts and to metadata). Now it's time to learn something about it. You already have lots of excellent ideas for how to apply the tools we've learned about so far. It's also a good time of the semester to review what we have learned and practice applying it in less structured settings.

**You will work by yourself or in a group of up to three people** to complete a short project applying methods from the previous weeks to this collection. You will turn in the completed project as a single notebook (one submission per group) with the following sections:

1. **Question(s).** Describe what you wanted to learn. Suggest several possible answers or hypotheses, and describe in general terms what you might expect to see if each of these answers were true (save specific measurements for the next section). For example, many students want to know the difference between horror and non-horror, or between detective stories and horror fiction, but there are many ways to operationalize this question. You do not need to limit yourself to questions of genre. **Note that your question should be interesting! If the answer is obvious before you begin, or if it's something the importance of which you cannot explain, your grade will suffer (a lot).** (10 points)

1. **Methods.** Describe how you will use computational methods presented so far in this class to answer your question. What do the computational tools do, and how does their output relate to your question? Describe how you will process the collection into a form suitable for a model or algorithm and why you have processed it the way you have. (10 points)

1. **Code.** Carry out your experiments. Code should be correct (no errors) and focused (unneeded code from examples is removed). Use the notebook format effectively: code may be incorporated into multiple sections. (20 points)

1. **Results and discussion.** Use sorted lists, tables, and visual presentations to make your argument. Excellent projects will provide multiple views of results, and follow up on any apparent outliers or strange cases, including through careful reading of the original documents. (40 points)

1. **Reflection.** Describe your experience in this process. What was harder or easier than you expected? What compromises or negotiations did you have to accept to match the collection, the question, and the methods? What would you try next? (10 points)

1. **Responsibility and resources consulted.** Credit any online sources (Stack Overflow, blog posts, documentation) that you found helpful. (0 points, but -10 if missing)
    * **If you worked in a group**, set up a group submission in CMS. Each group member should submit (via CMS) a separate text file in which they describe each member's (including their own) contributions to the project.
    * Most people will turn in *either* a completed notebook for their solo project *or* a responsibility statement. The only people who will submit both files are those who are the designated submitter for their group. Don't worry if CMS warns you about a missing file (unless you're the group submitter).

Note that 10 points will be carried over from problem set 7.

**We will grade this work based on accuracy, thoroughness, creativity, reflectiveness, and quality of presentation.**

**Scope:** this is a *mini*-project, with a short deadline. We are expecting work that is consistent with that timeframe, but that is serious, thoughtful, and rigorous. This problem set will almost certainly require more time and effort than many of the others. **For group work, the expected scope grows linearly with the number of participants.**

# 0. Project team

List here the members of your project team, including yourself.

# 1. Question(s)

# 2. Methods

# 3. Code

In [2]:
# Imports (all of them!)
import pandas as pd
from pathlib import Path
import numpy as np
from glob import glob
import os

In [3]:
#path = Path('/Users/francesw/desktop/info 6350','hw','Info6350_MiniP_data_v2.csv')
path = os.path.join('..', '..', 'data', 'Info6350_MiniP_data_v2.csv')
data = pd.read_csv(path)

# generate dummy variables for English, PovFirst, female, horror, detective and adaptation (1 if "True")

data["dHorror"] = data["horror"].astype(int)
data["dDetective"] = data["detective"].astype(int)
data["dAdaptation"] = data["adaptation"].astype(int)
data['dEnglish'] = np.where(data['language'] == 'en', 1, 0)
data['dPovFirst'] = np.where(data['pov'] == 'first', 1, 0)
data['dFemale'] = np.where(data['gender'] == 'female', 1, 0)

# convert all 'gb' to  'uk' 
data['country'] = data['country'].replace(['gb'],'uk')

# generate dummy variables for Romantic, Victorian, Modern and PostModern literature period (1 if "True")
data['dRomanticP'] = np.where((data['year'] >=  1790) & (data['year'] <= 1830), 1, 0)
data['dVictorianP'] = np.where((data['year'] >=  1832) & (data['year'] <= 1901), 1, 0)
data['dModernP'] = np.where((data['year'] >=  1914) & (data['year'] <= 1945), 1, 0)
data['dPostModernP'] = np.where((data['year'] >=  1945), 1, 0)

# generate dummy variables for FranceWar, USWar, UKWar, GermanyWar and War literature period (1 if the written was done during the writer's motherland war)
data['dFranceWar'] = np.where((data['year'] >=  1830) & (data['year'] <= 1848) & (data['country'] == 'fr'), 1, 0)
data['dUKWar'] = np.where((((data['year'] >=  1914) & (data['year'] <= 1918)) | ((data['year'] >=  1939) & (data['year'] <= 1945))) & (data['country'] == 'uk'), 1, 0)
data['dUSWar'] = np.where((((data['year'] >=  1914) & (data['year'] <= 1918)) | ((data['year'] >=  1939) & (data['year'] <= 1945)) | ((data['year'] >=  1861) & (data['year'] <= 1865))) & (data['country'] == 'us'), 1, 0)
data['dGermanyWar'] = np.where((((data['year'] >=  1914) & (data['year'] <= 1918)) | ((data['year'] >=  1939) & (data['year'] <= 1945))) & ((data['country'] == 'de') | (data['country'] == 'cz')), 1, 0)
data['dWar'] = np.where((data['dFranceWar'] ==  1) | (data['dUKWar'] ==  1)| (data['dUSWar'] ==  1)| (data['dGermanyWar'] ==  1), 1, 0)

# summary statistics for key variables
data[['wordcount', 'language', 'age', 'dScience', 'dHorror', 'dDetective', 'dAdaptation', 'dEnglish', 'dPovFirst', 'dFemale', 'dRomanticP', 'dVictorianP', 'dModernP', 'dPostModernP', 'dWar']].describe()  


Unnamed: 0,wordcount,age,dScience,dHorror,dDetective,dAdaptation,dEnglish,dPovFirst,dFemale,dRomanticP,dVictorianP,dModernP,dPostModernP,dWar
count,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0
mean,85428.007576,42.166667,0.219697,0.272727,0.25,0.386364,0.992424,0.454545,0.537879,0.083333,0.340909,0.204545,0.159091,0.090909
std,77588.786826,10.958051,0.415619,0.447058,0.434662,0.48877,0.087039,0.499826,0.500462,0.277438,0.475821,0.404906,0.367154,0.288575
min,2764.0,21.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48454.25,33.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,61016.5,42.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,104441.75,49.0,0.0,1.0,0.25,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0
max,667000.0,71.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [4]:
import string
from nltk.corpus import stopwords
from nltk import word_tokenize, sent_tokenize
from collections import defaultdict

In [5]:
novel_files = glob(os.path.join('..', '..', 'data','novels', '*.txt'))
emolex_file = os.path.join('..', '..','data','lexicons','emolex.txt')

In [57]:
def tokenize_text(text, stopwords=None):
    '''
    Takes a string.
    Returns a list of tokenized sentences.
    '''
    tokenized_text = []
    for sent in sent_tokenize(text):
        tokens = word_tokenize(sent.lower())
        if stopwords != None:
            tokens = [token for token in tokens if token not in stopwords]
        tokenized_text.append(tokens)
    return tokenized_text

# Build stopword list with punctuation
stopwords = set(['and','but','am','is','are','was','were','be','being','been','the','a','an','of',
                'on','under','above','out','in','at','with','have','has','had', "'s"])
stopwords = stopwords.union(set(string.punctuation))


In [58]:
# read and parse the emolex file
def read_emolex(filepath=None):
    '''
    Takes a file path to the emolex lexicon file.
    Returns a dictionary of emolex sentiment values.
    '''
    if filepath==None: # Try to find the emolex file
        filepath = emolex_file
        if os.path.isfile(filepath):
            pass
        elif os.path.isfile('emolex.txt'):
            filepath = 'emolex.txt'
        else:
            raise FileNotFoundError('No EmoLex file found')
    emolex = defaultdict(dict) # Like Counter(), defaultdict eases dictionary creation
    with open(filepath, 'r') as f:
    # emolex file format is: word emotion value
        for line in f:
            word, emotion, value = line.strip().split()
            emolex[word][emotion] = int(value)
    return emolex

# Get EmoLex data
emolex = read_emolex(emolex_file)

FileNotFoundError: [Errno 2] No such file or directory: '../../data/lexicons/emolex.txt'

In [56]:
# Sentiment scoring function
def sentiment_score(token_list, lex=None):
    '''
    Takes a tokenized sentence.
    Returns a dictionary of length-normalized EmoLex sentiment scores.
    '''
    if lex==None: # reading emolex everytime
        lex = read_emolex() 
    sent_score = { #making a dictionary with the scores and sentiments
  'anger': 0.0, 
  'anticipation': 0.0, 
  'disgust': 0.0, 
  'fear': 0.0, 
  'joy': 0.0, 
  'negative': 0.0, 
  'positive': 0.0, 
  'sadness': 0.0, 
  'surprise': 0.0, 
  'trust': 0.0
    }
    count = 0 
    for i, sent in enumerate(token_list): #so since we are not actually iterating through token list, we have to seperate each tokenized text into token list and index
        for x in sent: #so after seperating, we look at each token list 
            if x in emolex: # if an element in the list is in emolex 
                for y in sent_score: # we then iterate through our own dictionary we have (above)
                    sent_score[y] = sent_score[y] + emolex[x][y] #and we look at every element in our sent_score dict and update the scores based on the numbers in emolex[word][sentiment]
        count += 1 #this counts each sentence (can change, up to you, can indent again for every element)
    for i in sent_score: #you can comment this out to see the actual scores of each emotion (as in how many times each emotion is present)
        sent_score[i] = sent_score[i]/count #each score/count
    return(sent_score, count)

In [36]:
corpus_scores = {} # Dictionary to hold results
for novel in novel_files: # Iterate over novels / encoding="ISO-8859-1"
    with open(novel, encoding="utf8",errors='ignore') as f: #ignoring any errors because it wouldn't matter since there are so many txts/novels
        novel_text = f.read() # Read a novel as a string
    novel_label = os.path.split(novel)[1].rstrip('.txt') # Get convenience label for novel
    tokens = tokenize_text(novel_text) # Tokenize
#     for i, sent in enumerate(tokens):
    scores = sentiment_score(tokens,lex=emolex) # Score
    corpus_scores[novel_label] = scores # Record scores

In [55]:
ggp_path = os.path.join('..','..','data','novels','The_Great_God_Pan.txt') #example with the great god pan
with open(ggp_path, 'r') as f:
    ggp_text = f.read()
ggp = tokenize_text(ggp_text)
emo_score = sentiment_score(ggp,lex=emolex)
display(emo_score)

({'anger': 0.2702702702702703,
  'anticipation': 0.47157502329916123,
  'disgust': 0.25442684063373716,
  'fear': 0.43895619757688725,
  'joy': 0.3699906803355079,
  'negative': 0.7306616961789375,
  'positive': 0.8108108108108109,
  'sadness': 0.42684063373718545,
  'surprise': 0.26654240447343897,
  'trust': 0.5638397017707363},
 1073)

In [46]:
corpus_scores

{'a_sweet_little_maid': {'anger': 0.006179154560707124,
  'anticipation': 0.01929451241969145,
  'disgust': 0.004828743299095634,
  'fear': 0.00890043786062119,
  'joy': 0.017575807177640464,
  'negative': 0.01761672873102263,
  'positive': 0.03105945901706429,
  'sadness': 0.010332692228997013,
  'surprise': 0.00859352621025494,
  'trust': 0.020338012030936693},
 'Mathilda': {'anger': 0.022185246810870772,
  'anticipation': 0.028629532789266568,
  'disgust': 0.016718168132477618,
  'fear': 0.02789002456223754,
  'joy': 0.028233367667643873,
  'negative': 0.04814726778121121,
  'positive': 0.05055066951905554,
  'sadness': 0.03158756569738267,
  'surprise': 0.015133507645986847,
  'trust': 0.029791617146026465},
 'the_island_of_doctor_moreau': {'anger': 0.015412214034952356,
  'anticipation': 0.0148335326569191,
  'disgust': 0.009991898460707534,
  'fear': 0.023783804637166776,
  'joy': 0.012133019559430578,
  'negative': 0.04064272211720227,
  'positive': 0.02850970255777169,
  'sadne

# 4. Results and discussion

# 5. Reflection

# 6. Responsibility and resources consulted