<a href="https://colab.research.google.com/github/lbiester/AI4All-UM-NLP/blob/master/3_Exploring_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    %cd '/content/drive/My Drive/AI4All-UM-NLP'

    import nltk
    nltk.download('punkt')

In [0]:
from collections import Counter
import lib
import pandas as pd
import nltk
import itertools
import spacy

## Is happiness seasonal?
Determine whether or not people mention different seasons more in relation to their happiness. We provide the list of seasons. Which season makes people happiest?

Write the `get_all_tokens` function with `nltk.word_tokenize` to get all of the tokens from `joined_data`. Remember, you wrote code to do this before in section 2. Feel free to copy it over!

In [0]:
joined_data = lib.load_joined_data()

def get_all_tokens(joined_data):
    tokens = []
    for hm in joined_data:
        tokens.extend(nltk.word_tokenize(hm['hm_text'].lower()))
    return tokens

all_tokens = get_all_tokens(joined_data)

In [0]:
seasons = ['spring', 'summer', 'fall', 'winter']

In [0]:
def count_seasons(all_tokens):
    counts = Counter()
    for token in all_tokens:
        if token in seasons:
            counts[token] += 1
    print(counts)
count_seasons(all_tokens)

### The magic of preprocesssing
nltk's word tokenize algorithm is trained to handle special cases like punctuation. However, a simple "tokenizer" is python's `string.split` function, which splits a string on white space.

Write a new function, `get_all_tokens_split`, where you create your list of tokens using `string.split` instead of `nltk.word_tokenize`. Do you get different results? Look at the CSV files (<span style="color:red">TODO:</span> make sure this is possible in collab) - can you see why?

In [0]:
def get_all_tokens_split(joined_data):
    tokens = []
    for hm in joined_data:
        tokens.extend(hm['hm_text'].lower().split())
    return tokens

all_tokens_split = get_all_tokens_split(joined_data)

count_seasons(all_tokens_split)

How about if you don't use `lower` to ignore case? Write one more function, `get_all_tokens_split_case_sensitive`, which does the same thing as `get_all_tokens_split` but doesn't ignore case. Check the results of `count_seasons` again. This time, we'll use upper case for our seasons.

In [0]:
seasons = ['Spring', 'Summer', 'Fall', 'Winter']

def get_all_tokens_split_case_sensitive(joined_data):
    tokens = []
    for hm in joined_data:
        tokens.extend(hm['hm_text'].split())
    return tokens

all_tokens_split_case_sensitive = get_all_tokens_split_case_sensitive(joined_data)

count_seasons(all_tokens_split_case_sensitive)

Which results seem most "correct" to you? Which pre-processing method is the most robust?

## What kinds of things are people happy about?
To go beyond the simple example with seasons, we will explore what kinds of things people are happy about. In particular, we will explore two areas:
1. What purchases are people happy with (i.e. can money buy happiness)?
1. Who are people happy to do things with?

In the end, we will see if mention people or purchases more in their happy moments.

### Parsing sentences

To find out what people have bought, you'll be using spaCy's sentence parser. If you aren't familiar, here is an example parse tree, generated using [this site](http://nlpviz.bpodgursky.com/):

<img src="img/parse2.png" width="500">

This is the parse tree for the sentence "I went to the store to buy some blue jeans." Different parse trees will have slightly different structures - sometimes more specific tags, like NNS (plural noun) are used, while sometimes only more general tags like N (noun) will be used.

To begin, we'll play around with the spacy parser to find the NP that is associated with the word "buy" in our example sentence. Make sure that you actually use the parse tree structure, as it will become important later on.

You'll want to loop through the noun chunks to find the one whose "head" is "buy." [The noun chunks documentation](https://spacy.io/usage/linguistic-features#noun-chunks) should give you a good idea of how to do this. Return all noun chunks for which the head is in the passed in list `buy_list`.

It should be clear from the example how to get the text attribute from a noun chunk - do note that noun chunks are _spans_ and individual words are _tokens_ in spaCy. This means that you should always make sure to get the text if that's waht you want!

Here's the documentation for span and token objects:
* [Spans](https://spacy.io/api/span)
* [Tokens](https://spacy.io/api/token)

In [0]:
sentence = 'I went to the store to buy some blue jeans'
simple_buy_list = ['buy']

def get_things_bought(document, buy_list):
    things_bought = []
    
    # load spacy, and parse your query with spacy
    ### YOUR WORK HERE
    nlp = spacy.load('en', max_length=2000000)
    parsed = nlp(document)
    ### END YOUR WORK
    
    # now, find all of the relevant noun chunks
    # make sure you return their text!
    ### YOUR WORK HERE
    for noun_chunk in parsed.noun_chunks:
        root = noun_chunk.root
        head = root.head
        
        if head.pos_ == 'VERB' and head.text in buy_list:
            things_bought.append(noun_chunk.text)
            
    ### END YOUR WORK
    return things_bought

# This should print ['some blue jeans']
print(get_things_bought(sentence, simple_buy_list))

### Counting Purchases

Now that you've written the get_things_bought function, let's put everything together on our actual dataset. Create a dictionary to count every purchase that has been mentioned, and then print the results descending order. Be sure to print the count and the noun chunk.

You will also want to extend your buy list. You will want to think of synonyms, in addition to different forms of the verb "to buy".

Hint: try calling your function with the full document instead of individual sentences. Individual sentences are not needed by spaCy, and this will make your code run much faster.

In [0]:
# load happy moments
# because parsing is slow, you should define a few variables here
# 1. full_hms: this is a list of all happy moment texts in the dataset
# 2. eigth_hms: this is 1/8 of the happy moments in the dataset. we will use this when our code is resource-intensive
# 4. hms_text: this is eigth_hms, but we will join it together in a string! combine all strings with the \n (newline)
#    character.
### YOUR WORK HERE
hms = lib.load_happy_moments()
full_hms = [hm['cleaned_hm'] for i, hm in enumerate(hms)]
eigth_hms = [hm['cleaned_hm'] for i, hm in enumerate(hms) if i < len(hms) / 8]
hms_text = '\n'.join(eigth_hms)
### END YOUR WORK

In [0]:
# define your buy list
### YOUR WORK HERE
buy_list = ['buy', 'bought', 'purchase', 'purchased', 'purchasing', 'buying', 'order', 'ordering']
### END YOUR WORK

In [0]:
# find purchases
# this might take a few minutes to run, don't worry if it does!
### YOUR WORK HERE
things_bought = {}
things = get_things_bought(hms_text, buy_list)
for thing in things:
    if thing not in things_bought:
        things_bought[thing] = 0
    things_bought[thing] += 1
things_sorted = sorted(things_bought.items(), key=lambda x: x[1], reverse=True)

for thing, count in things_sorted:
    print(thing, count)
### END YOUR WORK

You might notice that some of the most common words here are not in fact things that people have bought. An obvious example is the common word "I"

One sentence where "I" is pulled out of the parse tree is the following:
```
I bought a new TV
```

In this specific case, I think that the spaCy parser is doing something wrong my marking "I" as a child of "bought", (<span style="color:red">proofreaders: do you agree?</span>) however, we can still filter it out. A more legitimate failure is from the sentence "I bought my father a bicycle", in which "father" really does belong under "bought" in the tree, but is not the thing that was bought.

Modify your code to take a list of blacklist words that you define. Make sure you think about case as you work on this. Once you've added the blacklist, be creative! Add anything else that you think will help with your performance.

You can copy your code down here to work on this.

In [0]:
# define blacklist
### YOUR WORK HERE
things_blacklisted = ['my', 'I', 'me', 'i', 'it', 'his', 'her', 'their', 'we', 'us', 'he', 'him', 'she', 'her', 'they', 'them']
### END YOUR WORK

def get_things_bought_blacklist(document, buy_list, blacklist):
    things_bought = []
    
    # load spacy, and parse your query with spacy
    ### YOUR WORK HERE
    nlp = spacy.load('en', max_length=2000000)
    parsed = nlp(document)
    ### END YOUR WORK
    
    # now, find all of the relevant noun chunks
    # make sure you return their text!
    ### YOUR WORK HERE
    for noun_chunk in parsed.noun_chunks:
        root = noun_chunk.root
        head = root.head
        
        if head.pos_ == 'VERB' and head.text in buy_list:
            bad = False
            for word in blacklist:
                if word in noun_chunk.text.lower():
                    bad = True
            if not bad:
                things_bought.append(noun_chunk.text)
            
    ### END YOUR WORK
    return things_bought

### YOUR WORK HERE
things_bought = {}
things = get_things_bought_blacklist(hms_text, buy_list, things_blacklisted)
for thing in things:
    if thing not in things_bought:
        things_bought[thing] = 0
    things_bought[thing] += 1
things_sorted = sorted(things_bought.items(), key=lambda x: x[1], reverse=True)

for thing, count in things_sorted:
    print(thing, count)
### END YOUR WORK

What purchases seem to make people the most happy?

Finally, count up and print the total number of purchases mentioned in this chunk of the dataset.

In [0]:
### YOUR WORK HERE
total_purchases = sum(things_bought.values())
### END YOUR WORK
print(total_purchases)

### Counting Personal Interactions
In addition to purchases, we want to count other people who are mentioned in the dataset. This will be a fairly simple pattern-matching exercise, like what we did for seasons. However, we will do a little bit of (our own) parsing to get some ideas!

Much of the time, people are mentioned using a possessive like *my*. Go through all of the sentences, searching for the word *my*. Count up occurrences of words that appear after *my*. This might give you some ideas about what to look for! Print the list out in order.

In [0]:
# TODO: save tokenized sentences for speed for students!
nlp = spacy.load('en')

### YOUR WORK HERE
from collections import Counter
after_my = Counter()
for sentence in eigth_hms:
    if 'my' in sentence:
        tokens = nlp(sentence)
        for i, token in enumerate(tokens):
            if token.text == 'my' and i != len(tokens) - 1:
                next_token = tokens[i + 1].text
                after_my[next_token] += 1
        
#         counted += 1
#         print(counted)
    
# print all words that occur with "my"
### END YOUR WORK

In [0]:
after_my.most_common(1000)

Now that you have some ideas, build your list of personal relationships (call it `relationships`), and write a function `count relationships`.

Make sure that you do not double count a happy moment with multiple relationships mentioned.

In [0]:
relationships = ['family', 'wife', 'husband', 'son', 'daughter', 'girlfriend', 'friend', 'parents', 'mother', 'sister',
                'mom', 'boyfriend', 'brother', 'kids', 'boss', 'dad', 'grandpa', 'nephew', 'uncle', 'partner',
                'niece', 'baby', 'children', 'child', 'spouse', 'cousin', 'ex', 'neighbor', 'fiance', 'daughters',
                'granddaughter', 'students', 'aunt', 'roommate', 'coworkers']

In [0]:
def count_relationships(hms, relationships):
    count = 0
    ### YOUR WORK HERE
    for sentence in eigth_hms:
        for relationship in relationships:
            if relationship in sentence:
                count += 1
                break
    ### END YOUR WORK
    return count

In [0]:
count_relationships(hms, relationships)

Do people mention relationships that they are happy with more, or people? What does this tell us about the general cause of happiness?