In [0]:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    %cd '/content/drive/My Drive/AI4All-UM-NLP'

    import nltk
    nltk.download('punkt')

In [0]:
from collections import Counter
import lib
import pandas as pd
import nltk
import itertools
import spacy

# Is happiness seasonal?
Determine whether or not people mention different seasons more in relation to their happiness. We provide the list of seasons. Which season makes people happiest?

Before we figure this out, we're going to explore an important NLP tool.

## Tokenization
We used tokenization in the last notebook, but we didn't really learn much about it. Before we continue to look at seasons, let's learn a bit more about tokenization.

Tokenization is one of the first pre-processing steps in NLP. It is the process of splitting a string of text into individual tokens. You can think of a token as an individual word or a piece of punctuation.

You may have seen the method `split`, which is used for strings in python. It takes a string and splits it into words, based on white space.


NLTK, a python toolkit for NLP, has its own function, `nltk.word_tokenize`. This is a "smarter" version of tokenization. Let's see why. Tokenize the sentence below using both methods. What is the main difference?


To call `split` on a string, write `string.split()`
To call `word_tokenize` on a string, write `nltk.word_tokenize(string)`

In [0]:
sentence = 'When I went to the store today, I bought apples, bananas, and oranges!'

# print out the list returned using `split`
### YOUR WORK HERE!!!


# print out the list returned using `word_tokenize`
### YOUR WORK HERE!!!


What major differences do you see?

Now, create a list of all tokens in the dataset. If you get stuck, you can look at what we did in the last notebook. Make sure to use `nltk.word_tokenize`, _and_ make each token lowercase.

In [1]:
joined_data = lib.load_joined_data()

# create your list of tokens!
all_tokens = []
### YOUR WORK HERE!!!


NameError: name 'lib' is not defined

In [0]:
seasons = ['spring', 'summer', 'fall', 'winter']

Now, write a function count_seasons, that takes in your tokens as input and prints out the count for each season. You can use the seasons list provided above!

In [0]:
def count_seasons(all_tokens):
    counts = {}
    ### YOUR WORK HERE!!!

    print(counts)

In [None]:
# call your function on the tokens
### YOUR WORK HERE!!!

### The magic of preprocesssing
nltk's word tokenize algorithm is trained to handle special cases like punctuation. However, we saw that a simple "tokenizer" is python's `string.split` function, which splits a string on white space.

Write a new function, `get_all_tokens_split`, where you create your list of tokens using `string.split` instead of `nltk.word_tokenize`. Make sure to make all of your tokens **lower case** again! Do you get different results?

In [0]:
def get_all_tokens_split(joined_data):
    tokens = []
    ### YOUR WORK HERE!!!
    return tokens

# call your function on the data


# count seasons - are your results the same? different?



How about if you don't use `lower` to ignore case? Write one more function, `get_all_tokens_split_case_sensitive`, which does the same thing as `get_all_tokens_split` but doesn't ignore case. Check the results of `count_seasons` again. This time, we'll use upper case for our seasons.

In [0]:
seasons = ['Spring', 'Summer', 'Fall', 'Winter']

def get_all_tokens_split_case_sensitive(joined_data):
    tokens = []
    ### YOUR WORK HERE!!!
    return tokens

# call your new function on the data


# count seasons - are your results the same? different?



Which results seem most "correct" to you? Which pre-processing method is the most robust?

## What kinds of things are people happy about?
To go beyond the simple example with seasons, we will explore what kinds of things people are happy about. In particular, we will explore two areas:
1. What purchases are people happy with (i.e. can money buy happiness)?
1. Who are people happy to do things with?

In the end, we will see if mention people or purchases more in their happy moments.

### Parsing sentences

To find out what people have bought, you'll be using spaCy's sentence parser. If you aren't familiar, here is an example parse tree, generated using [this site](http://nlpviz.bpodgursky.com/):

<img src="img/parse2.png" width="500">

This is the parse tree for the sentence "I went to the store to buy some blue jeans." Different parse trees will have slightly different structures - sometimes more specific tags, like NNS (plural noun) are used, while sometimes only more general tags like N (noun) will be used.

To begin, we'll play around with the spacy parser to find the NP that is associated with the word "buy" in our example sentence. Make sure that you actually use the parse tree structure, as it will become important later on.

You'll want to loop through the noun chunks to find the one whose "head" is "buy." [The noun chunks documentation](https://spacy.io/usage/linguistic-features#noun-chunks) should give you a good idea of how to do this. Return all noun chunks for which the head is in the passed in list `buy_list`.

It should be clear from the example how to get the text attribute from a noun chunk - do note that noun chunks are _spans_ and individual words are _tokens_ in spaCy. This means that you should always make sure to get the text if that's waht you want!

Here's the documentation for span and token objects:
* [Spans](https://spacy.io/api/span)
* [Tokens](https://spacy.io/api/token)

In [None]:
# use this cell to play around with a spacy parse, based on the documentation.
# this may help you to understand what is happening
# also feel free to ask questions - this is definitely the hardest thing we have done so far.



In [0]:
sentence = 'I went to the store to buy some blue jeans'
simple_buy_list = ['buy']

def get_things_bought(document, buy_list):
    things_bought = []
    
    # load spacy, and parse your document with spacy
    ### YOUR WORK HERE

    ### END YOUR WORK
    
    # now, find all of the relevant noun chunks
    # make sure you return their text!
    # refer to the documentation on noun chunks for this
    ### YOUR WORK HERE
            
    ### END YOUR WORK
    return things_bought

# This should print ['some blue jeans']
print(get_things_bought(sentence, simple_buy_list))

### Counting Purchases

Now that you've written the get_things_bought function, let's put everything together on our actual dataset. Create a dictionary to count every purchase that has been mentioned, and then print the results **in descending order.** Be sure to print the count and the noun chunk.

You will also want to extend your buy list. You will want to think of synonyms, in addition to different forms of the verb "to buy".

Hint: try calling your function with the full document instead of individual sentences. Individual sentences are not needed by spaCy, and this will make your code run much faster.

In [0]:
# load happy moments
# because parsing is slow, you should define a few variables here
# 1. full_hms: this is a list of all happy moment texts in the dataset
# 2. eigth_hms: this is 1/8 of the happy moments in the dataset. we will use this when our code is resource-intensive
# 4. hms_text: this is eigth_hms, but we will join it together in a string! combine all strings with the \n (newline)
#    character. Use the string join function: https://www.programiz.com/python-programming/methods/string/join
### YOUR WORK HERE
hms = lib.load_happy_moments()
full_hms = ???
eigth_hms = ???
hms_text = ???
### END YOUR WORK

In [0]:
# define your buy list
### YOUR WORK HERE
buy_list = ???
### END YOUR WORK

In [0]:
# find purchases
# this might take a few minutes to run, don't worry if it does!
# hint: you will want to sort the list to print in descending order. 
# Ask for help if you don't know how to do this, or look it up online!
### YOUR WORK HERE

# count things



# sort


# print


### END YOUR WORK

You might notice that some of the most common words here are not in fact things that people have bought. An obvious example is the common word "I"

One sentence where "I" is pulled out of the parse tree is the following:
```
I bought a new TV
```

In this specific case, I think that the spaCy parser is doing something wrong my marking "I" as a child of "bought", however, we can still filter it out. A more legitimate failure is from the sentence "I bought my father a bicycle", in which "father" really does belong under "bought" in the tree, but is not the thing that was bought.

Modify your code to take a list of blacklist words that you define. Make sure you think about case as you work on this. Once you've added the blacklist, be creative! Add anything else that you think will help with your performance.

You can copy your code down here to work on this.

In [0]:
# define blacklist
### YOUR WORK HERE


### END YOUR WORK

def get_things_bought_blacklist(document, buy_list, blacklist):
    things_bought = []
    
    # load spacy, and parse your document with spacy
    ### YOUR WORK HERE

    
    ### END YOUR WORK
    
    # now, find all of the relevant noun chunks
    # make sure you return their text!
    ### YOUR WORK HERE

    
    
    ### END YOUR WORK
    return things_bought

### YOUR WORK HERE
# this should be similar to your previous code


### END YOUR WORK

What purchases seem to make people the most happy?

Finally, count up and print the total number of purchases mentioned in this chunk of the dataset.

In [0]:
### YOUR WORK HERE
total_purchases = ???


### END YOUR WORK
print(total_purchases)

### Counting Personal Interactions
In addition to purchases, we want to count other people who are mentioned in the dataset. This will be a fairly simple pattern-matching exercise, like what we did for seasons. However, we will do a little bit of (our own) parsing to get some ideas!

Much of the time, people are mentioned using a possessive like *my*. Go through all of the sentences, searching for the word *my*. Count up occurrences of words that appear after *my*. This might give you some ideas about what to look for! Print the list out in order.

In [0]:
nlp = spacy.load('en')

### YOUR WORK HERE
# count words that occur after my



### END YOUR WORK

In [0]:
### YOUR WORK HERE
# print words that commonly occur after my, in order



### END YOUR WORK

Now that you have some ideas, build your list of personal relationships (call it `relationships`), and write a function `count relationships`.

Make sure that you do not double count a happy moment with multiple relationships mentioned.

In [0]:
relationships = ???

In [0]:
def count_relationships(hms, relationships):
    count = 0
    ### YOUR WORK HERE

    
    ### END YOUR WORK
    return count

In [0]:
count_relationships(hms, relationships)

Do people mention relationships that they are happy with more, or people? What does this tell us about the general cause of happiness?