In [None]:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    %cd '/content/drive/My Drive/AI4All-UM-NLP'

    import nltk
    nltk.download('punkt')

In [None]:
import lib
import nltk
import spacy
from IPython.display import Image

# Is happiness seasonal?
Determine whether or not people mention different seasons more in relation to their happiness. We provide the list of seasons. Which season makes people happiest?

Before we figure this out, we're going to explore an important NLP tool.

## Tokenization
We used tokenization in the last notebook, but we didn't really learn much about it. Before we continue to look at seasons, let's learn a bit more about tokenization.

Tokenization is one of the first pre-processing steps in NLP. It is the process of splitting a string of text into individual tokens. You can think of a token as an individual word or a piece of punctuation.

You may have seen the method `split`, which is used for strings in python. It takes a string and splits it into words, based on white space.


NLTK, a python toolkit for NLP, has its own function, `nltk.word_tokenize`. This is a "smarter" version of tokenization. Let's see why. We show how to tokenize the sentence below using both methods. What is the main difference?

In [None]:
sentence = 'When I went to the store today, I bought apples, bananas, and oranges!'

# print out the list returned using `split`
print(sentence.split())


# print out the list returned using `word_tokenize`
print(nltk.word_tokenize(sentence))


What major differences do you see?

Now, we will create a list of all tokens in the dataset using `nltk.word_tokenize`. We will make each token lowercase. If you don't know how to do that, there is an example below!

In [None]:
# example: this is how you make a token lowercase!

token = 'Sunday'

print(token.lower())

In [None]:
joined_data = lib.load_joined_data()

# create your list of tokens!
all_tokens = []
for happy_moment in joined_data:
  for word in nltk.word_tokenize(happy_moment['cleaned_hm']):
    # append the word to all_tokens and make it lowercase!
    ???

Now, write a function count_seasons, that takes in your tokens as input and prints out the count for each season. You can use the seasons list provided above!

There are two options for how you can count
1. Count using a dictionary. If you don't remember how, look at the python worksheet.
2. Use individual variables to count each season, then make a dictionary at the end!

The code is currently "set up" for the first option, but you should feel free to do the second.

In [None]:
def count_seasons(all_tokens, seasons):
  counts = {}
  for token in all_tokens:
    for season in seasons:
      # check if the token and season is equivalent
      # if they are, increase counts for this season
      # this will require two ifs, one else, and 5 lines of code
      ???
  print(counts)
  
# here we call the function on all_tokens
seasons = ['spring', 'summer', 'fall', 'winter']
count_seasons(all_tokens, seasons)

### The magic of preprocesssing
nltk's word tokenize algorithm is trained to handle special cases like punctuation. However, we saw that a simple "tokenizer" is python's `string.split` function, which splits a string on white space.

Instead of using `nltk.word_tokenize`, we will use `string.split` this time, and run our count_seasons function on the new resulting set of tokens. Make sure to make all of your tokens **lower case** again! Do you get different results?

In [None]:
split_tokens = []
for happy_moment in joined_data:
  for word in happy_moment['cleaned_hm'].split():
    # append the word to split_tokens and make it lowercase!
    ???

    
# count seasons - are your results the same? different?
seasons = ['spring', 'summer', 'fall', 'winter']
count_seasons(split_tokens, seasons)

How about if you don't use `lower` to ignore case? Write out the loop one more time, but don't use `lower` to ignore case. Do your results change?

In [None]:
split_tokens_cases = []
for happy_moment in joined_data:
  for word in happy_moment['cleaned_hm'].split():
    # append the word to split_tokens_cases, but don't make it lowercase!
    ???

    
# count seasons - are your results the same? different?
# this time, we will use uppercase season names, since we are likely to have uppercase seasons in our data
seasons = ['Spring', 'Summer', 'Fall', 'Winter']
count_seasons(split_tokens_cases, seasons)

Which results seem most "correct" to you? Which pre-processing method is the most robust?

# What kinds of things are people happy about?
To go beyond the simple example with seasons, we will explore what kinds of things people are happy about. In particular, we will explore two areas:
1. What purchases are people happy with (i.e. can money buy happiness)?
1. Who are people happy to do things with?

In the end, we will see if mention people or purchases more in their happy moments.

### Parsing sentences

To find out what people have bought, you'll be using spaCy's sentence parser. If you aren't familiar, here is an example parse tree, generated using [this site](http://nlpviz.bpodgursky.com/):

In [None]:
Image('img/parse2.png', width=500)

This is the parse tree for the sentence "I went to the store to buy some blue jeans." Different parse trees will have slightly different structures - sometimes more specific tags, like NNS (plural noun) are used, while sometimes only more general tags like N (noun) will be used.

To begin, we'll play around with the spacy parser to find the NP that is associated with the word "buy" in our example sentence. Make sure that you actually use the parse tree structure, as it will become important later on.

We will loop through the noun chunks to find the one whose "head" is "buy." [The noun chunks documentation](https://spacy.io/usage/linguistic-features#noun-chunks) shows how to do this. Return all noun chunks for which the head is in the passed in list `buy_list`. The contents of this list should be verbs that have something to do with buying things!

One thing to note to understand the code below is that noun chunks are _spans_ and individual words are _tokens_ in spaCy. This is why we need to use the `text` attribute; they aren't strings!!

Here's the documentation for span and token objects:
* [Spans](https://spacy.io/api/span)
* [Tokens](https://spacy.io/api/token)

#### Checking if something is in a list
The following code is how you check if a string (or any type of data!) is in a list. This will be useful when you write your function.

We can also check if something is _not_ in a list using `not in`, as shown below.

Think about what you think the output should be before running this cell!

In [None]:
x = 'monday'

y = 'saturday'

weekdays = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday']

print(x in weekdays)

print(y in weekdays)

print(y not in weekdays)

Now, fill in the rest of the function, writing out the necessary if statement.

In [None]:
sentence = 'I went to the store to buy some blue jeans'
simple_buy_list = ['buy']

def get_things_bought(document, buy_list):
  things_bought = []
    
  # load spacy, and parse your document with spacy
  nlp = spacy.load('en', max_length=2000000)
  parsed = nlp(document)
    
  # now, we find all of the relevant noun chunks
  for noun_chunk in parsed.noun_chunks:
    root = noun_chunk.root
    head = root.head

    # check that part-of-speech of head is 'VERB', and that the text of the head is in the buy_list
    # to get POS: head.pos_
    # to get text: head.text
    # to check if something is in buy_list: x in buy_list
    
    ### YOUR WORK HERE
    if ???:
      # if this condition is true, we will add the text from the noun chunk to things_bought
      things_bought.append(noun_chunk.text)
    ### END YOUR WORK
            
  return things_bought

# This should print ['some blue jeans']
print(get_things_bought(sentence, simple_buy_list))

## Counting Purchases

Now that you've finished the get_things_bought function, let's put everything together on our actual dataset.

We'll start by loading our data.

In [None]:
# load happy moments and pre-process data
# for this, we will use only 1/8 of the data, because spaCy parsing is very slow.
eigth_index = len(joined_data) // 8
hms = []
for hm in joined_data[:eigth_index]:
  hms.append(hm['cleaned_hm'])

#### Joining Text
In addition to hms, you will want to have a version of the happy moments that includes all of the sentences in a single string. Between the sentences, you can use newlines (`\n`). Here's an example of how `join` works

In [None]:
some_things = ['chocolate', 'skittles', 'starburst', 'm&m']

joined = '\n'.join(some_things)

print(joined)

Join the text from hms together to create `hms_text`

In [None]:
hms_text = ???

Next, create an expanded `buy_list`. Remember, the contents of this list should be verbs that have something to do with buying things!

Consider different verb forms (like "bought") in addition to synonyms (like "purchase"). Try to have at least 6 words in your list!

In [None]:
# define your buy list
### YOUR WORK HERE
buy_list = ???
### END YOUR WORK

Create a dictionary to count every purchase that has been mentioned. Code is provided to print the dictionary sorted **in descending order.**

Hint: try calling your function with the full document instead of individual sentences. Individual sentences are not needed by spaCy, and this will make your code run much faster.

In [None]:
things_purchased = get_things_bought(hms_text, buy_list)

In [None]:
# count things
# things_purchased will be a list of strings. Count the number of occurrences of each string using a dictionary
# if you don't remember how, look at the python worksheet
thing_counts = {}

### YOUR WORK HERE!


### END YOUR WORK!


# sort
thing_counts_sorted = sorted(thing_counts.items(), key=lambda x: x[1], reverse=True)


# print
for thing, count in thing_counts_sorted:
  print(thing, count)

You might notice that some of the most common words here are not in fact things that people have bought. An obvious example is the common word "I"

One sentence where "I" is pulled out of the parse tree is the following:
```
I bought a new TV
```

We want to filter out "I" since it is not really the thing that the worker bought.

Modify your code to take a list of blacklist words that you define. Make sure you think about case as you work on this. Once you've added the blacklist, be creative! Add anything else that you think will help with your performance, like checking for lowercase!

The skeleton of the code you wrote above is copied here for you to update.

In [None]:
# define blacklist
### YOUR WORK HERE
blacklist = ???
### END YOUR WORK

In [None]:
def get_things_bought_blacklist(document, buy_list, blacklist):
  things_bought = []
    
  # load spacy, and parse your document with spacy
  nlp = spacy.load('en', max_length=2000000)
  parsed = nlp(document)
    
  # now, we find all of the relevant noun chunks
  for noun_chunk in parsed.noun_chunks:
    root = noun_chunk.root
    head = root.head

    # check that part-of-speech of head is 'VERB', and that the text of the head is in the buy_list
    # to get POS: head.pos_
    # to get text: head.text
    # to check if something is in buy_list: x in buy_list
    
    ### YOUR WORK HERE
    # first if is same as before: POS and buy_list
    if ???:
      # this time, we need to check that NOTHING in the noun_chunk is in the blacklist
      # this is a bit complex because we need to check for all blacklisted words
      bad = False
      for word in blacklist:
        # we want bad to be true if the word is in the noun chunk
        # to do this, use the `in` operator, and make sure to check with noun_chunk.text
        if ???:
          bad = True
      if not bad:
        things_bought.append(noun_chunk.text)
    ### END YOUR WORK
            
  return things_bought
  
things_purchased_blacklist = get_things_bought_blacklist(hms_text, buy_list, blacklist)

In [None]:
# do the same thing you did before... count things, and run code to sort and print
# things_purchased will be a list of strings. Count the number of occurrences of each string using a dictionary
# if you don't remember how, look at the python worksheet
thing_counts_blacklist = {}

### YOUR WORK HERE!


### END YOUR WORK!


# sort
thing_counts_blacklist_sorted = sorted(thing_counts_blacklist.items(), key=lambda x: x[1], reverse=True)


# print
for thing, count in thing_counts_blacklist_sorted:
  print(thing, count)

What purchases seem to make people the most happy?

Finally, count up and print the total number of purchases mentioned in this chunk of the dataset.

In [None]:
### YOUR WORK HERE
total_purchases = ???


### END YOUR WORK
print(total_purchases)

### Counting Personal Interactions
In addition to purchases, we want to count other people who are mentioned in the dataset. This will be a fairly simple pattern-matching exercise, like what we did for seasons. However, we will do a little bit of (our own) parsing to get some ideas!

Much of the time, people are mentioned using a possessive like *my*. Go through all of the sentences, searching for the word *my*. Count up occurrences of words that appear after *my*. This might give you some ideas about what to look for! We will print out the top 200 words in the list in order!

In [None]:
# count words that occur after my
after_my = {}
for sentence in hms:
  if 'my' in sentence:
    tokens = nltk.word_tokenize(sentence)
    # only look at len(tokens) - 1.
    # question: why would we do this? think about your answer.
    for i in range(len(tokens) - 1):
      current_token = tokens[i]
      next_token = tokens[i + 1]
      # in the if statement, check if the current token is 'my'
      if ???:
        # increment the count of the next token in the if statement.
        if next_token in after_my:
          ???
        else:
          ???

print(after_my)

In [None]:
# print words that commonly occur after my, in order
# sort
after_my_sorted = sorted(after_my.items(), key=lambda x: x[1], reverse=True)

# print
for word, count in after_my_sorted[:200]:
  print(word, count)


Now that you have some ideas, build your list of personal relationships (call it `relationships`), and write a function `count_relationships`. This will be a lot like `count_seasons`, but you don't need to have multiple variables to count - you only need to count all of the relationships!

In [None]:
relationships = ???

In [None]:
def count_relationships(hms, relationships):
  count = 0
  ### YOUR WORK HERE
  for sentence in hms:
    for relationship in relationships:
      # check if the relationpship is mentioned in the sentence
      # you can use 'in' just like you would for a list.
      if ???:
        # increase the count
        ???

  ### END YOUR WORK
  return count
  
count_relationships(hms, relationships)

Do people mention relationships that they are happy with more, or people? What does this tell us about the general cause of happiness?

Your answer:

#### If you have extra time
Consider updating your `count_relationships` to see which types of relationships people mention most often.

Try to prevent double-counting, so that one relationship is only counted a single time in `count_relationships`