In [None]:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    %cd '/content/drive/My Drive/AI4All_3'

    import nltk
    nltk.download('punkt')

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import lib
import spacy
import itertools
from collections import Counter
import statistics
import pandas as pd
import nltk

# Basic Language Processing
Now that we know a little bit about the demographics of the workers who helped to produce the dataset, let's start looking at the language! There are a few questions that we want to answer:
* What makes people happy?
* Do the things that cause happiness differ between groups?

We'll start by using some simple techniques to answer the first question!

First, we'll load the data. If you don't remember, look back at the first notebook to see how to load the joined data.

Save it as `joined_data`

In [None]:
# load the *joined* data!

## Getting Sentences
First, go through `joined_data`, and create a list of happy moments. You will need to create a list of happy moments with various properties a few times, so make sure that you are very clear on how to do this using `joined_data`!

In [None]:
all_sentences = []
# add the happy moment text for each happy moment to all_sentences
# each happy moment is a dictionary
# the happy moment text is stored in the property 'cleaned_hm'

### YOUR WORK HERE!!!

## Wordclouds

One way in which we can visualize text data is by using a word cloud. This will show us which words appear frequently in the text. Luckily, we don't need to write a bunch of code to display a word cloud - libraries exist to do it already! We have a function that can be used to create word clouds in the library, `lib.create_word_cloud`.

Warning: The word cloud library is a bit slow, so if it takes a minute or two to load, don't worry - it probably doesn't have anything to do with your code!

This is how you would call it on a list of sentences, `s`: `lib.create_word_cloud(s)`

In [None]:
# create your word cloud here
### YOUR WORK HERE!!!



Now let's do something a bit more interesting: later on, we will classify happiness posts by if they are made by a man or a woman. Let's create two word clouds: one of posts made by men, and one of posts made by women, and see how they differ. This will require two steps

1. Separate out happy moments into entries written by women and entries written by men
1. Create word clouds of each

In [None]:
man_sentences = []
woman_sentences = []
# complete the lists of sentences
### YOUR WORK HERE!!!


In [None]:
# create the word cloud for women
### YOUR WORK HERE!!!


In [None]:
# create the word cloud for men
### YOUR WORK HERE!!!


You probably notice a few differences between the word clouds - take a minute to jot some of them down.

Even though there are some differences, you'll probably notice that the word clouds look quite similar overall. Words that don't seem to meaningful like "made happy" and "got" are large in both word clouds.

There are many ways that we can remove words that aren't meaningful. One typical approach is to use a "stopwords" list, which will include function words like "the", "a", "an", etc.

The wordcloud library actually has a built-in list of stopwords, but we also should filter out some words that are common in happy moments even if they aren't common in written text overall.

### Adding Domain-Specific Stopwords
Try creating a domain-specific stopwords list, using words that you see frequently for _both_ men and women. Because these words are so common overall, they don't have much meaning for happy moments. There are two ways that you can do this; feel free to try both:

1. Manually create a list
1. Create a list of all words that appear in the top 100 for men _and_ women. Those words are likely very common overall. To do so, you'll need to figure out all of the words in the happy moments. This is a process called **tokenization**, which we will explore more later. For now, we've tokenized for you, creating the lists `man_tokens` and `woman_tokens` below.

Something to you may notice is that the word cloud library uses n-grams (short phrases), not just individual words. It might be valuable to include multiple-word phrases like "made happy" in your stopwords list, even though it isn't really an individual word!

You may also want to use a special function defined in `lib` to get the most common words in a list of words.

It's usage is `lib.get_most_common_words(word_list, n)`, where `word_list` is a list of words, and `n` is the number of words to return.

Here's an example of the usage:

In [None]:
happy_words = ['happy', 'happy', 'family', 'happy', 'friends', 'work', 'friends']

print(lib.get_most_common_words(happy_words, 2))

If you want to make this more challenging, you can also write your own code to get the most common words from a list. 

In [None]:
def get_all_tokens(sentences):
    tokens = []
    for sentence in sentences:
        tokens.extend(get_tokens(sentence))
    return tokens

def get_tokens(sentence):
    tokens = []
    for token in nltk.word_tokenize(sentence):
        tokens.append(token.lower())
    return tokens

In [None]:
man_tokens = get_all_tokens(man_sentences)
woman_tokens = get_all_tokens(woman_sentences)

In [None]:
# create your list of stopwords!
### YOUR WORK HERE!!!


After you have your personal stopwords list, pass it in to the word cloud function like this: `lib.create_word_cloud(sentences, stop=stop)` (it is an optional parameter).

In [None]:
# create the word cloud for women
### YOUR WORK HERE!!!


In [None]:
# create the word cloud for men
### YOUR WORK HERE!!!


Hopefully, you now see more noticeable differences between the word clouds.

Feel free to play around with the word clouds with different attributes, like age and country!

## Word Count
Something else that might differ between men and women is the number of words included in what they write. Collect the overall average word count, in addition to average for men and average for women. What do you find?

First, you'll need to count the tokens in each happy moment a) overall, b) for women, and c) for men. Create a list that has the number of tokens for each happy moment for each of the groups. To help you out, we've already done it for men. You will need to count for women and overall. You can use the same loop that we already started, or create a new loop!

Then, you'll need to calculate the average values. To do this, you can call `statistics.mean` on the list. We'll have an example of how to do that below.

In [None]:
overall_token_counts = []
woman_token_counts = []
man_token_counts = []

# count tokens for each group
# already written: men
joined_data = lib.load_joined_data()
for hm in joined_data:
  if hm['gender'] == 'm':
    happy_text = hm['cleaned_hm']
    man_token_counts.append(len(get_tokens(happy_text)))

### YOUR WORK HERE!!!

In [None]:
# This is how `statistics.mean` is used:
numbers = [1, 2, 3, 4, 5]
print(statistics.mean(numbers))

In [None]:
# print out the means overall, for women, and for men
### YOUR WORK HERE!!!

Who tends to write more? What about parental status? Do parents write more or less than non-parents? Do the same thing that you just did for men and women: count tokens for each group, then calculate the average.

You can use what you did for men and women as an example here, just look at the `'parenthood'` property instead of `'gender'`

In [None]:
parent_token_counts = []
non_parent_token_counts = []

# count up tokens
### YOUR WORK HERE!!!


In [None]:
# print out the means for parents and non-parents
### YOUR WORK HERE!!!


It seems as though parents write a bit more than non-parents! Why do you think that could be?

Ultimately, word count does not tell us much about what makes different groups happy. However, it could be a useful tool when predicting who a happy moment description comes from. We will explore this more later. It does seem as though the word clouds, which represent which words are most frequently used when people talk about what makes them happy.