<a href="https://colab.research.google.com/github/lbiester/AI4All-UM-NLP/blob/master/2_Basic_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    %cd '/content/drive/My Drive/AI4All-UM-NLP'

    import nltk
    nltk.download('punkt')

In [0]:
%matplotlib inline

import lib
import spacy
import itertools
from collections import Counter
import statistics
import pandas as pd
import nltk

# Basic Language Processing
Now that we know a little bit about the demographics of the workers who helped to produce the dataset, let's start looking at the language! There are a few questions that we want to answer:
* What makes people happy?
* Do the things that cause happiness differ between groups?

We'll start by using some simple techniques to answer the first question!

First, though, we need to pre-process the data. We will create a new map that maps hmid to a list of "tokens" in the cleaned happy moment text. You can think of a token as an individual word. To tokenize text, use the function `nltk.word_tokenize` from the nltk library. Make each token lowercase.

You should also create a list of all tokens in the dataset.

In [0]:
joined_data = lib.load_joined_data()

In [0]:
def get_hm_tokens(joined_data):
    hm_tokens = {}
    for hm in joined_data:
        hm_tokens[hm['hmid']] = []
        for token in nltk.word_tokenize(hm['hm_text']):
            hm_tokens[hm['hmid']].append(token.lower())
    return hm_tokens

In [0]:
hm_tokens = get_hm_tokens(joined_data)
all_tokens = list(itertools.chain.from_iterable(hm_tokens.values()))

## Wordclouds

One way in which we can visualize text data is by using a word cloud. This will show us which words appear frequently in the text. Luckily, we don't need to write a bunch of code to display a word cloud - libraries exist to do it already! We have a function that can be used to create word clouds in the library, `lib.create_word_cloud`.

In [0]:
lib.create_word_cloud(all_tokens)

Now let's do something a bit more interesting: later on, we will classify happiness posts by if they are made by a man or a woman. Let's create two word clouds: one of posts made by men, and one of posts made by women, and see how they differ. This will require two steps

1. Separate out tokens in entries written by women and entries written by men
1. Create word clouds of each

In [0]:
man_tokens = []
woman_tokens = []
for hm in joined_data:
    if hm['gender'] == 'm':
        man_tokens.extend(hm_tokens[hm['hmid']])
    elif hm['gender'] == 'f':
        woman_tokens.extend(hm_tokens[hm['hmid']])

In [0]:
lib.create_word_cloud(man_tokens)

In [0]:
lib.create_word_cloud(woman_tokens)

You probably notice a few differences between the word clouds - take a minute to jot some of them down.

Even though there are some differences, you'll probably notice that the word clouds look quite similar overall. Words that don't seem to meaningful like "made happy" and "got" are large in both word clouds.

There are many ways that we can remove words that aren't meaningful. One typical approach is to use a "stopwords" list, which will include function words like "the", "a", "an", etc.

The wordcloud library actually has a built-in list of stopwords, but we also should filter out some words that are common in happy moments even if they aren't common in written text overall.

In [0]:
# TODO: I'm currently doing this programmatically. I could ask the students to do it this way, or ask them to come up 
# with their own list. There are probably also smarter ways to do this, this is just a starting point

common_female_words = set(k for k, v in Counter(woman_tokens).most_common(100))
common_male_words = set(k for k, v in Counter(man_tokens).most_common(100))
stop = common_female_words.intersection(common_male_words)

After you have your personal stopwords list, pass it in to the word cloud function like this: `lib.create_word_cloud(tokens, stop=stop)` (it is an optional parameter).

In [0]:
lib.create_word_cloud(man_tokens, stop=stop)

In [0]:
lib.create_word_cloud(woman_tokens, stop=stop)

Hopefully, you now see more noticeable differences between the word clouds.

Feel free to play around with the word clouds with different attributes, like age and country!

## Word Count
Something else that might differ between men and women is the number of words included in what they write. Collect the overall average word count, in addition to average for men and average for women. What do you find?

In [0]:
print('Overall average:', statistics.mean(len(tokens) for hmid, tokens in hm_tokens.items()))
print('Male average:', statistics.mean(len(hm_tokens[hm['hmid']]) for hm in joined_data if hm['gender'] == 'm'))
print('Female average:', statistics.mean(len(hm_tokens[hm['hmid']]) for hm in joined_data if hm['gender'] == 'f'))

Who tends to write more? What about parental status? Do parents write more or less than non-parents?

In [0]:
print('Parent average:', statistics.mean(len(hm_tokens[hm['hmid']]) for hm in joined_data if hm['parenthood'] == 'y'))
print('Non-parent average:', statistics.mean(len(hm_tokens[hm['hmid']]) for hm in joined_data if hm['parenthood'] == 'n'))

It seems as though parents write a bit more than non-parents! Why do you think that could be?

Ultimately, word count does not tell us much about what makes different groups happy. However, it could be a useful tool when predicting who a happy moment description comes from. We will explore this more later. It does seem as though the word clouds, which represent which words are most frequently used when people talk about what makes them happy.