<a href="https://colab.research.google.com/github/lbiester/AI4All-UM-NLP/blob/master/1_Demographics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    %cd '/content/drive/My Drive/AI4All-UM-NLP'
    import nltk
    nltk.download('punkt')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/AI4All-UM-NLP
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [0]:
%load_ext autoreload
%autoreload 2
import lib
from collections import Counter
import numpy as np

# Loading the Data and Checking Demographics
We have two data files: demographic.csv and cleaned_hm.csv.

* demographic.csv contains demogrpahic information about the individuals who are represented in the dataset  
  The dataset can be loaded as a list of dictionaries by calling `lib.load_demographics()`. The key in the dictionary is the worker ID (wid).

* cleaned_hm.csv contains 100,000 crowd-sourced happy moments. Worker IDs listed correspond to `wid` in demographic.csv.  
  The dataset can be loaded as a list of dictionaries by calling `lib.load_happy_moments()`. The key in the dictionary is the happy moment ID (hmid).
  
You may find `lib.load_joined_data()` to be particularly useful, as it will load all of the data you will need without you needing to combine the two tables together! The format is a list of dictionaries.
  
We will load the data files here using the pandas library so that you can see what each file looks like; however, you aren't expected to learn how to work with pandas, so we provide functions to load the data as a dictionary

In [3]:
lib.view_demographics()

Unnamed: 0,wid,age,country,gender,marital,parenthood
0,1,37.0,USA,m,married,y
1,2,29.0,IND,m,married,y
2,3,25.0,IND,m,single,n
3,4,32.0,USA,m,married,y
4,5,29.0,USA,m,married,y
5,6,35.0,IND,m,married,y
6,7,34.0,USA,m,married,y
7,8,29.0,VNM,m,single,n
8,9,61.0,USA,f,married,y
9,10,27.0,USA,m,single,n


In [4]:
lib.view_happy_moments()

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category
0,27673,2053,24h,I went on a successful date with someone I fel...,I went on a successful date with someone I fel...,True,1,,affection
1,27674,2,24h,I was happy when my son got 90% marks in his e...,I was happy when my son got 90% marks in his e...,True,1,,affection
2,27675,1936,24h,I went to the gym this morning and did yoga.,I went to the gym this morning and did yoga.,True,1,,exercise
3,27676,206,24h,We had a serious talk with some friends of our...,We had a serious talk with some friends of our...,True,2,bonding,bonding
4,27677,6227,24h,I went with grandchildren to butterfly display...,I went with grandchildren to butterfly display...,True,1,,affection
5,27678,45,24h,I meditated last night.,I meditated last night.,True,1,leisure,leisure
6,27679,195,24h,"I made a new recipe for peasant bread, and it ...","I made a new recipe for peasant bread, and it ...",True,1,,achievement
7,27680,740,24h,I got gift from my elder brother which was rea...,I got gift from my elder brother which was rea...,True,1,,affection
8,27681,3,24h,YESTERDAY MY MOMS BIRTHDAY SO I ENJOYED,YESTERDAY MY MOMS BIRTHDAY SO I ENJOYED,True,1,,enjoy_the_moment
9,27682,4833,24h,Watching cupcake wars with my three teen children,Watching cupcake wars with my three teen children,True,1,,affection


Right now, we will only be working with the demographics file. Load it with the function that is mentioned above.

In [0]:
demographics = lib.load_demographics()

## Aggregating Demographic Information
This data was crowdsourced, and demographic information was collected about worker's ages, countries, genders, marital status, and parenthood. To better understand the dataset, fill in the function called `get_distribution` to calculate the distribution of workers in each category for a certain property.

In [0]:
def get_distribution(demographics, worker_property):
    counts = Counter([worker[worker_property] for worker in demographics])
    total = sum(counts.values())
    distribution = {category: (100 * count / total) for category, count in counts.items()}
    # return a dictionary that maps a property to a percent
    return distribution

After writing your function, run the cell below to save each distribution

In [0]:
age_distribution = get_distribution(demographics, 'age')
country_distribution = get_distribution(demographics, 'country')
gender_distribution = get_distribution(demographics, 'gender')
marital_distribution = get_distribution(demographics, 'marital')
parenthood_distribution = get_distribution(demographics, 'parenthood')

Now, print out the distributions for marital status, country, and age. You can pass your dictionary to the function `lib.print_as_table` to print a table containing the distribution for better readability. The function takes two arguments: the dictionary and the title for the table.

In [0]:
lib.print_as_table(age_distribution, 'Age Distribution')
lib.print_as_table(country_distribution, 'Country Distribution')
lib.print_as_table(marital_distribution, 'Marital Distribution')

If you wrote a fairly simple function to create your dictionary, you may notice some issues with these tables that make them less informative than we would like:
1. There are a lot of ages! Furthermore, some are represented as floats and some are ints, but 25 and 25.0 should mean the same thing!
1. There are also a lot of countries, but only two of them (USA and India) are very prevalent.
1. There are some _weird_ unwanted values like nan (not a number, which means that this was not filled in in the table) and "prefer not to say" for age.

To solve these problems, we will write three more functions:
1. `get_age_distribution`  
   This function will get the distribution of ages using a range instead of using single ages. You can use the buckets 10–20, 20–30, ..., 80-90. If you come across a value that does not fit in one of the ranges, skip it!
2. `get_country_distribution`  
   This function will get the distribution of countries, but will group together all countries with less than .4% of the overall workers into one group that you should call "OTHER". You should exclude nan values.
3. `get_distribution_new`  
   This function should be the same as your original `get_distribution` function, but should ignore nan values!

In [0]:
def get_age_distribution(demographics):
    buckets = ['10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90']
    counts = Counter()
    for worker in demographics:
        age = worker['age']
        try:
            age = int(float(age))
        except:
            # in this case the input is bad
            continue
        bucket_id = int((age - (age % 10)) / 10 - 1)
        if bucket_id < len(buckets):
            bucket = buckets[bucket_id]
            counts[buckets[bucket_id]] += 1
    total = sum(counts.values())
    distribution = {category: (100 * count / total) for category, count in counts.items()}
    return distribution


def get_country_distribution(demographics):
    distribution = get_distribution(demographics, 'country')
    modified_distribution = {}
    other = 0
    for k, v in distribution.items():
        if v < 0.4:
            other += v
        else:
            modified_distribution[k] = v
    modified_distribution['OTHER'] = other
    return modified_distribution

def get_distribution_new(demographics, worker_property):
    demographics = [worker for worker in demographics if type(worker[worker_property]) == str or not np.isnan(worker[worker_property])]
    # if this is called with one of the properties that we have a special function for, call that function instead!
    if worker_property == 'age':
        return get_age_distribution(demographics)
    elif worker_property == 'country':
        return get_country_distribution(demographics)
    return get_distribution(demographics, worker_property)

In [0]:
age_distribution = get_distribution_new(demographics, 'age')
country_distribution = get_distribution_new(demographics, 'country')
gender_distribution = get_distribution_new(demographics, 'gender')
marital_distribution = get_distribution_new(demographics, 'marital')
parenthood_distribution = get_distribution_new(demographics, 'parenthood')

In [0]:
lib.print_as_table(age_distribution, 'Age Distribution')
lib.print_as_table(country_distribution, 'Country Distribution')
lib.print_as_table(marital_distribution, 'Marital Distribution')
lib.print_as_table(gender_distribution, 'Gender Distribution')
lib.print_as_table(parenthood_distribution, 'Parenthood Distribution')

## Distribution Visualizations

Finally, using the library functions `lib.create_histogram` and `lib.create_pie`, create histograms and pie charts for our properties. Create a histogram for age; for the other properties, create pie charts. The functions take two parameters: the distributions and a title

In [0]:
lib.create_pie(age_distribution, 'Age Distribution')
lib.create_pie(country_distribution, 'Country Distribution')
lib.create_pie(marital_distribution, 'Marital Distribution')
lib.create_pie(gender_distribution, 'Gender Distribution')
lib.create_pie(parenthood_distribution, 'Parenthood Distribution')