In [0]:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    %cd '/content/drive/My Drive/AI4All-UM-NLP'

In [0]:
%load_ext autoreload
%autoreload 2
import lib
from collections import Counter
import numpy as np

## Using Lib
In addition to outside libraries, we have provided a library called `lib` with a few functions we have defined. It will be useful to load data, and for some visualizations. If you want to understand the code, you can find it in the google drive directory, under `lib.py`, but you don't need to!

# Loading the Data and Checking Demographics
We have two data files: demographic.csv and cleaned_hm.csv.

* demographic.csv contains demogrpahic information about the individuals who are represented in the dataset  
  The dataset can be loaded as a list of dictionaries by calling `lib.load_demographics()`. The keys are the columns of the table: `wid`, `age`,	`country`,	`gender`,	`marital`, and	`parenthood`

* cleaned_hm.csv contains 100,000 crowd-sourced happy moments. Worker IDs listed correspond to `wid` in demographic.csv.  
  The dataset can be loaded as a list of dictionaries by calling `lib.load_happy_moments()`.The keys are the columns of the table: 	`hmid`,	`wid`,	`reflection_period`,	`original_hm`,	`cleaned_hm`,	`modified`,	`num_sentence`,	`ground_truth_category`,	`predicted_category`
  
You may find `lib.load_joined_data()` to be particularly useful, as it will load all of the data you will need without you needing to combine the two tables together! The format is a list of dictionaries. The keys are the "core" keys from the other tables that you will use later on: `cleaned_hm`, `age`, `country`, `wid`, `gender`, `parenthood`, `marital`, and `hmid`.
  
We will load the data files here using the pandas library so that you can see what each file looks like; however, you aren't expected to learn how to work with pandas, so we provide functions to load the data as a dictionary

In [0]:
lib.view_demographics()

In [0]:
lib.view_happy_moments()

Right now, we will only be working with the demographics file. Load it with the function that is mentioned above.

In [0]:
demographics = ???

## Aggregating Demographic Information
This data was crowdsourced, and demographic information was collected about worker's ages, countries, genders, marital status, and parenthood. To better understand the dataset, fill in the function called `get_distribution` to calculate the distribution of workers in each category for a certain property.

Hint (highlight text to see):<font color='white'>count each value, then calculate a percentage afterwards</font>

In [0]:
def get_distribution(demographics, worker_property):
    ???
    # return a dictionary that maps a property to a percent
    return distribution

After writing your function, run the cell below to save each distribution. Add calls to save all of the distributions!

In [0]:
age_distribution = get_distribution(demographics, 'age')
# save country, gender, and marital distributions too!
# YOUR CODE HERE!!


Now, print out the distributions for marital status, country, and age. You can pass your dictionary to the function `lib.print_as_table` to print a table containing the distribution for better readability. The function takes two arguments: the dictionary and the title for the table. For example, to call it for age, you could write:

`lib.print_as_table(age_distribution, 'Age Distribution')`

In [0]:
# print out distributions here!

If you wrote a fairly simple function to create your dictionary, you may notice some issues with these tables that make them less informative than we would like:
1. There are a lot of ages! Furthermore, some are represented as floats and some are ints, but 25 and 25.0 should mean the same thing!
1. There are also a lot of countries, but only two of them (USA and India) are very prevalent.
1. There are some _weird_ unwanted values like nan (not a number, which means that this was not filled in in the table) and "prefer not to say" for age.

To solve these problems, we will write three more functions:
1. `get_age_distribution`  
   This function will get the distribution of ages using a range instead of using single ages. You can use the buckets 10–20, 20–30, ..., 80-90. If you come across a value that does not fit in one of the ranges, skip it!
2. `get_country_distribution`  
   This function will get the distribution of countries, but will group together all countries with less than .4% of the overall workers into one group that you should call "OTHER". You should exclude nan values.
3. `get_distribution_new`  
   This function should be the same as your original `get_distribution` function, but should ignore nan values!

In [0]:
def get_age_distribution(demographics):
    # these are the buckets that you should use. Use any method you want to see if an age falls within a bucket.
    buckets = ['10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90']
    # use a dictionary to count workers
    # you will want to convert values to integers. There is at least one value that is not a number and is invalid
    # try to see if you can ignore this value. if you don't know how, talk to a TA!
    return distribution


def get_country_distribution(demographics):
    # create a distribution, but remove everything with less than 0.4%
    modified_distribution = {}
    return modified_distribution

def get_distribution_new(demographics, worker_property):
    # hint: to see if a value is nan, run type(val) == str or not np.isnan(val)
    # make this return the distribution
    return {}

In [0]:
# call the new get_distribution functions to save your age, country, marital, gender, and parenthood distributions

In [0]:
# call `lib.print_as_table` on the new distributions, to see if your new functions helped

## Distribution Visualizations

Finally, using the library functions `lib.create_histogram` and `lib.create_pie`, create histograms and pie charts for our properties. Create a histogram for age; for the other properties, create pie charts. The functions take two parameters: the distributions and a title.  For example, to call it for age, you could write:

`lib.create_pie(age_distribution, 'Age Distribution')`


In [0]:
# create pie charts for all properties