In [None]:
%load_ext autoreload
%autoreload 2
import lib
from collections import Counter

# Loading the Data and Checking Demographics
We have two data files: demographic.csv and cleaned_hm.csv.

* demographic.csv contains demogrpahic information about the individuals who are represented in the dataset  
  The dataset can be loaded into a pandas dataframe by calling `lib.load_demographics()`. The key in the dictionary is the worker ID (wid).

* cleaned_hm.csv contains 100,000 crowd-sourced happy moments. Worker IDs listed correspond to `wid` in demographic.csv.  
  The dataset can be loaded into a pandas dataframe by calling `lib.load_happy_moments()`. The key in the dictionary is the happy moment ID (hmid).
  
We will load the data files here using the pandas library so that you can see what each file looks like; however, you aren't expected to learn how to work with pandas, so we provide functions to load the data as a dictionary

In [None]:
lib.view_demographics()

In [None]:
lib.view_happy_moments()

So that you can process them, load the two data files using the functions mentioned above

In [None]:
demographics = lib.load_demographics()
happy_moments = lib.load_happy_moments()

## Aggregating Demographic Information
This data was crowdsourced, and demographic information was collected about worker's ages, countries, genders, marital status, and parenthood. To better understand the dataset, fill in the function called `get_distribution` to calculate the distribution of workers in each category for a certain property.

In [None]:
def get_distribution(demographics, worker_property):
    counts = Counter(demographics[worker_property])
    total = sum(counts.values())
    distribution = {category: (100 * count / total) for category, count in counts.items()}
    # return a dictionary that maps a property to a percent
    return distribution

After writing your function, run the cell below to save each distribution

In [None]:
age_distribution = get_distribution(demographics, 'age')
country_distribution = get_distribution(demographics, 'country')
gender_distribution = get_distribution(demographics, 'gender')
marital_distribution = get_distribution(demographics, 'marital')
parenthood_distribution = get_distribution(demographics, 'parenthood')

Now, print out the distributions for marital status, country, and age. You can pass your dictionary to the function `lib.print_as_table` to print a table containing the distribution for better readability. The function takes two arguments: the dictionary and the title for the table.

In [None]:
lib.print_as_table(age_distribution, 'Age Distribution')
lib.print_as_table(country_distribution, 'Country Distribution')
lib.print_as_table(marital_distribution, 'Marital Distribution')

If you wrote a fairly simple function to create your dictionary, you may notice some issues with these tables that make them less informative than we would like:
1. There are a lot of ages! Furthermore, some are represented as floats and some are ints, but 25 and 25.0 should mean the same thing!
1. There are also a lot of countries, but only two of them (USA and India) are very prevalent.
1. There are some _weird_ unwanted values like nan (not a number, which means that this was not filled in in the table) and "prefer not to say" for age.

To solve these problems, we will write three more functions:
1. `get_age_distribution`  
   This function will get the distribution of ages using a range instead of using single ages. You can use the buckets 10–20, 20–30, ..., 80-90. If you come across a value that does not fit in one of the ranges, skip it!
2. `get_country_distribution`  
   This function will get the distribution of countries, but will group together all countries with less than .4% of the overall workers into one group that you should call "OTHER". You should exclude nan values.
3. `get_distribution_new`  
   This function should be the same as your original `get_distribution` function, but should ignore nan values!

In [None]:
def get_age_distribution(demographics):
    buckets = ['10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90']
    counts = Counter()
    for age in demographics['age']:
        try:
            age = int(float(age))
        except:
            # in this case the input is bad
            continue
        bucket_id = int((age - (age % 10)) / 10 - 1)
        if bucket_id < len(buckets):
            bucket = buckets[bucket_id]
            counts[buckets[bucket_id]] += 1
    total = sum(counts.values())
    distribution = {category: (100 * count / total) for category, count in counts.items()}
    return distribution


def get_country_distribution(demographics):
    distribution = get_distribution(demographics, 'country')
    modified_distribution = {}
    other = 0
    for k, v in distribution.items():
        if v < 0.4:
            other += v
        else:
            modified_distribution[k] = v
    modified_distribution['OTHER'] = other
    return modified_distribution

def get_distribution_new(demographics, worker_property):
    demographics = demographics[demographics[worker_property].notnull()]
    # if this is called with one of the properties that we have a special function for, call that function instead!
    if worker_property == 'age':
        return get_age_distribution(demographics)
    elif worker_property == 'country':
        return get_country_distribution(demographics)
    return get_distribution(demographics, worker_property)

In [None]:
age_distribution = get_distribution_new(demographics, 'age')
country_distribution = get_distribution_new(demographics, 'country')
gender_distribution = get_distribution_new(demographics, 'gender')
marital_distribution = get_distribution_new(demographics, 'marital')
parenthood_distribution = get_distribution_new(demographics, 'parenthood')

In [None]:
lib.print_as_table(age_distribution, 'Age Distribution')
lib.print_as_table(country_distribution, 'Country Distribution')
lib.print_as_table(marital_distribution, 'Marital Distribution')
lib.print_as_table(gender_distribution, 'Gender Distribution')
lib.print_as_table(parenthood_distribution, 'Parenthood Distribution')

<span style="color:red">TODO:</span> add histograms!

In [None]:
int(float(demographics['age'][0]))

In [None]:
int(float('445'))