# Measuring Diversity

This notebook will lead you through a fairly simple analysis of real student demographic data from UC Berkeley. The data is fairly simple, and so we will be focusing on data manipulation and presentation.

A very simple question one might ask is how does the ethnic and gender makeup of an academic department compare to the makeup of Berkeley as a whole (or California, or the US...). Using data from Cal Answers, let's try and answer that question.

We have data going back to 2005 with gender and (coarse) ethnicity broken down by various academic units. A skeleton for comparing one department/unit to Berkeley has been created.

In groups, you'll need to:
- choose to look at gender, ethnicity or a combination across units,
- automate .csv loads from a folder,
- run the analysis for every department/group,
- make a summary plot(s).

As a reminder, IPython notebooks are organized by "cells." Each cell can have its own code and can be run independently and in any order (although they are usually run top to bottom in a notebook.) To run a cell and move to the next cell press ```Shift+Enter```. To run a cell and stay on that cell press ```Control+Enter```.


Questions to be discussed in groups are highlighted in <font color='green'>green</font>. If you don't understand a function that is used, try googling something like "python function-name".

## Homework
Your homework will be to post a picture of your results on Piazza along with a brief description of the analysis you have done. You might also comment on the implications of your analysis on diversity, equity, etc.

In [None]:
import os # library that deals with operating system
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline

# Loading data

In [None]:
folder = 'data'
filename = 'CensusEquityComparisonData-LettersAndSciences.csv'
baseline = 'CensusEquityComparisonData-CampusTotal.csv'
df = pd.read_csv(os.path.join(folder, filename))
df_baseline = pd.read_csv(os.path.join(folder, baseline))

## Inspect the data

In [None]:
cols = df.columns
print(cols)
df

In [None]:
semesters = df[cols[1]]

In [None]:
semesters = set(semesters)

In [None]:
semesters = list(semesters)

In [None]:
semesters = sorted(semesters)

In [None]:
semesters

<font color='green'>
1) What does each line above do to 'semesters'?<br>
You can run them individually and print semesters between lines.</font>

In [None]:
genders = list(set(df[cols[3]]))

# Example Analysis
We'll first look at how the college of Letters and Sciences compares to the campus as a whole. We'll restrict this analysis to gender for now.

In [None]:
def headcount_percent(df, semester, col, attrs):
    """
    Extract fractional headcount data for specific attributes from a column.
    
    Parameters
    ----------
    df : dataframe
        Dataframe containing data.
    semester : str
        String for semester
    col : str
        Column to check attributes from.
    attrs : list of str
        List of attributes to select and count.
        
    Return
    ------
    Array of percentages for each attribute in attrs.
    """
    indxs = []
    # What does this loop do?
    for attr in attrs:
        indxs.append((df[col] == attr) & (df['Semester Year Letter Cd Concat'] == semester))
    counts = [df['Student Headcount'].loc[indx].sum() for indx in indxs]
    total = sum(counts)
    return 100.*np.array([float(count)/total for count in counts])

def distance(data, baseline):
    """
    Compute the euclidean distance between two data arrays.
    
    Parameters
    ----------
    data : array
        Array of data.
    baseline : array
        Array of baseline data, should be the same shape as data.
        
    Return
    ------
    Euclidean distance between the arrays.
    """
    return np.linalg.norm(data-baseline)

campus_vs_LS = {} # dictionary for data
for semester in semesters:
    data = headcount_percent(df, semester, cols[3], genders)
    baseline = headcount_percent(df_baseline, semester, cols[3], genders)
    campus_vs_LS[semester] = distance(data, baseline)

In [None]:
campus_vs_LS

<font color='green'>
Read through the functions ```headcount_frac``` and ```distance```.<br>
1) What do they do?<br>
2) How are they documented?<br>
3) Is this the most reasonable way to calculate distance given our question? What drawbacks does it have?
</font>

## Plotting the Results

In [None]:
labels = sorted(campus_vs_LS.keys())
vals = [campus_vs_LS[key] for key in labels]
ticks = range(len(labels))

In [None]:
plt.plot(vals)
plt.ylabel('L&S Distance from Campus (percent)')
plt.xlabel('Semester')
p = plt.xticks(ticks, labels, rotation='vertical')

<font color='green'>
1) What does this plot show?<br>
2) How could this plot be improved?<br>
</font>

# Your Analysis
In your group, choose a question you'd like to investigate. The data comes from a number of academic units and is broken down by semester, gender, and ethnicity. You may have to restrict the scope of your question given the limitations of the dataset (you'll have access to more data for final projects!)

The rest of the notebook is broken down into the probable steps you'll need to take. Feel free to copy code from above and look things up online.
## Finding files
Google 'python listdir' and use it to get a list of all of the csv files in the data folder.

## Select Data
Depending on your question, you may need to make a modified version of the ```headcount_frac``` function to select the data you want. Or you might just need to pass it different values. You may also wish to use a different distance function.

## Plot the Results
Make a plot (or series of plots) that convey your result as clearly as possible. Google 'matplotlib something' for ideas on how to make certain plot types. If you are plotting different datasets on one plot, you may want to use a legend!