# Measuring Diversity

A very simple question one might ask is how does the ethnic and gender makeup of an academic department compare to the makeup of Berkeley as a whole (or California, or the US...). Using data from Cal Answers, let's try and answer that question.

We have data going back to 2005 with gender and (coarse) ethnicity. A skeleton for comparing one department/unit to Berkeley has been created.

You'll need to:
- automate .csv loads from a folder,
- choose to look at gender, ethnicity or combination in groups,
- run the analysis for every department/group,
- make a summary plot.

As a reminder, IPython notebooks are organized by "cells." Each cell can have its own code and can be run independently and in any order (although they are usually run top to bottom in a notebook.) To run a cell and move to the next cell press ```Shift+Enter```. To run a cell and stay on that cell press ```Control+Enter```.


Questions to be discussed in groups are highlighted in <font color='green'>green</font>. If you don't understand a function that is used, try googling something like "python function-name".

In [1]:
import os # library that deals with operating system
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline

# Loading data

In [2]:
folder = 'data'
filename = 'CensusEquityComparisonData-LettersAndSciences.csv'
baseline = 'CensusEquityComparisonData-CampusTotal.csv'
df = pd.read_csv(os.path.join(folder, filename))
df_baseline = pd.read_csv(os.path.join(folder, baseline))

## Inspect the dadta

In [4]:
cols = df.columns
print(cols)
df

Index([u'﻿Semester Year Name Concat', u'Semester Year Letter Cd Concat',
       u'Ungrad Grad Cd', u'Gender Desc', u'Ucb Level1 Ethnic Rollup Cd',
       u'Ucb Level1 Ethnic Rollup Desc', u'Student Headcount',
       u'Prorated Student Major Cnt Sum'],
      dtype='object')


Unnamed: 0,﻿Semester Year Name Concat,Semester Year Letter Cd Concat,Ungrad Grad Cd,Gender Desc,Ucb Level1 Ethnic Rollup Cd,Ucb Level1 Ethnic Rollup Desc,Student Headcount,Prorated Student Major Cnt Sum
0,2005 Fall,2005 D,U,Female,1,Underrepresented Minority,1818,1815.500000
1,2005 Fall,2005 D,U,Female,2,Asian/Pacific Islander,4172,4127.000000
2,2005 Fall,2005 D,U,Female,3,White/Other,4255,4240.166667
3,2005 Fall,2005 D,U,Female,4,International,176,173.500000
4,2005 Fall,2005 D,U,Male,1,Underrepresented Minority,1082,1080.000000
5,2005 Fall,2005 D,U,Male,2,Asian/Pacific Islander,2658,2616.666667
6,2005 Fall,2005 D,U,Male,3,White/Other,3355,3339.666667
7,2005 Fall,2005 D,U,Male,4,International,189,186.000000
8,2006 Spring,2006 B,U,Female,1,Underrepresented Minority,1755,1752.000000
9,2006 Spring,2006 B,U,Female,2,Asian/Pacific Islander,4164,4116.000000


In [5]:
semesters = df[cols[1]]

In [6]:
semesters = set(semesters)

In [7]:
semesters = list(semesters)

In [9]:
semesters = sorted(semesters)

In [12]:
semesters

['2005 D',
 '2006 B',
 '2006 D',
 '2007 B',
 '2007 D',
 '2008 B',
 '2008 D',
 '2009 B',
 '2009 D',
 '2010 B',
 '2010 D',
 '2011 B',
 '2011 D',
 '2012 B',
 '2012 D',
 '2013 B',
 '2013 D',
 '2014 B',
 '2014 D',
 '2015 B']

<font color='green'>
1) What does each line above do to 'semesters'?<br>
You can run them individually and print semesters between lines.</font>

In [17]:
genders = list(set(df[cols[3]]))

# Example Analysis
We'll first look at how the college of Letters and Sciences compares to the campus as a whole. We'll restrict this analysis to gender for now.

In [19]:
# function for extracting data from df
def headcount_frac(df, semester, col, attrs):
    """
    Extract fractional headcount data for specific attributes from a column.
    
    Parameters
    ----------
    df : dataframe
        Dataframe containing data.
    semester : str
        String for semester
    col : str
        Column to check attributes
    attrs : list of str
        List of attributes to calculate
    """
    indxs = []
    for attr in attrs:
        indxs.append((df[col] == attr) & (df['Semester Year Letter Cd Concat'] == semester))
    counts = [df['Student Headcount'].iloc[indx].sum() for indx in indxs]
    total = sum(counts)
    return [(attr, float(count)/total) for attr, count in zip(attrs, counts)]

# function for computing 'distance' between two units
def distance(data, baseline):
    pass

campus_vs_LS = {} # dictionary for data
for semester in semesters:
    data = df
    baseline = df_baseline
    campus_vs_LS[semester + '_' + gender] = distance(data, baseline)

In [21]:
df['Student Headcount'][(df['Gender Desc'] == gender) & (df['Semester Year Letter Cd Concat'] == semester)]

153    2370
154    4230
155    3177
156    1313
Name: Student Headcount, dtype: int64

# Your Analysis
In your group, choose a question you'd like to investigate.
## Finding files
Google os.listdir() and use it to get a list of all of the csv files.