# Measuring Diversity

A very simple question one might ask is how does the ethnic and gender makeup of an academic department compare to the makeup of Berkeley as a whole (or California, or the US...). Using data from Cal Answers, let's try and answer that question.

We have data going back to 2005 with gender and (coarse) ethnicity. A skeleton for comparing one department/unit to Berkeley has been created.

You'll need to:
- automate .csv loads from a folder,
- choose to look at gener, ethnicity or combination in groups
- run the analysis for every department/group,
- make a summary plot.

As a reminder, IPython notebooks are organized by "cells." Each cell can have its own code and can be run independently and in any order (although they are usually run top to bottom in a notebook.) To run a cell and move to the next cell press ```Shift+Enter```. To run a cell and stay on that cell press ```Control+Enter```.


Questions to be discussed in groups are highlighted in <font color='green'>green</font>. If you don't understand a function that is used, try googling something like "python function-name".

In [33]:
import os # library that deals with operating system
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline

# Loading data

In [35]:
folder = 'data'
filename = 'CensusEquityComparisonData-LettersAndSciences.csv'
baseline = 'CensusEquityComparisonData-CampusTotal.csv'
df = pd.read_csv(os.path.join(folder, filename))
df_baseline = pd.read_csv(os.path.join(folder, baseline))

## Inspect the dadta

In [17]:
cols = df.columns
print(cols)
df

Index([u'﻿Semester Year Name Concat', u'Semester Year Letter Cd Concat',
       u'Ungrad Grad Cd', u'Gender Desc', u'Ucb Level1 Ethnic Rollup Cd',
       u'Ucb Level1 Ethnic Rollup Desc', u'Student Headcount',
       u'Prorated Student Major Cnt Sum'],
      dtype='object')


Unnamed: 0,﻿Semester Year Name Concat,Semester Year Letter Cd Concat,Ungrad Grad Cd,Gender Desc,Ucb Level1 Ethnic Rollup Cd,Ucb Level1 Ethnic Rollup Desc,Student Headcount,Prorated Student Major Cnt Sum
0,2005 Fall,2005 D,U,Female,1,Underrepresented Minority,181,165.000000
1,2005 Fall,2005 D,U,Female,2,Asian/Pacific Islander,335,287.000000
2,2005 Fall,2005 D,U,Female,3,White/Other,591,543.333333
3,2005 Fall,2005 D,U,Female,4,International,18,16.500000
4,2005 Fall,2005 D,U,Male,1,Underrepresented Minority,133,121.500000
5,2005 Fall,2005 D,U,Male,2,Asian/Pacific Islander,161,140.833333
6,2005 Fall,2005 D,U,Male,3,White/Other,396,362.666667
7,2005 Fall,2005 D,U,Male,4,International,10,10.000000
8,2006 Spring,2006 B,U,Female,1,Underrepresented Minority,206,184.833333
9,2006 Spring,2006 B,U,Female,2,Asian/Pacific Islander,363,312.500000


In [27]:
semesters = df[cols[0]]

In [28]:
semesters = set(semesters)

In [29]:
semesters = list(semesters)

In [31]:
genders = list(set(df[cols[3]]))

# Example Analysis
We'll first look at how the college of Letters and Sciences compares to the campus as a whole. We'll restrict this analysis to gender.

In [None]:
# function for computing 'distance' between two units
def distance()

campus_vs_LS = {} # dictionary for data
for semester in semesters:
    for gender in genders:
        campus_cs_LS[semester+'_'+gender] = 

In [43]:
df['Student Headcount'][df['Gender Desc'] == genders[0]]

0      1818
1      4172
2      4255
3       176
8      1755
9      4164
10     4137
11      176
16     1854
17     4213
18     4320
19      180
24     1843
25     4266
26     4235
27      171
32     1941
33     4365
34     4371
35      187
40     1910
41     4412
42     4213
43      168
48     1999
49     4381
50     4222
51      280
56     2002
57     4357
       ... 
98     3930
99      855
104    2142
105    4143
106    3807
107     826
112    2190
113    4055
114    3636
115     996
121    2160
122    4161
123    3575
124     935
129    2207
130    4026
131    3299
132    1306
137    2238
138    4143
139    3304
140    1250
145    2329
146    4118
147    3204
148    1389
153    2370
154    4230
155    3177
156    1313
Name: Student Headcount, dtype: int64