In [1]:
import pandas as pd
import numpy as np
import pylab as plt
import csv
import sqlite3
%matplotlib inline

from ipywidgets import FloatProgress
from IPython.display import display

## Plans for Cleaning up the GPA data

For this data analysis task, we have two raw data files regarding the admission information of the universities of California. The first file is dealing with various kinds of GPAs and the second file contains valuable information regarding applicant numbers. Therefore, the plan for our data cleaning task is to merge the two data files together. The main purpose for the data merging is that, other than the two variables mentioned above (GPA and student number), the other variables are consistent. It would be redundant and wasteful of space to keep two files separately. 

Before we proceed to the merging, we have to clean up individual file first. The first step of the clean-up is to replace some confusing column names with the ones that could be easily manipulated, such as, replacing "Uad Uc Ethn 6 Cat" with "ethnicity" and "Measure Name" with "type" etc. Then some preliminary files merging could be initiated. Before the merging was happening, we have the liberty to pick up columns that are useful to our analysis. This cleaning process will also reduce the size of the combined data sets. As the two files getting merged, we could do some further cleaning on the newly-created file, for example, we could delete rows that have NaN in the GPA column. 

As we are getting satisfied with dropping the rows that are useless, we would be able to notice that there are some more cleaning work we could do by looking at the existing data file. First of all, instead of having three different types - enrolled ('enr'), admitted ('adm') and applied ('app'), we could explicitly unpack this column and make seperate columns for specific variables that are relevant to 'type', for example, we have columns like 'adm_gpa' to stand for GPA for admission, and 'enr_num' for the number of students that enroll into any university etc. Next, as there are only nine universities in University of California system, it becomes natural to regroup the data by campus. By doing this, 'campus' becomes a categorical field. 

## Parsing the school name

Both of our datasets (GPA and counts data) contain a `Calculation1` column that specifies the high school. For California schools, that column contains a `cds` number - however it is formatted differently in those datasets, which makes it problematic to merge them.

In the next cell we will devise a method for splitting the `Calculation1` column into two:
 - the cds number
 - the name of the school


In [2]:
def split_school_field(df, field_name):
    df['school'] = df[field_name].str.replace('\d+$', '')
    df['school_num'] = pd.to_numeric(df[field_name].str.extract('(\d+$)', expand=False))

## Open up GPA data

In [3]:
gpa_data_raw = pd.read_csv("data/FR_GPA_by_Inst_data_converted.csv")
gpa_data_raw.head()

Unnamed: 0,Calculation1,Campus,City,County,Fall Term,Measure Names,School,Measure Values
0,21ST CENTURY EXPERIMENTAL SCH694223,Santa Cruz,,Not Applicable,2017,Enrl GPA,21ST CENTURY EXPERIMENTAL SCH,
1,21ST CENTURY EXPERIMENTAL SCH694223,Santa Barbara,,Not Applicable,2017,Enrl GPA,21ST CENTURY EXPERIMENTAL SCH,
2,21ST CENTURY EXPERIMENTAL SCH694223,San Diego,,Not Applicable,2017,Enrl GPA,21ST CENTURY EXPERIMENTAL SCH,
3,21ST CENTURY EXPERIMENTAL SCH694223,Los Angeles,,Not Applicable,2017,Enrl GPA,21ST CENTURY EXPERIMENTAL SCH,
4,21ST CENTURY EXPERIMENTAL SCH694223,Irvine,,Not Applicable,2017,Enrl GPA,21ST CENTURY EXPERIMENTAL SCH,


Clean up the GPA data.
Rename many of the columns, replace string NaNs with literal NaN values, unify the "applied", "admitted", and "enrolled" codes.

In [4]:
gpa_data = gpa_data_raw.drop(columns="School")
renaming = {"Uad Uc Ethn 6 Cat": "Ethnicity", 
             "Calculation1": "SchoolField",
             "Campus": "campus",
             "City": "city",
             "County": "county",
             "Fall Term":"year",
             "Measure Names": "type",
             'Uad Uc Ethn 6 Cat':"ethnicity",
             "Measure Values": "gpa"}
gpa_data = gpa_data.rename(index=str, columns=renaming)
gpa_data['city'].replace('n/a ', np.nan, inplace=True) #TODO maybe not use nan since not a number
gpa_data['county'].replace('Not Applicable', np.nan, inplace=True)
renaming = {"Enrl GPA":"enr",
           "Adm GPA":"adm",
           "App GPA":"app"}
gpa_data['type'].replace(renaming, inplace=True)
gpa_data.head()

Unnamed: 0,SchoolField,campus,city,county,year,type,gpa
0,21ST CENTURY EXPERIMENTAL SCH694223,Santa Cruz,,,2017,enr,
1,21ST CENTURY EXPERIMENTAL SCH694223,Santa Barbara,,,2017,enr,
2,21ST CENTURY EXPERIMENTAL SCH694223,San Diego,,,2017,enr,
3,21ST CENTURY EXPERIMENTAL SCH694223,Los Angeles,,,2017,enr,
4,21ST CENTURY EXPERIMENTAL SCH694223,Irvine,,,2017,enr,


The "school" represents a high school; "campus" is which UC campus ("Univeristywide" represents total of all campuses); "city" is which city the high school is in, or NaN if HS is outside California; "county" is which county the high school is in, or NaN if HS is outside California; "year" is the year of the fall term that students started; "type" is which metric we are measuring, either the GPA of applied, admitted, or enrolled students; "gpa" is the actual GPA of this group of students.

Next, we will parse the school name:

In [5]:
split_school_field(gpa_data, 'SchoolField')
gpa_data = gpa_data.drop(columns='SchoolField')
gpa_data.head()

Unnamed: 0,campus,city,county,year,type,gpa,school,school_num
0,Santa Cruz,,,2017,enr,,21ST CENTURY EXPERIMENTAL SCH,694223
1,Santa Barbara,,,2017,enr,,21ST CENTURY EXPERIMENTAL SCH,694223
2,San Diego,,,2017,enr,,21ST CENTURY EXPERIMENTAL SCH,694223
3,Los Angeles,,,2017,enr,,21ST CENTURY EXPERIMENTAL SCH,694223
4,Irvine,,,2017,enr,,21ST CENTURY EXPERIMENTAL SCH,694223


## Open up count data

In [6]:
hs_data_raw = pd.read_csv("data/HS_by_Year_data_converted.csv")
print(hs_data_raw.columns)
hs_data_raw.head()

Index(['Calculation1', 'Campus', 'City', 'County/State/ Territory',
       'Fall Term', 'Measure Names', 'Uad Uc Ethn 6 Cat', 'Measure Values'],
      dtype='object')


Unnamed: 0,Calculation1,Campus,City,County/State/ Territory,Fall Term,Measure Names,Uad Uc Ethn 6 Cat,Measure Values
0,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,enr,All,12.0
1,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,adm,All,30.0
2,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,app,All,43.0
3,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,Inter- national,5.0
4,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,adm,Inter- national,12.0


Clean up this data. Make the column names consistent with the other table so we can merge them

In [7]:
# hs_data = hs_data_raw.drop(columns="School")
renaming = {"Uad Uc Ethn 6 Cat": "Ethnicity", 
             "Calculation1": "SchoolField",
             "Campus": "campus",
             "City": "city",
             "County/State/ Territory": "region",
             "Fall Term":"year",
             "Measure Names": "type",
             'Uad Uc Ethn 6 Cat':"ethnicity",
             "Measure Values": "num"}
hs_data = hs_data_raw.rename(index=str, columns=renaming)
hs_data.head()

Unnamed: 0,SchoolField,campus,city,region,year,type,ethnicity,num
0,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,enr,All,12.0
1,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,adm,All,30.0
2,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,app,All,43.0
3,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,Inter- national,5.0
4,21ST CENTURY EXPERIMENTAL SCH694223,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,adm,Inter- national,12.0


The "school" represents a high school; "campus" is which UC campus ("Univeristywide" represents total of all campuses); "city" is which city the high school is in, or NaN if HS is outside California; "region" is either the name of a US state, or the name of the country that HS is in; "year" is the year of the fall term that students started; "type" is which metric we are measuring, either the GPA of applied, admitted, or enrolled students; "ethnicity" is the ethnicity of thie group of students, or "All" for the union of these groups; "num" is how many students belong to this group.

Now, we will parse the school name field

In [8]:
split_school_field(hs_data, 'SchoolField')
hs_data = hs_data.drop(columns='SchoolField')
hs_data.head()

Unnamed: 0,campus,city,region,year,type,ethnicity,num,school,school_num
0,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,enr,All,12.0,21ST CENTURY EXPERIMENTAL SCH,694223
1,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,adm,All,30.0,21ST CENTURY EXPERIMENTAL SCH,694223
2,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,app,All,43.0,21ST CENTURY EXPERIMENTAL SCH,694223
3,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,Inter- national,5.0,21ST CENTURY EXPERIMENTAL SCH,694223
4,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,adm,Inter- national,12.0,21ST CENTURY EXPERIMENTAL SCH,694223


## Merge the two datasets
When any two rows match on every one of the columns ['school', 'campus', 'city', 'year', 'type'], then merge these two rows. The GPA data contains two columns that the HS data does not: "county" and "gpa". The HS data contains two columns that the GPA data does not: "region" and "num". Therefore, these columns will be filled with NaNs in the merged. For instance, if a record exists in the HS data but not in the GPA data, then the corresponding record in the merged table will have NaN in the "region" column, since we don't know how to fill this in.

In [9]:
merged = pd.merge(hs_data, gpa_data, on=['school', 'school_num', 'city', 'campus', 'year', 'type'])
merged

Unnamed: 0,campus,city,region,year,type,ethnicity,num,school,school_num,county,gpa
0,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,enr,All,12.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.986667
1,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,adm,All,30.0,21ST CENTURY EXPERIMENTAL SCH,694223,,4.020417
2,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,app,All,43.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.864324
3,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,Inter- national,5.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.920000
4,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,All,5.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.920000
5,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,adm,Inter- national,12.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.755833
6,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,adm,All,12.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.755833
7,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,app,Inter- national,18.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.680000
8,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,app,All,19.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.680000
9,Universitywide,,"CHINA, PEOPLES REPUBLIC",2015,enr,All,3.0,21ST CENTURY EXPERIMENTAL SCH,694223,,


## A few other cleaning steps
Some of the rows have NaN in both the GPA and num fields, and thus are useless, so delete them.

In [15]:
print("number or rows before dropping NaN rows:", merged.shape[0])
merged = merged[   pd.notnull(merged['num']) & pd.notnull(merged['gpa'])  ]
print("number or rows after dropping NaN rows:", merged.shape[0])
merged.head()

number or rows before dropping NaN rows: 1025352
number or rows after dropping NaN rows: 668593


Unnamed: 0,campus,city,region,year,type,ethnicity,num,school,school_num,county,gpa,country,state
0,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,enr,All,12.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.986667,"CHINA, PEOPLES REPUBLIC",
1,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,adm,All,30.0,21ST CENTURY EXPERIMENTAL SCH,694223,,4.020417,"CHINA, PEOPLES REPUBLIC",
2,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,app,All,43.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.864324,"CHINA, PEOPLES REPUBLIC",
3,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,Inter- national,5.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.92,"CHINA, PEOPLES REPUBLIC",
4,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,All,5.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.92,"CHINA, PEOPLES REPUBLIC",


### High school location

The dataset uses the field `region` to specify the location of the high school. However, since the same row is used to specify either a country, a state, or a county, it is difficult to query the field. We will try to put the high schools in three categories:

 - in California
 - in the US but outside California
 - outside of the US
 
In order to perform this task, we notice the following conjecture about the format of the `County/State/Territory` column in the `counts` dataset:

 - If the school is located in California, the column contains the county name
 - If the school is located in the US, the column contains the name of the state
 - If the school is located outside of the US, the column contains the name of the country (in all caps)

First we will validate our data:

In [16]:
# We extracted the list of California counties, and US teritories from the list of unique locations
ca_counties = ['Alameda', 'Alpine', 'Amador', 'Butte', 'Calaveras', 'Colusa', 'Contra Costa', 'Del Norte', 'El Dorado', 'Fresno', 'Glenn', 'Humboldt', 'Imperial', 'Inyo', 'Kern', 'Kings', 'Lake', 'Lassen', 'Los Angeles', 'Madera', 'Marin', 'Mariposa', 'Mendocino', 'Merced', 'Modoc', 'Mono', 'Monterey', 'Napa', 'Nevada', 'Orange', 'Placer', 'Plumas', 'Riverside', 'Sacramento', 'San Benito', 'San Bernardino', 'San Diego', 'San Francisco', 'San Joaquin', 'San Luis Obispo', 'San Mateo', 'Santa Barbara', 'Santa Clara', 'Santa Cruz', 'Shasta', 'Sierra', 'Siskiyou', 'Solano', 'Sonoma', 'Stanislaus', 'Sutter', 'Tehama', 'Trinity', 'Tulare', 'Tuolumne', 'Ventura', 'Yolo', 'Yuba']
us_states_and_territories = ['American Samoa', 'Northern Mariana Islands', 'U.S. Armed Forces –\xa0Pacific', 'U.S. Armed Forces –\xa0Europe', 'Puerto Rico', 'Guam', 'District of Columbia', 'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']

all_locations = list(hs_data_raw['County/State/ Territory'].unique())
country_names = [l for l in all_locations
                 if l not in ca_counties and
                    l not in us_states_and_territories and
                    l is not np.nan]

# Sanity check - our contry_names should be in all caps:
for country_name in country_names:
    assert(country_name == country_name.upper())

Next, we will actually group the schools into their respective locations:

In [17]:
ca_filter = merged['region'].isin(ca_counties)
us_non_ca_filter = merged['region'].isin(us_states_and_territories)
foreign_filter = merged['region'].isin(country_names)

merged['country'] = np.nan
merged.loc[ca_filter, 'country'] = 'USA'
merged.loc[foreign_filter, 'country'] = merged.loc[foreign_filter, 'region']
merged.loc[us_non_ca_filter, 'country'] = 'USA'

merged['state'] = np.nan
merged.loc[ca_filter, 'state'] = 'California'
merged.loc[foreign_filter, 'state'] = np.nan
merged.loc[us_non_ca_filter, 'state'] = merged.loc[us_non_ca_filter, 'region']

merged['county'] = np.nan
merged.loc[ca_filter, 'county'] = merged.loc[ca_filter, 'region']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydat

In [18]:
merged

Unnamed: 0,campus,city,region,year,type,ethnicity,num,school,school_num,county,gpa,country,state
0,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,enr,All,12.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.986667,"CHINA, PEOPLES REPUBLIC",
1,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,adm,All,30.0,21ST CENTURY EXPERIMENTAL SCH,694223,,4.020417,"CHINA, PEOPLES REPUBLIC",
2,Universitywide,,"CHINA, PEOPLES REPUBLIC",2017,app,All,43.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.864324,"CHINA, PEOPLES REPUBLIC",
3,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,Inter- national,5.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.920000,"CHINA, PEOPLES REPUBLIC",
4,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,enr,All,5.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.920000,"CHINA, PEOPLES REPUBLIC",
5,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,adm,Inter- national,12.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.755833,"CHINA, PEOPLES REPUBLIC",
6,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,adm,All,12.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.755833,"CHINA, PEOPLES REPUBLIC",
7,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,app,Inter- national,18.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.680000,"CHINA, PEOPLES REPUBLIC",
8,Universitywide,,"CHINA, PEOPLES REPUBLIC",2016,app,All,19.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.680000,"CHINA, PEOPLES REPUBLIC",
11,Universitywide,,"CHINA, PEOPLES REPUBLIC",2015,app,All,5.0,21ST CENTURY EXPERIMENTAL SCH,694223,,3.704000,"CHINA, PEOPLES REPUBLIC",


## Flattening multiple measurements onto a single row

The files we have downloaded from the University of California database used an unusual format for storing the data. Each row represented a single measurement, that is, it contained the following:

- Information about the UC campus, high school, county, etc.
- Type of the measurement (for enrolled students, for admitted students, for applying students)
- Value of the measurement (average GPA, number of students)

That means, that the data about applicants from a given high school to a given UC campus for a given year was split into 6 rows.

However, we wanted our rows to represent applicants from a single high school, to a single UC campus in a given year. That is why we decided to merge the underlying 6 rows into a single row.

In [19]:
type_column = 'type'
readout_columns = ['gpa', 'num']

def pack_readout_rows(df, type_column, readout_columns):
    # Pandas gives warning when doing all this fancy pivoting with NaNs,
    # so replace them with a temporary placeholder for now
    # https://github.com/pandas-dev/pandas/issues/3729
    df = df.replace(np.nan, 'NaN_placeholder', regex=True)
    
    # get all the columns we wish to group by
    # these are all of them besides type, gpa, and num
    group_by_columns = list(df)
    for to_remove in [type_column] + readout_columns:
        group_by_columns.remove(to_remove)
    
    def group_to_row(g):
        result = dict(g[group_by_columns].iloc[0])

        for t in group['type'].unique():
            for readout_column in readout_columns:
                column_name = t + '_' + readout_column
                result[column_name] = group[group[type_column] == t][readout_column].iloc[0]
        return result
    
    rows = []
    group_by_result = df.groupby(group_by_columns)
    row_count = (len(group_by_result))
    current_row_index = 0
    
    f = FloatProgress(min=0, max=row_count)
    display(f)
    
    for name, group in group_by_result:
        # ex: group is a selection of the 6 matching rows that we wish to merge
        # and name is the values of the school, campus, etc. columns that we grouped by
        rows.append(group_to_row(group))
        current_row_index += 1
        if current_row_index % 100 == 0:
            f.value = current_row_index
    f.value = current_row_index
    return pd.DataFrame(rows).replace('NaN_placeholder', np.nan, regex=True)

packed = pack_readout_rows(merged, type_column, readout_columns)
packed = packed.sort_values(['campus', 'year', 'school'])
better_col_order = ['campus', 
                    'year', 
                    'school', 
                    'school_num', 
                    'city', 
                    'county',
                    'state',
                    'country',
                    'region', 
                    'ethnicity',
                    'app_num',
                    'adm_num',
                    'enr_num',
                    'app_gpa',
                    'adm_gpa',
                    'enr_gpa']
packed = packed[better_col_order]

Unnamed: 0,campus,year,school,school_num,city,county,state,country,region,ethnicity,app_num,adm_num,enr_num,app_gpa,adm_gpa,enr_gpa
21130,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,All,14.0,,,3.620000,,
21152,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,Asian,8.0,,,3.620000,,
21165,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,Hispanic/ Latino,5.0,,,3.620000,,
34276,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,All,58.0,8.0,7.0,3.682931,4.121250,4.088571
34286,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,Asian,50.0,8.0,7.0,3.682931,4.121250,4.088571
35153,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,All,14.0,,,3.640714,,
35167,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,Hispanic/ Latino,6.0,,,3.640714,,
33524,Berkeley,1994,ACADEMY OUR LADY OF PEACE,52820,San Diego,San Diego,California,USA,San Diego,All,5.0,,,3.786000,,
19529,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,All,61.0,30.0,13.0,3.557869,3.828333,3.563846
19530,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,Asian,16.0,4.0,,3.557869,3.828333,


In [21]:
packed

Unnamed: 0,campus,year,school,school_num,city,county,state,country,region,ethnicity,app_num,adm_num,enr_num,app_gpa,adm_gpa,enr_gpa
21130,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,All,14.0,,,3.620000,,
21152,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,Asian,8.0,,,3.620000,,
21165,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,Hispanic/ Latino,5.0,,,3.620000,,
34276,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,All,58.0,8.0,7.0,3.682931,4.121250,4.088571
34286,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,Asian,50.0,8.0,7.0,3.682931,4.121250,4.088571
35153,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,All,14.0,,,3.640714,,
35167,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,Hispanic/ Latino,6.0,,,3.640714,,
33524,Berkeley,1994,ACADEMY OUR LADY OF PEACE,52820,San Diego,San Diego,California,USA,San Diego,All,5.0,,,3.786000,,
19529,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,All,61.0,30.0,13.0,3.557869,3.828333,3.563846
19530,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,Asian,16.0,4.0,,3.557869,3.828333,


Write out the processed data so we don't have to wait for that again! Don't bother writing the row indices.

In [22]:
packed.to_csv('data/processed.csv', sep=',', index=False)