In [1]:
import pandas as pd
%matplotlib inline
import pylab as plt
import numpy as np

# Distance from high school to college campuses

**Use: `easy_install googlemaps` to prepare required Python dependencies. (pip doesn't work apparently...)**

While exploring our dataset we realized we need additional data to perform meaningful yield prediction. One of the features we think might be meaningful in predicting admission yield is the distance from a student's high school to the college campus.

We will user the google maps API to find the locations of high schools and the distance from the schools to each UC campus. 

In [2]:
import googlemaps

# Keys left as a courtesy to the instructor.
# we had to use all three for when we used up the limit on one key
# key = 'AIzaSyAfMIzlBeHc_rJo1n1OgnRVGhvgWxY_MiE' #Michal's
# key = 'AIzaSyA84f9q_9o6LCnsqaLnpSyFkj7tS0rU0to' #Nick's
key = 'AIzaSyBF-P9gMxVzvV0O2jjrDa853DtXCn4yTL8' #Nelson's
gmaps = googlemaps.Client(key=key)

## Finding location and distance using Google Maps API

Next, we will define functions to find the following for each high school/UC campus combo:
 - Location of the high school as described in a string
 - Distance between the high school and the campus

In [3]:
def get_school_loc_str(df):
    loc = df['school'].values.copy()
    loc += np.where(df['city'].notnull(),  ', '+df['city'], '' )
    loc += np.where(df['state'].notnull(), ', '+df['state'], '' )
    loc += np.where(df['country'].notnull(),  ', '+df['country'], '' )
    return loc

import time
def get_distance(campus_abbr, school_strings):
    '''given a campus string and a list of school location strings,
    calculate the distances from the campus to each high school.'''
    if isinstance(school_strings, str):
        # deal with the case of only passing in one school
        school_strings = [school_strings]
    if campus_abbr == 'Universitywide':
        raise ValueError("Can't get the distance to the entire university system")
    campus_str = 'University of California, {}'.format(campus_abbr)
    
    #theres a max of 25 destinations per request so split them up
    N = 25
    chunks = [school_strings[i:i+N] for i in range(0, len(school_strings), N)]
    results = []
    for c in chunks:
        time.sleep(1) #ensure we dont go over 100 elements/sec limit
        try:
            response = gmaps.distance_matrix(origins=campus_str, destinations=c)
            by_hs = response['rows'][0]['elements']
        except Exception as e:
            raise RuntimeError("API timeout")
        for entry in by_hs:
            if 'distance' in entry:
                results.append(entry['distance']['value'])
            else:
#                 google maps couldnt look up that distance
                results.append(np.nan)
    return results

Check that the *get_distance* function is working...

In [4]:
get_distance("Berkeley", ["ABRAHAM LINCOLN HIGH SCHOOL, Los Angeles", "LAWRENCEVILLE SCHOOL, Lawrenceville, New jersey"])

[601648, 4656150]

## Open and Clean Data
Next, we will use the above functions on our main dataset.

In [5]:
data = pd.read_csv('data/processed.csv')
data

Unnamed: 0,campus,year,school,school_num,city,county,state,country,region,ethnicity,app_num,adm_num,enr_num,app_gpa,adm_gpa,enr_gpa
0,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,All,14.0,,,3.620000,,
1,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,Asian,8.0,,,3.620000,,
2,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,Hispanic/ Latino,5.0,,,3.620000,,
3,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,All,58.0,8.0,7.0,3.682931,4.121250,4.088571
4,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,Asian,50.0,8.0,7.0,3.682931,4.121250,4.088571
5,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,All,14.0,,,3.640714,,
6,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,Hispanic/ Latino,6.0,,,3.640714,,
7,Berkeley,1994,ACADEMY OUR LADY OF PEACE,52820,San Diego,San Diego,California,USA,San Diego,All,5.0,,,3.786000,,
8,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,All,61.0,30.0,13.0,3.557869,3.828333,3.563846
9,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,Asian,16.0,4.0,,3.557869,3.828333,


Filter data down to rows that are only for ethnicity=="All", California schools, non-Universitywide entrys, and remove duplicate rows that share a campus and high school, since we only need to calculate the distance for each pair once. Also, add in a column for a string description of the location of the school.

In [6]:
no_dups = data[data['ethnicity'] == 'All']
no_dups = data.drop_duplicates(subset=['campus', 'school_num'])
no_dups['school_loc_str'] = get_school_loc_str(no_dups)
no_dups = no_dups[no_dups['campus'] != 'Universitywide']
no_dups = no_dups[no_dups['state'] == 'California']
no_dups.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,campus,year,school,school_num,city,county,state,country,region,ethnicity,app_num,adm_num,enr_num,app_gpa,adm_gpa,enr_gpa,school_loc_str
0,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,All,14.0,,,3.62,,,"ABRAHAM LINCOLN HIGH SCHOOL, Los Angeles, Cali..."
3,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,All,58.0,8.0,7.0,3.682931,4.12125,4.088571,"ABRAHAM LINCOLN HIGH SCHOOL, San Francisco, Ca..."
5,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,All,14.0,,,3.640714,,,"ABRAHAM LINCOLN HIGH SCHOOL, San Jose, Califor..."
7,Berkeley,1994,ACADEMY OUR LADY OF PEACE,52820,San Diego,San Diego,California,USA,San Diego,All,5.0,,,3.786,,,"ACADEMY OUR LADY OF PEACE, San Diego, Californ..."
8,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,All,61.0,30.0,13.0,3.557869,3.828333,3.563846,"ACALANES HIGH SCHOOL, Lafayette, California, USA"


## Finding the Distances for our data

Because the Google Maps API crashed prepeatedly, we decided to store and save our results between calls so that we could revert whenever we needed to. We store the results in a doubly nested dictionary. The first level of the dictionary corresponds to each UC campus, and the second level corresponds to each high school (indexed by ID number). Thus we could look up the distance from campus to school by using `campus_distances[campus][school]`.

First, define a function that loads the existing distance dictionary, and then load up the cached results...

In [7]:
def load_distances():
    import json
    with open('data/distances.json') as fp:
        return json.load(fp)

In [8]:
campus_distances = load_distances()
campus_distances

{'Berkeley': {'190179': 3522215,
  '210250': 4499324,
  '210730': 4529619,
  '220445': 4966968,
  '260645': 3279005,
  '30216': 1225337,
  '30265': 1202233,
  '30303': 1210402,
  '30397': 1223534,
  '310283': 4657624,
  '320003': 1746371,
  '330357': 4687730,
  '330630': 4677259,
  '333325': 4683857,
  '333480': 4674059,
  '440557': 3091296,
  '480070': 1288563,
  '50003': 607116,
  '50005': 15660,
  '50011': 12543,
  '50013': 16934,
  '50015': 13645,
  '50029': 12053,
  '50035': 5348,
  '50050': 609390,
  '50063': 689175,
  '50068': 570883,
  '50077': 654941,
  '50081': 648744,
  '50082': 640146,
  '50086': 639141,
  '50090': 641611,
  '50093': 657480,
  '50095': 315621,
  '50100': 200427,
  '50103': 638336,
  '50109': 62700,
  '50115': 56408,
  '50118': 674007,
  '50119': 134456,
  '50126': 662924,
  '50130': 615382,
  '50134': 673300,
  '50135': 449969,
  '50140': 682552,
  '50144': 613315,
  '50150': 395154,
  '50155': 633940,
  '50160': 466771,
  '50162': 600433,
  '50165': 342098

Now, go through all of the campuses, and for all the high schools that we haven't found yet, calculate the distance to them. Do them in batches of 100 at a time so that if we overload the API we don't lose too much progress. After finding each batch of distances, update the existing dictionary to save it. After going through all campuses, save our progress.

In [9]:
for campus, group in no_dups.groupby('campus'):
    found_distances = campus_distances[campus].keys()
    not_found_schools = ~group['school_num'].isin(found_distances) 
    not_found = group[not_found_schools]
    to_do = not_found[:100]

    print("getting the distance from UC " + campus + " to " + str(len(to_do)) + " schools out of " + str(len(not_found)))
    schools = to_do['school_loc_str'].values
    distances = get_distance(campus, schools)
    new_distances = dict(   zip(to_do['school_num'], distances)   )

    campus_distances[campus].update(new_distances)
    
print('saving...')
import json
with open('data/distances.json', 'w') as fp:
    json.dump(campus_distances, fp)
print("DONE")

getting the distance from UC Berkeley to 0 schools out of 0
getting the distance from UC Davis to 0 schools out of 0
getting the distance from UC Irvine to 0 schools out of 0
getting the distance from UC Los Angeles to 0 schools out of 0
getting the distance from UC Merced to 0 schools out of 0
getting the distance from UC Riverside to 0 schools out of 0
getting the distance from UC San Diego to 0 schools out of 0
getting the distance from UC Santa Barbara to 0 schools out of 0
getting the distance from UC Santa Cruz to 0 schools out of 0
saving...
DONE


Explore the data we have collected so far...

In [10]:
for campus, dict_ in campus_distances.items():
    print(campus, len(dict_))
    for i, (school_id, distance) in enumerate(dict_.items()):
        if i == 5:
            print('...')
            break
        print(school_id, distance)
#         if distance is np.nan:
#             print(no_dups[no_dups['school_num'] == school_id])

Santa Barbara 801
51520 172024
52910 534268
53075 460956
52820 362720
51315 528187
...
Santa Cruz 780
51520 552122
52910 120536
53075 52354
52820 740919
51315 127583
...
Los Angeles 831
50944 110570
51520 33275
52910 616195
53075 542882
52820 211675
...
Merced 694
51520 447598
52910 228469
53075 212163
52820 636395
51315 190935
...
San Diego 812
50944 183419
51520 176715
52910 799366
53075 726054
52820 23173
...
Riverside 738
50944 35796
51520 88780
52910 710606
51315 672645
50438 172181
...
Davis 800
51520 637455
52910 126819
53075 166247
52820 826251
51315 93385
...
Irvine 788
50944 81503
51520 69439
52910 696308
53075 622996
51315 658347
...
Berkeley 819
51520 601648
52910 33037
53075 76043
52820 790444
51315 21980
...


## Add the Distance Data to our Dataframe
We want to use the original dataframe so that we keep all the original information too

In [11]:
final_data = data.copy()
final_data['distance'] = np.nan #fill with NaNs to start
final_data

Unnamed: 0,campus,year,school,school_num,city,county,state,country,region,ethnicity,app_num,adm_num,enr_num,app_gpa,adm_gpa,enr_gpa,distance
0,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,All,14.0,,,3.620000,,,
1,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,Asian,8.0,,,3.620000,,,
2,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,Hispanic/ Latino,5.0,,,3.620000,,,
3,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,All,58.0,8.0,7.0,3.682931,4.121250,4.088571,
4,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,Asian,50.0,8.0,7.0,3.682931,4.121250,4.088571,
5,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,All,14.0,,,3.640714,,,
6,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,Hispanic/ Latino,6.0,,,3.640714,,,
7,Berkeley,1994,ACADEMY OUR LADY OF PEACE,52820,San Diego,San Diego,California,USA,San Diego,All,5.0,,,3.786000,,,
8,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,All,61.0,30.0,13.0,3.557869,3.828333,3.563846,
9,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,Asian,16.0,4.0,,3.557869,3.828333,,


In [12]:
final_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,341784.0,2008.04411,6.690691,1994.0,2003.0,2009.0,2014.0,2017.0
school_num,341784.0,119628.939514,783239.456241,4019.0,50829.0,51966.0,53425.0,108142900.0
app_num,341784.0,23.285888,48.548044,5.0,7.0,12.0,25.0,4973.0
adm_num,238259.0,15.95677,30.873295,3.0,5.0,9.0,17.0,3274.0
enr_num,88550.0,12.169848,20.896512,3.0,5.0,7.0,12.0,1371.0
app_gpa,341784.0,3.688469,0.209092,1.362,3.556406,3.700309,3.832566,4.516
adm_gpa,238259.0,3.891925,0.224982,2.548333,3.732807,3.904118,4.071429,4.495
enr_gpa,88550.0,3.830525,0.227511,2.598,3.688776,3.842308,3.993077,4.43
distance,0.0,,,,,,,


Go through the dataset and fill in the proper distance value for each row...

In [13]:
campus_distances = load_distances()
for campus, dict_ in campus_distances.items():
    print(campus, len(dict_))
    campus_matches = final_data['campus']==campus
    for i, (num, dist) in enumerate(dict_.items()):
        print("\r{}/{}".format(i, len(dict_)), end='', flush=True)
        school_matches = final_data['school_num']==int(num)
        final_data.loc[school_matches & campus_matches, 'distance'] = dist
    print()

Santa Barbara 801
800/801
Santa Cruz 780
779/780
Los Angeles 831
830/831
Merced 694
693/694
San Diego 812
811/812
Riverside 738
737/738
Davis 800
799/800
Irvine 788
787/788
Berkeley 819
818/819


In [14]:
final_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,341784.0,2008.04411,6.690691,1994.0,2003.0,2009.0,2014.0,2017.0
school_num,341784.0,119628.939514,783239.456241,4019.0,50829.0,51966.0,53425.0,108142900.0
app_num,341784.0,23.285888,48.548044,5.0,7.0,12.0,25.0,4973.0
adm_num,238259.0,15.95677,30.873295,3.0,5.0,9.0,17.0,3274.0
enr_num,88550.0,12.169848,20.896512,3.0,5.0,7.0,12.0,1371.0
app_gpa,341784.0,3.688469,0.209092,1.362,3.556406,3.700309,3.832566,4.516
adm_gpa,238259.0,3.891925,0.224982,2.548333,3.732807,3.904118,4.071429,4.495
enr_gpa,88550.0,3.830525,0.227511,2.598,3.688776,3.842308,3.993077,4.43
distance,243578.0,350520.043276,345014.599624,662.0,101178.0,224885.0,599219.0,5083067.0


In [15]:
final_data

Unnamed: 0,campus,year,school,school_num,city,county,state,country,region,ethnicity,app_num,adm_num,enr_num,app_gpa,adm_gpa,enr_gpa,distance
0,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,All,14.0,,,3.620000,,,601648.0
1,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,Asian,8.0,,,3.620000,,,601648.0
2,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,Hispanic/ Latino,5.0,,,3.620000,,,601648.0
3,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,All,58.0,8.0,7.0,3.682931,4.121250,4.088571,33037.0
4,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,Asian,50.0,8.0,7.0,3.682931,4.121250,4.088571,33037.0
5,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,All,14.0,,,3.640714,,,76043.0
6,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,Hispanic/ Latino,6.0,,,3.640714,,,76043.0
7,Berkeley,1994,ACADEMY OUR LADY OF PEACE,52820,San Diego,San Diego,California,USA,San Diego,All,5.0,,,3.786000,,,790444.0
8,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,All,61.0,30.0,13.0,3.557869,3.828333,3.563846,21980.0
9,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,Asian,16.0,4.0,,3.557869,3.828333,,21980.0


Save the finished dataframe to disk!

In [17]:
final_data.to_csv('data/distances.csv', sep=',', index=False)