In [3]:
import pandas as pd
%matplotlib inline
import pylab as plt
import numpy as np

# Distance from high school to college campuses

**Use: `easy_install googlemaps` to prepare required Python dependencies. (pip doesn't work apparently...)**

**Please note, this notebook is not finished yet. We are still trying to fix the issues we encountered while using the Google Maps API. This notebook is meant to show our work and progress**.

While exploring our dataset we realized we need additional data to perform meaningful yield prediction. One of the features we think might be meaningful in predicting admission yield is the distance from a student's high school to the college campus.

We will user the google maps API to find the locations of high schools and the distance from the schools to each UC campus. 

In [2]:
import googlemaps

# Key left as a courtesy to the instructor.
# key = 'AIzaSyAfMIzlBeHc_rJo1n1OgnRVGhvgWxY_MiE' #Michal's
# key = 'AIzaSyA84f9q_9o6LCnsqaLnpSyFkj7tS0rU0to' #Nick's
key = 'AIzaSyBF-P9gMxVzvV0O2jjrDa853DtXCn4yTL8' #Nelson's
gmaps = googlemaps.Client(key=key)

## Finding location and distance using Google Maps API

Next, we will define functions to find the following for each high school/UC campus combo:
 - Location of the high school
 - Distance between the high school and the campus

In [3]:
import time
def get_distance(campus_abbr, school_strings):
    if isinstance(school_strings, str):
        school_strings = [school_strings]
    if campus_abbr == 'Universitywide':
        raise ValueError("Can't get the distance to the entire university system")
    campus_str = 'University of California, {}'.format(campus_abbr)
    
    #theres a max of 25 destinations per request so split them up
    N = 25
    chunks = [school_strings[i:i+N] for i in range(0, len(school_strings), N)]
    results = []
    for c in chunks:
        time.sleep(1) #ensure we dont go over 100 elements/sec limit
        try:
            response = gmaps.distance_matrix(origins=campus_str, destinations=c)
            by_hs = response['rows'][0]['elements']
        except Exception as e:
            raise RuntimeError("API timeout")
        for entry in by_hs:
            if 'distance' in entry:
                results.append(entry['distance']['value'])
            else:
#                 google maps couldnt look up that distance
                results.append(np.nan)
    return results

def get_school_loc_str(df):
    loc = df['school'].values.copy()
    loc += np.where(df['city'].notnull(),  ', '+df['city'], '' )
    loc += np.where(df['state'].notnull(), ', '+df['state'], '' )
    loc += np.where(df['country'].notnull(),  ', '+df['country'], '' )
    return loc

In [4]:
get_distance("Berkeley", ["ABRAHAM LINCOLN HIGH SCHOOL, Los Angeles", "LAWRENCEVILLE SCHOOL, Lawrenceville, New jersey"])

[601648, 4656150]

Next, we will use the above functions on our main dataset.

In [5]:
data = pd.read_csv('../data/processed.csv')
data

Unnamed: 0,campus,year,school,school_num,city,county,state,country,region,ethnicity,app_num,adm_num,enr_num,app_gpa,adm_gpa,enr_gpa
0,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,All,14.0,,,3.620000,,
1,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,Asian,8.0,,,3.620000,,
2,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,Hispanic/ Latino,5.0,,,3.620000,,
3,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,All,58.0,8.0,7.0,3.682931,4.121250,4.088571
4,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,Asian,50.0,8.0,7.0,3.682931,4.121250,4.088571
5,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,All,14.0,,,3.640714,,
6,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,Hispanic/ Latino,6.0,,,3.640714,,
7,Berkeley,1994,ACADEMY OUR LADY OF PEACE,52820,San Diego,San Diego,California,USA,San Diego,All,5.0,,,3.786000,,
8,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,All,61.0,30.0,13.0,3.557869,3.828333,3.563846
9,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,Asian,16.0,4.0,,3.557869,3.828333,


## Issues

We have encountered multiple issues while trying to collec our location data. The main problems were:

 - The Google Maps API only allows for a small number of API calls per day
 - The API crashed repeatedly
 
Below, one can find different attempts we made to query the API and deduplicate the location results.

In [6]:
no_dups = data[data['ethnicity'] == 'All']
no_dups = data.drop_duplicates(subset=['campus', 'school_num'])
no_dups['school_loc_str'] = get_school_loc_str(no_dups)
no_dups = no_dups[no_dups['campus'] != 'Univeristywide']
no_dups = no_dups[no_dups['state'] == 'California']
no_dups.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,campus,year,school,school_num,city,county,state,country,region,ethnicity,app_num,adm_num,enr_num,app_gpa,adm_gpa,enr_gpa,school_loc_str
0,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,All,14.0,,,3.62,,,"ABRAHAM LINCOLN HIGH SCHOOL, Los Angeles, Cali..."
3,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,All,58.0,8.0,7.0,3.682931,4.12125,4.088571,"ABRAHAM LINCOLN HIGH SCHOOL, San Francisco, Ca..."
5,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,All,14.0,,,3.640714,,,"ABRAHAM LINCOLN HIGH SCHOOL, San Jose, Califor..."
7,Berkeley,1994,ACADEMY OUR LADY OF PEACE,52820,San Diego,San Diego,California,USA,San Diego,All,5.0,,,3.786,,,"ACADEMY OUR LADY OF PEACE, San Diego, Californ..."
8,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,All,61.0,30.0,13.0,3.557869,3.828333,3.563846,"ACALANES HIGH SCHOOL, Lafayette, California, USA"


In [7]:
# no_dups[no_dups['school_num'] == '']

## Saving results

Unfortunately, the googlemaps API crashed repeatedly. We decided, to save the results in a persistent dictionary. This way we were able to save results even if our API calls crashed.

In [6]:
def load_distances():
    import json
    with open('../data/distances.json') as fp:
        return json.load(fp)

In [9]:
gb = no_dups.groupby('campus')
to_be_done = gb.groups.keys() - {'Universitywide'}
campus_distances = load_distances()
print(campus_distances)

{'Santa Barbara': {'51520': 172024, '52910': 534268, '53075': 460956, '52820': 362720, '51315': 528187, '50438': 89051, '53276': 467769, '50003': 116869, '50005': 515350, '50035': 533094, '320003': 1435994, '51525': 162476, '50050': 179483, '51915': 544336, '52742': 359641, '50077': 225634, '53378': 182414, '52495': 495494, '50974': 484047, '680400': nan, '53345': 614552, '230350': 3830126, '53077': 447144, '51355': 219159, '50115': 558772, '50118': 298004, '50119': 407730, '53163': 502908, '50130': 186075, '50135': 974767, '53078': 465850, '52658': 257514, '50910': 575855, '50470': 312163, '50150': 133708, '50830': 192883, '53125': 501076, '50155': 206151, '50160': 240285, '50165': 187142, '50724': 519281, '50172': 482860, '50188': 197526, '50205': 251533, '50225': 199630, '50235': 342750, '53436': 564618, '51092': 710576, '50265': 188116, '50245': 188116, '50912': 664469, '53080': 461141, '50380': 159543, '51540': 166743, '50280': 550858, '51550': 170730, '50290': 526714, '221900': 4

In [10]:
# for campus in to_be_done:
#     group = gb.get_group(campus)
#     found_distances = campus_distances[campus].keys()
#     not_found_schools = ~group['school_num'].isin(found_distances) 
#     not_found = group[not_found_schools]
#     to_do = not_found[:100]
    
#     print("getting the distance from UC " + campus + " to " + str(len(to_do)) + " schools out of " + str(len(not_found)))
#     schools = to_do['school_loc_str'].values
#     distances = get_distance(campus, schools)
#     new_distances = dict(   zip(to_do['school_num'], distances)   )
    
#     campus_distances[campus].update(new_distances)
# print('saving...')
# import json
# with open('../data/distances.json', 'w') as fp:
#     json.dump(campus_distances, fp)
# print("DONE")

In [11]:
for campus, dict_ in campus_distances.items():
    print(campus, len(dict_))
    for i, (school_id, distance) in enumerate(dict_.items()):
        if i == 5:
            print('...')
            break
        print(school_id, distance)
#         if distance is np.nan:
#             print(no_dups[no_dups['school_num'] == school_id])

Santa Barbara 801
51520 172024
52910 534268
53075 460956
52820 362720
51315 528187
...
Santa Cruz 780
51520 552122
52910 120536
53075 52354
52820 740919
51315 127583
...
Los Angeles 831
50944 110570
51520 33275
52910 616195
53075 542882
52820 211675
...
Merced 694
51520 447598
52910 228469
53075 212163
52820 636395
51315 190935
...
San Diego 812
50944 183419
51520 176715
52910 799366
53075 726054
52820 23173
...
Riverside 738
50944 35796
51520 88780
52910 710606
51315 672645
50438 172181
...
Davis 800
51520 637455
52910 126819
53075 166247
52820 826251
51315 93385
...
Irvine 788
50944 81503
51520 69439
52910 696308
53075 622996
51315 658347
...
Berkeley 819
51520 601648
52910 33037
53075 76043
52820 790444
51315 21980
...


## Add the distance data to our dataframe

In [12]:
# campuses = [campus for school in dict_ for dict_ in distances]

NameError: name 'distances' is not defined

In [7]:
final_data = data.copy()
final_data['distance'] = np.nan #fill with NaNs to start
final_data

Unnamed: 0,campus,year,school,school_num,city,county,state,country,region,ethnicity,app_num,adm_num,enr_num,app_gpa,adm_gpa,enr_gpa,distance
0,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,All,14.0,,,3.620000,,,
1,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,Asian,8.0,,,3.620000,,,
2,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,51520,Los Angeles,Los Angeles,California,USA,Los Angeles,Hispanic/ Latino,5.0,,,3.620000,,,
3,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,All,58.0,8.0,7.0,3.682931,4.121250,4.088571,
4,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,52910,San Francisco,San Francisco,California,USA,San Francisco,Asian,50.0,8.0,7.0,3.682931,4.121250,4.088571,
5,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,All,14.0,,,3.640714,,,
6,Berkeley,1994,ABRAHAM LINCOLN HIGH SCHOOL,53075,San Jose,Santa Clara,California,USA,Santa Clara,Hispanic/ Latino,6.0,,,3.640714,,,
7,Berkeley,1994,ACADEMY OUR LADY OF PEACE,52820,San Diego,San Diego,California,USA,San Diego,All,5.0,,,3.786000,,,
8,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,All,61.0,30.0,13.0,3.557869,3.828333,3.563846,
9,Berkeley,1994,ACALANES HIGH SCHOOL,51315,Lafayette,Contra Costa,California,USA,Contra Costa,Asian,16.0,4.0,,3.557869,3.828333,,


In [8]:
final_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,341784.0,2008.04411,6.690691,1994.0,2003.0,2009.0,2014.0,2017.0
school_num,341784.0,119628.939514,783239.456241,4019.0,50829.0,51966.0,53425.0,108142900.0
app_num,341784.0,23.285888,48.548044,5.0,7.0,12.0,25.0,4973.0
adm_num,238259.0,15.95677,30.873295,3.0,5.0,9.0,17.0,3274.0
enr_num,88550.0,12.169848,20.896512,3.0,5.0,7.0,12.0,1371.0
app_gpa,341784.0,3.688469,0.209092,1.362,3.556406,3.700309,3.832566,4.516
adm_gpa,238259.0,3.891925,0.224982,2.548333,3.732807,3.904118,4.071429,4.495
enr_gpa,88550.0,3.830525,0.227511,2.598,3.688776,3.842308,3.993077,4.43
distance,0.0,,,,,,,


Unnamed: 0,campus,year,school,school_num,city,county,state,country,region,ethnicity,app_num,adm_num,enr_num,app_gpa,adm_gpa,enr_gpa,distance
8.0,Berkeley,1994.0,ACALANES HIGH SCHOOL,51315.0,Lafayette,Contra Costa,California,USA,Contra Costa,All,61.0,30.0,13.0,3.557869,3.828333,3.563846,
8.0,Berkeley,1994.0,ACALANES HIGH SCHOOL,51315.0,Lafayette,Contra Costa,California,USA,Contra Costa,All,61.0,30.0,13.0,3.557869,3.828333,3.563846,
13.0,Berkeley,1994.0,ADOLFO CAMARILLO HIGH SCHOOL,50438.0,Camarillo,Ventura,California,USA,Ventura,Asian,11.0,4.0,,4.008438,4.143333,,
13.0,Berkeley,1994.0,ADOLFO CAMARILLO HIGH SCHOOL,50438.0,Camarillo,Ventura,California,USA,Ventura,Asian,11.0,4.0,,4.008438,4.143333,,
15.0,Berkeley,1994.0,ADRIAN C WILCOX HIGH SCHOOL,53276.0,Santa Clara,Santa Clara,California,USA,Santa Clara,All,30.0,9.0,,3.876,4.211111,,
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
19.0,Berkeley,1994.0,AGOURA HIGH SCHOOL,50003.0,Agoura Hills,Los Angeles,California,USA,Los Angeles,Asian,15.0,7.0,,3.881042,4.037391,,
59.0,Berkeley,1994.0,APPLE VALLEY HIGH SCHOOL,50118.0,Apple Valley,San Bernardino,California,USA,San Bernardino,White,7.0,3.0,,3.721538,3.868571,,
77.0,Berkeley,1994.0,ARMIJO HIGH SCHOOL,50910.0,Fairfield,Solano,California,USA,Solano,All,12.0,,,3.718333,,,


In [63]:
# final_data[final_data['school_num'].isnull()]
campus_distances = load_distances()
for campus, dict_ in campus_distances.items():
    print(campus, len(dict_))
#     group = final_data.loc[final_data['campus']==campus]
#     print(group)
#     ids, distances = zip(*dict_.items())
#     ids = [str(e) for e in ids]
#     distances = [float(e) for e in distances]
#     print(ids)
#     print(distances)
    campus_matches = final_data['campus']==campus
    for i, (num, dist) in enumerate(dict_.items()):
        print("\r{}/{}".format(i, len(dict_)), end='', flush=True)
#         if i >= 200:
#             break
        
        school_matches = final_data['school_num']==int(num)
#         print(campus_matches & school_matches)
#         display(data[campus_matches & school_matches])
#         display(final_data[campus_matches&school_matches])
#         print(group.loc[group['school_num']==int(num)])
        final_data.loc[school_matches & campus_matches, 'distance'] = dist
    print()
#     group.loc[group['school_num']==ids]['distances'] = distances
#     break

Santa Barbara 801
800/801
Santa Cruz 780
779/780
Los Angeles 831
830/831
Merced 694
693/694
San Diego 812
811/812
Riverside 738
737/738
Davis 800
799/800
Irvine 788
787/788
Berkeley 819
818/819


In [64]:
final_data.describe().T

Unnamed: 0,year,school_num,app_num,adm_num,enr_num,app_gpa,adm_gpa,enr_gpa,distance
count,341744.0,341744.0,341744.0,239364.0,94713.0,341744.0,239364.0,94713.0,235080.0
mean,10948.45,127241.4,9012.887,12851.01,32450.03,8993.89,12839.34,32442.31,350121.6
std,76631.51,785821.4,76857.63,91566.89,143367.6,76859.84,91568.52,143369.3,345475.5
min,1994.0,4019.0,5.0,3.0,3.0,1.362,2.548333,2.598,662.0
25%,2003.0,50867.0,7.0,5.0,5.0,3.562,3.739143,3.710521,98535.0
50%,2009.0,52058.0,12.0,9.0,8.0,3.708667,3.915263,3.871667,223339.0
75%,2014.0,53644.0,26.0,19.0,17.0,3.8464,4.092667,4.050667,603068.0
max,4791196.0,108142900.0,4791196.0,4791196.0,4791196.0,4791196.0,4791196.0,4791196.0,5083067.0


In [65]:
final_data

Unnamed: 0,campus,year,school,school_num,city,county,state,country,region,ethnicity,app_num,adm_num,enr_num,app_gpa,adm_gpa,enr_gpa,distance
0,Berkeley,1994.0,ABRAHAM LINCOLN HIGH SCHOOL,51520.0,Los Angeles,Los Angeles,California,USA,Los Angeles,All,14.0,,,3.620000,,,601648.0
1,Berkeley,1994.0,ABRAHAM LINCOLN HIGH SCHOOL,51520.0,Los Angeles,Los Angeles,California,USA,Los Angeles,Asian,8.0,,,3.620000,,,601648.0
2,Berkeley,1994.0,ABRAHAM LINCOLN HIGH SCHOOL,51520.0,Los Angeles,Los Angeles,California,USA,Los Angeles,Hispanic/ Latino,5.0,,,3.620000,,,601648.0
3,Berkeley,1994.0,ABRAHAM LINCOLN HIGH SCHOOL,52910.0,San Francisco,San Francisco,California,USA,San Francisco,All,58.0,8.0,7.0,3.682931,4.121250,4.088571,33037.0
4,Berkeley,1994.0,ABRAHAM LINCOLN HIGH SCHOOL,52910.0,San Francisco,San Francisco,California,USA,San Francisco,Asian,50.0,8.0,7.0,3.682931,4.121250,4.088571,33037.0
5,Berkeley,1994.0,ABRAHAM LINCOLN HIGH SCHOOL,53075.0,San Jose,Santa Clara,California,USA,Santa Clara,All,14.0,,,3.640714,,,76043.0
6,Berkeley,1994.0,ABRAHAM LINCOLN HIGH SCHOOL,53075.0,San Jose,Santa Clara,California,USA,Santa Clara,Hispanic/ Latino,6.0,,,3.640714,,,76043.0
7,Berkeley,1994.0,ACADEMY OUR LADY OF PEACE,52820.0,San Diego,San Diego,California,USA,San Diego,All,5.0,,,3.786000,,,790444.0
8,Berkeley,1994.0,ACALANES HIGH SCHOOL,51315.0,Lafayette,Contra Costa,California,USA,Contra Costa,All,61.0,30.0,13.0,3.557869,3.828333,3.563846,21980.0
9,Berkeley,1994.0,ACALANES HIGH SCHOOL,51315.0,Lafayette,Contra Costa,California,USA,Contra Costa,Asian,16.0,4.0,,3.557869,3.828333,,21980.0


In [66]:
final_data.to_csv('../data/distances.csv', sep=',', index=False)

In [70]:
final = final_data[final_data['campus'] != 'Univeristywide']
final = final[final['state'] == 'California']
final = final[final['ethnicity'] == 'All']
final = final[final['distance'].notnull()]
final['desc'] = get_school_loc_str(final)
final

Unnamed: 0,campus,year,school,school_num,city,county,state,country,region,ethnicity,app_num,adm_num,enr_num,app_gpa,adm_gpa,enr_gpa,distance,desc
0,Berkeley,1994.0,ABRAHAM LINCOLN HIGH SCHOOL,51520.0,Los Angeles,Los Angeles,California,USA,Los Angeles,All,14.0,,,3.620000,,,601648.0,"ABRAHAM LINCOLN HIGH SCHOOL, Los Angeles, Cali..."
3,Berkeley,1994.0,ABRAHAM LINCOLN HIGH SCHOOL,52910.0,San Francisco,San Francisco,California,USA,San Francisco,All,58.0,8.0,7.0,3.682931,4.121250,4.088571,33037.0,"ABRAHAM LINCOLN HIGH SCHOOL, San Francisco, Ca..."
5,Berkeley,1994.0,ABRAHAM LINCOLN HIGH SCHOOL,53075.0,San Jose,Santa Clara,California,USA,Santa Clara,All,14.0,,,3.640714,,,76043.0,"ABRAHAM LINCOLN HIGH SCHOOL, San Jose, Califor..."
7,Berkeley,1994.0,ACADEMY OUR LADY OF PEACE,52820.0,San Diego,San Diego,California,USA,San Diego,All,5.0,,,3.786000,,,790444.0,"ACADEMY OUR LADY OF PEACE, San Diego, Californ..."
8,Berkeley,1994.0,ACALANES HIGH SCHOOL,51315.0,Lafayette,Contra Costa,California,USA,Contra Costa,All,61.0,30.0,13.0,3.557869,3.828333,3.563846,21980.0,"ACALANES HIGH SCHOOL, Lafayette, California, USA"
12,Berkeley,1994.0,ADOLFO CAMARILLO HIGH SCHOOL,50438.0,Camarillo,Ventura,California,USA,Ventura,All,32.0,15.0,6.0,4.008438,4.143333,3.966667,592504.0,"ADOLFO CAMARILLO HIGH SCHOOL, Camarillo, Calif..."
15,Berkeley,1994.0,ADRIAN C WILCOX HIGH SCHOOL,53276.0,Santa Clara,Santa Clara,California,USA,Santa Clara,All,30.0,9.0,,3.876000,4.211111,,79299.0,"ADRIAN C WILCOX HIGH SCHOOL, Santa Clara, Cali..."
18,Berkeley,1994.0,AGOURA HIGH SCHOOL,50003.0,Agoura Hills,Los Angeles,California,USA,Los Angeles,All,48.0,23.0,8.0,3.881042,4.037391,3.922500,607116.0,"AGOURA HIGH SCHOOL, Agoura Hills, California, USA"
21,Berkeley,1994.0,ALAMEDA HIGH SCHOOL,50005.0,Alameda,Alameda,California,USA,Alameda,All,58.0,22.0,10.0,3.913448,4.115455,4.186000,15660.0,"ALAMEDA HIGH SCHOOL, Alameda, California, USA"
24,Berkeley,1994.0,ALBANY HIGH SCHOOL,50035.0,Albany,Alameda,California,USA,Alameda,All,30.0,15.0,6.0,3.659333,4.011333,3.788333,5348.0,"ALBANY HIGH SCHOOL, Albany, California, USA"


In [71]:
final.describe()

Unnamed: 0,year,school_num,app_num,adm_num,enr_num,app_gpa,adm_gpa,enr_gpa,distance
count,91138.0,91138.0,91138.0,66016.0,21718.0,91138.0,66016.0,21718.0,91138.0
mean,2007.449253,52297.768132,31.349514,18.362594,10.05599,3.655526,3.896422,3.836858,354684.0
std,6.690267,21338.921186,35.121057,18.354647,7.10987,0.213344,0.246369,0.270573,261496.0
min,1994.0,50003.0,5.0,5.0,5.0,2.468333,2.798462,2.782,662.0
25%,2002.0,50750.0,10.0,7.0,6.0,3.515149,3.71,3.64375,113398.0
50%,2008.0,51472.0,19.0,12.0,8.0,3.66875,3.926,3.865774,261633.0
75%,2013.0,52850.0,38.0,22.0,12.0,3.809333,4.100769,4.054,609581.0
max,2017.0,998463.0,445.0,310.0,142.0,4.356667,4.495,4.43,1348288.0


In [74]:
final['desc'].values

array(['ABRAHAM LINCOLN HIGH SCHOOL, Los Angeles, California, USA',
       'ABRAHAM LINCOLN HIGH SCHOOL, San Francisco, California, USA',
       'ABRAHAM LINCOLN HIGH SCHOOL, San Jose, California, USA', ...,
       'LE GRAND UNION HIGH SCHOOL, Le Grand, California, USA',
       'LE LYCEE FRANCAIS DE LOS ANGELES, Los Angeles, California, USA',
       'LEADERSHIP HIGH SCHOOL, San Francisco, California, USA'],
      dtype=object)