In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Loading the Data

### Playgrounds Dataset

In [2]:
playgrounds = pd.read_json('../data/playgrounds/DPR_Playgrounds_001.json')
parks = pd.read_json('../data/parks/DPR_Parks_001.json')

### Parks Dataset

# Joining the Playgrounds and Parks datasets

In [4]:
playgrounds.rename(columns={'Name':'Playground_Name', 'Location':'Playground_Location'}, inplace=True)
parks.rename(columns={'Name':'Park_Name', 'Location':'Park_Location'}, inplace=True)
nyc_playgrounds = pd.merge(playgrounds, parks, how='left', on='Prop_ID')


# Test 1: zipcode_finder.py file. Fuzzy address search

There were about 9 edge cases where there was no zipcode and no coordinates in the dataset. I built a code that properly returned zipcodes for 8/9 of the entries (I found it impossible for Geopy to return the proper address of Classon Playground with the data available). I am now testing that code on a subset of entries and comparing the results to the ground truth to see if it is scaleable 

In [6]:
import os
os.chdir('../')
from src.zipcode_finder import *

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="new_app")

In [7]:
df = nyc_playgrounds[(nyc_playgrounds.Zip.notnull()) 
                & (nyc_playgrounds.lat.notnull())
               & (nyc_playgrounds.Playground_Location.notnull())
                 & (nyc_playgrounds.Playground_Name.notnull())
                    & (nyc_playgrounds.Zip.str.len() == 5)]

#### Pseudocode
 - If playground_name isn't null -> Search Geophy for Playground_Name with borough
 - elif parkname isn't null -> search Geophy for Park_Name with borough
 - else search playground location with borough

#### Testing location search

In [8]:
tf = df.sample(n=20)
test_series = tf.apply(test,axis=1,args=(geolocator,))
tf['zip2'] = test_series.values

In [9]:
np.mean(tf['Zip']==tf['zip2'])

0.55

While this method worked for 8/9 edge cases I originally tested it on and built it around, it does not scale to the larger dataframe and shouldnt be used

# Test 2: playground coordinate search vs. park zipcode


Question: Are the playground coordinates different than the park zipcode? 
 - Maybe the park crosses between multiple zipcodes and the playground is only in one of them
 - Maybe there are data entry issues

In [9]:
# Making sure I only test valid entries (1 zipcode, coordinates present)
test = nyc_playgrounds[(nyc_playgrounds.lat.notnull()) &
                      (nyc_playgrounds.Zip.str.len() == 5)]

tf2 = test.sample(n=20, random_state=101)

In [10]:
def _coord_to_zip(row, geolocator):
    coordinate_query = '{}, {}'.format(row['lat'], row['lon'])
    if geolocator.reverse(coordinate_query).raw:
        try:
            return geolocator.reverse(coordinate_query).raw['address']['postcode'][:5]
        except:
            return np.NaN
    else:
        return np.NaN
    

In [11]:
tf2_series = tf2.apply(_coord_to_zip, args=(geolocator,),axis=1)
tf2['zip2'] = tf2_series.values
np.mean(tf2['Zip']==tf2['zip2'])

0.85

In [12]:
tf2[tf2['Zip'] != tf2['zip2']]

Unnamed: 0,Accessible,Adaptive_Swing,Level,Playground_Location,Playground_Name,Playground_ID,Prop_ID,School_ID,Status,lat,lon,Park_Location,Park_Name,Zip,zip2
849,Y,Y,2.0,E 164 ST & RIVER AVE,Mullaly Park (1),X034-01,X034,,,40.8314,-73.9254,Jerome Av to River Av bet. E 164 St and McClel...,Mullaly Park,10452,
353,Y,N,2.0,"Henry St, Market St, E Broadway",Loeb Playground,M067,M067,,,40.7132,-73.9943,"Henry St., Market St., E. Broadway",Sophie Irene Loeb,10002,11201.0
615,Y,N,4.0,"Van Wyck Exwy, 106 Ave, 142 St, 104 Ave",Norelli-hargreaves Playground,Q220B,Q220B,,,40.6902,-73.8088,"Van Wyck Exwy. Sr. Rd. E., 142 St., 106 Ave.",Norelli-Hargreaves Playground,11435,11436.0


There were only 3 cases where this method did not work for this small subset. In all 3 cases, the original zipcode was the correct zipcode and geopy failed to return the proper zip (manually checked via google maps).

I have therefore come to the conclusion that the original park zipcodes are reasonably representative of the true playground location. Geopy's coordinate lookup is also mildly representative of the true location, however clearly it is not perfect and it's accuracy should likely be explored more later down the line. For now, it is useful

__A note about the small testing samples:__ geopy has a request limit that I do not want to exceed. It would be better to test this on a larger sample size, or even with a train/test approach.