### Capstone Week Five Data Collection Notebook

This notebook contains the code for harvesting venue data from FourSquare and Storing it in a dictionary.  To re-use this code you may have to run the search routine muliple times on separate days to accomodate the query limit (quota) imposed by FourSquare.

In [1]:
#!/Users/john/anaconda3/bin/conda install -c conda-forge folium=0.10.0 --yes
import folium

print('Folium installed and imported!')

Folium installed and imported!


In [2]:
# The code was removed by Watson Studio for sharing.

Your credentails:


## Retrieve Previously Defined Neighborhoods
Neighborhoods are defined, selected, and stored in a separate notebook available on [Github](github.com/jshubin-ghub/coursera-capstone-project/blob/master/Capstone_Week_Five_Neighborhoods.ipynb).

In [3]:
import pandas as pd

newyork_geo_df = pd.read_pickle("new_york_n_geodata.pkl") 
toronto_geo_df = pd.read_pickle("toronto_n_geodata.pkl")

# a dataframe for testing
newyork_geo_test_df = newyork_geo_df.head()

## Getting Venues From FourSquare

After some experimentation (documentation for FourSquare is sparse) I found using the FourSquare search endpoint, specifying the 'browse' intent produced the broadest results, with the slight drawback that some venues outside of the specified radius were returned.  

In [None]:
# search_browse_intent_json_to_dataframe has a bug and 
# returns a dict of category values instead of just the category_id
def search_browse_intent_json_to_dataframe(current_result, geo_dataframe, index, current_df):
    """
    Data from current_result json is parsed and added to current_df dataframe.

    Parameters
    ----------
    current_result : json 
        json returned from foursquare search endpoint with intent = browse modifier 
    geo_dataframe : DataFrame 
        dataframe of data linked to geolocations
    index : int
        current record of interest in the geo_dataframe data
    current_df : DataFrame
        information collected from previously parsed foursquare queries

    Returns
    -------
    DataFrame   
    """

    try: 
        neighborhood_label = geo_dataframe.at[index, 'ZIPCode']
    except:
        neighborhood_label = geo_dataframe.at[index, 'PostalCode']
    for item in range(len(current_result['response']['venues'])):
        usable_row = False
        
        try:
            new_row_list = [neighborhood_label,
                           current_result['response']['venues'][item]['name'],
                           current_result['response']['venues'][item]['id'],
                           current_result['response']['venues'][item]['location']['lat'],
                           current_result['response']['venues'][item]['location']['lng']]
        except:
            print("error creating new row list at index {} in result list {} items long".format(item,len(current_result['response']['venues'])))
        try:
            if len(current_result['response']['venues'][item]['categories'][0])>0:
                new_row_list.append(current_result['response']['venues'][item]['categories'][0])
                distance_from_center = distance((geo_dataframe.at[index, 'Latitude'],
                                                 geo_dataframe.at[index, 'Longitude']),
                                                (new_row_list[3],new_row_list[4]))
                    
                if distance_from_center <1.0:
                    usable_row = True
                else:
                    print("point too far away", distance_from_center)
            else:
                print("no category found for:{}".format(new_row_list))                
        except IndexError as error:
            print (error)
            print ("IndexError exception thrown for:{}".format(new_row_list))

        # add the venue if it isn't alread in the dataframe, based on venue_id
        if usable_row:
            new_row_df = pd.DataFrame([new_row_list], columns = ['ZIPCode', 'venue_name', 'venue_id', 'latitude', 'longitude', 'category_id'])
            if new_row_df['venue_id'][0] not in current_df['venue_id']:
                current_df = current_df.append(new_row_df, ignore_index = True)
    # add the venue if it isn't alread in the dataframe, based on venue_id    
    return (current_df)

In [None]:
# here are a couple of helper functions
# distance computes distance between two geocoordinates (in kilometers)
# spread_out computes a rosette of n points around a central geocoordinate.
import math

def distance(origin, destination):
    lat1, lon1 = origin
    lat2, lon2 = destination
    radius = 6371 # radius of the earth in km, returned distance will be in kilometers

    dlat = math.radians(lat2-lat1)
    dlon = math.radians(lon2-lon1)
    a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
        * math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = radius * c

    return d


def spread_out(latitude, longitude, num_points, distance):

    
    # create a rosette of points at a distance from a central point at latitude, longitude
    # the number of points indicates the rosette petals arround the central point
    # center point is returned with the rosette as the first value of a list
    # distance is in meters, this function assumes spherical geometry for changes in latitude spacing.
    radius = 6371 # radius of the earth in km
    km_distance = distance/1000
    
    # number_degrees will be the number of latitude degrees of displacements for passed distance
    number_degrees = 90*(km_distance/((1/4)* 2*math.pi*radius))
    # longitude displacement will be divded by cos(latitude)
    a = math.cos((abs(latitude)/360)*2*math.pi)
    
    rotation = 2*math.pi/num_points
    
    rosette = [(latitude, longitude)]
    for spot in range(num_points):
        spot_rotation = spot*rotation
        spot_lat = latitude + (math.sin(spot_rotation)*number_degrees)
        spot_lon = longitude + (math.cos(spot_rotation)*number_degrees)/a
        rosette.append((spot_lat, spot_lon))
    return rosette


In [None]:
import pickle

try:
    with open('ny_results.pkl', 'rb') as handle:
        ny_results_dict = pickle.load(handle)
        print('loading ny_results_dict from filesystem')
except Exception as e: 
    print("Exception caught: {}".format(e))
    print("creating new ny_results_dict")
ny_index_series = newyork_geo_df.index
# for index in ny_index_series:
#     if (newyork_geo_df.at(index,'ZIPCode')) in ny_results_dict.keys:
#         print('found {} in results dict'.format(newyork_geo_df[index]))
for index in ny_index_series:
    if newyork_geo_df.at[index,'ZIPCode'] in ny_results_dict:
        print('skipping: {}'.format(newyork_geo_df.at[index,'ZIPCode']))
    else:
        print("new entry: {}".format(newyork_geo_df.at[index,'ZIPCode']))


## New York Queries

In [None]:
# Harvest New York foursquare venues at a geocoordinate, using the search endpoint with 'browse' intent

# Duplicate venues, venues outside geographical limits, and records with incomplete 
#  category information are removed.
# 
# Because the number of searches required exceeded the quota imposed by the Foursquare, intermediate 
#  results are stored and retrieved as query quota is refreshed.
#
# Results are a dict of dataframes with ZIPCode keys.

import json
import requests
import pandas as pd
import time

try:
    with open('ny_results.pkl', 'rb') as handle:
        ny_results_dict = pickle.load(handle)
        print('loading ny_results_dict from filesystem')
except:
    print("creating new ny_results_dict")
    ny_results_dict = {} # will be used to gather 
###(Changed for Testing)
ny_index_series = newyork_geo_df.index 
#ny_index_series = newyork_geo_test_df.index

for index in ny_index_series:
    print('*****************************')
    print('***** Running queries for zip code: {}  *****'.format(newyork_geo_df.at[index, 'ZIPCode']))
    print('*****************************')
    
    # set up a dataframe to gather unique venue information from multiple searches
    if newyork_geo_df.at[index,'ZIPCode'] in ny_results_dict:
        print('skipping: {}'.format(newyork_geo_df.at[index,'ZIPCode']))
    else:
        search_results_df = pd.DataFrame(columns = ['ZIPCode', 
                                                    'venue_name', 
                                                    'venue_id', 
                                                    'latitude', 
                                                    'longitude', 
                                                    'category_id'])


        ###(Changed for testing)
        lat, lng = newyork_geo_df.at[index, 'Latitude'], newyork_geo_df.at[index, 'Longitude'] 
        #lat, lng = newyork_geo_test_df.at[index, 'Latitude'], newyork_geo_test_df.at[index, 'Longitude']

        rosette = spread_out(lat, lng, 6, 500)
        for spot in rosette:
            time.sleep(2)
            lat, lng = spot[0], spot[1]
            radius = 500
            LIMIT = 1000

            url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&intent=browse'.format(
                    CLIENT_ID, 
                    CLIENT_SECRET, 
                    VERSION, 
                    lat, 
                    lng, 
                    radius, 
                    LIMIT)
            print('*****************************')
            print('***** Running query at lat:{}, lon:{} *****'.format(lat,lng))
            print('*****************************')
            this_result = requests.get(url).json()
            
            search_results_df = search_browse_intent_json_to_dataframe(this_result, 
                                                                       newyork_geo_df,
                                                                       index,
                                                                       search_results_df)

        ny_results_dict[newyork_geo_df.at[index, 'ZIPCode']] = search_results_df

print("successfully completed data harvest for new york")

In [None]:
# save the collected data
import pickle

with open('ny_results.pkl', 'wb') as handle:
    pickle.dump(ny_results_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

## Toronto Queries

In [None]:
# Harvest toronto foursquare venues at a geocoordinate, using the search endpoint with 'browse' intent
# Code is from 'Harvest New York foursquare venues', above

import json
import requests
import pandas as pd
import time


try:
    with open('toronto_results.pkl', 'rb') as handle:
        toronto_results_dict = pickle.load(handle)
        print('loading toronto_results_dict from filesystem')
except:
    print("creating new toronto_results_dict")
    toronto_results_dict = {} # will be used to gather venue dataframes for each PostalCode
###(Changed for Testing)
toronto_index_series = toronto_geo_df.index 


for index in toronto_index_series:
    print('*****************************')
    print('***** Running queries for postal code: {}  *****'.format(toronto_geo_df.at[index, 'PostalCode']))
    print('*****************************')
    
    # set up a dataframe to gather unique venue information from multiple searches
    if toronto_geo_df.at[index,'PostalCode'] in toronto_results_dict:
        print('skipping: {}'.format(toronto_geo_df.at[index,'PostalCode']))
    else:
        search_results_df = pd.DataFrame(columns = ['PostalCode', 
                                                    'venue_name', 
                                                    'venue_id', 
                                                    'latitude', 
                                                    'longitude', 
                                                    'category_id'])

        lat, lng = toronto_geo_df.at[index, 'Latitude'], toronto_geo_df.at[index, 'Longitude'] 

        rosette = spread_out(lat, lng, 6, 500)
        for spot in rosette:
            time.sleep(2)
            lat, lng = spot[0], spot[1]
            radius = 500
            LIMIT = 1000
   
            url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&intent=browse'.format(
                    CLIENT_ID, 
                    CLIENT_SECRET, 
                    VERSION, 
                    lat, 
                    lng, 
                    radius, 
                    LIMIT)
            print('*****************************')
            print('***** Running query at lat:{}, lon:{} *****'.format(lat,lng))
            print('*****************************')
            this_result = requests.get(url).json()
            
            search_results_df = search_browse_intent_json_to_dataframe(this_result, 
                                                                       toronto_geo_df,
                                                                       index,
                                                                       search_results_df)

        


        # displaced searches completed and unique results have been collected


        toronto_results_dict[toronto_geo_df.at[index, 'PostalCode']] = search_results_df

print("successfully completed data harvest for toronto")

In [None]:
# save the collected data
import pickle

with open('toronto_results.pkl', 'wb') as handle:
    pickle.dump(toronto_results_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [4]:
import pickle
import pandas as pd
# get the ny data back
try:
    with open('ny_results.pkl', 'rb') as handle:
        ny_results_dict = pickle.load(handle)
        print('loading ny_results_dict from filesystem')
except Exception as e: 
    print("Exception caught: {}".format(e))
# repeat data munge for toronto neighborhoods
try:
    with open('toronto_results.pkl', 'rb') as handle:
        toronto_results_dict = pickle.load(handle)
        print('loading toronto_results_dict from filesystem')
except Exception as e: 
    print("Exception caught: {}".format(e))

loading ny_results_dict from filesystem
loading toronto_results_dict from filesystem


## Data Repair

Data in the 'category_id' field needs to be repaired

In [7]:
# repair New York neighborhood venues, splitting data in the 'category_id' category from a dict of values 
# mistakenly included in column 'category_id'

ny_cleaned_results = {}
for postcode in ny_results_dict:
        first = False
        this_df = pd.DataFrame(ny_results_dict[postcode])
        
        for index in this_df.index:
            this_df.at[index,'new_category']=this_df.at[index,'category_id']['id']
            
        new_df = this_df.drop(columns =['category_id'])
        ny_cleaned_results[postcode]=new_df
        

In [8]:
# repair Toronto neighborhood venues, extracting the category_id from the dict of values 
# mistakenly included in column 'category_id', and some mislabeled columns

toronto_cleaned_results = {}
for postcode in toronto_results_dict:
    this_df = pd.DataFrame(toronto_results_dict[postcode])
    for index in this_df.index:
        this_df.at[index,'new_category']=this_df.at[index,'category_id']['id']    
    new_df = this_df.drop(columns =['category_id', 'PostalCode'])
    new_df.rename(columns = {'ZIPCode':'PostalCode'}, inplace = True)
    toronto_cleaned_results[postcode]=new_df

In [9]:
# save the corrected data

with open('ny_cleaned_results.pkl', 'wb') as handle:
    pickle.dump(ny_cleaned_results, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('toronto_cleaned_results.pkl', 'wb') as handle:
    pickle.dump(toronto_cleaned_results, handle, protocol=pickle.HIGHEST_PROTOCOL)