## Part 2: data acquisition and processing of restaurant data from the Foursquare API

Now I'll get the data on the types of restaurants and likes in each neighborhood. 

In [1]:
import pandas as pd
import geopandas as gpd
import folium
from shapely.geometry import Point
import wget
import requests

In [2]:
centroids = gpd.read_file('processed/la_times_centroids.shp')
polys = gpd.read_file('processed/la_times_polygons.shp')
centroids.columns = ['name', 'r_list', 'centroids']

First I'll set my Foursquare credentials, the category for our searches (food) and a 100 venue limit in each search.

In [3]:
with open(u'C:\\Users\ianfi\\.FS_apikey') as keyfile:
    CLIENT_ID = keyfile.readline().strip()
    CLIENT_SECRET = keyfile.readline().strip()

In [4]:
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # limit of number of venues returned by Foursquare API
categoryId = '4d4b7105d754a06374d81259' #category ID for food

Let's do a test run on a single neighborhood (Alhambra) from the polys data frame.

In [5]:
#define the name, centroid, radius for Alhambra
name = centroids.loc[1]['name']
centroid = centroids.loc[1]['centroids']
radius = centroids.loc[1]['r_list']
lat = centroid.y
lng = centroid.x

Now we'll call Foursquare to get restaurants in Alhambra:

In [6]:
url = 'https://api.foursquare.com/v2/venues/explore?&categoryId={}&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    categoryId, 
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    lat, 
    lng, 
    radius*1000.00, 
    LIMIT)
results = requests.get(url).json()["response"]['groups'][0]['items']

The results dict now contains a list of restaurants within our given radius of the center of Alhambra. Let's examine an entry:

In [7]:
results[1]

{'reasons': {'count': 0,
  'items': [{'summary': 'This spot is popular',
    'type': 'general',
    'reasonName': 'globalInteractionReason'}]},
 'venue': {'id': '56298249498edac0ff05a775',
  'name': '85C Bakery Cafe',
  'location': {'address': '300 W Main St #101',
   'crossStreet': 'at S 3rd St',
   'lat': 34.093411,
   'lng': -118.130214,
   'labeledLatLngs': [{'label': 'display',
     'lat': 34.093411,
     'lng': -118.130214}],
   'distance': 1159,
   'postalCode': '91801',
   'cc': 'US',
   'city': 'Alhambra',
   'state': 'CA',
   'country': 'United States',
   'formattedAddress': ['300 W Main St #101 (at S 3rd St)',
    'Alhambra, CA 91801',
    'United States']},
  'categories': [{'id': '4bf58dd8d48988d16a941735',
    'name': 'Bakery',
    'pluralName': 'Bakeries',
    'shortName': 'Bakery',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/bakery_',
     'suffix': '.png'},
    'primary': True}],
  'photos': {'count': 0, 'groups': []}},
 'referralId': 'e-0-5629

Unfortunately, this call doesn't give us the total likes for a venue. To access this data, we'll need to do another call to Foursquare for each venue. Let's loop through the results venues and get the total likes for each venue. Some of the venues don't have a 'likes' category and return an error when queried. We'll use a "try: except:" structure to keep track of these errors. Then, we'll load all of the restaurant data into a dataframe so they are easier to analyze:

In [8]:
venues_list = []
for v in results:
    current_loc = Point(v['venue']['location']['lng'],v['venue']['location']['lat'])
    VENUE_ID = v['venue']['id']
    url = 'https://api.foursquare.com/v2/venues/{0}/likes?client_id={1}&client_secret={2}&v={3}'.format(
        VENUE_ID,
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION)
    try:
        total_likes = int(requests.get(url).json()['response']['likes']['count'])
    except:
        total_likes = 0
        print('ERROR: NO LIKES KEY FOUND FOR VENUE')
    venues_list.append([( 
        v['venue']['name'],
        VENUE_ID,
        current_loc,
        total_likes,
        v['venue']['categories'][0]['name'])])
nearby_venues = gpd.GeoDataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = [
              'Venue',
              'Venue_ID',                      
              'Venue LonLat',
              'Total Likes',
              'Venue Category']
nearby_venues.head(5)

Unnamed: 0,Venue,Venue_ID,Venue LonLat,Total Likes,Venue Category
0,The Hat,4a36fdd1f964a5201e9e1fe3,POINT (-118.1231599198082 34.07866657962181),98,Sandwich Place
1,85C Bakery Cafe,56298249498edac0ff05a775,POINT (-118.130214 34.093411),80,Bakery
2,Pepe's Mexican Food,4b3818a4f964a5208c4b25e3,POINT (-118.1284653643477 34.07852553536646),48,Mexican Restaurant
3,Blaze Pizza,560b1a93498e26781b72d4c9,POINT (-118.125894 34.09542),39,Pizza Place
4,Chengdu Taste,51d49380498e9f26e5a0936b,POINT (-118.1324607364275 34.0779432701684),106,Szechuan Restaurant


Next, we'll use the 'within' attribute of geopandas to differentiate venues that are inside or outside Alhambra:

In [9]:
within_list = []
for index,row in nearby_venues.iterrows():
        within_list.append(bool(nearby_venues.loc[index]['Venue LonLat'].within(polys.loc[1]['geometry'])))
nearby_venues['within_neigh'] = within_list
within_df=nearby_venues[nearby_venues.within_neigh==True]

Finally, let's visualize the results from these last few steps. In the following folium map the search area that we queried from the Foursquare API is highlighted as a big blue semi-transparent circle. All of the food venues that Foursquare returned have been highlighted as dots and the Alhambra polygon has been highlighted in green. The food venues within the Alhambra polygon have been differentiated with as green dots, and those not within the polygon as blue dots. Click on the dots to get the name of each venue. 

In [10]:
alhambra_coords = [centroids.loc[1]['centroids'].y, centroids.loc[1]['centroids'].x]#coordinates for LA 
alham_map = folium.Map(location=alhambra_coords,zoom_start=13)
alham_map.choropleth(
    geo_data=gpd.GeoDataFrame(polys.loc[[1]][['geometry']]).to_json(),
    fill_opacity=0.3, 
    line_opacity=0.5,
    fill_color='green')
folium.Circle(
    alhambra_coords,
    fill=True,
    fill_opacity=0.1,
    radius = centroids.loc[1]['r_list']*1000.0).add_to(alham_map)
for latlon, name in zip(nearby_venues['Venue LonLat'], nearby_venues['Venue']):
    label = folium.Popup(str(name),parse_html=True)
    folium.CircleMarker(
        [latlon.y,latlon.x],
        radius=5,
        popup=label,
        fill=True,
        fill_opacity=0.8).add_to(alham_map)   
for latlon, name in zip(within_df['Venue LonLat'], within_df['Venue']):
    label = folium.Popup(str(name),parse_html=True)
    folium.CircleMarker(
        [latlon.y,latlon.x],
        radius=5,
        popup=label,
        fill=True,
        fill_color='green',
         color='green',
        fill_opacity=0.5).add_to(alham_map)  
alham_map

Now let's define a function get_venues that does this whole procedure. The inputs are the neighborhood name, center, polygon, and radius to search.

In [11]:
def get_venues(names, centroids, areas, radii):
    venues_list = []
    for name, centroid ,area, radius in zip(names, centroids, areas, radii):
        print(name)
        radius = radius*1000.00
        lat = centroid.y
        lng = centroid.x
        url = 'https://api.foursquare.com/v2/venues/explore?&categoryId={}&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            categoryId, 
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
        for v in results:
            current_loc = Point(v['venue']['location']['lng'],v['venue']['location']['lat'])
            if current_loc.within(area): # the within statement checks if restaurant is inside neighborhood polygon
                VENUE_ID = v['venue']['id']
                url = 'https://api.foursquare.com/v2/venues/{0}/likes?client_id={1}&client_secret={2}&v={3}'.format(
                    VENUE_ID,
                    CLIENT_ID, 
                    CLIENT_SECRET, 
                    VERSION)
                try:
                    total_likes = int(requests.get(url).json()['response']['likes']['count'])
                except:
                    total_likes = 0
                    print('ERROR: NO LIKES KEY FOUND FOR VENUE')
                venues_list.append([(
                    name, 
                    lat, 
                    lng, 
                    v['venue']['name'],
                    VENUE_ID,
                    current_loc,
                    total_likes,
                    v['venue']['categories'][0]['name'])])
        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        nearby_venues.columns = ['Neighborhood', 
                      'Neighborhood Latitude', 
                      'Neighborhood Longitude', 
                      'Venue',
                      'Venue_ID',                      
                      'Venue LonLat',
                      'Total Likes',
                      'Venue Category']
    return(nearby_venues)

Calling all the neighborhoods takes time. Specifically, the most time consuming aspect is individually calling all the venues in a given neighborhood to get their total likes. Foursquare stops returning values if you hit the hourly rate limit for a free account, which happends about halfway through the neighborhoods. I added a try-except structure in the get_venues function so that the function won't crash if it stops getting results (and it alerts the user). To help preserve progress making queries, the neighborhoods are also split into four parts and then combined into a single dataframe. I'll get through part 1 and 2, then wait about an hour and do parts 3 and 4. Note from pandas documentation for loc: "contrary to usual python slices, both the start and the stop are included"

In [12]:
start = 0
stop = 50
part1 = get_venues(centroids.loc[start:stop]['name'],
                   centroids.loc[start:stop]['centroids'],
                   polys.loc[start:stop]['geometry'],
                   centroids.loc[start:stop]['r_list'])

Adams-Normandie
Alhambra
Alondra Park
Altadena
Arcadia
Arleta
Arlington Heights
Artesia
Athens
Atwater Village
Avocado Heights
Baldwin Hills/Crenshaw
Baldwin Park
Bel-Air
Bell
Bellflower
Bell Gardens
Beverly Crest
Beverly Grove
Beverly Hills
Beverlywood
Boyle Heights
Bradbury
Brentwood
Broadway-Manchester
Burbank
Carson
Carthay
Central-Alameda
Century City
Cerritos
Chesterfield Square
Cheviot Hills
Chinatown
Commerce
Compton
Cudahy
Culver City
Cypress Park
Del Aire
Del Rey
Downey
Downtown
Eagle Rock
East Compton
East Hollywood
East La Mirada
East Los Angeles
East Pasadena
East San Gabriel
Echo Park


In [13]:
start = 51
stop = 100
part2 = get_venues(centroids.loc[start:stop]['name'],
                   centroids.loc[start:stop]['centroids'],
                   polys.loc[start:stop]['geometry'],
                   centroids.loc[start:stop]['r_list'])

El Monte
El Segundo
El Sereno
Elysian Park
Elysian Valley
Encino
Exposition Park
Fairfax
Florence
Florence-Firestone
Gardena
Glassell Park
Glendale
Gramercy Park
Green Meadows
Griffith Park
Hacienda Heights
Hancock Park
Hansen Dam
Harbor City
Harbor Gateway
Harvard Heights
Harvard Park
Hawaiian Gardens
Hawthorne
Hermosa Beach
Highland Park
Historic South-Central
Hollywood
Hollywood Hills
Hollywood Hills West
Huntington Park
Hyde Park
Industry
Inglewood
Irwindale
Jefferson Park
Koreatown
La Cañada Flintridge
La Crescenta-Montrose
Ladera Heights
La Habra Heights
Lake View Terrace
Lakewood
La Mirada
La Puente
Larchmont
Lawndale
Leimert Park
Lennox


In [22]:
start = 101
stop = 150
part3 = get_venues(centroids.loc[start:stop]['name'],
                   centroids.loc[start:stop]['centroids'],
                   polys.loc[start:stop]['geometry'],
                   centroids.loc[start:stop]['r_list'])

Lincoln Heights
Lomita
Long Beach
Los Feliz
Lynwood
Manchester Square
Manhattan Beach
Marina del Rey
Mar Vista
Mayflower Village
Maywood
Mid-City
Mid-Wilshire
Monrovia
Montebello
Montecito Heights
Monterey Park
Mount Washington
North El Monte
North Hollywood
North Whittier
Norwalk
Pacific Palisades
Pacoima
Palms
Panorama City
Paramount
Pasadena
Pico Rivera
Pico-Robertson
Pico-Union
Playa del Rey
Playa Vista
Rancho Dominguez
Rancho Park
Redondo Beach
Rosemead
San Gabriel
San Marino
San Pasqual
Santa Fe Springs
Santa Monica
Sawtelle
Sepulveda Basin
Shadow Hills
Sherman Oaks
Sierra Madre
Signal Hill
Silver Lake
South El Monte


In [23]:
start = 151
stop = 194
part4 = get_venues(centroids.loc[start:stop]['name'],
                   centroids.loc[start:stop]['centroids'],
                   polys.loc[start:stop]['geometry'],
                   centroids.loc[start:stop]['r_list'])

South Gate
South Park
South Pasadena
South San Gabriel
South Whittier
Studio City
Sunland
Sun Valley
Temple City
Toluca Lake
Torrance
Tujunga
Universal City
University Park
Valinda
Valley Glen
Valley Village
Van Nuys
Venice
Vermont Knolls
Vermont-Slauson
Vermont Square
Vermont Vista
Vernon
Veterans Administration
View Park-Windsor Hills
Walnut Park
Watts
West Adams
West Carson
Westchester
West Compton
West Hollywood
Westlake
West Los Angeles
Westmont
West Puente Valley
West Whittier-Los Nietos
Westwood
Whittier
Whittier Narrows
Willowbrook
Wilmington
Windsor Square
ERROR: NO LIKES KEY FOUND FOR VENUE


Put it all together:

In [24]:
frames = [part1, part2, part3, part4]
neighbor_df = pd.concat(frames)
neighbor_df.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue_ID,Venue LonLat,Total Likes,Venue Category
0,Adams-Normandie,34.031398,-118.300301,Ignatius Cafe,4fa1ba91e4b0e852a4072ac2,POINT (-118.2930062234196 34.03177152787079),17,Café
1,Adams-Normandie,34.031398,-118.300301,Bird's Nest Cafe,57dd8298498ee69b670f8bb9,POINT (-118.2916379696774 34.03443378381787),10,Restaurant
2,Adams-Normandie,34.031398,-118.300301,Orange Door Sushi,5498d200498e8153c17c751a,POINT (-118.299540608008 34.03226985807376),6,Sushi Restaurant
3,Adams-Normandie,34.031398,-118.300301,Himalayan House,579d85f4498ef5970ff133f7,POINT (-118.2942092848373 34.02578528424605),11,Himalayan Restaurant
4,Adams-Normandie,34.031398,-118.300301,Caveman Kitchen,4b8bb46cf964a520b6a732e3,POINT (-118.291991746519 34.03582259149761),27,South American Restaurant


Then the results are saved to a csv file:

In [25]:
neighbor_df.to_csv('processed/restaurant_data.csv')

In [26]:
neighbor_df.head(5)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue_ID,Venue LonLat,Total Likes,Venue Category
0,Adams-Normandie,34.031398,-118.300301,Ignatius Cafe,4fa1ba91e4b0e852a4072ac2,POINT (-118.2930062234196 34.03177152787079),17,Café
1,Adams-Normandie,34.031398,-118.300301,Bird's Nest Cafe,57dd8298498ee69b670f8bb9,POINT (-118.2916379696774 34.03443378381787),10,Restaurant
2,Adams-Normandie,34.031398,-118.300301,Orange Door Sushi,5498d200498e8153c17c751a,POINT (-118.299540608008 34.03226985807376),6,Sushi Restaurant
3,Adams-Normandie,34.031398,-118.300301,Himalayan House,579d85f4498ef5970ff133f7,POINT (-118.2942092848373 34.02578528424605),11,Himalayan Restaurant
4,Adams-Normandie,34.031398,-118.300301,Caveman Kitchen,4b8bb46cf964a520b6a732e3,POINT (-118.291991746519 34.03582259149761),27,South American Restaurant


In [27]:
neighbor_df.shape

(6438, 8)

In total, the dataframe contains 6438 food venues, with their associated likes, coordinates, category, and neighborhood. Since we'll be using this data for k-means clustering and PCA, we'll want to make sure that that we have enough venues in each neighborhood to associate it with a cluster. We'll remove neighborhoods that have fewer than 10 venues:

In [28]:
#inner join to remove neighborhoods with fewer than 5 restaurants

number_of_venues = 10
counts_df = neighbor_df.groupby('Neighborhood').count()[['Venue']]
counts_df.columns = ['counts']
counts_df = counts_df[counts_df['counts']>=number_of_venues]
neighbor_df.set_index('Neighborhood', inplace=True)

neighbor_df_filter = pd.merge(neighbor_df, counts_df, left_index=True, right_index=True, how='inner')
neighbor_df_filter.reset_index(inplace=True)

Then we implement one hot encoding on the Venue Category column using the get_dummies method of pandas. Then, we group by neighborhood and find the mean occurance of each restaurant type:

In [29]:
# one hot encoding
la_onehot = pd.get_dummies(neighbor_df_filter[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
la_onehot['Neighborhood'] = neighbor_df_filter['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [la_onehot.columns[-1]] + list(la_onehot.columns[:-1])
la_onehot = la_onehot[fixed_columns]
la_grouped = la_onehot.groupby('Neighborhood').mean().reset_index()
la_grouped.head(5)

Unnamed: 0,Neighborhood,African Restaurant,American Restaurant,Argentinian Restaurant,Asian Restaurant,Australian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bistro,...,Taco Place,Taiwanese Restaurant,Tapas Restaurant,Tex-Mex Restaurant,Thai Restaurant,Theme Restaurant,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint
0,Adams-Normandie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.074074,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Alhambra,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.04,0.0
2,Altadena,0.0,0.09375,0.0,0.0,0.0,0.0,0.0,0.09375,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Arcadia,0.0,0.085106,0.0,0.021277,0.0,0.021277,0.021277,0.106383,0.0,...,0.0,0.0,0.0,0.0,0.021277,0.0,0.0,0.021277,0.021277,0.0
4,Arleta,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Finally, save the data in this model-ready format:

In [31]:
la_grouped.to_csv('processed/restaurant_model_input.csv', index=False)

In [None]:
# This allows NB viewer to use scrolling on the long outputs
%%html
<style>
.nbviewer div.output_area {
  overflow-y: auto;
  max-height: 500px;
}
</style>