# Introduction

Atlanta, my hometown, is a highly diverse city with poorly-defined neighborhoods. While the neighborhoods are often defined by road boundaries, they are characterized by a number of other factors. In this project, I will use data from Four Square to examine the fluctuating popularity of restaurants of various kinds and use that information to classify locations into neighborhoods. I will then compare that data with google-assigned neighborhoods and determine if the two classifications are similar. 

There are a number of possible uses for this neighborhood clustering information, from consumers who would like to know where to go for a particular cuisine, to developers looking to fit a new restaurant into the neighborhood. Additionally, this information could be used by city planning officials in the permitting process.

# Data

This project will utilize data from 2 sources:
1. Trending restaurant data from Foursquare. This will be pulled between 6-7 PM on a Friday (peak restaurant selection time) and saved to a file, since repeated calls to the Foursquare API would result in different trending information. An extension of this project could be to examine the fluctuation of neighborhood boundaries over the course of a week, but that is beyond the scope of this project. 
2. Neighborhood Planning Unit data from The City of Atlanta's GIS database. Accessed through this URL: https://dcp-coaplangis.opendata.arcgis.com/datasets/npu/geoservice The city provides an API to interact with the GIS data, which I will use to extract the geometric boundaries of each neighborhood. The output of this call contains the following information: 
     
     "attributes": {
            "OBJECTID": 260,
            "LOCALID": null,
            "NAME": "K",
            "GEOTYPE": "NPU",
            "FULLFIPS": null,
            "LEGALAREA": null,
            "ACRES": 1528.29,
            "SQMILES": 2.39,
            "OLDNAME": null,
            "NPU": null
         },
         "geometry": {
            "rings": [
               [
                  [
                     -84.4173772073577,
                     33.772197013770004
                  ],
                  
The "Name" attribute is the neighborhood planning unit name, a proxy for a traditional neighborhood name like "midtown". The geometry information will be used to build a boundary box, and each trending restaurant will be placed within those boundaries and classified. 

# Initializations

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import folium
import requests
from pandas.io.json import json_normalize
from sklearn.model_selection import ParameterGrid
import math

# Methodology
## Connecting with the City of Atlanta's GIS system
### Exploring GIS results
The GIS system includes both the neighborhood planning unit (NPU) name, and coordinates for its boundaries. In this section, we will plot these boundaries as an initial exploration. The URL below was created using the City of Atlanta GIS API Explorer system listed in Section 2.

In [20]:
url = 'https://gis.atlantaga.gov/dpcd/rest/services/OpenDataService/FeatureServer/4/query?where=1%3D1&outFields=NAME&outSR=4326&f=json'

We will process the json file into a readable dataframe.

In [21]:
results = requests.get(url).json()
dataframe = json_normalize(results)
data = dataframe['features'][0]
cleandata = json_normalize(data)
cleandata.head()

Unnamed: 0,attributes.NAME,geometry.rings
0,T,"[[[-84.41391130598113, 33.75469930680354], [-8..."
1,K,"[[[-84.4173772073577, 33.772197013770004], [-8..."
2,C,"[[[-84.4175773783347, 33.83996741007558], [-84..."
3,S,"[[[-84.45199196698579, 33.73370062523282], [-8..."
4,R,"[[[-84.45466114171202, 33.721230458664635], [-..."


In order to visualize the structure of the official neighborhoods listed in the city's documentation, we will plot their borders using folium and the geometry in the above dataframe.

In [22]:
venues_map = folium.Map(tiles='Stamen Toner',location=[33.755845, -84.38902], zoom_start=10)

for index, row in cleandata.iterrows():
    coord = row['geometry.rings'][0][:]
    for ll in coord:
        lat = ll[1]
        long = ll[0]
        folium.Circle(
            [lat, long],
            radius=3,
            fill=True
            ).add_to(venues_map)
venues_map

### Create GIS information dataframe
GIS systems provide extensive data, and can be used for a variety of geo-spatial calculations. However, most of these calculations are beyond the scope of this project. Consequently, the GIS data here will be used to classify venues into neighborhoods based on the closest boundary point via euclidian distance. While this is not a rigorous methodology, additional computations in the GIS system are complex and would require a great deal of work that is not focused on data science. In this project, we will assume that a venue is in the NPU where it lies closest to a boundary point.

#### Define a function to calculate the Euclidean distance between 2 coordinates
This function will be called for each venue coordinate and each NPU boundary point. It will return the distance from each boundary point to the venue coordinates. Though this is a 2D problem and the distance could be simply calculated, we will generalize the function to allow for n-dimensional inputs.

In [12]:
def calc_dist(point1, point2):
    distance = math.sqrt(sum([(a - b) ** 2 for a, b in zip(point1, point2)]))
    return distance

We will test our calc_dist function using a known example: the Euclidean distance from (0,0) to (1,1) should be $\sqrt2$, approximately 1.414.

In [17]:
point1 = [0,0]
point2 = [1,1]
calc_dist(point1, point2)

1.4142135623730951

#### Iterate through GIS data and create a simplified dataframe.
This dataframe will contain the coordinates of each boundary point and the NPU name that is assigned to that boundary point.

In [25]:
gis_df = pd.DataFrame(columns=['NPU', 'Latitude','Longitude'])
for index, row in cleandata.iterrows():
    coord = row['geometry.rings'][0][:]
    name = row['attributes.NAME'][0][:]
    for ll in coord:
        lat = ll[1]
        long = ll[0]
        gis_df = gis_df.append({"NPU": name,"Latitude":lat,"Longitude": long}, ignore_index=True)
        
gis_df.head()

Unnamed: 0,NPU,Latitude,Longitude
0,T,33.754699,-84.413911
1,T,33.754696,-84.413739
2,T,33.754697,-84.413553
3,T,33.754698,-84.413455
4,T,33.754701,-84.412775


## Connect with Foursquare

In this hidden cell, the authorization criteria are provided for interacting with the Foursquare API.

In [40]:
# @hidden_cell
client_secret = 'PLV5C4EDNBSRH4AMBKOXNGHFZWHEWCGULGTQLPQZVQFJFYZO'
client_id = 'HE3V0C15YMFATHFBWRAZ0B32H12HYYSJ1HJ523U02CKPEO3M'

### Create latitude/longitude grid to collect data all over Atlanta
Since the number of API calls and results are limited, we will generate a search grid that overlays the Atlanta area and submit the API call at each node on the grid. The width and height of the space between nodes is approximately 4500 meters, so the radius for the API call will be set to that value. Note that the boundaries of the city are poorly defined and non-orthogonal, so a linear approximation is used.

#### Define functions used in this section and set visable API parameters

In [41]:
# Define a function to make the search ranges
def make_range(minval, maxval, steps):
    step = (maxval - minval)/steps
    val = minval
    outlist = list()
    outlist.append(minval)
    while val <= maxval:
        val += step
        outlist.append(val)
    return outlist

#### Build search grid

In [42]:
# Set the boundaries for the search grid using Google Maps coordinates
min_lat = 33.619816
max_lat = 33.919599
min_long = -84.495203
max_long = -84.241616

# Create the lists for latitude and longitude
lat_list = make_range(min_lat, max_lat, 10)
long_list = make_range(min_long, max_long, 10)

# Generate the search grid
grid = {'latitude':lat_list, 'longitude':long_list}
grid = ParameterGrid(grid)

### Iterate through grid search and call API
With each API call, the json object must be converted into a dataframe, and that data will be appended to a larger analysis frame.
Note that each call will not have the same number of output columns. In testing, it was determined that the most frequently missing column is the neighborhood value. When present, this will be collected. When not present, it will be skipped. Data points with an assigned neighborhood value will be used for validation of the clustering model.

#### Set API parameters and define useful functions

In [43]:
# Set the API parameters that are non-user identifying
radius = 4500
limit=50
ver = '20180604'

In [44]:
# Define a function to extract the venue category information from the json results
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Perform grid search and store data

In [48]:
# Generate an empty dataframe to contain all results
fsq_df = pd.DataFrame(columns=['name', 'categories', 'latitude', 'longitude', 'neighborhood'])

# Iterate over the search grid
for set in grid:
    # Extract latitude and longitude values
    lat = set['latitude']
    long = set['longitude']
    
    # Build API URL
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&limit={}&radius={}'.format(client_id, client_secret, lat, long, ver, limit,radius)
    
    # API call
    results = requests.get(url).json()
    
    # Extract useful data from json
    items = results['response']['groups'][0]['items']
    dataframe = json_normalize(items)
    
    # Create mini working dataframe to contain only the columns of interest
    data = pd.DataFrame()
    data['name'] = dataframe['venue.name']
    data['categories'] = dataframe['venue.categories']
    
    # Apply the get categories function to further strip structure in that field
    data['categories'] = data.apply(get_category_type, axis=1)
    data['latitude'] = dataframe['venue.location.lat']
    data['longitude'] = dataframe['venue.location.lng']
    
    # Check to see if neighborhood data is present, and store it is so
    if 'venue.location.neighborhood' in list(dataframe.columns.values):
        data['neighborhood'] = dataframe['venue.location.neighborhood']
    else: 
        data['neighborhood'] = np.nan
        
    # Append the mini working dataframe to the results dataframe
    fsq_df = pd.concat([fsq_df, data])
    
# Check shape and columns of results dataframe    
print('Size of dataframe: ', fsq_df.shape)
fsq_df.head()

Size of dataframe:  (7091, 5)


Unnamed: 0,name,categories,latitude,longitude,neighborhood
0,CFA Cafe,Café,33.613347,-84.489282,
1,Tropical Cuisine,Caribbean Restaurant,33.622083,-84.476598,
2,College Park Crab Pot,Seafood Restaurant,33.615771,-84.476075,
3,Bojangles' Famous Chicken 'n Biscuits,Fast Food Restaurant,33.615125,-84.472216,
4,Big Daddy's Dish,Southern / Soul Food Restaurant,33.603378,-84.472936,


Because this search is conducted in circles around points on a square grid, there may be duplicate entries. We will delete these duplicates to simplify analysis.

In [49]:
fsq_df.drop_duplicates(inplace=True)
print('Reduced size of dataframe: ',fsq_df.shape)

Reduced size of dataframe:  (2542, 5)


## Classify each restaurant into a neighborhood planning unit
In order to create this designation, we will use our Euclidean distance function from above. For each venue point, the Euclidean distance from all of the NPU boundary points. The distance vector will be sorted, and the venue will be classified inot an NPU based on the shortest distance. 

As a first step, we will add an empty column to the fsq_df dataframe in which to store the NPU classification.

In [62]:
fsq_df['NPU'] = ''
print(fsq_df.shape)

(2542, 6)


We will iterate over the four square dataframe and the boundary dataframe to calculate the NPI for each venue.

In [None]:
for index1, row in fsq_df.iterrows():
    # Select the coordinates of each venue
    venue_coord = [row['latitude'], row['longitude']]
    name = row['name']
    # Create a dataframe in which to store the distance information for this venue
    dist_frame = pd.DataFrame(columns=['NPU Name', 'Distance'])
    
    # Iterate through the gis dataframe
    for index2, boundary_row in gis_df.iterrows():
        
        # Select the NPU name for each boundary point
        npu = boundary_row['NPU']
        
        # Get the coordinates of the boundary point
        boundary_coord = [boundary_row['Latitude'], boundary_row['Longitude']]
        
        # Use the Euclidean distance function on the venue coordinates and the boundary coordinates
        dist = calc_dist(venue_coord, boundary_coord)
        
        # Append the distance and NPU name to the distance dataframe for this venu
        dist_frame = dist_frame.append({'NPU Name':npu, 'Distance':dist}, ignore_index=True)
    
    # Find the NPU at the minimum distance
    closest_npu = dist_frame['NPU Name'][dist_frame['Distance'].idxmin()]
    
    # Store this NPU data in the fsq_df dataframe
    fsq_df.at[index1,'NPU'] = closest_npu
    
    # Print to monitor progress
    print('Row: {}, Venue Name: {}, NPU: {}           '.format(index1,name, closest_npu), end='\r')

Row: 5, Venue Name: One Flew South Restaurant & Sushi Bar, NPU: Z                

In [None]:
fsq_df.head()


# Clustering Analysis
## Visualize Raw Data

In [13]:
venues_map = folium.Map(location=[lat, long], zoom_start=11)

for index, row in fsq_df.iterrows():
    lat = row['latitude']
    long = row['longitude']
    label = row['name']
    category = row['categories']
    
    folium.Circle(
        [lat, long],
        popup=label,
        radius=10,
        fill=True
        ).add_to(venues_map)

# display map
venues_map

## Perform clustering analysis on restaurant data

### Convert neighborhood strings into dummy variables
We will use the label encoder function from SKlearn to create an integer representing each given neighborhood.

In [18]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = data['neighborhood'].to_list()
neighborhood_labels = le.fit_transform(y)
data['neighborhood'] = neighborhood_labels
data.head()

Unnamed: 0,name,categories,latitude,longitude,neighborhood
0,Fish Bowl Poké,Poke Place,33.755727,-84.3894,1
1,Aviva by Kameel,Mediterranean Restaurant,33.760597,-84.386548,7
2,The Tabernacle,Music Venue,33.758719,-84.391455,7
3,Centennial Olympic Park,Park,33.760356,-84.393507,7
4,College Football Hall of Fame,Museum,33.760184,-84.395134,7


### Split dataframe into X and y arrays

In [None]:
X_train = data['latitude'], data['longitude']
y_train = data['neighborhood']

### Cross Validation Tuning

In [21]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
neighbors = list(range(1,11,1))
cv_scores = []
for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())
    
plt.plot(neighbors, cv_scores)
plt.xlabel('K Value')
plt.ylabel('Score')
plt.show()

ValueError: Found input variables with inconsistent numbers of samples: [2, 100]

### Perform K Nearest Neighbors Classification with the optimal number of clusters.
We will use the graph above to select the best k value

In [None]:
k = 5
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
print(knn.score(X_test,y_test, pred))
print(pred)

## Compare results of analysis to City of Atlanta Neighborhood data