# Peer-Graded Assignment: Capstone Project - The Battle of Neighborhoods (Code)

## Greek Restaurant in New York city

## Anastasios-Petros Kazamias

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Importing and Exploring the Datasets</a>

2. <a href="#item2">Data Analysis</a>

3. <a href="#item3">Clusterring Neighborhoods based on Greek Restaurant Suitability</a>

4. <a href="#item4">Results</a>
  
</font>
</div>

Before we get the data and start exploring it, let's import all the necessary libraries:

In [1]:
import numpy as np # library to handle data in a vectorized manner (mathematics library)

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import seaborn as sns

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

#!conda install -c conda-forge geocoder --yes
import geocoder

print('Libraries imported.')

Libraries imported.


## 1. Importing and Exploring the Datasets

New York has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood.

This dataset exists for free on the web and can be downloaded from this link: https://geo.nyu.edu/catalog/nyu_2451_34572

In [2]:
with open("C:/Users/Petros/Desktop/nyu-2451-34572-geojson.json") as json_data:
    newyork_data = json.load(json_data)

In [3]:
newyork_data

{'type': 'FeatureCollection',
 'totalFeatures': 306,
 'features': [{'type': 'Feature',
   'id': 'nyu_2451_34572.1',
   'geometry': {'type': 'Point',
    'coordinates': [-73.84720052054902, 40.89470517661]},
   'geometry_name': 'geom',
   'properties': {'name': 'Wakefield',
    'stacked': 1,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.2',
   'geometry': {'type': 'Point',
    'coordinates': [-73.82993910812398, 40.87429419303012]},
   'geometry_name': 'geom',
   'properties': {'name': 'Co-op City',
    'stacked': 2,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.87429419303012]}},
  {'type': 'Feature',
 

Notice how all the relevant data is in the *features* key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.

In [4]:
neighborhoods_data = newyork_data['features']

Let's take a look at the first item in this list.

In [5]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Tranform the data into a pandas dataframe

The next task is essentially transforming this data of nested Python dictionaries into a pandas dataframe. So let's start by creating an empty dataframe.

In [6]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Then let's loop through the data and fill the dataframe one row at a time.

In [7]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [8]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [9]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


#### Use geopy library to get the latitude and longitude values of New York City.

In [None]:
address = 'New York City, NY'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

#### Create a map of New York neighborhoods.

In [11]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

Next, we are going to start utilizing the Foursquare API in order to acquire the datasets of general type and Greek type restaurants in the city of New York.

#### Define Foursquare Credentials and Version

In [18]:
import getpass # hiding credentials

CI = getpass.getpass('Enter CLIENT_ID:')
CS = getpass.getpass('Enter CLIENT_SECRET:')
CLIENT_ID = CI   # my Foursquare ID
CLIENT_SECRET = CS  # my Foursquare Secret

VERSION = '20180605' # Foursquare API version

print('Your credentails are defined')

Your credentails are defined


Let's define a fuction that extracts the category of a venue.

In [21]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now let's create a function to repeat the above fuction to all venues of the neighborhoods of New York.

In [30]:
#define limit of venues and search radius from center in meters
LIMIT=100
radius=500

def getNearbyVenues(names, bors, latitudes, longitudes, search_query):
    
    venues_list=[]
    for name, bor, lat, lng in zip(names, bors, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&query={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            search_query,
            radius, 
            LIMIT)
            
        # make the GET request
        rresults = requests.get(url).json()
        results = rresults["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            bor,
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough',
                             'Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We can now create our dataframes of New York general type Restaurants and Greek Restaurants.

General type:

In [32]:
new_york_resta_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   bors=neighborhoods['Borough'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude'],
                                   search_query='Restaurant'
                                  )

Let's take a look of our dataframe.

In [33]:
print(new_york_resta_venues.shape)
new_york_resta_venues.head()

(8213, 8)


Unnamed: 0,Borough,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bronx,Wakefield,40.894705,-73.847201,Cooler Runnings Jamaican Restaurant Inc,40.898283,-73.850478,Caribbean Restaurant
1,Bronx,Wakefield,40.894705,-73.847201,Dunkin Donuts,40.890631,-73.849027,Donut Shop
2,Bronx,Wakefield,40.894705,-73.847201,SUBWAY,40.890656,-73.849192,Sandwich Place
3,Bronx,Wakefield,40.894705,-73.847201,Pitman Deli,40.894149,-73.845748,Food
4,Bronx,Wakefield,40.894705,-73.847201,Central Deli,40.896846,-73.844415,Deli / Bodega


Greek Restaurants:

In [34]:
new_york_greek_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   bors=neighborhoods['Borough'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude'],
                                   search_query='Greek'
                                  )

In [37]:
print(new_york_greek_venues.shape)
new_york_greek_venues.head()

(350, 8)


Unnamed: 0,Borough,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bronx,Kingsbridge,40.881687,-73.902818,Greek Express,40.883703,-73.904788,Greek Restaurant
1,Bronx,Kingsbridge,40.881687,-73.902818,Cold Cut City,40.879174,-73.905753,Sandwich Place
2,Manhattan,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Bronx,Pelham Parkway,40.857413,-73.854756,Liberty Donut & Coffee Shop,40.855339,-73.855333,Coffee Shop
4,Bronx,Bedford Park,40.870185,-73.885512,House pizza,40.874132,-73.884652,Pizza Place


## 2. Data Analysis

We can check how many venues were returned for each Neighborhood.

In [40]:
new_york_resta_venues.groupby('Neighborhood').count().head()

Unnamed: 0_level_0,Borough,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Allerton,24,24,24,24,24,24,24
Annadale,14,14,14,14,14,14,14
Arden Heights,2,2,2,2,2,2,2
Arlington,2,2,2,2,2,2,2
Arrochar,14,14,14,14,14,14,14


Taking a look at *new_york_greek_venues* dataframe we can see that there are venues that does not match to greek culture or are not restaurants. This is because or search query was the word 'Greek'. In order to correct that we will work only with venues which Venue Category is "Greek Restaurant".

In [41]:
gg = new_york_greek_venues[new_york_greek_venues['Venue Category']=='Greek Restaurant']
print(gg.shape)

(142, 8)


We group the venues population on each neighborhood.

In [43]:
gg_grouped = gg[['Neighborhood','Venue']].groupby('Neighborhood').count().reset_index()
ggr_grouped = new_york_resta_venues[['Neighborhood','Venue']].groupby('Neighborhood').count().reset_index()

In [47]:
gg_grouped.rename(columns={'Venue':'Greek Venues'},inplace=True)
ggr_grouped.rename(columns={'Venue':'Venues'},inplace=True)

Take a look at a sample of general type venues.

In [48]:
ggr_grouped.head()

Unnamed: 0,Neighborhood,Venues
0,Allerton,24
1,Annadale,14
2,Arden Heights,2
3,Arlington,2
4,Arrochar,14


Now we join both dataframes.

In [50]:
gg_final = gg_grouped.join(ggr_grouped.set_index('Neighborhood'), on='Neighborhood', how= 'inner')

And we create a value *"pointer"* that the bigger that value is, the more suitable the neighborhood is to open a Greek restaurant. This value takes into account the number of Greek restaurants in the neighborhood and the number of general type restaurants. It is constructed as follows:

In [51]:
s=[]
for i in range(gg_final.shape[0]):
    s.append(100*gg_final.iloc[i,1]/gg_final.iloc[i,2])

Now we are close to the final form of our dataframe.

In [53]:
gg_final['pointer']=s
gg_final.head()

Unnamed: 0,Neighborhood,Greek Venues,Venues,pointer
0,Astoria,9,87,10.344828
1,Astoria Heights,1,11,9.090909
2,Bay Ridge,5,79,6.329114
3,Bay Terrace,1,20,5.0
4,Bayside,5,55,9.090909


In order to have a better sample and limit outliners, we will limit our model in the Neighborhoods with number of venues greater than 10.

In [57]:
gg_final_limit = gg_final[gg_final['Venues'] > 10]

In [63]:
gg_final_limit

Unnamed: 0,Neighborhood,Greek Venues,Venues,pointer
0,Astoria,9,87,10.344828
1,Astoria Heights,1,11,9.090909
2,Bay Ridge,5,79,6.329114
3,Bay Terrace,1,20,5.0
4,Bayside,5,55,9.090909
6,Boerum Hill,1,74,1.351351
7,Carnegie Hill,1,69,1.449275
8,Carroll Gardens,1,69,1.449275
9,Chinatown,2,100,2.0
10,Civic Center,1,87,1.149425


## 3. Clusterring Neighborhoods based on Greek Restaurant Suitability.

In this section we will cluster our neighborhoods based on the suitability to open a Greek restaurant and we will visualize those clusters in a New York map. After some research we believe that 4 is the most suitable number for the clusters. So we train our model with the results of the final dataframe from the above analysis.

In [68]:
kclusters=4
r_state=0

new_york_clustering = gg_final_limit.drop(['Neighborhood','Greek Venues','Venues'],axis=1)

    # run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=r_state).fit(new_york_clustering)

Let's take a look at the cluster centers.

In [69]:
kmeans.cluster_centers_

array([[ 9.46178276],
       [ 2.20529361],
       [ 5.04543072],
       [17.64705882]])

There are two interesting clusters. Neighborhoods with 0 cluster label have a 9.5% possibility of a restaurant to be Greek type. Neighborhoods with 3 cluster label are obviously the best result or an outlier (something the model shows good but in reality isnt).

We continue the prosess adding cluster labels, longitude and latitude to the dataframe.

In [None]:
new_york_merged = neighborhoods
    
    # add clustering labels in gg_final_limit dataframe
gg_final_limit['Cluster Labels'] = kmeans.labels_
    
    # merge neighborhoods with gg_final_limit to add latitude/longitude for each neighborhood
new_york_merged = new_york_merged.join(gg_final_limit.set_index('Neighborhood'), on='Neighborhood', how='inner')

new_york_merged.head(2)

We can now visualize our clustering results in a New York neighborhood map.

In [72]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
    
    # set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(new_york_merged['Latitude'], new_york_merged['Longitude'], new_york_merged['Neighborhood'], new_york_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

## Results

Looking at the map above, there are 4 types of colors. According to our model, if someone wants to open a Greek restaurant he should consider the following:

* <font color=green> Green </font> is by far the most suitable neighborhood (outlier).
* Red are very good results.
* Light blue are good results too.
* Purple are not so good neighborhoods to try your luck.
* Finally neighborhoods not showing in the above map are the worst choice.

To conclude, in _Manhattan_, __Midtown__ and __South__ neighborhoods are more suitable places to open a Greek restaurant but there is a lot antagonism cause there are so many types of restaurants in these neighborhoods.

 We wouldn't recommend _Bronx_ for starting a business.

In _Brooklyn_ a good choice is __Gowanus__ and nearby neighborhoods, but a better one is __Bay Ridge__.

In the Southeast part of _Staten Island_, __Dongan Hills__, __Grant City__, __Bay Terrace__ and especially __Old Town__ are very good neighborhoods to start a Greek restaurant business.

Finally _Queens_ seems to be the most suitable Borough. There is an obvious trend in Greek restaurants in __Astoria__ and neighborhoods near it. With a simple web search we found out that Astoria is the Greek cultured neighborhood of New York city since it was the home of the first Greek immigrants. So the most suitable neighborhoods to open a Greek restaurant in Queens and we could say in all of New York city are by far neighborhoods close to __Astoria__, and also __Bayside__, __Bay Terrace__ and __Douglaston__.