## FINAL CAPSTONE PROJECT

#### Case study description

One of the oldest Pizza Restaurant in Naples, Italy, is studying the possibility to make its first investment outside Italy.

The Pizza restaurant's owners first choice is New York City. The city in highly attractive for foreign investments, the Pizza culture is extremely well-developed and it could be the chance to give the restaurant's brand an international boost.
However, the city competition in both Italian Restaurant and Pizza sectors is really high and the real estate market is expensive, so the company has to carefully study its peers in the city. 

The Pizza restaurant owner has any precise idea to where locate the new opening, so it will be useful for him to look a map that shows the most attractive and competitive pizza venues in the city.

I will start collecting the NYC data, clean it if necessary, use Foursquare to get the venues that I need and, finally, build a cluster to segment the Pizza restaurant in the city.

#### Data section

For achieve the purpose of this project, I will use a dataset that contains NYC's Borough, Neighborhoods, Latitude and Longitude of the latest one.
It will be essential to find all Neighborhoods in the city and their coordinates. In fact, through them, I will use FourSquare to collect all the venues in that Neighboords.

Summarly, my 2 main sources of data will be the NYC dataset described above and data about venues retrived by FourSquare.

In [None]:
! pip install folium==0.5.0

In [2]:
# import libraries
import pandas as pd
import numpy as np
import requests
import json
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

#### Now I download the New York dataset that contains the necessary data to look at the city neighborhoors locations.

In [3]:
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
print('Data downloaded!')

Data downloaded!


In [4]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Select only the data contained within the 'Feature' key

In [5]:
neighborhoods_data = newyork_data['features']

In [6]:
neighborhoods_data[0:3]

[{'type': 'Feature',
  'id': 'nyu_2451_34572.1',
  'geometry': {'type': 'Point',
   'coordinates': [-73.84720052054902, 40.89470517661]},
  'geometry_name': 'geom',
  'properties': {'name': 'Wakefield',
   'stacked': 1,
   'annoline1': 'Wakefield',
   'annoline2': None,
   'annoline3': None,
   'annoangle': 0.0,
   'borough': 'Bronx',
   'bbox': [-73.84720052054902,
    40.89470517661,
    -73.84720052054902,
    40.89470517661]}},
 {'type': 'Feature',
  'id': 'nyu_2451_34572.2',
  'geometry': {'type': 'Point',
   'coordinates': [-73.82993910812398, 40.87429419303012]},
  'geometry_name': 'geom',
  'properties': {'name': 'Co-op City',
   'stacked': 2,
   'annoline1': 'Co-op',
   'annoline2': 'City',
   'annoline3': None,
   'annoangle': 0.0,
   'borough': 'Bronx',
   'bbox': [-73.82993910812398,
    40.87429419303012,
    -73.82993910812398,
    40.87429419303012]}},
 {'type': 'Feature',
  'id': 'nyu_2451_34572.3',
  'geometry': {'type': 'Point',
   'coordinates': [-73.82780644716412, 

Now let's tranform the data in a pandas dataframe

In [7]:
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude']
neighborhoods = pd.DataFrame(columns=column_names)
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough']
    neighborhood_name = data['properties']['name']
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                         'Neighborhood': neighborhood_name,
                                         'Latitude': neighborhood_lat,
                                         'Longitude': neighborhood_lon}, ignore_index = True)

In [8]:
neighborhoods.head(30)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585
5,Bronx,Kingsbridge,40.881687,-73.902818
6,Manhattan,Marble Hill,40.876551,-73.91066
7,Bronx,Woodlawn,40.898273,-73.867315
8,Bronx,Norwood,40.877224,-73.879391
9,Bronx,Williamsbridge,40.881039,-73.857446


In [9]:
neighborhoods.shape

(306, 4)

Let's create a map of all Neighborhoods in New York city

In [10]:
# I need NYC's coordinates
!pip install geopy
from geopy.geocoders import Nominatim
address ='New York City, NY'

geolocator = Nominatim(user_agent = 'ny_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of New York City are {}, {}.'.format(latitude, longitude))

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/0c/67/915668d0e286caa21a1da82a85ffe3d20528ec7212777b43ccd027d94023/geopy-2.1.0-py3-none-any.whl (112kB)
[K     |████████████████████████████████| 112kB 1.2MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.1.0
The geographical coordinate of New York City are 40.7127281, -74.0060152.


In [11]:
newyork_map = folium.Map(location=[latitude, longitude], zoom_start = 10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(newyork_map)
newyork_map

#### Now we have found all the possible Neighborhoods in NYC where it might be possible to open the Italian Pizza Restaurant

Let's explore Pizza restaurants for the first Neighborhood

In [12]:
# @hidden_cell
CLIENT_ID = 'X4RIMZIEGYH5MNOZL35JOZIBSX5RIOZY3UPITSZRWCQHN3LC' # your Foursquare ID
CLIENT_SECRET = '3ED1BGNBZ0P5FBCSNQOFCF00JDSFX5GHQIEQM3YNNAS5TR51' # your Foursquare Secret
ACCESS_TOKEN = 'T1F4FPCZT1SMN3LICGOB1IXVQZD3TRH0JQFQ3K2AVYR515ST' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: X4RIMZIEGYH5MNOZL35JOZIBSX5RIOZY3UPITSZRWCQHN3LC
CLIENT_SECRET:3ED1BGNBZ0P5FBCSNQOFCF00JDSFX5GHQIEQM3YNNAS5TR51


In [13]:
# First neighborhood
neighborhoods.loc[0, 'Neighborhood']

'Wakefield'

In [14]:
neighborhood_latitude = neighborhoods.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhoods.loc[0, 'Longitude'] # neighborhood longitude value
neighborhood_name = neighborhoods.loc[0, 'Neighborhood']
print(neighborhood_name, neighborhood_latitude, neighborhood_longitude)

Wakefield 40.89470517661 -73.84720052054902


In [15]:
# Let's define a query that find Pizza food within 800 meters and limits the research at 50 venues
search_query = 'Pizza'
radius=800
limit = 100
print(search_query + '.... ok!')

Pizza.... ok!


In [16]:
# Define the url
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&search_query={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    limit,
    search_query)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=X4RIMZIEGYH5MNOZL35JOZIBSX5RIOZY3UPITSZRWCQHN3LC&client_secret=3ED1BGNBZ0P5FBCSNQOFCF00JDSFX5GHQIEQM3YNNAS5TR51&v=20180604&ll=40.89470517661,-73.84720052054902&radius=800&limit=100&search_query=Pizza'

In [None]:
# Sent a GET request
results = requests.get(url).json()
results

#### Now we have to repeat the operation for all the neighborhoods in NYC. So let's define a new function.

In [18]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we can clean the json file and transform it in a pd dataframe

In [19]:
from pandas.io.json import json_normalize

venues = results['response']['groups'][0]['items']
nyc_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nyc_venues =nyc_venues.loc[:, filtered_columns]

# filter the category for each row
nyc_venues['venue.categories'] = nyc_venues.apply(get_category_type, axis=1)

# clean columns
nyc_venues.columns = [col.split(".")[-1] for col in nyc_venues.columns]

nyc_venues.head()

  after removing the cwd from sys.path.


Unnamed: 0,name,categories,lat,lng
0,Lollipops Gelato,Dessert Shop,40.894123,-73.845892
1,Ripe Kitchen & Bar,Caribbean Restaurant,40.898152,-73.838875
2,Jackie's West Indian Bakery,Caribbean Restaurant,40.889283,-73.84331
3,Carvel Ice Cream,Ice Cream Shop,40.890487,-73.848568
4,Walgreens,Pharmacy,40.896528,-73.8447


In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&search_query={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
            search_query)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nyc_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nyc_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nyc_venues)

In [21]:
# Now run the function above creating a new dataframe called toronto_venues
nyc_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker

In [29]:
nyc_venues.head(20)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy
2,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
3,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop
5,Wakefield,40.894705,-73.847201,Subway,40.890468,-73.849152,Sandwich Place
6,Wakefield,40.894705,-73.847201,Pitman Deli,40.896744,-73.844398,Food
7,Wakefield,40.894705,-73.847201,Central Deli,40.896728,-73.844387,Deli / Bodega
8,Wakefield,40.894705,-73.847201,Koss Quick Wash,40.891281,-73.849904,Laundromat
9,Co-op City,40.874294,-73.829939,Rite Aid,40.870345,-73.828302,Pharmacy


In [23]:
nyc_venues.shape

(6152, 7)

Let's filter the dataframed to find all the Pizza Place in New York

In [24]:
df_mask = nyc_venues['Venue Category'] == 'Pizza Place'

In [30]:
nyc_venues_filtered = nyc_venues[df_mask]
nyc_venues_filtered.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
10,Co-op City,40.874294,-73.829939,Capri II Pizza,40.876374,-73.82994,Pizza Place
25,Eastchester,40.887556,-73.827806,Mario's Pizza,40.888628,-73.83126,Pizza Place
64,Kingsbridge,40.881687,-73.902818,Kingsbridge Social Club,40.884581,-73.901999,Pizza Place
66,Kingsbridge,40.881687,-73.902818,Sam's Pizza,40.879435,-73.905859,Pizza Place
90,Kingsbridge,40.881687,-73.902818,Broadway Pizza & Pasta,40.878822,-73.904494,Pizza Place
93,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
114,Woodlawn,40.898273,-73.867315,Katonah Pizza and Pasta,40.898784,-73.867457,Pizza Place
127,Woodlawn,40.898273,-73.867315,Bella Napoli 2,40.896781,-73.867338,Pizza Place
140,Woodlawn,40.898273,-73.867315,Bella Napoli 2,40.89673,-73.86232,Pizza Place
144,Norwood,40.877224,-73.879391,Sal's Pizzeria,40.875269,-73.879563,Pizza Place


In [26]:
nyc_venues_filtered.shape

(310, 7)

In [27]:
# Now It might be useful to see all the Pizza restaurants in NYC
newyork_pizza = folium.Map(location=[latitude, longitude], zoom_start = 10)

# add markers to map
for lat, lng, venue, neighborhood in zip(nyc_venues_filtered['Neighborhood Latitude'], nyc_venues_filtered['Neighborhood Longitude'], nyc_venues_filtered['Neighborhood'], nyc_venues_filtered['Venue']):
    label = '{}, {}'.format(neighborhood, venue)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(newyork_pizza)
newyork_pizza

### Now that we have all the pizza restaurants in NYC we are going to kluster them.

In [31]:
#Let's count venues for each neighborhood
nyc_venues_filtered.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Allerton,3,3,3,3,3,3
Annadale,1,1,1,1,1,1
Arden Heights,1,1,1,1,1,1
Arrochar,2,2,2,2,2,2
Arverne,1,1,1,1,1,1
...,...,...,...,...,...,...
Willowbrook,1,1,1,1,1,1
Woodhaven,1,1,1,1,1,1
Woodlawn,3,3,3,3,3,3
Woodrow,1,1,1,1,1,1


I am going to use the K-Means method to cluster the pizza restaurants

In [41]:
kclusters = 5 # define the number of clusters

In [36]:
nyc_venues_clus = nyc_venues_filtered.drop('Neighborhood', 1)

In [38]:
nyc_venues_clus = nyc_venues_clus.drop('Venue', 1)

In [39]:
nyc_venues_clus = nyc_venues_clus.drop('Venue Category', 1)

In [40]:
nyc_venues_clus.head()

Unnamed: 0,Neighborhood Latitude,Neighborhood Longitude,Venue Latitude,Venue Longitude
10,40.874294,-73.829939,40.876374,-73.82994
25,40.887556,-73.827806,40.888628,-73.83126
64,40.881687,-73.902818,40.884581,-73.901999
66,40.881687,-73.902818,40.879435,-73.905859
90,40.881687,-73.902818,40.878822,-73.904494


In [42]:
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(nyc_venues_clus)

In [47]:
kmeans.labels_

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0,
       2, 2, 4, 4, 4, 4, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 4, 4, 2, 2, 2, 2,
       2, 2, 0, 0, 0, 0, 2, 2, 2, 0, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 3, 3, 3, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 3, 4, 4, 4, 4, 4, 4, 4, 0, 4, 4,
       4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 4, 4, 2, 2, 4, 4, 0, 3, 3, 3, 3,
       0, 0, 0, 0, 0, 0, 1, 1, 2, 2, 4, 1, 1, 4, 2,

In [None]:
nyc_venues_clus.insert(0, 'Cluster Labels', kmeans.labels_)

In [51]:
nyc_venues_clus.head()

Unnamed: 0,Cluster Labels,Neighborhood Latitude,Neighborhood Longitude,Venue Latitude,Venue Longitude
10,3,40.874294,-73.829939,40.876374,-73.82994
25,3,40.887556,-73.827806,40.888628,-73.83126
64,3,40.881687,-73.902818,40.884581,-73.901999
66,3,40.881687,-73.902818,40.879435,-73.905859
90,3,40.881687,-73.902818,40.878822,-73.904494


In [57]:
nyc_venues_merg = nyc_venues_clus
nyc_venues_merg.insert(5, 'Neighborhood', nyc_venues_filtered['Neighborhood'])

In [58]:
nyc_venues_merg.head()

Unnamed: 0,Cluster Labels,Neighborhood Latitude,Neighborhood Longitude,Venue Latitude,Venue Longitude,Neighborhood
10,3,40.874294,-73.829939,40.876374,-73.82994,Co-op City
25,3,40.887556,-73.827806,40.888628,-73.83126,Eastchester
64,3,40.881687,-73.902818,40.884581,-73.901999,Kingsbridge
66,3,40.881687,-73.902818,40.879435,-73.905859,Kingsbridge
90,3,40.881687,-73.902818,40.878822,-73.904494,Kingsbridge


In [60]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(nyc_venues_merg['Venue Latitude'], nyc_venues_merg['Venue Longitude'],nyc_venues_merg['Neighborhood'], nyc_venues_merg['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### NOW THE NEW PIZZA RESTAURANT HAS A MAP OF ALL PIZZA RESTAURANTS IN NEW YORK, SEGMENTED BY AREA

It can manipulate the data in many ways, such as understand the dispersion in each neighborhood, build a new dataset based on only one cluster to study the area, ecc.