<h1> Capstone Project </h1>

#### Business Problem
<i>A restaurateur wants to open an Italian restaurant in Manhattan and Queens Borough in the New York City. He needs to find a suitable neighborhood in these boroughs where an Italian restaurant would flourish and earn him profits.</i> 

### Methodolgy

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.18.1               |             py_0          51 KB  conda-forge
    openssl-1.0.2p             |    h14c3975_1002         3.1 MB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    conda-4.6.0                |        py36_1000         878 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.0 MB

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0         conda-forge
    geopy:         1.18.1-py_0       conda-forge

The following packages will be UPDATED:

    conda:         4.5.12-py36_1000  conda-forge --> 4.6.0-py36_1000      conda-forge
    o

<a id='item1'></a>

### 1. Download and Explore Dataset of New York

Run a `wget` command and get access to the New York data.

In [2]:
!wget -q -O 'newyork_data.json' https://ibm.box.com/shared/static/fbpwbovar7lf8p5sgddm06cgipa2rxpe.json
print('Data downloaded!')

Data downloaded!


Next, let's load the data.

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Define a new variable that includes the neighborhood data.

In [4]:
neighborhoods_data = newyork_data['features']

Let's take a look at the first item in this list.

Tranform the data into a *pandas* dataframe

In [5]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the empty dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Then let's loop through the data and fill the dataframe one row at a time to add the neighborhoods.

In [6]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [7]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


The dataset has all 5 boroughs and 306 neighborhoods. Let's confirm it.

In [8]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


The restauranteur wants to know the hoods in Manhattan and Queens boroughs where he can set up the restaurant. So, let's extract those boroughs' data.

In [9]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [10]:
queens_data = neighborhoods[neighborhoods['Borough'] == 'Queens'].reset_index(drop=True)
queens_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Queens,Astoria,40.768509,-73.915654
1,Queens,Woodside,40.746349,-73.901842
2,Queens,Jackson Heights,40.751981,-73.882821
3,Queens,Elmhurst,40.744049,-73.881656
4,Queens,Howard Beach,40.654225,-73.838138


Let's get the geographical coordinates of Manhattan and Queens.

In [11]:
print(manhattan_data.shape)
print(queens_data.shape)

def getLL(address):
    geolocator = Nominatim()
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    print('The geograpical coordinate of {} are {}, {}.'.format(address,latitude, longitude))
    return (latitude,longitude)

manhattan_coordinates=getLL('Manhattan, NY')
queens_coordinates=getLL('Queens, NY')

(40, 4)
(81, 4)


  """


The geograpical coordinate of Manhattan, NY are 40.7900869, -73.9598295.
The geograpical coordinate of Queens, NY are 40.6524927, -73.7914214158161.


### 2. Use Foursquare API to explore the hoods of Manhattan and Queens

Define Foursquare Credentials and Version

In [1]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 
CLIENT_SECRET:


Before we proceed, let's borrow the <i>get_category_type</i> function from the Foursquare lab.

In [13]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Create a function to get the nearby venues of all the neighborhoods in a given borough

In [14]:
def getNearbyVenues(names, latitudes, longitudes, query, radius=2000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&section={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
            query)
            
        # make the GET request
        # print(requests.get(url).json())
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Run the above function on each neighborhood to find the nearby <b>restaurants</b> and create a new dataframe for each borough.

In [15]:
LIMIT=100
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude'],
                                   query='food'
                                  )
queens_venues = getNearbyVenues(names=queens_data['Neighborhood'],
                                   latitudes=queens_data['Latitude'],
                                   longitudes=queens_data['Longitude'],
                                   query='food'
                                  )


Save the data into _csv_ files for future use.

In [18]:
manhattan_venues.to_csv('manhattan_venues.csv',index=False)
queens_venues.to_csv('queens_venues.csv',index=False)

In [16]:
manhattan_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
2,Marble Hill,40.876551,-73.91066,Sam's Pizza,40.879435,-73.905859,Pizza Place
3,Marble Hill,40.876551,-73.91066,Loeser's Delicatessen,40.879242,-73.905471,Sandwich Place
4,Marble Hill,40.876551,-73.91066,El Malecon,40.879338,-73.904457,Caribbean Restaurant


In [17]:
queens_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Astoria,40.768509,-73.915654,Favela Grill,40.767348,-73.917897,Brazilian Restaurant
1,Astoria,40.768509,-73.915654,Al-sham Sweets and Pastries,40.768077,-73.911561,Middle Eastern Restaurant
2,Astoria,40.768509,-73.915654,Brooklyn Bagel & Coffee Co.,40.764895,-73.916954,Bagel Shop
3,Astoria,40.768509,-73.915654,Seva Indian Cuisine,40.765521,-73.919157,Indian Restaurant
4,Astoria,40.768509,-73.915654,Duzan,40.76873,-73.911013,Falafel Restaurant


In [18]:
#read files when the API request rate for Foursquare exceeds
#manhattan_venues=pd.read_csv('manhattan_venues.csv')
#queens_venues=pd.read_csv('queens_venues.csv')

Check the size of the resulting dataframe.

In [19]:
print(manhattan_venues.shape)
print(queens_venues.shape)

(4000, 7)
(6714, 7)


Let's check how many venues were returned for each neighborhood.

In [20]:
print(manhattan_venues.groupby('Neighborhood').count().shape)
print(queens_venues.groupby('Neighborhood').count().shape)

(40, 6)
(81, 6)


Let's find out how many unique categories can be curated from all the returned venues.

In [21]:
print('There are {} uniques categories in Manhattan.'.format(len(manhattan_venues['Venue Category'].unique())))
print('There are {} uniques categories in Queens.'.format(len(queens_venues['Venue Category'].unique())))

There are 102 uniques categories in Manhattan.
There are 104 uniques categories in Queens.


<a id='item3'></a>

### 3. Analyze Each Neighborhood

Perform one hot encoding tranforming categorical variables to numeric ones for each dataframe.

In [22]:
def oneHotEncoding(data):
    # one hot encoding
    data_onehot = pd.get_dummies(data[['Venue Category']], prefix="", prefix_sep="")

    # add neighborhood column back to dataframe
    data_onehot['Neighborhood'] = data['Neighborhood'] 

    # move neighborhood column to the first column
    fixed_columns = [data_onehot.columns[-1]] + list(data_onehot.columns[:-1])
    data_onehot = data_onehot[fixed_columns]
    
    #group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
    data_grouped = data_onehot.groupby('Neighborhood').mean().reset_index()
    return data_grouped

manhattan_grouped=oneHotEncoding(manhattan_venues)
queens_grouped=oneHotEncoding(queens_venues)

In [23]:
manhattan_grouped.head()

Unnamed: 0,Neighborhood,African Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Asian Restaurant,Australian Restaurant,Austrian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bistro,Brazilian Restaurant,Breakfast Spot,Burger Joint,Burrito Place,Café,Cambodian Restaurant,Cantonese Restaurant,Caribbean Restaurant,Chinese Restaurant,Churrascaria,Comfort Food Restaurant,Creperie,Cuban Restaurant,Czech Restaurant,Deli / Bodega,Dim Sum Restaurant,Diner,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Empanada Restaurant,English Restaurant,Ethiopian Restaurant,Falafel Restaurant,Fast Food Restaurant,Filipino Restaurant,Food,Food Court,Food Truck,French Restaurant,Fried Chicken Joint,Gastropub,German Restaurant,Greek Restaurant,Halal Restaurant,Hawaiian Restaurant,Hot Dog Joint,Indian Restaurant,Israeli Restaurant,Italian Restaurant,Japanese Curry Restaurant,Japanese Restaurant,Jewish Restaurant,Korean Restaurant,Kosher Restaurant,Latin American Restaurant,Lebanese Restaurant,Malay Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Molecular Gastronomy Restaurant,Moroccan Restaurant,New American Restaurant,Noodle House,North Indian Restaurant,Paella Restaurant,Persian Restaurant,Peruvian Restaurant,Pet Café,Pizza Place,Ramen Restaurant,Restaurant,Russian Restaurant,Salad Place,Sandwich Place,Scandinavian Restaurant,Seafood Restaurant,Snack Place,Soba Restaurant,Soup Place,South American Restaurant,South Indian Restaurant,Southern / Soul Food Restaurant,Souvlaki Shop,Spanish Restaurant,Steakhouse,Sushi Restaurant,Szechuan Restaurant,Taco Place,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Tibetan Restaurant,Turkish Restaurant,Udon Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Vietnamese Restaurant,Wings Joint
0,Battery Park City,0.0,0.09,0.0,0.0,0.02,0.01,0.0,0.02,0.02,0.04,0.0,0.0,0.01,0.02,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.01,0.0,0.0,0.02,0.02,0.05,0.02,0.01,0.0,0.03,0.0,0.0,0.0,0.03,0.0,0.08,0.01,0.03,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.03,0.01,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.03,0.06,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01
1,Carnegie Hill,0.0,0.03,0.0,0.0,0.01,0.0,0.01,0.0,0.04,0.05,0.0,0.0,0.01,0.01,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.03,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.02,0.0,0.01,0.01,0.0,0.0,0.0,0.02,0.01,0.0,0.14,0.0,0.05,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.09,0.02,0.01,0.0,0.01,0.03,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.03,0.0,0.0,0.0,0.05,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.02,0.0
2,Central Harlem,0.05,0.05,0.0,0.0,0.02,0.0,0.0,0.01,0.01,0.04,0.01,0.0,0.0,0.02,0.0,0.1,0.0,0.0,0.05,0.03,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.01,0.0,0.04,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.08,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.07,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.01,0.01,0.0,0.0,0.01,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.08,0.0,0.01,0.01,0.03,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0
3,Chelsea,0.0,0.12,0.0,0.0,0.02,0.0,0.0,0.02,0.01,0.04,0.01,0.0,0.0,0.02,0.0,0.05,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.18,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.07,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.02,0.0,0.02,0.03,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.02,0.04,0.01,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0
4,Chinatown,0.0,0.07,0.0,0.0,0.03,0.03,0.01,0.0,0.0,0.05,0.0,0.0,0.01,0.0,0.0,0.06,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.01,0.0,0.02,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.12,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.03,0.04,0.0,0.01,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.01,0.0,0.01,0.04,0.0,0.03,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.02,0.0,0.01,0.01,0.03,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0


In [24]:
queens_grouped.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,African Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Asian Restaurant,Australian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bistro,Brazilian Restaurant,Breakfast Spot,Buffet,Burger Joint,Burrito Place,Cafeteria,Café,Cajun / Creole Restaurant,Cambodian Restaurant,Cantonese Restaurant,Caribbean Restaurant,Chinese Restaurant,Colombian Restaurant,Comfort Food Restaurant,Creperie,Cuban Restaurant,Czech Restaurant,Deli / Bodega,Dim Sum Restaurant,Diner,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Egyptian Restaurant,Empanada Restaurant,English Restaurant,Falafel Restaurant,Fast Food Restaurant,Filipino Restaurant,Fish & Chips Shop,Food,Food Court,Food Stand,Food Truck,French Restaurant,Fried Chicken Joint,Gastropub,German Restaurant,Gluten-free Restaurant,Greek Restaurant,Halal Restaurant,Himalayan Restaurant,Hot Dog Joint,Hotpot Restaurant,Indian Restaurant,Indonesian Restaurant,Irish Pub,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Kosher Restaurant,Latin American Restaurant,Malay Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Modern Greek Restaurant,New American Restaurant,Noodle House,Paella Restaurant,Pakistani Restaurant,Peruvian Restaurant,Pizza Place,Poke Place,Polish Restaurant,Portuguese Restaurant,Ramen Restaurant,Restaurant,Salad Place,Sandwich Place,Seafood Restaurant,Shanghai Restaurant,Snack Place,Soba Restaurant,South American Restaurant,Southern / Soul Food Restaurant,Souvlaki Shop,Spanish Restaurant,Sri Lankan Restaurant,Steakhouse,Sushi Restaurant,Szechuan Restaurant,Taco Place,Taiwanese Restaurant,Tapas Restaurant,Tex-Mex Restaurant,Thai Restaurant,Tibetan Restaurant,Turkish Restaurant,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Vietnamese Restaurant,Wings Joint
0,Arverne,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.060606,0.0,0.0,0.0,0.0,0.060606,0.0,0.0,0.030303,0.0,0.0,0.0,0.030303,0.151515,0.0,0.0,0.0,0.0,0.0,0.121212,0.0,0.030303,0.121212,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.121212,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.030303,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.060606,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0
1,Astoria,0.0,0.0,0.02,0.01,0.0,0.01,0.0,0.01,0.05,0.04,0.0,0.04,0.0,0.0,0.04,0.0,0.0,0.04,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.04,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.04,0.0,0.0,0.0,0.01,0.0,0.14,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.06,0.0,0.0,0.0,0.01,0.0,0.02,0.01,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.01,0.0,0.01,0.02,0.04,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.01,0.04,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.01,0.01,0.0,0.0,0.0
2,Astoria Heights,0.0,0.0,0.02,0.0,0.01,0.01,0.01,0.01,0.04,0.07,0.0,0.02,0.0,0.0,0.02,0.0,0.0,0.03,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.04,0.0,0.03,0.04,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.08,0.0,0.0,0.0,0.0,0.01,0.0,0.02,0.07,0.0,0.0,0.0,0.01,0.0,0.02,0.02,0.04,0.0,0.0,0.0,0.0,0.0,0.02,0.06,0.0,0.0,0.0,0.01,0.02,0.01,0.01,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.04,0.0,0.01,0.0,0.0,0.0,0.03,0.0,0.01,0.0,0.0,0.0,0.0
3,Auburndale,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.02,0.03,0.05,0.01,0.0,0.01,0.0,0.03,0.0,0.0,0.04,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.04,0.02,0.3,0.0,0.01,0.0,0.01,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.01,0.0,0.03,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.03,0.01,0.01,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0
4,Bay Terrace,0.0,0.0,0.018182,0.0,0.0,0.036364,0.0,0.0,0.054545,0.054545,0.018182,0.0,0.0,0.0,0.036364,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.072727,0.0,0.0,0.0,0.0,0.0,0.018182,0.0,0.036364,0.072727,0.0,0.0,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.018182,0.018182,0.0,0.0,0.0,0.018182,0.054545,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.018182,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.127273,0.0,0.0,0.0,0.0,0.036364,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018182,0.054545,0.018182,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Get the top 5 venues for each hood
First, let's write a function to sort the venues in descending order.

In [25]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 5 venues for each neighborhood.

In [26]:
def getTopVenues(data,num_top_venues = 10):

    indicators = ['st', 'nd', 'rd']

    # create columns according to number of top venues
    columns = ['Neighborhood']
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

    # create a new dataframe
    neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
    neighborhoods_venues_sorted['Neighborhood'] = data['Neighborhood']

    for ind in np.arange(data.shape[0]):
        neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(data.iloc[ind, :], num_top_venues)
    
    return neighborhoods_venues_sorted

manhattan_neighborhoods_venues_sorted=getTopVenues(manhattan_grouped,5)
queens_neighborhoods_venues_sorted=getTopVenues(queens_grouped,5)


In [27]:
manhattan_neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Battery Park City,American Restaurant,Italian Restaurant,Steakhouse,Sandwich Place,French Restaurant
1,Carnegie Hill,Italian Restaurant,Pizza Place,Mexican Restaurant,Bakery,Café
2,Central Harlem,Café,Italian Restaurant,Southern / Soul Food Restaurant,Mexican Restaurant,Seafood Restaurant
3,Chelsea,Italian Restaurant,American Restaurant,New American Restaurant,Café,Tapas Restaurant
4,Chinatown,Italian Restaurant,American Restaurant,French Restaurant,Café,Bakery


In [28]:
queens_neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Arverne,Chinese Restaurant,Donut Shop,Deli / Bodega,Pizza Place,Taco Place
1,Astoria,Greek Restaurant,Italian Restaurant,Bagel Shop,Pizza Place,Café
2,Astoria Heights,Greek Restaurant,Italian Restaurant,Bakery,Pizza Place,Donut Shop
3,Auburndale,Korean Restaurant,Greek Restaurant,Pizza Place,Bakery,Italian Restaurant
4,Bay Terrace,Pizza Place,Italian Restaurant,Chinese Restaurant,Donut Shop,Bakery


<a id='item4'></a>

### 4. Cluster Neighborhoods for Manhattan and Queens

Run *k*-means to cluster the neighborhood into 4 clusters.

Let's create a new dataframe that includes the cluster as well as the top 5 venues for each neighborhood for each Borough.

In [29]:
def clusterHoods(data,data_grouped,data_neighborhoods_venues_sorted,kclusters = 4):
    
    data_grouped_clustering = data_grouped.drop('Neighborhood', 1)
    #print(manhattan_grouped_clustering.head())

    # run k-means clustering
    kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(data_grouped_clustering)

    # check cluster labels generated for each row in the dataframe
    kmeans.labels_[0:10]

    data_merged = data

    # add clustering labels
    data_merged['Cluster Labels'] = kmeans.labels_

    # merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
    data_merged = data_merged.join(data_neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
    
    return data_merged

manhattan_merged=clusterHoods(manhattan_data,manhattan_grouped,manhattan_neighborhoods_venues_sorted)
queens_merged=clusterHoods(queens_data,queens_grouped,queens_neighborhoods_venues_sorted)


Finally, let's visualize the resulting clusters

In [30]:
def visualizeClusters(coords,data_merged,kclusters=4):
    # create map
    map_clusters = folium.Map(location=list(coords), zoom_start=11)

    # set color scheme for the clusters
    x = np.arange(kclusters)
    ys = [i+x+(i*x)**2 for i in range(kclusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
    markers_colors = []
    for lat, lon, poi, cluster in zip(data_merged['Latitude'], data_merged['Longitude'], data_merged['Neighborhood'], data_merged['Cluster Labels']):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)

    return map_clusters
    

In [31]:
visualizeClusters(manhattan_coordinates,manhattan_merged)

In [32]:
visualizeClusters(queens_coordinates,queens_merged)

### 5. Examine Clusters

Clusters are formed based on the venues to separate similar ones with dissimilar ones.
Since the restaurateur wants to open an _Italian Restaurant_, determine those clusters that have the maximum Italian restaurants in it.

#### Cluster 1

In [33]:
manhattan_merged[manhattan_merged.eq('Italian Restaurant').any(1)].groupby('Cluster Labels').count()

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,2,2,2,2,2,2,2,2,2
1,14,14,14,14,14,14,14,14,14
2,11,11,11,11,11,11,11,11,11
3,5,5,5,5,5,5,5,5,5


We observe that Cluster 1 and Cluster 2 have the maximum count of Italian restaurants for Manhattan.

In [34]:
queens_merged[queens_merged.eq('Italian Restaurant').any(1)].groupby('Cluster Labels').count()

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,2,2,2,2,2,2,2,2,2
1,16,16,16,16,16,16,16,16,16
2,9,9,9,9,9,9,9,9,9


We observe that most of the Italian restaurants are in Cluster 1 for Queens.

##### Get those hoods in Manhattan that belong to Cluster 1 and Cluster 2, but don't have _Italian Restaurant_ in the top venues

In [35]:
manhattan_hoods1=manhattan_merged[~manhattan_merged.eq('Italian Restaurant').any(1)].loc[manhattan_merged['Cluster Labels'] == 1, manhattan_merged.columns[[1]]]
manhattan_hoods2=manhattan_merged[~manhattan_merged.eq('Italian Restaurant').any(1)].loc[manhattan_merged['Cluster Labels'] == 2, manhattan_merged.columns[[1]]]

##### Get those hoods in Queens that belong to Cluster 1, but don't have _Italian Restaurant_ in the top venues

In [36]:
queens_hoods=queens_merged[~queens_merged.eq('Italian Restaurant').any(1)].loc[queens_merged['Cluster Labels'] == 1, queens_merged.columns[[1]]]


### 6. Recommend Hoods

#### Hoods where an Italian restaurant can be opened are:

In [37]:
manhattan_selected=pd.concat([manhattan_hoods1,manhattan_hoods2]).reset_index(drop=True)
manhattan_selected

Unnamed: 0,Neighborhood
0,Inwood
1,Murray Hill
2,Gramercy
3,Tudor City
4,Marble Hill
5,Clinton


In [38]:
queens_selected=queens_hoods.reset_index(drop=True)
queens_selected

Unnamed: 0,Neighborhood
0,Kew Gardens
1,Richmond Hill
2,Flushing
3,Maspeth
4,Glendale
5,Woodhaven
6,South Ozone Park
7,Kew Gardens Hills
8,Briarwood
9,Jamaica Center


### Thank you