![alt text](https://raw.githubusercontent.com/icsouza68/Coursera_Capstone/master/header.jpg "Logo Title Text 1")

# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

Very often, people need to move from one place to another. A new job, raising a family and college admission, for example, are some of the most common reasons for moving around. A change can also take place due to external problems: armed conflicts, poverty, natural disasters, political persecution, etc. But, when arriving at the new location, a question that may eventually arise is: where is the best place to live in this city?

Each individual has unique preferences and needs, which may vary over time, but it's reasonable and legitimate to think that everyone would like to live in a neighborhood that best suits their current expectations. A married person with children may prefer to live closer to where there are more schools and parks, for example. A young single person may prefer somewhere better served by public transport. A couple with no children may prefer to live near restaurants. If the couple is Italian, probably their favourite restaurants would be Italian, not Indian, for instance. The desired combinations of amenities are virtually endless.

Living close to places that are most compatible with current needs and preferences means maximizing personal and familiar happiness. From a philosophical point of view, the pursuit of happiness has always been the subject of study by great philosophers, from Aristoteles, through Kant to Stuart Mill, just to name a few. 

Thus, in addition to the philosophical aspect, and now dealing with a more practical and rational approach, moving from one place to another can result in more or less personal and / or family problems. Lower work productivity (or even unemployment), emotional instability, disagreements with spouse / children, among other problems, can be related to the non-adaptation to a location due to the lack of essential structures needed by the individuals or families. On the other hand,  good adaptation means greater personal and familiar fulfillment.

It is true, however, that happiness depends on many other aspects that are not related to living in a good neighborhood, but all these aspects are beyond the scope of this work.
Having said that, the question to be answered by this project is: **Considering a person's needs, which New York City neighborhoods would be most compatible with him/her?**

### People who might be interested in this kind of information
The answer to this question may be of interest not only to those who seek a place for themselves or their families to live in, but also to public offices working with the establishment of immigrants or refugees, as well as to private businesses such as a real estate offices, or companies seeking professionals abroad, for example. Any entity that is responsible for advising someone to obtain a residence in New York City may be interested in this project.


## Data <a name="data"></a>

### Required Data
For the execution of this project, we will need:
- Data on the New York boroughs and its neighborhoods, which have already been made available throughout the course;
- **Foursquare** venues category list, which can be obtained by simply calling an endpoint (https://api.foursquare.com/v2/venues/categories);
- List of places mapped by **Foursquare** in each New York City neighborhood, obtained through **Foursquare API** calls;
- List of priorities within **Foursquare** categories, informed by the user and which will be used to generate a score for each neighborhood.

### How data will be used

- The user will inform the categories he or she thinks are important in a neighborhood, indicating their priority:

   3: very important;
   2: Important; 
   1: Not so important 
 

- He or she can choose as many categories as he or she likes, indicating their priorities
- The system will obtain the list of boroughs of the city, and for each borough its neighborhoods;
- For each neighborhood, we will get the list of places within the categories that the user chose 
- After gathering all data necessary, the system will group the data by neighborhood, summing up the amount of venues of each category, normalizing the data and applying the priority informed by the user to calculate the final rating for each neighborhood
- With this, it will be possible to create clusters using **KMeans clustering** and create groups of neighborhoods based on its calculated ratings
- Two maps will be plotted:
 1. Clusters based on the similarities of the neighborhoods
 2. Groups based on the final ratings of the neighborhoods 

### Important notice about Foursquare venue categories

- Foursquare advices that the list of categories may slightly change over time, so it can't be a fixed list and we need to call the API every time we run the application in order to bring all the categories up to date;
- Each venue in Foursquare has at least a main category and a subcategory. For example:
 - "Travel & Transport" = main category
     - "Metro Station" = subcategory
- However, there are some subcategories that are subdivided into narrower categories:
 - "Food" = main category 
     - "Italian Restaurant" = subcategory 
         - "Calabria Restaurant" = subcategory within "Italian Restaurant"
         - "Venetto Restaurant" = subcategory within "Italian Restaurant"
         - "Puglia Restaurant" = subcategory within "Italian Restaurant"
         - etc.;
- In an extreme case, categories can reach up to four levels: 
 - "Outdoors & Recreation" = main category
     - "Athletics & Sports" = subcategory 
         - "Gym / Fitness Center" = subcategory within "Athletics & Sports" 
             - "Boxing Gym" = subcatgegory within "Gym / Fitness Center"
             - "Climbing Gym" = subcatgegory within "Gym / Fitness Center"
             - "Cycle Studio" = subcatgegory within "Gym / Fitness Center"
             - "Gym Pool" = subcatgegory within "Gym / Fitness Center"
             - etc.;
- It's not a farfetched idea that Foursquare can come up with a five level category or even a higher level category
- When we use Foursquare to search for venues from a certain location, following some category criteria, it might bring categories that were not specified, but having some relation to one of the selected. For instance, if you ask Foursquare to bring all "Italian Restaurants", it will bring all of them and includes all "Pizza Places" as well, but "Pizza Place" subcategory does not belongs to "Italian Restaurant" category. It's an independent main category, but somehow it relates to "Italian Restaurant" (**IT'S NOT CLEAR ON FOURSQUARE ENDPOINT RESPONSE HOW SOME CATEGORIES ARE LINKED**);
- In this project, we are considering that user will be able to select only subcategories (level = 2) from categories (level = 1), but not subcategories (level = 3) from subcategories (level = 2). However, the system will handle the entire category chain brought by Foursquare by assigning each subcategory (second, third and fourth levels) to its second level subcategory.

#### Import libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
import requests

import geopy.geocoders
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

#### Download JSON file with New York information

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


### Manipulate the downloaded file and create a dataframe

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [4]:
neighborhoods_data = newyork_data['features']

In [5]:
# First element of the list, just to show what attributes are in the dictionary
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

### Creates the base dataframe with the basic columns

In [6]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

### Populates the dataframe with the data from the downloaded file

In [7]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [8]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


In [9]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
geopy.geocoders.options.default_timeout = 7
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


### Plots a map with all neighborhoods in New York

#### Here we can see all the neighborhoods available in New York. Now we have to group them according to the user needs and priorities

At this point we don't know yet how the venues are distributed across the city

In [10]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  

map_newyork

## Methodology <a name="methodology"></a>

In this project we will discover all the venues of certain kinds for selected neighborhoods and try to figure out what neighborhoods are more attractive in terms of wanted amenities. The scope of this project is only New York City.

In first step we have collected the required **data: location and category of every venue within selected boroughs**. 

Second step will be to prepare the dataframe to create clusters of locations that meet requirements defined by the user, and group locations considering their final ratings, calculated applying the weight of each category. The higher the priority (weight), the higher the importance of the category after summarizing the ratings.

In order to better analyse the data, we need to normalize it by dividing each cell (neighborhood x category) to the maximum value of the category, so the data in each cell will be between 0 and 1, **before** we apply the priorities (weights). **After** the priorities have been applied, each cell will be between 0 and 3, as the higher priority is 3.

A category with a great number of venues but with a lower priority may be less important as a category with lower number of venues but with a higer priority. The number of venues is important, but we need to take it associated with the priority of its category.

Important to notice that changing priorities, boroughs or categories may lead to a change in the clusters and groups, reflecting the individual needs.

## Foursquare Section

##### Foursquare parameters that will be used by the application

In [11]:
# setup some parameters used throughout the project
CLIENT_ID = 'F0AWP3ODSZERAKL2IYALXMLMI1ZD5JDGHV4IXTJNQ42J34OF' # your Foursquare ID
CLIENT_SECRET = 'DQDIHA45MRLP4L22SFZQRNASE1ELOQW0G131NZBRAL4RKXGZ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 200
RADIUS = 1000

##### Endpoint that returns all the categories available at Foursquare

In [12]:
# Foursquare endpoint to get all the categories available
url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
      CLIENT_ID, 
      CLIENT_SECRET, 
      VERSION)

results = requests.get(url).json()

In [13]:
# get only the data that matters
results = results['response']

#### This recursive function populates two dictionaries:
 1. cMap = {subcategory Id : 2-level subcategory Id} -> with this dictionary, given a 2nd or higher level category, we know its 2nd level category. It will be used to map the categories brought by Foursquare to the ones selected by the user
 2. cDic = {2-level subcategory Id : [2-level subcategory name, [ main category Id, main category name] ]} -> Just to show user what categories are available. In a real application, this dictionary could be used to allow user to select the categories and subcategories to filter the venues returned by Foursquare API
 
 
 It works for any level, but Foursquare has used up to 4 levels, so far

In [14]:
# catList -> the list with the categories
#   level -> level of the category (0 = main category, 1 = sub category, 2 = sub sub category, 3 = ...)
#    cMap -> explained above
#  subCat -> 2nd-level category of this "branch"
#    cDic -> explained above
# mainCat -> 1st level category (main category) of this "branch"
def subCategoryTable(catList, level, cMap, subCat, cDic, mainCat):

    if catList==None: # stop clause
        return
    else:
        valor = None
        for cat in catList['categories']: # for each category in the list
            if level > 0:
                if level == 1: # if True (this is a 2nd level category list) add new key/value to cDic
                    valor = cat['id']
                    cDic[valor]=[cat['name'], mainCat]
                else: # Else (this is a higher than 2nd level category list) just adjust the value of the current 2nd level category
                    valor = subCat
                cMap[cat['id']]=valor # add a new key/value to cMap
            else: # this is the first level (main) category 
                mainCat = [cat['id'], cat['name']] # set main category for this "branch"
            
            subCategoryTable(cat, level+1, cMap, valor, cDic, mainCat)  # calls recursively the same function      

In [15]:
catMap={}  # -> cMap
catDic={}  # -> cDic

# calls the function with the 1st level category list
subCategoryTable(results, 0, catMap, None, catDic, None)

Let's take a look at the subcategory Id - subcategory Name : category Name

In [16]:
# shows all subcategories and its main categories
# User can pick up one or more of these categories
for k in catDic:
    print(k, ' - ', catDic.get(k)[0], ' : ', catDic.get(k)[1][1])

56aa371be4b08b9a8d5734db  -  Amphitheater  :  Arts & Entertainment
4fceea171983d5d06c3e9823  -  Aquarium  :  Arts & Entertainment
4bf58dd8d48988d1e1931735  -  Arcade  :  Arts & Entertainment
4bf58dd8d48988d1e2931735  -  Art Gallery  :  Arts & Entertainment
4bf58dd8d48988d1e4931735  -  Bowling Alley  :  Arts & Entertainment
4bf58dd8d48988d17c941735  -  Casino  :  Arts & Entertainment
52e81612bcbc57f1066b79e7  -  Circus  :  Arts & Entertainment
4bf58dd8d48988d18e941735  -  Comedy Club  :  Arts & Entertainment
5032792091d4c4b30a586d5c  -  Concert Hall  :  Arts & Entertainment
52e81612bcbc57f1066b79ef  -  Country Dance Club  :  Arts & Entertainment
52e81612bcbc57f1066b79e8  -  Disc Golf  :  Arts & Entertainment
56aa371be4b08b9a8d573532  -  Exhibit  :  Arts & Entertainment
4bf58dd8d48988d1f1931735  -  General Entertainment  :  Arts & Entertainment
52e81612bcbc57f1066b79ea  -  Go Kart Track  :  Arts & Entertainment
4deefb944765f83613cdba6e  -  Historic Site  :  Arts & Entertainment
5744ccdfe

52f2ab2ebcbc57f1066b8b34  -  Pawn Shop  :  Shop & Service
52f2ab2ebcbc57f1066b8b23  -  Perfume Shop  :  Shop & Service
5032897c91d4c4b30a586d69  -  Pet Service  :  Shop & Service
4bf58dd8d48988d100951735  -  Pet Store  :  Shop & Service
4bf58dd8d48988d10f951735  -  Pharmacy  :  Shop & Service
4eb1bdde3b7b55596b4a7490  -  Photography Lab  :  Shop & Service
554a5e17498efabeda6cc559  -  Photography Studio  :  Shop & Service
52f2ab2ebcbc57f1066b8b20  -  Piercing Parlor  :  Shop & Service
52f2ab2ebcbc57f1066b8b3d  -  Pop-Up Shop  :  Shop & Service
52f2ab2ebcbc57f1066b8b28  -  Print Shop  :  Shop & Service
5744ccdfe4b0c0459246b4c4  -  Public Bathroom  :  Shop & Service
5032885091d4c4b30a586d66  -  Real Estate Office  :  Shop & Service
4bf58dd8d48988d10d951735  -  Record Shop  :  Shop & Service
52f2ab2ebcbc57f1066b8b37  -  Recording Studio  :  Shop & Service
4f4531084b9074f6e4fb0101  -  Recycling Facility  :  Shop & Service
56aa371be4b08b9a8d573552  -  Rental Service  :  Shop & Service
4bf58d

#### In the real application, there would be a front-end interface so the user could select categories and give them a priority

In [2]:
# In this example, the user took the following categories: 
# Market, Fruit and Vegetable stores, Drugstores, Laundry Services, 
# Athletics and Sports, Metro station, bus stop, Italian restaurant, Pizza places
# Priorities for each category:
# 3 = Very important
# 2 = Important
# 1 = Not so important
categories=[ ['52f2ab2ebcbc57f1066b8b1c', 'Fruit & Vegetable Store', 3],
             ['50be8ee891d4fa8dcc7199a7', 'Market', 3],
             ['4bf58dd8d48988d1fc941735', 'Laundry Service', 2],
             ['5745c2e4498e11e7bccabdbd', 'Drugstore', 1],
             ['4f4528bc4b90abdf24c9de85', 'Athletics & Sports', 2],
             ['4bf58dd8d48988d1fd931735', 'Metro Station', 3],
             ['52f2ab2ebcbc57f1066b8b4f', 'Bus Stop', 3],
             ['4bf58dd8d48988d10f951735', 'Pharmacy', 1],
             ['4bf58dd8d48988d110941735', 'Italian Restaurant', 2],
             ['4bf58dd8d48988d1ca941735', 'Pizza Place', 2]]

# this will be used by Foursquare API engine
categoryIds = ','.join(x[0] for x in categories)

# this will be used to calculate the neighborhood rating based on the quantity of venues of each kind and its respective weight (priority)
weights = {}
for c in categories:
    weights[c[1]] = c[2]

In [18]:
weights

{'Fruit & Vegetable Store': 3,
 'Market': 3,
 'Laundry Service': 2,
 'Drugstore': 1,
 'Athletics & Sports': 2,
 'Metro Station': 3,
 'Bus Stop': 3,
 'Pharmacy': 1,
 'Italian Restaurant': 2,
 'Pizza Place': 2}

#### Again, in a real application, there would be a front-end interface so the user could select the boroughs he/she is interested in living at

In [19]:
# The user can also select the boroughs he/she wants
# ['Bronx' 'Manhattan' 'Brooklyn' 'Queens' 'Staten Island']

boroughs=['Manhattan'] 

#### For each neighborhood within the selected boroughs Foursquare will return the venues of the wanted categories in that area

This function returns all the venues within the area of the selected boroughs filtering the categories

In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius, categorias):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&intent="browse"&categoryId={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
            categorias)
        
        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        # return only relevant information for each nearby venue
        venues_list.append([(name, 
                             lat, 
                             lng, 
                             v['name'], # ['venue']
                             v['location']['lat'], 
                             v['location']['lng'],  
                             v['categories'][0]['name'],
                             v['categories'][0]['id']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                             'Neighborhood Latitude', 
                             'Neighborhood Longitude', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category',
                             'Category Id']
    
    return(nearby_venues)

In [21]:
# List with the neighborhoods within selected boroughs
boroughs_data = neighborhoods[neighborhoods.Borough.isin(boroughs)]

# Get all the venues according with the parameters from Foursquare
ny_venues = getNearbyVenues(boroughs_data['Neighborhood'],
                            boroughs_data['Latitude'],
                            boroughs_data['Longitude'],
                            RADIUS,
                            categoryIds)

#### We map the category Foursquare returned with the ones user selected
If Foursquare returns any category out of the scope, we delete it (Just in case)

In [22]:
# Some lines have to be deleted because Foursquare returns other categories than the ones user has selected
ny_venues['Subcategory']=ny_venues['Category Id'].map(catMap)
ny_venues.drop(index=ny_venues.loc[~ny_venues.Subcategory.isin(categoryIds.split(','))].index, inplace=True)

Add 'Subacategory Name' column

In [23]:
# Insert column with Subcategory Name
ny_venues['Subcategory Name'] = ny_venues['Subcategory'].map(catDic).apply(lambda a: a[0])

## Analysis <a name="analysis"></a>

At this point, we have a dataframe with all the neighborhoods and venues within each of them

Let's perform some basic explanatory data analysis with the data we have gathered from Foursquare

In [24]:
ny_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Category Id,Subcategory,Subcategory Name
0,Marble Hill,40.876551,-73.91066,MTA Subway - 225th St/Marble Hill (1),40.874486,-73.909589,Metro Station,4bf58dd8d48988d1fd931735,4bf58dd8d48988d1fd931735,Metro Station
1,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place,4bf58dd8d48988d1ca941735,4bf58dd8d48988d1ca941735,Pizza Place
2,Marble Hill,40.876551,-73.91066,Bronx Boxing,40.875671,-73.908355,Boxing Gym,52f2ab2ebcbc57f1066b8b47,4f4528bc4b90abdf24c9de85,Athletics & Sports
3,Marble Hill,40.876551,-73.91066,marble hill pharmacy,40.87505,-73.909195,Pharmacy,4bf58dd8d48988d10f951735,4bf58dd8d48988d10f951735,Pharmacy
4,Marble Hill,40.876551,-73.91066,24 Hour Fitness,40.880592,-73.908255,Gym / Fitness Center,4bf58dd8d48988d175941735,4f4528bc4b90abdf24c9de85,Athletics & Sports


Now, we create a dataframe with subcategory names and Neighborhood columns

In [25]:
# one hot encoding
ny_onehot = pd.get_dummies(ny_venues[['Subcategory Name']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
ny_onehot['Neighborhood'] = ny_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [ny_onehot.columns[-1]] + list(ny_onehot.columns[:-1])
ny_onehot = ny_onehot[fixed_columns]

In [27]:
print(ny_onehot.shape)
ny_onehot.head()

(1875, 11)


Unnamed: 0,Neighborhood,Athletics & Sports,Bus Stop,Drugstore,Fruit & Vegetable Store,Italian Restaurant,Laundry Service,Market,Metro Station,Pharmacy,Pizza Place
0,Marble Hill,0,0,0,0,0,0,0,1,0,0
1,Marble Hill,0,0,0,0,0,0,0,0,0,1
2,Marble Hill,1,0,0,0,0,0,0,0,0,0
3,Marble Hill,0,0,0,0,0,0,0,0,1,0
4,Marble Hill,1,0,0,0,0,0,0,0,0,0


#### We create a new dataframe grouping by neighborhood and summing up the amount of each venue

As we can see, now we have the number of categories in each neighborhood

In [28]:
ny_grouped = ny_onehot.groupby('Neighborhood').sum().reset_index()
ny_grouped.head()

Unnamed: 0,Neighborhood,Athletics & Sports,Bus Stop,Drugstore,Fruit & Vegetable Store,Italian Restaurant,Laundry Service,Market,Metro Station,Pharmacy,Pizza Place
0,Battery Park City,14,0,0,0,4,0,1,18,7,2
1,Carnegie Hill,17,0,0,0,4,0,0,6,14,7
2,Central Harlem,15,3,0,0,3,4,1,8,6,7
3,Chelsea,20,0,0,0,5,0,1,13,4,4
4,Chinatown,12,0,0,0,9,0,0,12,8,5


Just checking the columns related to the category names

In [29]:
columns=ny_grouped.columns[1:]
columns

Index(['Athletics & Sports', 'Bus Stop', 'Drugstore',
       'Fruit & Vegetable Store', 'Italian Restaurant', 'Laundry Service',
       'Market', 'Metro Station', 'Pharmacy', 'Pizza Place'],
      dtype='object')

As we gonna apply the weights for each category, **we need to normalize** the data to avoid distortions on the results. This can be done by dividing each cell by the maximum value found on it

In [30]:
# Normalize and apply respective weights for every column
for col in columns:
    print(col, ' Max value = ', str(ny_grouped[col].max()))
    ny_grouped[col] = (ny_grouped[col] * weights.get(col))/ ny_grouped[col].max()
    
ny_grouped['Total Rating']=ny_grouped.iloc[:,1:].sum(axis=1)

Athletics & Sports  Max value =  25
Bus Stop  Max value =  6
Drugstore  Max value =  1
Fruit & Vegetable Store  Max value =  1
Italian Restaurant  Max value =  11
Laundry Service  Max value =  10
Market  Max value =  2
Metro Station  Max value =  20
Pharmacy  Max value =  14
Pizza Place  Max value =  13


As we can see, the number of stores of "Fruit & Vegetable", "Drugstore" and "Market" are not that many. This may suggest that the borough chosen by the user lacks some amenities.

However, the borough seems to be plenty of gymns, metro stations and italian/pizza restaurants.

Let's take a look at our dataframe after normalizing and applying the weights on it

In [31]:
ny_grouped.head()

Unnamed: 0,Neighborhood,Athletics & Sports,Bus Stop,Drugstore,Fruit & Vegetable Store,Italian Restaurant,Laundry Service,Market,Metro Station,Pharmacy,Pizza Place,Total Rating
0,Battery Park City,1.12,0.0,0.0,0.0,0.727273,0.0,1.5,2.7,0.5,0.307692,6.854965
1,Carnegie Hill,1.36,0.0,0.0,0.0,0.727273,0.0,0.0,0.9,1.0,1.076923,5.064196
2,Central Harlem,1.2,1.5,0.0,0.0,0.545455,0.8,1.5,1.2,0.428571,1.076923,8.250949
3,Chelsea,1.6,0.0,0.0,0.0,0.909091,0.0,1.5,1.95,0.285714,0.615385,6.86019
4,Chinatown,0.96,0.0,0.0,0.0,1.636364,0.0,0.0,1.8,0.571429,0.769231,5.737023


As we are handling 10 categories in this example, and each cell (neighborhood x category) is between 0 and 3, the closer "Total Rating" to 30, the better.

In [33]:
print ('Min Rating = ', "{:4.2f}".format(ny_grouped['Total Rating'].min()))
print ('Max Rating = ', "{:4.2f}".format(ny_grouped['Total Rating'].max()))

Min Rating =  4.63
Max Rating =  10.10


Let's just rearrange our dataframe

In [34]:
fixed_columns = [ny_grouped.columns[0]] + [ny_grouped.columns[-1]] + list(ny_grouped.columns[1:-1])
ny_grouped = ny_grouped[fixed_columns]

The function below just sort the categories in each row to find the most common venues in each neighborhood

In [35]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [36]:
num_top_venues = ny_grouped.shape[1]-2

indicators = ['st', 'nd', 'rd', 'th']

# create columns according to number of top venues
columns = ['Neighborhood', 'Total Rating']
for ind in np.arange(num_top_venues):
    columns.append('{}{} Most Common Venue'.format(ind+1, indicators[min(ind, len(indicators)-1)]))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = ny_grouped['Neighborhood']
neighborhoods_venues_sorted['Total Rating'] = ny_grouped['Total Rating']

for ind in np.arange(ny_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 2:] = return_most_common_venues(ny_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,Total Rating,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,6.854965,Metro Station,Market,Athletics & Sports,Italian Restaurant,Pharmacy,Pizza Place,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
1,Carnegie Hill,5.064196,Athletics & Sports,Pizza Place,Pharmacy,Metro Station,Italian Restaurant,Market,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
2,Central Harlem,8.250949,Market,Bus Stop,Metro Station,Athletics & Sports,Pizza Place,Laundry Service,Italian Restaurant,Pharmacy,Fruit & Vegetable Store,Drugstore
3,Chelsea,6.86019,Metro Station,Athletics & Sports,Market,Italian Restaurant,Pizza Place,Pharmacy,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
4,Chinatown,5.737023,Metro Station,Italian Restaurant,Athletics & Sports,Pizza Place,Pharmacy,Market,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop


#### Let's plot some clusters using KMeans function

We drop two columns before we fit our model:

- 'Neighborhood' is a string
- 'Total Rating' is a combination of the other columns

In [37]:
# set number of clusters
kclusters = 5

ny_grouped_clustering = ny_grouped.drop(['Neighborhood', 'Total Rating'], 1)
#ny_grouped_clustering = ny_grouped.drop('Total Weight', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ny_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 3, 1, 0, 3, 0, 4, 1, 3, 0, 0, 3, 0, 1, 4, 1, 2, 3, 0, 3, 3, 1,
       1, 0, 0, 3, 0, 0, 2, 0, 4, 2, 0, 3, 2, 3, 3, 1, 0, 3])

The model has generated the cluster labels. Now we want them as a column in our dataframe. Besides, we want other information back to the dataframe: Borough, Neighborhood, Latitude, Longitude, Total Rating.

In [38]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

ny_merged = neighborhoods[neighborhoods.Borough.isin(boroughs)]

# merge ny_grouped with ny_data to add latitude/longitude for each neighborhood
ny_merged = ny_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

ny_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,Total Rating,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Manhattan,Marble Hill,40.876551,-73.91066,1,8.319291,Bus Stop,Market,Pizza Place,Athletics & Sports,Laundry Service,Pharmacy,Metro Station,Italian Restaurant,Fruit & Vegetable Store,Drugstore
100,Manhattan,Chinatown,40.715618,-73.994279,3,5.737023,Metro Station,Italian Restaurant,Athletics & Sports,Pizza Place,Pharmacy,Market,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
101,Manhattan,Washington Heights,40.851903,-73.9369,1,10.09963,Bus Stop,Laundry Service,Pizza Place,Market,Athletics & Sports,Metro Station,Italian Restaurant,Pharmacy,Fruit & Vegetable Store,Drugstore
102,Manhattan,Inwood,40.867684,-73.92121,1,9.503576,Laundry Service,Bus Stop,Market,Pizza Place,Metro Station,Athletics & Sports,Pharmacy,Italian Restaurant,Fruit & Vegetable Store,Drugstore
103,Manhattan,Hamilton Heights,40.823604,-73.949688,1,7.071548,Laundry Service,Bus Stop,Pizza Place,Athletics & Sports,Metro Station,Italian Restaurant,Pharmacy,Market,Fruit & Vegetable Store,Drugstore


Well, let's see where the clustes are in a map of New York City

In [39]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, tw in zip(ny_merged['Latitude'], 
                                      ny_merged['Longitude'], 
                                      ny_merged['Neighborhood'], 
                                      ny_merged['Cluster Labels'], 
                                      ny_merged['Total Rating']):
    label = folium.Popup(str(poi) + ' | Cluster ' + str(cluster) + ' | Rating : ' + "{:4.2f}".format(tw), parse_html=True)

    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Now, we want to see the neighborhoods belonging to the **cluster 0**

This cluster was created with those neighborhoods with a large number of metro stations

In [41]:
ny_merged.loc[ny_merged['Cluster Labels'] == 0, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]].sort_values(by='Total Rating', ascending=False)


Unnamed: 0,Neighborhood,Total Rating,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
120,Tribeca,7.385854,Metro Station,Market,Italian Restaurant,Athletics & Sports,Pizza Place,Pharmacy,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
249,Civic Center,7.292448,Metro Station,Market,Italian Restaurant,Athletics & Sports,Pharmacy,Pizza Place,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
123,West Village,6.995255,Metro Station,Athletics & Sports,Market,Italian Restaurant,Pizza Place,Pharmacy,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
128,Financial District,6.863536,Metro Station,Market,Athletics & Sports,Italian Restaurant,Pharmacy,Pizza Place,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
116,Chelsea,6.86019,Metro Station,Athletics & Sports,Market,Italian Restaurant,Pizza Place,Pharmacy,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
127,Battery Park City,6.854965,Metro Station,Market,Athletics & Sports,Italian Restaurant,Pharmacy,Pizza Place,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
122,Soho,5.800919,Metro Station,Italian Restaurant,Athletics & Sports,Pizza Place,Pharmacy,Market,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
121,Little Italy,5.72959,Metro Station,Athletics & Sports,Italian Restaurant,Pizza Place,Pharmacy,Market,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
276,Flatiron,5.615405,Metro Station,Athletics & Sports,Italian Restaurant,Pharmacy,Pizza Place,Market,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
117,Greenwich Village,5.601229,Metro Station,Athletics & Sports,Italian Restaurant,Pizza Place,Pharmacy,Market,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop


Now, we want to see the neighborhoods belonging to the **cluster 1**

This cluster was created with those neighborhoods with a large number of bus stops

In [42]:
ny_merged.loc[ny_merged['Cluster Labels'] == 1, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]].sort_values(by='Total Rating', ascending=False)


Unnamed: 0,Neighborhood,Total Rating,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
101,Washington Heights,10.09963,Bus Stop,Laundry Service,Pizza Place,Market,Athletics & Sports,Metro Station,Italian Restaurant,Pharmacy,Fruit & Vegetable Store,Drugstore
102,Inwood,9.503576,Laundry Service,Bus Stop,Market,Pizza Place,Metro Station,Athletics & Sports,Pharmacy,Italian Restaurant,Fruit & Vegetable Store,Drugstore
6,Marble Hill,8.319291,Bus Stop,Market,Pizza Place,Athletics & Sports,Laundry Service,Pharmacy,Metro Station,Italian Restaurant,Fruit & Vegetable Store,Drugstore
105,Central Harlem,8.250949,Market,Bus Stop,Metro Station,Athletics & Sports,Pizza Place,Laundry Service,Italian Restaurant,Pharmacy,Fruit & Vegetable Store,Drugstore
106,East Harlem,7.174296,Bus Stop,Athletics & Sports,Metro Station,Pizza Place,Pharmacy,Laundry Service,Italian Restaurant,Market,Fruit & Vegetable Store,Drugstore
103,Hamilton Heights,7.071548,Laundry Service,Bus Stop,Pizza Place,Athletics & Sports,Metro Station,Italian Restaurant,Pharmacy,Market,Fruit & Vegetable Store,Drugstore
104,Manhattanville,6.876893,Bus Stop,Laundry Service,Athletics & Sports,Italian Restaurant,Pizza Place,Metro Station,Pharmacy,Market,Fruit & Vegetable Store,Drugstore


Now, we want to see the neighborhoods belonging to the **cluster 2**

This cluster was created with those neighborhoods with a large number of fruit and vegetable stores

In [43]:
ny_merged.loc[ny_merged['Cluster Labels'] == 2, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]].sort_values(by='Total Rating', ascending=False)


Unnamed: 0,Neighborhood,Total Rating,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
271,Sutton Place,8.318052,Fruit & Vegetable Store,Athletics & Sports,Metro Station,Italian Restaurant,Pharmacy,Bus Stop,Pizza Place,Market,Laundry Service,Drugstore
273,Turtle Bay,8.215455,Fruit & Vegetable Store,Athletics & Sports,Pharmacy,Drugstore,Metro Station,Italian Restaurant,Pizza Place,Market,Laundry Service,Bus Stop
110,Roosevelt Island,8.135415,Fruit & Vegetable Store,Athletics & Sports,Italian Restaurant,Pizza Place,Pharmacy,Bus Stop,Metro Station,Market,Laundry Service,Drugstore
109,Lenox Hill,7.908262,Fruit & Vegetable Store,Athletics & Sports,Pharmacy,Metro Station,Italian Restaurant,Bus Stop,Pizza Place,Market,Laundry Service,Drugstore


Now, we want to see the neighborhoods belonging to the **cluster 3**

This cluster was created with those neighborhoods with a large number of Italian restaurants

In [44]:
ny_merged.loc[ny_merged['Cluster Labels'] == 3, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]].sort_values(by='Total Rating', ascending=False)


Unnamed: 0,Neighborhood,Total Rating,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
125,Morningside Heights,6.697133,Italian Restaurant,Pizza Place,Metro Station,Bus Stop,Athletics & Sports,Pharmacy,Laundry Service,Market,Fruit & Vegetable Store,Drugstore
111,Upper West Side,6.514006,Italian Restaurant,Athletics & Sports,Drugstore,Metro Station,Pharmacy,Bus Stop,Pizza Place,Market,Laundry Service,Fruit & Vegetable Store
119,Lower East Side,5.759341,Italian Restaurant,Pizza Place,Athletics & Sports,Metro Station,Pharmacy,Market,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
100,Chinatown,5.737023,Metro Station,Italian Restaurant,Athletics & Sports,Pizza Place,Pharmacy,Market,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
124,Manhattan Valley,5.666613,Pizza Place,Metro Station,Pharmacy,Athletics & Sports,Italian Restaurant,Bus Stop,Laundry Service,Market,Fruit & Vegetable Store,Drugstore
107,Upper East Side,5.294476,Athletics & Sports,Pharmacy,Italian Restaurant,Metro Station,Pizza Place,Bus Stop,Market,Laundry Service,Fruit & Vegetable Store,Drugstore
118,East Village,5.201019,Athletics & Sports,Metro Station,Italian Restaurant,Pharmacy,Pizza Place,Market,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
274,Tudor City,5.124625,Athletics & Sports,Drugstore,Pharmacy,Metro Station,Italian Restaurant,Pizza Place,Market,Laundry Service,Fruit & Vegetable Store,Bus Stop
247,Carnegie Hill,5.064196,Athletics & Sports,Pizza Place,Pharmacy,Metro Station,Italian Restaurant,Market,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop
108,Yorkville,4.964585,Athletics & Sports,Pizza Place,Pharmacy,Italian Restaurant,Metro Station,Market,Laundry Service,Fruit & Vegetable Store,Drugstore,Bus Stop


Now, we want to see the neighborhoods belonging to the **cluster 4**

Finally, this cluster was created with those neighborhoods with a large number of markets

In [45]:
ny_merged.loc[ny_merged['Cluster Labels'] == 4, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]].sort_values(by='Total Rating', ascending=False)

Unnamed: 0,Neighborhood,Total Rating,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
301,Hudson Yards,9.016613,Market,Bus Stop,Pizza Place,Pharmacy,Athletics & Sports,Metro Station,Italian Restaurant,Laundry Service,Fruit & Vegetable Store,Drugstore
113,Clinton,8.7498,Market,Metro Station,Athletics & Sports,Bus Stop,Pharmacy,Italian Restaurant,Pizza Place,Laundry Service,Fruit & Vegetable Store,Drugstore
275,Stuyvesant Town,7.030519,Pizza Place,Market,Athletics & Sports,Italian Restaurant,Pharmacy,Bus Stop,Laundry Service,Metro Station,Fruit & Vegetable Store,Drugstore


#### Ratings

In the previous section, we create 5 clusters and could see how each cluster was formed. Now, we want to use a more objective criteria to create groups: the ratings we have calculated for each neighborhood considering the user expectations

In [46]:
# Given the rating of a neighborhood, this function returns the number of the group it belongs to
def findGroup(weight, ranges):
    ranInit = ranges[0]
    cont = 0
    for i in ranges:
        if ranInit != i:
            if (weight>=ranInit) and (weight<i):
                return cont
            else:
                cont += 1
                ranInit = i

Let's, see the minimum and the maximum value of ratings. Then we will create 5 groups equally spaced between minimum and maximum values

In [47]:
wMin = ny_merged['Total Rating'].min()
wMax = ny_merged['Total Rating'].max()
groups = np.linspace(wMin-0.1, wMax+0.1, 6)
print(wMin, ' - ', wMax, ' - ', groups)

4.632317682317683  -  10.09963036963037  -  [ 4.53231768  5.66578022  6.79924276  7.93270529  9.06616783 10.19963037]


We need to create a 'Group' columns that contains the **number of the group (0-4)** of the neighborhood

In [48]:
ny_merged['Group']=ny_merged['Total Rating'].apply(lambda x : findGroup(x, groups))

Let's see a map with the groups based on the ratings

#### Legend

<table align="left">
    <tr>
        <th><p align="center">Color</p></th>
        <th><p align="center">Group</p></th>
        <th><p align="left">Meaning</p></th>
    </tr>
    <tr>
        <td><p align="center">
             <img src="https://raw.githubusercontent.com/icsouza68/Coursera_Capstone/master/Red1.jpg" style="width:20px;height:20px;"></p> 
        </td>
        <td><p align="center"> 0 </p></td>
        <td><p align="left">I don't think I'm gonna like this neighborhood</p></td>
    </tr>
    <tr>
        <td><p align="center">
             <img src="https://raw.githubusercontent.com/icsouza68/Coursera_Capstone/master/Orange1.jpg" style="width:20px;height:20px;"></p> 
        </td>
        <td><p align="center"> 1 </p></td>
        <td><p align="left">I still have doubts about the neighborhood</p></td>
    </tr>
    <tr>
        <td><p align="center">
             <img src="https://raw.githubusercontent.com/icsouza68/Coursera_Capstone/master/Yellow1.jpg" style="width:20px;height:20px;"></p>  
        </td>
        <td><p align="center"> 2 </p></td>
        <td><p align="left">Well, it might be worthy to give it a chance</p></td>
    </tr>
    <tr>
        <td><p align="center">
             <img src="https://raw.githubusercontent.com/icsouza68/Coursera_Capstone/master/Blue1.jpg" style="width:20px;height:20px;"></p>  
        </td>
        <td><p align="center"> 3 </p></td>
        <td><p align="left">This might be a good address</p></td>
    </tr>
    <tr>
        <td><p align="center">
             <img src="https://raw.githubusercontent.com/icsouza68/Coursera_Capstone/master/Green1.jpg" style="width:20px;height:20px;"></p>  
        </td><p align="center">
        <td><p align="center"> 4 </p></td>
        <td><p align="left">Oh, boy, I have just found paradise</p></td>
    </tr>
</table>

In [54]:
# create map
map_weight = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
rainbow = ['#ff8000', '#ffff00', '#0000ff', '#04b404', '#df0101' ]

# add markers to the map
markers_colors = []
for lat, lon, poi, grp, tw in zip(ny_merged['Latitude'], ny_merged['Longitude'], ny_merged['Neighborhood'], ny_merged['Group'], ny_merged['Total Rating']):
    label = folium.Popup(str(poi) + ' | Group ' + str(grp) + ' | Rating : ' + "{:4.2f}".format(tw), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[grp-1],
        fill=True,
        fill_color=rainbow[grp-1],
        fill_opacity=0.7).add_to(map_weight)
       
map_weight

## Results and Discussion <a name="results"></a>

Depending on the requirements the user demands, any neighborhood can be of any interest. The tool we just developed is capable of bring some neighborhoods that fit the user expectations for a new home in New York City. However, some observations are due.

The tools is highly dependable on the quality of data provided by Foursquare:

- If a venue is not registered on Foursquare, it will not show up in our search;
- Some venues may have been shut down by the time we run the application, for instance;
- The tool only assess the quantitative aspect of the categories, not the qualitative one.


Each metropolitan area may have its own subdivision mechanism, so further programming needs to be made to adapt the tool for another area. 

It would be very interesting to bring other types of data such as criminal statistics, demographics, cost of square meter for renting/acquisition, etc. These data will certainly add more value to the tool.

### Differences between the two algorithms used

We have deployed two different algorithms: 
- **KMeans**
- **Grouping based on the overall rating of the neighborhood**

The idea was to compare both mechanisms. While the former is based only on information about the most common categories in a neighborhood, without direct interference from user-given priorities, the latter is based on the most important aspects of calculating an indicator for each location, grouping neighborhoods according to its grades.

The KMeans algorithm seems not to be appropriate for the purpose of this project, as it ignores important premises given by the user. 

In [72]:
ny_merged.loc[:,['Neighborhood', 'Cluster Labels', 'Group']]

Unnamed: 0,Neighborhood,Cluster Labels,Group
6,Marble Hill,1,3
100,Chinatown,3,1
101,Washington Heights,1,4
102,Inwood,1,4
103,Hamilton Heights,1,2
104,Manhattanville,1,2
105,Central Harlem,1,3
106,East Harlem,1,2
107,Upper East Side,3,0
108,Yorkville,3,0


## Conclusion <a name="conclusion"></a>

The purpose of this project was to develop a tool to help people moving to New York City to find a neighborhood that best attends their specific needs in terms of venues and amenities. 

With some additional programming it could be extended to other metropolitan areas.

It's is flexible enough to allow users to filter results by a huge number of combinations of categories and boroughs, and grouping locations according to its ratings, so user can have an idea of which neighborhoods are closer or distant from your needs.

Although the system recommends some locations, the user needs to check each of the indicated places and consider other factors not associated with the tool search such as noise levels, presence of illegal activities, property prices, proximity to work, etc.