# IBM Coursera Data Science Professional Certificate
# The Battle of the Neighborhoods

# Comparing Neighborhoods in New York City and Toronto
## Marcelo Guimarães

# 1 - Introduction

When you have to move from your home, it is always difficult to find the right neigborhood to live.
Throughout my life I have moved among different neighborhoods in the same city, from one city to another inside the same state, from one state to another, and even from one crounty to another, including countries in different continents. Everytime I was moving from one place to another, the same question arises: where in this new city will I find the right place to live?

This problem can be minimized if we can compare the neighborhoods in differents cities and make a list of 
the best candidates, or at least the neighborhoods that are similar to the one we like.

What if we could create a recommendation system for neighbourhoods? We will gather information about the current neighbourhood using the Foursquare API, then we will create a recommendation system based on our preferred venues and lastly we will create a list of possible candidates in New York City.

It is not a complete solution, but it is a start.

In this project we will consider a client that lives in Toronto, specifically in the neighbourhood called Little Portugal. The client will move to New York City and would like to know which neighbourhoods would be similar to the current one. 
****

# 2 - Data

In order to understand the distribution of venues in New York City and Toronto, and start to search for good areas to live, we will use data from Foursquare. 

We will use the Foursquare API to retrieve relevant data for New York City and Toronto and organize it into pandas Dataframes.

We will also use geolocaliztion data for Toronto and New York City, available in previous modules of this Capstone Project.

## We begin importing the libraries required in this project

In [1]:
import pandas as pd
import numpy as np
import json

#Geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

#Importing wikipedia to read the page
import wikipedia as wp

# these will be used to print the maps!
import os
import time
from selenium import webdriver

print('Libraries imported succesfully!')

Libraries imported succesfully!


## Importing and Preparing the New York Dataset

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

neighbourhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighbourhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighbourhoods = pd.DataFrame(columns=column_names)

for data in neighbourhoods_data:
    borough = neighbourhood_name = data['properties']['borough'] 
    neighbourhood_name = data['properties']['name']
        
    neighbourhood_latlon = data['geometry']['coordinates']
    neighbourhood_lat = neighbourhood_latlon[1]
    neighbourhood_lon = neighbourhood_latlon[0]
    
    neighbourhoods = neighbourhoods.append({'Borough': borough,
                                          'Neighbourhood': neighbourhood_name,
                                          'Latitude': neighbourhood_lat,
                                          'Longitude': neighbourhood_lon}, ignore_index=True)
print('The dataframe has {} boroughs and {} neighbourhoods.'.format(
        len(neighbourhoods['Borough'].unique()),
        neighbourhoods.shape[0]
    )
)
newyork = neighbourhoods.copy()
newyork.head()

Data downloaded!
The dataframe has 5 boroughs and 306 neighbourhoods.


Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


## Importing and Preparing the Toronto Dataset

In [3]:
html = wp.page("List_of_postal_codes_of_Canada:_M").html().encode("UTF-8")
df = pd.read_html(html)[0]

table = df[df['Borough'] != 'Not assigned']

table['Neighbourhood'] = table.groupby('Postcode')['Neighbourhood'].transform(lambda neigh: ', '.join(neigh))

table = table.drop_duplicates()

table['Neighbourhood'].replace("Not assigned", table["Borough"],inplace=True)

print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(table['Borough'].unique()),
        table.shape[0]
    )
)
table.head()

The dataframe has 10 boroughs and 103 neighborhoods.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,"Lawrence Heights, Lawrence Manor"
7,M7A,Downtown Toronto,Queen's Park


## We still need the latitude and longitude for each neighborhood in Toronto.

In [4]:
geo_df = pd.read_csv("Geospatial_Coordinates.csv")
geo_df.columns = ["Postcode", "Latitude", "Longitude"]
toronto = table.join(geo_df.set_index('Postcode'),on='Postcode')
toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.753259,-79.329656
3,M4A,North York,Victoria Village,43.725882,-79.315572
4,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
5,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
7,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


## Geolocation for New Yor City

In [5]:
address_NY = 'New York City, NY'

geolocator_NY = Nominatim(user_agent="ny_explorer", timeout=5)
location_NY = geolocator_NY.geocode(address_NY)
latitude_NY = location_NY.latitude
longitude_NY = location_NY.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude_NY, longitude_NY))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


## Geolocation for Toronto

In [6]:
address_TO = 'Toronto, CN'

geolocator_TO = Nominatim(user_agent="toronto_explorer", timeout=5)
location_TO = geolocator_TO.geocode(address_TO)
latitude_TO = location_TO.latitude
longitude_TO = location_TO.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude_TO, longitude_TO))

The geograpical coordinates of Toronto are 43.6425637, -79.38708718320467.


## Setting the Foursquare API

In [7]:
### Setting the API
CLIENT_ID = 'IF0FBHU2M5U0TBUTDYE3THW4YWZMTYMRCJPTF54M5QVWOIP5' # your Foursquare ID
CLIENT_SECRET = 'LVOH43H1TW0SQ30RK21VO3QR3ZGGV1X4O0ZW2ATTZ0RQOLIV' # your Foursquare Secret
VERSION = '20200226' # Foursquare API version
LIMIT = 100
radius=500

## Defining a function to collect the data using the Foursquare API

In [8]:
#Defining a function to make the process automatic

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

## Geting the data for Toronto

In [9]:
toronto_venues = getNearbyVenues(names=toronto['Neighbourhood'],
                                   latitudes=toronto['Latitude'],
                                   longitudes=toronto['Longitude']
                                  )

Parkwoods
Victoria Village
Harbourfront
Lawrence Heights, Lawrence Manor
Queen's Park
Islington Avenue
Rouge, Malvern
Don Mills North
Woodbine Gardens, Parkview Hill
Ryerson, Garden District
Glencairn
Cloverdale, Islington, Martin Grove, Princess Gardens, West Deane Park
Highland Creek, Rouge Hill, Port Union
Flemingdon Park, Don Mills South
Woodbine Heights
St. James Town
Humewood-Cedarvale
Bloordale Gardens, Eringate, Markland Wood, Old Burnhamthorpe
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Downsview North, Wilson Heights
Thorncliffe Park
Adelaide, King, Richmond
Dovercourt Village, Dufferin
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto
Harbourfront East, Toronto Islands, Union Station
Little Portugal, Trinity
East Birchmount Park, Ionview, Kennedy Park
Bayview Village
CFB Toronto, Downsview East
The Danforth West,

## Geting the data for New York City

In [18]:
newyork_venues = getNearbyVenues(names=newyork['Neighbourhood'],
                                   latitudes=newyork['Latitude'],
                                   longitudes=newyork['Longitude']
                                  )

Wakefield


KeyError: 'groups'

In [19]:
newyork_venues = pd.read_csv("newyork_300.csv")

In [20]:
print("There are {} venues in Toronto.".format(toronto_venues.shape[0]))
print("There are {} venues in New York.".format(newyork_venues.shape[0]))

There are 2222 venues in Toronto.
There are 10278 venues in New York.


### All the data was loaded and pre-processed into dataframes. We can proceed with the analysis.

In [21]:
newyork_venues.head()
#newyork_venues.head().to_excel("newyork_venues.xlsx")

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
2,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
3,Wakefield,40.894705,-73.847201,Walgreens,40.896687,-73.84485,Pharmacy
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


****

# 3 - Methodolgy

## Visualizing the maps of New York and Toronto, together with their neighborhoods.

## MESSAGE TO GRADERS!
### If cannot view the map, maybe is because you are viewing the Jupyter Notebbok straight in Github.
### This is a known problem as Jupyter Notebook does NOT render a map when read through Github's direct view.
### To view the maps properly, you need to go through JupyterViewer:
https://nbviewer.jupyter.org/

### Copy the github URL as given into the main field and you will be able to see the map rendered properly.

## Map of Toronto

In [22]:
map_toronto = folium.Map(location=[latitude_TO+0.04, longitude_TO], zoom_start=10.5)

# add markers to map
for lat, lng, borough, neighbourhood in zip(toronto['Latitude'], toronto['Longitude'], toronto['Borough'], toronto['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [415]:
delay=10
 
#Save the map as an HTML file
fn='map_totonto.html'
tmpurl='file://{path}/{mapfile}'.format(path=os.getcwd(),mapfile=fn)
map_toronto.save(fn)
 
#Open a browser window...
browser = webdriver.Firefox()
#..that displays the map...
browser.get(tmpurl)
#Give the map tiles some time to load
time.sleep(delay)
#Grab the screenshot
browser.save_screenshot('map_toronto.png')
#Close the browser
browser.quit()

## Map of New York

In [23]:
map_newyork = folium.Map(location=[latitude_NY, longitude_NY], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(newyork['Latitude'], newyork['Longitude'], newyork['Borough'], newyork['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [418]:
delay=5
 
#Save the map as an HTML file
fn='map_newyork.html'
tmpurl='file://{path}/{mapfile}'.format(path=os.getcwd(),mapfile=fn)
map_newyork.save(fn)
 
#Open a browser window...
browser = webdriver.Firefox()
#..that displays the map...
browser.get(tmpurl)
#Give the map tiles some time to load
time.sleep(delay)
#Grab the screenshot
browser.save_screenshot('map_newyork.png')
#Close the browser
browser.quit()

### One important information is the number of unique venue categories in our dataframes.

In [24]:
print('There are {} uniques categories in Toronto.'.format(len(toronto_venues['Venue Category'].unique())))

There are 268 uniques categories in Toronto.


In [25]:
print('There are {} uniques categories in New York City.'.format(len(newyork_venues['Venue Category'].unique())))

There are 429 uniques categories in New York City.


****

### Now it is time to start analyzing the data. 
### We will create a new dataframe, listing all the unique categories for each neighborhood.
### Our intention is to obtain a list of most frequent venues per neighborhood. 
### We will then use this information to characterize the neighborhoods.

### First let's do it for Toronto.

### We will use the one-hot encoding.

In [26]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print("Shape of the dataframe:", toronto_onehot.shape)

# populate the dataframe toronto_grouped using group-by and mean
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

Shape of the dataframe: (2222, 269)


Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Now let's sort the venues

In [28]:
#Function to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 15

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
toronto_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
toronto_neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    toronto_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

toronto_neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Restaurant,Bar,Café,Thai Restaurant,Sushi Restaurant,Seafood Restaurant,Bakery,Steakhouse,Pizza Place,Burger Joint,Asian Restaurant,Hotel,Clothing Store,Vegetarian / Vegan Restaurant
1,Agincourt,Latin American Restaurant,Skating Rink,Lounge,Breakfast Spot,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Dumpling Restaurant,Drugstore,College Rec Center,Eastern European Restaurant,Electronics Store
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Park,Playground,Yoga Studio,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,Deli / Bodega,Drugstore,Dumpling Restaurant,Eastern European Restaurant
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Fried Chicken Joint,Beer Store,Sandwich Place,Pizza Place,Pharmacy,Fast Food Restaurant,Donut Shop,Drugstore,Deli / Bodega,Doner Restaurant,Dog Run,Dumpling Restaurant,Distribution Center,Discount Store
4,"Alderwood, Long Branch",Pizza Place,Pub,Sandwich Place,Skating Rink,Pool,Pharmacy,Coffee Shop,Gym,Doner Restaurant,Dog Run,Distribution Center,Donut Shop,Curling Ice,Drugstore,Discount Store


In [442]:
toronto_neighborhoods_venues_sorted.head().to_excel("toronto_venues_sorted.xlsx")

In [443]:
toronto_neighborhoods_venues_sorted.head().to_html("toronto_venues_sorted.html")

### Now let's do the same process for New York City

In [29]:
# one hot encoding
newyork_onehot = pd.get_dummies(newyork_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighbourhood column back to dataframe
newyork_onehot['Neighbourhood'] = newyork_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns2 = [newyork_onehot.columns[-1]] + list(newyork_onehot.columns[:-1])
newyork_onehot = newyork_onehot[fixed_columns2]
print("Shape of the dataframe:", newyork_onehot.shape)

Shape of the dataframe: (10278, 430)


In [30]:
#when grouping the data now, instead of mean we will use sum. That was necessary for future porpuses.
newyork_grouped = newyork_onehot.groupby('Neighbourhood').sum().reset_index()
newyork_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Allerton,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Annadale,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Arden Heights,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Arlington,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Arrochar,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Time to sort the most common venues

In [31]:
#Function to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 15

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns2 = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns2.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns2.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
newyork_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
newyork_neighborhoods_venues_sorted['Neighbourhood'] = newyork_grouped['Neighbourhood']

for ind in np.arange(newyork_grouped.shape[0]):
    newyork_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(newyork_grouped.iloc[ind, :], num_top_venues)

newyork_neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue
0,Allerton,Pizza Place,Bakery,Deli / Bodega,Cosmetics Shop,Supermarket,Mexican Restaurant,Fast Food Restaurant,Bus Station,Martial Arts Dojo,Electronics Store,Pharmacy,Gas Station,Breakfast Spot,Grocery Store,Donut Shop
1,Annadale,Pizza Place,Dance Studio,Liquor Store,American Restaurant,Sports Bar,Train Station,Restaurant,Bakery,Diner,Farm,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Filipino Restaurant,Factory
2,Arden Heights,Lawyer,Pharmacy,Coffee Shop,Bus Stop,Pizza Place,Yoga Studio,Event Space,Exhibit,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant
3,Arlington,Bus Stop,Deli / Bodega,Coffee Shop,Boat or Ferry,Grocery Store,Yoga Studio,Fish & Chips Shop,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Film Studio
4,Arrochar,Italian Restaurant,Pizza Place,Deli / Bodega,Bus Stop,Food Truck,Supermarket,Taco Place,Outdoors & Recreation,Middle Eastern Restaurant,Mediterranean Restaurant,Bagel Shop,Sandwich Place,Athletics & Sports,Nail Salon,Hotel


****

## Recommendation system.

### Now we will try to come up with a recommendation system. It will use information from the "client" together with information from the neighbourhoods to create a list of best candidates.

### In order to do that we need to explore and get to know our current neighbourhood in Toronto: Little Portugal.

In [34]:
toronto_neighborhoods_venues_sorted[toronto_neighborhoods_venues_sorted['Neighbourhood'] == 'Little Portugal, Trinity']

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue
65,"Little Portugal, Trinity",Bar,Coffee Shop,Asian Restaurant,Restaurant,Pizza Place,Bakery,Men's Store,Café,Vietnamese Restaurant,Wine Bar,Record Shop,Park,Brewery,New American Restaurant,Music Venue


In [42]:
toronto_neighborhoods_venues_sorted[toronto_neighborhoods_venues_sorted['Neighbourhood'] == 'Little Portugal, Trinity'].transpose()

Unnamed: 0,65
Neighbourhood,"Little Portugal, Trinity"
1st Most Common Venue,Bar
2nd Most Common Venue,Coffee Shop
3rd Most Common Venue,Asian Restaurant
4th Most Common Venue,Restaurant
5th Most Common Venue,Pizza Place
6th Most Common Venue,Bakery
7th Most Common Venue,Men's Store
8th Most Common Venue,Café
9th Most Common Venue,Vietnamese Restaurant


In [33]:
selected_TO_venue = toronto_venues[toronto_venues['Neighbourhood'] == 'Little Portugal, Trinity']

map_littleportugal = folium.Map(location=[selected_TO_venue['Neighbourhood Latitude'][926],selected_TO_venue['Neighbourhood Longitude'][926]], zoom_start=15.5)

# add markers to map
for lat, lng, name, categorie in zip(selected_TO_venue['Venue Latitude'], selected_TO_venue['Venue Longitude'], selected_TO_venue['Venue'], selected_TO_venue['Venue Category']):
    label = '{}, {}'.format(name, categorie)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_littleportugal)  
    
map_littleportugal

In [419]:
delay=5
 
#Save the map as an HTML file
fn='map_littleportugal.html'
tmpurl='file://{path}/{mapfile}'.format(path=os.getcwd(),mapfile=fn)
map_littleportugal.save(fn)
 
#Open a browser window...
browser = webdriver.Firefox()
#..that displays the map...
browser.get(tmpurl)
#Give the map tiles some time to load
time.sleep(delay)
#Grab the screenshot
browser.save_screenshot('map_littleportugal.png')
#Close the browser
browser.quit()

### We have a list of the 15 most common venues in the neighbourhood and we can see in the map that there is a big park in the area. 

### The first thing we will do is to give a score to each of the 15 most common venues of the neighbourhood, in other words, we will create a vector with the client rating (0 to 10) for each type of venue.


In [43]:
userInput = [
            {'venue':'Bar', 'rating':9.0},
            {'venue':'Coffee Shop', 'rating':9.5},
            {'venue':'Asian Restaurant', 'rating':9.5},
            {'venue':'Restaurant', 'rating':9.0},
            {'venue':"Pizza Place", 'rating':7.0},
            {'venue':'Bakery', 'rating':10.0},
            {'venue':"Men's Store", 'rating':4.5},
            {'venue':'Vietnamese Restaurant', 'rating':8.5},
            {'venue':'Wine Bar', 'rating':5.0},
            {'venue':'Café', 'rating':10.0},
            {'venue':'Record Shop', 'rating':7.5},
            {'venue':'Cuban Restaurant', 'rating':6.5},
            {'venue':'Park', 'rating':10.0},
            {'venue':'Brewery', 'rating':6.5},
            {'venue':'New American Restaurant', 'rating':4.5}
         ] 
inputVenues = pd.DataFrame(userInput).sort_values(by=['venue']).reset_index(drop=True).set_index('venue')
inputVenues2 = pd.DataFrame(userInput).sort_values(by=['venue']).reset_index(drop=True)
inputVenues2

Unnamed: 0,venue,rating
0,Asian Restaurant,9.5
1,Bakery,10.0
2,Bar,9.0
3,Brewery,6.5
4,Café,10.0
5,Coffee Shop,9.5
6,Cuban Restaurant,6.5
7,Men's Store,4.5
8,New American Restaurant,4.5
9,Park,10.0


### Now we will start to prepare the New York data.
### We need to create a dataframe containing only the venues listed as most common in Little Portugal.

In [44]:
step1 = newyork_grouped.drop('Neighbourhood',1)
step2 = step1.transpose().reset_index().rename(columns={'index':'venue'})
subset1 = step2['venue'].isin(inputVenues2['venue'].tolist())
subset2 = step2[subset1].reset_index(drop=True).set_index('venue').transpose().reset_index(drop=True)
subset2

venue,Asian Restaurant,Bakery,Bar,Brewery,Café,Coffee Shop,Cuban Restaurant,Men's Store,New American Restaurant,Park,Pizza Place,Record Shop,Restaurant,Vietnamese Restaurant,Wine Bar
0,0,3,0,0,0,0,0,0,0,0,4,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,3,0,1,0,0
2,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,0,0,0,0,0,0,0,0,0,2,1,0,1,0,0
297,0,1,1,0,0,0,0,0,0,1,4,0,0,0,0
298,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0
299,0,4,3,0,2,1,0,0,0,0,3,0,1,0,0


In [45]:
inputVenues

Unnamed: 0_level_0,rating
venue,Unnamed: 1_level_1
Asian Restaurant,9.5
Bakery,10.0
Bar,9.0
Brewery,6.5
Café,10.0
Coffee Shop,9.5
Cuban Restaurant,6.5
Men's Store,4.5
New American Restaurant,4.5
Park,10.0


### Now we will multiply the vector containing the client rating by the venues in each neighbourhood in New York. 

### We will do it using the dot productso that in the end we will have a total score for each neighbourhood.

In [46]:
# Checking if the dimensions of the matrices to be multiplied are correct
print('Number of columns in the first matrix:',subset2.shape[1])
print('Number of rows in the second matrix:',inputVenues.shape[0])

Number of columns in the first matrix: 15
Number of rows in the second matrix: 15


In [47]:
userProfile = subset2.dot(inputVenues['rating'])
result = pd.DataFrame(data=userProfile,columns=['Score'])
print('The shape of the dataframe is:', result.shape)
result.head(10)

The shape of the dataframe is: (301, 1)


Unnamed: 0,Score
0,58.0
1,40.0
2,16.5
3,9.5
4,14.0
5,16.5
6,143.0
7,17.0
8,0.0
9,62.0


### Now that we have the final score for each neighbourhood it's time to put things back together and add the score as a column in the New York neighbourhood dataframe.

In [48]:
# merging the result to the subset used previously.
merge1= pd.merge(subset2,result,left_index=True,right_index=True)
# adding the name of the neighbourhoods
newyork_grouped_score = pd.merge(newyork_grouped['Neighbourhood'],merge1,left_index=True,right_index=True)
newyork_grouped_score

Unnamed: 0,Neighbourhood,Asian Restaurant,Bakery,Bar,Brewery,Café,Coffee Shop,Cuban Restaurant,Men's Store,New American Restaurant,Park,Pizza Place,Record Shop,Restaurant,Vietnamese Restaurant,Wine Bar,Score
0,Allerton,0,3,0,0,0,0,0,0,0,0,4,0,0,0,0,58.0
1,Annadale,0,1,0,0,0,0,0,0,0,0,3,0,1,0,0,40.0
2,Arden Heights,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,16.5
3,Arlington,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,9.5
4,Arrochar,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,14.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,Woodhaven,0,0,0,0,0,0,0,0,0,2,1,0,1,0,0,36.0
297,Woodlawn,0,1,1,0,0,0,0,0,0,1,4,0,0,0,0,57.0
298,Woodrow,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,26.5
299,Woodside,0,4,3,0,2,1,0,0,0,0,3,0,1,0,0,126.5


# 4 - Results

### Now we can sort the dataframe using the score! 
### Let's see which are the 10 neighbourhoods with higher scores.

In [49]:
newyork_grouped_score.sort_values(by=['Score'],ascending=False).head(10)

Unnamed: 0,Neighbourhood,Asian Restaurant,Bakery,Bar,Brewery,Café,Coffee Shop,Cuban Restaurant,Men's Store,New American Restaurant,Park,Pizza Place,Record Shop,Restaurant,Vietnamese Restaurant,Wine Bar,Score
120,Greenpoint,0,2,9,0,3,6,0,0,2,0,7,3,2,1,1,300.0
254,South Side,0,2,7,0,1,5,0,0,1,1,6,0,2,0,3,230.0
217,Prospect Heights,0,3,9,1,3,2,0,0,2,0,2,0,2,1,2,226.0
300,Yorkville,1,2,6,0,1,6,0,0,1,2,4,0,0,2,1,225.0
44,Carroll Gardens,0,4,3,1,1,6,1,0,0,1,5,1,1,0,2,218.5
43,Carnegie Hill,0,3,2,0,4,7,0,0,1,0,4,0,1,2,1,218.0
278,Upper West Side,1,3,4,0,3,4,0,0,0,0,2,0,2,1,4,204.0
273,Tudor City,2,0,1,0,5,3,0,0,0,5,3,0,2,1,0,204.0
186,Murray Hill,1,2,4,0,1,6,2,0,1,0,3,0,2,1,1,202.5
39,Bushwick,0,3,7,0,2,5,1,0,1,0,3,0,1,0,0,201.5


### Looking at the venues in these neighbourhoods we can see that only 4 have parks. Remeber that in Little Portugal there is a big park in the area and that our client wants a neighbourhood very similar to it.

### Let's restric our search considering that Parks are obligatory. 

In [50]:
selected_neigh = ['South Side', 'Yorkville', 'Carroll Gardens', 'Financial District']
newyork_grouped_score[newyork_grouped_score['Park'] != 0].sort_values(by=['Score'],ascending=False).head(10)

Unnamed: 0,Neighbourhood,Asian Restaurant,Bakery,Bar,Brewery,Café,Coffee Shop,Cuban Restaurant,Men's Store,New American Restaurant,Park,Pizza Place,Record Shop,Restaurant,Vietnamese Restaurant,Wine Bar,Score
254,South Side,0,2,7,0,1,5,0,0,1,1,6,0,2,0,3,230.0
300,Yorkville,1,2,6,0,1,6,0,0,1,2,4,0,0,2,1,225.0
44,Carroll Gardens,0,4,3,1,1,6,1,0,0,1,5,1,1,0,2,218.5
273,Tudor City,2,0,1,0,5,3,0,0,0,5,3,0,2,1,0,204.0
95,Financial District,0,0,4,0,1,9,1,0,1,2,4,0,1,0,0,199.5
81,East Village,0,2,7,0,0,3,0,0,0,1,4,1,0,2,4,194.0
272,Tribeca,0,2,1,0,4,3,0,3,1,5,0,0,0,1,3,189.0
74,Dumbo,1,3,1,0,2,5,0,1,1,4,2,0,1,0,0,188.0
82,East Williamsburg,0,4,6,1,2,4,0,0,0,1,0,1,1,0,0,185.0
274,Turtle Bay,2,1,1,0,3,5,0,0,0,3,1,0,1,0,4,181.5


### Let's look at these neighbourhoods in the map.

In [51]:
selected_group = newyork_grouped_score[newyork_grouped_score['Park'] != 0].sort_values(by=['Score'],ascending=False).head(10)

selec = newyork['Neighbourhood'].isin(selected_group['Neighbourhood'].tolist())
newyork_selected = newyork[selec]

map_selected = folium.Map(location=[latitude_NY, longitude_NY], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(newyork_selected['Latitude'], newyork_selected['Longitude'], newyork_selected['Borough'], newyork_selected['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_selected)  
    
map_selected


In [52]:
delay=5
 
#Save the map as an HTML file
fn='map_selected.html'
tmpurl='file://{path}/{mapfile}'.format(path=os.getcwd(),mapfile=fn)
map_selected.save(fn)
 
#Open a browser window...
browser = webdriver.Firefox()
#..that displays the map...
browser.get(tmpurl)
#Give the map tiles some time to load
time.sleep(delay)
#Grab the screenshot
browser.save_screenshot('map_selected.png')
#Close the browser
browser.quit()

****

# 5- Discussion

The methodolgy applied here is very simple, compared to what is really necessary to select a new neighborhood in a different city.
However, it is a start. We would need more information, like rental or saling prices, public transportation, schools, etc.
Unfortunately we don't have that information with Foursquare.

This project can be improved with time, allowing for more constrains to be used in order to select similar neighborhoods to live.

# 6 - Conclusion

In conclusion, the Foursquare API is a powerfull machine to help us solve problems regarding selection of venues in different locations. 

A simple recommendation system worked fine, but there is room for improvement.

It's combination with an API that could retrieve real state data about sales and rental prices would be very interesting.

The visualization of the data using Folium also helps a lot to decide among different options of neighborhood, in the present case.