# Introduction

The aim of this project is to get a tool to find neighbourhoods with specific characteristics. When people move on to a different city there usually are some essencial venues and services they want on the new neighbourhood. For example some people may want to have a train station, or scholls near their new home or, maybe, the only want to find a neighbourhood with similar characteristics than they actual neighbourhood. In this project we will try the Foursqare Api to explore the city of Toronto in order to find the more situable neigbourhood for someone moving from an specific place.

# Data

The data used to investigate Toronto neighbourhood is extracted from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

The data was extracted form the webpage by usint BeautifulSoup library

In [1]:
pip install bs4

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 20.6MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from beautifulsoup4->bs4)
  Downloading https://files.pythonhosted.org/packages/6f/8f/457f4a5390eeae1cc3aeab89deb7724c965be841ffca6cfca9197482e470/soupsieve-2.0.1-py3-none-any.whl
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jupyterlab/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.9.1 bs4-0.0

In [2]:
# library for BeautifulSoup, for web scrapping
from bs4 import BeautifulSoup
# library to handle requests
import requests
import pandas as pd
import numpy as np

wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikipedia_page = requests.get(wikipedia_link)

# Cleans html file
soup = BeautifulSoup(wikipedia_page.content, 'html.parser')
# This extracts the "tbody" within the table where class is "wikitable sortable"
table = soup.find('table', {'class':'wikitable sortable'}).tbody
# Extracts all "tr" (table rows) within the table above
rows = table.find_all('tr')
# Extracts the column headers, removes and replaces possible '\n' with space for the "th" tag
columns = [i.text.replace('\n', '')
           for i in rows[0].find_all('th')]
# Converts columns to pd dataframe
df = pd.DataFrame(columns = columns)


In [3]:
#Extracts every row with corresponding columns then appends the values to the create pd dataframe "df". The first row (row[0]) is skipped because it is already the header

for i in range(1, len(rows)):
    tds = rows[i].find_all('td')    
    if len(tds) == 7:
        values = [tds[0].text, tds[1].text, tds[2].text.replace('\n', ''.replace('\xa0','')), tds[3].text, tds[4].text.replace('\n', ''.replace('\xa0','')), tds[5].text.replace('\n', ''.replace('\xa0','')), tds[6].text.replace('\n', ''.replace('\xa0',''))]
    else:
        values = [td.text.replace('\n', '').replace('\xa0','') for td in tds]
        
        df = df.append(pd.Series(values, index = columns), ignore_index = True)
        df

The final data frame is showed bellow

In [4]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Lets delete not assigned boroughs in order to use the data

In [5]:
toronto_df=df.drop(df[df.Borough == 'Not assigned'].index)
toronto_df = toronto_df.reset_index(drop=True)
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## Longitude and latitude for every Neighbourhood

In order to get the Foursquare API information lets get the lotitude and latitude information from every nighbourhood

To do this I will use Geocoder library

In [6]:
!pip -q install geopy
from geopy.geocoders import Nominatim

# install the Geocoder
!pip -q install geocoder
import geocoder

# import time
import time

In [7]:
# Geocoder starts here
# Defining a function to use --> get_latlng()'''
def get_latlng(arcgis_geocoder):
    
    # Initialize the Location (lat. and long.) to "None"
    lat_lng_coords = None
    
    # While loop helps to create a continous run until all the location coordinates are geocoded
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(arcgis_geocoder))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [8]:
start = time.time()
postal_codes = toronto_df['Postal Code']    
coordinates = [get_latlng(postal_code) for postal_code in postal_codes.tolist()]
end = time.time()
print("Time of execution: ", end - start, "seconds")

Time of execution:  60.36322498321533 seconds


In [9]:
toronto_df_loc = toronto_df

In [10]:
# The obtained coordinates (latitude and longitude) are joined with the dataframe as shown
toronto_df_coordinates = pd.DataFrame(coordinates, columns = ['Latitude', 'Longitude'])
toronto_df_loc['Latitude'] = toronto_df_coordinates['Latitude']
toronto_df_loc['Longitude'] = toronto_df_coordinates['Longitude']
toronto_df_loc.head(5)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75188,-79.33036
1,M4A,North York,Victoria Village,43.73042,-79.31282
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65514,-79.36265
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72321,-79.45141
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66449,-79.39302


## Foursquare API

In [11]:
CLIENT_ID = 'YSJNVPY5QDYMNVGV4CLQSYSODUYDTQ54SGTFR5ZWUUTAWF0B' # your Foursquare ID
CLIENT_SECRET = 'RRKMWVYNK3RZOT0YYPF3DANVOQ2Z1RT14PGUZRBB0BHXN4EH' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: YSJNVPY5QDYMNVGV4CLQSYSODUYDTQ54SGTFR5ZWUUTAWF0B
CLIENT_SECRET:RRKMWVYNK3RZOT0YYPF3DANVOQ2Z1RT14PGUZRBB0BHXN4EH


#### Function to obtain venues for every neighbour

In [12]:
def getNearbyVenues(names, latitudes, longitudes, LIMIT=100, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Get the venues

In [13]:
toronto_venues = getNearbyVenues(names=toronto_df_loc['Neighbourhood'],
                                   latitudes=toronto_df_loc['Latitude'],
                                   longitudes=toronto_df_loc['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

Now we have the information to investigate the Neighbourhood in Toronto, look for specific services and compare them with other neighbourhood in different cities.

In [14]:
toronto_venues.head(5)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.75188,-79.33036,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.75188,-79.33036,PetSmart,43.748639,-79.333488,Pet Store
2,Parkwoods,43.75188,-79.33036,Brookbanks Pool,43.751389,-79.332184,Pool
3,Parkwoods,43.75188,-79.33036,Variety Store,43.751974,-79.333114,Food & Drink Shop
4,Parkwoods,43.75188,-79.33036,The Bing Suites,43.747816,-79.33219,Bed & Breakfast


In order to find the venues in each neighbourhood we will reshape the information. We use only Nighbourhood and Venue category information. The aim is to know which venues are in each neighbourhood and how many of each type.

In [15]:
#Group by Venue and Neighbourhood and count 
df_venues=toronto_venues.groupby(['Neighborhood','Venue Category']).size()

In [16]:
#Reshape the information in order to get a simple table
df_venues=df_venues.unstack()

In [17]:
df_venues.head()

Venue Category,Accessories Store,Airport,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Agincourt,,,,,,,,,,,...,,,,,,,,,,
"Alderwood, Long Branch",,,,,,,,,,1.0,...,,,,,,,,,,
"Bathurst Manor, Wilson Heights, Downsview North",,,,,,,,,,,...,,,,,,,,,1.0,
Bayview Village,,,,,,,,,,,...,1.0,,,,,,,,,
"Bedford Park, Lawrence Manor East",,,,,,,,,,,...,,,,,,,,,,


# Methodology

The methodology here presented allows to find a neighbourhood with specific services. Lets assume that we live on a specific neighbourhood in New York and we are moving to Toronto. The methodology here developed allows to find the neighbourhood in Toronto more similar to your actual neighbourhood.

Lets assume that we live in the Riverdale neighbourhood in the Bronx in New York. And we want to find a similar neighbourhood in Toronto.
First of all we need to get from the Foursquare API the complete information about our actual neighbourhood.

In [18]:
latitude=40.890834
longitude=-73.912585
radius=500
LIMIT=100

url='https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)

results = requests.get(url).json()['response']['groups'][0]['items']


In [19]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


In [20]:
from pandas import json_normalize

dataframe = json_normalize(results) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories']
dataframe_filtered = dataframe.loc[:, filtered_columns]

# filter the category for each row
dataframe_filtered['venue.categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# clean columns
dataframe_filtered.columns = [col.split('.')[-1] for col in dataframe_filtered.columns]

dataframe_filtered.head(10)

Unnamed: 0,name,categories
0,Riverdale Ave,Plaza
1,Bell Tower Park,Park
2,Chase Bank,Bank
3,Seton Park,Park
4,JHS Riverdale Baseball Field,Baseball Field
5,Park Lunch,Food Truck
6,MTA BX7 238th Stop,Bus Station
7,Hayden On Hudson Gym,Gym
8,MTA MaBSTOA Bus Bx7 / Bx10 / Bx20 / BxM1 / BxM...,Bus Station


These are the services in Riverdale

In [21]:

target_services=dataframe_filtered.groupby(['categories']).count()
target_services

Unnamed: 0_level_0,name
categories,Unnamed: 1_level_1
Bank,1
Baseball Field,1
Bus Station,2
Food Truck,1
Gym,1
Park,2
Plaza,1


Now we only need to check wich neghbourhood in Toronto have these services

Create a subset of Toronto data taking into account only the services in Riverdale

In [95]:
categories=dataframe_filtered['categories'].drop_duplicates()
categories=categories.sort_values().reset_index(drop=True)
categories.tolist
categories

0              Bank
1    Baseball Field
2       Bus Station
3        Food Truck
4               Gym
5              Park
6             Plaza
Name: categories, dtype: object

We select only this venues in the total data of Toronto

In [72]:
small_venues= df_venues[df_venues.columns.intersection(categories)]
small_venues

Venue Category,Bank,Baseball Field,Bus Station,Food Truck,Gym,Park,Plaza
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Agincourt,,,,,,1.0,
"Alderwood, Long Branch",,,,,1.0,,
"Bathurst Manor, Wilson Heights, Downsview North",,,,,,1.0,
Bayview Village,,,,,,1.0,
"Bedford Park, Lawrence Manor East",,,,,,,
...,...,...,...,...,...,...,...
"Willowdale, Willowdale West",1.0,,,,,1.0,
Woburn,,,,,,1.0,
Woodbine Heights,,,,,,,
York Mills West,,,,,,1.0,


We compute the total of venues of every nighbourhood and select the neighbourhood with the maximum number of venues

In [99]:
small_venues=small_venues.fillna(0)

In [122]:
neighborhood=toronto_venues['Neighborhood'].drop_duplicates()
neighborhood=neighborhood.sort_values().reset_index(drop=True)
neighborhood



0                                           Agincourt
1                              Alderwood, Long Branch
2     Bathurst Manor, Wilson Heights, Downsview North
3                                     Bayview Village
4                   Bedford Park, Lawrence Manor East
                           ...                       
92                        Willowdale, Willowdale West
93                                             Woburn
94                                   Woodbine Heights
95                                    York Mills West
96                           York Mills, Silver Hills
Name: Neighborhood, Length: 97, dtype: object

We compute the difference between our New York neighbourhood venues and every neighbourhood in Toronto venues. To do this we will use the eucledian difference.

In [26]:
from scipy.spatial import distance

venues_np=small_venues.to_numpy()
old_np=target_services.to_numpy()
old_np=np.transpose(old_np)[0,:]

In [70]:
distance=np.zeros([len(venues_np)])
print(old_np)
for ngh in range(0,len(venues_np)):
    new_np=venues_np[ngh,:]
    distance[ngh]=np.linalg.norm(old_np-new_np)
    

[1 1 2 1 1 2 1]


In [140]:
small_venues['Distance']=distance
small_venues

categories,Bank,Baseball Field,Bus Station,Food Truck,Gym,Park,Plaza,Distance
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Agincourt,0.0,0.0,0.0,0.0,0.0,1.0,0.0,3.162278
"Alderwood, Long Branch",0.0,0.0,0.0,0.0,1.0,0.0,0.0,3.464102
"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,1.0,0.0,3.162278
Bayview Village,0.0,0.0,0.0,0.0,0.0,1.0,0.0,3.162278
"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.605551
...,...,...,...,...,...,...,...,...
"Willowdale, Willowdale West",1.0,0.0,0.0,0.0,0.0,1.0,0.0,3.000000
Woburn,0.0,0.0,0.0,0.0,0.0,1.0,0.0,3.162278
Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.605551
York Mills West,0.0,0.0,0.0,0.0,0.0,1.0,0.0,3.162278


# Results

Taking into account the distance between every Toronto neighbourhood and the initial New York neighbourhood we can select those neighbourhood more similiar to the original one (Those showing lower distance)

In [83]:
ngh=[]


for idx in range(0,6):
    ngh.append(neighborhood[pos[0][idx]])
    
ngh

['CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport',
 'Central Bay Street',
 'Dufferin, Dovercourt Village',
 'India Bazaar, The Beaches West',
 'St. James Town',
 'St. James Town, Cabbagetown']

The set of neighbourhood more similar to the original one are presented below:

In [137]:
findata = small_venues.loc[ ngh , : ]
findata

categories,Bank,Baseball Field,Bus Station,Food Truck,Gym,Park,Plaza
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",1.0,0.0,0.0,0.0,2.0,2.0,0.0
Central Bay Street,1.0,0.0,0.0,0.0,1.0,1.0,2.0
"Dufferin, Dovercourt Village",1.0,0.0,0.0,0.0,0.0,2.0,0.0
"India Bazaar, The Beaches West",0.0,0.0,0.0,0.0,1.0,2.0,0.0
St. James Town,0.0,0.0,0.0,1.0,2.0,2.0,0.0
"St. James Town, Cabbagetown",1.0,0.0,0.0,0.0,0.0,2.0,0.0


Taking into account the target venues in New York, the user can select the more suitable neighbourhood

In [139]:
target_services

Unnamed: 0_level_0,name
categories,Unnamed: 1_level_1
Bank,1
Baseball Field,1
Bus Station,2
Food Truck,1
Gym,1
Park,2
Plaza,1


The user can select the neighbourhood from the list above taking into account their personal preferences. For exmaplE, analysing the venues in each neighbourhood it seems that the more suitable neighbourhood should be Cental Bay Street as it is the more complete neighbourhood from the list.

The user can refine the search by selecting a specific characteristic in the list. For examnple none of the above neighbourhood have a bus station near. If the user consider this venues essential can repeat the search by forcing to have some value in the Bus station column

In [148]:
finaldata= small_venues[small_venues['Bus Station']>0]
finaldata

categories,Bank,Baseball Field,Bus Station,Food Truck,Gym,Park,Plaza,Distance
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"Kennedy Park, Ionview, East Birchmount Park",0.0,0.0,1.0,0.0,0.0,0.0,0.0,3.162278


In this case only one neighbourhood have bus station and this neighbourhodd have not any of the other target venues

# Conclusions

In this project a tool for search similar neighbourhoods in different cities is presented. The project use the Foursquare Api and the euclidina distance to select the neighbourhoods with more similar venues. This allows to the user to explore the new city and to search for different options taking into account his personal prefferences.