# The Battle of Neighborhoods

## 1. Introduction

People relocate to new cities all the time for a wide variety of reasons such as work, studies, family, weather, and so on. Relocating to a new city whether in your country of residence or a new country can be both exciting and challenging. One of the main challenges that people face when relocating is finding a new apartment to buy/rent in a nice neighborhood. But in many cases, we are unfamiliar with the city we are relocating to and we need to do extensive googling to find reliable information. Besides that, many people would like to move to a neighborhood that has a lot of similarities with their old neighborhood. This is where data science can jump in and help in finding similar neighborhoods in any city we are relocating to. In this assignment, I introduce a fictional character called "John Anderson" who has been facing challenges in finding a good neighborhood in a new city he is relocating to.

John is a 35-year-old business consultant working at a fortune 500 consulting firm. At the moment, John lives in a rental apartment in the "Central Bay Street" neighborhood, Toronto, Canada. He recently got a promotion at his job which requires him to relocate to Helsinki, Finland. John is very excited about the promotion, but he is a bit worried because he does not know much about Helsinki. He enjoys living at "Central Bay Street" and hopes to find a similar neighborhood in Helsinki to rent an apartment. John's friend Mark is a data scientist and when he became aware of his problem, he offered to help. John told Mark that he has two main criteria for selecting a neighborhood in Helsinki:
1. **He wishes to find a neighborhood that has very similar characteristics to "Central Bay Street". Especially in terms of venues available in the area.**
2. **He only wants to live in a neighborhood which is a maximum of 3km away from Helsinki city center so the apartment is not far away from the city center and also he can explore the city easier during his free time.**

Mark promised John to give him a list of neighborhoods in Helsinki that meet his criteria.

From now on, I will be taking the role of Mark and use the data science knowledge acquired throughout this course to come up with a good solution for this challenge. To make the report easier to follow, I break down the report into the following sections:

+ Introduction
+ Data
+ Methodology
+ Results
+ Discussion
+ Conclusion

Let us dive into the challenge. I hope you find it interesting.

## 2. Data

Now that we have an understanding of the challenge and selection criteria, it is time to figure out what datasets are required and how we are getting the data. To solve this challenge we need to find neighborhood data for both Helsinki and Toronto, coordinates of the neighborhoods, and venues data.

Neighborhood data is easily accessible from the following Wikipedia pages:

+ Subdivisions of Helsinki: https://en.wikipedia.org/wiki/Subdivisions_of_Helsinki
+ List of postal codes of Canada: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

To fetch neighborhoods' geographical coordinates, we use **GeoPy** library. And to get the data for venues in each neighborhood, we use **Foursquare API**.

### 2.1 Import Libraries

In this section all the libraries required for this analysis are imported.

In [1]:
# data analysis and visualization
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import folium

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

%matplotlib inline

# to handle requests
import requests

# import k-means from clustering stage
from sklearn.cluster import KMeans

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

# ! pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
from geopy import distance

# scraping data from website
! pip install beautifulsoup4
from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


### 2.2 Toronto Neighborhoods

As mentioned in the introduction section, John lives in Toronto, Canada in a neighborhood called "Central Bay Street".

The first step is to get Toronto neighborhoods' data from Wikipedia.

In [2]:
# assign the Toronto wiki url to a variable

url_wiki_toronto = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
url_wiki_toronto


# fetch the table from wikipedia and store in a dataframe

dfs_toronto = pd.read_html(url_wiki_toronto, header=0)
df_wiki_toronto = pd.concat(dfs_toronto[0:1])
df_wiki_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [3]:
# rename Neighbourhood to Neighborhood

df_wiki_toronto = df_wiki_toronto.rename(columns={'Neighbourhood': 'Neighborhood'})
df_wiki_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now that we have Toronto neighborhood data, we should filter the data to only include "Central Bay Street".

In [11]:
# Remove all the rows except for Central Bay Street neighborhood

df_toronto_filtered = df_wiki_toronto[df_wiki_toronto['Neighborhood'] == 'Central Bay Street'].reset_index(drop=True)
df_toronto_filtered

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street


In [5]:
# fetching latitude and longitude for Central Bay Street neighborhood from geolocator

bay_toronto = df_toronto_filtered['Neighborhood'][0]
bay_geo = []
bay_latitude = []
bay_longitude = []

bay_geo.append(bay_toronto + ", Toronto")

geolocator = Nominatim(user_agent="toronto_explorer")
bay_location = geolocator.geocode(bay_geo)
bay_latitude.append(bay_location.latitude)
bay_longitude.append(bay_location.longitude)

In [12]:
# add neighborhood coordinate data to the data frame and removing unnecessary columns

df_toronto_filtered['Latitude'] = bay_latitude
df_toronto_filtered['Longitude'] = bay_longitude
df_toronto_filtered = df_toronto[['Neighborhood','Latitude', 'Longitude']]
df_toronto_filtered

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Central Bay Street,43.653779,-79.382944


### 2.2 Helsinki Neighborhoods

Besides Toronto neighborhoods, we also need to get data for Helsinki neighborhoods from Wikipedia.

In [7]:
# fetch Helsinki neighborhood data from wikipedia

wiki_data_helsinki = requests.get('https://en.wikipedia.org/wiki/Subdivisions_of_Helsinki').text
soup = BeautifulSoup(wiki_data_helsinki, 'html.parser')

# filter html divs with a specific class

wiki_divs_helsinki = soup.findAll("div", {"class": "div-col columns column-width"})
wiki_divs_helsinki

# select neighborhoods div

helsinki_neighborhoods_div = wiki_divs_helsinki[0]

# neighborhoods numbering system

helsinki_neighborhoods_split = helsinki_neighborhoods_div.text.split()

# create a list for Helinki neighborhoods

helsinki_neighborhoods = []

for item in helsinki_neighborhoods_split:
    if len(item) <= 2:
        x = helsinki_neighborhoods_split.index(item)+1
        helsinki_neighborhoods.append(helsinki_neighborhoods_split[x])
       
helsinki_neighborhoods

['Kruununhaka',
 'Kluuvi',
 'Kaartinkaupunki',
 'Kamppi',
 'Punavuori',
 'Eira',
 'Ullanlinna',
 'Katajanokka',
 'Kaivopuisto',
 'Sörnäinen',
 'Kallio',
 'Alppiharju',
 'Etu-Töölö',
 'Taka-Töölö',
 'Meilahti',
 'Ruskeasuo',
 'Pasila',
 'Laakso',
 'Mustikkamaa-Korkeasaari',
 'Länsisatama',
 'Hermanni',
 'Vallila',
 'Toukola',
 'Kumpula',
 'Käpylä',
 'Koskela',
 'Vanhakaupunki',
 'Oulunkylä',
 'Haaga',
 'Munkkiniemi',
 'Lauttasaari',
 'Konala',
 'Kaarela',
 'Pakila',
 'Tuomarinkylä',
 'Viikki',
 'Pukinmäki',
 'Malmi',
 'Tapaninkylä',
 'Suutarila',
 'Suurmetsä',
 'Kulosaari',
 'Herttoniemi',
 'Tammisalo',
 'Vartiokylä',
 'Pitäjänmäki',
 'Mellunkylä',
 'Vartiosaari',
 'Laajasalo',
 'Villinki',
 'Santahamina',
 'Suomenlinna',
 'Ulkosaaret',
 'Vuosaari',
 'Östersundom',
 'Salmenkallio',
 'Talosaari',
 'Karhusaari',
 'Ultuna']

The next step is to make a dataframe for Helsinki neighborhoods and fetch geographical coordinates.

In [8]:
# make a data frame for Helsinki neighborhoods

df_helsinki = pd.DataFrame(columns = ['Neighborhood', 'Latitude', 'Longitude'])
df_helsinki

Unnamed: 0,Neighborhood,Latitude,Longitude


In [9]:
# fetching latitude and longitude for all the helsinki neighborhoods from geolocator

helsinki_geo = []
hel_latitude = []
hel_longitude = []

for neighborhood in helsinki_neighborhoods:
    helsinki_geo.append(neighborhood + ", Helsinki")

for item in helsinki_geo:
    geolocator = Nominatim(user_agent="hel_explorer")
    hel_location = geolocator.geocode(item)
    hel_latitude.append(hel_location.latitude)
    hel_longitude.append(hel_location.longitude)

In [10]:
# add helsinki neighborhood data to the data frame

df_helsinki['Neighborhood'] = helsinki_neighborhoods
df_helsinki['Latitude'] = hel_latitude
df_helsinki['Longitude'] = hel_longitude

df_helsinki.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Kruununhaka,60.17287,24.954733
1,Kluuvi,60.170778,24.947329
2,Kaartinkaupunki,60.165214,24.947222
3,Kamppi,60.168535,24.930494
4,Punavuori,60.161237,24.936505


Now that we collected the list of Helsinki neighborhoods, we should remove the ones which are more than 3km away from Helsinki city center. For this pupose, we first need to fetch the coordinates of Helsinki city center, compare with the neighborhood coordinates, and drop neighborhoods which do not fit the criteria.

In [13]:
# Use geopy library to get the latitude and longitude for Helsinki Center

helsinki_center = 'Helsinki, FI'

geolocator = Nominatim(user_agent="center_explorer")
center_location = geolocator.geocode(helsinki_center)
center_latitude = center_location.latitude
center_longitude = center_location.longitude

center_coords = (center_latitude, center_longitude)
center_coords

(60.1674881, 24.9427473)

In [14]:
# create a list of all Helsinki neighborhoods' coordinates

neighborhood_coords = []

for ind in df_helsinki.index:
    coords = (df_helsinki['Latitude'][ind], df_helsinki['Longitude'][ind])
    neighborhood_coords.append(coords)

In [15]:
# calculate the distance between Helsinki center and neighborhoods

hel_distance = []

for coord in neighborhood_coords:
    hel_distance.append(distance.distance(center_coords, coord).km)

In [16]:
# add distance column to df_helsinki

df_helsinki['Distance from Center'] = hel_distance
df_helsinki.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Distance from Center
0,Kruununhaka,60.17287,24.954733,0.895688
1,Kluuvi,60.170778,24.947329,0.446188
2,Kaartinkaupunki,60.165214,24.947222,0.354881
3,Kamppi,60.168535,24.930494,0.690177
4,Punavuori,60.161237,24.936505,0.77794


In [17]:
# now any neighborhood which is more than 3km away from center is removed

df_helsinki_filtered = df_helsinki[df_helsinki['Distance from Center'] <= 3.0]
df_helsinki_filtered = df_helsinki_filtered.reset_index()
df_helsinki_filtered

Unnamed: 0,index,Neighborhood,Latitude,Longitude,Distance from Center
0,0,Kruununhaka,60.17287,24.954733,0.895688
1,1,Kluuvi,60.170778,24.947329,0.446188
2,2,Kaartinkaupunki,60.165214,24.947222,0.354881
3,3,Kamppi,60.168535,24.930494,0.690177
4,4,Punavuori,60.161237,24.936505,0.77794
5,5,Eira,60.156191,24.938375,1.28186
6,6,Ullanlinna,60.158715,24.949404,1.045046
7,7,Katajanokka,60.166975,24.968151,1.411529
8,8,Kaivopuisto,60.156465,24.955262,1.411137
9,9,Sörnäinen,60.183885,24.964409,2.186975


So far we were able to fetch the neighborhood data and coordinates, and filter out the data based on the crtieria.

### 2.3 Venue Data for Helsinki and Toronto

We are using Foursquare API to get access to venue data for both Helsinki and Toronto.

In [18]:
# Foursquare credentials

CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
ACCESS_TOKEN = '' # your FourSquare Access Token
VERSION = ''

In [19]:
# Iterate through the neighbourhoods and fetch data from Foursquare API

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Before running the function to get venues, we combine Helsinki and Toronto filtered dataframes.

In [21]:
# creating a new dataframe by combining df_helsinki and df_toronto and dropping unnecessary columns

df = df_helsinki_filtered.append(df_toronto_filtered, ignore_index=True)
df = df.drop(['index', 'Distance from Center'], axis=1)
df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Kruununhaka,60.17287,24.954733
1,Kluuvi,60.170778,24.947329
2,Kaartinkaupunki,60.165214,24.947222
3,Kamppi,60.168535,24.930494
4,Punavuori,60.161237,24.936505


In [22]:
# Run getNearbyVenues function for Helsinki neighborhoods and store the data in "venues" variable

venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Kruununhaka
Kluuvi
Kaartinkaupunki
Kamppi
Punavuori
Eira
Ullanlinna
Katajanokka
Kaivopuisto
Sörnäinen
Kallio
Alppiharju
Etu-Töölö
Taka-Töölö
Mustikkamaa-Korkeasaari
Länsisatama
Central Bay Street


In [23]:
# Check the content of venues dataframe

print(venues.shape)
venues

(917, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Kruununhaka,60.172870,24.954733,Papu Cafe,60.173040,24.956453,Café
1,Kruununhaka,60.172870,24.954733,Cafe LOV,60.171284,24.956623,Café
2,Kruununhaka,60.172870,24.954733,Korea House,60.172910,24.956436,Korean Restaurant
3,Kruununhaka,60.172870,24.954733,Anton & Anton,60.172348,24.956458,Organic Grocery
4,Kruununhaka,60.172870,24.954733,Gateau,60.174137,24.953712,Bakery
...,...,...,...,...,...,...,...
912,Central Bay Street,43.653779,-79.382944,Pantages Hotel & Spa,43.654498,-79.379035,Hotel
913,Central Bay Street,43.653779,-79.382944,Tim Hortons,43.655212,-79.380063,Coffee Shop
914,Central Bay Street,43.653779,-79.382944,Pantages Lounge & Bar,43.654493,-79.379000,Cocktail Bar
915,Central Bay Street,43.653779,-79.382944,Imperial Pub,43.656254,-79.378955,Pub


In [24]:
# Check how many unique venue categories are availalbe

print('There are {} uniques categories.'.format(len(venues['Venue Category'].unique())))

There are 201 uniques categories.


We gathered all the data required to tackle this challenge, cleaned the data and filter the data based on the challenge crtieria. Next, we move on to the Methodology section.