# The Battle of Neighborhoods

## 1. Introduction

People relocate to new cities all the time for a wide variety of reasons such as work, studies, family, weather, and so on. Relocating to a new city whether in your country of residence or a new country can be both exciting and challenging. One of the main challenges that people face when relocating is finding a new apartment to buy/rent in a nice neighborhood. But in many cases, we are unfamiliar with the city we are relocating to and we need to do extensive googling to find reliable information. Besides that, many people would like to move to a neighborhood that has a lot of similarities with their old neighborhood. This is where data science can jump in and help in finding similar neighborhoods in any city we are relocating to. In this assignment, I introduce a fictional character called "John Anderson" who has been facing challenges in finding a good neighborhood in a new city he is relocating to.

John is a 35-year-old business consultant working at a fortune 500 consulting firm. At the moment, John lives in a rental apartment in the "Central Bay Street" neighborhood, Toronto, Canada. He recently got a promotion at his job which requires him to relocate to Helsinki, Finland. John is very excited about the promotion, but he is a bit worried because he does not know much about Helsinki. He enjoys living at "Central Bay Street" and hopes to find a similar neighborhood in Helsinki to rent an apartment. John's friend Mark is a data scientist and when he became aware of his problem, he offered to help. John told Mark that he has two main criteria for selecting a neighborhood in Helsinki:
1. **He wishes to find a neighborhood that has very similar characteristics to "Central Bay Street". Especially in terms of venues available in the area.**
2. **He only wants to live in a neighborhood which is a maximum of 3km away from Helsinki city center so the apartment is not far away from the city center and also he can explore the city easier during his free time.**

Mark promised John to give him a list of neighborhoods in Helsinki that meet his criteria.

From now on, I will be taking the role of Mark and use the data science knowledge acquired throughout this course to come up with a good solution for this challenge. To make the report easier to follow, I break down the report into the following sections:

+ Introduction
+ Data
+ Methodology
+ Results
+ Discussion
+ Conclusion

Let us dive into the challenge. I hope you find it interesting.

## 2. Data

Now that we have an understanding of the challenge and selection criteria, it is time to figure out what datasets are required and how we are getting the data. To solve this challenge we need to find neighborhood data for both Helsinki and Toronto, coordinates of the neighborhoods, and venues data.

Neighborhood data is easily accessible from the following Wikipedia pages:

+ Subdivisions of Helsinki: https://en.wikipedia.org/wiki/Subdivisions_of_Helsinki
+ List of postal codes of Canada: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

To fetch neighborhoods' geographical coordinates, we use **GeoPy** library. And to get the data for venues in each neighborhood, we use **Foursquare API**.

### 2.1 Import Libraries

In this section all the libraries required for this analysis are imported.

In [1]:
# data analysis and visualization
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import folium

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

%matplotlib inline

# to handle requests
import requests

# import k-means from clustering stage
from sklearn.cluster import KMeans

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

# ! pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
from geopy import distance

# scraping data from website
! pip install beautifulsoup4
from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


### 2.2 Toronto Neighborhoods

As mentioned in the introduction section, John lives in Toronto, Canada in a neighborhood called "Central Bay Street".

The first step is to get Toronto neighborhoods' data from Wikipedia.

In [2]:
# assign the Toronto wiki url to a variable

url_wiki_toronto = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
url_wiki_toronto


# fetch the table from wikipedia and store in a dataframe

dfs_toronto = pd.read_html(url_wiki_toronto, header=0)
df_wiki_toronto = pd.concat(dfs_toronto[0:1])
df_wiki_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [3]:
# rename Neighbourhood to Neighborhood

df_wiki_toronto = df_wiki_toronto.rename(columns={'Neighbourhood': 'Neighborhood'})
df_wiki_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now that we have Toronto neighborhood data, we should filter the data to only include "Central Bay Street".

In [4]:
# Remove all the rows except for Central Bay Street neighborhood

df_toronto_filtered = df_wiki_toronto[df_wiki_toronto['Neighborhood'] == 'Central Bay Street'].reset_index(drop=True)
df_toronto_filtered

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street


In [5]:
# fetching latitude and longitude for Central Bay Street neighborhood from geolocator

bay_toronto = df_toronto_filtered['Neighborhood'][0]
bay_geo = []
bay_latitude = []
bay_longitude = []

bay_geo.append(bay_toronto + ", Toronto")

geolocator = Nominatim(user_agent="toronto_explorer")
bay_location = geolocator.geocode(bay_geo)
bay_latitude.append(bay_location.latitude)
bay_longitude.append(bay_location.longitude)

In [6]:
# add neighborhood coordinate data to the data frame and removing unnecessary columns

df_toronto_filtered['Latitude'] = bay_latitude
df_toronto_filtered['Longitude'] = bay_longitude
df_toronto_filtered = df_toronto_filtered[['Neighborhood','Latitude', 'Longitude']]
df_toronto_filtered

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Central Bay Street,43.653779,-79.382944


### 2.2 Helsinki Neighborhoods

Besides Toronto neighborhoods, we also need to get data for Helsinki neighborhoods from Wikipedia.

In [7]:
# fetch Helsinki neighborhood data from wikipedia

wiki_data_helsinki = requests.get('https://en.wikipedia.org/wiki/Subdivisions_of_Helsinki').text
soup = BeautifulSoup(wiki_data_helsinki, 'html.parser')

# filter html divs with a specific class

wiki_divs_helsinki = soup.findAll("div", {"class": "div-col columns column-width"})
wiki_divs_helsinki

# select neighborhoods div

helsinki_neighborhoods_div = wiki_divs_helsinki[0]

# neighborhoods numbering system

helsinki_neighborhoods_split = helsinki_neighborhoods_div.text.split()

# create a list for Helinki neighborhoods

helsinki_neighborhoods = []

for item in helsinki_neighborhoods_split:
    if len(item) <= 2:
        x = helsinki_neighborhoods_split.index(item)+1
        helsinki_neighborhoods.append(helsinki_neighborhoods_split[x])
       

helsinki_neighborhoods

['Kruununhaka',
 'Kluuvi',
 'Kaartinkaupunki',
 'Kamppi',
 'Punavuori',
 'Eira',
 'Ullanlinna',
 'Katajanokka',
 'Kaivopuisto',
 'Sörnäinen',
 'Kallio',
 'Alppiharju',
 'Etu-Töölö',
 'Taka-Töölö',
 'Meilahti',
 'Ruskeasuo',
 'Pasila',
 'Laakso',
 'Mustikkamaa-Korkeasaari',
 'Länsisatama',
 'Hermanni',
 'Vallila',
 'Toukola',
 'Kumpula',
 'Käpylä',
 'Koskela',
 'Vanhakaupunki',
 'Oulunkylä',
 'Haaga',
 'Munkkiniemi',
 'Lauttasaari',
 'Konala',
 'Kaarela',
 'Pakila',
 'Tuomarinkylä',
 'Viikki',
 'Pukinmäki',
 'Malmi',
 'Tapaninkylä',
 'Suutarila',
 'Suurmetsä',
 'Kulosaari',
 'Herttoniemi',
 'Tammisalo',
 'Vartiokylä',
 'Pitäjänmäki',
 'Mellunkylä',
 'Vartiosaari',
 'Laajasalo',
 'Villinki',
 'Santahamina',
 'Suomenlinna',
 'Ulkosaaret',
 'Vuosaari',
 'Östersundom',
 'Salmenkallio',
 'Talosaari',
 'Karhusaari',
 'Ultuna']

The next step is to make a dataframe for Helsinki neighborhoods and fetch geographical coordinates.

In [8]:
# make a data frame for Helsinki neighborhoods

df_helsinki = pd.DataFrame(columns = ['Neighborhood', 'Latitude', 'Longitude'])
df_helsinki

Unnamed: 0,Neighborhood,Latitude,Longitude


In [9]:
# fetching latitude and longitude for all the helsinki neighborhoods from geolocator

helsinki_geo = []
hel_latitude = []
hel_longitude = []

for neighborhood in helsinki_neighborhoods:
    helsinki_geo.append(neighborhood + ", Helsinki")

for item in helsinki_geo:
    geolocator = Nominatim(user_agent="hel_explorer")
    hel_location = geolocator.geocode(item)
    hel_latitude.append(hel_location.latitude)
    hel_longitude.append(hel_location.longitude)

In [10]:
# add helsinki neighborhood data to the data frame

df_helsinki['Neighborhood'] = helsinki_neighborhoods
df_helsinki['Latitude'] = hel_latitude
df_helsinki['Longitude'] = hel_longitude

df_helsinki.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Kruununhaka,60.17287,24.954733
1,Kluuvi,60.170778,24.947329
2,Kaartinkaupunki,60.165214,24.947222
3,Kamppi,60.168535,24.930494
4,Punavuori,60.161237,24.936505


Now that we collected the list of Helsinki neighborhoods, we should remove the ones which are more than 3km away from Helsinki city center. For this pupose, we first need to fetch the coordinates of Helsinki city center, compare with the neighborhood coordinates, and drop neighborhoods which do not fit the criteria.

In [11]:
# Use geopy library to get the latitude and longitude for Helsinki Center

helsinki_center = 'Helsinki, FI'

geolocator = Nominatim(user_agent="center_explorer")
center_location = geolocator.geocode(helsinki_center)
center_latitude = center_location.latitude
center_longitude = center_location.longitude

center_coords = (center_latitude, center_longitude)
center_coords

(60.1674881, 24.9427473)

In [12]:
# create a list of all Helsinki neighborhoods' coordinates

neighborhood_coords = []

for ind in df_helsinki.index:
    coords = (df_helsinki['Latitude'][ind], df_helsinki['Longitude'][ind])
    neighborhood_coords.append(coords)

In [13]:
# calculate the distance between Helsinki center and neighborhoods

hel_distance = []

for coord in neighborhood_coords:
    hel_distance.append(distance.distance(center_coords, coord).km)

In [14]:
# add distance column to df_helsinki

df_helsinki['Distance from Center'] = hel_distance
df_helsinki.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Distance from Center
0,Kruununhaka,60.17287,24.954733,0.895688
1,Kluuvi,60.170778,24.947329,0.446188
2,Kaartinkaupunki,60.165214,24.947222,0.354881
3,Kamppi,60.168535,24.930494,0.690177
4,Punavuori,60.161237,24.936505,0.77794


In [15]:
# now any neighborhood which is more than 3km away from center is removed

df_helsinki_filtered = df_helsinki[df_helsinki['Distance from Center'] <= 3.0]
df_helsinki_filtered = df_helsinki_filtered.reset_index()
df_helsinki_filtered

Unnamed: 0,index,Neighborhood,Latitude,Longitude,Distance from Center
0,0,Kruununhaka,60.17287,24.954733,0.895688
1,1,Kluuvi,60.170778,24.947329,0.446188
2,2,Kaartinkaupunki,60.165214,24.947222,0.354881
3,3,Kamppi,60.168535,24.930494,0.690177
4,4,Punavuori,60.161237,24.936505,0.77794
5,5,Eira,60.156191,24.938375,1.28186
6,6,Ullanlinna,60.158715,24.949404,1.045046
7,7,Katajanokka,60.166975,24.968151,1.411529
8,8,Kaivopuisto,60.156465,24.955262,1.411137
9,9,Sörnäinen,60.183885,24.964409,2.186975


So far we were able to fetch the neighborhood data and coordinates, and filter out the data based on the crtieria.

### 2.3 Venue Data for Helsinki and Toronto

We are using Foursquare API to get access to venue data for both Helsinki and Toronto.

In [16]:
# Foursquare credentials

CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
ACCESS_TOKEN = '' # your FourSquare Access Token
VERSION = ''

In [17]:
# Iterate through the neighbourhoods and fetch data from Foursquare API

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Before running the function to get venues, we combine Helsinki and Toronto filtered dataframes.

In [18]:
# creating a new dataframe by combining df_helsinki and df_toronto and dropping unnecessary columns

df = df_helsinki_filtered.append(df_toronto_filtered, ignore_index=True)
df = df.drop(['index', 'Distance from Center'], axis=1)
df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Kruununhaka,60.17287,24.954733
1,Kluuvi,60.170778,24.947329
2,Kaartinkaupunki,60.165214,24.947222
3,Kamppi,60.168535,24.930494
4,Punavuori,60.161237,24.936505


In [19]:
# Run getNearbyVenues function for Helsinki neighborhoods and store the data in "venues" variable

venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Kruununhaka
Kluuvi
Kaartinkaupunki
Kamppi
Punavuori
Eira
Ullanlinna
Katajanokka
Kaivopuisto
Sörnäinen
Kallio
Alppiharju
Etu-Töölö
Taka-Töölö
Mustikkamaa-Korkeasaari
Länsisatama
Central Bay Street


In [20]:
# Check the content of venues dataframe

print(venues.shape)
venues

(917, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Kruununhaka,60.172870,24.954733,Papu Cafe,60.173040,24.956453,Café
1,Kruununhaka,60.172870,24.954733,Cafe LOV,60.171284,24.956623,Café
2,Kruununhaka,60.172870,24.954733,Korea House,60.172910,24.956436,Korean Restaurant
3,Kruununhaka,60.172870,24.954733,Anton & Anton,60.172348,24.956458,Organic Grocery
4,Kruununhaka,60.172870,24.954733,Gateau,60.174137,24.953712,Bakery
...,...,...,...,...,...,...,...
912,Central Bay Street,43.653779,-79.382944,Pantages Hotel & Spa,43.654498,-79.379035,Hotel
913,Central Bay Street,43.653779,-79.382944,Tim Hortons,43.655212,-79.380063,Coffee Shop
914,Central Bay Street,43.653779,-79.382944,Pantages Lounge & Bar,43.654493,-79.379000,Cocktail Bar
915,Central Bay Street,43.653779,-79.382944,Imperial Pub,43.656254,-79.378955,Pub


In [21]:
# Check how many unique venue categories are availalbe

print('There are {} uniques categories.'.format(len(venues['Venue Category'].unique())))

There are 201 uniques categories.


We gathered all the data required to tackle this challenge, cleaned the data and filter the data based on the challenge crtieria. Next, we move on to the Methodology section.

## 3. Methodology

In the methodology section we first perform exploratory data analysis to get a better understanding of the data and the challenge. Next, we use the data to build a model which will help us in selecting the right neighborhoods.

### 3.1 Exploratory Data Analysis

First lets take a look at the neighborhood that Joe Anderson lives: "Central Bay Street" neighborhood, Toronto, Canada

In [22]:
# Use geopy library to get the latitude and longitude for Toronto

toronto = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
toronto_location = geolocator.geocode(toronto)
toronto_latitude = toronto_location.latitude
toronto_longitude = toronto_location.longitude

toronto_coords = (toronto_latitude, toronto_longitude)
toronto_coords

(43.6534817, -79.3839347)

In [23]:
# visualizing Central Bay Street neighborhood in Toronto, Canada

map_toronto = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=13)

for lat, lng, neighborhood in zip(df_toronto_filtered['Latitude'], df_toronto_filtered['Longitude'], df_toronto_filtered['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)


map_toronto

Now that we highlighted the location of the neighborhood on the map, let’s take a look at venues which are available in this neighborhood.

In [24]:
# list of all venues availalbe at Central Bay Street

venues_bay_street = venues[venues['Neighborhood'] == 'Central Bay Street']
venues_bay_street

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
829,Central Bay Street,43.653779,-79.382944,Downtown Toronto,43.653232,-79.385296,Neighborhood
830,Central Bay Street,43.653779,-79.382944,Nathan Phillips Square,43.652270,-79.383516,Plaza
831,Central Bay Street,43.653779,-79.382944,Indigo,43.653515,-79.380696,Bookstore
832,Central Bay Street,43.653779,-79.382944,CF Toronto Eaton Centre,43.654447,-79.380952,Shopping Mall
833,Central Bay Street,43.653779,-79.382944,LUSH,43.653557,-79.380400,Cosmetics Shop
...,...,...,...,...,...,...,...
912,Central Bay Street,43.653779,-79.382944,Pantages Hotel & Spa,43.654498,-79.379035,Hotel
913,Central Bay Street,43.653779,-79.382944,Tim Hortons,43.655212,-79.380063,Coffee Shop
914,Central Bay Street,43.653779,-79.382944,Pantages Lounge & Bar,43.654493,-79.379000,Cocktail Bar
915,Central Bay Street,43.653779,-79.382944,Imperial Pub,43.656254,-79.378955,Pub


In [25]:
# number of unique venue categories in Central Bay Street

venues_bay_street['Venue Category'].nunique()

59

As we can see from the above table, "Central Bay Street" is a neighborhood which is located in the center of Toronto and there are wide variety of venues available in this neighborhood.

Now lets take a closer look at Helsinki neighborhoods by visualizing them on the map.

In [26]:
# Use geopy library to get the latitude and longitude for Helsinki

helsinki = 'Helsinki, Finland'

geolocator = Nominatim(user_agent="helsinki_explorer")
helsinki_location = geolocator.geocode(helsinki)
helsinki_latitude = helsinki_location.latitude
helsinki_longitude = helsinki_location.longitude

helsinki_coords = (helsinki_latitude, helsinki_longitude)
helsinki_coords

(60.1674881, 24.9427473)

In [27]:
# visualizing Helsinki neighborhoods

map_helsinki = folium.Map(location=[helsinki_latitude, helsinki_longitude], zoom_start=11)

for lat, lng, neighborhood in zip(df_helsinki['Latitude'], df_helsinki['Longitude'], df_helsinki['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_helsinki)


map_helsinki

As mentioned in the Introduction section, Joe is only interested in neighborhoods which are maximum 3km away from Helsinki city center. Let’s visualize that on the map with a red circle.

In [28]:
# adding a red circle on the map to highlight the neighborhoods which are maximum 3km away from Helsinki city center

folium.Circle([helsinki_latitude, helsinki_longitude],radius=3000, color='red').add_to(map_helsinki)

map_helsinki

Now we can remove other neighborhoods from the map and take a closer look only at neighborhoods which are in the 3km radius.

In [29]:
# visualizing Helsinki neighborhoods located within 3km radius of city center

map_helsinki_filtered = folium.Map(location=[helsinki_latitude, helsinki_longitude], zoom_start=10)

for lat, lng, neighborhood in zip(df_helsinki_filtered['Latitude'], df_helsinki_filtered['Longitude'], df_helsinki_filtered['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_helsinki_filtered)

folium.Circle([helsinki_latitude, helsinki_longitude],radius=3000, color='red').add_to(map_helsinki_filtered)

map_helsinki_filtered

Now that we limited number of neighborhoods based on their proximity to Helsinki city center, it is time to start building a model which can help us compare venue similarities between "Central Bay Street" and Helsinki neighborhoods.

The first step in building the model is transforming venue data from categorical to numerical.

In [30]:
# one hot encoding

onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe

onehot['Neighborhood'] = venues['Neighborhood']

# Move neighborhood column to the first column

neighborhood = onehot['Neighborhood']
onehot.drop(labels=['Neighborhood'], axis=1, inplace = True)
onehot.insert(0, 'Neighborhood', neighborhood)

onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Asian Restaurant,Auditorium,Bagel Shop,...,Used Bookstore,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Waterfront,Wine Bar,Wine Shop,Yoga Studio,Zoo,Zoo Exhibit
0,Kruununhaka,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Kruununhaka,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Kruununhaka,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Kruununhaka,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Kruununhaka,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
onehot.shape

(917, 201)

The second step is to use Groupby method and calculate on average how many venue categories are in every neighborhood.

In [32]:
grouped = onehot.groupby('Neighborhood').mean().reset_index()
grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Asian Restaurant,Auditorium,Bagel Shop,...,Used Bookstore,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Waterfront,Wine Bar,Wine Shop,Yoga Studio,Zoo,Zoo Exhibit
0,Alppiharju,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0
1,Central Bay Street,0.0,0.022727,0.0,0.0,0.0,0.011364,0.0,0.0,0.0,...,0.0,0.011364,0.011364,0.011364,0.0,0.0,0.0,0.0,0.0,0.0
2,Eira,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0
3,Etu-Töölö,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Kaartinkaupunki,0.0,0.021739,0.0,0.0,0.01087,0.01087,0.01087,0.0,0.0,...,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The third step is to create a function to select the most popular venues in each neighborhood. For this analysis, we use top 20 venues in each neighborhood.

In [33]:
# Function to sort venue in a descending order

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [34]:
# Creating a dataframe to show top 20 venues in every neighborhood

num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# Create columns according to number of top venues

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Create a new dataframe

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = grouped['Neighborhood']

for ind in np.arange(grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Alppiharju,Theme Park Ride / Attraction,Park,Bar,Greek Restaurant,Recreation Center,Blini House,Beer Garden,Theater,Theme Park,...,Bus Station,Café,Dog Run,History Museum,Pub,Vietnamese Restaurant,Italian Restaurant,Aquarium,Soccer Field,Steakhouse
1,Central Bay Street,Coffee Shop,Clothing Store,Hotel,Restaurant,Theater,Electronics Store,New American Restaurant,Office,Diner,...,Plaza,Seafood Restaurant,Breakfast Spot,Bookstore,Sushi Restaurant,Gym,American Restaurant,Bar,Music Venue,Ramen Restaurant
2,Eira,Ice Cream Shop,Park,Boat or Ferry,Italian Restaurant,Pizza Place,French Restaurant,Bakery,Café,Modern European Restaurant,...,Mexican Restaurant,Coffee Roaster,Sushi Restaurant,Bistro,Beach,Korean Restaurant,Coffee Shop,Harbor / Marina,Turkish Restaurant,Dog Run
3,Etu-Töölö,Scandinavian Restaurant,Park,Sushi Restaurant,Gym,Bakery,Coffee Shop,Plaza,Restaurant,Pub,...,Tennis Court,Dog Run,Fast Food Restaurant,Chinese Restaurant,Soccer Field,Café,Supermarket,Playground,Bookstore,Ice Cream Shop
4,Kaartinkaupunki,Scandinavian Restaurant,Hotel,Coffee Shop,Café,Cocktail Bar,Hotel Bar,Restaurant,Park,Music Venue,...,Dance Studio,Plaza,Pizza Place,Sushi Restaurant,Modern European Restaurant,Filipino Restaurant,American Restaurant,Bar,French Restaurant,Vegetarian / Vegan Restaurant


Finally we use a clustering algorithm to help us in grouping similar neighborhoods in terms of venues.

In [35]:
# Set number of clusters

kclusters = 5

grouped_clustering = grouped.drop('Neighborhood', 1)

# Run k-means clustering

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)

# Check cluster labels generated for each row in the dataframe

kmeans.labels_[0:10]

array([1, 0, 2, 2, 0, 2, 3, 3, 0, 0])

In [36]:
grouped_clustering.head()

Unnamed: 0,Accessories Store,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Asian Restaurant,Auditorium,Bagel Shop,Bakery,...,Used Bookstore,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Waterfront,Wine Bar,Wine Shop,Yoga Studio,Zoo,Zoo Exhibit
0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.022727,0.0,0.0,0.0,0.011364,0.0,0.0,0.0,0.011364,...,0.0,0.011364,0.011364,0.011364,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057143,...,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.055556,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.021739,0.0,0.0,0.01087,0.01087,0.01087,0.0,0.0,0.01087,...,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [37]:
# add clustering labels

neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
merged = df

merged = merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
merged.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Kruununhaka,60.17287,24.954733,3,Boat or Ferry,History Museum,Bar,Theater,Scandinavian Restaurant,Grocery Store,...,Indie Movie Theater,Beer Bar,Chinese Restaurant,Clothing Store,Restaurant,Event Space,Breakfast Spot,Korean Restaurant,Plaza,Coworking Space
1,Kluuvi,60.170778,24.947329,0,Coffee Shop,Café,Scandinavian Restaurant,Gym / Fitness Center,Theater,Park,...,Bar,Music Venue,Burger Joint,Hotel,Pizza Place,Clothing Store,Modern European Restaurant,Seafood Restaurant,Fountain,Scenic Lookout
2,Kaartinkaupunki,60.165214,24.947222,0,Scandinavian Restaurant,Hotel,Coffee Shop,Café,Cocktail Bar,Hotel Bar,...,Dance Studio,Plaza,Pizza Place,Sushi Restaurant,Modern European Restaurant,Filipino Restaurant,American Restaurant,Bar,French Restaurant,Vegetarian / Vegan Restaurant
3,Kamppi,60.168535,24.930494,3,Wine Bar,Beer Bar,Scandinavian Restaurant,Art Museum,Bar,Japanese Restaurant,...,Sushi Restaurant,Food Court,Pizza Place,Chinese Restaurant,Toy / Game Store,Burger Joint,Art Gallery,Restaurant,Vietnamese Restaurant,Playground
4,Punavuori,60.161237,24.936505,3,Scandinavian Restaurant,Restaurant,Bakery,Pizza Place,Coffee Shop,Park,...,Sandwich Place,Pub,Seafood Restaurant,Modern European Restaurant,Bar,Japanese Restaurant,Italian Restaurant,Art Gallery,Yoga Studio,Grocery Store


In [38]:
merged.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Kruununhaka,60.17287,24.954733,3,Boat or Ferry,History Museum,Bar,Theater,Scandinavian Restaurant,Grocery Store,...,Indie Movie Theater,Beer Bar,Chinese Restaurant,Clothing Store,Restaurant,Event Space,Breakfast Spot,Korean Restaurant,Plaza,Coworking Space
1,Kluuvi,60.170778,24.947329,0,Coffee Shop,Café,Scandinavian Restaurant,Gym / Fitness Center,Theater,Park,...,Bar,Music Venue,Burger Joint,Hotel,Pizza Place,Clothing Store,Modern European Restaurant,Seafood Restaurant,Fountain,Scenic Lookout
2,Kaartinkaupunki,60.165214,24.947222,0,Scandinavian Restaurant,Hotel,Coffee Shop,Café,Cocktail Bar,Hotel Bar,...,Dance Studio,Plaza,Pizza Place,Sushi Restaurant,Modern European Restaurant,Filipino Restaurant,American Restaurant,Bar,French Restaurant,Vegetarian / Vegan Restaurant
3,Kamppi,60.168535,24.930494,3,Wine Bar,Beer Bar,Scandinavian Restaurant,Art Museum,Bar,Japanese Restaurant,...,Sushi Restaurant,Food Court,Pizza Place,Chinese Restaurant,Toy / Game Store,Burger Joint,Art Gallery,Restaurant,Vietnamese Restaurant,Playground
4,Punavuori,60.161237,24.936505,3,Scandinavian Restaurant,Restaurant,Bakery,Pizza Place,Coffee Shop,Park,...,Sandwich Place,Pub,Seafood Restaurant,Modern European Restaurant,Bar,Japanese Restaurant,Italian Restaurant,Art Gallery,Yoga Studio,Grocery Store


In this section we did an expletory analysis on the data and built a clustering model to help in grouping neighborhoods according to the venue data. In the next section, we will review the clusters to see which one is like the "Central Bay Street" neighborhood.

## 4. Results

Now that we built the clusters, lets visualize them on the map.

In [39]:
# set color scheme for the clusters

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map

markers_colors = []
for lat, lon, poi, cluster in zip(merged['Latitude'], merged['Longitude'], merged['Neighborhood'], merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_helsinki_filtered)
       
map_helsinki_filtered

The above visualization shows us how the clusters are color coded and located in Helsinki, but we cannot see which cluster "Central Bay Street" belongs to. So, lets filter out the map further and only show neighborhoods in the same cluster as the "Central Bay Street".

In [40]:
# checking Central Bay Street cluster lable

bay_cluster = merged[merged['Neighborhood'] == 'Central Bay Street'] 
bay_cluster

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
16,Central Bay Street,43.653779,-79.382944,0,Coffee Shop,Clothing Store,Hotel,Restaurant,Theater,Electronics Store,...,Plaza,Seafood Restaurant,Breakfast Spot,Bookstore,Sushi Restaurant,Gym,American Restaurant,Bar,Music Venue,Ramen Restaurant


As we can see from the above table, "Central Bay Street" cluster label is 0. So lets create a dataframe which contains data for all cluster 0 neighborhoods and then visualize them on Helsinki map.

In [41]:
# creating a dataframe for cluster zero neighborhoods

cluster_zero = merged[merged['Cluster Labels'] == 0]
cluster_zero

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
1,Kluuvi,60.170778,24.947329,0,Coffee Shop,Café,Scandinavian Restaurant,Gym / Fitness Center,Theater,Park,...,Bar,Music Venue,Burger Joint,Hotel,Pizza Place,Clothing Store,Modern European Restaurant,Seafood Restaurant,Fountain,Scenic Lookout
2,Kaartinkaupunki,60.165214,24.947222,0,Scandinavian Restaurant,Hotel,Coffee Shop,Café,Cocktail Bar,Hotel Bar,...,Dance Studio,Plaza,Pizza Place,Sushi Restaurant,Modern European Restaurant,Filipino Restaurant,American Restaurant,Bar,French Restaurant,Vegetarian / Vegan Restaurant
7,Katajanokka,60.166975,24.968151,0,Park,Hotel,Scandinavian Restaurant,Restaurant,Bar,Boat or Ferry,...,Plaza,Escape Room,Piano Bar,Gourmet Shop,Grocery Store,Gym / Fitness Center,Himalayan Restaurant,Karaoke Bar,Hostel,Coffee Shop
16,Central Bay Street,43.653779,-79.382944,0,Coffee Shop,Clothing Store,Hotel,Restaurant,Theater,Electronics Store,...,Plaza,Seafood Restaurant,Breakfast Spot,Bookstore,Sushi Restaurant,Gym,American Restaurant,Bar,Music Venue,Ramen Restaurant


In [42]:
map_helsinki_cluster = folium.Map(location=[helsinki_latitude, helsinki_longitude], zoom_start=13)

# add markers to the map

markers_colors = []
for lat, lon, poi, cluster in zip(cluster_zero['Latitude'], cluster_zero['Longitude'], cluster_zero['Neighborhood'], cluster_zero['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_helsinki_cluster)
       
map_helsinki_cluster

So based on this analysis Kluuvi, Kaartinkaupunki, and Katajanokka are three neighborhoods which are both similar to "Central Bay Street" and are located within 3km radius of Helsinki city center.

## 5. Discussion

Nowadays, it is very common that people relocate to new cities for various reasons. One of the challenges of relocating to new cities is selecting a neighborhood to rent an apartment. The objective of this study was to help people find desirable neighborhoods in the new city they are relocating. To make the case more relatable, I introduced a fictional character called Joe Anderson who was planning to move from Toronto to Helsinki and sought data driven help to make a better decision.

We used retrieved neighborhood data for Helsinki and Toronto from Wikipedia and used Geopy and Foursquare to fetch location and venue data for all the neighborhoods. Then I performed exploratory analysis to better understand the challenge.

In order to select the best neighborhoods in Helsinki, I first filtered out the neighborhoods based on their proximity to the Helsinki city center. Then I used the K means algorithm for clustering neighborhoods. I set K value to 5 and used top 20 venues in each neighborhood. Finally, I summarized the results in the form of table and visualization on the Helsinki map.

## 6. Conclusion

The result of the clustering revealed that there are three neighborhoods in Helsinki namely Kluuvi, Kaartinkaupunki, and Katajanokka which had a lot of similarities to Central Bay street neighborhood. Also, they were within 3km radius of Helsinki city center which was also a deciding factor for the fictional character.

In the future, we can add more criteria to find even better matching neighborhoods in cities we are relocating to.