# Where to buy  in Helsinki

## 1 Introduction
### 1.1 Background

**Helsinki** is the capital city of Finland with a population of 657,674. Together with the cities of Espoo, Vantaa, and Kauniainen, and surrounding commuter towns, Helsinki forms the Greater Helsinki metropolitan area (Uusimaa), which has a population of over 1.5 million. This area is the country's most important center for politics, education, finance, culture, and research. The urbanization and development of the uusimaa area has brought great opportunities for the tertiary sectory business, including catering. Considering someone is seeking for a suitable place in Helsinki to open a restaurant, he or she must be interested in how restaurants are located in this city and which neighborhoods have the most restaurants. My project will provide an analysis of the 60 neighborhoods in Helsinki area and the situation of restaurants in each neighborhood. Then I will divide the neighborhoods to several clusters ... 

### 1.2 Data description

The data that will be used in this project include:\
-Subdivision (neighborhoods) of Helsinki, collected from wikipedia page [1].\
-The center coordinates of each neighborhood, collected from Google Map[2].\
-Housing price per square meter of each neighborhood, collected from Blok company website [3]. \
-The most common venues in each neighborhood, collected from Foursquare API [4].

## 2 Methodology

### 2.1 Data preparation

2.1.1 Prepare libaries needed for data collection, pre-processing and data modeling

In [1]:
# import libraries
import numpy as np 
import pandas as pd 
import requests # library to handle requests
!pip install bs4
from bs4 import BeautifulSoup

import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



2.1.2 Get the neighborhood data from wikipedia using BeautifulSoup

In [74]:
# Use Webscraping to Extract data
url = 'https://en.wikipedia.org/wiki/Subdivisions_of_Helsinki'
data = requests.get(url)

soup= BeautifulSoup(data.content, "html.parser")
helsinki_neiborhood_raw = soup.find_all("div", {"class": "div-col"})[0].find_all("li")

df = pd.DataFrame(columns=["Code","Neighborhood","Codelen"])
for row in helsinki_neiborhood_raw:
    col = row.get_text().split(" ") 
    code = col[0] 
    neighborhood = col[1] 
    codelen = len(col[0]) #length of the code
    df= df.append({"Code":code, "Neighborhood":neighborhood,"Codelen":codelen },ignore_index = True)

df=df[df.Codelen!=3] #remove rows with sub-neighborhood (Column "Code" has values with three digit)
df.drop(['Codelen','Code'], axis=1, inplace=True) #drop colmn 'Codelen' and'Code'
df.reset_index(drop=True, inplace=True) 
df.replace({"Ultuna\n591":"Ultuna"}, inplace=True) #fix data of row 58
helsinki_neiborhood = df
helsinki_neiborhood

Unnamed: 0,Neighborhood
0,Kruununhaka
1,Kluuvi
2,Kaartinkaupunki
3,Kamppi
4,Punavuori
5,Eira
6,Ullanlinna
7,Katajanokka
8,Kaivopuisto
9,Sörnäinen


2.1.3 Get housing price data from Blok website

In [68]:
url2 = 'https://blok.ai/en/neighbourhoods/'
data2 = requests.get(url2)

soup2=BeautifulSoup(data2.content,'html.parser')
table = soup2.find_all('table')
housing_price_raw = table[0]

In [69]:
housing_price_raw

<table class="table table-hover table-striped table-condensed" id="datatable">
<thead>
<tr>
<th>#</th>
<th># +/- (1yr.)</th>
<th>Postcode</th>
<th>Neighborhood</th>
<th>City</th>
<th>Average price per square 2020</th>
<th>Price +/-% (1yr.)</th>
<th>Price +/-% (5yr.)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0</td>
<td>00140</td>
<td>Kaivopuisto - Ullanlinna</td>
<td>Helsinki</td>
<td>8713</td>
<td>2%</td>
<td>29%</td>
</tr>
<tr>
<td>2</td>
<td>1</td>
<td>00150</td>
<td>Eira - Hernesaari</td>
<td>Helsinki</td>
<td>8367</td>
<td>4%</td>
<td>27%</td>
</tr>
<tr>
<td>3</td>
<td>1</td>
<td>00120</td>
<td>Punavuori</td>
<td>Helsinki</td>
<td>8160</td>
<td>6%</td>
<td>27%</td>
</tr>
<tr>
<td>4</td>
<td>3</td>
<td>00180</td>
<td>Kamppi - Ruoholahti</td>
<td>Helsinki</td>
<td>8023</td>
<td>14%</td>
<td>27%</td>
</tr>
<tr>
<td>5</td>
<td>N/A</td>
<td>00220</td>
<td>Jätkäsaari</td>
<td>Helsinki</td>
<td>7871</td>
<td>N/A</td>
<td>N/A</td>
</tr>
<tr>
<td>6</td>
<td>-1</td>
<td>00170</td>
<td>

In [73]:
df2 = pd.DataFrame(columns=["Postcode","Neighborhood","City", "Avg_price_per_sqaure_meter_2020", "Price_change_percentage_1yr", "Price_change_percentage_5yr"])
rows = housing_price_raw.find('tbody').find_all('tr')
for row in rows:
    col = row.find_all('td')
    postcode = col[2].string
    neighborhood2 = col[3].string
    city = col[4].string
    avg_price_per_sqaure_meter_2020 = col[5].string
    price_change_percentage_1yr = col[6].string
    price_change_percentage_5yr = col[7].string
    df2= df2.append({"Postcode":postcode,"Neighborhood":neighborhood2,"City":city, "Avg_price_per_sqaure_meter_2020":avg_price_per_sqaure_meter_2020, "Price_change_percentage_1yr":price_change_percentage_1yr, "Price_change_percentage_5yr":price_change_percentage_5yr},ignore_index = True)

df2.head()

Unnamed: 0,Postcode,Neighborhood,City,Avg_price_per_sqaure_meter_2020,Price_change_percentage_1yr,Price_change_percentage_5yr
0,140,Kaivopuisto - Ullanlinna,Helsinki,8713,2%,29%
1,150,Eira - Hernesaari,Helsinki,8367,4%,27%
2,120,Punavuori,Helsinki,8160,6%,27%
3,180,Kamppi - Ruoholahti,Helsinki,8023,14%,27%
4,220,Jätkäsaari,Helsinki,7871,,


In [85]:
#Keep only rows with City value "Helsinki"
housing_price = df2[df2.City=='Helsinki']
housing_price.reset_index(drop=True, inplace=True)
housing_price.head()

Unnamed: 0,Postcode,Neighborhood,City,Avg_price_per_sqaure_meter_2020,Price_change_percentage_1yr,Price_change_percentage_5yr
0,140,Kaivopuisto - Ullanlinna,Helsinki,8713,2%,29%
1,150,Eira - Hernesaari,Helsinki,8367,4%,27%
2,120,Punavuori,Helsinki,8160,6%,27%
3,180,Kamppi - Ruoholahti,Helsinki,8023,14%,27%
4,220,Jätkäsaari,Helsinki,7871,,


2.1.4 Get the center coordinates for each neighborhood using Googld Maps

In [91]:
housing_price[50:79]

Unnamed: 0,Postcode,Neighborhood,City,Avg_price_per_sqaure_meter_2020,Price_change_percentage_1yr,Price_change_percentage_5yr
50,630,Maunula-Suursuo,Helsinki,3726,3%,7%
51,650,Veräjämäki,Helsinki,3571,-7%,6%
52,390,Konala,Helsinki,3516,7%,12%
53,690,Tuomarinkylä-Torpparinmäki,Helsinki,3510,8%,
54,370,Reimarla,Helsinki,3488,8%,2%
55,730,Tapanila,Helsinki,3471,2%,6%
56,950,Vartioharju,Helsinki,3463,2%,8%
57,720,Pukinmäki-Savela,Helsinki,3385,5%,10%
58,680,Itä-Pakila,Helsinki,3378,-2%,6%
59,910,Puotila,Helsinki,3376,-2%,8%
