# <center>IBM Data Science Professional Certificate<center>

# <center>Capstone Project Report

# Table of Contents

- [Introduction](#introduction)
- [Data](#data)
- [Methodology](#methodology)
- Analysis
- Results and Discussion
- Conclusion

# Introduction <a name="introduction"></a>

***
During the last decade, house prices skyrocketed in Europe, with an increase of more than 20%, according to Eurostat. This increase has been showing different paces from country to country, being Portugal one of the countries with the highest growth, around 35% from 2009 to 2019. This increase has been motivated by several factors, among them:
- low interest rates spiked the interest of consumers to invest in new housing;
- low baseline prices;
- low supply of housing;
- foreign companies relocating to Portugal, creating more pressure both on the office and private housing market;
- boom in tourism created new investment alternatives for house owners (e.g. short-term rental).

All this pressure is forcing population to find more affordable solutions to live in other areas that were not previously considered, whether for renting or buying a house.

The purpose of this project is to explore the characteristics of the different neighborhoods, identifying what makes them unique and find similarities to cluster them and compare their houses prices. 

I chose Porto Metropolitan Area, the 2nd biggest urban area in Portugal, to be the focus of this analysis. The main reasons for this choice are detailed below:
1. The region has been experiencing a tremendous increase in housing prices and, therefore, makes a good use case for the problem identified;
2. Porto is a mid-sized city, with around 2.659.524 inhabitants and 1.600.000 tourists visiting it annually, which makes a good candidate to have sufficient data to conduct the analysis;
3. I have been living in the city for more than 2 years and, therefore, will be interesting to contrast the results from the analysis with my empirical knowledge of the city.

Despite Porto being the object of study, the methodology used can be applied to any other city.

# Data <a name="data"></a>

***
The data used for this project comes from a multitude of sources, namely Foursquare, an open data repository (i.e. Porto Digital) and one of the largest repositories of apartments to buy in Portugal (Imovirtual). Below there is a detailed description on the sources of information:
- Venues in the metropolitan area (https://pt.foursquare.com/)
- neighborhoods (https://raw.githubusercontent.com/publicos-pt/pt_regions/master/pt_regions/counties.json)
- metro stations (https://opendata.urbanplatform.portodigital.pt/dataset/metro-do-porto)
- bus stations (https://opendata.urbanplatform.portodigital.pt/dataset/informacao-publica-dos-servicos-de-transporte-colectivo-do-porto/resource/4be68555-5a0d-49d0-b62e-4e76735953a5)
- train stations (https://pt.wikipedia.org/wiki/CP_Urbanos_do_Porto)
- bike stations (https://opendata.urbanplatform.portodigital.pt/dataset/bicicletarios)
- car parking (https://opendata.urbanplatform.portodigital.pt/dataset/parques-de-estacionamento-municipais)
- bike path (https://opendata.urbanplatform.portodigital.pt/dataset/ciclovias)
- supermarkets (https://opendata.urbanplatform.portodigital.pt/dataset/supermercados-e-hipermercados)
- hotels (https://opendata.urbanplatform.portodigital.pt/dataset/hoteis_-aparthoteis-e-albergarias-centroides/resource/3c490f86-9fd4-4268-8db2-1f7b4832dc6f)
- kindergartens, elementary and high schools (https://opendata.urbanplatform.portodigital.pt/dataset/estabelecimentos-de-ensino-por-ano-letivo)
- hospitals, other medical centers and pharmacies (https://opendata.urbanplatform.portodigital.pt/dataset/equipamentos-de-saude)
- pools (https://opendata.urbanplatform.portodigital.pt/dataset/piscinas-centroides)
- post office (https://opendata.urbanplatform.portodigital.pt/dataset/correios-2013)
- tourist information offices (https://opendata.urbanplatform.portodigital.pt/dataset/postos-de-informacao-turistica-centroides)
- apartments to buy (https://www.imovirtual.com/comprar/apartamento/?search%5Bdescription%5D=1&locations%5B0%5D%5Bregion_id%5D=13&locations%5B1%5D%5Bregion_id%5D=3&locations%5B2%5D%5Bregion_id%5D=1&nrAdsPerPage=103323)

# 0. Import libraries

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # import library to handle data in a vectorized manner

import pandas as pd # import library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # import library to handle requests

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import matplotlib.cm as cm # Matplotlib and associated plotting modules
import matplotlib.colors as colors

from sklearn.cluster import KMeans # import k-means from clustering stage

import folium # map rendering library

import geopandas as gpd #

print('Libraries imported.')

Libraries imported.


1.1 Extract the neighborhoods of Porto Metropolitan Area

In [5]:
!wget -q -O ptfreguesias_data.json https://raw.githubusercontent.com/nmota/caop_GeoJSON/master/ContinenteFreguesias.geojson
print('Data downloaded!')

Data downloaded!


1.2 Load, transform the data into a dataframe and explore the data

In [16]:
ptfreg = gpd.read_file('ptfreguesias_data.json') #more info here: https://geopandas.org/io.html#reading-spatial-data
ptfreg.head()

Unnamed: 0,Dicofre,Freguesia,Concelho,Distrito,Area_Ha,Des_Simpli,geometry
0,10103,Aguada de Cima,ÁGUEDA,AVEIRO,2839.31,Aguada de Cima,"POLYGON ((-23153.704 98134.039, -22875.069 977..."
1,10109,Fermentelos,ÁGUEDA,AVEIRO,858.2,Fermentelos,"POLYGON ((-32142.508 98702.545, -32161.288 986..."
2,10112,Macinhata do Vouga,ÁGUEDA,AVEIRO,3195.44,Macinhata do Vouga,"POLYGON ((-20560.758 113803.912, -20550.798 11..."
3,10119,Valongo do Vouga,ÁGUEDA,AVEIRO,4320.11,Valongo do Vouga,"POLYGON ((-22002.741 110943.503, -21999.021 11..."
4,10121,União das freguesias de Águeda e Borralha,ÁGUEDA,AVEIRO,3602.93,Águeda e Borralha,"POLYGON ((-21105.293 105071.326, -21079.433 10..."


1.3 Remove the unnecessary columns (i.e. "Dicofre) and order columns from the biggest geographical area to the smallest (i.e. from "distrito" to "freguesia") 

In [17]:
ptfreg2 = ptfreg.drop(columns='Dicofre') #more info here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
ptfreg2.head()

Unnamed: 0,Freguesia,Concelho,Distrito,Area_Ha,Des_Simpli,geometry
0,Aguada de Cima,ÁGUEDA,AVEIRO,2839.31,Aguada de Cima,"POLYGON ((-23153.704 98134.039, -22875.069 977..."
1,Fermentelos,ÁGUEDA,AVEIRO,858.2,Fermentelos,"POLYGON ((-32142.508 98702.545, -32161.288 986..."
2,Macinhata do Vouga,ÁGUEDA,AVEIRO,3195.44,Macinhata do Vouga,"POLYGON ((-20560.758 113803.912, -20550.798 11..."
3,Valongo do Vouga,ÁGUEDA,AVEIRO,4320.11,Valongo do Vouga,"POLYGON ((-22002.741 110943.503, -21999.021 11..."
4,União das freguesias de Águeda e Borralha,ÁGUEDA,AVEIRO,3602.93,Águeda e Borralha,"POLYGON ((-21105.293 105071.326, -21079.433 10..."


In [18]:
ptfreg3 = ptfreg2[['Distrito','Concelho','Freguesia','Des_Simpli','Area_Ha','geometry']] #ordered columns
ptfreg3.head()

Unnamed: 0,Distrito,Concelho,Freguesia,Des_Simpli,Area_Ha,geometry
0,AVEIRO,ÁGUEDA,Aguada de Cima,Aguada de Cima,2839.31,"POLYGON ((-23153.704 98134.039, -22875.069 977..."
1,AVEIRO,ÁGUEDA,Fermentelos,Fermentelos,858.2,"POLYGON ((-32142.508 98702.545, -32161.288 986..."
2,AVEIRO,ÁGUEDA,Macinhata do Vouga,Macinhata do Vouga,3195.44,"POLYGON ((-20560.758 113803.912, -20550.798 11..."
3,AVEIRO,ÁGUEDA,Valongo do Vouga,Valongo do Vouga,4320.11,"POLYGON ((-22002.741 110943.503, -21999.021 11..."
4,AVEIRO,ÁGUEDA,União das freguesias de Águeda e Borralha,Águeda e Borralha,3602.93,"POLYGON ((-21105.293 105071.326, -21079.433 10..."


In [20]:
ptfreg4 = ptfreg3.rename(columns={"Freguesia":"FreguesiaFullName","Des_Simpli":"Freguesia"}) #renamed columns
ptfreg4.head()

Unnamed: 0,Distrito,Concelho,FreguesiaFullName,Freguesia,Area_Ha,geometry
0,AVEIRO,ÁGUEDA,Aguada de Cima,Aguada de Cima,2839.31,"POLYGON ((-23153.704 98134.039, -22875.069 977..."
1,AVEIRO,ÁGUEDA,Fermentelos,Fermentelos,858.2,"POLYGON ((-32142.508 98702.545, -32161.288 986..."
2,AVEIRO,ÁGUEDA,Macinhata do Vouga,Macinhata do Vouga,3195.44,"POLYGON ((-20560.758 113803.912, -20550.798 11..."
3,AVEIRO,ÁGUEDA,Valongo do Vouga,Valongo do Vouga,4320.11,"POLYGON ((-22002.741 110943.503, -21999.021 11..."
4,AVEIRO,ÁGUEDA,União das freguesias de Águeda e Borralha,Águeda e Borralha,3602.93,"POLYGON ((-21105.293 105071.326, -21079.433 10..."


1.4 Keep only the neighborhoods within Porto Metropolitan Area (i.e. neighborhoods of Porto, Braga and Aveiro districts)

In [22]:
ptfreg4.shape # check the total number of entries and cross-check with the total number of "freguesias" in Portugal Continental (source here: https://www.pordata.pt/Municipios/Freguesias-54)

(2882, 6)

In [23]:
porto = ptfreg4['Distrito'] == "PORTO" # Create variable with TRUE if district is PORTO
braga = ptfreg4['Distrito'] == "BRAGA" # Create variable with TRUE if district is BRAGA
aveiro = ptfreg4['Distrito'] == "AVEIRO" # Create variable with TRUE if district is AVEIRO
porto_neigh = ptfreg4[porto | braga | aveiro] # create new dataframe with only Porto Metropolitan Area neighborhoods
porto_neigh.head()

Unnamed: 0,Distrito,Concelho,FreguesiaFullName,Freguesia,Area_Ha,geometry
0,AVEIRO,ÁGUEDA,Aguada de Cima,Aguada de Cima,2839.31,"POLYGON ((-23153.704 98134.039, -22875.069 977..."
1,AVEIRO,ÁGUEDA,Fermentelos,Fermentelos,858.2,"POLYGON ((-32142.508 98702.545, -32161.288 986..."
2,AVEIRO,ÁGUEDA,Macinhata do Vouga,Macinhata do Vouga,3195.44,"POLYGON ((-20560.758 113803.912, -20550.798 11..."
3,AVEIRO,ÁGUEDA,Valongo do Vouga,Valongo do Vouga,4320.11,"POLYGON ((-22002.741 110943.503, -21999.021 11..."
4,AVEIRO,ÁGUEDA,União das freguesias de Águeda e Borralha,Águeda e Borralha,3602.93,"POLYGON ((-21105.293 105071.326, -21079.433 10..."


In [24]:
print('Porto Metropolitan area has {} neighborhoods.'.format(porto_neigh.shape[0])) # check the total number of neighborhoods and cross-check (source here: https://www.pordata.pt/Municipios/Freguesias-54)

Porto Metropolitan area has 737 neighborhoods.


1.5 Adding a centroid to 

In [59]:
# copy GeoDataFrame
centroids = porto_neigh.copy()
# change geometry 
centroids['geometry'] = centroids['geometry'].centroid
centroids.head()

Unnamed: 0,Distrito,Concelho,FreguesiaFullName,Freguesia,Area_Ha,geometry
0,AVEIRO,ÁGUEDA,Aguada de Cima,Aguada de Cima,2839.31,POINT (-23376.421 95067.498)
1,AVEIRO,ÁGUEDA,Fermentelos,Fermentelos,858.2,POINT (-33491.864 99637.413)
2,AVEIRO,ÁGUEDA,Macinhata do Vouga,Macinhata do Vouga,3195.44,POINT (-25191.511 110982.592)
3,AVEIRO,ÁGUEDA,Valongo do Vouga,Valongo do Vouga,4320.11,POINT (-23467.379 106945.630)
4,AVEIRO,ÁGUEDA,União das freguesias de Águeda e Borralha,Águeda e Borralha,3602.93,POINT (-24009.776 100641.937)


In [65]:
gjson = centroids.to_crs(epsg='4326').to_json()

1.X Plot the map of Porto Metropolitan Area neighbourhoods

In [66]:
portomap = folium.Map(
    location=[41.157944,-8.629105],
    zoom_start=11,
    tiles="openstreetmap") # Picked Porto coordinates to center the map (source: https://www.latlong.net/)

points = folium.features.GeoJson(gjson)
portomap.add_children(points)

portomap