# <center>IBM Data Science Professional Certificate<center>

# <center>Capstone Project Report

# Table of Contents

- [Introduction](#introduction)
- [Data](#data)
- [Methodology](#methodology)
- Analysis
- Results and Discussion
- Conclusion

# Introduction <a name="introduction"></a>

***
During the last decade, house prices skyrocketed in Europe, with an increase of more than 20%, according to Eurostat. This increase has been showing different paces from country to country, being Portugal one of the countries with the highest growth, around 35% from 2009 to 2019. This increase has been motivated by several factors, among them:
- low interest rates spiked the interest of consumers to invest in new housing;
- low baseline prices;
- low supply of housing;
- foreign companies relocating to Portugal, creating more pressure both on the office and private housing market;
- boom in tourism created new investment alternatives for house owners (e.g. short-term rental).

All this pressure is forcing population to find more affordable solutions to live in other areas that were not previously considered, whether for renting or buying a house.

The purpose of this project is to explore the characteristics of the different neighborhoods, identifying what makes them unique and find similarities to cluster them and compare their houses prices. 

I chose Porto Metropolitan Area, the 2nd biggest urban area in Portugal, to be the focus of this analysis. The main reasons for this choice are detailed below:
1. The region has been experiencing a tremendous increase in housing prices and, therefore, makes a good use case for the problem identified;
2. Porto is a mid-sized city, with around 2.659.524 inhabitants and 1.600.000 tourists visiting it annually, which makes a good candidate to have sufficient data to conduct the analysis;
3. I have been living in the city for more than 2 years and, therefore, will be interesting to contrast the results from the analysis with my empirical knowledge of the city.

Despite Porto being the object of study, the methodology used can be applied to any other city.

# Data <a name="data"></a>

***
The data used for this project comes from a multitude of sources, namely Foursquare, an open data repository (i.e. Porto Digital) and one of the largest repositories of apartments to buy in Portugal (Imovirtual). Below there is a detailed description on the sources of information:
- Venues in the metropolitan area (https://pt.foursquare.com/)
- neighborhoods (https://raw.githubusercontent.com/publicos-pt/pt_regions/master/pt_regions/counties.json)
- metro stations (https://opendata.urbanplatform.portodigital.pt/dataset/metro-do-porto)
- bus stations (https://opendata.urbanplatform.portodigital.pt/dataset/informacao-publica-dos-servicos-de-transporte-colectivo-do-porto/resource/4be68555-5a0d-49d0-b62e-4e76735953a5)
- train stations (https://pt.wikipedia.org/wiki/CP_Urbanos_do_Porto)
- bike stations (https://opendata.urbanplatform.portodigital.pt/dataset/bicicletarios)
- car parking (https://opendata.urbanplatform.portodigital.pt/dataset/parques-de-estacionamento-municipais)
- bike path (https://opendata.urbanplatform.portodigital.pt/dataset/ciclovias)
- supermarkets (https://opendata.urbanplatform.portodigital.pt/dataset/supermercados-e-hipermercados)
- hotels (https://opendata.urbanplatform.portodigital.pt/dataset/hoteis_-aparthoteis-e-albergarias-centroides/resource/3c490f86-9fd4-4268-8db2-1f7b4832dc6f)
- kindergartens, elementary and high schools (https://opendata.urbanplatform.portodigital.pt/dataset/estabelecimentos-de-ensino-por-ano-letivo)
- hospitals, other medical centers and pharmacies (https://opendata.urbanplatform.portodigital.pt/dataset/equipamentos-de-saude)
- pools (https://opendata.urbanplatform.portodigital.pt/dataset/piscinas-centroides)
- post office (https://opendata.urbanplatform.portodigital.pt/dataset/correios-2013)
- tourist information offices (https://opendata.urbanplatform.portodigital.pt/dataset/postos-de-informacao-turistica-centroides)
- apartments to buy (https://www.imovirtual.com/comprar/apartamento/?search%5Bdescription%5D=1&locations%5B0%5D%5Bregion_id%5D=13&locations%5B1%5D%5Bregion_id%5D=3&locations%5B2%5D%5Bregion_id%5D=1&nrAdsPerPage=103323)

# 0. Import libraries

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
# import library to handle data in a vectorized manner
import numpy as np

# import library for data analsysis
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# library to handle JSON files
import json

# convert an address into latitude and longitude values
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

# import library to handle requests
import requests

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
import folium

#
import geopandas as gpd

print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



  current version: 4.7.12
  latest version: 4.8.2

Please update conda by running

    $ conda update -n base -c defaults conda




  current version: 4.7.12
  latest version: 4.8.2

Please update conda by running

    $ conda update -n base -c defaults conda


Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


  current version: 4.7.12
  latest version: 4.8.2

Please update conda by running

    $ conda update -n base -c defaults conda




1.1 Extract the neighborhoods

In [9]:
!wget -q -O ptneigh_data.json https://raw.githubusercontent.com/publicos-pt/pt_regions/master/pt_regions/counties.json
print('Data downloaded!')

Data downloaded!


1.2 Load and explore the data

In [13]:
with open('ptneigh_data.json','r', encoding='utf-8') as json_file:
    ptneigh_data = json.load(json_file)

1.3 Transform the data into a pandas dataframe

In [14]:
ptneigh_data[0] #have a look at the first item in the list to have an idea of the structure

{'NIF': 510833764,
 'area': 527229,
 'district': 'LISBOA',
 'municipality': 'TORRES VEDRAS',
 'name': 'UNIÃO DAS FREGUESIAS DE A DOS CUNHADOS E MACEIRA'}

In [15]:
column_names = ['District','Municipality','Neighborhood'] # define the dataframe columns
neighborhoods = pd.DataFrame(columns=column_names) # instantiate the dataframe
neighborhoods.head() # check the structure of the dataframe

Unnamed: 0,District,Municipality,Neighborhood


In [16]:
for data in ptneigh_data:
    district = data['district']
    municipality = data['municipality']
    neighborhood = data['name']
    neighborhoods = neighborhoods.append({'District': district,
                                          'Municipality': municipality,
                                          'Neighborhood': neighborhood}, ignore_index=True)
neighborhoods.head() # quickly check if the first 5 rows are populated appropriately

Unnamed: 0,District,Municipality,Neighborhood
0,LISBOA,TORRES VEDRAS,UNIÃO DAS FREGUESIAS DE A DOS CUNHADOS E MACEIRA
1,LEIRIA,CALDAS DA RAINHA,A DOS FRANCOS
2,LEIRIA,ÓBIDOS,A DOS NEGROS
3,PORTO,PÓVOA DE VARZIM,"UNIÃO DAS FREGUESIAS DE AVER-O-MAR, AMORIM E T..."
4,BRAGA,GUIMARÃES,UNIÃO DAS FREGUESIAS DE ABAÇÃO E GÉMEOS


In [17]:
neighborhoods.shape # check the total number of entries

(3091, 3)

1.4 Keep only the neighborhoods within Porto Metropolitan Area (i.e. neighborhoods of Porto, Braga and Aveiro districts)

In [18]:
porto = neighborhoods['District'] == "PORTO" # Create variable with TRUE if district is PORTO
braga = neighborhoods['District'] == "BRAGA" # Create variable with TRUE if district is BRAGA
aveiro = neighborhoods['District'] == "AVEIRO" # Create variable with TRUE if district is AVEIRO
porto_neigh = neighborhoods[porto | braga | aveiro] # create new dataframe with only Porto Metropolitan Area neighborhoods 
porto_neigh = porto_neigh.drop_duplicates() # remove any duplicates created
print('Porto Metropolitan area has {} neighborhoods.'.format(porto_neigh.shape[0])) # check the total number of neighborhoods

Porto Metropolitan area has 737 neighborhoods.


1.5 Get the coordinates for each neighborhood and append them to the dataframe