# TPM034A Machine Learning for socio-technical systems 
## `Assignment 01: Discover, explore and visualise data`

**Delft University of Technology**<br>
**Q2 2022**<br>
**Instructor:** Sander van Cranenburgh <br>
**TAs:**  Francisco Garrido Valenzuela & Lucas Spierenburg <br>

### `Instructions`

**Assignments aim to:**<br>
* Examine your understanding of the key concepts and techniques.
* Examine your the applied ML skills.

**Assignments:**<br>
* Are graded and must be submitted (see the submission instruction below). 

### `Workspace set-up`
**Option 1: Google Colab**<br>
Uncomment the following cells code lines if you are running this notebook on Colab

In [None]:
#!git clone https://github.com/TPM34A/Q2_2022
#!pip install -r Q2_2022/requirements_colab.txt
#!mv "/content/Q2_2022/Assignments/assignment_01/data" /content/data

**Option 2: Local environment**<br>
Uncomment the following cell if you are running this notebook on your local environment. This will install all dependencies on your Python version.

In [None]:
#!pip install -r requirements.txt

## `Application: Liveability and affordable housing in Amsterdam` <br>

### **Introduction**
There is a widespread sense that affordable housing for the middle incomes households is under pressure. Especially for new entrants to the housing markets (i.e. those who do not yet own a house), affordable houses to buy in pleasant neighboorhoods are in short supply. Entrants to the housing market typically are people in their 20s and 30s.<br>

The municipality of Amsterdam would like to tackle this issue. (see https://openresearch.amsterdam/en/page/77950/housing-crisis for articles on the subject). However, at present, the municipality of Amsterdam lacks insights on the extent to which access to affordable houses has deteriorated. <br>

*Your are asked to assist the municipality of Amsterdam in investigating **whether** and **where** access to afforable houses has deteriorated.*<br>

### **Data**

You have access to four data sets:
1. Real-estate prices in Amsterdam, at buurt level
1. Liveability scores in the Netherlands, at buurt level
1. Population statistics in the Netherlands, at buurt level
1. Buurten boundaries in the Netherlands (GIS)

### **Notes**
- In the livability scores dataset the column *versie* show the different versions of the livability score, only use the 3rd version. Thus, you may filter this column to keep *Leefbaarometer 3.0*	only.
- You may assume that the population statistics and geospatial data have not substantially changed across the years 2014 and 2020. Thus, you may assume both apply to 2014 and 2020.
- For Population statistics (3rd dataset), [this document](data/buurt/metadata_buurt.csv) provides a brief explanation of the features.

### **Tasks and grading**

Your assigment is divided into 3 subtasks: (1) Data preparation, (2) Data exploration and (3) Assess the affordability of liveable neighborhoods. In total, 10 points can be earned in this assignment. The weight per subtask is shown below. 

1.  **Data preparation: construct data from multiple data sources.** [2 pnt]
    1. Load the four dataset and show a preview of the dataset structure (some DataFrame rows).
    1. Prepare the table data (non-GIS) to have two different DataFrames (for 2014 and 2020) that contains the following information:
        - the liveability data for the year of interest, using the 3rd version of the Leefbaarometer
        - population data 
        - Real-estate prices
        - at the buurt level
        - *Make sure to filter the data and remove NULL (NaN values) if required*
    1. Add the geographic component of the buurten to your data.
1. **Data exploration: discover and visualise data.** [4 pnt]
    1.  Investigate the statistical distribution of the real-estate price levels and liveability in both years, using either a histogram of a CDF.
    1. Visualise the correlation between real-estate prices and liveability in Amsterdam, at the buurt level, with a scatter plot for each year. Then, visualise spatially real-estate prices and liveability in Amsterdam for 2014 and 2020 (use the same color scale for years 2014 and 2020).
1. **Assess the change in affordability of liveable neighbourhoods.** [3 pnt]
    1. Explore how the change in liveability associates with a change in real-estate prices, using a scatter plot.
    1. Compute the ratio of the liveability score over the real-estate price for both years, and show how the distribution of the ratio of liveability over real-estate price has changed between the two years.
    1. Determine the 5 buurten in which the ratio of liveability over real-estate has deteriorated most.
    1. Determine whether the number of buurten with price < 5k euro/m2 and a liveability ratio > 1/k euro has decreased in 2020, compared to 2014
1. **Qualitative reflection on machine learning and generalisation: There are some buurts in Amsterdam for which real-estate price data were missing. Suppose the minicipality of Amsterdam asks you whether you can create a machine learning model that can predict real-estate prices from the liveability index. Do you think this is possible? Explain your answer (conceptually).** [1 pnt]

### **Submission**
- The deadline for this assignment is **Wed, 23 November 2022** 
- Use **Python 3.7 or above**
- You have to submit your work in zip file with the ipynb (fully executed) in brightspace

In [None]:
import os
from os import getcwd
from pathlib import Path

import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.api import OLS, add_constant, tools
pd.set_option('display.max_columns', None)

### 1. Data preparation: construct data from multiple data sources [2 pnt]
#### 1.1 Load the four dataset and show a preview of the dataset structure (some DataFrame rows).

#### 1.2 Prepare the table data (non-GIS) to have two different DataFrames (for 2014 and 2020) that contains the following information:
- the liveability data for the year of interest, using the 3rd version of the Leefbaarometer
- population data 
- Real-estate prices
- at the buurt level
- *Make sure to filter the data and remove NULL (NaN values) if required*

#### 1.3 Add the geographic component of the buurten to your data

### 2. Data exploration: discover and visualise data [4 pnt]
#### 2.1 Investigate the statistical distribution of the real-estate price levels and liveability in both years

#### 2.2 Visualise the correlation between real-estate prices and liveability in Amsterdam, at the buurt level, with a scatter plot for each year. Then, visualise spatially real-estate prices and liveability in Amsterdam for 2014 and 2020 (use the same color scale for years 2014 and 2020).

### 3. Assess the change in affordability of liveable neighbourhoods [3 pnt]

#### 3.1 Explore how the change in liveability associates with a change in real-estate prices, using a scatter plot.

#### 3.2 Compute the ratio of the liveability score over the real-estate price for both years, and show how the distribution of the ratio of liveability over real-estate price has changed between the two years.

#### 3.3 Determine the 5 buurten in which the ratio of liveability over real-estate has deteriorated most.

#### 3.4 Determine whether the number of buurten with price < 5k euro/m2 and a liveability ratio > 1/k euro has decreased in 2020, compared to 2014

### 4. Qualitative reflection on machine learning and generalisation [1 pnt]