# Project 5: mini machine learning project

Maaike de Jong 

Ironhack Amsterdam Data Analytics 2020


This project uses the data from project 2: Sustainability in Amsterdam

In this project I will use Machine Learning models to see to what extent green indicator variables can predict income and Amsterdam city district. My questions are:

Q1: How well do energy label scores and number of solar panels predict income? 
Q2: Can energy scores, solar panels and income predict the city district?

I used the following datasets:
From the [maps data portal](https://maps.amsterdam.nl/open_geodata/) of the Amsterdam city council:
- Solar panels (Zonnepanelen)
- Postcodes (PC6_VLAKKEN_BAG.csv)
- Neighbourhoods (GEBIED_BUURTEN.csv)
- City districts (GEBIED_STADSDELEN.csv)

From [Overheid.nl](overheid.nl):
- Energylabels in Amsterdam
- Income per Amsterdam area

All datasets can be found in [this google folder](https://drive.google.com/drive/folders/19VhvQbT89SLKaLnWsP20jhrTrqCvwMbd) 

This is the second part of two notebooks, part one deals with the data wrangling to combine variables from different datasets into the dataset used here for analysis.

In [None]:
# import packages

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

from scipy.stats import chi2_contingency

In [None]:
df = pd.read_csv('final_data_income.csv')
df.head()

In [None]:
df = df.rename(columns = {'energy_class_score': 'energy_score', 'mean_income (x 1.000 euro)': 'income'})
df.head()

In [None]:
# inspect the data

df.dtypes

In [None]:
df.shape

In [None]:
# check missing values
df.isnull().sum()

# no missing values

In [None]:
# inspect distribution of the variables

df['energy_score'].hist()

In [None]:
df['solar_panels'].hist()


In [None]:
df[df['solar_panels'] == 0].shape

# 154 of the 418 buurten have 0 solar panels. This might cause unbalance

In [None]:
df['income'].hist()

In [None]:
# investigate correlations

df[['energy_score', 'solar_panels', 'income']].corr()

# no meaningful correlations

## Apply different types of supervised models

First, linear regression with energy score and solar panels predicting income

In [None]:
# define predicting and predicted variables

y = df['income']
X = df[['energy_score', 'solar_panels']] 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [None]:

LM = linear_model.LinearRegression()

In [None]:
LM.fit(X_train, y_train)

In [None]:
y_pred = LM.predict(X_test)
pd.DataFrame({'test':y_test, 'predicted':y_pred})

In [None]:
# calculate the r squared score between y_pred and y_train which indicates how well the estimated regression model fits the training data

y_pred = LM.predict(X_train)
r2_score(y_train, y_pred)

In [None]:
# calculate the r squared score between y_pred and y_test

y_test_pred = LM.predict(X_test)
r2_score(y_test, y_test_pred)

In [None]:
# These score show that the model doesn't fit and the variables are likely unsuitable predictors of income

### Second, try supervised learning classification models

Again, test whether energy score and solar panels can predict income, using the following models:
- Logistic Regression
- K-nearest neighbor

In [None]:
# make income into a categorical variable
# check range

df['income'].describe()


In [None]:
# divide income into quantile categories

df['income_cat'] = pd.cut(df['income'], bins=[20.0, 26.85, 35.4, 42.8, 92.7], labels=['low','medium-low','medium-high','high'])

In [None]:
df.head()

In [None]:
y = df['income_cat']
X = df[['energy_score', 'solar_panels']] 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [None]:
# Initiate Logistic Regression model

logreg = LogisticRegression()

In [None]:
logreg.max_iter=1000
logreg.fit(X_train, y_train)

In [None]:
# Assign the fitted data to y_pred 

y_pred = logreg.predict(X_test)
pd.DataFrame({'test':y_test, 'predicted':y_pred})

In [None]:
# print confusion matrix:

confusion_matrix(y_test, y_pred)

In [None]:
# print accuracy score
accuracy_score(y_test, y_pred)

# the score is still low but a lot better than with the linear regression

In [None]:
# Now a K-nearest neighbors approach

In [None]:
KNN3 = KNeighborsClassifier(n_neighbors = 3)
KNN3.fit(X_train, y_train)

In [None]:
y_pred = KNN3.predict(X_test)

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
accuracy_score(y_test, y_pred)
# this score is still not great, but better than logistic regression

In [None]:
# same model with k = 5

KNN5 = KNeighborsClassifier(n_neighbors = 5)
KNN5.fit(X_train, y_train)

In [None]:
y_pred = KNN5.predict(X_test)

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
accuracy_score(y_test, y_pred)
# this score is slightly lower than k=3, so not an improvement

### A supervised clustering model predicting stadsdeel

Here I use the variables energy score, solar panels and income to predict stadsdeel

In [None]:
# First, with unscaled variables

y = df['Stadsdeel']
X = df[['energy_score', 'solar_panels', 'income']] 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [None]:
logreg = LogisticRegression()

In [None]:
logreg.max_iter=10000
logreg.fit(X_train, y_train)

# had to increase the nr of iterations to 10000 to avoid a convergence warning

In [None]:
# Assign the fitted data to y_pred 

y_pred = logreg.predict(X_test)
pd.DataFrame({'test':y_test, 'predicted':y_pred})

In [None]:
# print accuracy score
accuracy_score(y_test, y_pred)

# The score is 0.38

In [None]:
# Now with K-nearest neighbors

KNN3 = KNeighborsClassifier(n_neighbors = 3)
KNN3.fit(X_train, y_train)

In [None]:
y_pred = KNN3.predict(X_test)

In [None]:
accuracy_score(y_test, y_pred)
# this score is similar as logistic regression

In [None]:
# same model with k = 5

KNN5 = KNeighborsClassifier(n_neighbors = 5)
KNN5.fit(X_train, y_train)

In [None]:
y_pred = KNN5.predict(X_test)

In [None]:
accuracy_score(y_test, y_pred)
# this score is similar, not an improvement

In [None]:
# Now check whether it helps to scale the data

scaler = StandardScaler()

X_scaled = scaler.fit_transform(df[['energy_score', 'solar_panels', 'income']])
X_scaled = pd.DataFrame(X_scaled)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.20)

In [None]:
logreg = LogisticRegression()

In [None]:
logreg.max_iter=1000
logreg.fit(X_train, y_train)

In [None]:
# Assign the fitted data to y_pred 

y_pred = logreg.predict(X_test)
pd.DataFrame({'test':y_test, 'predicted':y_pred})

In [None]:
# print accuracy score
accuracy_score(y_test, y_pred)

# similar accuracy score, not an improvement 

## Now try unsupervised learning model

In [None]:
# Here I will use energy labels, solar panels and income data to create clusters
# and check to what extent these align with stadsdelen

# First, make two versions of the dataset: one with the untransformed relevant columns, one with transformed

ULdf = df[['energy_score', 'solar_panels', 'income']]

scaler = StandardScaler()
ULdf_scaled = scaler.fit_transform(ULdf)
ULdf_scaled = pd.DataFrame(ULdf_scaled)

In [None]:
# Data clustering with K-means

# unscaled data

kmeans = KMeans().fit(ULdf)
df['kmeans_labels'] = kmeans.labels_

In [None]:
# scaled data

kmeans2 = KMeans().fit(ULdf_scaled)
df['kmeans_scaled_labels'] = kmeans2.labels_

In [None]:
# Data clustering with DBSCAN

# unscaled data

dbscan = DBSCAN(eps=0.5).fit(ULdf)

df['dbscan_labels'] = dbscan.labels_

In [None]:
# scaled data

dbscan2 = DBSCAN(eps=0.5).fit(ULdf_scaled)

df['dbscan_scaled_labels'] = dbscan2.labels_

In [None]:
df.head()

In [None]:
df['kmeans_labels'].value_counts()

In [None]:
df['kmeans_scaled_labels'].value_counts()

In [None]:
df['dbscan_labels'].value_counts()

In [None]:
df['dbscan_scaled_labels'].value_counts()

In [None]:
# looking at these outcomes, I'm going to go with the kmeans analysis with the scaled data

df.dtypes

In [None]:
# perform a Chi-square test of independence to check whether there is a relationship between Stadsdelen and the kmeans clusters

# format the data

df_x2 = df[['Stadsdeel', 'kmeans_scaled_labels']]
df_x2.head()

In [None]:
# transform kmeans labels to str

df_x2['kmeans_scaled_labels'] = df['kmeans_scaled_labels'].apply(str)

In [None]:
df_x2.dtypes

In [None]:
dfx2 = df_x2.groupby('Stadsdeel')['kmeans_scaled_labels'].value_counts().unstack().fillna(0)

In [None]:
dfx2

In [None]:
sns.heatmap(dfx2)

# it is clear from the heatmap that there is a non-equal count of the cluster across the stadsdelen, so there is some dependence

In [None]:
chi2_contingency(dfx2)

# the p-value is very small (1.6140032016367972e-30), so there is a very significant relationship between the Stadsdelen and the kmeans clusters