# Project 5: mini machine learning project

Maaike de Jong 

Ironhack Amsterdam Data Analytics 2020

### Notebook 1: data wrangling


This project uses the data from project 2: Sustainability in Amsterdam

In this project I will use Machine Learning models to see to what extent green indicator variables can predict income and Amsterdam city district. My questions are:

Q1: How well do energy label scores and number of solar panels predict income?  
Q2: Can energy scores, solar panels and income predict the city district?

I used the following datasets:  
From the [maps data portal](https://maps.amsterdam.nl/open_geodata/) of the Amsterdam city council:

- Solar panels (Zonnepanelen)
- Postcodes (PC6_VLAKKEN_BAG.csv)
- Neighbourhoods (GEBIED_BUURTEN.csv)
- City districts (GEBIED_STADSDELEN.csv)

From [Overheid.nl](overheid.nl):

- Energylabels in Amsterdam
- Income per Amsterdam area
All datasets can be found in this [google folder](https://drive.google.com/drive/folders/19VhvQbT89SLKaLnWsP20jhrTrqCvwMbd) 

This is the first part of two notebooks, here I combine variables from different datasets into the dataset used for the analysis in notebook 2. 

In [None]:
# Import packages

import geopandas
import numpy as np
import pandas as pd
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon
from shapely import wkt

In [None]:
# Import energy label data
energy_labels = pd.read_csv('Energielabels_selectie gemeentes Amsterdam 4-1-2012.csv', sep = ';')
energy_labels.head()

In [None]:
energy_labels.shape

In [None]:
energy_labels['woningtype'].value_counts()

In [None]:
# select postcode and energy class columns into new df
energy_labels_df = energy_labels[['Pand_postcode', 'PandVanMeting_energieklasse']]
energy_labels_df = energy_labels_df.rename(columns = {'PandVanMeting_energieklasse':'energy_class', 'Pand_postcode': 'postcode'})
energy_labels_df.head()

In [None]:
# check energy_classes
labels_list = sorted(list(set(energy_labels_df['energy_class'])))
labels_list

In [None]:
# add extra column with energy classes converted to numerical score
energy_labels_df['energy_class_score'] = energy_labels_df['energy_class'].replace({'A++': 9, 'A+': 8, 'A': 7, 'B': 6, 'C': 5, 'D': 4, 'E': 3, 'F': 2, 'G': 1})
energy_labels_df.head()

In [None]:
# add buurtcodes
postcodes = pd.read_csv('PC6_VLAKKEN_BAG.csv', sep = ';')
postcodes = postcodes.rename(columns = {'Postcode6':'postcode'})
postcodes.head()

In [None]:
pc_select = postcodes[['postcode', 'Buurtcode']]

In [None]:
# join buurten 

energy_buurten = pd.merge(energy_labels_df, pc_select, on = 'postcode', how = 'left')
energy_buurten.head()

In [None]:
# create df with average energy class scores by buurt

energy_buurt = energy_buurten.groupby('Buurtcode')[['energy_class_score']].mean().reset_index()

energy_buurt.head()

In [None]:
energy_buurt.shape

In [None]:
#import solar panel csv as pandas dataframe
solar_panels = pd.read_csv('ZONNEPANELEN2017.csv', sep = ';')
solar_panels.head()


In [None]:
solar_panels.shape

In [None]:
#Use shapely.wkt sub-module to parse wkt format
solar_panels['WKT_LAT_LNG'] = solar_panels['WKT_LAT_LNG'].str.replace(',',' ')

solar_panels['WKT_LAT_LNG'] = solar_panels['WKT_LAT_LNG'].apply(wkt.loads)

In [None]:
#convert to geodataframe
solar_gdf = geopandas.GeoDataFrame(solar_panels, geometry='WKT_LAT_LNG')

solar_gdf.head()

In [None]:
#check whether the 'point' columns are the right datatypes
type(solar_gdf.WKT_LAT_LNG)

In [None]:
# then do a spatial join with the buurten geodata
# Import file with buurten to area conversion
buurten = pd.read_csv('GEBIED_BUURTEN.csv', sep = ';')
buurten.head()

In [None]:
#Use shapely.wkt sub-module to parse wkt format
#buurten['WKT_LAT_LNG'] = buurten['WKT_LAT_LNG'].str.replace(',',' ')

buurten['WKT_LAT_LNG'] = buurten['WKT_LAT_LNG'].apply(wkt.loads)

In [None]:
#convert to geodataframe
buurten_gdf = geopandas.GeoDataFrame(buurten, geometry='WKT_LAT_LNG')

In [None]:
#select relevant columns from solar_gdf
solar_select = solar_gdf[['Functie', 'Gedetecteerde_panelen', 'WKT_LAT_LNG']]
solar_select = solar_select.rename(columns = {'Gedetecteerde_panelen':'solar_panels'})
solar_select2 = solar_select[solar_select['Functie'] == 'Wonen']

In [None]:
solar_select2.head()

In [None]:
#assign the WGS84 latitude-longitude coordinate system to the geoseries
solar_select2.crs = "EPSG:4326"

In [None]:
buurten_select = buurten_gdf[['Buurt_code', 'WKT_LAT_LNG']]
buurten_select.crs = "EPSG:4326"

In [None]:
#perform spatial join in geopandas
solar_buurten = geopandas.sjoin(buurten_select, solar_select2, how="left", op="contains")

In [None]:
solar_buurten.head()

In [None]:
# new df with number of solar panels per buurt 

solar_buurt = solar_buurten.groupby('Buurt_code')[['solar_panels']].sum().reset_index()
solar_buurt = solar_buurt.rename(columns = {'Buurt_code': 'Buurtcode'})
solar_buurt.head()

In [None]:
# join energy labels and green roof data

energy_solar_buurt = pd.merge(energy_buurt, solar_buurt, on = 'Buurtcode', how = 'inner')
energy_solar_buurt.head()

In [None]:
energy_solar_buurt.shape

In [None]:
# add buurt stadsdeelcode, lat, long to this df

buurten_select = buurten[['Buurt_code', 'Stadsdeel_code','LNG', 'LAT']]
buurten_select = buurten_select.rename(columns = {'Buurt_code': 'Buurtcode'})
buurten_select.head()

In [None]:
# join this data to df

combined_data = pd.merge(energy_solar_buurt, buurten_select, on = 'Buurtcode', how = 'left')
combined_data.head()

In [None]:
# add stadsdeel namen
stadsdelen = pd.read_csv('GEBIED_STADSDELEN.csv', sep = ';')
stadsdelen.head()

In [None]:
stadsdelen_select = stadsdelen[['Stadsdeel_code', 'Stadsdeel']]

In [None]:
# join this data with main df into final df

final_data = pd.merge(combined_data, stadsdelen_select, on = 'Stadsdeel_code', how = 'left')
final_data.head()

In [None]:
# save data file for future use:
final_data.to_csv('final_data.csv', index=False)

In [None]:
# now also add income data

# Import income data file
income = pd.read_excel('2019_stadsdelen_3_15.xlsx', skiprows = [0,1,3,80,112,113])
income.head()

In [None]:
income_df = income[['wijk/std', 'gemiddeld persoonlijk inkomen (x 1.000 euro)']]

In [None]:
income_df = income_df.rename(columns = {'gemiddeld persoonlijk inkomen (x 1.000 euro)':'mean_income (x 1.000 euro)'})
income_df.head()

In [None]:
income_df['area'] = income_df['wijk/std'].str.extract('([A-Z]\d\d)')
income_df['area_name'] = income_df['wijk/std'].str.replace('([A-Z]\d\d)', '')
income_df.head()

In [None]:
income_df2 = income_df[['area', 'mean_income (x 1.000 euro)']]
income_df2.head()

In [None]:
final_data.head()

In [None]:
final_data2 = final_data.copy()

In [None]:
final_data2.head()

In [None]:
final_data2['area'] = final_data2['Buurtcode'].str.extract('([A-Z]\d\d)')
final_data2.head()

In [None]:
final_data_income = pd.merge(final_data2, income_df2, on = 'area', how = 'left')
final_data_income.head()

In [None]:
# Save final data file to use in analysis

final_data_income.to_csv('final_data_income.csv', index=False)