# PROJECT 5: Machine Learning

# CONSTRUCTION COSTS IN THE NETHERLANDS

## Study of the costs of different types of buildings in the Netherlands between years 2015 and 2019. 

## It includes an analysis of construction stage related costs for different types of buildings, and the regions of the NL that has the higest construction assocaited costs that people spent.

Data Source: https://opendata.cbs.nl/

In [None]:
# Importing all necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns

from scipy import stats
from scipy.stats import ttest_1samp

import chart_studio.plotly as py
import cufflinks as cf

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt


from scipy import stats
import statsmodels.api as sm

cf.go_offline()

## Total costs of all buildings in NL from 2015 - 2019

In [None]:
bldg_start = pd.read_excel('../data/costs/costs_buildings started.xlsx')
bldg_start.set_index("Periods", inplace = True) 

import plotly.express as px
fig = px.scatter(bldg_start, x=bldg_start.index, y=bldg_start['Total buildings_mln euro'])
fig.show()

#### From the past 5 years, 2018 had the highest peak in new buildings construction costs. About 5500 million euros were spent in NL on construction.

# Analysis of Costs in the different regions of the NL (North, South, East and West) for Housing and Education sectors.

### Total "Dwelling" Costs for all the regions of Netherlands

In [None]:
noord_housing = pd.read_excel('../data/regions/noord_nederland_housing_costs.xlsx')
noord_housing.rename(columns = {'Orders received by contractors_mln euro':'Noord'}, inplace = True)

oost_housing = pd.read_excel('../data/regions/oost_nederland_housing_costs.xlsx')
oost_housing.rename(columns = {'Orders received by contractors_mln euro':'Oost'}, inplace = True)

west_housing = pd.read_excel('../data/regions/west_nederland_housing_costs.xlsx')
west_housing.rename(columns = {'Orders received by contractors_mln euro':'West'}, inplace = True)


zuid_housing = pd.read_excel('../data/regions/zuid_nederland_housing_costs.xlsx')
zuid_housing.rename(columns = {'Orders received by contractors_mln euro':'Zuid'}, inplace = True)


# Orders received by contractors_mln euro for all the regions

combined = pd.concat([noord_housing['Noord'],oost_housing['Oost'],west_housing['West'],zuid_housing['Zuid']], axis=1)
                                  
combined_orders_cost = combined.set_index(noord_housing["Periods"])                                 
combined_orders_cost.head()

### Total "Education" Costs for all the regions of Netherlands

In [None]:
noord_education = pd.read_excel('../data/regions/noord_nederland_education_costs.xlsx')
noord_education.rename(columns = {'Orders received by contractors_mln euro':'Noord'}, inplace = True)

oost_education = pd.read_excel('../data/regions/oost_nederland_education_costs.xlsx')
oost_education.rename(columns = {'Orders received by contractors_mln euro':'Oost'}, inplace = True)

west_education = pd.read_excel('../data/regions/west_nederland_education_costs.xlsx')
west_education.rename(columns = {'Orders received by contractors_mln euro':'West'}, inplace = True)


zuid_education = pd.read_excel('../data/regions/zuid_nederland_education_costs.xlsx')
zuid_education.rename(columns = {'Orders received by contractors_mln euro':'Zuid'}, inplace = True)


# Orders received by contractors_mln euro for all the regions

combined = pd.concat([noord_education['Noord'],oost_education['Oost'],west_education['West'],zuid_education['Zuid']], axis=1)
                                  
combined_education_cost = combined.set_index(noord_education["Periods"])                                 
combined_education_cost.head(50)

# Regression Analysis

#### I would like to assume that the "Dwelling" orders received by contractors and the number of "Dwelling" where the construction started are correlated. I would like to check if there is linear or a non linear regression between the variables.

In [None]:
# "Dwelling" costs for all the four regions of NL can be calculated by combining the values of all the 4 regions.

noord_housing = pd.read_excel('../data/regions/noord_nederland_housing_costs.xlsx')
oost_housing = pd.read_excel('../data/regions/oost_nederland_housing_costs.xlsx')
west_housing = pd.read_excel('../data/regions/west_nederland_housing_costs.xlsx')
zuid_housing = pd.read_excel('../data/regions/zuid_nederland_housing_costs.xlsx')



### There are multiple columns in the dataframe that provided the million of euros spent and the number of buildings constructed that are in different stages.

So, I would like to see how they are all correlated.

In [None]:
corr = noord_housing.corr()

# Heatmap

plt.figure(figsize=(20,15))

sns.heatmap(corr, annot=True)

There is a positive correlation between "Production of building projects" and "Building projects under construction". This makes complete sense.

There is also a positive correlation between "Building projects started" and "Remaining production of buildings". I think this correlation says that there is a continuous demand for buildings and the construction is a continuing. Since no information is provided on what each column means in the dataset its hard to definitely say what it means.

The correlation that I find interesting is that there is a strong positive correaltion between "Orders received by contractors_mln euro" and "Building projects not yet started". This probably says that there is a backlog in the construction activities.

#### There is a positive correaltion between the two chosen variables.


In [None]:
noord_housing.isna().sum()

In [None]:
# Orders received by contractors_mln euro for all the regions

reg_combined = noord_housing + oost_housing + west_housing + zuid_housing       

# Checking the correlation between "Orders received by contractors_mln euros" and "Building projects started"

reg_combined[["Orders received by contractors_mln euro", "Building projects started"]].corr()

### Building a regression model for the data from two varaibles "Orders received by contractors_mln euros" and "Building projects started"

In [None]:
sns.regplot(x="Orders received by contractors_mln euro", y="Building projects started", data=reg_combined)

In [None]:
# I am using linear regression as there is a negative correlation between the data we want to analyze
# In regression analysis, the dependent variable is denoted "Y" and the independent variables are denoted by "X".
# Linear regression is also known as ordinary least squares (OLS) and linear least squares

from scipy import stats
import statsmodels.api as sm

# Adding a Constant
X = reg_combined['Orders received by contractors_mln euro']
Y = reg_combined['Building projects started']

x = sm.add_constant(X)

 
results = sm.OLS(Y,x).fit()
 
results.summary()





#### From the above summary table, we see that the p-value is 0.007 and we can say that the variables are statistiacally significant.


#### R-SQUARED:
The R-squared value is a widely-used measure that describes how powerful a regression is. 
The R-squared measures how much of the total variability is explained by our model.
Here the R-squared value is 0.567. R-squared of 1 would mean our model explains the entire variability of the data.
What we usually observe are values ranging from 0.2 to 0.9. The value we got here fall under that category and we can conclude that the regression is strong.





### Therefore there is a strong correlation between Housing orders received by contractors, and the number of Housing projects where the construction started 


### Plotting the regression model on the scatter plot.

In [None]:
import matplotlib.pyplot as plt

plt.scatter(X,Y)
 
# From the above summary table, the coefficient = 0.8371 which is "Orders received by contractors_mln euro", 
# and incercept is the constant is 256.4860    

yhat = 0.8371*X + 256.4860

 
fig = plt.plot(X,yhat, lw=4, c='pink', label = 'regression line')
 
plt.xlabel('Orders received by contractors_mln euro', fontsize = 20)
 
plt.ylabel('Building projects started', fontsize = 20)
 
plt.show()

#### From the above scatter plot, it is clear that the "Orders received by contractors_mln euro" is a good predictor of "Building projects started". 

#### The pink line in the plot above is the regression line – the predicted variables based on the data.

# Total Building Costs for all types of buildings

In [None]:
total_cost = pd.read_excel('../data/Total Building Costs.xlsx')
total_cost.set_index("Periods", inplace = True) 
total_cost.head()

In [None]:
total_cost.isna().sum()

In [None]:
total_cost.dtypes

# Performing Supervised Learning on the building dataframe

# Modeling, Prediction, and Evaluation

We'll start off this section by splitting the data to train and test. **Name your 4 variables `X_train`, `X_test`, `y_train`, and `y_test`. Select 80% of the data for training and 20% for testing.**

In [None]:

from sklearn.model_selection import train_test_split

y = total_cost['Building projects started']
X = total_cost.drop('Building projects started', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

## Here I will evaluate the model and also try two different models to compare my results

## I will use Linear Regression and K-nearest Neighbours

### 1. Linear Regression



The first model we will use in this lab is **Linear Regression**. 

In [None]:

# TRAIN THE  MODEL 


from sklearn.linear_model import LinearRegression 
model = LinearRegression()

Next, fit the model to my training data. 

In [None]:
model.fit(X_train, y_train)

### Evaluate the model

Compute the predicted *y* based on `X_train` and call it `y_pred`. Then calcualte the r squared score between `y_pred` and `y_train` which indicates how well the estimated regression model fits the training data.

In [None]:
# TEST THE  MODEL 

y_pred = model.predict(X_train)
pd.DataFrame({'test':y_train, 'predicted':y_pred})

In [None]:
from sklearn.metrics import r2_score

r2_score(y_train, y_pred)

#### Our next step is to evaluate the model using the test data. 

We would like to ensure that our model is not overfitting the data. This means that our model was made to fit too closely to the training data by being overly complex. If a model is overfitted, it is not generalizable to data outside the training data. In that case, we need to reduce the complexity of the model by removing certain features (variables).

In the cell below, use the model to generate the predicted values for the test data and assign them to `y_test_pred`. Compute the r squared score of the predicted `y_test_pred` and the oberserved `y_test` data.

In [None]:
y_test_pred = model.predict(X_test)
pd.DataFrame({'test':y_test, 'predicted':y_test_pred})

In [None]:
from sklearn.metrics import r2_score

r2_score(y_test, y_test_pred)

**The r squared score for the training data is 0.2578 and the test data is -1.01402.**

**The r2 score for the test data is a negative value which shows that model can be arbitrarily worse.**

### 2. K-Nearest Neighbors

### Our second algorithm is K-Nearest Neighbors. 

We will fit a model using the training data and then test the performance of the model using the testing data. We will start by loading `KNeighborsClassifier` from scikit-learn and then initialize and fit the model. We'll start off with a model where k=3.

In [None]:
# TRAIN THE MODEL 

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=4)
model.fit(X_train, y_train)

To test your model, compute the predicted values for the testing sample and print the confusion matrix as well as the accuracy score.

**Accuracy Score**

The accuracy_score function computes the accuracy, either the fraction (default) or the count (normalize=False) of correct predictions.

In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

Reference: https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
# TEST THE MODEL

y_pred = model.predict(X_test)
pd.DataFrame({'test':y_test, 'predicted':y_test_pred})

In [None]:
accuracy_score(y_test, y_pred)

**Confusion matrix**

In [None]:
confusion_matrix(y_test, y_pred)

# Unsupervised Learning


# Data Clustering with K-Means

Now let's cluster the data with K-Means first. Initiate the K-Means model, then fit your scaled data. In the data returned from the `.fit` method, there is an attribute called `labels_` which is the cluster number assigned to each data record. What you can do is to assign these labels back to `total_cost` in a new column called `total_cost['labels']`. Then you'll see the cluster results of the original data.

In [None]:
# Your code here:
# Reference: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

# n_clusters: The number of clusters to form as well as the number of centroids to generate.
# By default n_clusters=8

from sklearn.cluster import KMeans

total_cost_kmeans = KMeans().fit(total_cost)
total_cost_kmeans.labels_

In [None]:
total_cost['labels'] = total_cost_kmeans.labels_

total_cost

Count the values in `labels`.

In [None]:
# Your code here:

total_cost['labels'].value_counts() 

In [None]:
# K-Means Labels Scatter Plot
plt.scatter(x=total_cost['Orders received by contractors'], y=total_cost['Building projects started'], c=total_cost["labels"])
plt.title('Distribution of K-Means Labels with 8 clusters')
plt.xlabel('Orders received by contractors')
plt.ylabel('Building projects started')
plt.show()




### I would like to reduce the number of clusters to 2 to see the change from the default 8 clusters in the above

In [None]:
# Your code here

from sklearn.cluster import KMeans

# KMeans for n_clusters=2
total_cost_kmeans = KMeans(n_clusters=2).fit(total_cost)
total_cost['labels'] = total_cost_kmeans.labels_

plt.scatter(x=total_cost['Orders received by contractors'], y=total_cost['Building projects started'], c=total_cost["labels"])
plt.title('Distribution of K-Means Labels with 2 clusters')
plt.xlabel('Orders received by contractors')
plt.ylabel('Building projects started')
plt.show()

In [None]:
# By reducing the number of clusters I can clearly see the difference.

# CONCLUSIONS

## Research Questions:

Q1: Are the “Housing" orders received by contractors and the number of “Housing" projects where the construction started correlated?



## Overall Analysis:

Yes, the "Orders received by contractors_mln euro" is  a good predictor of "Building projects started". 
