Dataset: https://opendata.housing.gov.ie/dataset/social-housing-construction-status-report-q2-2022
Social Housing Construction Status Report Q2 2022
The latest Construction Status Report shows that 8,247 social homes are currently onsite with an additional 12,327 homes at design and tender stage. In Quarter 2 2022, 118 new construction schemes, (1,647 homes) were added to the pipeline. In total the Construction Status Report provides details on 27,006 new build social homes across 1,566 schemes.

In [2]:
# import dataset "Social Housing Construction Status Report Q2 2022" downloaded from Department of Housing, Local Government and Heritage

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
df = pd.read_csv('social-housing-construction-status-report-q2-2022.csv', encoding='windows-1252')

## Understanding and Cleaning the Data

In [6]:
df.tail(50)

Unnamed: 0,No.,Funding Programme,LA,Scheme/Project Name,No. of Units,Approved Housing Body,Stage 1 Capital Appraisal,Stage 2 Pre Planning,Stage 3 Pre Tender design,Stage 4 Tender Report or Final Turnkey/CALF approval,On Site,Completed
1516,1517,SHIP CONSTRUCTION SINGLE STAGE,Wicklow,"Ballynerrin Upper, Wicklow Town",10,*N/A,,,,"Stage 4 approved Q2-2022, the scheme is fully ...",,
1517,1518,SHIP CONSTRUCTION DESIGN BUILD,Wicklow,"Merrymeeting View, Rathnew",46,*N/A,,,,,,Q2-2021
1518,1519,SHIP CONSTRUCTION DESIGN BUILD,Wicklow,Shillelagh Phase 4,20,*N/A,,,,"Stage 4 approved Q2-2022, the scheme is fully ...",,
1519,1520,SHIP CONSTRUCTION DESIGN BUILD,Wicklow,"Ard na Greine, West Ward, Bray",31,*N/A,,,,,,Q1-2022
1520,1521,SHIP CONSTRUCTION DESIGN BUILD,Wicklow,"Avondale Heights, Rathdrum",20,*N/A,,,,,Q3-2021,
1521,1522,SHIP CONSTRUCTION DESIGN BUILD,Wicklow,"Sheehan Court, Back Street, Arklow",7,*N/A,,,,,Q2-2021,
1522,1523,SHIP CONSTRUCTION DESIGN BUILD,Wicklow,"Three Trouts, Greystones",41,*N/A,,"Stage 2 approved Q1-2022, full design/tender b...",,,,
1523,1524,SHIP CONSTRUCTION DESIGN BUILD,Wicklow,"Greenhill Road, Wicklow Town",36,*N/A,,,,,Q2-2021,
1524,1525,SHIP CONSTRUCTION DESIGN BUILD,Wicklow,"Cedar Court, Schools Road, Bray",14,*N/A,,,,,Q1-2022,
1525,1526,SHIP CONSTRUCTION DESIGN BUILD,Wicklow,"Coolattin Oaks, Carnew (Phase 4)",8,*N/A,,,,,,Q1-2021


In [7]:
df.shape

(1566, 12)

In [8]:
# I now know the shape and contents of the data.
# I see missing information in columns so I will further investigate this.
missing_counts = df.isnull().sum()
print("Missing values in each column:")
print(missing_counts)

Missing values in each column:
No.                                                        0
Funding Programme                                          0
LA                                                         0
Scheme/Project Name                                        0
No. of Units                                               0
Approved Housing Body                                     42
Stage 1 Capital Appraisal                               1412
Stage 2 Pre Planning                                    1446
Stage 3 Pre Tender design                               1515
Stage 4 Tender Report or Final Turnkey/CALF approval    1237
On Site                                                 1156
Completed                                               1064
dtype: int64


In [9]:
# For my objectives I don't need the columns Stage 1, Stage 2, Stage 3, Stage 4 so I will drop these.
# These have a substantial amount of missing information.
df_1 = df.drop(columns = ['Stage 1 Capital Appraisal', 'Stage 2 Pre Planning ','Stage 3 Pre Tender design','Stage 4 Tender Report or Final Turnkey/CALF approval'])

In [10]:
df_1.shape

(1566, 8)

In [11]:
# Now these are dropped, I will investigate what remains missing
missing_counts = df_1.isnull().sum()
print("Missing values in each column:")
print(missing_counts)

Missing values in each column:
No.                         0
Funding Programme           0
LA                          0
Scheme/Project Name         0
No. of Units                0
Approved Housing Body      42
On Site                  1156
Completed                1064
dtype: int64


In [12]:
# There is 42 missing values under 'Approved Housing Body', which is 2.7% of the data
# As removing this data will not make a significant impact I will drop these rows. 

df_1.dropna(subset=['Approved Housing Body'], inplace = True)
df_1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1524 entries, 0 to 1565
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   No.                    1524 non-null   int64 
 1   Funding Programme      1524 non-null   object
 2   LA                     1524 non-null   object
 3   Scheme/Project Name    1524 non-null   object
 4   No. of Units           1524 non-null   int64 
 5   Approved Housing Body  1524 non-null   object
 6   On Site                408 non-null    object
 7   Completed              491 non-null    object
dtypes: int64(2), object(6)
memory usage: 107.2+ KB


In [13]:
# Now there is 1156 missing values in 'On site' and 1064 in 'completed'
# Removing these would have significant impacts on the data
# I know that null values means 'No' so I will replace null values in these columns with 'No'

df_1['On Site'].fillna('No', inplace = True)
df_1['Completed'].fillna('No', inplace = True)

In [14]:
df_1.head()

Unnamed: 0,No.,Funding Programme,LA,Scheme/Project Name,No. of Units,Approved Housing Body,On Site,Completed
0,1,SHIP CONSTRUCTION TURNKEY,Carlow,"Carrigbrook, Tullow Road, Carlow",16,*N/A,No,Q4-2021
1,2,SHIP CONSTRUCTION TURNKEY,Carlow,"Carrigbrook, Tullow Road, Carlow",13,*N/A,No,Q2-2021
2,3,SHIP CONSTRUCTION TURNKEY,Carlow,"Granville Court, Granby Row,Carlow",4,*N/A,No,Q4-2021
3,4,SHIP CONSTRUCTION SINGLE STAGE,Carlow,"Ardattin, Co Carlow",6,*N/A,No,No
4,5,SHIP CONSTRUCTION SINGLE STAGE,Carlow,"Brownbog, Hackettstown",1,*N/A,Q1-2020,No


In [15]:
missing_counts = df_1.isnull().sum()
print("Missing values in each column:")
print(missing_counts)

Missing values in each column:
No.                      0
Funding Programme        0
LA                       0
Scheme/Project Name      0
No. of Units             0
Approved Housing Body    0
On Site                  0
Completed                0
dtype: int64


In [16]:
# I now have no missing data. Approved Housing Authority does have "*N/A" as a feature, but Iknow this means there is no Approved Housing Authority.
# I want to further understand the values in each columns

for column in df_1.columns:
    print(f"value counts for: {column}")
    print(df_1[column].value_counts(dropna=False))
    print()
          


value counts for: No.
No.
1       1
1047    1
1056    1
1055    1
1054    1
       ..
520     1
518     1
517     1
516     1
1566    1
Name: count, Length: 1524, dtype: int64

value counts for: Funding Programme
Funding Programme
CALF Turnkey                             447
SHIP CONSTRUCTION                        374
SHIP CONSTRUCTION TURNKEY                255
CAS CONSTRUCTION                         109
SHIP DIALOG                               76
SHIP CONSTRUCTION SINGLE STAGE            58
SHIP CONSTRUCTION DESIGN BUILD            45
SHIP CONSTRUCTION RENEWAL                 43
CALF Construction                         30
CAS CONSTRUCTION TURNKEY                  25
REGENERATION                              22
PUBLIC PRIVATE PARTNERSHIP (Bundle 2)     14
CAS CONSTRUCTION RENEWAL                  13
CAS CONSTRUCTION SINGLE STAGE              6
PUBLIC PRIVATE PARTNERSHIP (Bundle 3)      6
PUBLIC PRIVATE PARTNERSHIP (Bundle 1)      1
Name: count, dtype: int64

value counts for: LA
L

In [17]:
# My data is clean and ready for model building
df_1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1524 entries, 0 to 1565
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   No.                    1524 non-null   int64 
 1   Funding Programme      1524 non-null   object
 2   LA                     1524 non-null   object
 3   Scheme/Project Name    1524 non-null   object
 4   No. of Units           1524 non-null   int64 
 5   Approved Housing Body  1524 non-null   object
 6   On Site                1524 non-null   object
 7   Completed              1524 non-null   object
dtypes: int64(2), object(6)
memory usage: 107.2+ KB


## Data Pre-processing

In [19]:
from sklearn import preprocessing


In [20]:
# Encode categorical features to numerical features
label_encoders = {}
for col in df_1.select_dtypes(include=[object]).columns:
    le = preprocessing.LabelEncoder()
    df_1[col] = le.fit_transform(df_1[col])
    label_encoders[col] = le                   # Store encoders if needed later

In [21]:
df_1.tail()

Unnamed: 0,No.,Funding Programme,LA,Scheme/Project Name,No. of Units,Approved Housing Body,On Site,Completed
1561,1562,1,30,1207,30,19,0,4
1562,1563,1,30,1208,18,19,9,0
1563,1564,1,30,1289,45,19,13,0
1564,1565,1,30,1324,16,18,9,0
1565,1566,8,30,317,106,0,0,0


In [22]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1524 entries, 0 to 1565
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype
---  ------                 --------------  -----
 0   No.                    1524 non-null   int64
 1   Funding Programme      1524 non-null   int32
 2   LA                     1524 non-null   int32
 3   Scheme/Project Name    1524 non-null   int32
 4   No. of Units           1524 non-null   int64
 5   Approved Housing Body  1524 non-null   int32
 6   On Site                1524 non-null   int32
 7   Completed              1524 non-null   int32
dtypes: int32(6), int64(2)
memory usage: 71.4 KB


In [23]:
max_value = df_1['No. of Units'].max()
print(max_value)

163


In [24]:
mid_value = df_1['No. of Units'].median()
print(mid_value)

11.0


In [25]:
# convert target variable income to categorical
bins = [-1,11,1000]
labels = ['0-11','12+']

df_1['No. of Units Category'] = pd.cut(df_1['No. of Units'], bins=bins, labels=labels)

In [26]:
# Count how many times each class occurs
cat_value = df_1['No. of Units Category'].value_counts()
print(cat_value)

No. of Units Category
0-11    765
12+     759
Name: count, dtype: int64


In [27]:
df_1.shape

(1524, 9)

In [28]:
missing_counts = df_1.isnull().sum()
print("Missing values in each column:")
print(missing_counts)

Missing values in each column:
No.                      0
Funding Programme        0
LA                       0
Scheme/Project Name      0
No. of Units             0
Approved Housing Body    0
On Site                  0
Completed                0
No. of Units Category    0
dtype: int64


In [29]:
df_1 = df_1.drop(columns = ['No. of Units'])
df_1.head()

Unnamed: 0,No.,Funding Programme,LA,Scheme/Project Name,Approved Housing Body,On Site,Completed,No. of Units Category
0,1,14,0,347,0,0,6,12+
1,2,14,0,347,0,0,3,12+
2,3,14,0,708,0,0,6,0-11
3,4,13,0,151,0,0,0,0-11
4,5,13,0,309,0,3,0,0-11


In [30]:
# Define feature variables
X = df_1.drop('No. of Units Category',axis=1)

# Define target variable
y = df_1['No. of Units Category']

## Model Building and Evaluation

In [32]:
# Importing train-test-split 
from sklearn.model_selection import train_test_split


In [33]:
# Importing machine models from sklearn library
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

# Importing classification report confusion matrix and accuracy score from sklearn metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [34]:
results_90 = []
results_85 = []
results_75 = []

models = {
        "k-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5),
        "Decision Tree": DecisionTreeClassifier(random_state=41),
        "Random Forest": RandomForestClassifier(n_estimators=100, random_state=41),
        "Logistic Regression": LogisticRegression(max_iter=1000, random_state=41),
        "Support Vector Machine": SVC(kernel='linear', random_state=41)}

def apply_models_90(X, y):
        # Assign training and test data
        X_train_90, X_test_90, y_train_90, y_test_90 = train_test_split(X, y, train_size=0.90, random_state = 41)
    
        for model_name, model_instance in models.items():
        
            #train the model
            model_instance.fit(X_train_90, y_train_90)
            
            # Predict on the test set
            y_pred = model_instance.predict(X_test_90)
        
            # Calculate performance metrics
            accuracy = accuracy_score(y_test_90, y_pred)
            report = classification_report(y_test_90, y_pred, output_dict=True)
            precision = report['weighted avg']['precision']
            recall = report['weighted avg']['recall']
            f1_score = report['weighted avg']['f1-score']
            
            results_90.append([model_name, accuracy, precision, recall, f1_score])

def apply_models_85(X, y):
        # Assign training and test data
        X_train_85, X_test_85, y_train_85, y_test_85 = train_test_split(X, y, train_size=0.85, random_state = 41)
    
        for model_name, model_instance in models.items():
        
            #train the model
            model_instance.fit(X_train_85, y_train_85)
            
            # Predict on the test set
            y_pred = model_instance.predict(X_test_85)
        
            # Calculate performance metrics
            accuracy = accuracy_score(y_test_85, y_pred)
            report = classification_report(y_test_85, y_pred, output_dict=True)
            precision = report['weighted avg']['precision']
            recall = report['weighted avg']['recall']
            f1_score = report['weighted avg']['f1-score']
            
            results_85.append([model_name, accuracy, precision, recall, f1_score])


def apply_models_75(X, y):
        # Assign training and test data
        X_train_75, X_test_75, y_train_75, y_test_75 = train_test_split(X, y, train_size=0.75, random_state = 41)
    
        for model_name, model_instance in models.items():
        
            #train the model
            model_instance.fit(X_train_75, y_train_75)
            
            # Predict on the test set
            y_pred = model_instance.predict(X_test_75)
        
            # Calculate performance metrics
            accuracy = accuracy_score(y_test_75, y_pred)
            report = classification_report(y_test_75, y_pred, output_dict=True)
            precision = report['weighted avg']['precision']
            recall = report['weighted avg']['recall']
            f1_score = report['weighted avg']['f1-score']
            # Add resukts to dataframe for printing
            results_75.append([model_name, accuracy, precision, recall, f1_score])
# Apply the models
apply_models_90(X, y)
apply_models_85(X, y)
apply_models_75(X, y)
# Create columns in dataframe for visual ease
results_df_90 = pd.DataFrame(results_90, columns=["Model", "Accuracy", "Precision", "Recall", "F1 Score"])
results_df_85 = pd.DataFrame(results_85, columns=["Model", "Accuracy", "Precision", "Recall", "F1 Score"])
results_df_75 = pd.DataFrame(results_75, columns=["Model", "Accuracy", "Precision", "Recall", "F1 Score"])
# print the dataframes
print("Model Performance Metrics (90% Training set):")
print(results_df_90)
print("Model Performance Metrics (85% Training set):")
print(results_df_85)
print("Model Performance Metrics (75% Training set):")
print(results_df_75)

Model Performance Metrics (90% Training set):
                    Model  Accuracy  Precision    Recall  F1 Score
0     k-Nearest Neighbors  0.555556   0.555556  0.555556  0.555556
1           Decision Tree  0.647059   0.649408  0.647059  0.646455
2           Random Forest  0.692810   0.694030  0.692810  0.692679
3     Logistic Regression  0.640523   0.644503  0.640523  0.639168
4  Support Vector Machine  0.633987   0.636219  0.633987  0.633361
Model Performance Metrics (85% Training set):
                    Model  Accuracy  Precision    Recall  F1 Score
0     k-Nearest Neighbors  0.563319   0.562959  0.563319  0.562450
1           Decision Tree  0.655022   0.656063  0.655022  0.654930
2           Random Forest  0.685590   0.686978  0.685590  0.685446
3     Logistic Regression  0.663755   0.666725  0.663755  0.663088
4  Support Vector Machine  0.624454   0.634395  0.624454  0.619795
Model Performance Metrics (75% Training set):
                    Model  Accuracy  Precision    Recall  

## Cross Validation on full dataset

In [36]:
# Cross-validation for accuracy on the full dataset
cv_results = []
for model_name, model_instance in models.items():
    cv_scores = cross_val_score(model_instance, X, y, cv=5, scoring='accuracy')
    cv_results.append([model_name, cv_scores.mean()])

# Display cross-validation results
cv_results_df = pd.DataFrame(cv_results, columns=["Model", "Mean CV Accuracy"])
print("\nCross-Validation Accuracy Scores:")
print(cv_results_df)


Cross-Validation Accuracy Scores:
                    Model  Mean CV Accuracy
0     k-Nearest Neighbors          0.526931
1           Decision Tree          0.582720
2           Random Forest          0.581383
3     Logistic Regression          0.598490
4  Support Vector Machine          0.562368


## Hyper-parameter: GridSearchCV

In [79]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=41),
    "Decision Tree": DecisionTreeClassifier(random_state=41),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=41),
    "k-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5),
    "Support Vector Machine": SVC(kernel='linear', random_state=41),
   }

In [80]:
# Define hyperparameter grids for each classifier
param_grids = {
    "Logistic Regression": {
        "C": [0.1, 1, 10],
        "solver": ['lbfgs', 'liblinear'],
        "penalty": ['l2']
    },
    "Decision Tree": {
        "max_depth": [None, 10, 20, 30],

    },
    "Random Forest": {
        "n_estimators": [50, 100, 200],
        "max_depth": [None, 10, 20, 30],

    },
    "k-Nearest Neighbors": {
        "n_neighbors": [3, 5, 7, 9],
        "algorithm": ['auto', 'ball_tree', 'kd_tree', 'brute']     #The "algorithm" parameter in the KNeighborsClassifier controls how the nearest neighbors search is performed. It determines the method used to compute distances between points in the feature space and find the nearest neighbors. There are four options for the "algorithm"
    },
    "Support Vector Machine": {
        "C": [0.1, 1, 10],
        "kernel": ['linear', 'rbf'],
        "gamma": ['scale', 'auto']
    },
}

In [40]:
X_train_75, X_test_75, y_train_75, y_test_75 = train_test_split(X, y, train_size=0.75, random_state = 41)

# Apply GridSearchCV for each models and evaluate performance metric scores
results = []
for model_name, model_instance in models.items():
    print(f"Running GridSearchCV for {model_name}...")
    grid_search = GridSearchCV(estimator=model_instance, param_grid=param_grids[model_name], cv = 5, n_jobs = -1, verbose = 2)
    grid_search.fit(X_train_75, y_train_75)
    
    # Best model from grid search
    best_model = grid_search.best_estimator_
    
    # Predict on the test set with the best model
    y_pred = best_model.predict(X_test_75)
    
    # Calculate performance metrics
    accuracy = accuracy_score(y_test_75, y_pred)
    report = classification_report(y_test_75, y_pred, output_dict=True)
    precision = report['weighted avg']['precision']
    recall = report['weighted avg']['recall']
    f1_score = report['weighted avg']['f1-score']
    
    # Append results
    results.append([model_name, accuracy, precision, recall, f1_score, grid_search.best_params_])

Running GridSearchCV for Logistic Regression...
Fitting 5 folds for each of 6 candidates, totalling 30 fits
Running GridSearchCV for Decision Tree...
Fitting 5 folds for each of 4 candidates, totalling 20 fits
Running GridSearchCV for Random Forest...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Running GridSearchCV for k-Nearest Neighbors...
Fitting 5 folds for each of 16 candidates, totalling 80 fits
Running GridSearchCV for Support Vector Machine...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Running GridSearchCV for Neural Network...
Fitting 5 folds for each of 24 candidates, totalling 120 fits


In [41]:
# Convert results to DataFrame for easy viewing
results_df = pd.DataFrame(results, columns=["Model", "Accuracy", "Precision", "Recall", "F1 Score", "Best Parameters"])
print("Model Performance Metrics on Test Set after GridSearchCV:")
print(results_df)

Model Performance Metrics on Test Set after GridSearchCV:
                    Model  Accuracy  Precision    Recall  F1 Score  \
0     Logistic Regression  0.656168   0.656164  0.656168  0.656125   
1           Decision Tree  0.677165   0.677905  0.677165  0.676970   
2           Random Forest  0.695538   0.695675  0.695538  0.695530   
3     k-Nearest Neighbors  0.582677   0.583017  0.582677  0.582522   
4  Support Vector Machine  0.614173   0.615241  0.614173  0.612828   
5          Neural Network  0.593176   0.607524  0.593176  0.580809   

                                     Best Parameters  
0  {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}  
1                                {'max_depth': None}  
2            {'max_depth': None, 'n_estimators': 50}  
3            {'algorithm': 'auto', 'n_neighbors': 5}  
4   {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}  
5  {'activation': 'relu', 'hidden_layer_sizes': (...  


In [85]:
# Cross-validation for accuracy on the full dataset using best models
cv_results = []
for model_name, model_instances in models.items():
    # Get the best model from GridSearchCV
    grid_search = GridSearchCV(estimator=model_instances, param_grid=param_grids[model_name], cv=5, n_jobs=-1, verbose=2)
    grid_search.fit(X_train_75, y_train_75)
    best_model = grid_search.best_estimator_
    
    cv_scores = cross_val_score(best_model, X, y, cv=5, scoring='accuracy')
    cv_results.append([model_name, cv_scores.mean()])

Fitting 5 folds for each of 6 candidates, totalling 30 fits
Fitting 5 folds for each of 4 candidates, totalling 20 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 16 candidates, totalling 80 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits


In [86]:
# Display cross-validation results
cv_results_df = pd.DataFrame(cv_results, columns=["Model", "Mean CV Accuracy"])
print("\nCross-Validation Accuracy Scores after GridSearchCV:")
print(cv_results_df)


Cross-Validation Accuracy Scores after GridSearchCV:
                    Model  Mean CV Accuracy
0     Logistic Regression          0.597834
1           Decision Tree          0.582720
2           Random Forest          0.578745
3     k-Nearest Neighbors          0.526931
4  Support Vector Machine          0.559745
