## Data Preparation

After loading the data, the datatype of REF_DATE was changed to datetime format and the data is trimmed to range from "1986-01-01" till "2024-10-01".

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
file_path = "Merged_Time_Series_Data.csv"  # Update with your actual file path
df = pd.read_csv(file_path)

# Convert REF_DATE to datetime format
df['REF_DATE'] = pd.to_datetime(df['REF_DATE'])

# Define the date range
start_date = "1986-01-01"
end_date = "2024-10-01"

# Filter the dataset
df_trimmed = df[(df['REF_DATE'] >= start_date) & (df['REF_DATE'] <= end_date)]

# Display the first few rows of the trimmed dataset
print(df_trimmed.head())

# Save the trimmed dataset if needed
df_trimmed.to_csv("Trimmed_Time_Series_Data.csv", index=False)

       REF_DATE               GEO  Number_of_Households  Housing completions  \
1308 1986-01-01           Alberta                 859.0               4130.0   
1309 1986-01-01  British Columbia                1132.0               9185.0   
1310 1986-01-01            Canada                   NaN              77598.0   
1311 1986-01-01          Manitoba                 392.0               3327.0   
1312 1986-01-01     New Brunswick                 237.0               2190.0   

      Housing starts  Housing under construction  House only NHPI  \
1308          3778.1                      7477.0             28.0   
1309         11098.4                     23454.0             79.5   
1310         71934.4                    211817.0             39.4   
1311          4076.4                     10631.0             37.7   
1312           749.3                      3453.0             75.1   

      Land only NHPI  Total (house and land) NHPI  
1308            22.5                         26.4  


Checking data structure and quality

In [4]:
# Check data structure and quality
print("Dataset Information:")
df_trimmed.info()
print("\nMissing Values:")
print(df_trimmed.isnull().sum())
print("\nBasic Statistics:")
print(df_trimmed.describe())
print("\nUnique Values:")
print(df_trimmed.nunique())

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
Index: 5126 entries, 1308 to 6433
Data columns (total 9 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   REF_DATE                     5126 non-null   datetime64[ns]
 1   GEO                          5126 non-null   object        
 2   Number_of_Households         4660 non-null   float64       
 3   Housing completions          5126 non-null   float64       
 4   Housing starts               5126 non-null   float64       
 5   Housing under construction   5126 non-null   float64       
 6   House only NHPI              5018 non-null   float64       
 7   Land only NHPI               5018 non-null   float64       
 8   Total (house and land) NHPI  5018 non-null   float64       
dtypes: datetime64[ns](1), float64(7), object(1)
memory usage: 400.5+ KB

Missing Values:
REF_DATE                         0
GEO                              0


Handle missing values


In [16]:
# Exclude non-numeric columns before performing numerical operations
numeric_cols = df_trimmed.select_dtypes(include=[np.number]).columns

# Interpolation to fill in missing values with a smooth trend, avoiding sudden jumps
df_trimmed[numeric_cols] = df_trimmed[numeric_cols].interpolate(method='linear')

# Fill remaining missing values with the median of each numeric column to prevent extreme values from skewing data
# Handle columns that still have missing values explicitly
for col in numeric_cols:
    if df_trimmed[col].isnull().sum() > 0:
        df_trimmed[col].fillna(df_trimmed[col].median(), inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trimmed[numeric_cols] = df_trimmed[numeric_cols].interpolate(method='linear')


In [13]:
print("\nMissing Values:")
print(df_trimmed.isnull().sum())


Missing Values:
REF_DATE                       0
GEO                            0
Number_of_Households           0
Housing completions            0
Housing starts                 0
Housing under construction     0
House only NHPI                0
Land only NHPI                 0
Total (house and land) NHPI    0
dtype: int64


## Feature Engineering


In [None]:
# Extract year, month, quarter, and day to enable seasonal and trend analysis
df_trimmed['Year'] = df_trimmed['REF_DATE'].dt.year
df_trimmed['Month'] = df_trimmed['REF_DATE'].dt.month
df_trimmed['Quarter'] = df_trimmed['REF_DATE'].dt.quarter
df_trimmed['Day'] = df_trimmed['REF_DATE'].dt.day

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trimmed['Year'] = df_trimmed['REF_DATE'].dt.year
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trimmed['Month'] = df_trimmed['REF_DATE'].dt.month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trimmed['Quarter'] = df_trimmed['REF_DATE'].dt.quarter
A value is trying to be set on a copy of 

In [17]:
df_trimmed.head()

Unnamed: 0,REF_DATE,GEO,Number_of_Households,Housing completions,Housing starts,Housing under construction,House only NHPI,Land only NHPI,Total (house and land) NHPI,Year,Month,Quarter,Day
1308,1986-01-01,Alberta,859.0,4130.0,3778.1,7477.0,28.0,22.5,26.4,1986,1,1,1
1309,1986-01-01,British Columbia,1132.0,9185.0,11098.4,23454.0,79.5,49.2,66.3,1986,1,1,1
1310,1986-01-01,Canada,762.0,77598.0,71934.4,211817.0,39.4,38.8,39.8,1986,1,1,1
1311,1986-01-01,Manitoba,392.0,3327.0,4076.4,10631.0,37.7,26.9,34.7,1986,1,1,1
1312,1986-01-01,New Brunswick,237.0,2190.0,749.3,3453.0,75.1,56.1,70.5,1986,1,1,1


Creating Lag Features (Using 1, 3, and 6 months lag).
These help the model recognize past patterns and predict future trends

In [None]:
# Create Lag Features (Using 1, 3, and 6 months lag)
