### Consolidation of Data
In this notebook, we will be consolidating out data from the four sources we have gathered:
1. HDB Resale Data with supplementary information 
2. Macroeconomic Data for Singapore
3. Closest Mrt to HDB by Year and Walking Time
4. Singapore resident population data 

#### Load libraries

In [2]:
# Wrangling libraries
import numpy as np
import pandas as pd


#### Importing datasets

In [3]:
# Import HDB data
hdbdata = pd.read_parquet('../data/processed/HDB_full_resale_info_1990_2023.parquet.gzip')

# Import economic data
econdata = pd.read_csv('../data/processed/econdata_processed.csv')

# Import MRT + Walking time data
close_walking_data = pd.read_csv("../data/processed/by_year_walking_times_hdb_closest_mrt.csv")

# Import Sg population data
pop_df = pd.read_csv("../data/processed/singapore_population_1990_2023_v2.csv", index_col=0)


#### Standardizing Date Formats for HDB + Economic Dataframes

In [4]:
# Convert Date to datetime format for economic data
econdata['Date'] = pd.to_datetime(econdata['Date'], format='%Y-%m-%d')

# Create new column in hdb dataframe to make joining easier
hdbdata['Date'] = hdbdata['sold_year_month']


#### Merging HDB + Economic Dataframes

In [7]:
# Merge dataframes on Date column
hdb_econ = pd.merge(hdbdata, econdata, on='Date', how='left')

# Drop created date column (sold_year_month is sufficient)
hdb_econ = hdb_econ.drop(columns=['Date'])


In [8]:
# Check to see if it worked
hdb_econ.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 890376 entries, 0 to 890375
Data columns (total 55 columns):
 #   Column                                                                         Non-Null Count   Dtype         
---  ------                                                                         --------------   -----         
 0   town                                                                           890376 non-null  object        
 1   flat_type                                                                      890376 non-null  object        
 2   block                                                                          890376 non-null  object        
 3   street_name                                                                    890376 non-null  object        
 4   storey_range                                                                   890376 non-null  object        
 5   floor_area_sqm                                                          

#### Merging HDB_Econ to MRT Walking Times

In [9]:
# Merge dataframes on 'address' + 'sold_year'
hdb_time_mrt_econ = hdb_econ.merge(close_walking_data, how = 'left', on = ['address', 'sold_year'])

# Check to see if it worked
hdb_time_mrt_econ.head().T

Unnamed: 0,0,1,2,3,4
town,KALLANG/WHAMPOA,KALLANG/WHAMPOA,KALLANG/WHAMPOA,KALLANG/WHAMPOA,KALLANG/WHAMPOA
flat_type,3 ROOM,3 ROOM,3 ROOM,3 ROOM,3 ROOM
block,44,20,14,46,49
street_name,BENDEMEER RD,ST. GEORGE'S RD,KG ARANG RD,OWEN RD,DORSET RD
storey_range,04 TO 06,04 TO 06,04 TO 06,01 TO 03,04 TO 06
floor_area_sqm,63.0,67.0,103.0,68.0,68.0
flat_model,Standard,New Generation,New Generation,New Generation,New Generation
lease_commence_date,1981,1984,1984,1982,1979
resale_price,31400.0,66500.0,77000.0,58000.0,52000.0
sold_year_month,1990-01-01 00:00:00,1990-01-01 00:00:00,1990-01-01 00:00:00,1990-01-01 00:00:00,1990-01-01 00:00:00


#### Merging HDB_Econ_MRT_Walking to Singapore Population

In [10]:
# Merge dataframes on 'sold_year'
hdb_time_mrt_econ = hdb_time_mrt_econ.merge(pop_df, how = 'left', left_on='sold_year', right_on='key')
hdb_time_mrt_econ.head().T

# Check to see if it worked
hdb_time_mrt_econ.head().T

Unnamed: 0,0,1,2,3,4
town,KALLANG/WHAMPOA,KALLANG/WHAMPOA,KALLANG/WHAMPOA,KALLANG/WHAMPOA,KALLANG/WHAMPOA
flat_type,3 ROOM,3 ROOM,3 ROOM,3 ROOM,3 ROOM
block,44,20,14,46,49
street_name,BENDEMEER RD,ST. GEORGE'S RD,KG ARANG RD,OWEN RD,DORSET RD
storey_range,04 TO 06,04 TO 06,04 TO 06,01 TO 03,04 TO 06
...,...,...,...,...,...
SingaporePermanentResident,112132,112132,112132,112132,112132
ResidentPopulation_Growth_Rate,2.150542,2.150542,2.150542,2.150542,2.150542
Non_ResidentPopulation_Growth_Rate,9.353796,9.353796,9.353796,9.353796,9.353796
SingaporeCitizen_Growth_Rate,1.558427,1.558427,1.558427,1.558427,1.558427


#### Dropping Columns

Since the dataset is currently fairly wide, we will be dropping columns which we think will not add much information to the prediction of HDB flat prices. 

In [11]:
# List of columns to drop
drop_cols = ['1room_sold','2room_sold','3room_sold','4room_sold','5room_sold','exec_sold','multigen_sold',
             'studio_apartment_sold','  1-Room Residential Properties','  2-Room Residential Properties',
             '  3-Room Residential Properties','  4-Room Residential Properties','  5-Room Residential Properties',
             '  Executive Properties', 'Quarter', 'Year', 'key',
             'remaining_lease_in_2023', 'year_completed', 'residential', 'commercial', 'market_hawker', 'miscellaneous',
             'multistorey_carpark', 'precinct_pavilion', 'total_dwelling_units', 'Total Residential Properties',
             'Stamp Duty', 'LTV Value', 'yearly core inflation', 'yearly cement inflation', 'yearly clay inflation', 'GNI per capita',
             'TotalPopulation', 'SingaporeCitizenPopulation', 'Non_ResidentPopulation', 'SingaporePermanentResident',
             'Non_ResidentPopulation_Growth_Rate', 'SingaporeCitizen_Growth_Rate', 'SingaporePR_Growth_Rate']

# Drop the columns
hdb_time_mrt_econ = hdb_time_mrt_econ.drop(columns=drop_cols)

In [12]:
# Checking column data types
hdb_time_mrt_econ.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 890376 entries, 0 to 890375
Data columns (total 27 columns):
 #   Column                                                                         Non-Null Count   Dtype         
---  ------                                                                         --------------   -----         
 0   town                                                                           890376 non-null  object        
 1   flat_type                                                                      890376 non-null  object        
 2   block                                                                          890376 non-null  object        
 3   street_name                                                                    890376 non-null  object        
 4   storey_range                                                                   890376 non-null  object        
 5   floor_area_sqm                                                          

#### Saving Consolidated DataSet into a Parquet File 

In [13]:
# Save file
#hdb_time_mrt_econ.to_parquet('../data/processed/final_HDB_for_model.parquet.gzip', compression='gzip')
