# Preprocess & Features engineering
---

This notebook focuses on importing the training and testing datasets, creating dummy variables, and generating polynomial features to optimize a model for predicting housing prices. The steps are as follows:



1. **Import Training and Testing Datasets** : Load the train and test data into the environment.

2. **Convert Data Type for facilities Column** : Convert the data type of the facilities column from string to list for easier manipulation.

3. **Function to Generate Dummy Variables** : Define a function to add dummy variables for categorical features in both the train and test datasets.

4. **Function to Create Polynomial Features** : Define a function to generate polynomial features to capture non-linear relationships in both the train and test datasets.

### 1. Import Training and Testing Datasets

In [348]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn import metrics
import scipy.stats as stats
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

In [349]:
preprocess_data = pd.read_csv('../Dataset/bangkok_cleaned.csv')

In [350]:
test_preprocess_data =  pd.read_csv('../Dataset/test_cleaned.csv')

In [351]:
preprocess_data.isnull().sum()[preprocess_data.isnull().sum()!=0]

nearby_station_distance    7014
dtype: int64

### 2. Convert Data Type for facilities Column

In [353]:
def extract_facility(facilities):
    # Check if the value is NaN (missing value)
    if pd.isnull(facilities):
        return 'None'  # Return 'None' if there are no facilities listed
    
    # If facilities is not empty, process the string
    if facilities and len(facilities) != 0:  
       
        facility_text = str(facilities)[1:-1]
        
        facility_list = facility_text.split(',')
        
    return facility_list  

In [354]:
# Apply the extract_facility function to the 'facilities' column in train and test dataframe
preprocess_data['facilities'] = preprocess_data['facilities'].apply(extract_facility)
test_preprocess_data['facilities'] = test_preprocess_data['facilities'].apply(extract_facility)

### 3. Function to Generate Dummy Variables

In [356]:
# Pivot the 'nearby_station_distance' data to create separate columns for each station's distance, filling missing values with 0.
def pivot_station(data):
    pivot = data.pivot_table(index=data.index, 
                                      columns='station_name', 
                                      values='station_distance', 
                                      aggfunc= 'sum')
    pivot = pivot.fillna(0)
    pivot.columns = ['stat_' + str(col) for col in pivot.columns]
    data = pd.concat([data.drop(columns=['nearby_station_distance','station_name','station_distance']), pivot], axis = 1)
    return data
    
preprocess_data = pivot_station(preprocess_data)
test_preprocess_data = pivot_station(test_preprocess_data)

In [357]:
# Before creating dummy variables for facilities, we need to explode the list of facilities.
def explode_facility(data):
    # Explode the facilities list
    facility_exploded = data.explode('facilities')
    
    # Create dummy variables
    facility_dummies = pd.get_dummies(facility_exploded['facilities'].str.replace("'", ''), prefix='faci')
    
    # Group by the original index and sum the dummy variables
    facility_pivoted = facility_dummies.groupby(facility_exploded.index).sum()
    
    # Concatenate the dummy variables with the original DataFrame after dropping the 'facilities' column
    data = pd.concat([data.drop(columns=['facilities']), facility_pivoted], axis=1)

    return data
    
preprocess_data = explode_facility(preprocess_data)
test_preprocess_data = explode_facility(test_preprocess_data)

In [358]:
# For these categorical columns, we can create dummy variables directly.
preprocess_data = pd.get_dummies(data=preprocess_data, columns=["property_type", "district", "province"])
test_preprocess_data = pd.get_dummies(data=test_preprocess_data, columns=["property_type", "district", "province"])

### 4. Function to Create Polynomial Features

In [360]:
def poly_feature(data):
    # Select the initial features for polynomial expansion
    starter_features = ['bedrooms','baths','floor_area', 'total_facilities']
    
    # Select features for polynomial transformation
    data_poly = data[starter_features]
    
    # Generate polynomial features of degree 3 without the bias term
    poly = PolynomialFeatures(include_bias=False, degree=3)
    X_poly = poly.fit_transform(data_poly)
    
    # Create a DataFrame for polynomial features and concatenate with the original data
    poly_df = pd.DataFrame(X_poly, columns=poly.get_feature_names_out(starter_features))
    data = pd.concat([data, poly_df], axis=1)
    data = data.loc[:, ~data.columns.duplicated()]

    return data

preprocess_data = poly_feature(preprocess_data)
test_preprocess_data = poly_feature(test_preprocess_data)

### Save file to tuning the model

In [362]:
preprocess_data.to_csv('../Dataset/bangkok_preprocess.csv',index = False)

In [363]:
test_preprocess_data.to_csv('../Dataset/test_preprocess.csv',index = False)