## Feature Engineering

*Coding along with the Udemy Couse [Machine Learning Applied to Stock & Crypto Trading](https://www.udemy.com/course/machine-learning-applied-to-stock-crypto-trading-python/) by Shaun McDonogh.*

In [145]:
# data management
import pandas as pd
import numpy as np

# statistics
from statsmodels.tsa.stattools import adfuller

# data preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# supervised machine learning
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RepeatedKFold

# reporting
import matplotlib.pyplot as plt

### Data Ingestion

In [146]:
df = pd.read_csv("../assets/data/SydneyHousePrices.csv")
print(f"Length of Data: {len(df)}")
df.head()

Length of Data: 199504


Unnamed: 0,Date,Id,suburb,postalCode,sellPrice,bed,bath,car,propType
0,2019-06-19,1,Avalon Beach,2107,1210000,4.0,2,2.0,house
1,2019-06-13,2,Avalon Beach,2107,2250000,4.0,3,4.0,house
2,2019-06-07,3,Whale Beach,2107,2920000,3.0,3,2.0,house
3,2019-05-28,4,Avalon Beach,2107,1530000,3.0,1,2.0,house
4,2019-05-22,5,Whale Beach,2107,8000000,5.0,4,4.0,house


In [147]:
# interpret data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199504 entries, 0 to 199503
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Date        199504 non-null  object 
 1   Id          199504 non-null  int64  
 2   suburb      199504 non-null  object 
 3   postalCode  199504 non-null  int64  
 4   sellPrice   199504 non-null  int64  
 5   bed         199350 non-null  float64
 6   bath        199504 non-null  int64  
 7   car         181353 non-null  float64
 8   propType    199504 non-null  object 
dtypes: float64(2), int64(4), object(3)
memory usage: 13.7+ MB


### Feature Engineering - Common Tasks

#### Handle Non-Numerical Data

In [148]:
# Count unique items for suburb
suburb_text_unique = df["suburb"].unique()
print("Unique Suburbs: ", len(suburb_text_unique))
print("Perform LabelEncoding")

Unique Suburbs:  685
Perform LabelEncoding


In [149]:
# suburb_text_unique # let's see what suburbs we have

In [150]:
# Count unique items for propType
prop_type_text_unique = df["propType"].unique()
print("Unique Prop Types: ", len(prop_type_text_unique))
print("Perform OneHotEncoding")

Unique Prop Types:  8
Perform OneHotEncoding


In [151]:
# label encoding
labelencoder = LabelEncoder()
encoded_suburbs = labelencoder.fit_transform(df["suburb"])
df["suburbs_encoded"] = encoded_suburbs
df.head()

Unnamed: 0,Date,Id,suburb,postalCode,sellPrice,bed,bath,car,propType,suburbs_encoded
0,2019-06-19,1,Avalon Beach,2107,1210000,4.0,2,2.0,house,22
1,2019-06-13,2,Avalon Beach,2107,2250000,4.0,3,4.0,house,22
2,2019-06-07,3,Whale Beach,2107,2920000,3.0,3,2.0,house,654
3,2019-05-28,4,Avalon Beach,2107,1530000,3.0,1,2.0,house,22
4,2019-05-22,5,Whale Beach,2107,8000000,5.0,4,4.0,house,654


In [152]:
# one hot encoding
onehot_encoded = pd.get_dummies(df["propType"], prefix="pt", drop_first=True)
df = df.join(onehot_encoded)
df.head(3)

Unnamed: 0,Date,Id,suburb,postalCode,sellPrice,bed,bath,car,propType,suburbs_encoded,pt_duplex/semi-detached,pt_house,pt_other,pt_terrace,pt_townhouse,pt_villa,pt_warehouse
0,2019-06-19,1,Avalon Beach,2107,1210000,4.0,2,2.0,house,22,False,True,False,False,False,False,False
1,2019-06-13,2,Avalon Beach,2107,2250000,4.0,3,4.0,house,22,False,True,False,False,False,False,False
2,2019-06-07,3,Whale Beach,2107,2920000,3.0,3,2.0,house,654,False,True,False,False,False,False,False


#### Set Target

In [153]:
# set target
# our target we want to predict is the sell price
df["TARGET"] = df["sellPrice"]
df.head(3)

Unnamed: 0,Date,Id,suburb,postalCode,sellPrice,bed,bath,car,propType,suburbs_encoded,pt_duplex/semi-detached,pt_house,pt_other,pt_terrace,pt_townhouse,pt_villa,pt_warehouse,TARGET
0,2019-06-19,1,Avalon Beach,2107,1210000,4.0,2,2.0,house,22,False,True,False,False,False,False,False,1210000
1,2019-06-13,2,Avalon Beach,2107,2250000,4.0,3,4.0,house,22,False,True,False,False,False,False,False,2250000
2,2019-06-07,3,Whale Beach,2107,2920000,3.0,3,2.0,house,654,False,True,False,False,False,False,False,2920000


#### Remove Redundant Features

In [154]:
# remove features
df_drop = df.copy() # copying the dataframe first
df_drop.drop(columns=["Date", "Id", "suburb", "propType", "sellPrice"], inplace=True)
df_drop.head()

Unnamed: 0,postalCode,bed,bath,car,suburbs_encoded,pt_duplex/semi-detached,pt_house,pt_other,pt_terrace,pt_townhouse,pt_villa,pt_warehouse,TARGET
0,2107,4.0,2,2.0,22,False,True,False,False,False,False,False,1210000
1,2107,4.0,3,4.0,22,False,True,False,False,False,False,False,2250000
2,2107,3.0,3,2.0,654,False,True,False,False,False,False,False,2920000
3,2107,3.0,1,2.0,22,False,True,False,False,False,False,False,1530000
4,2107,5.0,4,4.0,654,False,True,False,False,False,False,False,8000000


#### Check for NaN or Inf Values

In [155]:
# Check for Null or Inf (infinity) values
is_null = df_drop.isnull().values.any()
is_inf = df_drop.isin([np.inf, -np.inf]).values.any()
print("Is Null: ", is_null)
print("Is Inf: ", is_inf)

Is Null:  True
Is Inf:  False


In [156]:
# Fill NA
df_drop = df_drop.fillna(df_drop.mean()) # filling na with the average
df_drop.isnull().values.any()

np.False_

#### Feature Scaling - Min Max Scaling

Neural networks are very sensitive to big numbers. What we need are numbers between zero and one or between minus five and plus five or minus one and one. And one of the ways to get this is what we call `min max scaling`. What we can do is assigning numbers based on where a number is in a data frame compared to the maximum number in that same column.

In [157]:
df_scaling = df_drop.copy()

In [158]:
df_scaling[['postalCode', 'bath', 'suburbs_encoded', 'pt_duplex/semi-detached', 'pt_house', 'pt_other', 'pt_terrace', 'pt_townhouse', 'pt_villa', 'pt_warehouse', 'TARGET']] = df_scaling[['postalCode', 'bath', 'suburbs_encoded', 'pt_duplex/semi-detached', 'pt_house', 'pt_other', 'pt_terrace', 'pt_townhouse', 'pt_villa', 'pt_warehouse', 'TARGET']].astype(float)
df_scaling

Unnamed: 0,postalCode,bed,bath,car,suburbs_encoded,pt_duplex/semi-detached,pt_house,pt_other,pt_terrace,pt_townhouse,pt_villa,pt_warehouse,TARGET
0,2107.0,4.0,2.0,2.0,22.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1210000.0
1,2107.0,4.0,3.0,4.0,22.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2250000.0
2,2107.0,3.0,3.0,2.0,654.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2920000.0
3,2107.0,3.0,1.0,2.0,22.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1530000.0
4,2107.0,5.0,4.0,4.0,654.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,8000000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
199499,2234.0,5.0,3.0,7.0,318.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1900000.0
199500,2234.0,4.0,3.0,2.0,318.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,980000.0
199501,2234.0,4.0,2.0,2.0,5.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,850000.0
199502,2234.0,3.0,2.0,2.0,318.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,640000.0


In [159]:
mms = MinMaxScaler() # from sklearn.preprocessing
# taking everything from the dataframe inluding the target with iloc[:]
df_scaling.iloc[:] = mms.fit_transform(df_scaling)
df_scaling.head()

Unnamed: 0,postalCode,bed,bath,car,suburbs_encoded,pt_duplex/semi-detached,pt_house,pt_other,pt_terrace,pt_townhouse,pt_villa,pt_warehouse,TARGET
0,0.037179,0.030612,0.010204,0.025,0.032164,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.000563
1,0.037179,0.030612,0.020408,0.075,0.032164,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.001048
2,0.037179,0.020408,0.020408,0.025,0.95614,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.00136
3,0.037179,0.020408,0.0,0.025,0.032164,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.000712
4,0.037179,0.040816,0.030612,0.075,0.95614,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.003725


### Train Test Split

We're going to split our data so that we have both training data and test data.

In [160]:
# use correct dataframe
is_deep_learning = False # we're not going to use deep learning
# so the next line decides which dataframe to use
# if DL we'll use dataframe with Min Max Scaling
df_tts = df_scaling.copy() if is_deep_learning else df_drop.copy()
df_tts.head(3)

Unnamed: 0,postalCode,bed,bath,car,suburbs_encoded,pt_duplex/semi-detached,pt_house,pt_other,pt_terrace,pt_townhouse,pt_villa,pt_warehouse,TARGET
0,2107,4.0,2,2.0,22,False,True,False,False,False,False,False,1210000
1,2107,4.0,3,4.0,22,False,True,False,False,False,False,False,2250000
2,2107,3.0,3,2.0,654,False,True,False,False,False,False,False,2920000


In [161]:
# split X and y data
X = df_tts.iloc[:, : -1].values # all the data going into the model
y = df_tts.iloc[:, -1].values # only the TARGET column
print("X Values: \n", X[:2])
print("y Values: \n", y[:5])

X Values: 
 [[2107 4.0 2 2.0 22 False True False False False False False]
 [2107 4.0 3 4.0 22 False True False False False False False]]
y Values: 
 [1210000 2250000 2920000 1530000 8000000]


In [162]:
# train test split
# we want to split the data in a way that 10% are retained for testing: test_size=0.1
# taking random samples out: random_state=1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1, shuffle=True)
print("X_train: ", X_train.shape)
print("X_test: ", X_test.shape)
print("y_train: ", y_train.shape)
print("y_test: ", y_test.shape)

X_train:  (179553, 12)
X_test:  (19951, 12)
y_train:  (179553,)
y_test:  (19951,)


### Machine Learning

__In this example we're going to use RandomForestRegressor of the scikit-learn module as a machine learning algorithm:__

`RandomForestRegressor` is an ensemble learning method that creates multiple decision trees and combines their predictions. It works by:

1. Building many decision trees, each trained on a random subset of the data and features
2. For predictions, averaging the outputs of all individual trees

The key benefits are:

- Handles non-linear relationships well
- Reduces overfitting through averaging multiple trees
- Provides feature importance scores
- Requires minimal preprocessing (no scaling needed)

Main parameters that control its behavior:

- `n_estimators`: number of trees
- `max_depth`: maximum depth of each tree
- `min_samples_split`: minimum samples needed to split a node

*(RandomForestRegressor according to Claude.ai)*

In [163]:
# train our regressor
# this is going to be a regression based ml algorithm
# our ml model here is RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=0)
regressor.fit(X_train, y_train) # passing in the training data, finding relationships to y_train

In [164]:
# make predictions on test set
y_pred = regressor.predict(X_test)
y_pred

array([ 590029.10825379, 2022675.047831  , 1112679.36926785, ...,
       2892150.6816391 ,  679998.09396343,  764235.15496623])

In [166]:
y_pred = [round(x, 0) for x in y_pred] # rounding the numbers
print("Test Predictions", y_pred[:5])
print("Test Actuals", y_test[:5])

Test Predictions [np.float64(590029.0), np.float64(2022675.0), np.float64(1112679.0), np.float64(1045638.0), np.float64(869990.0)]
Test Actuals [ 730000 1350100  860000 1390000  985000]


In [167]:
# check accuracy
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
n_scores = cross_val_score(regressor, X_train, y_train, scoring="neg_mean_absolute_error", cv=cv, n_jobs=-1, 
                           error_score="raise")

In [168]:
# Report Performance
print("MAE Avg: ", abs(n_scores.mean())) # mean absolute error
print("MAE Std: ", n_scores.std()) # standard deviation around the mean

MAE Avg:  389118.5199750819
MAE Std:  20152.765770107322


### Resources and Useful Reading

Data from: https://www.kaggle.com/datasets/mihirhalai/sydney-house-prices

One Hot Encoding vs Dummy Encoding: https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/

Mean Absolute Error Scoring (also using House Prices): http://www.andrewgurung.com/2018/12/28/regression-model-evaluation-mean-absolute-error/