# Feature engineering for 30 Days of ML Challenge, by Juan Torres

#### Based on Abhishek Thakur's tutorials and notebooks:

https://www.youtube.com/watch?v=tx3FoYdiFwA

https://www.kaggle.com/abhishek/competition-part-2-feature-engineering

In our previous notebook we implemented k-folds into our previously built pipeline that used one-hot encoding with an XGBRegressor as a model, and achieved better results. From this point on we will follow Abhishek Thakur's videos for ideas on how to improve our model. Although we have spoken about parameter optimization in the past, first we will do some feature engineering on the data.

## 1. Import libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# For one-hot encoding categorical variables
from sklearn.preprocessing import OneHotEncoder
from sklearn import preprocessing

# from sklearn.model_selection import train_test_split We won't be needing this anymore!

# For the construction of the pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# For training the XGBoost model
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/train-folds-30-days-of-ml/train_folds.csv
/kaggle/input/30-days-of-ml/sample_submission.csv
/kaggle/input/30-days-of-ml/train.csv
/kaggle/input/30-days-of-ml/test.csv


## 2. Loading and preparing data and pipeline construction

In [2]:
# Load the training and test data. 
X_full = pd.read_csv("../input/train-folds-30-days-of-ml/train_folds.csv")
X_test_full = pd.read_csv("../input/30-days-of-ml/test.csv")

In [3]:
# We select all features except "id", "target" and "kfold", as these are not predictors of our target.
useful_features = [c for c in X_full.columns if c not in ("id", "target", "kfold")]

# Select numerical columns
num_cols = [col for col in useful_features if 'cont' in col]

# We select categorical columns. Note that we dropped the cardinality check.
object_cols = [col for col in useful_features if 'cat' in col]

# We build X_test out of X_test_full, but only selecting the useful features.
X_test = X_test_full[useful_features]

In [4]:
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data and one-hot encoding
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, num_cols),('cat', categorical_transformer, object_cols)])

# Define the model 
model = XGBRegressor(tree_method='gpu_hist', gpu_id=0, predictor="gpu_predictor") # In Abhishek's method random_state was altered with each fold (as random_state = fold), so we'll trade repeatability for some induced randomness.

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

We will make some changes to our previous model:
* We will make a new list called "scores", in which we will store the scores to obtain an average and a standard deviation.

In [5]:
# We set up a list to store the final predictions.
final_predictions = []

# We set up a list for storing the mean non squared error scores.
scores = []

# We set the loop to loop across all of the folds. Since we have 5 folds, the loop range will be range(5).
for fold in range(5):
    X_train = X_full[X_full.kfold != fold].reset_index(drop=True) # We set the training data to be all folds different from the current fold number in the loop. We also reset the indices.
    X_valid = X_full[X_full.kfold == fold].reset_index(drop=True) # The validation data is the current fold number in the loop. We also reset the indices.
    X_test_copy = X_test.copy() # We copy the original X_test to not alter or overwrite over it.
    
    y_train = X_train.target # We set the training target equal to the target in the training set. This has to be done every iteration (as the fold and the data changes).
    y_valid = X_valid.target # We set the validation target equal to the target in the validation set. This has to be done every iteration (as the fold and the data changes).
    
    X_train = X_train[useful_features] # We set our training data to be the previously defined useful features of X_train.
    X_valid = X_valid[useful_features] # We set our validation data to be the previously defined useful features of X_valid.
    
    # We activate the pipeline, which preprocesses the training data and fits the model (will take about 10 minutes to run)
    my_pipeline.fit(X_train, y_train)

    preds_valid = my_pipeline.predict(X_valid) # We instruct the pipeline to make predictions on X_valid.
    preds_test = my_pipeline.predict(X_test) # We instruct the pipeline to make predictions on X_test.
    final_predictions.append(preds_test) # We append each of the test predictions on to our final_predictions list.
    rmse = mean_squared_error(y_valid, preds_valid, squared=False) # We store the mean non squared error in a variable.
    print(fold, rmse) # Print the fold number, and the mean non squared error for each fold.
    scores.append(rmse) # We append the rmse value to the scores list.
    
print(np.mean(scores), np.std(scores)) # Print the mean non square error average, and its standard deviation

0 0.7233328877365391
1 0.7303783657341386
2 0.7263870188205617
3 0.7254889329611524
4 0.7201351391590578
0.7251444688822899 0.0033891436272372347


Okay, let's take another look at the data to see what we can see about the features:

In [6]:
X_full

Unnamed: 0,id,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,...,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,target,kfold
0,1,B,B,B,C,B,B,A,E,C,...,0.160266,0.310921,0.389470,0.267559,0.237281,0.377873,0.322401,0.869850,8.113634,4
1,2,B,B,A,A,B,D,A,F,A,...,0.558922,0.516294,0.594928,0.341439,0.906013,0.921701,0.261975,0.465083,8.481233,0
2,3,A,A,A,C,B,D,A,D,A,...,0.375348,0.902567,0.555205,0.843531,0.748809,0.620126,0.541474,0.763846,8.364351,4
3,4,B,B,A,C,B,D,A,E,C,...,0.239061,0.732948,0.679618,0.574844,0.346010,0.714610,0.540150,0.280682,8.049253,1
4,6,A,A,A,C,B,D,A,E,A,...,0.420667,0.648182,0.684501,0.956692,1.000773,0.776742,0.625849,0.250823,7.972260,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,499993,B,B,A,A,B,D,A,E,A,...,0.450538,0.934360,1.005077,0.853726,0.422541,1.063463,0.697685,0.506404,7.945605,4
299996,499996,A,B,A,C,B,B,A,E,E,...,0.508502,0.358247,0.257825,0.433525,0.301015,0.268447,0.577055,0.823611,7.326118,4
299997,499997,B,B,A,C,B,C,A,E,G,...,0.372425,0.364936,0.383224,0.551825,0.661007,0.629606,0.714139,0.245732,8.706755,3
299998,499998,A,B,A,C,B,B,A,E,E,...,0.424243,0.382028,0.468819,0.351036,0.288768,0.611169,0.380254,0.332030,7.229569,4


Not much huh? Since we don't have any idea as to what the data actually is, we can't infer much information just by eyeballing it, and feature engineering is usually done when we have some general ideas about the data, otherwise it will be just throwing spaghetti against the wall and seeing what sticks. Luckily, I like italian food. We have both numerical and categorical values, let's see what we can do about this data:
* Standardization: Used for numerical values, we subtract the mean value from the feature and divide by the standard deviation.
* Normalization: Used for numerical values.
* Log transformation: Used for numerical values. We go to each column and replace every numerical column by the log of 1 plus x.
* Polynomial Features: Used for numerical values. Sklearn function, generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

## 3. Implementing Standardization

For implementing standardization, we will change our pipeline to preprocess numerical data with a StandardScaler:

In [7]:
# Preprocessing for numerical data
numerical_transformer_2 = preprocessing.StandardScaler() # We change the numerical transformer to use a StandardScaler

# Preprocessing for categorical data and one-hot encoding
categorical_transformer_2 = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Bundle preprocessing for numerical and categorical data
preprocessor_2 = ColumnTransformer(transformers=[('num', numerical_transformer_2, num_cols),('cat', categorical_transformer_2, object_cols)])

# Define the model 
model_2 = XGBRegressor(tree_method='gpu_hist', gpu_id=0, predictor="gpu_predictor") # In Abhishek's method random_state was altered with each fold (as random_state = fold), so we'll trade repeatability for some induced randomness.

# Bundle preprocessing and modeling code in a pipeline
my_pipeline_2 = Pipeline(steps=[('preprocessor', preprocessor_2), ('model', model_2)])

In [8]:
# We set up a list to store the final predictions.
final_predictions_2 = []

# We set up a list for storing the mean non squared error scores.
scores_2 = []

# We set the loop to loop across all of the folds. Since we have 5 folds, the loop range will be range(5).
for fold in range(5):
    X_train = X_full[X_full.kfold != fold].reset_index(drop=True) # We set the training data to be all folds different from the current fold number in the loop. We also reset the indices.
    X_valid = X_full[X_full.kfold == fold].reset_index(drop=True) # The validation data is the current fold number in the loop. We also reset the indices.
    X_test_copy = X_test.copy() # We copy the original X_test to not alter or overwrite over it.
    
    y_train = X_train.target # We set the training target equal to the target in the training set. This has to be done every iteration (as the fold and the data changes).
    y_valid = X_valid.target # We set the validation target equal to the target in the validation set. This has to be done every iteration (as the fold and the data changes).
    
    X_train = X_train[useful_features] # We set our training data to be the previously defined useful features of X_train.
    X_valid = X_valid[useful_features] # We set our validation data to be the previously defined useful features of X_valid.
    
    # We activate the pipeline, which preprocesses the training data and fits the model (will take about 10 minutes to run)
    my_pipeline_2.fit(X_train, y_train)

    preds_valid_2 = my_pipeline_2.predict(X_valid) # We instruct the pipeline to make predictions on X_valid.
    preds_test_2 = my_pipeline_2.predict(X_test) # We instruct the pipeline to make predictions on X_test.
    final_predictions_2.append(preds_test_2) # We append each of the test predictions on to our final_predictions list.
    rmse_2 = mean_squared_error(y_valid, preds_valid_2, squared=False) # We store the mean non squared error in a variable.
    print(fold, rmse_2) # Print the fold number, and the mean non squared error for each fold.
    scores_2.append(rmse_2) # We append the rmse value to the scores list.
    
print(np.mean(scores_2), np.std(scores_2)) # Print the mean non square error average, and its standard deviation

0 0.7222957994558532
1 0.7303877162646868
2 0.7262597753893988
3 0.7254887712952974
4 0.7204276928254211
0.7249719510461314 0.0034370969646553865


We got an improvement in most of our folds! So, even if we know nothing about our data, it seems trying out standardization is worth it.

## 4. Implementing Normalization

For implementing normalization, we will change our pipeline to preprocess numerical data with a Normalizer:

In [9]:
# Preprocessing for numerical data
numerical_transformer_3 = preprocessing.Normalizer() # We change the numerical transformer to use a normalizer

# Preprocessing for categorical data and one-hot encoding
categorical_transformer_3 = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Bundle preprocessing for numerical and categorical data
preprocessor_3 = ColumnTransformer(transformers=[('num', numerical_transformer_3, num_cols),('cat', categorical_transformer_3, object_cols)])

# Define the model 
model_3 = XGBRegressor(tree_method='gpu_hist', gpu_id=0, predictor="gpu_predictor") # In Abhishek's method random_state was altered with each fold (as random_state = fold), so we'll trade repeatability for some induced randomness.

# Bundle preprocessing and modeling code in a pipeline
my_pipeline_3 = Pipeline(steps=[('preprocessor', preprocessor_3), ('model', model_3)])

In [10]:
# We set up a list to store the final predictions.
final_predictions_3 = []

# We set up a list for storing the mean non squared error scores.
scores_3 = []

# We set the loop to loop across all of the folds. Since we have 5 folds, the loop range will be range(5).
for fold in range(5):
    X_train = X_full[X_full.kfold != fold].reset_index(drop=True) # We set the training data to be all folds different from the current fold number in the loop. We also reset the indices.
    X_valid = X_full[X_full.kfold == fold].reset_index(drop=True) # The validation data is the current fold number in the loop. We also reset the indices.
    X_test_copy = X_test.copy() # We copy the original X_test to not alter or overwrite over it.
    
    y_train = X_train.target # We set the training target equal to the target in the training set. This has to be done every iteration (as the fold and the data changes).
    y_valid = X_valid.target # We set the validation target equal to the target in the validation set. This has to be done every iteration (as the fold and the data changes).
    
    X_train = X_train[useful_features] # We set our training data to be the previously defined useful features of X_train.
    X_valid = X_valid[useful_features] # We set our validation data to be the previously defined useful features of X_valid.
    
    # We activate the pipeline, which preprocesses the training data and fits the model (will take about 10 minutes to run)
    my_pipeline_3.fit(X_train, y_train)

    preds_valid_3 = my_pipeline_3.predict(X_valid) # We instruct the pipeline to make predictions on X_valid.
    preds_test_3 = my_pipeline_3.predict(X_test) # We instruct the pipeline to make predictions on X_test.
    final_predictions_3.append(preds_test_3) # We append each of the test predictions on to our final_predictions list.
    rmse_3 = mean_squared_error(y_valid, preds_valid_3, squared=False) # We store the mean non squared error in a variable.
    print(fold, rmse_3) # Print the fold number, and the mean non squared error for each fold.
    scores_3.append(rmse_3) # We append the rmse value to the scores list.
    
print(np.mean(scores_3), np.std(scores_3)) # Print the mean non square error average, and its standard deviation

0 0.7374587462192872
1 0.7451464464387808
2 0.740816174993611
3 0.7390018751824272
4 0.7354334328295111
0.7395713351327234 0.003302757768207594


We got worse results, bummer.

## 5. Implementing Log Transformation

For implementing Log Transformation, we will have to modify the X_full dataframe and the X_test dataframe. We will use the first pipeline to see the effects of log transformation without any standardization or normalization.

In [11]:
# We set up a list to store the final predictions.
final_predictions_4 = []

# We set up a list for storing the mean non squared error scores.
scores_4 = []

# We set up copies of the training and test data to not overwrite it.
X_full_copy = X_full.copy()
X_test_copy = X_test.copy()

# We perform log transformation on our training and test sets
for col in num_cols:
    X_full_copy[col] = np.log1p(X_full_copy[col])
    X_test_copy[col] = np.log1p(X_test_copy[col])

# We set the loop to loop across all of the folds. Since we have 5 folds, the loop range will be range(5).
for fold in range(5):
    X_train = X_full_copy[X_full_copy.kfold != fold].reset_index(drop=True) # We set the training data to be all folds different from the current fold number in the loop. We also reset the indices.
    X_valid = X_full_copy[X_full_copy.kfold == fold].reset_index(drop=True) # The validation data is the current fold number in the loop. We also reset the indices.
    X_test_copy = X_test.copy() # We copy the original X_test to not alter or overwrite over it.
    
    y_train = X_train.target # We set the training target equal to the target in the training set. This has to be done every iteration (as the fold and the data changes).
    y_valid = X_valid.target # We set the validation target equal to the target in the validation set. This has to be done every iteration (as the fold and the data changes).
    
    X_train = X_train[useful_features] # We set our training data to be the previously defined useful features of X_train.
    X_valid = X_valid[useful_features] # We set our validation data to be the previously defined useful features of X_valid.
    
    # We activate the pipeline, which preprocesses the training data and fits the model (will take about 10 minutes to run)
    my_pipeline.fit(X_train, y_train)

    preds_valid_4 = my_pipeline.predict(X_valid) # We instruct the pipeline to make predictions on X_valid.
    preds_test_4 = my_pipeline.predict(X_test) # We instruct the pipeline to make predictions on X_test.
    final_predictions_4.append(preds_test_4) # We append each of the test predictions on to our final_predictions list.
    rmse_4 = mean_squared_error(y_valid, preds_valid_4, squared=False) # We store the mean non squared error in a variable.
    print(fold, rmse_4) # Print the fold number, and the mean non squared error for each fold.
    scores_4.append(rmse_4) # We append the rmse value to the scores list.
    
print(np.mean(scores_4), np.std(scores_4)) # Print the mean non square error average, and its standard deviation

0 0.722199574384611
1 0.7303888797180302
2 0.7263925761392298
3 0.7254851008898957
4 0.7201301016597935
0.7249192465583121 0.0035423241995017275


We also got a slight improvement using this method. Next up, we can mix different types of feature engineering to see if we can achieve even better results! Let's do a to-do list of feature engineering methods:
* Log Transform + Standardization
* Standardization of one-hot encoding & numerical values

## 6. Implementing Log Transformation + Standardization

Based off this question on stackexchange (https://stats.stackexchange.com/questions/402470/how-can-i-use-scaling-and-log-transforming-together), we should first apply the log transformation, and then standardize the data. We already have a pipeline working with my_pipeline_2, so we just need to apply log transformation and then apply that pipeline:

In [12]:
# We set up a list to store the final predictions.
final_predictions_5 = []

# We set up a list for storing the mean non squared error scores.
scores_5 = []

# We set up copies of the training and test data to not overwrite it.
X_full_copy = X_full.copy()
X_test_copy = X_test.copy()

# We perform log transformation on our training and test sets
for col in num_cols:
    X_full_copy[col] = np.log1p(X_full_copy[col])
    X_test_copy[col] = np.log1p(X_test_copy[col])

# We set the loop to loop across all of the folds. Since we have 5 folds, the loop range will be range(5).
for fold in range(5):
    X_train = X_full_copy[X_full_copy.kfold != fold].reset_index(drop=True) # We set the training data to be all folds different from the current fold number in the loop. We also reset the indices.
    X_valid = X_full_copy[X_full_copy.kfold == fold].reset_index(drop=True) # The validation data is the current fold number in the loop. We also reset the indices.
    X_test_copy = X_test.copy() # We copy the original X_test to not alter or overwrite over it.
    
    y_train = X_train.target # We set the training target equal to the target in the training set. This has to be done every iteration (as the fold and the data changes).
    y_valid = X_valid.target # We set the validation target equal to the target in the validation set. This has to be done every iteration (as the fold and the data changes).
    
    X_train = X_train[useful_features] # We set our training data to be the previously defined useful features of X_train.
    X_valid = X_valid[useful_features] # We set our validation data to be the previously defined useful features of X_valid.
    
    # We activate the pipeline, which preprocesses the training data and fits the model (will take about 10 minutes to run)
    my_pipeline.fit(X_train, y_train)

    preds_valid_5 = my_pipeline_2.predict(X_valid) # We instruct the pipeline to make predictions on X_valid.
    preds_test_5 = my_pipeline_2.predict(X_test) # We instruct the pipeline to make predictions on X_test.
    final_predictions_5.append(preds_test_5) # We append each of the test predictions on to our final_predictions list.
    rmse_5 = mean_squared_error(y_valid, preds_valid_5, squared=False) # We store the mean non squared error in a variable.
    print(fold, rmse_5) # Print the fold number, and the mean non squared error for each fold.
    scores_5.append(rmse_5) # We append the rmse value to the scores list.
    
print(np.mean(scores_5), np.std(scores_5)) # Print the mean non square error average, and its standard deviation

0 0.7486244672376378
1 0.7559871170628206
2 0.7522413781353476
3 0.750569989940971
4 0.7473239151480965
0.7509493735049747 0.003024605939871177


That didn't bode so well either. Off to the next one:

## 7. Implementing Standardization of One-Hot Encoding & Numerical Values

Sounds simple enough, apply one-hot encoding to the categorical values and then apply standardization on said one-hot values.

In [13]:
# Preprocessing for numerical data
numerical_transformer_4 = preprocessing.StandardScaler() # We change the numerical transformer to use a normalizer

# Preprocessing for categorical data and one-hot encoding
categorical_transformer_4 = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore')), ('scaler', preprocessing.StandardScaler(with_mean=False))])

# Bundle preprocessing for numerical and categorical data
preprocessor_4 = ColumnTransformer(transformers=[('num', numerical_transformer_4, num_cols),('cat', categorical_transformer_4, object_cols)])

# Define the model 
model_4 = XGBRegressor(tree_method='gpu_hist', gpu_id=0, predictor="gpu_predictor") # In Abhishek's method random_state was altered with each fold (as random_state = fold), so we'll trade repeatability for some induced randomness.

# Bundle preprocessing and modeling code in a pipeline
my_pipeline_4 = Pipeline(steps=[('preprocessor', preprocessor_4), ('model', model_4)])

In [14]:
# We set up a list to store the final predictions.
final_predictions_6 = []

# We set up a list for storing the mean non squared error scores.
scores_6 = []

# We set the loop to loop across all of the folds. Since we have 5 folds, the loop range will be range(5).
for fold in range(5):
    X_train = X_full[X_full.kfold != fold].reset_index(drop=True) # We set the training data to be all folds different from the current fold number in the loop. We also reset the indices.
    X_valid = X_full[X_full.kfold == fold].reset_index(drop=True) # The validation data is the current fold number in the loop. We also reset the indices.
    X_test_copy = X_test.copy() # We copy the original X_test to not alter or overwrite over it.
    
    y_train = X_train.target # We set the training target equal to the target in the training set. This has to be done every iteration (as the fold and the data changes).
    y_valid = X_valid.target # We set the validation target equal to the target in the validation set. This has to be done every iteration (as the fold and the data changes).
    
    X_train = X_train[useful_features] # We set our training data to be the previously defined useful features of X_train.
    X_valid = X_valid[useful_features] # We set our validation data to be the previously defined useful features of X_valid.
    
    # We activate the pipeline, which preprocesses the training data and fits the model (will take about 10 minutes to run)
    my_pipeline_4.fit(X_train, y_train)

    preds_valid_6 = my_pipeline_4.predict(X_valid) # We instruct the pipeline to make predictions on X_valid.
    preds_test_6 = my_pipeline_4.predict(X_test) # We instruct the pipeline to make predictions on X_test.
    final_predictions_6.append(preds_test_6) # We append each of the test predictions on to our final_predictions list.
    rmse_6 = mean_squared_error(y_valid, preds_valid_6, squared=False) # We store the mean non squared error in a variable.
    print(fold, rmse_6) # Print the fold number, and the mean non squared error for each fold.
    scores_6.append(rmse_6) # We append the rmse value to the scores list.
    
print(np.mean(scores_6), np.std(scores_6)) # Print the mean non square error average, and its standard deviation

0 0.7222957994558532
1 0.7303877162646868
2 0.7262597753893988
3 0.7254887712952974
4 0.7204276928254211
0.7249719510461314 0.0034370969646553865


Looks like we got better results, let's compare each method we have tried up to this point and select the one that gives us the smallest mean error. 

In [15]:
scores_dict = dict()
scores_dict['No Changes'] = (np.mean(scores), np.std(scores))
scores_dict['Standardization'] = (np.mean(scores_2), np.std(scores_2))
scores_dict['Normalization'] = (np.mean(scores_3), np.std(scores_3))
scores_dict['Log Transformation'] = (np.mean(scores_4), np.std(scores_4))
scores_dict['Log Transformation + Standardization'] = (np.mean(scores_5), np.std(scores_5))
scores_dict['Standardization of One-Hot Encoding & Numerical Values'] = (np.mean(scores_6), np.std(scores_6))
print("The method with the lowest mean error is: " + str(min(scores_dict, key=scores_dict.get)))

The method with the lowest mean error is: Log Transformation


Looks like for this specific case Log Transformation is the way to go. Although, one interesting thing is that Log Transformation also has the highest standard deviation of all methods:

In [16]:
print('No Changes: ' + 'Mean: ' + str(np.mean(scores)) + " Standard Deviation: " + str(np.std(scores)))
print('Standardization: ' + 'Mean: ' + str(np.mean(scores_2)) + " Standard Deviation: " + str(np.std(scores_2)))
print('Normalization: ' + 'Mean: ' + str(np.mean(scores_3)) + " Standard Deviation: " + str(np.std(scores_3)))
print('Log Transformation: ' + 'Mean: ' + str(np.mean(scores_4)) + " Standard Deviation: " + str(np.std(scores_4)))
print('Log Transformation + Standardization: ' + 'Mean: ' + str(np.mean(scores_5)) + " Standard Deviation: " + str(np.std(scores_5)))
print('Standardization of One-Hot Encoding & Numerical Values: ' + 'Mean: ' + str(np.mean(scores_6)) + " Standard Deviation: " + str(np.std(scores_6)))

No Changes: Mean: 0.7251444688822899 Standard Deviation: 0.0033891436272372347
Standardization: Mean: 0.7249719510461314 Standard Deviation: 0.0034370969646553865
Normalization: Mean: 0.7395713351327234 Standard Deviation: 0.003302757768207594
Log Transformation: Mean: 0.7249192465583121 Standard Deviation: 0.0035423241995017275
Log Transformation + Standardization: Mean: 0.7509493735049747 Standard Deviation: 0.003024605939871177
Standardization of One-Hot Encoding & Numerical Values: Mean: 0.7249719510461314 Standard Deviation: 0.0034370969646553865


We'll take the predictions of Log Transformation and make them the output for this notebook:

In [17]:
predictions = np.mean(np.column_stack(final_predictions_2), axis=1)

In [18]:
# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test_full.id, 'target': predictions})
output.to_csv('submission.csv', index=False)

Before we keep following Abhishek's tutorials, I want to implement LightGBM, as I've read it yields better results. 