# LightGBM Implementation for 30 Days of ML Challenge, by Juan Torres

#### Based on Svyatoslav Sokolov's notebook:

https://www.kaggle.com/svyatoslavsokolov/tps-feb-2021-lgbm-simple-version?select=train.csv

A quick word about our previous notebook. We had concluded that Log Transformation was the way to go, but unfortunately the model performed horribly with the test data (obtaining scores of 0.77!). On a second try, the predictions were formulated with stardardization for numeric data and thankfully the model performed much better (obtaining scores of 0.72), so we're going to go with normalization. 

For this notebook, we will try switching regressors. XGBoost has served us well up until now without any parameter optimization, but available literature on the internet shows that better results can be achieved with LightGBM. We will incorporate this into our current notebook and see if the results are improved. 

## 1. Import libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# For one-hot encoding categorical variables
from sklearn.preprocessing import OneHotEncoder
from sklearn import preprocessing

# For the construction of the pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# For training the LGBM regressor
from lightgbm import LGBMRegressor

# For the mean squared error needed to calculate our scores
from sklearn.metrics import mean_squared_error

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/30-days-of-ml/sample_submission.csv
/kaggle/input/30-days-of-ml/train.csv
/kaggle/input/30-days-of-ml/test.csv
/kaggle/input/train-folds-30-days-of-ml/train_folds.csv


## 2. Loading and preparing data and pipeline construction

In [2]:
# Load the training and test data. 
X_full = pd.read_csv("../input/train-folds-30-days-of-ml/train_folds.csv")
X_test_full = pd.read_csv("../input/30-days-of-ml/test.csv")

In [3]:
# We select all features except "id", "target" and "kfold", as these are not predictors of our target.
useful_features = [c for c in X_full.columns if c not in ("id", "target", "kfold")]

# Select numerical columns
num_cols = [col for col in useful_features if 'cont' in col]

# We select categorical columns. Note that we dropped the cardinality check.
object_cols = [col for col in useful_features if 'cat' in col]

# We build X_test out of X_test_full, but only selecting the useful features.
X_test = X_test_full[useful_features]

In [4]:
# Preprocessing for numerical data, we use a StandardScaler to apply standardization.
numerical_transformer = preprocessing.StandardScaler()

# Preprocessing for categorical data and one-hot encoding.
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, num_cols),('cat', categorical_transformer, object_cols)])

# Define the model 
model = LGBMRegressor(device="gpu") # We set this to be a LGBMRegressor, with the only parameter for the time specifying training on a GPU.

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

In [5]:
# We set up a list to store the final predictions.
final_predictions = []

# We set up a list for storing the mean non squared error scores.
scores = []

# We set the loop to loop across all of the folds. Since we have 5 folds, the loop range will be range(5).
for fold in range(5):
    X_train = X_full[X_full.kfold != fold].reset_index(drop=True) # We set the training data to be all folds different from the current fold number in the loop. We also reset the indices.
    X_valid = X_full[X_full.kfold == fold].reset_index(drop=True) # The validation data is the current fold number in the loop. We also reset the indices.
    X_test_copy = X_test.copy() # We copy the original X_test to not alter or overwrite over it.
    
    y_train = X_train.target # We set the training target equal to the target in the training set. This has to be done every iteration (as the fold and the data changes).
    y_valid = X_valid.target # We set the validation target equal to the target in the validation set. This has to be done every iteration (as the fold and the data changes).
    
    X_train = X_train[useful_features] # We set our training data to be the previously defined useful features of X_train.
    X_valid = X_valid[useful_features] # We set our validation data to be the previously defined useful features of X_valid.
    
    # We activate the pipeline, which preprocesses the training data and fits the model (will take about 10 minutes to run)
    my_pipeline.fit(X_train, y_train)

    preds_valid = my_pipeline.predict(X_valid) # We instruct the pipeline to make predictions on X_valid.
    preds_test = my_pipeline.predict(X_test) # We instruct the pipeline to make predictions on X_test.
    final_predictions.append(preds_test) # We append each of the test predictions on to our final_predictions list.
    rmse = mean_squared_error(y_valid, preds_valid, squared=False) # We store the mean non squared error in a variable.
    print(fold, rmse) # Print the fold number, and the mean non squared error for each fold.
    scores.append(rmse) # We append the rmse value to the scores list.
    
print(np.mean(scores), np.std(scores)) # Print the mean non square error average, and its standard deviation

0 0.7223753982194191
1 0.7301754794491622
2 0.7255235407438366
3 0.7244993511962095
4 0.7203523408978102
0.7245852221012876 0.0033147970608888955


Looking good on paper, let's stack these predictions and build the output file.

In [6]:
predictions = np.mean(np.column_stack(final_predictions), axis=1)

In [7]:
# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test_full.id, 'target': predictions})
output.to_csv('submission.csv', index=False)

The pipeline implementation is very convenient for these types of changes. Next up, we will follow Abhishek's tutorials once again to continue improving our model.