# Implementing k-folds into Pipeline for 30 Days of ML Challenge, by Juan Torres

#### Based on Abhishek Thakur's tutorials and notebooks:

https://www.youtube.com/watch?v=t5fhRP62YdE

https://www.kaggle.com/abhishek/competition-part-1-baseline?scriptVersionId=72291885

Having created a new .csv file in our previous notebook that took the training data and divided it into folds for cross validation, we will now implement said folds into our previously built pipeline.

## 1. Import libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# For one-hot encoding categorical variables
from sklearn.preprocessing import OneHotEncoder

# from sklearn.model_selection import train_test_split We won't be needing this anymore!

# For the construction of the pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# For training the XGBoost model
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/train-folds-30-days-of-ml/train_folds.csv
/kaggle/input/30-days-of-ml/sample_submission.csv
/kaggle/input/30-days-of-ml/train.csv
/kaggle/input/30-days-of-ml/test.csv


We have to account for the fact that we are working with k-folds, so the data loading and processing and the pipeline construction will be intertwined:

## 3. Loading and preparing data and pipeline construction

In [2]:
# Load the training and test data. 
X_full = pd.read_csv("../input/train-folds-30-days-of-ml/train_folds.csv")
X_test_full = pd.read_csv("../input/30-days-of-ml/test.csv")

We have our data loaded. Now, let's do some data preprocessing to correctly use the folds we set up previously:

In [3]:
# We select all features except "id", "target" and "kfold", as these are not predictors of our target.
useful_features = [c for c in X_full.columns if c not in ("id", "target", "kfold")]

# Select numerical columns
num_cols = [col for col in useful_features if 'cont' in col]

# We select categorical columns. Note that we dropped the cardinality check.
object_cols = [col for col in useful_features if 'cat' in col]

# We build X_test out of X_test_full, but only selecting the useful features.
X_test = X_test_full[useful_features]

Now, we begin constructing the pipeline:

In [4]:
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data and one-hot encoding
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, num_cols),('cat', categorical_transformer, object_cols)])

# Define the model 
model = XGBRegressor(tree_method = 'gpu_hist') # In Abhishek's method random_state was altered with each fold (as random_state = fold), so we'll trade repeatability for some induced randomness.

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

Finally, we set up the folds within a for loop that will loop across all of the k-folds:

In [5]:
# We set up a list to store the final predictions.
final_predictions = []
# We set the loop to loop across all of the folds. Since we have 5 folds, the loop range will be range(5).
for fold in range(5):
    X_train = X_full[X_full.kfold != fold].reset_index(drop=True) # We set the training data to be all folds different from the current fold number in the loop. We also reset the indices.
    X_valid = X_full[X_full.kfold == fold].reset_index(drop=True) # The validation data is the current fold number in the loop. We also reset the indices.
    X_test_copy = X_test.copy() # We copy the original X_test to not alter or overwrite over it.
    
    y_train = X_train.target # We set the training target equal to the target in the training set. This has to be done every iteration (as the fold and the data changes).
    y_valid = X_valid.target # We set the validation target equal to the target in the validation set. This has to be done every iteration (as the fold and the data changes).
    
    X_train = X_train[useful_features] # We set our training data to be the previously defined useful features of X_train.
    X_valid = X_valid[useful_features] # We set our validation data to be the previously defined useful features of X_valid.
    
    # We activate the pipeline, which preprocesses the training data and fits the model (will take about 10 minutes to run)
    my_pipeline.fit(X_train, y_train)

    preds_valid = my_pipeline.predict(X_valid) # We instruct the pipeline to make predictions on X_valid.
    preds_test = my_pipeline.predict(X_test) # We instruct the pipeline to make predictions on X_test.
    final_predictions.append(preds_test) # We append each of the test predictions on to our final_predictions list.
    print(fold, mean_squared_error(y_valid, preds_valid, squared=False)) # Print the fold number, and the mean squared error for each fold.

0 0.7233329089723132
1 0.7303783741255304
2 0.726387042692036
3 0.7254889468557322
4 0.7201351340010433


final_predictions is a list of lists with the predicted values for each fold, we'll take the average across folds and merge them into a single predictions variable:

In [6]:
predictions = np.mean(np.column_stack(final_predictions), axis=1)

Finally, we create the submission file:

In [7]:
# Use the model to generate predictions
predictions = my_pipeline.predict(X_test)

# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test_full.id,
                       'target': predictions})
output.to_csv('submission.csv', index=False)