# 30 Days of ML v3, by Juan Torres
### Based of my previous notebook, 30 Days of ML v2: https://www.kaggle.com/jtorres96/30-days-of-ml-v2

This notebook will be an iteration on the previous notebook. On this one, we will try using one-hot encoding with the previously built pipeline to see if performance is improved.

## 1. Importing libraries
First, let's import the libraries we'll use for this first model:

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# For encoding categorical variables, splitting data
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

# For the construction of the pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# For training random forest model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/30-days-of-ml/sample_submission.csv
/kaggle/input/30-days-of-ml/train.csv
/kaggle/input/30-days-of-ml/test.csv


## 2. Loading and preparing data
Next, let's load the training and test data, separate our target from the training features, separate categorical and numerical columns and do a train_test_split to break off a validation set from the training data.

In [2]:
# Load the training and test data. We set index_col=0 in the code cell below to use the id column to index the DataFrame. 
X_full = pd.read_csv("../input/30-days-of-ml/train.csv", index_col=0)
X_test_full = pd.read_csv("../input/30-days-of-ml/test.csv", index_col=0)

# Separate target (designated "y") from the training features.
y = X_full['target']
X = X_full.drop(['target'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, random_state=0)

# Select categorical columns with relatively low cardinality
object_cols = [col for col in X.columns if 'cat' in col and X[col].nunique() < 10] # We are separating the categorical columns by the column name, previously we had done this by checking the data type of each column.

# Select numerical columns
num_cols = [col for col in X.columns if 'cont' in col]

# Keep selected columns only
my_cols = object_cols + num_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

# Take a peek at the training data
X_train.head()

Unnamed: 0_level_0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cont0,...,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
412957,A,A,A,C,B,C,A,D,G,1.0354,...,0.284988,0.777432,0.391158,0.575325,0.449008,0.808493,0.638874,0.620083,0.860327,0.796654
282449,A,A,A,C,B,B,A,E,E,0.552942,...,0.979303,0.388542,0.392169,0.251943,0.257683,0.289205,0.753284,0.108515,0.223745,0.867225
164867,A,A,A,A,B,B,A,E,F,0.300135,...,0.528495,0.877585,0.692778,0.39411,0.372842,0.603872,0.729337,0.572204,0.381617,0.528454
541,A,B,A,C,B,B,A,E,E,0.423174,...,0.795973,0.254421,0.279979,0.244269,0.549558,0.318729,0.093974,0.428488,0.176486,0.25028
80790,B,B,A,C,B,D,A,E,A,0.66709,...,0.809695,0.412479,0.345747,0.537579,1.023082,0.433388,0.814753,0.655539,0.923203,0.467355


## 3. Building the pipeline
### 3.1: Define Preprocessing Steps
We'll use the ColumnTransformer class to bundle together different preprocessing steps. The code below will impute missing values in numerical and categorical data, and apply a **one-hot encoding** to categorical data.

In [3]:
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, num_cols),('cat', categorical_transformer, object_cols)])

### 3.2: Define the Model
Next, we define a random forest model with the familiar RandomForestRegressor class, with the same parameters as the v1 notebook to ensure repeatability.

In [4]:
# Define the model 
model = RandomForestRegressor(random_state=1)

### 3.3: Create and Evaluate the Pipeline
Finally, we use the Pipeline class to define a pipeline that bundles the preprocessing and modeling steps. Let's remember two important things:
* With the pipeline, we preprocess the training data and fit the model in a single line of code. (In contrast, without a pipeline, we have to do imputation, one-hot encoding, and model training in separate steps. This becomes especially messy if we have to deal with both numerical and categorical variables!)
* With the pipeline, we supply the unprocessed features in X_valid to the predict() command, and the pipeline automatically preprocesses the features before generating predictions. (However, without a pipeline, we have to remember to preprocess the validation data before making predictions.)

In [5]:
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

# Preprocessing of training data, fit model (will take about 10 minutes to run)
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds_valid = my_pipeline.predict(X_valid)

print(mean_squared_error(y_valid, preds_valid, squared=False)) # We set squared=False to get the root mean squared error (RMSE) on the validation data, as this is the competition's requirement.

0.7371147047329659


Previous score was 0.7375392165180452, let's compare:

In [6]:
print(abs(0.7375392165180452-mean_squared_error(y_valid, preds_valid, squared=False)))

0.00042451178507929566


This is a very small difference accountable for the fact that we are comparing the X_valid predictions to the X_test predictions, so we can consider that the pipeline was correctly implemented! Now, let's obtain the predictions using the X_test set, and compare that MAE.

## 5. Submit predictions

In [7]:
# Use the model to generate predictions
predictions = my_pipeline.predict(X_test)

# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test.index,
                       'target': predictions})
output.to_csv('submission.csv', index=False)

Next up, let's try using an XGBoost model!