# One-Hot-Encoding and Cross-Validation

In this module you are going to learn how to handle categorical features in a better way using one-hot-encoding. You will also learn how to use cross-validation to find a better estimate for the error of a model. You will also learn the difference between test sets and validation sets when comparing models.

<b>Functions and attributes in this lecture: </b>
- `sklearn.preprocessing` - Submodule for preprocessing
 - `OneHotEncoder` - One-hot-encoder for categorical data
- `sklearn.compose` - Submodule for composing transformers to put them into a pipeline
  - `make_column_transformer` - Transform only specific columns
- `sklearn.model_selection` - Submodule for choosing models
  - `cross_validate` - Cross validation on modules

In [93]:
# RUN THIS CELL!

# Non-sklearn packages
import numpy as np
import pandas as pd
from seaborn import load_dataset

# Import the tips dataset
tips = load_dataset("tips")
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


## One-Hot-Encoding

In [94]:
# Import the one-hot-encoder
from sklearn.preprocessing import OneHotEncoder

In [95]:
# Create a one-hot-encoder transformer
enc = OneHotEncoder()

In [96]:
# Fitting the data using the one-hot-encoder transformer
tra_array = enc.fit_transform(tips[["sex","smoker","day","time"]])
enc.categories_

[array(['Female', 'Male'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['Fri', 'Sat', 'Sun', 'Thur'], dtype=object),
 array(['Dinner', 'Lunch'], dtype=object)]

In [97]:
# Creating a dataframe containing the new columns
transformed = pd.DataFrame(tra_array.toarray(), columns = enc.get_feature_names_out())
transformed.head()

Unnamed: 0,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [98]:
# Adding the new columns to the tips dataframe
enc_tips = pd.concat([tips, transformed], axis = 1)

In [99]:
# Drop repeated columns
enc_tips.drop(columns=["sex","smoker","day","time","sex_Male","smoker_Yes","time_Lunch"], inplace=True)
enc_tips.head()

Unnamed: 0,total_bill,tip,size,sex_Female,smoker_No,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner
0,16.99,1.01,2,1.0,1.0,0.0,0.0,1.0,0.0,1.0
1,10.34,1.66,3,0.0,1.0,0.0,0.0,1.0,0.0,1.0
2,21.01,3.5,3,0.0,1.0,0.0,0.0,1.0,0.0,1.0
3,23.68,3.31,2,0.0,1.0,0.0,0.0,1.0,0.0,1.0
4,24.59,3.61,4,1.0,1.0,0.0,0.0,1.0,0.0,1.0


In [100]:
# Save the cleaned data for futher use
enc_tips.to_csv("cleaned_tips.csv", index=False)

## Cross-Validation

In [101]:
# Defining the features and targets
X = enc_tips.drop(columns=["tip"])
y = enc_tips["tip"]

In [102]:
# Import cross_validation
from sklearn.model_selection import cross_validate
# Import linear regression
from sklearn.linear_model import LinearRegression

In [103]:
# Creating an instance of linear regression
lin_reg = LinearRegression()

In [104]:
# Doing cross validation with 5 folds and linear regression
result = cross_validate(lin_reg, X, y, cv=2, scoring="neg_mean_squared_error")
result

{'fit_time': array([0.00694394, 0.00346208]),
 'score_time': array([0.00226092, 0.00215602]),
 'test_score': array([-0.81200699, -1.38436749])}

In [105]:
# The resulting scores
result["test_score"]

array([-0.81200699, -1.38436749])

In [106]:
# Mean of the scores
mean_test_score = -np.mean(result["test_score"])
mean_test_score

1.0981872351321682

## Using One-Hot-Encoding in a Pipeline

In [107]:
# Imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

In [108]:
# Defining the features and targets
X = tips.drop(["tip"], axis=1)
y = np.ravel(tips[["tip"]])

In [109]:
# Splitting into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [110]:
# Importing the column transformer
from sklearn.compose import ColumnTransformer
        

In [111]:
# Creating a pipeline using column transformer
prepros_pipeline = ColumnTransformer([("categorical", OneHotEncoder(),["sex","smoker","day","time"]),
                                      ("scaler", StandardScaler(),["total_bill","size"])])

In [112]:
# Setting two pipelines together
forest_pipeline = Pipeline([("Preprocessing", prepros_pipeline),
                            ("RandomForest", RandomForestRegressor())])

linear_pipeline = Pipeline([("Preprocessing", prepros_pipeline),
                            ("Linear", LinearRegression())])


## Using a Pipeline in Cross-Validation

In [113]:
# Cross-validation can take in pipelines
result_forest = cross_validate(forest_pipeline, X_train, y_train, scoring="neg_mean_squared_error")
-np.mean(result_forest["test_score"])

1.233423963505682

In [114]:
# cross-validation on linear regression
result_linear = cross_validate(linear_pipeline, X_train, y_train, scoring="neg_mean_squared_error")
-np.mean(result_linear["test_score"])

1.1824818678630933

In [115]:
# We choose the linear regression since the error is lower. 
# Want to use the test set to find the error.
linear_pipeline.fit(X_train,y_train)

In [119]:
# Predicting on the test set using the pipeline
y_pred = linear_pipeline.predict(X_test)

In [121]:
# Getting out the final error
final_error = mean_squared_error(y_test, y_pred)
final_error

1.0335330005787038

In [123]:
# Training on all data to get the final model used in production
final_model = linear_pipeline.fit(X ,y)
final_model