# One-Hot-Encoding and Cross-Validation

In this module you are going to learn how to handle categorical features in a better way using one-hot-encoding. You will also learn how to use cross-validation to find a better estimate for the error of a model. You will also learn the difference between test sets and validation sets when comparing models.

<b>Functions and attributes in this lecture: </b>
- `sklearn.preprocessing` - Submodule for preprocessing
 - `OneHotEncoder` - One-hot-encoder for categorical data
- `sklearn.compose` - Submodule for composing transformers to put them into a pipeline
  - `make_column_transformer` - Transform only specific columns
- `sklearn.model_selection` - Submodule for choosing models
  - `cross_validate` - Cross validation on modules

In [None]:
# RUN THIS CELL!

# Non-sklearn packages
import numpy as np
import pandas as pd
from seaborn import load_dataset

# Import the tips dataset
tips = load_dataset("tips")
tips.head()

## One-Hot-Encoding

In [None]:
# Import the one-hot-encoder


In [None]:
# Create a one-hot-encoder transformer


In [None]:
# Fitting the data using the one-hot-encoder transformer


In [None]:
# Seeing the categories in the one-hot-encoder transformer


In [None]:
# Creating a dataframe containing the new columns


In [None]:
# Adding the new columns to the tips dataframe


In [None]:
# Drop repeated columns


In [None]:
# Save the cleaned data for futher use


## Cross-Validation

In [None]:
# Defining the features and targets


In [None]:
# Import cross_validation

# Import linear regression


In [None]:
# Creating an instance of linear regression


In [None]:
# Doing cross validation with 5 folds and linear regression


In [None]:
# The resulting scores


In [None]:
# Mean of the scores


## Using One-Hot-Encoding in a Pipeline

In [None]:
# Imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

In [None]:
# Defining the features and targets


In [None]:
# Splitting into training and testing set


In [None]:
# Importing the column transformer


In [None]:
# Creating a pipeline using column transformer


In [None]:
# Setting two pipelines together


## Using a Pipeline in Cross-Validation

In [None]:
# Cross-validation can take in pipelines


In [None]:
# cross-validation on linear regression


In [None]:
# We choose the linear regression since the error is lower. 
# Want to use the test set to find the error.


In [None]:
# Predicting on the test set using the pipeline


In [None]:
# Getting out the final error


In [None]:
# Training on all data to get the final model used in production
