# Housing 5: Categorical encoding

So far, we have been only using numerical features for our model. By not using the categorical features, we were missing out on a lot of potentially important information.

As we will see, converting categorical features to numerical (so that they can be "digested" by the Scikit-Learn transformers and models) adds a bit of complexity to the modelling pipeline. This is why in this notebook we will start by encoding them without using pipelines (just to understand what's going) and only later we will include categorical encoding inside the pipeline.

Before going through this notebook, read the Platform lesson on One-Hot Encoding: https://platform.wbscodingschool.com/courses/data-science/12675/ 

## 1. Data reading & splitting

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# reading
url = "https://drive.google.com/file/d/14PA1y_394HsGMMqYe4i3fdb7T0gYsE8h/view?usp=share_link"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
data = df = pd.read_csv(path)

# X and y creation
X = data.copy()
y = X.pop("Expensive")

# Feature Engineering
#X.loc[:, "Cabin"] = X.Cabin.str[0]

# data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   LotArea       1460 non-null   int64  
 1   LotFrontage   1201 non-null   float64
 2   TotalBsmtSF   1460 non-null   int64  
 3   BedroomAbvGr  1460 non-null   int64  
 4   Fireplaces    1460 non-null   int64  
 5   PoolArea      1460 non-null   int64  
 6   GarageCars    1460 non-null   int64  
 7   WoodDeckSF    1460 non-null   int64  
 8   ScreenPorch   1460 non-null   int64  
 9   Expensive     1460 non-null   int64  
 10  MSZoning      1460 non-null   object 
 11  Condition1    1460 non-null   object 
 12  Heating       1460 non-null   object 
 13  Street        1460 non-null   object 
 14  CentralAir    1460 non-null   object 
 15  Foundation    1460 non-null   object 
 16  ExterQual     1460 non-null   object 
 17  ExterCond     1460 non-null   object 
 18  BsmtQual      1423 non-null 

In [None]:
data.isna().sum()

LotArea           0
LotFrontage     259
TotalBsmtSF       0
BedroomAbvGr      0
Fireplaces        0
PoolArea          0
GarageCars        0
WoodDeckSF        0
ScreenPorch       0
Expensive         0
MSZoning          0
Condition1        0
Heating           0
Street            0
CentralAir        0
Foundation        0
ExterQual         0
ExterCond         0
BsmtQual         37
BsmtCond         37
BsmtExposure     38
BsmtFinType1     37
KitchenQual       0
FireplaceQu     690
dtype: int64

## 2. Categorical encoding - "MANUAL" approach  (Without using Pipelines)

### 2.1. Replacing NaNs

We will need two different strategies to deal with missing values in numerical and categorical features.

#### 2.1.1. Replacing NaNs in categorical features

We were imputing the mean to NaN’s on our preprocessing pipeline for numerical features. There's a problem with categorical values: they don’t have a “mean”. Here, we will replace NaNs with a string that marks them: “N_A”. It is not an elegant solution, but it will allow us to move forward.

In [None]:
# selecting non-numerical columns
X_train_cat = X_train.select_dtypes(exclude="number")

# defining the imputer to use "N_A" as replacement value
cat_imputer = SimpleImputer(strategy="constant", 
                            fill_value="N_A")

# fitting the imputer
cat_imputer.fit(X_train_cat)

# transforming the data & keeping it as a DataFrame
X_cat_imputed = pd.DataFrame(cat_imputer.transform(X_train_cat), 
                             columns=X_train_cat.columns)
X_cat_imputed.head()

Unnamed: 0,MSZoning,Condition1,Heating,Street,CentralAir,Foundation,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,KitchenQual,FireplaceQu
0,RL,Norm,GasA,Pave,Y,PConc,Gd,TA,Gd,TA,Gd,GLQ,Gd,TA
1,RL,Norm,GasA,Pave,Y,CBlock,TA,TA,TA,TA,No,BLQ,Gd,Gd
2,RL,PosN,GasA,Pave,Y,CBlock,TA,Gd,Gd,Gd,No,ALQ,TA,TA
3,RL,Norm,GasA,Pave,N,CBlock,TA,TA,TA,TA,No,Unf,TA,N_A
4,RL,Norm,GasA,Pave,Y,Wood,TA,TA,Gd,TA,No,GLQ,TA,N_A


In [None]:
# To check categorical values
for i in X_train_cat:
    print(i, " : ", data[i].unique())

MSZoning  :  ['RL' 'RM' 'C (all)' 'FV' 'RH']
Condition1  :  ['Norm' 'Feedr' 'PosN' 'Artery' 'RRAe' 'RRNn' 'RRAn' 'PosA' 'RRNe']
Heating  :  ['GasA' 'GasW' 'Grav' 'Wall' 'OthW' 'Floor']
Street  :  ['Pave' 'Grvl']
CentralAir  :  ['Y' 'N']
Foundation  :  ['PConc' 'CBlock' 'BrkTil' 'Wood' 'Slab' 'Stone']
ExterQual  :  ['Gd' 'TA' 'Ex' 'Fa']
ExterCond  :  ['TA' 'Gd' 'Fa' 'Po' 'Ex']
BsmtQual  :  ['Gd' 'TA' 'Ex' nan 'Fa']
BsmtCond  :  ['TA' 'Gd' nan 'Fa' 'Po']
BsmtExposure  :  ['No' 'Gd' 'Mn' 'Av' nan]
BsmtFinType1  :  ['GLQ' 'ALQ' 'Unf' 'Rec' 'BLQ' nan 'LwQ']
KitchenQual  :  ['Gd' 'TA' 'Ex' 'Fa']
FireplaceQu  :  [nan 'TA' 'Gd' 'Fa' 'Ex' 'Po']


#### 2.1.2. Replacing NaNs in numerical features

This is what we already did in previous notebooks: replacing numerical NaNs with the mean of their column.

In [None]:
# Selecting numerical columns
X_train_num = X_train.select_dtypes(include="number")

# Imputing the mean
num_imputer = SimpleImputer(strategy="mean")

# Fitting
num_imputer.fit(X_train_num)

# Transforming, keeping a DataFrame
X_num_imputed = pd.DataFrame(num_imputer.transform(X_train_num), 
                             columns=X_train_num.columns)

X_num_imputed.head()

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch
0,9900.0,90.0,1347.0,4.0,1.0,0.0,3.0,340.0,0.0
1,14585.0,69.58427,1144.0,3.0,2.0,0.0,2.0,216.0,0.0
2,12227.0,69.58427,1330.0,4.0,1.0,0.0,2.0,550.0,0.0
3,10778.0,72.0,1768.0,4.0,0.0,0.0,0.0,0.0,0.0
4,14115.0,85.0,796.0,1.0,0.0,0.0,2.0,40.0,0.0


In [None]:
# Concatenating all columns
X_imputed = pd.concat([X_cat_imputed, X_num_imputed], axis=1)

X_imputed.head()

Unnamed: 0,MSZoning,Condition1,Heating,Street,CentralAir,Foundation,ExterQual,ExterCond,BsmtQual,BsmtCond,...,FireplaceQu,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch
0,RL,Norm,GasA,Pave,Y,PConc,Gd,TA,Gd,TA,...,TA,9900.0,90.0,1347.0,4.0,1.0,0.0,3.0,340.0,0.0
1,RL,Norm,GasA,Pave,Y,CBlock,TA,TA,TA,TA,...,Gd,14585.0,69.58427,1144.0,3.0,2.0,0.0,2.0,216.0,0.0
2,RL,PosN,GasA,Pave,Y,CBlock,TA,Gd,Gd,Gd,...,TA,12227.0,69.58427,1330.0,4.0,1.0,0.0,2.0,550.0,0.0
3,RL,Norm,GasA,Pave,N,CBlock,TA,TA,TA,TA,...,N_A,10778.0,72.0,1768.0,4.0,0.0,0.0,0.0,0.0,0.0
4,RL,Norm,GasA,Pave,Y,Wood,TA,TA,Gd,TA,...,N_A,14115.0,85.0,796.0,1.0,0.0,0.0,2.0,40.0,0.0


In [None]:
X_num_imputed.describe()

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch
count,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0
mean,10353.034247,69.58427,1061.137842,2.871575,0.605308,3.44863,1.759418,97.089041,14.263699
std,9411.800862,21.24299,448.16577,0.831439,0.636673,44.896939,0.745967,127.90262,55.068118
min,1300.0,21.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7538.75,60.0,793.0,2.0,0.0,0.0,1.0,0.0,0.0
50%,9452.5,69.58427,996.0,3.0,1.0,0.0,2.0,0.0,0.0
75%,11604.0,79.0,1311.75,3.0,1.0,0.0,2.0,168.0,0.0
max,215245.0,313.0,6110.0,8.0,3.0,738.0,4.0,857.0,480.0


### 2.2. One Hot encoding

As you have learnt in the Platform lesson, One Hot encoding means creating a new binary column for each category in every categorical column. Fortunately, a Scikit-Learn transformer takes care of everything.

#### 2.3.1. Fitting the `OneHotEncoder`

As with any transformer, we have to:
1. Import it
2. Initialize it
3. Fit it to the data
4. Use it to transform the data

In [None]:
# import
from sklearn.preprocessing import OneHotEncoder

# initialize
my_onehot = OneHotEncoder(drop="first")

# fit
my_onehot.fit(X_cat_imputed)

# transform
X_cat_imputed_onehot = my_onehot.transform(X_cat_imputed)

The result is a "sparse matrix": an object that Scikit-Learn creates when a matrix contains mostly zeros:

In [None]:
X_cat_imputed_onehot

<1168x57 sparse matrix of type '<class 'numpy.float64'>'
	with 15479 stored elements in Compressed Sparse Row format>

#### 2.3.2. Converting the sparse matrix into a DataFrame

To see what exactly is inside of this sparse matrix we can convert it to a pandas DataFrame: 

In [None]:
df = pd.DataFrame.sparse.from_spmatrix(X_cat_imputed_onehot)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,47,48,49,50,51,52,53,54,55,56
0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


We can see that all the columns contain either 0's or 1's. This is exactly how "one-hot" encoded columns (also called "dummy columns") look like: binary categories.

Now, for exploration and learning purposes, we will rename the columns in this dataframe so that we know the origin of each binary column (the category and original column they come from).

#### 2.3.3. Retrieving the column names for the "one-hot" columns

The fitted transformer contains this information, and the method `get_feature_names_out` allows us to recover the names of the columns.



> **Note:** If you're running this code as a local Jupyter notebook and you don't have the last version of Scikit-Learn, you might have to adapt the code. Check the documentation for the Scikit-Learn version you have installed.



In [None]:
colnames = my_onehot.get_feature_names_out(X_cat_imputed.columns)
df.columns = colnames
df.head()

Unnamed: 0,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,Condition1_RRAe,Condition1_RRAn,...,BsmtFinType1_Rec,BsmtFinType1_Unf,KitchenQual_Fa,KitchenQual_Gd,KitchenQual_TA,FireplaceQu_Fa,FireplaceQu_Gd,FireplaceQu_N_A,FireplaceQu_Po,FireplaceQu_TA
0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


Each column has the name of the original column, an underscore and the name of the category: 

- A column such as "Sex", with only two categories, "male" and "female", has become a single column, "Sex_male", where `1` stands for "male" and `0` for "female".

- A column such as "Cabin", with many categories ("A", "B", "C", "D", "E", "F", "G", "N_A", "T") has become as many columns as categories were (minus one), making the dataframe very wide and sparse.

### 2.3.4. Concatenating "one-hot" columns with numerical columns:

Now that the categorical columns are numerical, we can join them back with the originally numerical columns and assemble the dataset that will be ready for modelling:

In [None]:
X_imputed = pd.concat([df, X_num_imputed], axis=1)

X_imputed.head(3)

Unnamed: 0,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,Condition1_RRAe,Condition1_RRAn,...,FireplaceQu_TA,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch
0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,9900.0,90.0,1347.0,4.0,1.0,0.0,3.0,340.0,0.0
1,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,14585.0,69.58427,1144.0,3.0,2.0,0.0,2.0,216.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,12227.0,69.58427,1330.0,4.0,1.0,0.0,2.0,550.0,0.0


## 3. Categorical encoding - "Automated" approach (Using Pipelines)

In the manual approach, to encode the categorical columns numericall, we have:

1. Selected the categorical columns.
2. Fitted a `OneHotEncoder` to them.
3. Transformed the categorical columns with the encoder.
4. Converted the sparse matrix into a dataframe.
5. Recovered the names of the columns.
6. Concatenated the one-hot columns with the numerical columns.

All these steps can be synthetised by using Scikit-Learn Pipelines and specifically something called `ColumnTransformer`, which allows us to apply different transformations to two or more groups of columns: in our case, categorical and numerical columns.

This process is also called creating "branches" in the pipeline. One branch for the categorical columns and another for the numerical columns. Each branch will contain as many transformers as we want. Then, the branches will meet again, and the transformed columns will be automatically concatenated. Let's see the process in action:

### 3.1. Creating the "numeric pipe" and the "categoric pipe"

In [None]:
# select categorical and numerical column names
X_cat_columns = X.select_dtypes(exclude="number").copy().columns
X_num_columns = X.select_dtypes(include="number").copy().columns

# create numerical pipeline, only with the SimpleImputer(strategy="mean")
numeric_pipe = make_pipeline(
    SimpleImputer(strategy="mean"))
 
 # create categorical pipeline, with the SimpleImputer(fill_value="N_A") and the OneHotEncoder
categoric_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="N_A"),
    OneHotEncoder()
)

### 3.2. Using `ColumnTransformer` a pipeline with 2 branches (the `preprocessor`) 

We simply tell the pipeline the following:

- One branch, called `"num_pipe"`, will apply the steps in the `numeric_pipe` to the columns named in `X_num_columns`
- The second branch, called `"cat_pipe"`, will apply the steps in the `categoric_pipe` to the columns named in `X_cat_columns`

In [None]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ("num_pipe", numeric_pipe, X_num_columns),
        ("cat_pipe", categoric_pipe, X_cat_columns),
    ]
)

### 3.3. Creating the `full_pipeline` (`preprocessor` + Decision Tree)

Pipelines are modular. The `preprocessor` we created above with the `ColumnTransformer` can become now a step in a new pipeline, that we'll call `full_piepline` and will include, as a last step, a Decision Tree model:

In [None]:
full_pipeline_dt = make_pipeline(preprocessor, 
                              DecisionTreeClassifier(random_state =10))

We can then fit this `full_pipeline` to the data:

Note: notice that we did not fit the `preprocessor` before —we only fit the pipeline once it has been full assembled.

In [None]:
full_pipeline_dt.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num_pipe',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer())]),
                                                  Index(['LotArea', 'LotFrontage', 'TotalBsmtSF', 'BedroomAbvGr', 'Fireplaces',
       'PoolArea', 'GarageCars', 'WoodDeckSF', 'ScreenPorch'],
      dtype='object')),
                                                 ('cat_pipe',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='N_A',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(

This full pipeline can make predictions, as any othet pipeline that ends with a model:

In [None]:
import csv
import requests
result = full_pipeline_dt.predict(X_train)
result

array([1, 0, 1, ..., 1, 0, 0])

In [44]:
from google.colab import files
df = pd.DataFrame(result)
df.to_csv('result.csv', index = False)
files.download("result.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
acc_decision_tree = round(accuracy_score(y_test, full_pipeline_dt.predict(X_test))*100,2)

In [None]:
acc_decision_tree

89.38

In [None]:
#RandomForest
full_pipeline_rf = make_pipeline(preprocessor, 
                              RandomForestClassifier(random_state=10))
full_pipeline_rf.fit(X_train, y_train)
full_pipeline_rf.predict(X_train)
acc_random_forest = round(accuracy_score(y_test, full_pipeline_rf.predict(X_test))*100,2)
acc_random_forest

95.55

In [None]:
# KNN 
full_pipeline_knn = make_pipeline(preprocessor, 
                              KNeighborsClassifier())
full_pipeline_knn.fit(X_train, y_train)
full_pipeline_knn.predict(X_train)
acc_knn = round(accuracy_score(y_test, full_pipeline_knn.predict(X_test))*100,2)
acc_knn

88.01

In [None]:
#Which is the best Model ?
results = pd.DataFrame({
    'Model': ['KNN','Random Forest','Decision Tree'],
    'Score': [acc_knn, acc_random_forest, acc_decision_tree]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head()

Unnamed: 0_level_0,Model
Score,Unnamed: 1_level_1
95.55,Random Forest
89.38,Decision Tree
88.01,KNN


### **Exercise 1:** use the new Pipeline with branches to train a DecisionTree with GridSearch cross validation.

We are basically asking to combine what you have learned in this notebook (categorical encoding & branches) with what you learned in the previous one (using `GridSearchCV` for a whole Pipeline).

In [None]:
from sklearn.model_selection import GridSearchCV

param_range = [2, 4, 5, 10]
estimators_range = [50,100,150]
num_range = ["mean", "median","std"]

dt_param_grid = [{
    "columntransformer__num_pipe__simpleimputer__strategy": num_range,
    "decisiontreeclassifier__max_depth": param_range,
    "decisiontreeclassifier__min_samples_leaf": param_range
}]
rf_param_grid = [{ "columntransformer__num_pipe__simpleimputer__strategy": num_range,
                   "randomforestclassifier__n_estimators" :estimators_range,
                   "randomforestclassifier__max_depth": param_range,
                   "randomforestclassifier__min_samples_leaf": param_range
}]
knn_param_grid = [{'kneighborsclassifier__n_neighbors': param_range,
                   'kneighborsclassifier__weights': ['uniform', 'distance'],
                   'kneighborsclassifier__metric': ['euclidean', 'manhattan']
}]

dt_search = GridSearchCV(estimator=full_pipeline_dt,
                      param_grid=dt_param_grid,
                      scoring='accuracy',
                      cv=5,
                      verbose=1)
rf_search = GridSearchCV(estimator=full_pipeline_rf,
                      param_grid=rf_param_grid,
                      scoring='accuracy',
                      cv=5,
                      verbose=1)
knn_search = GridSearchCV(estimator=full_pipeline_knn,
                      param_grid=knn_param_grid,
                      scoring='accuracy',
                      cv=5,
                      verbose=1)
grids = [dt_search,rf_search,knn_search]
for pipe in grids:
    pipe.fit(X_train,y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/usr/local/lib/python3.8/dist-packages/sklearn/metrics/_scorer.py", line 258, in _score
    y_pred = method_caller(estimator, "predict", X)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/metrics/_scorer.py", line 68, in _cached_call
    return getattr(estimator, method)(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
  File "/usr/local/lib/python3.8/dist-packages/sklearn/pipeline.py", line 469, in predict
    Xt = transform.transform(Xt)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/compose/_column_transformer.py"

Fitting 5 folds for each of 144 candidates, totalling 720 fits


Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/usr/local/lib/python3.8/dist-packages/sklearn/metrics/_scorer.py", line 258, in _score
    y_pred = method_caller(estimator, "predict", X)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/metrics/_scorer.py", line 68, in _cached_call
    return getattr(estimator, method)(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
  File "/usr/local/lib/python3.8/dist-packages/sklearn/pipeline.py", line 469, in predict
    Xt = transform.transform(Xt)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/compose/_column_transformer.py"

Fitting 5 folds for each of 16 candidates, totalling 80 fits


Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/usr/local/lib/python3.8/dist-packages/sklearn/metrics/_scorer.py", line 258, in _score
    y_pred = method_caller(estimator, "predict", X)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/metrics/_scorer.py", line 68, in _cached_call
    return getattr(estimator, method)(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
  File "/usr/local/lib/python3.8/dist-packages/sklearn/pipeline.py", line 469, in predict
    Xt = transform.transform(Xt)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/compose/_column_transformer.py"

In [None]:
grid_dict = {0: 'Decision Trees', 
             1: 'Random Forest',
             2: 'KNN'}
for i, model in enumerate(grids):
    print('{} Test Accuracy: {}'.format(grid_dict[i],
    model.score(X_test,y_test)))
    print('{} Best Params: {}'.format(grid_dict[i],model.best_params_))

Decision Trees Test Accuracy: 0.9178082191780822
Decision Trees Best Params: {'columntransformer__num_pipe__simpleimputer__strategy': 'mean', 'decisiontreeclassifier__max_depth': 2, 'decisiontreeclassifier__min_samples_leaf': 2}
Random Forest Test Accuracy: 0.9041095890410958
Random Forest Best Params: {'columntransformer__num_pipe__simpleimputer__strategy': 'mean', 'randomforestclassifier__max_depth': 2, 'randomforestclassifier__min_samples_leaf': 2, 'randomforestclassifier__n_estimators': 50}
KNN Test Accuracy: 0.8698630136986302
KNN Best Params: {'kneighborsclassifier__metric': 'euclidean', 'kneighborsclassifier__n_neighbors': 2, 'kneighborsclassifier__weights': 'uniform'}
