# Lecture 5 and 6: Class demo

## Imports

In [None]:
# import the libraries
import os
import sys
sys.path.append(os.path.join(os.path.abspath(".."), (".."), "code"))
from plotting_functions import *
from utils import *

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler

%matplotlib inline

pd.set_option("display.max_colwidth", 200)

c = os.path.join(os.path.abspath(".."), (".."), "data/")
DATA_DIR = os.path.join(os.path.abspath(".."), (".."), "data/")
pd.set_option("display.max_colwidth", 200)

<br><br>

## Data and splitting

Do you recall [the restaurants survey](https://ubc.ca1.qualtrics.com/jfe/form/SV_73VuZiuwM1eDVrw) you completed at the start of the course?

Let's use that data for this demo. You'll find a [wrangled version](../../data/cleaned_restaurant_data.csv) in the course repository.

In [None]:
df = pd.read_csv(DATA_DIR + 'cleaned_restaurant_data.csv')

In [None]:
df

In [None]:
df.describe()

Are there any unusual values in this data that you notice?
Let's get rid of these outliers. 

In [None]:
restaurant_df.describe()

We aim to predict whether a restaurant is liked or disliked.

In [None]:
# Separate `X` and `y`. 

Below I'm perturbing this data just to demonstrate a few concepts. Don't do it in real life. 

In [None]:
X.at[459, 'food_type'] = 'Quebecois'
X['price'] = X['price'] * 100

In [None]:
# Split the data

<br><br>

## Exploratory data analysis 

In [None]:
X_train.hist(bins=20, figsize=(12, 8));

Do you see anything interesting in these plots? 

In [None]:
X_train['food_type'].value_counts()

Error in data collection? Probably "Fusion" and "fusion" categories should be combined?

In [None]:
# Replace "fusion" with "Fusion"

In [None]:
X_train['food_type'].value_counts()

Again, usually we should spend lots of time in EDA, but let's stop here so that we have time to learn about transformers and pipelines.   

<br><br>

## Modeling 

### Dummy Classifier

In [None]:
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier()
scores = cross_validate(dummy, X_train, y_train, return_train_score=True)
pd.DataFrame(scores)

We have a relatively balanced distribution of both 'like' and 'dislike' classes.

<br><br>

### Let's try KNN on this data

Do you think KNN would work directly on `X_train` and `y_train`?

In [None]:
# Preprocessing and pipeline
from sklearn.neighbors import KNeighborsClassifier

<br><br><br><br><br><br>
- We need to preprocess the data before feeding it into machine learning models. What are the different types of features in the data?
- What transformations are necessary before training a machine learning model?
- Can we categorize features based on the type of transformations they require?

In [None]:
X_train[4:11]

In [None]:
X_train.columns

In [None]:
X_train['food_type'].value_counts()

In [None]:
X_train['north_america'].value_counts()

In [None]:
X_train['good_server'].value_counts()

In [None]:
X_train['noise_level'].value_counts()

In [None]:
numeric_feats = [] # Continuous and quantitative features
categorical_feats = [] # Discrete and qualitative features
binary_feats = [] # Categorical features with only two possible values 
ordinal_feats = [] # Some natural ordering in the categories 
noise_cats = []
drop_feats = [] # Dropping text feats and `eat_out_freq` because it's not that useful

<br><br>

Let's begin with numeric features. What if we just use numeric features to train a KNN model? Would it work? 

In [None]:
X_train_num = X_train[numeric_feats]
X_test_num = X_test[numeric_feats]
# knn.fit(X_train_num, y_train)

We need to deal with NaN values. 

### sklearn's `SimpleImputer` 

In [None]:
# Impute numeric features using SimpleImputer
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')

# fit the imputer 

# Transform training data 

# Transform test data 

In [None]:
knn.fit(X_train_num_imp, y_train)

No more errors. It worked! Let's try cross validation. 

In [None]:
knn.score(X_train_num_imp, y_train)

In [None]:
knn.score(X_test_num_imp, y_test)

We have slightly improved results in comparison to the dummy model. 

### Discussion questions 

- What's the difference between sklearn estimators and transformers?  
- Can you think of a better way to impute missing values? 

<br><br><br><br>

Do we need to scale the data? 

In [None]:
X_train[numeric_feats]

In [None]:
# Scale the imputed data 

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Fit scaler

# Transform train
X_train_num_imp_scaled = 

# Transform test
X_test_num_imp_scaled = 

### Alternative methods for scaling
- [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html): Transform each feature to a desired range
- [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html): Scale features using median and quantiles. Robust to outliers. 
- [Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html): Works on rows rather than columns. Normalize examples individually to unit norm.
- [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html): A scaler that scales each feature by its maximum absolute value.
    - What would happen when you apply `StandardScaler` to sparse data?    
- You can also apply custom scaling on columns using [`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html). For example, when a column follows the power law distribution (a handful of your values have many data points whereas most other values have few data points) log scaling is helpful.    

- For now, let's focus on `StandardScaler`. Let's carry out cross-validation

In [None]:
cross_val_score(knn, X_train_num_imp_scaled, y_train)

In this case, we don't see a big difference with `StandardScaler`. But usually, scaling is a good idea. 

<br><br><br><br>
- This worked but are we doing anything wrong here? 
- What's the problem with calling `cross_val_score` with preprocessed data? 


In [None]:
plot_improper_processing("kNN")

<br><br><br><br>

#### How would you do it properly? Enter sklearn pipelines!!

In [None]:
# Create a pipeline 
pipe_knn = 

In [None]:
cross_val_score(pipe_knn, X_train_num, y_train).mean()

- What is happening under the hood? 
- Why is this a better approach? 

<img src='../../img/pipeline.png' width="800">
    
[Source](https://amueller.github.io/COMS4995-s20/slides/aml-04-preprocessing/#18)

In [None]:
plot_proper_processing("kNN")

<br><br><br><br>

We will continue with this demo in the next lecture. 

### Categorical features

Let's assess the scores using categorical features.

In [None]:
X_train['food_type'].value_counts()

In [None]:
X_train[categorical_feats]

In [None]:
X_train['north_america'].value_counts()

In [None]:
X_train['food_type'].value_counts()

In [None]:
X_train_cat = X_train[categorical_feats]
X_test_cat = X_test[categorical_feats]

In [None]:
# One-hot encoding of categorical features 
from sklearn.preprocessing import OneHotEncoder
# Create class object
ohe = OneHotEncoder(sparse_output=False)

# fit OneHotEncoder

# Transform
X_train_cat_ohe  = 
X_test_cat_ohe  =  

In [None]:
X_train_cat_ohe

- It's a sparse matrix. 
- Why? What would happen if we pass `sparse_output=False`? Why we might want to do that? 

In [None]:
# Get the OHE feature names 

ohe_feats = ohe.get_feature_names_out().tolist()
ohe_feats

In [None]:
pd.DataFrame(X_train_cat_ohe, columns = ohe_feats)

In [None]:
cross_val_score(knn, X_train_cat_ohe, y_train)

- What's wrong here? 
- How can we fix this?

<br><br><br><br><br><br>

Let's do this properly with a pipeline. 

In [None]:
# Code to create a pipeline for OHE and KNN
pipe_ohe_knn = make_pipeline(
    OneHotEncoder(sparse_output=False, handle_unknown="ignore"),
    KNeighborsClassifier()
)

In [None]:
cross_val_score(pipe_ohe_knn, X_train_cat, y_train)

### Ordinal features

Let's examine the scores using ordinal features.

In [None]:
noise_ordering = []

In [None]:
X_train['noise_level'].value_counts()

In [None]:
pipe_ordinal_knn = make_pipeline(
    OrdinalEncoder(categories=[noise_ordering]),
    KNeighborsClassifier()
)

In [None]:
# cross_val_score(pipe_ordinal_knn, X_train[['noise_level']], y_train)

<br><br><br><br><br><br>

In [None]:
X_train['noise_level'].isnull().any()

There are missing values. So we need an imputer. 

In [None]:
from sklearn.preprocessing import OrdinalEncoder
noise_ordering = ['no music', 'low', 'medium', 'high', 'crazy loud']

ordinal_transformer = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OrdinalEncoder(categories=[noise_ordering]),
    KNeighborsClassifier()
)

In [None]:
cross_val_score(ordinal_transformer, X_train[['noise_level']], y_train)

<br><br><br><br>

Right now we are working with numeric and categorical features separately. But ideally when we create a model, we need to use all these features together. 

**Enter column transformer!**

How can we horizontally stack  
- preprocessed numeric features, 
- preprocessed binary features, 
- preprocessed ordinal features, and 
- preprocessed categorical features?

Let's define a column transformer. 

In [None]:
from sklearn.compose import make_column_transformer

How does the transformed data look like? 

In [None]:
categorical_feats

In [None]:
X_train.shape

In [None]:
preprocessor

In [None]:
# Getting feature names from a column transformer

In [None]:
numeric_feats

In [None]:
feat_names = 

In [None]:
transformed

In [None]:
pd.DataFrame(transformed, columns = feat_names)

We have new columns for the categorical features. Let's create a pipeline with the preprocessor and SVC. 

In [None]:
from sklearn.svm import SVC 

We are getting better results! 
<br><br><br>

### Incorporating text features 

We haven't incorporated the comments feature into our pipeline yet, even though it holds significant value in indicating whether the restaurant was liked or not.

In [None]:
X_train

Let's create bag-of-words representation of the `comments` feature. But first we need to impute the rows where there are no comments. There is a small complication if we want to put `SimpleImputer` and `CountVectorizer` in a pipeline. 
- `SimpleImputer` takes a 2D array as input and produced 2D array as output. 
- `CountVectorizer` takes a 1D array as input. 

To deal with this, we will use sklearn's `FunctionTransformer` to convert the 2D output of `SimpleImputer` into a 1D array which can be passed to `CountVectorizer` as input. 

In [None]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer

Pretty good scores just with text features! Do we get better scores if we combine all features? Let's define a column transformer which carries out 
- imputation and scaling on numeric features
- imputation and one-hot encoding with `drop="if_binary"` on binary features
- imputation and one-hot encoding with `handle_unknown="ignore"` on categorical features
- imputation, reshaping, and bag-of-words transformation on the text feature

Some improvement when we combine all features! 