# Integrating PCA in Pipelines - Lab

## Introduction

In a previous section, you learned about how to use pipelines in scikit-learn to combine several supervised learning algorithms in a manageable pipeline. In this lesson, you will integrate PCA along with classifiers in the pipeline. 

## Objectives

In this lab you will: 

- Integrate PCA in scikit-learn pipelines 

## The Data Science Workflow

You will be following the data science workflow:

1. Initial data inspection, exploratory data analysis, and cleaning
2. Feature engineering and selection
3. Create a baseline model
4. Create a machine learning pipeline and compare results with the baseline model
5. Interpret the model and draw conclusions

##  Initial data inspection, exploratory data analysis, and cleaning

You'll use a dataset created by the Otto group, which was also used in a [Kaggle competition](https://www.kaggle.com/c/otto-group-product-classification-challenge/data). The description of the dataset is as follows:

The Otto Group is one of the world’s biggest e-commerce companies, with subsidiaries in more than 20 countries, including Crate & Barrel (USA), Otto.de (Germany) and 3 Suisses (France). They are selling millions of products worldwide every day, with several thousand products being added to their product line.

A consistent analysis of the performance of their products is crucial. However, due to their global infrastructure, many identical products get classified differently. Therefore, the quality of product analysis depends heavily on the ability to accurately cluster similar products. The better the classification, the more insights the Otto Group can generate about their product range.

In this lab, you'll use a dataset containing:
- A column `id`, which is an anonymous id unique to a product
- 93 columns `feat_1`, `feat_2`, ..., `feat_93`, which are the various features of a product
- a column `target` - the class of a product



The dataset is stored in the `'otto_group.csv'` file. Import this file into a DataFrame called `data`, and then: 

- Check for missing values 
- Check the distribution of columns 
- ... and any other things that come to your mind to explore the data 

In [18]:
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline

df = pd.read_csv('otto_group.csv', index_col='id')
df.head()

Unnamed: 0_level_0,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,feat_9,feat_10,...,feat_85,feat_86,feat_87,feat_88,feat_89,feat_90,feat_91,feat_92,feat_93,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,Class_1
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,Class_1
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,Class_1
4,1,0,0,1,6,1,5,0,0,1,...,0,1,2,0,0,0,0,0,0,Class_1
5,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,Class_1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61878 entries, 0 to 61877
Data columns (total 95 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       61878 non-null  int64 
 1   feat_1   61878 non-null  int64 
 2   feat_2   61878 non-null  int64 
 3   feat_3   61878 non-null  int64 
 4   feat_4   61878 non-null  int64 
 5   feat_5   61878 non-null  int64 
 6   feat_6   61878 non-null  int64 
 7   feat_7   61878 non-null  int64 
 8   feat_8   61878 non-null  int64 
 9   feat_9   61878 non-null  int64 
 10  feat_10  61878 non-null  int64 
 11  feat_11  61878 non-null  int64 
 12  feat_12  61878 non-null  int64 
 13  feat_13  61878 non-null  int64 
 14  feat_14  61878 non-null  int64 
 15  feat_15  61878 non-null  int64 
 16  feat_16  61878 non-null  int64 
 17  feat_17  61878 non-null  int64 
 18  feat_18  61878 non-null  int64 
 19  feat_19  61878 non-null  int64 
 20  feat_20  61878 non-null  int64 
 21  feat_21  61878 non-null  int64 
 22

In [10]:
# Inspect missing values
df.isna().sum()

id         0
feat_1     0
feat_2     0
feat_3     0
feat_4     0
          ..
feat_90    0
feat_91    0
feat_92    0
feat_93    0
target     0
Length: 95, dtype: int64

In [14]:
# Inspect Duplicates
df.duplicated().sum()

0

In [None]:
# Your code here

In [None]:
# Your code here

In [None]:
# Your code here

If you look at all the histograms, you can tell that a lot of the data are zero-inflated, so most of the variables contain mostly zeros and then some higher values here and there. No normality, but for most machine learning techniques this is not an issue. 

In [None]:
# Your code here

Because there are so many zeroes, most values above zero will seem to be outliers. The safe decision for this data is to not delete any outliers and see what happens. With many 0s, sparse data is available and high values may be super informative. Moreover, without having any intuitive meaning for each of the features, we don't know if a value of ~260 is actually an outlier.

In [None]:
# Your code here

## Feature engineering and selection with PCA

Have a look at the correlation structure of your features using a [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html).

Use PCA to select a number of features in a way that you still keep 80% of your explained variance.

In [19]:
# Split the data into X and y
X = df.drop(['target'], axis=1)
y = df['target']

In [23]:
# Mean
X_mean = X.mean()
 
# Standard deviation
X_std = X.std()
 
# Standardization
Z = (X - X_mean) / X_std

# Compute covariance matrix
c = Z.cov()

# Compute eigenvectors and eigenvalues
eigenvalues, eigenvectors = np.linalg.eig(c)

# Sort the eigenvalues and the corresponding eigenvectors in descending order
# Index the eigenvalues in descending order 
idx = eigenvalues.argsort()[::-1]
# Sort the eigenvalues in descending order 
eigenvalues = eigenvalues[idx]
# sort the corresponding eigenvectors accordingly
eigenvectors = eigenvectors[:,idx]

# Compute explained variance of the eigenvalues
explained_var = np.cumsum(eigenvalues) / np.sum(eigenvalues)

# Determine the number of principal components 
# with an explained variance of atleast 80%
n_components = np.argmax(explained_var >= 0.80) + 1

# print number of principal components
print("Number of Principal Components =", n_components)

Number of Principal Components = 49


In [24]:
# Create a dataframe of the principal components

# Import PCA
from sklearn.decomposition import PCA

# Instantiate PCA
n_components = 49
pca = PCA(n_components=n_components)

# Fit PCA
principalComponents = pca.fit_transform(X)

df_pca = pd.DataFrame(principalComponents,
                       columns=['PC{}'.
                       format(i+1)
                        for i in range(n_components)])
df_pca

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,...,PC40,PC41,PC42,PC43,PC44,PC45,PC46,PC47,PC48,PC49
0,0.683548,-1.456507,1.411944,-2.680561,-1.613445,-3.989781,-2.832998,-2.918754,-2.510902,-0.760803,...,0.818986,-0.494317,1.176411,-0.641635,-1.096102,-0.155344,1.346502,-0.481683,-0.214463,0.246286
1,-2.645988,-1.893243,-0.276866,3.213357,-3.256368,-1.145315,-1.993133,-0.244933,0.778022,-0.025340,...,0.019169,-0.318193,-0.085530,0.760190,-0.161854,0.253835,0.236596,-0.225338,0.361819,0.573120
2,-1.881297,-3.111402,0.167103,-0.358598,-2.231775,-2.001617,-1.724631,-2.017783,0.086283,1.111060,...,-0.024723,0.269060,0.036751,0.044180,-0.777383,-0.249647,0.117856,-1.058864,0.594294,1.496537
3,5.354253,-0.266834,5.140063,-2.754794,-1.604268,-0.389002,-1.379860,1.973593,0.624973,0.420288,...,9.027724,-4.967802,0.054585,1.404898,-8.510237,-3.669086,1.854414,-2.160291,-1.538034,3.585226
4,-2.948446,-1.601106,-2.180337,1.959822,-2.808223,-1.467113,-2.486020,-0.599538,-0.254935,-0.260538,...,-0.139382,0.379848,-0.117925,0.307644,-0.590505,0.622268,0.877822,0.032856,0.501068,0.350772
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61873,2.246453,3.820398,5.943894,-7.373531,14.608052,-0.915350,1.937525,-7.596933,-6.158874,-3.300357,...,8.131801,8.059145,12.713684,0.813836,-0.340855,3.637284,2.835560,6.354039,-6.612726,-2.776392
61874,-1.929256,-3.265102,0.279949,-0.639577,-0.187086,-2.951185,-0.059637,-2.487794,-0.118592,0.763973,...,0.580190,0.010070,1.107310,0.276667,-1.174522,-0.100865,-1.325581,0.338815,-1.144150,1.643478
61875,4.241452,-7.466896,6.809563,-8.615352,2.796913,-4.236099,0.372613,-8.487954,0.299560,5.484483,...,-1.572760,-0.132036,-1.820427,0.847127,-0.051036,1.891836,1.096517,-0.141154,-0.666354,-1.129437
61876,-0.347051,-2.212910,1.389312,-2.284562,-0.371693,-2.311001,-1.747018,-3.067846,-0.440319,1.020332,...,0.850393,1.538233,1.071178,-0.167398,-1.649595,0.438646,-3.176340,1.019229,-0.836431,-1.029299


## Create a train-test split with a test size of 40%

This is a relatively big training set, so you can assign 40% to the test set. Set the `random_state` to 42. 

In [25]:
from sklearn.model_selection import train_test_split

# Split the PCA dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df_pca, y, 
                                                    test_size=0.4, random_state=42)

## Create a baseline model

Create your baseline model *in a pipeline setting*. In the pipeline: 

- Your first step will be to scale your features down to the number of features that ensure you keep just 80% of your explained variance (which we saw before)
- Your second step will be to build a basic logistic regression model 

Make sure to fit the model using the training set and test the result by obtaining the accuracy using the test set. Set the `random_state` to 123. 

In [None]:
# Your code here

In [None]:
# Your code here

In [None]:
# Your code here

## Create a pipeline consisting of a linear SVM, a simple decision tree, and a simple random forest classifier

Repeat the above, but now create three different pipelines:
- One for a standard linear SVM
- One for a default decision tree
- One for a random forest classifier

In [30]:
# Check the number of target classes
y_train.value_counts()

Class_2    9666
Class_6    8441
Class_8    5094
Class_3    4863
Class_9    2921
Class_7    1732
Class_5    1654
Class_4    1621
Class_1    1134
Name: target, dtype: int64

In [31]:
# Import libraries
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Standard Linear SVM pipeline
pipe_svm = Pipeline([('ss', StandardScaler()), 
                    ('svc', SVC(kernel='linear', random_state=42))])

# Fit the pipeline on training set
pipe_svm.fit(X_train, y_train)

# ⏰ This cell may take several minutes to run

In [32]:
pipe_svm.score(X_test, y_test)

0.7584033613445378

In [29]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

y_pred = pipe_svm.predict(X_test)
print("Overall accuracy score", accuracy_score(y_test, y_pred))
print("Overall precision score", precision_score(y_test, y_pred, average='weighted'))
print("Overall recall score", recall_score(y_test, y_pred, average='weighted'))
print("Overall F1-score", f1_score(y_test, y_pred, average='weighted'))

Overall accuracy score 0.7584033613445378
Overall precision score 0.7587977183724836
Overall recall score 0.7584033613445378
Overall F1-score 0.7290679512420039


## Pipeline with grid search

Construct two pipelines with grid search:
- one for random forests - try to have around 40 different models
- one for the AdaBoost algorithm 

### Random Forest pipeline with grid search

In [None]:
# Your code here 
# imports

In [None]:
# Your code here
# ⏰ This cell may take a long time to run!


Use your grid search object along with `.cv_results` to get the full result overview

In [None]:
# Your code here 

### AdaBoost

In [None]:
# Your code here
# ⏰ This cell may take several minutes to run

Use your grid search object along with `.cv_results` to get the full result overview: 

In [None]:
# Your code here 

### Level-up (Optional): SVM pipeline with grid search 

As extra level-up work, construct a pipeline with grid search for support vector machines. 
* Make sure your grid isn't too big. You'll see it takes quite a while to fit SVMs with non-linear kernel functions!

In [None]:
# Your code here
# ⏰ This cell may take a very long time to run!

Use your grid search object along with `.cv_results` to get the full result overview: 

In [None]:
# Your code here 

## Note

Note that this solution is only one of many options. The results in the Random Forest and AdaBoost models show that there is a lot of improvement possible by tuning the hyperparameters further, so make sure to explore this yourself!

## Summary 

Great! You've gotten a lot of practice in using PCA in pipelines. What algorithm would you choose and why?