![](../src/logo.svg)

**© Jesús López**

Ask him any doubt on **[Twitter](https://twitter.com/jsulopz)** or **[LinkedIn](https://linkedin.com/in/jsulopz)**

<a href="https://colab.research.google.com/github/jsulopz/resolving-machine-learning/blob/main/06_Principal%20Component%20Analysis%20%28PCA%29/06_dimensionality-reduction-pca_session_solution.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


# 06 | Principal Component Analysis (PCA)

## Chapter Importance

We used just two variables out of the seven we had in the whole DataFrame.

We could have computed better cluster models as we give more information to the Machine Learning model. Nevertheless, it would have been **harder to plot seven variables with seven axis in a graph**.

Is there anything we can do compute a clustering model with more than two variables and later represent all the points along with their variables?

- Yes, everything is possible with data. As one of my teachers told me: "you can torture the data untill it gives you what you want" (sometimes it's unethical, so behave).

We'll develop the code to show you the need for **dimensionality reduction** techniques. Especifically, the Principal Component Analysis (PCA).

## [ ] Load the Data

Imagine for a second you are the president of the United States of America and you are considering to create campaigns to reduce **car accidents**.

You won't create 51 different TV campaigns for each one of the **States of USA** (rows). Instead, you will see which States behave in a similar manner to cluster them into 3 groups, based on the variation accross their features (columns).

In [1]:
import seaborn as sns #!

df_crashes = sns.load_dataset(name='car_crashes', index_col='abbrev')
df_crashes

Unnamed: 0_level_0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses
abbrev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AL,18.8,7.332,5.640,18.048,15.040,784.55,145.08
AK,18.1,7.421,4.525,16.290,17.014,1053.48,133.93
...,...,...,...,...,...,...,...
WI,13.8,4.968,4.554,5.382,11.592,670.31,106.62
WY,17.4,7.308,5.568,14.094,15.660,791.14,122.04


> Check [this website](https://www.kaggle.com/fivethirtyeight/fivethirtyeight-bad-drivers-dataset/) to understand the measures of the following data.

## Data Preprocessing

## k-Means Model in Python

### Import the Class

### Instantiate the Class

### Fit the Model

### Calculate Predictions

### Create a New DataFrame for the Predictions

### Create a New Column for the Predictions

### Visualize the Model

### Model Interpretation

## [ ] Grouping Variables with `PCA()`

You need to group the original variables of the `DataFrame` in to components so that the groups are clearly separated from each other when we visualize it:

![](src/pca.png)

### Transform Data to Components

`PCA()` is another technique used to transform data.

How has the data been manipulated so far?

1. Original Data `df_crashes`

In [18]:
df_crashes

Unnamed: 0_level_0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses
abbrev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AL,18.8,7.332,5.640,18.048,15.040,784.55,145.08
AK,18.1,7.421,4.525,16.290,17.014,1053.48,133.93
...,...,...,...,...,...,...,...
WI,13.8,4.968,4.554,5.382,11.592,670.31,106.62
WY,17.4,7.308,5.568,14.094,15.660,791.14,122.04


2. Normalized Data `df_scaled`

In [19]:
df_scaled

Unnamed: 0_level_0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses
abbrev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AL,0.737446,1.168148,0.439938,1.002301,0.277692,-0.580083,0.430514
AK,0.565936,1.212695,-0.211311,0.608532,0.807258,0.943258,-0.022900
...,...,...,...,...,...,...,...
WI,-0.487627,-0.015114,-0.194372,-1.834714,-0.647305,-1.227190,-1.133459
WY,0.394425,1.156135,0.397884,0.116657,0.444019,-0.542754,-0.506406


3. Principal Components Data `df_pca` (now)

### Visualize Components & Clusters

## [ ] Explained Variance Ratio

## [ ] Relationship between Original Variables & Components

### Loading Vectors

### Correlation Matrix

### [ ] Calculating One PCA Value

## [ ] PCA & Cluster Interpretation

### Biplot

## Conclusion

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.