# Onboarding DS - Part 3

Yay! We are already in the third part of this onboarding. Now, we will continue preparing the data and start clustering our penguins :D <br>
Do not forget that you can always revisit the exploratory data analysis part and try other ways to prepare your data for clustering! None of the steps are immutable and it is an important part of the process to go back to previous analysis and reevaluate things (check out the [CRISP-DM](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining) methodology).

## Packages and data

Here, we will use the same packages we were using before and add a new one: sci-kit learn (sklearn). This package is really helpful when we want to build machine learning models. Install it the same way you installed the other packages earlier.

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.mixture import GaussianMixture

Let's import our penguins dataset (created on notebook 2) and continue working with it.

In [None]:
penguins = # TO DO

## Data prep (continuation)

Do you remember the histograms we did previously? One thing that is important to notice is the range of values of each feature: the body mass varies from 2700 up to 6300 grams, while the culmen depth can have values between 13 and 21.5 mm. In other words, the scale of the features are not the same.

It can be a problem when we want to use clustering models because, in general, these models are based on the idea of comparing distances between data points and when we have different value ranges, a feature may unintentionally gain more relevance over another one. That is why we should normalize the data! There are many different techniques to do so, and depending on your intentions, you may prefer one over another. In some cases, you are more interested in putting all features in a same range of values; in other cases, you want your data to follow a normal distribution or you just want to deal with smaller values (like the square root of the original ones). We invite you to take a deeper look in this topic ([you can start here :) ](https://scikit-learn.org/stable/modules/preprocessing.html))!

<div class = 'alert alert-block alert-info'> Task 1: Choose one of the normalization techniques and rescale your features. Save this rescaled values in a new dataframe penguins_sc.<br>
    
Tip: if you choose to use one of the sklearn functions to rescale your data, do not forget to import it!<br>
    
Tip2: notice that removing outliers is important; otherwise your rescaling would be affected by the really high/low values that do not meet the rest of you data.
</div>

In [None]:
# TO DO
# normalize features and save them in
penguins_sc = # TO DO

Since we cannot plot string values in a graph, we must treat the `island` and `sex` columns. We can use Label Encoding, to transforms string categories into numerical ones, but this only works when there is some kind of ordering in the categories. Another solution is doing the One Hot Encoding: it transforms each of the category values into a boolean column and sets `1` if the concerning value matches to the row's actual category, `0` otherwise.


<div class = 'alert alert-block alert-info'> Task 2: Since there is no way of ordering the islands or the sex, let's use the One Hot Encoding approach and turn this features into boolean ones.
</div>

In [None]:
# TO DO
# Do the one hot encoding for the island and sex attributes

# you can drop useless and redundant columns here

## Clustering

We are finally getting to the clustering part!

Clustering is one of the of the applications of unsupervised learning: when you do not have labels and your model may find patterns and ways to group the data by itself. K-Means, GMM (Gaussian Mixture Models) and DBSCAN are examples of clustering algorithms. We invite you to know [read more](https://scikit-learn.org/stable/modules/clustering.html#clustering) about them and find out other ones that exist and to try applying them in this onboarding.

We will guide you through implementing GMM.

For both K-Means and GMM, you need to give the number of clusters you want as input. To find out which number is the best, there are some metrics we can calculate and therefore make our choice. When we are working with K-Means, we normally use the Silhouette score and inertia. For GMM, BIC and AIC are more used. We calculate those metrics for each number of clusters we are thinking about using, then we pick the number that showed the best results.

For BIC and AIC, the lower the score, the better. We invite you to find out how they are calculated.
[GMM from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture) already has in-built aic and bic methods (check on the documentation), so we only need to store these values and compare them afterwards.

### GMM

The idea of this algorithm is to group similar instances in a same cluster, according to their features (in this case, our clusters represent the distinct penguin species). It works in an iterative way, so the algorithm stops when its output converges (the actual output is equal or very close to the previous one). In the input, you must specify the dataset you are working with, in how many clusters you want to split your data, the covariance type (regarding the covariance matrix and the degrees of freedom in the clusters' shape), some convergence hyperparameters and other values.

<div class = 'alert alert-block alert-info'> Task 3: First of all, choose which features from the penguins dataset you will take into consideration during the clustering (you can pick all of them if you wish) and create a new dataframe <b>data</b> with these features.
</div>

In [None]:
data = # TO DO

<div class = 'alert alert-block alert-info'> Task 4: Adapt the function below to test the range of number of clusters you wish. Look up the GMM documentation on sci-kit learn and find out what the <b>covariance type</b> hyperparameter is. Try changing it, from full to tied, diagonal and spherical and run the aic_bic function again (to facilitate visualization, adapt the function so you can plot all different covariance types simultaneouly).
</div>

In [None]:
def aic_bic(data):
    aics = []
    bics = []
    
    for i in range(2,15):
        gmm = GaussianMixture(n_components=i, covariance_type='full', random_state=0)
        y = gmm.fit_predict(data)
        aics.append(gmm.aic(data))
        bics.append(gmm.bic(data))
  
    fig = make_subplots(rows=2, cols=1)

    fig.append_trace(go.Scatter(
        x=[i for i in range(2,15)],
        y=aics,
        name='AIC'
    ), row=1, col=1)

    fig.append_trace(go.Scatter(
        x=[i for i in range(2,15)],
        y=bics,
        name='BIC'
    ), row=2, col=1)

    fig.update_layout(height=600, width=600, title_text="Choosing number of clusters")
    fig.show()
    


In [None]:
aic_bic(data)

<div class = 'alert alert-block alert-info'> Task 5: Now, run the GMM with the best hyperparameters you found and plot your results in a 3D scatter plot. Do not hesitate to try GMM with other n_components values. Sometimes, it will be more convenient for you to work with a lower or higher number of clusters, even if it is different from the ideal one according to the graphs above.
</div>

In [None]:
# TO DO
# GMM with chosen hyperparameters

# TO DO
# 3D scatter plot

## Analyzing our clusters

Now that you have chosen the hyperparameters of the clustering and that you have applied the GMM to our dataset, we can analyze each of the groups deeply. Through this study, we can identify **who** makes part of the cluster.

<div class = 'alert alert-block alert-info'> Task 6: Do you remember the charts we plotted in the previous notebook? Now, try to plot some of them again comparing the properties/characteristics of each cluster. For instance: plot box-plots comparing the distribution of culmen depth and body mass values of the penguins from each cluster.<br>
    
Extra: one interesting chart you may try is a bubble chart: it is a scatter plot, but the size of the points on it is proportional to the size of the data they represent. We usually use it to compare the clusters (not the instances themselves). Choose 2 numerical features from the penguins dataset and plot a 2D bubble chart.
</div>

In [None]:
# TO DO
# comparison charts
# box-plots

# TO DO
# 2D bubble chart

<div class = 'alert alert-block alert-info'> Task 7: What can you conclude about the penguins? Try to describe the penguins of each of the species you have found (which species is heavier, which has the longer culmen, etc.).
</div>

<div class = 'alert alert-block alert-info'> Task 8: Well done! To finish this onboarding, push the changes into the remote repository and open a pull request (PR).
</div>