# Clustering with <a href="https://cran.r-project.org/"><img src="https://cran.r-project.org/Rlogo.svg" style="max-width: 40px; display: inline" alt="R"/></a>

---

The objective of this tutorial is to apply the different concepts studied during the course on clustering to identify groups of wines.

---

## Data exploration

In this tutoral we will studied the _wine_ dataset (_wine.txt_).
This dataset includes physico-chemical measurements performed on a sample of $n=600$ wines (red and white) from Portugal. These measurements are complemented by a sensory evaluation of the quality by a set of experts. Each wine is described by the following variables:
- _Quality:_ Wine quality according to experts (“bad“, “medium”, “good”),
- _Type:_ 1 for red wine and 0 for white wine,
- _AcidVol:_ The volatile acid content (in g/dm3 of acetic acid),
- _AcidCitr:_ The citric acid content (in g/dm3),
- _SO2lbr:_ The measurement of free sulfur dioxide (in mg/dm3),
- _SO2tot:_ Total sulfur dioxide measurement (in mg/dm3),
- _Density:_ The density (in g/cm3),
- _Alcohol:_ The alcohol level (in % of Vol.).

### Some important parkages

In [None]:
library(mclust)
library(cluster)
library(factoextra)
library(FactoMineR)
library(ppclust)
library(reticulate)
library(ggplot2)
library(reshape)
library(corrplot)
library(gridExtra)
library(circlize)
library(tidyverse)

##### <span style="color:purple">**Todo:** Load the _wine.txt_ dataset and: </span>

- Use the `str()` function to show information about variables. Are all variables on the appropriated type?
- If not transform quantitative variables to factors with the `as.factor` function.
- Rename the levels of the variable _type_ : 1=red and 0=white.

In [None]:
### TO BE COMPLETED ### 

wine = ...

[...]

head(wine)

In [None]:
# source("solutions/data/load_data.r", echo=TRUE)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**: 
* The _Type_ variable is considered as numeric variable,
* The variable _Quantite_ is condired as character. -->


### Exploratory data analysis

In [None]:
library(corrplot)

In [None]:
summary(wine)

##### <span style="color:purple">**Todo:** Descriptive statistic and bivariate analysis: </span>
- Show description of variables with the `summary` function,
- Draw boxplots of quantitative variables. Analyze the results,
- Do graphical description of qualitative variables (barplot),
- Analyze correlation between numeric variables.

In [None]:
### TO BE COMPLETED ### 
# Descriptive statistics of quantitative data


In [None]:
# source("solutions/data/quanti.r", echo=TRUE)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**: 
 * The variables do not have the same overall distribution and variances,
 * Moreover, they are not expressed in the same units (units are not homogeneous). -->

In [None]:
### TO BE COMPLETED ### 
# Descriptive statistics of qualitative data


In [None]:
# source("solutions/data/quali.r", echo=TRUE)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**: 
 * Frequency of levels of each quantitative vriable are not homogeneous. -->

In [None]:
### TO BE COMPLETED ### 
# Correlation study


In [None]:
# source("solutions/data/correlation.r", echo=TRUE)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**: 
 * Correlation coefficients between numeric variables are relatively low except for _Density_ and _Alcohol_ (-0.68). -->

### Principal Component Analysis

In [None]:
library(ggpubr)  #to get the ggarrange function

##### <span style="color:purple">**Todo:** PCA of wine dataset: </span>

- What impact can the above analyses have on the PCA result?
- Perform PCA of the _wine_ data (Quantitative variables should be specified as _supplementary_ variables) and make visualization of wines (ind.) on the first factorial plan (use _habillage_ parameter to show groups according to qualitative variables). 
- How many clusters groups of wines can be suggested?

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**:
-  The results of the exploratory analysis suggest that variables should be standardized before performing PCA. -->

In [None]:
### TO BE COMPLETED ### 
# PCA of wine data -- Variables

wine2 = wine
wine2[,-c(1,2)] = scale(...)

In [None]:
# source("solutions/data/pca_var.r", echo=TRUE)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**: 
 * According to the barplot of explained variances we can choose the first 4 principals components (they explain 91.8% of variance), 
 * The variable _AcidCitr_ is not well explained by the first factorial space. -->

In [None]:
### TO BE COMPLETED ### 
# PCA of wine data -- Individuals


In [None]:
# source("solutions/data/pca_ind.r", echo=TRUE)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**: 
 * the first factorial plane allows a clear distinction between white and red wines,
 * We can also show **3 clusters of wines**. -->

## Clustering with $k$-means

In this part, we will perform the $k$-means clustering of wines by using only quantitative variable. Qualitative variable will be used to explains obtained clusters.

##### <span style="color:purple">**Todo:** Clustering with $k=3$: </span>

- By using the `kmeans()` function, perform the clustering the wines. Numeric variables should be standardized before.
- Use the `fviz_cluster()` function to visualize cluster on the first factorial plan of the PCA.
- Analyze the links between clusters and qualitative variables.

In [None]:
### TO BE COMPLETED ### 
# k-means, with k=3

reskmeans = kmeans(...)

In [None]:
# source("solutions/kmeans/kmeans.r", echo=TRUE)

In [None]:
### TO BE COMPLETED ### 
# Clusters vs Type of wine


In [None]:
# source("solutions/kmeans/clust_vs_type.r", echo=TRUE)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**: 
 * Cluster 2 is made of red wines,
 * Cluster 1 is made of white wines with high levels of alcohol,
 * Cluster 3 white wines with low levels of alcohol. -->

In [None]:
### TO BE COMPLETED ### 
# Clusters vs Quality of the wine


In [None]:
# source("solutions/kmeans/clust_vs_quality.r", echo=TRUE)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**: 
- Cluster are not linked to the wine quality. -->

##### <span style="color:purple">**Todo:** Determine the best value of $k$: </span>

- using the elbow method
- using the silhouette score

**Note**: _One can use the `fviz_nbclust` of the `factoextra` package_

In [None]:
### TO BE COMPLETED ### 
# Elbow method used with total within sum of square as metric

fviz_nbclust(...)

In [None]:
# source("solutions/kmeans/elbow_wss.r", echo=TRUE)

In [None]:
### TO BE COMPLETED ### 
# Elbow method used with silhouette score as metric

fviz_nbclust(...)

In [None]:
# source("solutions/kmeans/elbow_silhouette.r", echo=TRUE)

In [None]:
reskmeans = kmeans(wine2[,-c(1,2)], centers=5) 
sil = silhouette(reskmeans$cluster, dist(wine2[, -c(1:2)]))
fviz_silhouette(sil)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**: 
- The Elbow method suggests 3 as optimal number of clusters,
- The Silhouette method suggests 4 as optimal number of clusters but the Silhouette avarage value are close for 3 and 4 clusters.  -->

## Clustering with CAH

In this section, we will perform the CAH to make the same analysis as in th preview section.

##### <span style="color:purple">**Todo**: Use the `hclust` function to perform a hierarchical classification of the wine data</span>

- Test the different type of linkage : _single_, _complete_ and _average_,
- Graphically, compare the associated dendrograms, and comment on the results.

In [None]:
### TO BE COMPLETED ### 

# Clustering
hclustsingle = hclust(...)
hclustcomplete = hclust(...)
hclustaverage = hclust(...)

# Dendograms visualization
fviz_dend(...)

In [None]:
# source("solutions/cah/cah.r", echo=TRUE)

##### <span style="color:purple">**Todo:** Find the appropriate number of clusters with `hclustcomplete` by using the both methods (_wss_ and _silhouette_)</span>

In [None]:
### TO BE COMPLETED ### 


In [None]:
# source("solutions/cah/cah_nb.r", echo=TRUE)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**:
- Both mehtods suggest 3 as optimal number of clusters. -->

##### <span style="color:purple">**Todo:** With the results of the `hclustcomplete`, use the `cutetree()` function to get a clustering with 3 Clusters of wines.</span>
- Explain clusters with qualitative variables.

In [None]:
### TO BE COMPLETED ### 

ClassK3 = cutree(...)

In [None]:
# source("solutions/cah/cah_cut.r", echo=TRUE)

In [None]:
### TO BE COMPLETED ### 
# Clusters vs Type of wine


In [None]:
# source("solutions/cah/clust_vs_type.r", echo=TRUE)

In [None]:
### TO BE COMPLETED ### 
# Clusters vs Quality of wine


In [None]:
# source("solutions/cah/clust_vs_quality.r", echo=TRUE)

## Clustering with Gaussian Mixture

In this part, we will do the same analysis as above by with the GMM method.

##### <span style="color:purple">**Todo:** Perform clustering with the `Mclust` function by using the _BIC_ criteria. </span>
- Select the best model and visualize the obtained clusters.

In [None]:
wine3 = wine2[, -c(1, 2)]

In [None]:
### TO BE COMPLETED ### 
# GMM with BIC

resBICall = Mclust(...)
summary(resBICall)

# --- #

fviz_mclust(...)

In [None]:
# source("solutions/gmm/gmm_bic.r", echo=TRUE)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**
- We select $G=9$. -->

<!-- <span style="color:teal ">[Solution]</span> -->

**Interpretation**
- We select $G=9$.

In [None]:
### TO BE COMPLETED ### 
# Best model with BIC


In [None]:
# source("solutions/gmm/gmm_best_bic.r", echo=TRUE)

##### <span style="color:purple">**Todo:** Perform clustering with the `Mclust` function by using the _ICL_ criteria.</span>
- Select the best model and visualize the obtained clusters.

In [None]:
### TO BE COMPLETED ### 
# GMM with ICL

resICLall = mclustICL(...)
summary(resICLall)

In [None]:
# source("solutions/gmm/gmm_icl.r", echo=TRUE)

In [None]:
### TO BE COMPLETED ### 
# Best model with ICL


In [None]:
# source("solutions/gmm/gmm_best_icl.r", echo=TRUE)

##### <span style="color:purple">**Question:** Which _GMM_ model to choose?</span>

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**: The penalty BIC criteria gives to a model with high number of cluster.
It preferred to choose model obtained with the _ICL_ criteria. -->

##### <span style="color:purple">**Todo:** Analyze cluster with qualitative variables.</span>

In [None]:
### TO BE COMPLETED ### 


In [None]:
# source("solutions/gmm/quali.r", echo=TRUE)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**: the obtained clusters are linked to the type of wines but not to their quality. -->

## Comparison of clustering algorithms

The purpose of this last section is to compare the different results we obtained previously.

### $k$-means _vs._ CAH

In [None]:
library(cvms)

In [None]:
# We remain that best model for these algorithms are:

reskmeans = kmeans(wine2[,-c(1,2)], centers=3)
ClassK3 = cutree(hclustcomplete, 3)

##### <span style="color:purple">**Todo:** Use the `ggarrange` fuction and `fviz_pca` to visualize clusters of these models on the principal component plane</span>

In [None]:
### TO BE COMPLETED ### 


In [None]:
# source("solutions/compare/cah_vs_kmeans.r", echo=TRUE)

##### <span style="color:purple">**Todo:** Analyze the result obtained with the `table()` function</span>

In [None]:
### TO BE COMPLETED ### 


In [None]:
# source("solutions/compare/cah_vs_kmeans_conf.r", echo=TRUE)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**: The obtained clusters with these models are not similar. -->

### $k$-means _vs._ GMM

##### <span style="color:purple">**Todo:** Do the same analysis as for $k$-means _vs._ GMM</span>

In [None]:
### TO BE COMPLETED ### 


In [None]:
# source("solutions/compare/gmm_vs_kmeans.r", echo=TRUE)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**: There is a significant similarity between the clusters obtained by GMM and $k$-means. -->

### CAH _vs._ GMM

##### <span style="color:purple">**Todo:** Do the same analysis as for CAH _vs._ GMM</span>

In [None]:
### TO BE COMPLETED ### 


<span style="color:teal ">[Solution]</span>

In [None]:
# source("solutions/compare/cah_vs_gmm.r", echo=TRUE)

<span style="color:teal ">[Solution]</span>

<!-- **Interpretation**: Idem as between $k$-mean and CAH. -->