# Wine Analysis

## Introduction

We have in our hands a dataset which covers the chemical compositon of wines from three different cultivars from the same region in Italy. 

These chemicals are:

- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline

We will see if we can find any interest insights from it by applying exploratory analysis and clustering techniques.

## Pre-Analysis

Let's say that we are nor wine or chemical experts. Then, it would be useful to have some insights about the attributes that we are working with to have an idea of what we could expect to find out.

### Alcohol
- **Description:** Represents the percentage of alcohol in the wine.
- **Typical Range:** Varies from around 8% to 15%.


### Malic Acid
- **Description:** Organic acid present in grapes. Affects acidity and can influence flavor.
- **Typical Range:** In the range of 0.1 to 5 g/L.



### Ash
- **Description:** Describes the total amount of minerals present in the wine after burning. Can be indicative of wine quality.
- **Typical Range:** Can vary, but typically found in the range of 1.5 to 3 g/L.



### Alcalinity of Ash
- **Description:** Measures the amount of alkali in terms of carbonate equivalent. Related to wine acidity.
- **Typical Range:** Common values are in the range of 10 to 30 mEq/L.



### Magnesium
- **Description:** Concentration of magnesium in the wine.
- **Typical Range:** Can vary, but typical values range between 70 and 162 mg/L.


### Total Phenols
- **Description:** Represents the total concentration of phenolic compounds in the wine, including antioxidants.
- **Typical Range:** Concentrations can vary, but red wines, in particular, may have values in the range of 100 to 300 mg/L.



### Flavanoids
- **Description:** Antioxidant compounds contributing to wine structure, flavor, and color.
- **Typical Range:** Concentrations can vary, but typical values are between 0.5 and 5 mg/L.



### Nonflavanoid Phenols
- **Description:** Another group of phenolic compounds excluding flavonoids.
- **Typical Range:** Concentrations can vary, but typical values are between 0.1 and 1.5 mg/L.



### Proanthocyanins
- **Description:** Antioxidant compounds contributing to astringency and flavor.
- **Typical Range:** Concentrations can vary, but typical values are between 0.5 and 3 mg/L.



### Color Intensity
- **Description:** Measures the intensity of the wine color.
- **Typical Range:** Red wines often have higher values, in the range of 1 to 15.



### Hue
- **Description:** Refers to the color tone of the wine.
- **Typical Range:** Typical values can be in the range of 0.5 to 1.5.



### OD280/OD315 of Diluted Wines
- **Description:** The ratio of optical density at 280 nm to 315 nm. Provides information about wine color concentration and clarity.
- **Typical Range:** Concentrations can vary, but typical values may be in the range of 1 to 4.



### Proline
- **Description:** A measure of proline concentration, an amino acid, in the wine.
- **Typical Range:** Concentrations can vary, but typical values are between 300 and 1680 mg/L.


#### General Observations:

Based on this information, we could expect some attributes to the be nearly correlated:

- The ones that explicity refer to or affect the color: Hue, OD280/OD315, Color Intensity 
- The one that refer to the acid level of each wine: Ash, Alcalinity of Ash, Malic Acid
- And the ones which cuold tell us something about the antioxidantes in the wine: Total Phenols, Flavanoids, Non Flavanoids Phenols

Now, let's get started.

## Analysis

### Import Libraries

In [1]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score

### Load the data

In [5]:
#API request

url = "http://127.0.0.1:5000/download_csv"

response = requests.get(url)

if response.status_code == 200:
    print("The dataset was correctly downloaded")
else:
    print(f"There was an error trying to fetch the data: {response.status_code}")

The dataset was correctly downloaded


In [None]:
path = "data/"
dataset = "wine-clustering.csv"
df = pd.read_csv(path+dataset, sep=",")

In [None]:
df.head() #check that it was correctly loaded

### Data Processing

In [None]:
df.shape #(178, 13)

In [None]:
df.columns

In [None]:
df.info()

It seems that the data set has no null values and all the attributes have numeric data types.

As we have no categorical variables or strings, there is no necessity to check for misspelling words or encoding problemas. 


The columns names are representative of what they have. The values of their rows, at least based on the previos seaction `.head()` seems to be between the expected ranges (besides `Total_Phenols`, but we can consider that it has a different measurment unit). 



However, it would be convenient to normalize the values for when we reach the clustering section.

### Exploratory Analysis

Let's start checking out a quick description about the range in which our values effectivelly are and how the distribute.

In [None]:
stats = df.describe()

We can do a box plot for each chemical to get a better view

In [None]:
def boxplot_per_chemical(data, title=""):
    fig = plt.figure(figsize=(16, 14))
    for i, column in enumerate(data.columns):
        plt.subplot(4, 4, i + 1)
        sns.boxplot(x=data[column])
        mean_value = stats[column]["mean"]
        plt.axvline(x=mean_value, color="red", linestyle="--", label="Mean value")
        plt.legend(loc="upper right",)
        plt.title(f'{column}´s boxplot')
        del mean_value
    
    fig.suptitle(title, fontsize=16)

    plt.tight_layout()
    plt.show()

In [None]:
boxplot_per_chemical(df, "Box plot per chemical")

We can observe a few things:

- We have some outliers, depending on which clustering method we use, they could (or not) have a notable impact in the results. With one as KMeans, which iterates multiple times reasigning each centroid to its geometric center, it could give us unexpected results, as the outlier could make the center move significantly. In others as DBSCAN, which assigns data that it is proximate to each other to a same cluster, it wouldn't been a problem. However, as we now that the wines come from three different cultivars, one would be tentated to usea KMeans with K=3. In the clustering section we will see how to proceed.

- In the case of Alcohol, it seems that the distribution is simmetric. The median is in the middle and we can see that the 50% of the values wines got values between 12.5 and 13.6 approximately. 

- The Malic Acid doesn't have a simmetric distribution. Half of the wines have values below (approximately) 2 and almost 75% of them have values below 3. Since the boxplot uppershadow has values until 5 and a bit, we can say that its distribution is concentrated to lower values of this chemical. When we explore this particular chemical distribution, it would be interesting to keep this in mind, since the outliers are the ones with the higher values.

- The Ash has relatively simmetrical boxplot, with a few outliers on each side of it. We can see that the body of the box looks more narrow than in the previous chemicals, but this could be because the outliers make the boxplot a bit small. 50% of the rows have values between 1.5 and 2.6.

- The Ash Alcalinity one looks very similar to the previous boxplot, it even has the same amount of outliers: one on the bottom and two on the top. They look very correlated, as we could suspect.

- Magnesium's boxplot is a bit center to the left and has all of its outliers on the right of the plot, they are the ones that has the bigger values, but they are not a lot.

- Total Phenols looks pretty simetric and its doesn´t have any outlier. The body of the Flavanoids and Non-flavanoids boxplot's looks similar to this, but are slightly center to lower values. They also have no outliers.

- Proantochyanins is centered, but it has two outliers on the right.

- Color intensity is moved to the left, we can easily see that approximately 75% of the wines have values below a bit more than 6, but there also outliers with values higher than 10. It seems like there are four of them. Is similar to what happens with Malic Acid.

- We could suppose that the hue boxplot would be similar to the color intensity, but its is very simmetrical, and has only one outlier. It looks more similar to the OD280's one, but this looks wider than it.The OD280's also has its median not pretty centered in the middle of the box body, it is moved to the right. We could expect a bit more than half of the wines to have this chemical's value greater than 2.6

- Lastly, the Proline. It clearly tends to have lower values, as 75% of the wines have less than 1000 units of it. Even if there are 25% of them that have values between 1000 and (approximately) 1700, it doesn´t have any outlier.

We saw that same of the chemicals have a box body centered to one side but also presents outliners on the other one. It would be interesting to see if what happens is that there are wines with considerable lower values than others in these ones and other with considerably higher values, forming two groups.

In [None]:
def kdeplot_per_chemical(data, title=""):
    fig = plt.figure(figsize=(16, 14))
    for i, column in enumerate(data.columns):
        plt.subplot(4, 4, i + 1)
        sns.histplot(x=data[column], bins=35, kde=True)
        median_value = data[column].median()
        plt.axvline(x=median_value, color="red", linestyle="--", label="Median value")
        plt.legend(loc="upper right",)
        plt.title(f'{column}´s KDE and Histogram')
        del median_value
    fig.suptitle(title, fontsize=16)
    plt.tight_layout()
    plt.show()

In [None]:
kdeplot_per_chemical(df, "KDE-Histogram per chemical")

The ones with a clearly left-centered boxplot and some otuliers on the right were the Malic Acid, Magnesium, Proanthocyanins and Color Intensity seem to simply have a distribution centered over lowered values. It seems like the outliers are indeed atypical values. We can drop them.

In [None]:
def IQR_outliers_range(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1

    bottom = Q1 - 1.5 * IQR
    top = Q3 + 1.5 * IQR

    return bottom, top

In [None]:
outliers_btm_filters, outliers_upper_filters = IQR_outliers_range(df)

In [None]:
outliers_btm_filters

In [None]:
outliers_upper_filters

In [None]:
columns = df.columns

df_2 = df.copy() #this will be our df without the outliers

for _, column in enumerate(columns):
    filter = (df_2[column] >= outliers_btm_filters[column]) & (df_2[column] <= outliers_upper_filters[column])
    df_2 = df_2[filter]

df_2.head()

del columns

In [None]:
len_before = len(df)
len_now = len(df_2)

print(f'Rows before the filer: {len_before}, rows after the filer: {len_now}')
print(f'We retained: {len_now*100/len_before:.2f} % of the data')
del len_before, len_now

In [None]:
df_2.describe()

In [None]:
boxplot_per_chemical(df_2, "Boxplots per chemical (outliers removed)")

In [None]:
kdeplot_per_chemical(df_2, "KDE-Histogram per chemical (outliers removed)")




Something more interesting happens with the Total Phenois, Flavonoids and OD280's plots: they seem like if they have slightly two different maxes on their KDE curves, we may sepparate our wines in two different types based on this. Even after the remove of the outliers, histogram and KDE looks very similar, they have that bimodal distribution.

But first, let's see if we can find out any correlations between our variables based on a quantitative value. We can use the correlation matrix to this and see if there is a lineal relation between two of our columns.

In [None]:
corr_matrix = df_2.corr()

fig = plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(corr_matrix, vmin=-1, vmax=1, annot=True, linewidths=.5, cmap='coolwarm')

heatmap.set_title('Correlation Heatmap', fontdict={'fontsize': 12}, pad=12)
plt.show()
del heatmap, corr_matrix


In this order, we can see that the strongest correlations are the ones between:

- Total Phenols - Flavanoids: 0.88
- OD280 - Flavanoids: 0.78
- Proanthocyanins - Flavanoids: 0.74
- OD280 - Total Phenols: 0.70
- Proline - Alcohol: 0.66
- Total Phenols - Proanthocyanins: 0.65

Just to consider the ones with an absolute value greater than 0.60 (the are other that are also noticeable).

Now we can plot them and see what we can find out.

In [None]:
def scatter_plot_two_chemicals(data, x_column, y_column):
    fig, ax = plt.subplots()
    x = data[x_column]
    y = data[y_column]
    sns.scatterplot(x=x, y=y)
    ax.set_xlabel(x_column)
    ax.set_ylabel(y_column)
    fig.legend()
    plt.show()

In [None]:
pairs = [
    ('Total_Phenols', 'Flavanoids'),
    ('OD280', 'Flavanoids'),
    ('Proanthocyanins', 'Flavanoids'),
    ('OD280', 'Total_Phenols'),
    ('Proline', 'Alcohol'),
    ('Total_Phenols', 'Proanthocyanins')
    ]

In [None]:
fig = plt.figure(figsize=(16, 14))
for i, tuple in enumerate(pairs):
    plt.subplot(4, 4, i + 1)

    x_column, y_column = tuple

    x = df_2[x_column]
    y = df_2[y_column]
    sns.scatterplot(x=x, y=y)

    plt.title(f'Relation between {x_column} - {y_column}')

fig.suptitle("Top best linear correlated chemicals scatterplots", fontsize=16)
plt.tight_layout()
plt.show()
del pairs

We can see from left to right and top to bottom that the first plots tend to have a more linear relational than the next ones. 

It is also notable that in the `OD280 - Flavanoids` and `OD280 - Total_Phenols` scatter plots it looks like if there where two clusters or groups. It makes sense that this happens with the `OD280` and those two columns since they have a stronger linear correlation, as we can see in the first plot.

Before, we mentioned that this three attributes had two different peaks in their `KDE` plots and it could lead us to think they would be a good way to get subgroups out of our data. Theses plots confirm our intuitions.

Evenmore, as the `Flavanoids` and the `Total Phenols` are that well correlated, we can think of reduce the dimensionality of those columns to only one and see what happens. 

But first, let´s take a bigger picture out of this with the correlation plots between all the variables.

In [None]:
sns.pairplot(df_2)

There are not more (strongly) correlated variables than the previous ones.

But there are variables that have a curious correlation, even if they aren't linear.

As we could expect from the boxplots, the `Malic Acid` tends to have low values in each of its plots.

It looks like there are 3 clusters in the `OD280` - `Alcohol` scatterplot. Similar, but not equal clearly, seems to happen in the `Total Phenols` - `Acohol` one and the `Flavanoids` - `Alcohol` graphic.

If we reduce the dimensions of our dataframe, maybe we could get a better insight out of the number of clusters that could be hidden.

For this purpose, we will have to decide to how many dimensions we will reduce our dataframe.

In [None]:
#as the chemicals use different units of measurement and the PCA works with distances, we will need to reescale them
scaler = StandardScaler()
df_2_standard = scaler.fit_transform(df_2)


variance_ratios = []
n_dimensions_list = range(1, len(df_2.columns) + 1)

for n_dimensions in n_dimensions_list:
    pca = PCA(n_components=n_dimensions)
    df_pca = pca.fit_transform(df_2_standard)
    
    cumulative_explained_variance = np.sum(pca.explained_variance_ratio_)
    variance_ratios.append(cumulative_explained_variance)

plt.plot(n_dimensions_list, variance_ratios, marker='o')
plt.xticks(n_dimensions_list)
plt.xlabel('Number of PCs')
plt.ylabel('Explained sum of variance')
plt.title('Total explained variance by number of PCs')
plt.show()

del variance_ratios, n_dimensions_list, df_2_standard

There is no clear elbow to decide which is the best number of Principal Components to choose, let´s check the explained variance for each PCs of each PCA.

In [None]:
scaler = StandardScaler()
df_2_standard = scaler.fit_transform(df_2)


variance_ratios_per_pca = []
n_dimensions_list = range(1, len(df_2.columns) + 1)

for n_dimensions in n_dimensions_list:
    pca = PCA(n_components=n_dimensions)
    df_pca = pca.fit_transform(df_2_standard)
    
    variance_ratios_per_pca.append(pca.explained_variance_ratio_)

In [None]:
variance_ratios_per_pca

In [None]:
fig = plt.figure(figsize=(16, 14))

for i, variances in enumerate(variance_ratios_per_pca):

    n_labels = range(1, len(variances)+1)

    var_df = pd.DataFrame()
    var_df["PC"] = n_labels
    var_df["Explained Variance"] = variances

    plt.subplot(4, 4, i + 1)
    sns.barplot(var_df, x="PC", y="Explained Variance")
    plt.title(f'Explained variance with {i+1} components' )

fig.suptitle("Explained variance by number of principal components")
plt.tight_layout()
plt.show()

We can see that the PC1 explains almost 40% of the variance, the PC2 the 20% and if we continue, each PC will explain less than 10% of it.

However, as with the two first principal components we can explain up to 60% of the variance, we can consider to see if they tell us something about the distribution of it.

In [None]:
pca = PCA(n_components=2)
pca_result = pca.fit_transform(df_2_standard)

df_pca = pd.DataFrame(data=pca_result, columns=['PC1', 'PC2'])

In [None]:
df_pca.head()

In [None]:
sns.scatterplot(df_pca, x="PC1", y="PC2")

Not bad at all. Although this two components only accumulate 60% of the variance, they seem like enough to show us that there may be three different types of wines. Based on previous scatterplots and the kde-histogram ones, we could thought there could be two clusters or three clusters. This last one makes more sense since we know that the dataset has wine from three different cultivars.

Furthermore, since the distance from the left bottom cluster to the right bottom one seems a bit higher than this two to the upper one, we can say that they are more different between them than from the other one, since the horizontal distance is given by the PC1 and the vertical distance is given by the PC2.

So, by no we have two candidates to cluster our data. We can start with it in the next section.

### Clustering

Let´s try with to different clustering algorithms and compare them.

#### KMeans clustering

In [None]:
scaler = StandardScaler()
df_pca_scaled = scaler.fit_transform(df_pca)

del scaler

In [None]:
inertia = [] 

for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42) #just using random state so every time we get the same result
    kmeans.fit(df_pca_scaled)
    inertia.append(kmeans.inertia_)


plt.plot(range(1, 11), inertia, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.xticks(range(1, 11))
plt.title('Elbow Method for Optimal K')
plt.show()

In [None]:
silhouette = []

for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(df_pca_scaled)
    silhouette.append(silhouette_score(df_pca_scaled, kmeans.labels_))

plt.plot(range(2, 11), silhouette, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.axhline(y=max(silhouette), color="red", linestyle="--", label=f'Max Value: {max(silhouette)}')
plt.title('Silhouette Score for Optimal K')
plt.show()

With the inertia metric, we can see that is a clear elbow in `k` = 3, it inertia starts to go down slow and slower since that point. With the Silhouette one, we can also see that it gets maximized with that value of `k` with an approximated value of 0.60.

Since based on our exploratory analysis we had guessed that this one could be a possible number of clusters, and based on these plots, it seems to be strongly better than the other option (`k` = 2), we can take it as our reference from now on.

In [None]:
k = 3
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit(df_pca_scaled).labels_

In [None]:
sns.scatterplot(data=df_pca, x="PC1", y="PC2", hue=labels, palette="bright")

Let's see now how our Dataframe would like with this labels

In [None]:
df_with_labels = df_2.copy()
df_with_labels["KMeans_labels"] = labels

In [None]:
sns.pairplot(df_with_labels, hue='KMeans_labels', diag_kind='kde', palette="bright")
plt.show()

Looks pretty good. Some of the graphs that we thought could have three clusters are painted now as we may expected, but are others that they didn't have aa clear clustering distribution, such as the `Color_Intensity` - `Proline` one (the `Proline` has good clusters plots in general) and the `Alcohol` - `Hue` that now seem to have very nice separated groups.

Let's see rapidly how are the wines distributed along this clusters

In [None]:
df_with_labels["KMeans_labels"].value_counts()

There are almost the same amount of `0` and `2` type of wines. The `1` ones (the ones that are painted in orange) are notable less.

Even if this results are pretty well, maybe we could try with another algorithm and see if we can get a most uniform distribution.

#### DBSCAN

Since we know that 3 clusters are a good option, we will try to find optimal values for $\epsilon$ and `n_neighbors` so that it matches this number.

As we saw in the `PC1` - `PC2` scatterplot that the clusters are relatively easy to differenciate, what we will do first is get an optimal number of neighbors for our DBSCAN so that most of the data points have this amount at a relatively same distance in average.

Let $x_{1}$ be some of our points. It will have $k$ closest neighbor to it at a $d_{1}$ distance in average. Then, if we want all the points to have a same amout of neighbors at a relatively similar distance in average, that means that for a $x_{2}$ point its corresponding $d_{2}$ would be similar to $d_{1}$. In general, it means that the variance of $d$ among all our data is relatively low.


In [None]:
scaler = StandardScaler()
pca_scaled = scaler.fit_transform(df_pca)

del scaler

In [None]:
pca_scaled

In [None]:
n_neighbors_list = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
n_neighbors_stats = {}


for n in n_neighbors_list:
    neighbors = NearestNeighbors(n_neighbors=n)
    neighbors.fit(pca_scaled)
    distances, indexes = neighbors.kneighbors(pca_scaled)
    mean_distances = np.mean(distances, axis=1)
    mean_variance = np.var(mean_distances)
    variance_and_means = {
        'global_mean_distance': np.mean(distances),
        'mean_variance': mean_variance
    }
    n_neighbors_stats[n]=variance_and_means


In [None]:
n_neighbors_stats

In [None]:
n_neighbors_stats_df= pd.DataFrame(n_neighbors_stats).T

In [None]:
n_neighbors_stats_df

In [None]:
plt.figure(figsize=(10, 6))

plt.plot(n_neighbors_stats_df.index, n_neighbors_stats_df['global_mean_distance'], label='Global Mean Distance', marker='o')
plt.plot(n_neighbors_stats_df.index, n_neighbors_stats_df['mean_variance'], label='Mean Variance', marker='o')

plt.title('Average Distance and Variance vs Number of Neighbors')
plt.xlabel('Number of Neighbors (n)')
plt.ylabel('Value')
plt.yscale('log')
plt.legend()
plt.xticks(n_neighbors_list)
plt.grid(True)
plt.show()

As the mean variance increases in a smaller scale than the average distance, the plot uses a log scale.

Anyways, there isn´t a clearly optimal number of neighbors that we could get out of this. There is no `n` for which the variance ends to significantlly increase.

But we can see that for higher values it increases slowly than for lowers.

Let´s face the problem in another way.

In [None]:
df_pca_scaled = pd.DataFrame()
df_pca_scaled["PC1"] = pca_scaled[:, 0]
df_pca_scaled["PC2"] = pca_scaled[:, 1]

In [None]:
df_pca_scaled

In [None]:
sns.scatterplot(data=df_pca_scaled, x="PC1", y="PC2")

Based on the graph, $\epsilon$ values between 0 and 1 may be a good choice. For the neighbors we couldn´t determinate a good range of values, but based on the graph we could try with something between 3 and 10 and give our final choose to a metric.

In [None]:
epsilon_values = np.linspace(start=0.1, stop=0.5, num=10)

In [None]:
minPts_values = range(3, 11)

In [None]:
fig, axes = plt.subplots(nrows=len(minPts_values), ncols=len(epsilon_values), figsize=(30, 26))

for i, minPts in enumerate(minPts_values):
    
    for j, epsilon in enumerate(epsilon_values):

        dbscan = DBSCAN(eps=epsilon, min_samples=minPts)
        
        labels = dbscan.fit_predict(df_pca_scaled)
        
        sns.scatterplot(data=df_pca_scaled, x='PC1', y='PC2', hue=labels, palette='bright', ax=axes[i, j])
        
        axes[i, j].set_title(f'Eps={epsilon:.2f}, MinPts={minPts}')
        axes[i, j].legend().set_visible(False)
        axes[i, j].set_xlabel('PC1')
        axes[i, j].set_ylabel('PC2')


plt.tight_layout()
plt.show()

It looks like for values of epsilon greater lower than 0.32 we have a lot of outliers. For values greater or equal than 0.41 we got only two clusters (except for the case MinPts=10).

It would be better to restrict our epsilon range then.

In [None]:
epsilon_values = [eps for eps in epsilon_values if 0.31 <= eps <= 0.42] #filtered epsilon values

In [None]:
#plot again, but this time we will save their silhouettes

silhouette = {
}

three_clusters_dbscans = []


fig, axes = plt.subplots(nrows=len(minPts_values), ncols=len(epsilon_values), figsize=(16, 24))

for i, minPts in enumerate(minPts_values):

    silhouette[minPts] = {}
    
    for j, epsilon in enumerate(epsilon_values):

        dbscan = DBSCAN(eps=epsilon, min_samples=minPts)

        labels = dbscan.fit_predict(df_pca_scaled)

        if (len(np.unique(labels)) == 4): #4 becaause the outliers have their own category

            three_clusters_dbscans.append(dbscan)


        silhouette_coefficients = silhouette_score(df_pca_scaled, labels)

        epsilon_key = f'{epsilon:.2f}'
        silhouette[minPts][epsilon_key] =  silhouette_coefficients
        
        sns.scatterplot(data=df_pca_scaled, x='PC1', y='PC2', hue=labels, palette='bright', ax=axes[i, j])
        
        axes[i, j].set_title(f'Eps={epsilon:.2f}, MinPts={minPts}')
        axes[i, j].legend().set_visible(False)
        axes[i, j].set_xlabel('PC1')
        axes[i, j].set_ylabel('PC2')


plt.tight_layout()
plt.show()

In [None]:
silhouette

As we saw that three clusters looks like a good choice, we will only end up with the dbscans that has them.

In [None]:
three_clusters_dbscans

Now let´s take a metric to choose which could be the best one

In [None]:
silhouette_scores = {}

for dbscan in three_clusters_dbscans:
    params = dbscan.get_params()

    eps = params["eps"]
    minPts = params["min_samples"]

    dict_key = (f'{eps}', f'{minPts}')

    db_silhouette = silhouette_score(df_pca_scaled, dbscan.labels_)

    silhouette_scores[dict_key] = db_silhouette

In [None]:
silhouette_scores

In [None]:
silhouette_scores = dict(sorted(silhouette_scores.items(), key=lambda item: item[1], reverse=True))

In [None]:
for key, value in silhouette_scores.items():
    print(f"Eps={key[0]}, MinPts={key[1]}: Silhouette Score = {value:.4f}")

We can plot the first three DBSCANs and see how they look.

In [None]:
first_three_dbscan = sorted(silhouette_scores.items(), key=lambda x: x[1], reverse=True)[:3]
best_dbscan_labels= []


plt.figure(figsize=(20, 16))

for i in range(0, 3):
    eps, minPts = first_three_dbscan[i][0]
    eps = float(eps)
    minPts = int(minPts)
    score = first_three_dbscan[i][1]

    dbscan = DBSCAN(eps=eps, min_samples=minPts)
    labels = dbscan.fit_predict(df_pca_scaled)

    best_dbscan_labels.append(labels)


    plt.subplot(3, 3, i+1)
    sns.scatterplot(x=df_pca_scaled['PC1'], y=df_pca_scaled['PC2'], hue=labels, palette='bright', legend='full')
    plt.title(f'DBSCAN - Eps: {eps}, MinPts: {minPts}\nSilhouette Score: {score:.3f}')
    plt.xlabel('Principal Component 1 (PC1)')
    plt.ylabel('Principal Component 2 (PC2)')
    

The three of the look pretty good. Let´s see how many elements of each type they got.

In [None]:
plt.figure(figsize=(15, 5))

for i in range(3):
    plt.subplot(1, 3, i + 1)
    sns.countplot(x=best_dbscan_labels[i])
    plt.title(f'Cluster Distribution - DBSCAN {i + 1}')
    plt.xlabel('Cluster')
    plt.ylabel('Count')

plt.tight_layout()
plt.show()

The two most populated clusters have almost the same amount of elements (something about 50), just as what happened in the KMeans. The third one has around 40, the KMeans gave us 45 for it. Also, the DBSCAN is giving us 10 outliers, but looking at the previous scatterplot, they don´t look that atypical like to be considered so.

As we didn´t get more uniform distirbution and the outliers don´t really look so, we will continue with the KMeans labels for the analysis

### Post-Clustering Analysis

So far, this is our labeled dataframe.

In [None]:
df_with_labels.head()

In [None]:
df_with_labels["KMeans_labels"].value_counts()

In [None]:
df_with_labels["KMeans_labels"] = df_with_labels["KMeans_labels"].astype(str)

In [None]:
replace_with_letters = {
    '0': 'A', '1': 'B', '2': 'C'
}
df_with_labels["KMeans_labels"].replace(replace_with_letters, inplace=True)

It will be useful to take a quick picture of what it looks like with all the different label data together where we began.

In [None]:
sns.pairplot(df_with_labels)
plt.show()

Besides a few variables that seem to have a good linear correlation, most of doesn´t seem to have a clear one. But there are some of them that looks to maybe have some cluster, with two or three groups.

In [None]:
sns.pairplot(df_with_labels, hue='KMeans_labels', diag_kind='kde', palette="bright")
plt.show()

We can rapidly see that in most of the kdeplot of the diagonal the distribution are pretty different for each category. The scatterplots seem to also have different groups now that we coloured them.

We can anticipate that the individual linear correlations for each group may be pretty different for some types of wines. For example, the `Proline` - `Alcohol` relations looks like to have a well-defined positive correlation. But looking at the clusters, individually they don´t like to have so.

Now we can get deeper into each class and see which insights we can find.

In [None]:
df_grouped_by_label = df_with_labels.groupby(by="KMeans_labels")


In [None]:
numeric_cols = df_with_labels.columns.difference(["KMeans_labels"])

In [None]:
global_median = df_with_labels[numeric_cols].median()

In [None]:
plt.figure(figsize=(16, 14))
for i, column in enumerate(numeric_cols):
    plt.subplot(4, 4, i + 1)
    sns.boxplot(x="KMeans_labels", y=column, data=df_with_labels)

    all_data_median = global_median[column]

    plt.axhline(y=all_data_median, linestyle="--", color="red", label="Global Median")
    plt.legend()
    plt.title(f'{column}´s boxplot')

plt.tight_layout()
plt.show()

Things starts to get more interesting.

One step at a time:

- Before we saw that the `Alcohol`'s Boxplot was pretty simmetric, with a nor wide or really narrow body. Now we can see that the ranges in which each category moves for this variable is very delimitated if we only considered the body of the box. Based on the graphic, we can see that the `C` type wines are relatively easy to differentiate from the `B` ones just by their alcohol levels, since the first one have their bottom value around 13.0, which is approximately the top for the second ones. The `A` category has 50% of its wines between the bottom and top shadows of the blue and green wines, which may lead us to think that they have a less 'extrem' combination for this chemical, since even their shadows don´t go much away from the other boxplots bodies. With the `C` wines having relatively high values of `Alcohol` and the `B` ones have relatively lower ones, the `A` type of them looks like an intermediate within them.

- The `Ash`'s boxplot had a kind of narrow but centered body in the general boxplot. Here we can see that the `C` and `B` wines have a pretty similar distribution, with values around (a bit higher than) 2.0 and (almost) 2.9, the `B` ones are considerably more concentrated than the others (looks like 75% of them are in the range of 2.1-2.5 for this chemical). The `A` wines move in a wider range, as its body is a bit lower centered than the others, we can think that most of this wines tend to have slightly lower values for `Ash` tan the others, and 25% (the bottom shadow) of them has values so low that they are outside of the others wines range.

- Prviously, we saw that the `Ash_Alcalinity`'s boxplot was proportionally similar to the `Ash` ones. Now that we have access to each boxplot for each label, that observation seems to not be really accurate. For the `B` wines, their `Ash` and `Ash_Alcalinity` looks sort of similar, but that doesn´t happens with the other wines. From `C`, to `B`, to `A`, they seem to progressively tend to have higher values for this chemical. The difference between the values for the `B` and `A` wines is lees than what it could be for any of them and the `C`. Looking deeper, we can see that approximately 75% of the wines, for each category, have values between a bit less than 16.0 (so that we can get the bottom shadow of the `B` ones) and bit more than 22.0. In this order of ideas, this chemical tends to have similar values for whatever of the wines we get. This chemical is probably not the most determinant for the differenciation of the wines.

- Even after get the outliers removed, our firsr `Color_Intensity` boxplot was centered to lower values. Now what we got is really different. There is one category, the `A` which has extremelly low values for this variable, the highest of them is around 5.0, which is just the lower part of the body of the `C` and not even the corresponding to the `B` ones. We can also see that the boxplot for this last category is more wider than the other ones, their wines must have more distributed values for this variable, while the others are more concentrated. The `C` ones move in a range which is the within the corresponding one for the `B` labeled wines, but their distribution is more concentrated than theirs.

- This looks to be similar to what happens with the `Ash_Alcalinity` plot. The `Flavanoids` are distributed in lower and lower amounts from `C` to `A` and then `B`. Their original boxplot was also similar to the `Ash_Alcalinity` one, but with a wider body. Here we can see why that happened. 

It looks like similar distribution patterns tends to repeat for differente chemicals, there are ones with relatively:

- High `C`, medium `B` and low `A` values: the `Alcohol`, `Color_Intensity`, `Magnesium` and `Proline`.

- Low `C`, medium `A` and high `B` values (the plots with the 'ascending' ladder boxplots): the `Ash_Alcalinity` and `Non_Flavanoids Phenols`.

- High `C`, medium `A` and low `B` values (the plot with the 'descending' ladder boxplots): the `Flavanoids`, `OD280`, `Proanthocyanins` and `Total_Phenols`

There is no clear pattern for the `Ash` to belong. Similar happens with the `Malic_Acid`. As we talked before about the `Ash`, now we can do the same with this other chemical:

- Its general boxplot was centered to the left (lower values). At this moment, we can see why is that. The `C` are extremelly concentrated, even if they have a bunch of outliers (like 7), it is obvious within which values is its distribution centered (1.5 to approximately 2.1). The `A` wines also have a relatively compact distribution. Not as the `C`, but it should be enough to say that approximately 75% of them have values for this chemical below 2.1 (or the top of `A`), but we have to notice that they also have a lower bottom. Curious is what happens with the `B` ones, almost 75% of them have values above 2.5, and considering that the top of the `A` upper shadow is around 3.0, the `B` wines have considerably the highest values for `Malic_Acid`.

In [None]:
plt.figure(figsize=(16, 14))
for i, column in enumerate(numeric_cols):
    plt.subplot(4, 4, i + 1)
    sns.histplot(data=df_with_labels, x=column, hue="KMeans_labels",  palette="bright", multiple = "dodge", kde=True)
    plt.title(f'{column}´s Kde-Histogram')

plt.tight_layout()
plt.show()

The histograms looks as we could most spected from the boxplots. Ones like the `Flavanoids` that had really compact boxplots looks to have KDEs and histograms in well differentiated ranges, and others like `Ash`'s histogram and KDEs overlap for the different types of wines. We observed before that this should be the case, since their boxplot moved in a very similar range.

In [None]:

group_A = df_grouped_by_label.get_group("A")
group_B = df_grouped_by_label.get_group("B")
group_C = df_grouped_by_label.get_group("C")

It will be useful to remember the top 6 most strongest linear correlations that we found in the global dataset.

- Total Phenols - Flavanoids: 0.88
- OD280 - Flavanoids: 0.78
- Proanthocyanins - Flavanoids: 0.74
- OD280 - Total Phenols: 0.70
- Proline - Alcohol: 0.66
- Total Phenols - Proanthocyanins: 0.65

#### Group A

In [None]:
group_A.describe()

In [None]:
def plot_corr_matrix(df): #we gonna need this for each group
    corr_matrix = df.corr(numeric_only=True)

    plt.figure(figsize=(16, 6))
    heatmap = sns.heatmap(corr_matrix, vmin=-1, vmax=1, annot=True, linewidths=.5, cmap='coolwarm')

    heatmap.set_title('Correlation Heatmap', fontdict={'fontsize': 12}, pad=12)

    plt.show()
    del heatmap, corr_matrix

In [None]:
plot_corr_matrix(group_A)

Looks like the best linear correlated variables are:

- `Flavanoids` - `Total_Phenols` with a 0.82 coefficient.

- `Proanthocyanins` - `Flavanoids` with 0.64.

- `Ash` - `Ash_Alcanity` with 0.58.

- `Nonflavanoids_Phenols` - `Total_Phenols` with -0.56.

With a bit less than 0.50 absolute value we have:

- `Nonflavanoids_Phenols` - `Flavanoids` with -0.49.

- `OD280` - `Flavanoids` with 0.49

- `OD280` - `Nonflavanoids_Phenols` with -0.47.

We can notice that compare to the global heatmap, here the relatively considerable linear correlations are less, most of them are weaker. But others, like the ones between the `Flavanoids` and `Total_Phenols` are still present.Now the `Nonflavanoids_Phenols` also gets into the podium. 

The `Ash` - `Ash_Alcanity` had a global coefficient of 0.31, it drastically increased for this cluster.

It is interesting to highligt that the `Proline` - `Alcohol`, one of the strongest correlations in the first heatmap, is completely gone here. Also, the `OD280` correlations that we noticed here are less strong than the ones from the other heatmap. 

Let's do this dataset multi-scatterplot.

In [None]:
sns.pairplot(group_A)

We can check how these correlations looked in the non-grouped dataset. 

In [None]:
group_A_pairs = [
    ('Flavanoids', 'Total_Phenols'),
    ('Proanthocyanins', 'Flavanoids'),
    ('Ash', 'Ash_Alcanity'),
    ('Nonflavanoid_Phenols', 'Total_Phenols'),
    ('Nonflavanoid_Phenols', 'Flavanoids'),
    ('OD280', 'Flavanoids'),
    ('OD280', 'Nonflavanoid_Phenols')
    
]

In [None]:
def plot_pairs(df, group_df, pairs):
    plt.figure(figsize=(16, 14))
    for i, tuple in enumerate(pairs):
        plt.subplot(4, 4, i + 1)

        x_column, y_column = tuple

        x = df[[x_column]]
        y = df[y_column]

        lr = LinearRegression()
        lr.fit(x, y)

        group_x = group_df[[x_column]]
        group_y = group_df[y_column]

        group_lr = LinearRegression()
        group_lr.fit(group_x, group_y)

        sns.scatterplot(data=df, x=x_column, y=y_column, hue="KMeans_labels")
        sns.lineplot(x=df[x_column], y=lr.predict(x), linestyle="dotted", color="black")
        sns.lineplot(x=group_df[x_column], y=group_lr.predict(group_x), color="black")

        plt.title(f'Relation between {x_column} - {y_column}')
    plt.tight_layout()
    plt.show()


We can see the global scatterplots for these pairs, but with coloured points and two regression lines (a solid one for this specific group and a dotted one for the full data) so we can get a visual representation of the correlation with respect to the first Dataframe.

In [None]:
plot_pairs(df_with_labels, group_A, group_A_pairs)

Based on the lines slopes, we can see that in relation with the general ones, the group A has:

- An sligthly slower relation with the `Flavanoids` - `Proanthocyanins`, `Flavanoids` - `Nonflavanoids` and `OD280` - `Flavanoids` chemicals relations, considering that a same change in the same chemical displayed in the x-axis corresponds to a slightly less change in the chemical displayed in the y-axis.

- An sligthly higher relation with the `Flavanoids` - `Total_Phenols`, `Ash` - `Ash_Alcanity` and `OD280` - `Nonflavanoids_Phenols`. Using the previous idea, their slopes are more pronunciated (to an upper or lower direction) than the dotted ones.

- In the case of the `Total_Phenols` - `Flavanoids`, the relation is almost identical.

#### Group B

In [None]:
group_B.describe()

In [None]:
plot_corr_matrix(group_B)

Noticeable, the strongest correlations that we are getting are different from the previous group. Let´s list their coeficients.

- `Ash_Alcanity` - `Ash`: 0.69.

- `Hue` - `Color_Intensity`: -0.68.

- `Nonflavanoid_Phenols` - `Magnesium`: -0.55.

- `OD280` - `Hue`: 0.51.

- `Flavanoids` - `Magnesium`: 0.48.

- `Nonflavanoid_Phenols` - `Flavanoids`: -0.48.

Before we got the `Ash_Alcanity` - `Ash`, but also some between the `OD280` and the `Flavanoids` or `Nonflavanoid_Phenols`, they are gone here.  The relation between this last two looks to remain, but in other direction. Also, the `Magnesium` seems to get relevant here.

In this group we got good relations with the `Hue`, that we didn´t get before.

In [None]:
group_B_pairs = [
    ('Ash_Alcanity', 'Ash'), 
    ('Hue', 'Color_Intensity'), 
    ('Nonflavanoid_Phenols', 'Magnesium'), 
    ('OD280', 'Hue'),
    ('Flavanoids', 'Magnesium'),
    ('Nonflavanoid_Phenols', 'Flavanoids')
]

In [None]:
plot_pairs(df_with_labels, group_B, group_B_pairs)

There are some insights that we can extract from this:

- The `group B`'s `Ash` - `Ash_Alcanity` and `Magnesium` - `Flavanoids` relations are more pronunciated than the global ones in an upper direction. Similar happens with `Color_Intensity` - `Hue` and `Magnesium` - `Nonflavanoid_Phenols` in a lower direction. 

- The `OD280` - `Hue` relation has a similar slope to the global one, but its range of values is more restricted than that, as we saw previously with the boxplots per group.

- The `Flavanoids` - `Nonflavanoid_Phenols` relation has a slower slope. It is also concentrated in lower values for the first chemical.

#### Group C

In [None]:
group_C.describe()

In [None]:
plot_corr_matrix(group_C)

With a simple view, it seems like here we have more noticeable correlations. As before, we will mention the strongest ones with their Pearson coefficients:

- `Total_Phenols` - `Flavanoids`: 0.80.

- `Color_Intensity` - `Flavanoids`: 0.74.

- `Color_Intensity` - `Total_Phenols`: 0.64.

- `Proline` - `Color_Intensity`: 0.57.

- `Proanthocyanins` - `Flavanoids`: 0.54.

- `Hue` - `Malic_Acid`: -0.41

considering the first six ones.

In [None]:
group_C_pairs = [
    ('Total_Phenols', 'Flavanoids'),
    ('Color_Intensity', 'Flavanoids'),
    ('Color_Intensity', 'Total_Phenols'),
    ('Proline', 'Color_Intensity'),
    ('Proanthocyanins', 'Flavanoids'),
    ('Hue', 'Malic_Acid')
]

In [None]:
plot_pairs(df_with_labels, group_C, group_C_pairs)

Interesting, we will do a quick description of what we an see again:

- While the `Flavanoids` - `Color_Intensity` and `Total_Phenols` - `Color_Intensity` plots don´t seem to be well described with a regression line, with the clustering label we see that it is a good indicator of how the group C chemical correlate. They do it in a good positive way.

- The `Flavanoids` - `Total_Phenols` and `Flavanoids` - `Proanthocyanins` ones already have a good linear correlation, they fit in such a good way the dotted line. But now discriminating by cluster, we obtain better lines, with a slightly slower slopes.

- The `Color_Intensity` - `Proline` and `Malic_Acid` - `Hue` global lines are very simmilar to `group C` ones, but this last ones seem to be better to describe these chemicals correlated behavior since they are not that dispersed with the clustering.

We have got a lot of insights. But they are a bit mixed up. Let´s get it all together.

## Conclusions and Insights

We started out analysis by trying to take a big picture out of how our data was distributed. 

We used a Boxplot per chemical and observe that there were some outliers, so we checked the KDE and Histogram plots to be sure that those were atypical values. We confirmed that and decided to remove them, so that our analysis becomes the best representative that it could be based on the data that was in our hands.

We retained 90.40% of it.

With those boxplots we saw that most of the chemicals seem to have a realtivelly centered body, nor really wide or narrow, but some of them had a left centered one, indicating that most of those chemical values tend to have lower values.

In a few words, we can say that:

- `Alcohol`, `Total_Phenols`, `Flavanoids`, `Hue` and `OD280` had a relativelly centered and wide boxplot body.

- `Ash`, `Ash_Alcanity`, `Magnesium`, `Nonflavanoid_Phenols` and `Proanthocyanins` also had a relativelly centered boxplot body, but a bit narrower than the previous mentioned chemicals.

- `Malic_Acid`, `Color_Intensity` and `Proline` (with this last one being the one with the wider body) had left or low values centerd bodies.

While most of the wines seem at a first look to may have a non-clear distribution for the first two type of mentioned chemicals, it was evident that for these last three the values that we could get would be frecuentlly restricted to a narrow domain.

But as they used to have outliers with higher values (i.e. in the other part of which they tend to congregate in the boxplot), we thought that maybe there were wines with relativelly low values for this chemicals (the majority) and others with higher values (the less) significantly easier to differentiate.

So we looked at histograms and KDE plots for each chemical. We discovered that there were three chemicals that seem to have a kind of bimodal distribution: `Total_Phenols`, `Flavanoids` and `OD280`. This wasn´t something that we could just figured out of the boxplots. Evenmore, the `Malic_Acid`, `Color_Intensity` and `Proline` didn´t have the suspected two groups.

Seeing that the firs three mentioned chemicals may be interesting to look to separate our data, we started to suspect that there may be two clusters within our wines.

But before assuming that it was the case, we quickly checked out if there were any linear correlations between our variables with a Pearson Correlation Matrix ploted as a Heatmap.

We saw that there were a lot of considerable linear correlations. Considering the top 6 by Pearson coefficient absolut value we got:

- Total Phenols - Flavanoids: 0.88
- OD280 - Flavanoids: 0.78
- Proanthocyanins - Flavanoids: 0.74
- OD280 - Total Phenols: 0.70
- Proline - Alcohol: 0.66
- Total Phenols - Proanthocyanins: 0.65

We looked at them with scatterplots to get a better insight and saw something curious: while they seem to have that good linear correlation between the displayed x-axis and y-axis variable, the `OD280 - Flavanoids` and `OD280 - Total_Phenols` looked like if they hay two clusters, reinforcing our previous beliefs.

To get a better picture and see if this may happen with other variables, we did a pairplot of the full dataset.

Pairs like `Alcohol`-`Hue` or `Alcohol`-`OD280` (and most of the `Alcohol`'s ones) seem like to may have 3 different types of concentrated values, three clusters. Others like the mentioned a bit up or the `Flavanoids` - `Color_Intensity` looked like they had two clusters. We had two candidates to separate our data.

As we mentioned that the linear correlation was significant between lots of variables, we decided to reduce the dimensionality of the dataset with an algorithm that can take advantage of this: the `PCA`.

The thing was to find the less number of Principal Components that can explain the most of the variance. We did a line plot for this, with the number of PCs as the x-axis and the explained variance as the y-axis.

There was no clear elbow to decided an optimal number of PCs. Then, we decided to give a try with the first two PCs, since they could at least explained up to 60% of the variance (up to 40% from `PC1` and 20% from `PC2`). We did a scatter plot.

The result was really interesting, there were three clearly different clusters, which makes really sense since we knew from the beginning that these wines come from three different cultivars. It was clustering time.

First, we tried different `KMeans` with different `K` values to label our data using the PCA resultant dataset. We took two metrics to decided which was the best hyperparameter: the Inertia and the Silhouette. The first one presented a clearly elbow in the `K`=3 x-axis value and the second maxmized its value with the same `K` value. This confirmed us that is was the best number of clusters to choose.

We got the following amount of elements for each group:

| Group 0 | Group 1 | Group 2 |
|:--------:|:--------:|:--------:|
|  59   |  45   | 57   |

with the number labels given by our algorithm.

We looked at a pairplot in the big dataset but now labeled by this clusters. Taking a quick picture, it was clear that different wines got a different, sometimes slightly sometimes higher, behavior for each correlation between chemicals than what we could expect out of the no-labeled pairplot.

But while the clusters `0` and `2` had almost the same amount of wines, the `1` cluster got more than 10 less than the others.

Convinced that these was the correct amount of clusters, we tried to see if we could optimize a bit the distribution of elements within them, so we appealed to another clustering algorithm: the `DBSCAN`.

Here it was more difficult to choose the hyperparameters `epsilon` and `min_samples`.

We started using the `Nearest Neighbors` algorithm to get the distance for each point of our `PCA` dataframe to its `n` closest neighbors for `n` in an arbitrary list [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] and see if we could get a good value for `min_samples`. If that was the case, we could relatively easy got candidates for `epsilon`.

Since we wanted to get a more uniform distribution within 3 groups of wines, we think of a value of `n` that gave us an average distance to its closest neighbors for each point approximately similar to what it could be if we saw another point to its corresponding closest neighbors. That means that the variance along the mean distance between a point and its `n` closest neighbors is relatively low.

Comparing with the mean distance for whatever point to its closest neighbors and the spoken variance, we tried to see if there was a value for `n` that could form an elbow in a lineplot, meaning that after that point the variace stopped to significantly increase. There wasn´t a clear `n` value, so we tried another method.

We got a linspace with different values of `epsilon` from a range that, looking at the `PC1` - `PC2` scatterplot made sense, and tried each of these with differente values of `min_samples`. We plotted the different scatterplots with these labels and started to restrict the values of these hyperparameters to the ones that gave us three clusters.

When we got a more reduced domain, we selected the best three combinations of `epsilon` and `min_samples` that maximized a metric. Again, the `Silhouette`.

All of them labeled around 10 wines as outliers and gave groups `0` and `1` almost the same amount of samples (around 55). The `2` cluster got around 40. The situation didn´t improve from what we had before, so we continued with the `KMeans` results.

Replacing the numberic labels `0`, `1` and `2` for `A`, `B` and `C`, we started to analyze the labeled datadrame.

Looking first at a labeled pairplot, we saw that the three cluster division made sense even in pairs that we didn´t think so initially. By doing this, it looked like it would help us to describe more precisely the distribution of each chemical in each kind of wine. So, we did labeled boxplots.

We got lots of interesting results that we wouldn´t being able to figure out if it wasn´t because of the clustering.

Basing of general realtions between the boxplots of each chemical compare to others from the same variable but different cluster, we found there are relations with relatively:

- High `C`, medium `B` and low `A` values: the `Alcohol`, `Color_Intensity`, `Magnesium` and `Proline`.

- Low `C`, medium `A` and high `B` values (the plots with the 'ascending' ladder boxplots): the `Ash_Alcalinity` and `Non_Flavanoids Phenols`.

- High `C`, medium `A` and low `B` values (the plot with the 'descending' ladder boxplots): the `Flavanoids`, `OD280`, `Proanthocyanins` and `Total_Phenols`

But there are two that don´t seem to be categorized in any of the previous mentioned patterns:

- The `Ash` which has similar `C` and `B` boxplots, this last one with a slightly more compact body. The `A` boxplot has a body slightly lower centered, and its bottom shadow reaches values that nor the `C` or `B` wines can touch.

- The `Malic_Acid`, which is the one with the most outliers based on the boxplots. The `C` boxplot is extremely compact to lower values, but it has higher values outliers. Looking at its corresponding histogram, we can figure out why that happens. But continuing with the boxplot, the `A` cluster also has a relatively compact domain of values, but it is around the same range that the `C` one does. They could be easily missed by each other based on this chemical. Finally, the `B` cluster: its boxplot is categorically centered to higher values, it´s body is centered around the range in which `C` and `A` outliers move, so most of the time this type of wines must be easily differentiated from the others based on this component.

- Something realatively similar happens to the `Hue`, but now the `C` and `A` clusters are the ones that tend to have higher values, while the `B` is the lower centered one. In this case, the first two ones are a bit wider and the `B` is really compact.

However, this patterns that we mentioned can be better saw with the histogram and kdeplots. Again, coloured by the labels.

Simply looking at were are the KDEs peaks and how wide they are from these values, we can do the same observations that we did out of the boxplots. But they let us get a little depper insights for the behaviour of the outliers. We said that the `Malic_Acid`'s `C` boxplot had a lot of them. Looking at the `C` histogram and KDEs we can see that there is a little, but noticeable, peak around values that are considerable higher than what the expected value could be for this cluster. Looks like a little bimodal distribution there.

A similar thing looks to happen in its corresponding plot for `OD280` (if we see its KDE peaks), but it really doesn´t seem to be backed up by its histogram.

In a more friendly way, for each cluster we can conclude that the variables have values that are relatively (compare to what the ones for other clusters tend to have and the values that cover all of the domain of each chemical in this dataset):

| Cluster | Alcohol | Ash | Ash_Alcanity | Color_Intensity | Flavanoids | Hue | Magnesium | Malic_Acid |
|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
|  A   |  Low   | Low-Medium-High   |  High   |  Low   | Medium   |  Low   |  Low-Medium  | Low   |
|  B   |  Medium   |  Medium-High  |  High   |  High   | Low   |  Medium-High  |  Medium  | Low   |
|  C   |  High   |  Medium-High   |  Medium   |  Medium   | High   |  Medium   |  Medium-High   | High   |


| Cluster | Nonflavanoid_Phenols | OD280 | Proanthocyanins | Proline | Total_Phenols |
|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
|  A   |  Low-Medium-High  | Medium-High   |  Medium   |  Low   | Medium   |
|  B  |  High   | Low   |  Low   |  Low   | Low |
|  C   |  Low   | High   |  Medium-High   |  High   | High   |

With this naive label for the distribution of the chemicals values for each cluster, we can conclude that:

- The `A` wines tend to have low values: they got 5 `Low` and 1 `Low-Medium`. They got only 1 `High`, 1 `Medium` and 1 `Low-Medium-High`. So w can say that they are not that fluctuating from chemical to chemical.

- The `B` wines tend, in general, to have low values: they got 8 `Low`. But they also got 3 `High`s and 2 `Medium-High`. So, they can also fluctuate a lot.

- The `C` wines tend to have medium-high values: they got 3 `Medium`, 3 `Medium-High`, and 6 `High`. Only one `Low`. 

 Now, getting a look deeper into the correlations within each clusters' chemicals, what we did was ploting its heatmap and analyzing the specific plots for the top best correlated pairs based on the absolute value of their Pearson coefficient. 

For group A:

- `Flavanoids` - `Total_Phenols` with a 0.82 coefficient.

- `Proanthocyanins` - `Flavanoids` with 0.64.

- `Ash` - `Ash_Alcanity` with 0.58.

- `Nonflavanoids_Phenols` - `Total_Phenols` with -0.56.

- `Nonflavanoids_Phenols` - `Flavanoids` with -0.49.

- `OD280` - `Flavanoids` with 0.49

- `OD280` - `Nonflavanoids_Phenols` with -0.47.

For group B:

- `Ash_Alcanity` - `Ash`: 0.69.

- `Hue` - `Color_Intensity`: -0.68.

- `Nonflavanoid_Phenols` - `Magnesium`: -0.55.

- `OD280` - `Hue`: 0.51.

- `Flavanoids` - `Magnesium`: 0.48.

- `Nonflavanoid_Phenols` - `Flavanoids`: -0.48.

For group C:

- `Total_Phenols` - `Flavanoids`: 0.80.

- `Color_Intensity` - `Flavanoids`: 0.74.

- `Color_Intensity` - `Total_Phenols`: 0.64.

- `Proline` - `Color_Intensity`: 0.57.

- `Proanthocyanins` - `Flavanoids`: 0.54.

- `Hue` - `Malic_Acid`: -0.41

As there are some pairs* correlations that are in the podium in each group, we can think of giving this a better format to see how it varies between each group.


| Pair | Global Correlation | Cluster A| Cluster B | Cluster C |
|:--------:|:--------:|:--------:|:--------:|:--------:|
|  Flavanoids - Total_Phenols  |  0.88  | 0.82   |  0.19   | 0.80   |
|  Flavanoids - Proanthocyanins |  0.74   | 0.64   |  0.24   |  0.54   |
|  Flavanoids - Magnesium   |  0.21   | -0.079  |  0.48   |  0.14   |
|  Nonflavanoid_Phenols - Flavanoids   |  0.6   | -0.49   |  -0.48   |  -0.061  |
|  Nonflavanoid_Phenols - Total_Phenols  |  -0.5   | -0.56   | 0.3  |  0.0094   |
|  Nonflavanoid_Phenols - Magnesium   | -0.24   | 0.042   |  -0.55   | 0.18  |
|  OD280 - Nonflavanoid_Phenols  |  -0.53   | -0.47  |  0.18   |  -0.36   |
|  OD280 - Hue   |  0.57   | -0.14   |  0.51   |  -0.3   |
|  OD280 - Flavanoids  |  0.78  | 0.49   |  -0.28  |  -0.11   |
|  Color_Intensity - Hue  |  -0.48   | 0.1   |  0.51   |  0.075   |
|  Color_Intensity - Proline  |  0.38   | 0.18   |  0.11   |  0.57   |
|  Color_Intensity - Flavanoids   |  -0.14  | 0.26  |  -0.02   |  0.74   |
|  Color_Intensity - Total_Phenols   |  -0.04   | 0.15   |  -0.0084   |  0.64   |
|  Hue - Malic_Acid   |  -0.58   | -0.32   |  -0.22  |  -0.41   |
|  Ash - Ash_Alcanity   | 0.31   | 0.58   |  0.69   |  0.44   |


`*` Some pairs have their left and right chemical exchanged to have better groups, but as their Pearson coefficient remains the same, it doesn´t affect the analysis.


Just to have a better visualization, the pairs with the same left chemical are display together. Then, they are sorted in a descending way by their `Global Correlation` coefficient's absolute value.

We can conclude that:

- While the first two `Flavanoids` pairs coefficient seem to somehow remain for `Cluster A` and `Cluster C`, they drastically drop for `Cluster B`. The opposite happens with the third one, the `Flavanoids` - `Magnesium`: it has such a low coefficient in the `Global Correlation` and it is even lower for clusters `A` and `C`, for the first one it is almost zero. But it has a really good improvement with `Cluster B`, which manages to go up to 0.48.

- For the `Nonflavanoid_Phenols`-`Flavanoids` pair, both clusters `A` and `B` manage to keep some correlation, but they do it in an opposite direction, since their coefficentes are negative. In the case of `Cluster C`, the correlation seems to disappear. Its coefficient dropped to almost zero. The same happens for this last cluster in the `Nonflavanoid_Phenols` - `Total_Phenols` and a bit less drastically in the `Magnesium` one. In this last one, the cluster that drastically drop is the `A`, while `B` improves its correlation in the same direction as the global one.

- For all the top pairs `OD280` the correlation for each cluster is weaker than the global one, in terms of the coefficient absolute value. In the `Nonflavanoid_Phenols` one the `Cluster A` keeps it, while the others drop significantly, even got other directions. For the `Hue` is the `Cluster B` the one that mantains a similar strong correlation, the others have a considerable decrease and got coefficients with negative values, in other directions. Again, with the `Flavanoids` is the cluster `A` the one that holds that relatively strong corrlation, but it also had a significant drop from what was the global one. The others are relatively close to zero.

- With the exception of the `Color_Intensity` - `Hue`, all the top `Color_Intensity` pairs have a great improvement in their linear correlation for `Cluster C`, even if we couldn´t expect that out from the `Global Correlation`. In the `Hue` pair, `C` drops down to practically zero. `B` is the only one with a good correlation in that pair, even bigger than the global one in absolute value, but in a completely opposite direction.

- The `Hue` - `Malic_Acid` coefficient is weaker for all of the cluster in comparison to the `Global Coefficient`, but they manage to have it in the same direction than it.

- Finally, for the `Ash` - `Ash_Alcanity` pair: it wasn´t really strong for the global view, but it really improve, and with the same direction, for each of the clusters, with `B` being the one with the strongest correlation, getting almost a 0.70 Pearson coefficient.

Using scatterplots labeled by cluster and two regression lines: one for the global non-grouped data and one for the specific cluster that we were checking, we could obtain these same insights, but in a visual way.

But what can we do with all this information? 

With a global view of each chemical boxplots we saw that most of them didn´t have a real tendency to be centered to a really strange range of values within our domain, they were approximately well distributed.

There were some pretty good linear correlations between the variables, and even if that could be enough to make the decision to describe them via a linear regression, we moved on and reduce their dimensionality and saw that there were three well differentiated groups within our wines.

Therefore, and clustring our data, we saw that there were pretty well distinguished ranges of values for most of the chemicals for each cluster. 

Then, we observed how the top correlated pairs remained similar for each cluster to what they were in the global dataset or if they changed and how they did it.

With this knowledge of the the distibution of the values of each chemical per wine type and how their chemicals correlate to each one can manage to take decisions over:

- How to classify a wine given their chemical composition values.
- What to expect of the full chemical composition of a wine given a part of the chemicals.
- Which wine may be prioritized to produce in view of changes of the chemical prices (i.e. if we see that the alcohol gets expensive, the type of wines which use more of this will cost more to produce).
- How we may manipulate the different levels of chemicals to obtain a same type of wine (i.e. in group C we saw that the Flavanoids and Proanthocyanins correlation is well described with a linear regression, so if we don´t have that much of access to the second chemical, we can reduce the amount of the first one and also get the desired C wine).

But obviously, this has to be taken carefully, as the opinion of an expert in the field could give us a better understanding of what we found, why that happens and if it makes sense.