# STEP 1-3

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn import preprocessing
import warnings
warnings.filterwarnings('ignore')
from dsiad_functions import Solution_UML
from dsiad_functions import plots
solution = Solution_UML()
plot = plots()   

In the previous module we have prepared the data by imputing the mean for missing values, removing highly correlated features and removing features with very low variance. We will reuse our prepared dataset, by loading the data saved at the end of the previous module.

In [None]:
wine = pd.read_csv("winequality-red_3.csv") 
X = wine.drop(["quality"], axis = 1)

In Unsupervised Machine Learning we only use the features, to look for hidden patterns in the data. We do not have a target or `y`. We set the features to `X`. 

# STEP 4: Feature Engineering

Before clustering, we normalize our dataset. This means that we set the mean of every column to zero and calculate each value relative to this mean. By doing this, each columns has the same range of values, which makes it more suitable for comparison.

Let's first check our dataset with the function we have learned! 

Now, we are going to apply normalization. Remember the formula for normalization is: (X - the minimum of X) / (the maximum of X - the minimum of X). 

You can use the following functions: 
`.min()` gives the minimum 
`.max()` gives the maximum

View the difference between X and X_normalized afterwards.

In [None]:
X_normalized = 

X_normalized.head()

In [None]:
solution.step_4()

# STEP 5: Feature selection 

We allready selected our features! 

# STEP 6: Modelling

## 6.1 Selecting the number of clusters

When we use a k-means clustering technique, we should select the number of clusters beforehand. One way to do this is the elbow method. We want to look for the elbow, the point where the slope suddenly decreases. How to select the right number of clusters? A good cluster should have tight clusters, but not too many clusters. A simple rule of thumb is to find the elbow of the graph. 

In [None]:
plot.elbow_plot(X_normalized)

Where do you think the elbow is? What is our optimal number of cluster? Test the results later for different numbers of clusters!

In [None]:
number_clusters = 

You can check the amount of clusters below. Fill out the number of clusters in the solution function. 

In [None]:
solution.step_61(1)

## 6.2 K-means clustering of the data

Here, we define our model by the function `KMeans()` and fit the model to our dataset `X_normalized`. In the image you see an example of how the kmeans algorithm works with k = 3. 



<center>
<img src="images/k-means.gif" width="300"><br/>
</center>


In [None]:
kmeans = KMeans(n_clusters = number_clusters).fit(X_normalized)

Next, we are using the fitted model to determine the cluster for each row of our data set by `.predict()`. 

In [None]:
cluster_pred = kmeans.predict(X_normalized)

Finally, we add the cluster numbers to our (non-normalized) dataset. You can see a new colomn `cluster` that shows to which cluster the observation belongs based on our model. 

In [None]:
wine["cluster"] = cluster_pred
wine.head()

# STEP 7: Reviewing results
## 7.1 Inspect centroids

To see whether the clusters make sense, we can compare the values of the centroids of the different clusters. Remember, we want to see that next to compactness, isolation is optimized. This means that the centroids have different locations. Because we normalized the dataset, the location can be between 0 and 1. If one centroid is 0 and the other 1, than they have maximum distance. If one centroid is at 0.5 and the other at 0.51, they are fairly close and the clusters are highly likely to have overlap. Let's look at `kmean-cluster_centers`.

In [None]:
#returns the coordinates of the centers
pd.DataFrame(kmeans.cluster_centers_ , columns = X.columns)

## 7.2 Inspect clusters in pairs 
It is not possible to visualize the clusters with all features at once. However, we can inspect the combination of 2 different features and see if the clusters are showing in the data. We do this with the use of `plt.scatter()` by the input argument `c=cluster-pred` we tell that we would like to have different colours for the different clusters. Inspect the clusters: would you define them as a cluster when seeing them visually? Try different combinations of features!

In [None]:
X_normalized.columns = X.columns

x_column = "alcohol" ## CHANGE VARIABLE ##
y_column = "pH"  ## CHANGE VARIABLE ##

plt.scatter(X_normalized[x_column], X_normalized[y_column], c=cluster_pred)
plt.xlabel(x_column)
plt.ylabel(y_column)
plt.title("Clustering of the wine")

## 7.3 Characteristics of the wine per cluster

We would like to inspect the difference in characteristics per cluster. With that information, we can determine what wine would suit which occasion! 

In [None]:
cluster_means = wine.groupby('cluster').mean().transpose()

cluster_means

In [None]:
cluster_means.plot.barh()