# Kmeans Clustering on the UCI Wine Dataset

**Project Description:** This dataset contains the results of a chemical analysis of wines grown in the same region in Italy but, derived from three different cultivars. The dataset lists measurments of 13 constituents found in each of the wines. Using k-Means Clustering, students will develop and evaluate a ML model to answer the question: "*Given the data, group the wine samples such that each group shares similar malic acid and alcohol content.*"

**Model used:** UCI Wine Dataset

# **Overview**
---



For this project, a dataframe is provided to you consisting of various chemical analyses of wines grown in the same region in Italy and from three different cultivars. Below are a list of constituents for which data was collected—amongst others—for your reference:

* ```alcohol```: the alcohol content of a wine sample
* ```malic acid```: the malic acid content of a wine sample
* ```hue```: the hue index of a sample of wine
* ```od280/od315_of_diluted_wines```: a method of quantitatively assessing the protein content of a sample of wine



In [1]:
#!conda install -c districtdatalabs yellowbrick
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
import sklearn.datasets as datasets
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

### **Step #1:** **Load the dataset. Run the following cell to view the first five rows.**

In [None]:
wine = datasets.load_wine()
wine_data = pd.DataFrame(data=np.c_[wine['data'],wine['target']],columns=wine['feature_names']+['target'])
wine_data.head()

Looks like we're all good to go. No null values are present.



**We can now generate a pairplot for the ```wine_data``` using seaborn. Run the following cell block to view the data's pairplot.**

**NOTE:** The pairplot will take some time to generate!

### **Step #2: Applying k-Means Clustering to Alcohol and Malic Acid Measurments**
---

**Create a scatter plot of Alcohol vs. Malic Acid with the given data.**

*We did not provide this code because we want you to practice creating scatter plots*

In [None]:
x = wine_data['alcohol']
y = wine_data['malic_acid']
plt.title("Alcohol vs. Malic Acid Scatterplot")
plt.xlabel("Alcohol")
plt.ylabel("Malic Acid")
plt.scatter(x, y)
plt.show()

### **Step #3: Define `X` using features ( `malic_acid` and `alcohol`)**



In [4]:
df=wine_data[["alcohol","malic_acid"]]
X=df.values

### **Step #4: Feature Scaling**
---
*As the Kmeans algorithm is distanced based we need to do feature scaling*

Use the `StandardScaler` to standardise your test and train data.



In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### **Step #5: Find the optimal number of clusters using the Elbow Method**

Set parameter `k = (2,10)` of `KElbowVisualizer` method. Fit, visualize and figure out the optimal value of `k`.

In [None]:
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans

model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,10))

visualizer.fit(X_scaled)
visualizer.show()

### **Step #6 Fit a Kmeans model using your optimal K**

Now that we have determined our optimized k-parameter, let's apply our model.


In [None]:
k=3
kmeans = KMeans(n_clusters=k, random_state=42)
y_pred = kmeans.fit_predict(X)

### **Step #7: Visualize your k-Means Clustering**

Using matplotlib and ```kmeans.labels_ ```, create a scatterplot of Alcohol and Malic Acid and color your datapoints according to which cluster they have been assigned to.

In [None]:
for y in range(k):
  #plt.figure(figsize=(10,10))
  plt.title(f"Cluster No. {y}")
  plt.ylabel("Alcohol", fontsize=14)
  plt.xlabel("Malic Acid", fontsize=14)
  plt.scatter(X[y_pred == y, 0], X[y_pred == y, 1], label=y, color=f'C{y}')
  plt.show()

In [None]:
plt.figure(figsize=(15,10))
for yval in range(k):
    plt.scatter(X[y_pred==yval,0], X[y_pred==yval,1], s=50, label=f'class{yval}')
plt.axis('equal')
plt.title("Classification of wine by alcohol and malic acid")
plt.xlabel("alcohol")
plt.ylabel("malic acid")
plt.show()

### **Step \#8: Learning from our Clustering Analysis**

1. **If a sample of wine has a malic acid content of 3.0 and an alcohol content of 14.2, what cluster does this wine get assigned to?**
2. **If a sample of wine has a malic acid content of 3.5 and an alcohol content of 12.7, what cluster does this wine get assigned to?**

Draw a new scatterplot with the existing datapoints colored by cluster, and the new predictions colored with their assigned cluster but in a star shape.

>**Hint:** It may be helpful to define a colormap: `colormap = np.array(['red','green','blue'])`, increase the size (`s=150`),
and change the outline of the star (`edgecolor="blac"`)


In [None]:
w1 = [[14.2,3.0]]
prediction1 = kmeans.predict(w1)
print(prediction1)

In [None]:
w2 = [[12.7,3.5]]
prediction2 = kmeans.predict(w2)
print(prediction2)

In [None]:
plt.figure(figsize=(15,10))
colormap = np.array(['red','green','blue'])
for yval in range(k):
    plt.scatter(X[y_pred==yval,0], X[y_pred==yval,1], s=50, label=f'class{yval}',c=colormap[yval])

plt.scatter(14.2,3.0,s=150,marker='*',edgecolor='black',c=colormap[prediction1])
plt.scatter(12.7,3.5,s=150,marker='*',edgecolor='black',c=colormap[prediction2])

plt.title("Classification of wine by alcohol and malic acid")
plt.xlabel("alcohol")
plt.ylabel("malic acid")
plt.show()

### **Step \#9: DIY**

1. **If a sample of wine has a malic acid content of 1.5 and an alcohol content of 12, what cluster does this wine get assigned to?**
2. **If a sample of wine has a malic acid content of 3.9 and an alcohol content of 13.2, what cluster does this wine get assigned to?**

Draw a new scatterplot with the existing datapoints colored by cluster, and the new predictions colored with their assigned cluster but in a star shape.


In [None]:
#Insert your code here for the first pair

In [None]:
#Insert your code here for the second pair

In [None]:
#Insert your code here for the scatterplot