## KMeans clustering 

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd


### let's get some data

In [2]:
from sklearn import datasets

In [3]:
data = datasets.load_wine()

In [None]:
#explore the data 

In [6]:
data.head()

AttributeError: head

In [None]:
# create data frame from data['data'], columns=data['feature_names']



In [None]:
# what data types do you have?


### Preparing the data
The scale of "proline" is much higher than the scale of many other variables! K-Means is a distance based algorithm: we need to scale / normalize

Check out the docs for standardScaler: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Explore other methods for normalizing data: https://scikit-learn.org/stable/modules/preprocessing.html

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# scale your data with the standard scaler

In [None]:
# create a dataframe of scaled features


### Clustering 

We will pick manually the number of clusters we want - let's set it to 8. Later we will discuss how many clusters we should have.

When randomness is involved, we better use a random seed so that we can reproduce our results. We can set this directly to the argument random_state.

In [None]:
from sklearn.cluster import KMeans

#define the model, fit the model to your data 


In [None]:
#look at the cluster centres 


In [None]:
# Predicting / assigning the clusters:


In [None]:
# Check the size of the clusters
pd.Series(clusters).value_counts().sort_index()

In [None]:
# Explore the cluster assignment by placing it in the original dataset


### Time to think : What makes a cluster a "good" cluster?

+ By default Scikit-Learn has tried 10 different random initializations and kept the best model- based on Inertia



* **Inertia**, Intuitively, inertia tells how far away the points within a cluster are. Therefore, a small of inertia is aimed for. The range of inertia’s value starts from zero and goes up.

* **Silhouette score**, (discuss later), -1 to 1


In [None]:
# total inertia of all the centroids
kmeans.inertia_

### other parameters 

max_iter: model iterates up to 300 times by default (those are the re-computing centroids iterations we saw earlier)

tol: determine when to stop iterating (if the clusters have changed only veeeery slightly, we assume we have achieved 'convergence')

algorithm: There are variations in the implementation of most algorithms and K-Means is no exception. By default, we're using a 'smart' implementation called elkan.

## Activity 
- For learning purposes, we can tweak the parameters

In [None]:
# Play with the KMeans parameters and see how that affects the 'inertia' result.

### Finding the optimal number of clusters
We have used K=8 by default for now - but we know that 8 might not be the optimal numbner of clusters for our dataset. Having a metric like inertia, we can compute it for several K values and then use the "elbow method" to choose the best K.

We will now leave all other parameters with their default value, since it seems to work pretty well.

In [None]:
# Try to run Kmeans with all values of K, from 2 to 20
K = range(2, 20)

# For each model, store the inertia in a list
inertia = []

for ...


print(inertia)

In [None]:
# Plot the results
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

plt.figure(figsize=(16,8))
plt.plot(K, inertia, 'bx-')
plt.xlabel('k')
plt.ylabel('inertia')
plt.xticks(np.arange(min(K), max(K)+1, 1.0))
plt.title('Elbow Method showing the optimal k')

Findings: 







+ **Inertia** is the metric that Scikit-Learn optimizes, but it does not have a limited range and that makes it difficult to evaluate.



+ There's another metric called **Silhouette Score**
* what Silhouette score does: **how similar is an observation to its own cluster compared to other clusters**
* $S_i = \frac{(b_i - a_i)}{\text{max}(a_i,b_i)}$
    * `a`: mean intra-cluster distance (the average distance between the i-th observation and every other observation in the cluster where i belongs to)
    * `b`: the mean **nearest** inter cluster distance (the average distance between the i_th observation of the nearest cluster that i is **not part of**)
    
* The **silhouette score for the whole model** is the **average** of all the silhouette scores of each instance.

Well separated clusters:
* `a` - the mean intra cluster distance is relatively small compared to
* `b` - the mean inter cluster distance that the points are not part of
* that means $S = (b - a) / max(a,b)$ approaches 1

Not so well separated clusters:
* `a` - the mean intra cluster distance is not so small (relatively) compared to
* `b` - the mean inter cluster distance that the points are not part of
* that means $S = (b - a) / max(a,b)$ becomes smaller and smaller (approaches 0 when b=a)
* S becomes negative for a point, which is not (yet) in the right cluster (too less iterations? play with tolerance. Or random effect - increase n_init?)

In [None]:
from sklearn.metrics import silhouette_score

K = range(2, 20)

silhouettes = []

for ...

In [None]:
import matplotlib.pyplot as plt


plt.figure(figsize=(16,8))
plt.plot(K, silhouettes, 'bo-')
plt.xlabel('k (number of clusters)')
plt.xticks(np.arange(min(K), max(K)+1, 1.0))
plt.ylabel('silhouette score')

Findings: 



# What next?

It's the moment to perform clustering on the songs you collected. Remember that the ultimate goal of this project is to improve the recommendations of songs. Clustering the songs will allow the recommendation system to limit the scope of the recommendations to only songs that belong to the same cluster - songs with similar audio features.

The experiments you did with the Spotify API and the Billboard web scraping will allow you to create a pipeline such that when the user enters a song, you:

+ 1. Check whether or not the song is in the Billboard Hot 100.
    + 1.1. If the song is in the Billboard Hot 100, recommend another song from there.
    + 1.2. If the song is not in the Billboard Hot 100, skip to step 2.
    
+ 2. Collect the audio features from that song by sending a requesto to the Spotify API.

+ 3. "Predict" the cluster of the song.

+ 4. Pick a random song from the predicted cluster and give it back to the user.

We want to make sure that clusters make some sense. Besides tuning the parameters of the K-Means algorithm, the most important measure of "performance" is checking whether or not the recommendations given make some sense to you and your classmates - so test and tune before demonstrating your new recommender product 