In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv("data/Mall_Customers.csv")
df.columns = ["CustomerID", "Gender", "Age", "Income", "Spending"]
df.head()

In [None]:
df.tail()

# Pre-processing

We need to do some pre-processing before we can cluster this dataset.

Obviously CustomerID is isn't informative, so we can drop that:

In [None]:
# YOUR CODE HERE

### OneHotEncoding

Let's move on to Gender. It is a string, but strings can't be inserted into a Machine Learning algorithm. Let's take a look:

In [None]:
df["Gender"].value_counts()

Ok, only two genders in this dataset. We can *One Hot Encode* this into a new variable `isFemale`.

You can call the `.map()` function on any column in a DataFrame, to map or convert all values in that column to new values. The *mapping* needs to be specified as a *dictionary*.

For example, to map `Foo` to 1, and `Bar` to 2, you would pass the following dictionary to `map()`:

```
{"Foo": 1,
 "Bar": 2}
```

In [None]:
df["isFemale"] = # YOUR CODE HERE

Now, we can drop Gender column:

In [None]:
# YOUR CODE HERE

In [None]:
df.head()

### Standardizing

Finally, we should standardize the variables such that they're on the same scale.

Most scikit-learn experts would use `sklearn.preprocessing.StandardScaler`, but it is a bit complicated to use. We can also do this ourselves quite easily.

A simple way to standardize values, is to subtract the mean, and divide the result by the standard deviation.

You can get the mean and standard deviation like this:

    df["column"].mean()
    df["column"].std()

In [None]:
ss = StandardScaler()

In [None]:
df_scaled = df.copy()

for col in df_scaled.columns:
    df_scaled[col] = # YOUR CODE HERE

Did you think about operator precedence? Remember that `/` is evaluated before `-`. Use brackets appropriately.

Let's see the result:

In [None]:
df_scaled.head()

In [None]:
df["Age"].hist()

In [None]:
df_scaled["Age"].hist()

# Clustering

In [None]:
from sklearn.cluster import KMeans

Now, initialize KMeans with 3 clusters. Refer to the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) if needed.

Also, make sure to use `random_state=42` to ensure we all get the same results.

In [None]:
kmeans = # YOUR CODE HERE

Fit to `df_scaled`:

In [None]:
# YOUR CODE HERE

Predict `df_scaled`:

In [None]:
clusters = # YOUR CODE HERE

Let's take a look at the clusters:

In [None]:
clusters

Ok, this doesn't make a lot of sense yet. Let's add these to the original data:

In [None]:
df_cluster = df.copy().assign(cluster=clusters)
df_cluster.head()

Now we have to figure out if these clusters make sense. Maybe we can describe each cluster and assign names to them? We are now going to have to make some plots to figure out in what way the clusters differ.

In [None]:
sns.boxplot(x="cluster", y="Age", data=df_cluster)

Exercise: check out all variables and see how the clusters differ. Boxplots are good, but you might also like `violinplot`.


It's also nice to visualize them in a scatterplot:

In [None]:
x = "Age" 
y = "Spending"

for cluster in df_cluster["cluster"].unique():
    df_cluster_single = df_cluster[df_cluster["cluster"] == cluster]
    plt.scatter(df_cluster_single[x], df_cluster_single[y], label=f"Cluster: {cluster}")
plt.legend()
plt.xlabel(x)
plt.ylabel(y)

**Exercise**:

Think of a name for each cluster, based on their properties!

- Cluster 0:
- Cluster 1:
- Cluster 2:

# Open ended bonus assignments
- Try increasing the number of clusters. Do you (still) get sensible results?
- Do 10 fits, with number of clusters from 1 to 10. For each fit, grab the `_intertia` value. Plot this in a graph and perform the *elbow method* to find the optimal number of clusters.
- Try clustering the wine quality dataset and see what interesting things you can find.