# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [None]:
# Import your libraries:

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Challenge 1 - Import and Describe the Dataset

In this lab, we will use a dataset containing information about customer preferences. We will look at how much each customer spends in a year on each subcategory in the grocery store and try to find similarities using clustering.

The dataset is located [here](https://drive.google.com/file/d/1z1gYSD32ktbHuKSzB5JVS_u4YsLibh5F/view?usp=sharing), please download it and place it in the data folder.

In [None]:
# loading the data:

customers = pd.read_csv('../data/Wholesale_customer_ data.csv')
customers.head()

#### Explore the dataset with mathematical and visualization techniques. What do you find?

Checklist:

* What does each column mean?
* Any categorical data to convert?
* Any missing data to remove?
* Column collinearity - any high correlations?
* Descriptive statistics - any outliers to remove?
* Column-wise data distribution - is the distribution skewed?
* Etc.

Additional info: Over a century ago, an Italian economist named Vilfredo Pareto discovered that roughly 20% of the customers account for 80% of the typical retail sales. This is called the [Pareto principle](https://en.wikipedia.org/wiki/Pareto_principle). Check if this dataset displays this characteristic.

In [None]:
# Your code here:

customers.dtypes

In [None]:
customers.isna().sum()

In [None]:
corr = customers.corr()
corr

In [None]:
sns.heatmap(corr, xticklabels = corr.columns, yticklabels = corr.columns)

In [None]:
customers.describe()

In [None]:
customers.boxplot(figsize = (12, 8))

In [None]:
f, axes = plt.subplots(2, 3, figsize=(16, 8), sharex=True)
sns.distplot(customers['Fresh'], ax=axes[0, 0])
sns.distplot(customers['Milk'], ax=axes[0, 1])
sns.distplot(customers['Grocery'], ax=axes[0, 2])
sns.distplot(customers['Frozen'], ax=axes[1, 0])
sns.distplot(customers['Detergents_Paper'], ax=axes[1, 1])
sns.distplot(customers['Delicassen'], ax=axes[1, 2])


In [None]:
# Your observations here

# The first two columns seem to be categorical columns ('Channel' only has values 1 and 2, 'Region' only 1,2 and 3)
# The other columns are types of groceries and should all be numerical columns, which they are
# There is no missing data
# Only two columns are highly correlated: 'Grocery' and 'Detergents_Paper' (r = 0,92), so one should be removed
# The distribution is positively skewed for all numerical variables, so possibly a square root, cube root, or log transformation could help here
# There are many outliers to the right, but this is probably the result of the skewness and they are probably valid data points. 
# I don't see reason to remove them


# Challenge 2 - Data Cleaning and Transformation

If your conclusion from the previous challenge is the data need cleaning/transformation, do it in the cells below. However, if your conclusion is the data need not be cleaned or transformed, feel free to skip this challenge. But if you do choose the latter, please provide rationale.

In [None]:
# Your code here

# First, remove 'Detergents_Paper' columns as it's highly correlated with the 'Grocery' column

customers.drop(columns = 'Detergents_Paper', axis = 1, inplace = True)
customers.head()

In [None]:
# I'm going to approach data conversion in several ways as I see different possible options.
# I'll compare the models with the different conversion approaches to see which one works best

# The first two columns ('Channel' and 'Region') have numerical type but behave as categorical data (and according to data source are nominal data)
# So, I will test three versions: one with the columns as numerical data (as they are), one without these columns, and one with the one-hot encoded columns

# customers = version with all numerical columns

In [None]:
# customers1 will be version with one-hot encoding for first two columsn
# convert data type from numerical to categorical for first two colums

customers1 = customers.copy()
customers1['Channel'] = customers1['Channel'].apply(str)
customers1['Region'] = customers1['Region'].apply(str)

In [None]:
customers1.dtypes

In [None]:
# one hot encode first two columns:

customers1_dummy = pd.get_dummies(customers1, columns = ['Channel', 'Region'])

In [None]:
customers1_dummy.dtypes

In [None]:
customers1_dummy.head()

In [None]:
# Create customers2 without the first two columns

customers2 = customers.drop(columns = ['Channel', 'Region'], axis = 1)

In [None]:
customers2.head()

In [None]:
# For the customers2 version, I'm also going to try the effect of transforming the numerical variables so they're less skewed
# I'm trying three different transformations: log, square root and cube root

# Apply log conversion to remaining numerical columns

customers2_log = customers2.apply(lambda x: np.log10(x))

In [None]:
# visually check whether distribution improved

f, axes = plt.subplots(2, 3, figsize=(16, 8), sharex=True)
sns.distplot(customers2_log['Fresh'], ax=axes[0, 0])
sns.distplot(customers2_log['Milk'], ax=axes[0, 1])
sns.distplot(customers2_log['Grocery'], ax=axes[0, 2])
sns.distplot(customers2_log['Frozen'], ax=axes[1, 0])
sns.distplot(customers2_log['Delicassen'], ax=axes[1, 1])

In [None]:
# Apply square root conversion to numerical columns

customers2_sqrt = customers2.apply(lambda x: np.sqrt(x))

In [None]:
# visually check whether distribution improved

f, axes = plt.subplots(2, 3, figsize=(16, 8), sharex=True)
sns.distplot(customers2_sqrt['Fresh'], ax=axes[0, 0])
sns.distplot(customers2_sqrt['Milk'], ax=axes[0, 1])
sns.distplot(customers2_sqrt['Grocery'], ax=axes[0, 2])
sns.distplot(customers2_sqrt['Frozen'], ax=axes[1, 0])
sns.distplot(customers2_sqrt['Delicassen'], ax=axes[1, 1])

In [None]:
# Apply cube root conversion to numerical columns

customers2_cubrt = customers2.apply(lambda x: np.cbrt(x))

In [None]:
# visually check whether distribution improved

f, axes = plt.subplots(2, 3, figsize=(16, 8), sharex=True)
sns.distplot(customers2_cubrt['Fresh'], ax=axes[0, 0])
sns.distplot(customers2_cubrt['Milk'], ax=axes[0, 1])
sns.distplot(customers2_cubrt['Grocery'], ax=axes[0, 2])
sns.distplot(customers2_cubrt['Frozen'], ax=axes[1, 0])
sns.distplot(customers2_cubrt['Delicassen'], ax=axes[1, 1])

In [None]:
# The cuberoot transformation seems to best improve the skew of the data so I'll go forward with that one

# check outliers in the transformed data:

customers2_cubrt.boxplot(figsize = (12, 8))

# there are considerably fewer outliers than with the untransformed data
# not sure whether to remove them - as they seem to be valid data points and not errors I choose not to

In [None]:
# So now I have 4 versions of the data:
# customers: first two columns as numerical, data untransformed
# customers1_dummy: first two columns categorical one hot encoded
# customers2: first two columns dropped
# customers2_cubrt: variables transformed with cube root transformation to improve skewness

# Challenge 3 - Data Preprocessing

One problem with the dataset is the value ranges are remarkably different across various categories (e.g. `Fresh` and `Grocery` compared to `Detergents_Paper` and `Delicassen`). If you made this observation in the first challenge, you've done a great job! This means you not only completed the bonus questions in the previous Supervised Learning lab but also researched deep into [*feature scaling*](https://en.wikipedia.org/wiki/Feature_scaling). Keep on the good work!

Diverse value ranges in different features could cause issues in our clustering. The way to reduce the problem is through feature scaling. We'll use this technique again with this dataset.

#### We will use the `StandardScaler` from `sklearn.preprocessing` and scale our data. Read more about `StandardScaler` [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler).

*After scaling your data, assign the transformed data to a new variable `customers_scale`.*

In [None]:
# Your import here:

from sklearn.preprocessing import StandardScaler

# Your code here:

scaler = StandardScaler()
customers_scale = scaler.fit_transform(customers)
customers_scale = pd.DataFrame(customers_scale)

In [None]:
customers_scale1 = scaler.fit_transform(customers1_dummy)
customers_scale1 = pd.DataFrame(customers_scale1)

In [None]:
customers_scale2 = scaler.fit_transform(customers2)
customers_scale2 = pd.DataFrame(customers_scale2)

In [None]:
customers_scale2_cubrt = scaler.fit_transform(customers2_cubrt)
customers_scale2_cubrt = pd.DataFrame(customers_scale2_cubrt)

# Challenge 3 - Data Clustering with K-Means

Now let's cluster the data with K-Means first. Initiate the K-Means model, then fit your scaled data. In the data returned from the `.fit` method, there is an attribute called `labels_` which is the cluster number assigned to each data record. What you can do is to assign these labels back to `customers` in a new column called `customers['labels']`. Then you'll see the cluster results of the original data.

In [None]:
# Your code here:

from sklearn.cluster import KMeans

customer_kmeans = KMeans().fit(customers_scale)
customers['labels'] = customer_kmeans.labels_

Count the values in `labels`.

In [None]:
# Your code here:

customers['labels'].value_counts()

# Challenge 4 - Data Clustering with DBSCAN

Now let's cluster the data using DBSCAN. Use `DBSCAN(eps=0.5)` to initiate the model, then fit your scaled data. In the data returned from the `.fit` method, assign the `labels_` back to `customers['labels_DBSCAN']`. Now your original data have two labels, one from K-Means and the other from DBSCAN.

In [None]:
# Your code here

from sklearn.cluster import DBSCAN

customers_dbscan = DBSCAN(eps=0.5).fit(customers_scale)

customers['labels_DBSCAN'] = customers_dbscan.labels_

Count the values in `labels_DBSCAN`.

In [None]:
# Your code here

customers['labels_DBSCAN'].value_counts()

# Challenge 5 - Compare K-Means with DBSCAN

Now we want to visually compare how K-Means and DBSCAN have clustered our data. We will create scatter plots for several columns. For each of the following column pairs, plot a scatter plot using `labels` and another using `labels_DBSCAN`. Put them side by side to compare. Which clustering algorithm makes better sense?

Columns to visualize:

* `Detergents_Paper` as X and `Milk` as y
* `Grocery` as X and `Fresh` as y
* `Frozen` as X and `Delicassen` as y

Visualize `Detergents_Paper` as X and `Milk` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
# Your code here:

# I removed 'Detergents_Paper' from the data as it was higly correlated with 'Grocery'
# I'll replace it here with 'Grocery'

fig, [ax1, ax2] = plt.subplots(1,2, figsize = (16,6))

ax1.scatter(x=customers['Grocery'], y=customers['Milk'], c=customers['labels'])
ax1.set_title('K-Means')
ax1.set_xlabel('Grocery')
ax1.set_ylabel('Milk')

ax2.scatter(x=customers['Grocery'], y=customers['Milk'], c=customers['labels_DBSCAN'])
ax2.set_title('DBSCAN')
ax2.set_xlabel('Grocery')
ax2.set_ylabel('Milk')

plt.show()

Visualize `Grocery` as X and `Fresh` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
# Your code here:

fig, [ax1, ax2] = plt.subplots(1,2, figsize = (16,6))

ax1.scatter(x=customers['Grocery'], y=customers['Fresh'], c=customers['labels'])
ax1.set_title('K-Means')
ax1.set_xlabel('Grocery')
ax1.set_ylabel('Fresh')

ax2.scatter(x=customers['Grocery'], y=customers['Fresh'], c=customers['labels_DBSCAN'])
ax2.set_title('DBSCAN')
ax2.set_xlabel('Grocery')
ax2.set_ylabel('Fresh')

plt.show()

Visualize `Frozen` as X and `Delicassen` as y by `labels` and `labels_DBSCAN` respectively

In [None]:
# Your code here:

fig, [ax1, ax2] = plt.subplots(1,2, figsize = (16,6))

ax1.scatter(x=customers['Frozen'], y=customers['Delicassen'], c=customers['labels'])
ax1.set_title('K-Means')
ax1.set_xlabel('Frozen')
ax1.set_ylabel('Delicassen')

ax2.scatter(x=customers['Frozen'], y=customers['Delicassen'], c=customers['labels_DBSCAN'])
ax2.set_title('DBSCAN')
ax2.set_xlabel('Frozen')
ax2.set_ylabel('Delicassen')

plt.show()

Let's use a groupby to see how the mean differs between the groups. Group `customers` by `labels` and `labels_DBSCAN` respectively and compute the means for all columns.

In [None]:
# Your code here:

customers.groupby('labels').agg('mean')

In [None]:
customers.groupby('labels_DBSCAN').agg('mean')

Which algorithm appears to perform better?

In [None]:
# Your observations here

# Perhaps I'm doing something wrong, but I really can't say which one performs better based on these figures or numbers, it's really unclear to me

# So I'm going to leave the comparison with the three other formatted datasets, because I can't really make sense of the first case



# Bonus Challenge 2 - Changing K-Means Number of Clusters

As we mentioned earlier, we don't need to worry about the number of clusters with DBSCAN because it automatically decides that based on the parameters we send to it. But with K-Means, we have to supply the `n_clusters` param (if you don't supply `n_clusters`, the algorithm will use `8` by default). You need to know that the optimal number of clusters differs case by case based on the dataset. K-Means can perform badly if the wrong number of clusters is used.

In advanced machine learning, data scientists try different numbers of clusters and evaluate the results with statistical measures (read [here](https://en.wikipedia.org/wiki/Cluster_analysis#External_evaluation)). We are not using statistical measures today but we'll use our eyes instead. In the cells below, experiment with different number of clusters and visualize with scatter plots. What number of clusters seems to work best for K-Means?

In [None]:
# Your code here

In [None]:
# Your comment here

# Bonus Challenge 3 - Changing DBSCAN `eps` and `min_samples`

Experiment changing the `eps` and `min_samples` params for DBSCAN. See how the results differ with scatter plot visualization.

In [None]:
# Your code here

In [None]:
# Your comment here