# Customer segmentation with unsupervised learning

__We will learn by doing__: you should complete with code the places marked with __[WORKSHOP]__

We will be using the following tools in Python 3:
* jupyter
* pandas
* matplotlib
* scikit-learn

## jupyter
You are already using __jupyter__!

__[jupyter](http://jupyter.org/) is a web-interface IDE__, a development environment __with REPL__ where you can execute arbitrary code and see the results instantly. Beneath the web interface there is a server that executes the code. The coding language is usually Python 3, but there are server's kerners for other languages.

Thanks to its REPL _powers_, __jupyter has became the facto standard format__ when showing data analysis, research papers and all data-related presentations. For us developers we could find jupyter as a too-simple IDE, but this simplicity is actually what helps researchers to __easily explore datasets__ and try things out quickly.

### jupyter notebook's cells

__A cell usually contains code or markdown__. Do a single click here to see it ;-)

Here you have some ways to move around and execute a cell:
* You can edit any cell just clicking on it with the mouse, or use the keyboard arrows and do _Enter_
* You can unfocus a cell with _Esc_
* You can add a new cell with the key B (bellow) or 
* Executing a cell is simple: _shift+Enter_ or _control+Enter_ will run the code or render the markdown

You can use the mouse and the top menu to do all kind of things, or learn the keyboard shortcuts (in the Help menu).

In [None]:
# This is a cell with code. Try to execute it: use the mouse to focus it and press control+Enter.

' '.join(['Hello', 'world'])

## pandas

__[pandas](http://pandas.pydata.org/) is a library for manipulating data frames__. It is based in a smaller library, __Numpy__, which operates with matrices. But pandas leverages numpy giving an incredible collection of functions to play with the data.

Let's load ulabox's dataset and play a bit with the data. If the .csv file is not in the current directory, this code will download it.

In order to know __the meaning of each column__ of the dataset, please have a look at [its data dictionary](https://github.com/ulabox/datasets/blob/master/README.md).


In [None]:
# Usually 'pandas' is nicknamed as 'pd'.
# Please execute this cell with control+Enter (and the following ones, while you read them).
import pandas as pd
import os.path

filename = 'ulabox_orders_with_categories_partials_2017.csv'
if not os.path.isfile(filename):
    import urllib.request
    urllib.request.urlretrieve('https://raw.githubusercontent.com/ulabox/datasets/master/data/ulabox_orders_with_categories_partials_2017.csv', filename) 

raw_df = pd.read_csv(filename)   

# head() shows first 5 rows
raw_df.head()

If you have a look at the raw data, __each row has an index__ (left, bold) __with its corresponding column data__. In data analysis rows are usually named "samples" while columns are named "features".

Actually in this case __the feature "order" (order number) could be directly used as the index__ of the dataframe. So let's use it and then drop the original "order" column.

In [None]:
df = raw_df.reindex(index=raw_df['order'])

df.drop('order', axis=1, inplace=True)

df.head()

Notice that __pandas library is really powerfull__. It can manipulate data in different ways and allows all kind of dataframe operations (at cell level, row and column level, and even between dataframes).

For example, it can use multi-indexes as follows.

In [None]:
multi_indexed_df = raw_df.groupby(by=['customer','order']).sum()

multi_indexed_df.head()

Filtering, sampling and __indexing by sample or a feature__ is really easy too.

In [None]:
# Get 50 random rows
sample = df.sample(50, random_state = 1)
# I use random_state to get same results (it's the random seed)

sample.head()

In [None]:
# Getting a column (feature)
column = sample['total_items']

column.head()

In [None]:
# Getting a row (sample)
sample.loc[8856]

In [None]:
# Getting an individual value
sample.loc[8856, 'hour']

In [None]:
# Filtering with direct test
orders_with_more_than_50_items = sample.loc[sample['total_items'] > 50]

orders_with_more_than_50_items.head()

In [None]:
# Notice the content of the previous comparison
(sample['total_items'] > 50).head()

In [None]:
# [WORKSHOP] Can you find any order with only Drink products bought?


You should have found more than one order with just drinks. Somebody was really thirsty!

Moreover __pandas have several functions for doing statistics__, helpfull when exploring the data.

In [None]:
df.describe()

As you can see, the dataset contains 30k rows (samples). We get the mean, standard deviation, min, max and other statistics for each feature.

Wait a moment! __Look at the discount% column__. It seems the maximum _discount%_ is 100%, and the minimum _discount%_ is (minus) -65%!! __This looks weird...__

If you want to have a look at other pandas features, I recommend [its documentation](https://pandas.pydata.org/pandas-docs/stable/) and [this cheatsheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf).

## matplotlib
__[matplotlib](https://matplotlib.org/) is a library for graphs__. It can use numpy arrays and pandas dataframes as input.

Actually pandas comes with direct helpers to display data thru matplotlib! If you want to learn more about matplotlib integration in pandas, check [pandas visualization documentation](https://pandas.pydata.org/pandas-docs/stable/visualization.html) and [matplotlib documentation](https://matplotlib.org/contents.html).

For instance, let's explore the discount% feature using a histogram.

In [None]:
import matplotlib.pyplot as plt

# This is a jupyter helper, so when a matplotlib is evaluated, it shows the graph
%matplotlib inline

# Use 15 blocks(bins)
df['discount%'].hist(bins=15)

After seeing the discount% feature displayed, it's clear that if the __discount% is 100%__, it should be some kind of __free order__ (like a gift for a VIP).

On the other hand, why are there some negative discount% values? This is not easy to understand until you ask the domain expert: some drinks have a surcharge (a negative discount) due to a law that taxes drinks with added sugar.

In [None]:
# [WORKSHOP] Plot a histogram to display the most common hours of the day orders are purchased. 


The results should show a peak at 12~13h and one at 22h.

## scikit-learn

__[scikit-learn](http://scikit-learn.org/stable/) is a library with machine learning algorithms__ and helpers.

Before starting using ML algorithms, let's __prepare a smaller dataset without free orders__, with just 100 rows, so the algorithms will work really fast. Also consider only the 8 categories' partials.

Notice that algorithms are optimized to work better with values between 0 to 1; __it's vital to keep values at this order of magnitude__. So an easy way to normalize the samples' data is to divide the percents values by 100. Another option could be using Standarization (see [documentation](http://scikit-learn.org/stable/modules/preprocessing.html])).

In ML jargon, the algorithm input is called __X__ (a matrix with samples and features).

In [None]:
no_free_orders = df.loc[df['discount%'] < 100]
one_thousand = no_free_orders.sample(100, random_state = 1)
X = one_thousand[['Food%', 'Fresh%', 'Drinks%', 'Home%', 'Beauty%', 'Health%', 'Baby%', 'Pets%']].divide(100)

X.head()

### KMeans clustering

scikit-learn comes with a k-means clustering algorithm with only one mandatory parameter: the number of expected clusters. Let's try with 7 clusters.

All algorithms in this library come with a .fit_predict() method to do the training.

In [None]:
from sklearn.cluster import KMeans

seven_clusters_alg = KMeans(n_clusters = 7, random_state = 1)
cluster_labels = seven_clusters_alg.fit_predict(X)
cluster_labels

Each one of the 100 samples is classified in a cluster (from 0 to 6). For instance, the first sample felt in cluster #3, the second in #4, etc.

Let's see how many samples felt in each cluster.

In [None]:
# cluster_labels is a numpy array, so first we embed it in a dataframe and then plot each cluster counting
pd.DataFrame(cluster_labels).hist(bins = 7)

As you see, cluster #1 and #3 have quite a lot of samples, while cluster #5 has just 5 samples.

Was choosing 7 clusters a good idea? If a cluster has a lot of samples, is it a correct clusterization? __How can we find the most "correct" amount of clusters?__ Silhouette score to the rescue!

scikit-learn comes with a __[silhouette_score function](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)__, that evaluates how _well_ the samples felt in the clusters. Its result is a score, the higher value the better. If you want to understand this function better, have a look at [the visual example that comes with its documentation](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html).

In our case, let's do a small script to try different amount of clusters, from 2 to 20, to see the score of each case.

In [None]:
from sklearn.metrics import silhouette_score

range_n_clusters = range(2,20)

for n_clusters in range_n_clusters:
    cluster_alg = KMeans(n_clusters = n_clusters, random_state = 1)
    cluster_labels = cluster_alg.fit_predict(X)

    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "the average silhouette_score is :", silhouette_avg)

Apparently choosing 7 clusters was a quite good guess, but the best score goes for 6 clusters.

In [None]:
# [WORKSHOP] Use KMeans algorithm with 6 clusters (and random_state = 1), and plot clusters' histogram (with 6 bins)


Changing from 7 clusters to 6 doesn't look like a big improvement, but the fact that some clusters have more samples than others doesn't mean those were bad chosen. It's normal that some _kind of customers_ are more popular that _other kind of customers_.

A good way to understand better those 6 clusters is to see their centers.

In [None]:
six_clusters_alg = KMeans(n_clusters = 6, random_state = 1)
cluster_labels = six_clusters_alg.fit_predict(X)

centers = pd.DataFrame(six_clusters_alg.cluster_centers_, columns=X.columns)
centers.multiply(100).round(0)

__Looking at the 6 clusters' centers everything makes more sense__. Let's see what it is bought in each one:
* #0 : food and some drinks, "_the food lover_"
* #1 : drinks and some food, "_the thirsty_"
* #2 : basically baby stuff, "_the parent_"
* #3 : fresh, food and drink, "_the healthy_", very common
* #4 : home products, "_the cleaner_"
* #5 : a bit of everything, "_the balanced_", quite common

Notice too that some features are totally irrelevant: Health% and Pets%. So we can consider ignoring them in the future.

### DBSCAN clustering

Another algorithm for clustering is DBSCAN. Let's try it!

In [None]:
from sklearn.cluster import DBSCAN
# Let's ask for at least 3 samples in each cluster, with a maximum of 0.3 distance
dbscan_alg = DBSCAN(eps = 0.3, min_samples = 3)
cluster_labels = dbscan_alg.fit_predict(X)
cluster_labels

Ouch! This algorithm only found 3 clusters (#0, #1 and #2) and some samples are marked as outsiders (#-1).

Let's try to remove Health% and Pets% as we have seen those features are irrelevant.

In [None]:
X2 = X[['Food%', 'Fresh%', 'Drinks%', 'Home%', 'Baby%']]
cluster_labels = dbscan_alg.fit_predict(X2)
cluster_labels

Well, the result has improved a bit. But this still has space for improvement...

In [None]:
# [WORKSHOP] Feel free to try alternative configurations for this algorithm...

# Thank you for attending this workshop!

As you have experienced, playing with Machine Learning algorithms is a question of spending time learning about the data and finding the correct parameters. Also notice that we were working with just 100 samples... things can get slow with real _big data_.

I hope you liked this workshop!