## Clustering Workshop: Online Retail Dataset

Dataset:
https://archive.ics.uci.edu/ml/datasets/online+retail

Objective:
Explore the dataset by finding clusters

### Data Set Information:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

### Attribute Information:
- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. 
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. 
- Description: Product (item) name. Nominal. 
- Quantity: The quantities of each product (item) per transaction. Numeric.	
- InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated. 
- UnitPrice: Unit price. Numeric, Product price per unit in sterling. 
- CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer. 
- Country: Country name. Nominal, the name of the country where each customer resides.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

In [None]:
# You can use read_excel, but it requires additional dependencies, and
# isn't as easy to use as pd.read_csv

df = pd.read_csv('d:/tmp/online-retail/Online Retail.csv', parse_dates=True, encoding='latin-1')
df.head()

In [None]:
df.isnull().values.any()

In [None]:
df[df.isnull().any(axis=1)]

In [None]:
df.dropna(inplace=True)

In [None]:
# See what types we need to convert
df.dtypes

## Data Transformation (Estimated time: 60 minutes)


### Which of these should we convert to numbers?
```
InvoiceNo              object -> Label Encode
StockCode              object -> Label Encode
Description            object -> Tfidf
Quantity                int64
InvoiceDate            object -> pd.to_datetime
UnitPrice             float64
CustomerID            float64
Country                object -> Label Encode
```

### 1. Label Encode string columns (except Description)

Q: is it better to keep separate encoders or to use separate columns?

A: separate encoders are cheaper for large datasets because you would only need to store the label mapping in memory

### 2. Convert InvoiceDate column from string to datetime

Try something like:
```
pd.to_datetime(..., format='%d/%m/%Y %H:%M')
```

### 3. Convert Description to Tf-Idf features

Description is fairly simple text, so we can try scikit learn's tokenizer. 

No need to use spacy to tokenize.

```
vectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
                             max_df=3, min_df=1)
```

- max_df allows us to skip words that are too frequent
- min_df allows us to skip words are are too unique

### Combine all converted columns into our dataframe

Something like this:

```
# Recall that TFIDF has each term as a feature

df_tfidf = pd.DataFrame(list(description_tfidf.toarray()),  
                        columns=vectorizer.get_feature_names(),
                        index=df.index)
                        
# Combine into 1 dataframe

df_combined = pd.concat([df, df_tfidf], axis=1)
```

## Cluster! (Estimated time: 90 minutes)

### 1. Pick our numeric columns

```
columns = ['InvoiceNo', 'StockCode', 'Quantity', 'UnitPrice', 'CustomerID', 'Country']

columns = columns + vectorizer.get_feature_names()
```

... Then apply .loc to select them

### 2. Pick a subset of datapoints to try clustering

(let's say 300 datapoints)

(You can always add more datapoints after you have the initial clustering model)

### 3. Plot the data points

In [None]:
# Apply PCA to convert to 2-dimensions

In [None]:
# Plot the scatter plot

### 4. Apply KMeans clustering

In [4]:
# Plot the Elbow diagram to pick the number of clusters

### 5. Re-plot the PCA plot with cluster centroids for the best k

1. Pick the best k

2. Do something like this to PCA transform the cluster centers:
```
centroids_2d = pca.transform(kmeans.cluster_centers_)
```

3. Re-plot the PCA scatter plot with the cluster centers overlaid.


4. You can also colour the scatter plots using the cluster ids

### 6. Cluster Metrics

Since we don't have the labels, we have to use silhouette_score

```
from sklearn.metrics import silhouette_score

# S=(b-a)/max(a, b)
# a: average distance between each sample and samples from the same cluster
# b: average distance between each sample and nearest cluster samples

print(silhouette_score(X, clusters, sample_size=300, random_state=42))
```

### Exploring data with clusters

Now that we have the clusters, we can use pandas to divide the dataset into the clusters.

In [None]:
df['cluster'] = clusters
df.head()

In [None]:
df[df.cluster==1].head()

In [None]:
df[df.cluster==2].head()

In [None]:
df[df.cluster==3].head()

In [None]:
df[df.cluster==1].describe()

In [None]:
df[df.cluster==2].describe()

In [None]:
df[df.cluster==3].describe()