# Overview of Week 3

This week's assignment consists of two parts. The first part gives you an introduction to unsupervised learning. In particular, we focus on techniques for clustering and dimensionality reduction and how they can be applied to ecommerce data. As you work through the three clustering case studies, you will find yourself generating many intermediate datasets, trying different models, and tuning each model as you go along. There's a lot to keep track of.   

This is where Part 2 comes in. It is in Part 2 that we introduce ideas of **workflow management** and **computational reproducibility**. Workflow management means organising your project directory to manage your analysis' artefacts (visualisations, processed datasets, notebooks and utility functions and experiment results). Ideally, your code for these should be clearly commented with well chosen names. Computational reproducibility means someone else (including future you!) being able to take just the code and data, and reproduce your project, from its results and models to visualisations etc. How one decides to practice workflow management and computational reproducibility can be quite a personal decision. Therefore, we provide guidelines, not rules. The most important is having a system rather than no system at all. 

**note about the week**   
While week 1's assignment was guided, with specific instructions about what code to run, as we move on the assignments will involve less hand-holding. For this week, we include some instructions, but leave the specific implementations up to you. There are also many techniques we cover. Again, while we share some resources, we leave the bulk of the research and background reading up to you to manage for yourself. 

**recap of the objectives for the first 6 weeks:**  
We aim to broadly cover a wide range of Machine Learning algorithms so that you can: 
- handle the technical demands of a 100E given some guidance on the right direction to take 
- can handle a technical job interview and get hired 

*materials for unsupervised learning adapted from William Thji* 

# Part I : Unsupervised Learning 
Unsupervised Learning refers to a set of machine learning techniques where no output variables (Y) are given. Only the input variables (X) are available and our job is to find patterns in X. You may read more about it from *pg 485 from Hastie and Tibshirani's Elements of Statistical Learning* available [here](https://web.stanford.edu/~hastie/Papers/ESLII.pdf). 

ESL by Hastie et. al with be the primary reference for this week, although feel free to source for your own books and links. 

## Short introduction to clustering 
Clustering puts datapoints into subsets so that datapoints within a cluster are more closely related to one another compared to datapoints in another cluster. More information is available from page 501 of *Elements of Statistical Learning*. 

Some quick points: 
- Clustering is extremely useful to many fields: 
    - Customer segmentation for personalised product recommendations
    - Topic identification to relieve the need to manually vet documents 
    - Image or geo-spatial segmentation to optimised supply and demand (Gojek does this) 
    - and maybe most importantly, getting a sense of the data before starting to model it. 

- Some examples of clustering algorithms: 
    - KMeans
    - Gaussian Mixture Models for drawing soft clustering boundaries instead of hard ones 
    - Hierarchical clustering
    - DBScan for density-based clustering for anomaly detection 
    - Co-clustering
    - Biclustering for analysing genes

## Clustering Case Study 1: Using PCA and clustering to create customer segments 
Context: The dataset we will be working with contains ecommerce transactions from a UK-based online retails store. The dataset is available on [Kaggle](https://www.kaggle.com/carrie1/ecommerce-data/home) or the UCI Machine Learning Repository. The dataset is quite small, so we have also included it inside the `data` folder inside this repo as `data/data.csv`. 

From the Kaggle website: 

"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."



In [None]:
import pandas as pd 
df = pd.read_csv('data/data.csv', encoding='ISO-8859-1')
df.head()

### Cleaning data 

Some data types are muddled, there are duplicates, NA values and unreasonable values hiding in the columns 

1. Clean the dataset and save the output as an intermediate dataset 
2. List the steps taken to clean the data
3. Extra: Encapsulate the steps inside their own functions so they can be reused. Organise the functions into their own library

In [None]:
## your code here 

### Feature Engineering iteration #1 

Inside the dataset, each row contains information about an ecommerce transation. However, we want to cluster the data by customers, which means each row should instead contain information about a customer. 

1. Reshape the data to look like the table below: 
![alt text](customer.png)

2. `data.columns` should give you `['UnitPriceMean','UnitPriceStd','TotalQuantity','NoOfUniqueItems','NoOfInvoices','UniqueItemsPerInvoice','QuantityPerInvoice','SpendingPerInvoice']` 

3. `data.shape` should give you (4339, 8)
4. Save this dataset as an intermediate dataset 

In [None]:
## your code here 

### Hierarchical clustering iteration #1 

1. Normalise the dataset from the section above and create a pairplot of the data
2. Apply hierarchical clustering to the dataset. [link to resource on hierarchical clustering]
2. Experiment with different linkage algorithms. Visualise the resulting trees side-by-side. Which linkage algorithm works best? 
3. List two ways to improve the clustering

In [None]:
# your code here 

### K-means and GMM Clustering iteration #1 
Apart from hierarchical clustering, we can also apply KMeans and Gaussian Mixture Models (GMM) on the data

1. Implement K-means clustering on the data. You may want to create a dictonary or DataFrame to store the predicted labels for each value of `k` tried. For example, for `k = 4`, we get a dictionary of `{0:50, 1:100, 2:40, 3:20}` where the keys are values of `k` and the values are their associated label counts. 
2. For each value of `k`, print the number of members in in each label class. 

In [None]:
# your code here 

### Model improvement iteration #1 

1. One way of improving our model is to identify and remove outliers in the data. Clustering can help us do this. Using a clustering algorithm, identify and remove potential outliers in the data. More resources on how to implement clustering to identify outliers can be found here [link]
2. Create two visualisations, one of the data before applying clustering and one of the data after clustering. 
2. Identify and remove outliers from the data WITHOUT using clustering algorithm
3. Create a boxplot of each column in the dataset after outliers have been removed. 
4. Save the normalised dataset with no outliers as an intermediate dataset 

In [None]:
# your code here 

### Hierarchical clustering iteration #2 

1. Using the dataset created from `Model Improvement Iteration #1`, implement hierarchical clustering again. Are there improvements? 

In [None]:
# your code here

### K-means and Gaussian Mixture Model (GMM) Clustering iteration #2 

1. Using the dataset created from `Model Improvement Iteration #1`, implement `K-means` and `GMM` clustering again. 
2. Run multiple experiments by testing k=1,...,20 for K-means and components=1,...,20 for Gaussian Mixture Models 
3. Plot a graph of the silhouette score against k for K-means and plot a graph of the silhouette score against number of components for GMM 
4. How would you select the optimal number of clusters using the silhouette score? 
5. Explain one other way of validating your clusters 
6. Plot a graph of the cluster center coordinates against number of clusters for K-means. 
Does this help you choose the best number of clusters? 
7. Plot a graph of the means of each component GMM distribution again the number of components for GMM. Does this help you choose the best number of clusters? 

In [1]:
# Your code here 

### K-means and GMM Clustering iteration #3

1. Choose a subset of columns from the ecommerce dataset with the outliers removed 
2. Run multiple experiments by testing k=1,...,20 for K-means and components=1,...,20 for Gaussian Mixture Models 
3. Plot a graph of the silhouette score against k for K-means and plot a graph of the silhouette score against number of components for GMM
4. Based on the silhouette scores for the GMM model, choose the optimal number of components for this dataset
5. For each mixture component in the GMM model from 4, plot their covariances for each column in the dataset. Based on this plot, which columns should we keep? 
6. Create a DataFrame with only the columns you chose to keep from 4. Save this DataFrame as an intermediate dataset 

In [2]:
# Your code here 

### GMM Clustering iteration #4 

1. Based on the intermediate datset created from iteration #3, again run n_components=1,...,20 for the GMM
2. Plot the silhouette scores against n_components. Based on this plot, choose the optimal value for n_components and fit a GMM model using this optimal value. What are the value counts of each mixture component? 
3. Create a DataFrame with the columns you chose to keep as columns and the means of the mixture components as rows. 
4. Save this final DataFrame. These are the customer segments that will be used for Case Study 2
5. Create a heatmap of the DataFrame of customer segments. What does this tell us about the information contained within each mixture component? 

### Clustering the outliers 

1. Do the outliers themselves cluster into subgroups, or are they distributed randomly?  

In [3]:
# Your code here 

### PCA for Dimensionality Reduction 

1. Using the intermediate dataset that was normalised with outliers removed, construct a pairplot again. How is it different from the first plot and how might you interpret it? 
2. Apply PCA to the normalised dataset with outliers removed. More information on PCA here [link]
3. Create a plot of cumulative explained variance and number of components. How does this inform you about the best number of components to select? 
4. Create a plot of PC0 against PC1, coloured by the GMM's predictions on the normalised dataset with outliers removed for n_components =7