<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons Licence" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>

# Lab 1: Customer Clustering & Visualization

## Objective & Description:
This Lab focuses on clustering and visualizing bank customers' data, which includes variables such as date of birth (DOB), account balance, transactions, transaction amount, and transaction place. We will use publicly available data from a bank in India to learn how to preprocess data and to prepare it for clustering and visualization.

The objective of the lab is to equip participants with the skills to analyze and gain insights from customer data, which they can use to improve business strategies and decision-making processes. Participants will learn how to use Python programming language libraries such as Pandas, NumPy, and Scikit-learn to preprocess the data, perform clustering using algorithms such as K-Means, Hierarchical Clustering, and DBSCAN, and visualize the results using techniques such as scatter plots, 3D plots, and heatmaps.

If you don't know Python don't worry we've got you backed up here are some resources to learn python:
- [Introduction to Python - DataCamp](https://www.datacamp.com/courses/intro-to-python-for-data-science)
- [Python for Data Science: Fundamentals  - DataQuest](https://www.dataquest.io/course/python-for-data-science-fundamentals/)
- [Python for Data Science: Intermediate  - DataQuest](https://www.dataquest.io/course/python-for-data-science-intermediate/)
- [Maybe you prefer books](https://pythonbooks.org/free-books/)


## Tasks:
The workshop consists of multiple tasks which have to be solved sequentially:

1. **Data cleaning.** Clean the bank customer data by handling missing values, identifying and handling outliers, and transforming the data as needed. 

2. **Data exploration.** Use appropriate visualization techniques to identify patterns in the data. Create visualizations that clearly and accurately represent the relationships between different features of the data. This is critical to assess your trained model in the next step.)

3. **Train your clustering model.** Apply clustering techniques to the preprocessed data and compare the results of at least two different clustering algorithms. Calculate accuracy of your model predictions and the overall quality of clustering models. 

4. **Visualize clustering results.** Create visualizations that accurately represent the clustering results. Use appropriate visualization techniques to highlight the differences between different clusters and provide insights into the patterns present in the data.

5. **Data interpretation and conclusions.** Interpret and analyze the results of their clustering models and visualizations. Draw meaningful insights from analysis and present findings in a clear and concise manner.

Follow the steps on this Jupyter notebook.


### Data Cleaning

#### Task 1.1
We have preloaded the dataset into this environment. You are expected to do the following:

1. Load the dataset to the notebook as a pandas DataFrame.
<details>
    <summary>hint:</summary>
Pandas has the method `read_csv`.
Look for the csv file in the side panel copy its path and load it using the previous method.
</details>

2. Inspect the data.
<details>
    <summary>hint:</summary>
Pandas Dataframe has the method `head`
</details>
3. You are provided with the function `data_status`. Call it on the DataFrame. What do you inspect ?
<details>
    <summary>hint:</summary>
Maybe wrong types?
</details>
4. Look at the info of your data. What do you realize about the numbers ? 
<details>
    <summary>hint:</summary>
Pandas DataFrame has the method `info`. invoke it on your data. hint: Any null values? 
</details>

In [None]:
!git clone https://github.com/khengari77/Lab-1-Customer-Clustering-Visualization.git
!pip install umap-learn

In [None]:
import pandas as pd
import numpy as np

In [None]:
# load your data by passing the csv file path to the read_csv function.
transactions = pd.read_csv('')

In [None]:
def data_status(df):
    columns = df.columns
    stat = []
    for column in columns:
        data_type = type(df[column][0])
        num_dups = np.sum(df[column].duplicated())
        num_null = np.sum(df[column].isna())
        stat.append([column, data_type, num_dups, num_null])
    status_frame = pd.DataFrame(stat)
    status_frame.columns = ['column', 'data type', 'no. duplicates', 'no. null']
    return status_frame

In [None]:
# Solve Task 1.1.1 here


In [None]:
# Solve Task 1.1.2 here

In [None]:
# Solve Task 1.1.2 here

#### Task 1.2
What did you notice? Probably some missing values, wrong types and feature we don't need?
Drop any null values and columns that we don't need.
<details>
    <summary>hint:</summary>
    For drop null values we have the method `dropna`. For dropping a certain column or row we have the drop method. Remember you can always look in the pandas documentation.

In [None]:
# Solve Task 1.2 here

#### Task 1.3
It is always said that data cleaning makes up 80% of a data scientist work. We have quite some cleaning to do if we want to get the most out of our data.

Convert columns with wrong types to a more suitable form. The DOB column was processed for you. You only have to work through the other columns. Feel free to add any other columns that you induce from the data or drop any column that you are finished with (extracted the useful data). 

**Remember it's always a good practice to make a copy from your DataFrame after each major manipulation** use the method `copy`. 

**Also, remember to inspect your data after each step using the methods in (Task 1.1).** They are quite handy. They also show you if your process succeeded or not.
<details>
    <summary>hint:</summary>
    1. Columns in Pandas DataFrames are basically Pandas Series. A great feature of Pandas Series is that you can apply functions to them in an element-wise fashion. Suppose we have a data series that has a scale form 0 - 2 and we want to scale it from 0 - 1 you can do something like: `series/2` 
     <br> <br>
    2. Pandas has a function : to_datetime that converts any date or time representation in another form to DateTime object
     <br> <br>
    3. Rare categories should be dealt with either by putting them in a one "UNKNOWN" category or by using any statistical method to eliminate them.
</details>

In [None]:
# feel free to change the variable names.
transactions = transactions[transactions['CustomerDOB']\
              .str \
              .fullmatch("\d{1,2}[\/]\d{1,2}[\/]\d{2}")]
# We used regex pattern matching here because there was some dates 
# that had an incorrect string representation. 
# filtered the dataframe and removed any samples such.
# Filtering in pandas is quite intuitive it has the following general syntax
# df = df[predicate_func(df)]

In [None]:
# We could have used pd.to_datetime 
# but we have dates that are before 1/1/1969
# which aren't supported by unix so we use string methods
# See Pandas docs 
# https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
dates = cleaned_transactions['CustomerDOB'].copy().str.split("/")

In [None]:
# This data was collected in 2016 so we can assume 
# that any date birth between 00 - 09 is in the 21st century. 
# This might not be true for all cases but it won't be harmful 
# because the percentage of such cases is negligible.

def full_year(i):
    if i > 9:  return 1900 + i
    else: return 2000 + i

year = dates.apply(lambda x: full_year(int(x[2])))

In [None]:
# Solve task 1.3 here & below
# you can add a new cell by pressing ESC then b on your keyboard

#### Data exploration:
Really, there is no limit to how you can explore data other than your imagination. You can get some metrics of your data by invoking the `describe`. A better way to explore data is by visualizations. [Matplotlib](https://matplotlib.org/stable/index.html) & [Seaborn](https://seaborn.pydata.org/) are two famous library when it comes to data visualizations. Here are some ideas:
- correlation matrix.
- age distribution.
- gender distribution.
- income distribution.
- transactions amount vs income.
- number of transactions vs income.
- income vs age.
- income vs gender
- time vs transactions amount
- date vs transactions amount
- time vs number of transactions
- date vs number of transactions

Feel free to do other investigations. Don't limit your imagination!

In [None]:
# Now we have a clean dataset that we can work with. 
# Let's start by importing some visualization libraries.
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
font = {'weight' : 'bold',
        'size'   : 16}

matplotlib.rc('font', **font)

### Data Preprocessing
Most artificial intelligence models take numbers as their inputs, but we have categorical data in our data. We also have time data. To solve this problem we have to convert these categories to numbers. A way to do that is by [one-hot encoding](https://en.wikipedia.org/wiki/One-hot). This method basically makes each category a column itself and assigns 1 to the sample its present in and 0 if it isn't. In a case where we have binary categorical data (like our gender data) we can basically omit one of the categories.


The best order of operations for speeding runtime is 
 1. Copying the location column.
 2. Applying One-Hot encoding to the gender and locations.
    Note that you don't need to use the One-Hot object for gender you can apply the operation in place try using the `apply` method
 3. Dropping the string columns.
 4. Normalizing the columns.
 5. Concatenating the location data in its one-hot format.
 6. Reducing the dimension of the data.
I don't say that this order is the correct but it worked best for me.

In [None]:
from sklearn.preprocessing import OneHotEncoder, minmax_scale
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

#### Task 3.1

Convert the location data and the gender data to one-hot encoding format

#### Task 3.2

Convert the time and date data into a numerical representation
<details>
    <summary>hint:</summary>
Consider values that repeat and effect the data. Data that doesn't change or has a minimal change can be ignored. For example all our data is from 2016 so we can ignore the year.
</details>


#### Task 3.3
It is usually better to normalize our data in a controlled range for example (0,1). Scikit-learn toolkit has various tools for normalizing data. One of the famous methods is min-max scaling. Apply it on your data per feature. 

#### Task 3.4
Inspect your data now. WOW! those are a lot of features. Data with high dimensionality is hard to train and results in poor results. Fortunately there are techniques to reduce the dimensionality of data such as [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) and [autoencoders](https://en.wikipedia.org/wiki/Autoencoder). Now using PCA reduce the number of features in your data to a sensible number. 

Note: *Reducing the dimensions of your data isn't a lossless operation. You will lose a lot of data. It's a matter of losing unneeded data. The truth is that the number of features is a hyperparameter and likely you will have to experiment with different numbers until you find a suitable one.*

### Training your model
At last you have reached to the main! That was a long journey wasn't it?! There are many clustering algorithms that are in use in the ML industry: Kmeans, DBSCAN, and Hierarchical Clustering are a few to name. You are required to implement K-means on the data. Choosing the right number of clusters can be tricky but fortunately there are heuristics that help us determine a good number of clusters. Two of the most popular methods are: 
1. The elbow method
2. The Silhouette Method
If you want to learn more read [here](https://towardsdatascience.com/how-many-clusters-6b3f220f0ef5)

You are provided with a function that calculates the silhouette score for a range of Ks starting from 2 until the number you specify. 

I will leave the actual interpretation of the silhouette score as an exercise for the reader to search for.

In [None]:
from sklearn.cluster import KMeans 
from sklearn.metrics import silhouette_score
from sklearn.utils import resample
import matplotlib.cm as cm

In [None]:
def silhouette_plot(data, max_clusters=16, n_samples=256, rounds=50):
    silhouette_scores = []
    for k in range(2,max_clusters):
        clusterer = KMeans(n_clusters=k, n_init='auto')
        cluster_labels = clusterer.fit_predict(data)
        mean_score = 0
        for i in range(rounds):
             mean_score += silhouette_score(data, cluster_labels, sample_size=n_samples)
        mean_score /= rounds
        silhouette_scores.append(mean_score)
        print(f"For n_clusters = {k} The average silhouette_score is:\
              {silhouette_scores[k - 2]}")
    plt.figure(figsize=(20, 8))
    plt.plot(range(2, max_clusters), silhouette_scores)
    plt.title("Silhouette Score per K")
    plt.xlabel("K")
    plt.ylabel("Silhouette Score")
    plt.show()

In [None]:
# Run the silhouette_plot here

In [None]:
# Implement a K-means clusterer here. 
# There are few hyperparameters that you can tweak.
# Look at scikit-learn documentation.

### Evaluate the results of clustering
In most cases you really don't have well defined categories so you can't evaluate your model like you would do in supervised learning. Nonetheless we still have [other methods](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation) that we can use to reason about our model. One of those methods is [Calinski-Harabasz Index](https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index). Reading the documentation in the previous link use this method on your model. What do you think about the results?

Another useful way to reason about your model is by visualizing the clusters. Unfortunately our model is in higher dimensions so we can't visualize the result with normal methods. A common technique used in these situations is the [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). TSNE in its essence is a dimensionality reduction algorithm but it is used quite often to convert variables to 2D space for visualization. Unfortunately this method takes hours to terminate sometimes. So we will another method namely [UMAP](https://umap-learn.readthedocs.io/en/latest/basic_usage.html) to reduce our data to 2D space and plot your results using libraries like matplotlib.

In [None]:
from umap import UMAP

## Finished early?
Don't worry we've got you backed up. Here are few things to consider:
- We talked about reducing dimensionality methods. We used PCA but it has its drawbacks the most serious one is that it assumes that the data is linear which is usually not the case. A more powerful method for reducing dimensionality is using autoencoders. Autoencoders are a variant of neural networks that learn how to represent the data in a smaller dimension using machine learning methods. [Read here about autoencoders and try implementing one by yourself](https://blog.keras.io/building-autoencoders-in-keras.html)

- In this notebook we used K-means. K-means has two serious drawbacks:
      1. It assumes spherical clusters.
      2. It is sensitive to noise and outliers.
      

If you want to read further: [Click here](https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages).
Our data had some outliers that made K-means not the preferred method for this problem. A more resilient method to outliers is DBSCAN. [Read here about DBSCAN and try applying it by yourself](https://scikit-learn.org/stable/modules/clustering.html)

## Give us your feedback!
[Please fill in this form and give us your feedback](https://forms.gle/hg9VRLUdnocU4G3Y7)