#**Scenario:** Unsupervised Learning: Customer Market Segmentation



The marketing department of any enterprise is responsible for promoting the products, ideas and mission of the enterprise, finding new customers, and reminding existing customers that you are in business

###**Problem Statement:**

You have been hired as a Data Scientist and have to perform Customer Market Segementation which will help the bank's marketing team to launch a targetted marketing ad campaign that is tailored to a specific group of customers. In simple words, the bank wants to divide their customers into atleast different groups for the ad-campaign to be successful

This process is called **Market Segmentation** and it is crucial for maximizing marketing campaign conversion rate

###**Aim:**

In this demo, you have to build a clustering model to divide customer's into  distinctive groups

###**Dataset Description:**
The [**data set**](https://www.dropbox.com/s/6v54wro81mlyp4x/marketing_data.csv) contains the following attributes:

- **CUST_ID** -  Identification of Credit Card holder (Categorical)
- **BALANCE** - Balance amount left in their account to make purchases
- **BALANCE_FREQUENCY** - How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
- **PURCHASES** - Amount of purchases made from account
- **ONEOFF_PURCHASES** - Maximum purchase amount done in one-go
- **INSTALLMENTS_PURCHASES** - Amount of purchase done in installment
- **CASH_ADVANCE** -  Cash in advance given by the user
- **PURCHASES_FREQUENCY** - How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
- **ONEOFF_PURCHASES_FREQUENCY** - How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
- **PURCHASES_INSTALLMENTS_FREQUENCY** - How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
- **CASH_ADVANCE_FREQUENCY** - How frequently the cash in advance being paid
- **CASH_ADVANCE_TRX** - Number of Transactions made with "Cash in Advanced"
- **PURCHASES_TRX** - Number of purchase transactions made
- **CREDIT_LIMIT** - Limit of Credit Card for user
- **PAYMENTS** - Amount of Payment done by user
- **MINIMUM_PAYMENTS** - Minimum amount of payments made by user
- **PRC_FULL_PAYMENT** - Percent of full payment paid by user
- **TENURE** - Tenure of credit card service for user

###**Tasks to be performed:**
- Import required libraries and load the data set from Dropbox
- Perform Exploratory Data Analysis (EDA) on the data set
  -  Generate a Data Report using Pandas Profiling and record your observations
  - Plot **Univariate Distributions**
    - What is the distribution of the Purchase & Purchases Frequency columns in the data set?
    - What's the tenure of the credit card service for user?
    
  - Plot **Bi-Variate Distributions**
    - How does the tenure of a credit card user affect its purchases?
    - Analyse the data set for outliers using a Box Plot
    
- Pre-process that data set for modeling
  - Handle Missing values present in the data set
  - Deal with extreme values 
  - Perform scaling on the data set using StandardScaler
- Modelling
  - Build a K-Means Clustering Model and then, apply PCA to visualize the clusters and use Elbow Method to find optimal value of K 
  - Build a Hierarchial Model
  - Build a DBSCAN Model
  - Build a Spectral Clustering Model
- Use **PyCaret** to build and analyze the models:

  - Import PyCaret and load the data set
  - Initialize or setup the environment 
  - Compare Multiple Models and their Accuracy Metrics
  - Create the model
  - Assign the model
  - Plot the model


###**Install the required dependencies, import the required libraries and load the data set**


####**Installing Dependencies**

**Note:** After Insalling Pandas Profiling, restart the kernel and run from the top excluding the installation cell

- It's better to install all th libraries at the top

In [1]:
#Intalling Pandas-Profiling

!pip install pandas-profiling==2.7.1 

Collecting visions[type_image_path]==0.4.1
  Using cached visions-0.4.1-py3-none-any.whl (58 kB)




Installing collected packages: visions
  Attempting uninstall: visions
    Found existing installation: visions 0.7.4
    Uninstalling visions-0.7.4:
      Successfully uninstalled visions-0.7.4
Successfully installed visions-0.4.1


In [2]:
!pip install pycaret

Collecting pycaret
  Using cached pycaret-2.3.4-py3-none-any.whl (266 kB)
Collecting Boruta
  Using cached Boruta-0.3-py3-none-any.whl (56 kB)
Collecting umap-learn
  Using cached umap_learn-0.5.1-py3-none-any.whl
Collecting mlflow
  Using cached mlflow-1.20.2-py3-none-any.whl (14.6 MB)
Collecting yellowbrick>=1.0.1
  Using cached yellowbrick-1.3.post1-py3-none-any.whl (271 kB)
Collecting textblob
  Using cached textblob-0.15.3-py2.py3-none-any.whl (636 kB)
Collecting pandas-profiling>=2.8.0
  Using cached pandas_profiling-3.1.0-py2.py3-none-any.whl (261 kB)
Collecting cufflinks>=0.17.0
  Using cached cufflinks-0.17.3-py3-none-any.whl
Collecting pyLDAvis
  Using cached pyLDAvis-3.3.1.tar.gz (1.7 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'error'
  Using cached pyLDAvis-3.3.0.tar.gz (1.7 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'error'
  Using cached pyLDAvis-3.2.2-py2.py3-n

  ERROR: Command errored out with exit status 1:
   command: 'c:\python36_64\python.exe' 'c:\python36_64\lib\site-packages\pip' install --ignore-installed --no-user --prefix 'C:\Users\vlekkala\AppData\Local\Temp\1\pip-build-env-1m96_hzt\overlay' --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple --trusted-host pypi.python.org --trusted-host pypi.org --trusted-host files.pythonhosted.org -- setuptools
       cwd: None
  Complete output (33 lines):
  Traceback (most recent call last):
    File "c:\python36_64\lib\runpy.py", line 193, in _run_module_as_main
      "__main__", mod_spec)
    File "c:\python36_64\lib\runpy.py", line 85, in _run_code
      exec(code, run_globals)
    File "c:\python36_64\lib\site-packages\pip\__main__.py", line 31, in <module>
      sys.exit(_main())
    File "c:\python36_64\lib\site-packages\pip\_internal\cli\main.py", line 68, in main
      command = create_command(cmd_name, isolated=("--isolated" in cmd_args))
    F

In [4]:
#Downloading the data set from Dropbox
#Please run this cell in Google Colab

#!wget https://www.dropbox.com/s/6v54wro81mlyp4x/marketing_data.csv

####**Importing Required Libraries**

In [5]:
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as shc
from sklearn.cluster import AgglomerativeClustering
import plotly.express as px
from sklearn.metrics import accuracy_score,confusion_matrix
import pandas_profiling
from pandas_profiling import ProfileReport
from scipy import stats
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn.cluster import DBSCAN 
from sklearn.cluster import SpectralClustering

sns.set()


import warnings
warnings.filterwarnings("ignore")    
import os


print('Libraries Imported')

Libraries Imported


In [6]:
#To be filled by learner

#Read the dataset 
df = pd.read_csv('marketing_data.csv')
df.head()
#Print the top 5 values

Unnamed: 0,CUST_ID,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
0,C10001,40.900749,0.818182,95.4,0.0,95.4,0.0,0.166667,0.0,0.083333,0.0,0,2,1000.0,201.802084,139.509787,0.0,12
1,C10002,3202.467416,0.909091,0.0,0.0,0.0,6442.945483,0.0,0.0,0.0,0.25,4,0,7000.0,4103.032597,1072.340217,0.222222,12
2,C10003,2495.148862,1.0,773.17,773.17,0.0,0.0,1.0,1.0,0.0,0.0,0,12,7500.0,622.066742,627.284787,0.0,12
3,C10004,1666.670542,0.636364,1499.0,1499.0,0.0,205.788017,0.083333,0.083333,0.0,0.083333,1,1,7500.0,0.0,,0.0,12
4,C10005,817.714335,1.0,16.0,16.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0,1,1200.0,678.334763,244.791237,0.0,12


###**Exploratory Data Analysis**

**Note:** If you want to learn more about Pandas-Profiling [**Click Here!**](https://pypi.org/project/pandas-profiling/)

In [None]:
#To be filled by learner

#Pass the data frame to the ProfileReport function
#Generating a Data Report

In [None]:
#To be filled by learner

#Display the notebook in Colab itself

In [None]:
#Generating a Pandas Profiling Report 


Please refer to the HTML file created by the name of **output.html**

**Note:** Answer the following questions:

- Which columns are highly **skewed**?
- Which columns are highly **Kurtosis** driven?
- Which columns have Wrong data type?
- What columns seems to have **outliers** based on **min**, **max** and **percentile values**, **IQR range** along with the **standard deviation** and **mean absolute deviation**?
- What columns have missing values? (Check the **Missing Values** section in **Pandas Profiling**)
- What columns have high amount of zero/ infinite percentage and make sure that these zeroes/ infinite are supposed to be there

**For Example:** Weight cannot be zero/ infinite and any percentage of zero/ infinite in column zero is erroneous
- What columns have **high variance** and **standard deviation**?
- Comment on the distribution of the continuous values **(Real Number: ℝ≥0)**
- Do you see any alarming trends in the extreme values (minimum 5 and maximum 5)?
- How many booleans columns are there in the data set and out of those how many are imbalanced?
- Check for **duplicate records** across all columns (**Check Warning Section**)

**For Example:** Gender Male and Female in which Male is **95%** and Female is just **5%**
- How many columns are categorical?
  - Are those categories in sync with the domain categories?
  - Check if all the categories are unique and they represent distinct information
  - Is there any imbalance in the categorical columns?

Based on the above questions and your observations, chart out a plan for **Data Pre-processing** and feature engineering

**Note:** Feature Engineering (Feature Selection and Feature Creation)

- From the **Interaction Tab**, write at least 3 observations that may be very crucial for prediction. Make sure that they are in story format

**For Example:** Av monthly hours vs Satisfaction Level..

- Check **Pearson** and **Spearman** tab in the **correlation** section and note down the columns which are highly correlated (Postive and Negative Correlation). Create two bands of thresholds. (Consider 60 (0.6) to 80 (0.8) or 80 to 100 as high) 


####**Plotting Univariate Distributions**
A **Univariate distribution** is a probability distribution of only one random variable

**Note:** You have already seen this in Pandas Profiling. Still, if you want to write the code, you can do so.

What is the distribution of the **Purchase** & **Purchases Frequency** columns in the data set?


In [None]:
#To be filled by learner

___
**Observations:**
- 
- 
___

In [None]:
#To be filled by learner

**What's the tenure of the credit card service for user?**


In [None]:
#To be filled by learner

___
**Observations:**
- 
- 
___

####**Bi-variate Distributions**
- A Bi-variate distribution is a distribution of two random variables
- The concept generalizes to any number of random variables, giving a **Multivariate Distribution**

**How does the tenure of a credit card user affect its purchases?**

In [None]:
#To be filled by learner

___
**Observations:**

- 
- 
___

**Analyse the data set for outliers using a Box Plot**

In [None]:
#To be filled by learner

In [None]:
#To be filled by learner

___
**Observations:**
- 
- 
___

###**Data Pre-processing**

**Handle Missing Values present in the data set**

In [None]:
#To be filled by learner

In [None]:
#To be filled by learner

In [None]:
#To be filled by learner

In [None]:
#To be filled by learner

In [None]:
#To be filled by learner

#Dropping the CUST_ID Column

**Checking for correlation using heatmap**

In [None]:
#To be filled by learner

**Removing highly correlated features(correlation>=0.8)**

In [None]:
#To be filled by learner

**Let us take care of outliers**

In [None]:
#To be filled by learner

**Let us use Z-Score to take care of Outliers**

In [None]:
# To be filled by learner

In [None]:
# To be filled by learner

In [None]:
# To be filled by learner

From Pandas Profiling, we saw that there is a lot skewness in columns: 
- **BALANCE**
- **ONEOFF_PURCHASES**
- **INSTALLMENTS_PURCHASES**
- **CASH_ADVANCE**
- **CASH_ADVANCE_TRX**
- **PURCHASES_TRX**
- **CREDIT_LIMIT**
- **PAYMENTS**
- **MINIMUM_PAYMENTS**


In [None]:
# To be filled by learner

#Performing Log Transformation to deal with Skewed Columns

**Scale the data set using StandardScaler**

In [None]:
#To be filled by learner

#Peforming Scaling on the data set

In [None]:
#To be filled by learner

#Checking the shape of the data set

###**Model Building**


###K-Means Clustering

**K-Means Clustering**

- Unsupervised Learning Algorithm
- Used to group some data points together (Clustering) in an Unsupervised fashion 
- Groups observations with similar attribute values together by measuring the Euclidean distance between points

A **cluster** refers to a collection of data points clubbed together because of existing similarities between them. The task of identifying and assigning similar data-points to a cluster is known as **Clustering**.


**Steps in K-Means Algorithm:**

**Step 1 :** Select the number of clusters you want to indentify i.e. the K.
 
**Step 2 :**Randomly select K different or distinct data points as centroids
.

**Step 3 :**Calculate Distance (Euclidean, Manhattan, and Cosine, etc) and assign each data-point to the closest cluster centroid.

**Step 4 :** Calculate the centroid of all the newly-formed clusters.

**Step 5 :** Repeat steps 3 and 4 until the stopping criteria is met.

**Stopping Criteria**

We stop the K-Means Clustering Algorithm **when the centroids of the clusters are not changing at all even after multiple iterations. That means the algorithm is not finding any new patterns.**

The algorithm will stop the training **once the maximum number of iterations has reached. For example - if you set the maximum number of iterations as 30, the algorithm will stop after 30 iterations.**


If you want to learn more about K-Means, [**Click Here!**](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

Let us say you work at a bank, the customers provide you with features such as Age and Savings (How much did they save?). 

Now, you want to cluster these customers into different groups 

In [None]:
#To be filled by learner

#Here, we are assuming the value of k as 3

#Creating a K-Means Object

#Fitting the Model

In [None]:
#To be filled by learner

#Gives the labels associated to each data point 

In [None]:
#To be filled by learner

In [None]:
#To be filled by learner

These are the centroids for all of these different features. The problem with the above data frame is that the data is already scaled. Let's apply inverse transform to scale the data back to it's original values


In [None]:
#To be filled by learner

___
**Observations:**
- Cluster 1:
 - Customers with high credit limit purchases more frequently 
 - These customers purchase items in installments
 -  They make the highest full payment in percentage and has been a credit card user for a very long time now
___
**NOTE::** Similarly, write observations for the remaining two clusters
- Cluster 2:
 - 
 - 
 -
- Cluster 3:
 - 
 - 
 -
___



In [None]:
#To be filled by learner

#Predict the clusters

In [None]:
#To be filled by learner

#Concatenate the Cluster Lables to our Original Dataframe

In [None]:
#To be filled by learner

#Plot the Histogram of Various Clusters

####**Apply Elbow Method to find the optimal value of K**

####**How to determine the value of K?**

>* If we know how many classes we want to classify, then we use that value as 'k'. For Example - All of us have heard of the Iris data or even worked with it earlier. It has three classes we could classify our flowers into. So, in that case the value of k could be taken as 3.
>* If we don't know how many classes we want, then we will have to decide what the best 'k' value is. A very popular to find the value of 'k' is **Elbow Method**

In [None]:
#To be filled by learner

In [None]:
#To be filled by learner

#Plotting the Elbow Curve

**From above, we select the optimum value of k by determining the Elbow Point - a point after which the inertia starts decreasing linearly. In this case, we can select the value of k as 10.**

###Apply PCA and visualize the results

Principal Component Analysis(PCA) is a dimensionality reduction technique in which we extract a new set of variables from the dataset. It is one of the widely used unsupervised algorithms. 	

This technique can also be used for:

- **visualization**
- **noise filtering**
- **feature extraction or engineering**


In [None]:
#To be filled by learner

# Obtain the principal components 

In this example, we are using PCA to bring down the number of dimensions from **14** to **2** (n_components)

PCA has a number of parameters such as- 

*   **n_components-** Defines the number of components you want to project your data onto.
*  **random state-** Fixed value which will guarantee the same sequence of random numbers are generated each time you run the code. And unless there is some other randomness present in the process, the results produced will be same as always. This helps in verifying the output.

For more information on PCA parameters [click here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) 



In [None]:
#To be filled by learner

# Create a dataframe with the two components

In [None]:
#To be filled by learner


In [None]:
#To be filled by learner

# Concatenate the clusters labels to the dataframe

In [None]:
#To be filled by learner

In [None]:
#To be filled by learners


# Training the model and Storing the predicted cluster labels 

**Silhouette score**
- Used to evaluate the quality of clusters created using clustering algorithms such as K-Means in terms of how well samples are clustered with other samples that are similar to each other 
- Calculated for each sample of different clusters
- Value ranges from -1 to 1
 - **1:** Means clusters are well apart from each other and clearly distinguished.
 - **0:** Means clusters are indifferent, or we can say that the distance between clusters is not significant.
 - **-1:** Means clusters are assigned in the wrong way

In [None]:
#To be filled by learner

###Hierarchial Clustering

**Understanding Hierarchical Clustering**

Hierarchical Clustering is an unsupervised learning algorithm which is used to group similar data-points in a cluster. It creates clusters that have a pre-determined order from top to bottom. For example, files and folders organized in a hierarchy on a hard-disk.

There are two types of Hierarchial Clustering:



*   **Agglomerative Hierarchial Clustering** :

It is most commonly used. It works in a **bottom-up manner**. Here, we assign each data-point to a individual cluster and then calculate the similarity between each of the clusters using either **Eulidean Distance** or **Manhattan Distance** and club the most similar clusters. It merges similar points of cluster and stops when all the data-points are merged into a single cluster.

*   **Divisive Hierarchial Clustering** :

It is not used much. It is the inverse of Agglomerative Hierarchial Clustering. It works in a top-down manner. Here, we assign all the data-points to a single cluster after each iteration we remove the data-points from the cluster which are not similar and each data-point that we remove is treated as an individual cluster. Here, we are dividing the cluster in every step. This is why, it is known as Divisive Hierarchial Clustering. 

The Hierarchial Clustering technique can be visualized with a **Dendrogram**.

A dendrogram is a tree-like diagram showing hierarchical clustering. It shows the relationships between similar sets of data-points. We can also use the concept of Dendrogram to decide the number of clusters in Hieararchial Clustering.





If you want to learn more about Agglomerative Clustering, [**Click Here!**](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html)

####**Getting the Dendrogram**

In [None]:
#To be filled by learner

#Print the Dendrogram

####**How to decide the Number of Clusters?**



From above, you can see that blue line has the maximum distance. We can select a threshold of 125 and the cut the dendrogram.

In [None]:
#To be filled by learner

**From above, you can see that the line cuts the dendrogram at 3 points. That means we are going to apply hierarchial clustering for 3 clusters**

####**Applying Hierarchial Clustering**

In [None]:
#To be filled by learner

#Agglomerative Hierarchial Clustering

#Fit and predict the clusters

**From above, you can see values between 0 and 4 beacuse we defined 4 clusters. 0 represents the points that belongs to the first cluster and 1 represents the points that belongs to the second cluster and so on. These values represents the cluster labels**

In [None]:
#To be filled by learner

# Visualizing the clustering 

In [None]:
#To be filled by learner

#Print the Silhouette Score

###DBSCAN Clustering

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. This technique is one of the most common clustering algorithms  which works based on density of object.
The whole idea is that if a particular point belongs to a cluster, it should be near to lots of other points in that cluster.

It works based on two parameters: Epsilon and Minimum Points  

- **Epsilon** determine a specified radius that if includes enough number of points within, we call it dense area  
- **minimumSamples** determine the minimum number of data points we want in a neighborhood to define a cluster.


If you want to learn more about **DBSCAN** Clustering, [**Click Here!**](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)

In [None]:
#To be filled by learner

#Define epsilon and minimumSamples

#Create a DBSCAN Object
#Fit the model

#Print the labels

In [None]:
#To be filled by learner

###Spectral Clustering



In [None]:
#To be filled by learner

#Create a SpectralClustering Object



In [None]:
#To be filled by learner

# List of different values of affinity 

# List of Silhouette Scores 

# Evaluating the performance 

#Print the Silhouette Scores

In [None]:
#To be filled by learner

# Plotting a Bar Graph to compare the models 

###**PyCaret**


Use **PyCaret** to find the best model and perform Automatic Hyperparameter tuning

**PyCaret** is an open source, low-code machine learning library in **Python** that allows you to go from preparing your data to deploying your model within minutes in your choice of notebook environment

[**Click Here!**](https://pycaret.org/) to learn more about **PyCaret**

**Installing PyCaret**

- !pip install pycaret

####**Tasks to be performed**

- Import PyCaret and load the data set
- Initialize or setup the environment 
- Create the model
- Assign the model
- Plot the model

####**Import PyCaret and load the data set**

In [None]:
import pycaret.clustering as pc
from pycaret.clustering import *

#dir(pc)

In [None]:
#To be filled by learner

#Loading the dataset

#Printing the first 5 rows of dataframe

####**Initialize or setup the environment**

In [None]:
#To be filled by learner

#Set up the Clustering enviroment

####**Create the Model**



- Creates a model on the dataset passed as a data param during the setup stage
- setup() function must be called before using create_model()
- Returns a trained model object

In [None]:
#To be filled by learner

In [None]:
#To be filled by learner

Parameters:

model : string / object, default = None
Enter abbreviated string of the model class. List of available models supported:

- **kmeans**	- K-Means Clustering
- **ap**	- Affinity Propagation
- **meanshift**	- Mean shift Clustering
- **sc**	- Spectral Clustering
- **hclust**	- Agglomerative Clustering
- **dbscan**	- Density-Based Spatial Clustering
- **optics**	- OPTICS Clustering
- **birch**	- Birch Clustering
- **kmodes**	- K-Modes Clustering


####**Assign the model**


- Assigns each of the data point in the dataset passed during setup stage to one of the clusters using trained model object passed as model param
- create_model() function must be called before using assign_model()
- Returns a pandas Dataframe.

In [None]:
#To be filled by learner

####**Plot the Model**

- **plot_model** takes a trained model object and returns a plot on the dataset passed during setup stage
- Internally calls assign_model before generating a plot

In [None]:
#To be filled by learner

In [None]:
plot_model(kmeans, plot = 'tsne')

In [None]:
plot_model(kmeans, plot = 'silhouette')