# PROJECT 4:

## Unsupervised Machine Learning - Clustering

This notebook is a part of my fourth project from the IBM Machine Learning certificate.

The main objective of this project is to cluster the data on this dataset and see which clustering algorithm is best for this purpose. The data from this dataset will be used to develop a customer segmentation to define marketing strategy. It summarizes the usage behavior of about 9000 active credit card holders.

Source Data: https://www.kaggle.com/arjunbhasin2013/ccdata

## Exploratory Data Analysis

In [None]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns, os
os.chdir('data')
from colorsetup import colors, palette
sns.set_palette(palette)

In [None]:
# We import and take a preliminary look at the dataset
data = pd.read_csv('CC GENERAL.csv')

data.head(4).T

In [None]:
data.shape

In [None]:
data.dtypes

In [None]:
data2 = data.copy() # Keep a copy of our original data 

In [None]:
print("Number of rows in the data:", data.shape[0])
print("Number of columns in the data:", data.shape[1])

We have determined the following:

    - There are 18 columns and 8950 rows in this dataset. 
    - This dataset has 3 types of data: object, float64 and int64.  

We will now do the following:

    - Examine the correlation and skew of all of the variables -- except for the CUST_ID column, as it adds no value to our project
    - Perform any appropriate feature transformations and/or scaling.
    - Examine the pairwise distribution of the variables with pairplots to verify scaling and normalization efforts.


## Data Cleaning and Feature Engineering

In [None]:
# We examine correlation between all variables excluding 'CUST_ID'
num_columns = [x for x in data.columns if x not in ['CUST_ID']]

# The correlation matrix
corr_mat = data[num_columns].corr()

# Strip out the diagonal values for the next step
for x in range(len(num_columns)):
    corr_mat.iloc[x,x] = 0.0
    
corr_mat

In [None]:
# Pairwise maximal correlations. We see which feature values are mostly correlated with which ones.
corr_mat.abs().idxmax()

In [None]:
# We see if there are any skew values in anticipation of transformations.
skew_columns = (data[num_columns]
                .skew()
                .sort_values(ascending=False))

skew_columns = skew_columns.loc[skew_columns > 0.75]
skew_columns

In [None]:
# We perform log transform on the skewed columns
for col in skew_columns.index.tolist():
    data[col] = np.log1p(data[col])


In [None]:
# We now perform feature scaling.
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
data[num_columns] = sc.fit_transform(data[num_columns])

data.head(4)
# Now, all our values are scaled

In [None]:
# We now make a pairplot of the scaled and transformed data
sns.pairplot(data, plot_kws=dict(alpha=.1, edgecolor='none'))

In [None]:
sns.set_context('notebook')
sns.pairplot(data[num_columns], 
             hue=None, 
             hue_order=['white', 'red'],
             palette={'red', 'white', 'gray'});

The pairplots show that the scaled and transformed values do not correlate with each other for the most part, except for features such as "ONCEOFF_PURCHASES" and "PURCHASES" for example.

In [None]:
# We replace any NaN values in the dataset
smaller_data = data.fillna(0)

## Clustering Method 1: K-Means

We will fit a K-means clustering model with 2 clusters.


In [None]:
from sklearn.cluster import KMeans

km = KMeans(n_clusters=2, random_state=42)
km = km.fit(smaller_data[num_columns])

data['kmeans'] = km.predict(smaller_data[num_columns])

In [None]:
(data[['TENURE', 'kmeans']]
 .groupby(['kmeans', 'TENURE'])
 .size()
 .to_frame()
 .rename(columns={0:'number'}))

We see that our dataset has been separated into 2 data clusters. The majority cluster is the "0" cluster.

In [None]:
# We now fit K-Means models with cluster values ranging from 1 to 20 to determine which K-Means value we want to use.
# Create and fit a range of models
km_list = list()

for clust in range(1,21):
    km = KMeans(n_clusters=clust, random_state=42)
    km = km.fit(smaller_data[num_columns])
    
    km_list.append(pd.Series({'clusters': clust, 
                              'inertia': km.inertia_,
                              'model': km}))

In [None]:
plot_data = (pd.concat(km_list, axis=1)
             .T
             [['clusters','inertia']]
             .set_index('clusters'))

ax = plot_data.plot(marker='o',ls='-')
ax.set_xticks(range(0,21,2))
ax.set_xlim(0,21)
ax.set(xlabel='Cluster', ylabel='Inertia');

In this case, I will choose 5 as my cluster value because it is the inflection point.

In [None]:
# We see how our 5 clusters look
km = KMeans(n_clusters=5, random_state=42)
km = km.fit(smaller_data[num_columns])

data['kmeans'] = km.predict(smaller_data[num_columns])

In [None]:
(data[['TENURE', 'kmeans']]
 .groupby(['kmeans', 'TENURE'])
 .size()
 .to_frame()
 .rename(columns={0:'number'}))

## Clustering Method 2: Hierarchical Agglomerative Clustering

In [None]:
# We now use the Agglomerative Clustering method on our dataset 
# We will then compare the results between this method and K-Means
from sklearn.cluster import AgglomerativeClustering

ag = AgglomerativeClustering(n_clusters=2, linkage='ward', compute_full_tree=True) #We can set compute_full_tree value to False if we want to save computational time
ag = ag.fit(smaller_data[num_columns])
data['agglom'] = ag.fit_predict(smaller_data[num_columns])

In [None]:
# First, for Agglomerative Clustering:
(data[['TENURE','agglom','kmeans']]
 .groupby(['TENURE','agglom'])
 .size()
 .to_frame()
 .rename(columns={0:'number'}))

We can see that the clusters appear to be grouped accordingly. This method has managed to separate the results into two classes, "0" and "1".

In [None]:
# Comparing with KMeans results:
(data[['TENURE','agglom','kmeans']]
 .groupby(['TENURE','kmeans'])
 .size()
 .to_frame()
 .rename(columns={0:'number'}))

In [None]:
# Now we compare the results:
(data[['TENURE','agglom','kmeans']]
 .groupby(['TENURE','agglom','kmeans'])
 .size()
 .to_frame()
 .rename(columns={0:'number'}))

Though the cluster numbers are not identical, the clusters themselves are very consistent. We will now plot a dendrogram created from the agglomerative clustering.

In [None]:
from scipy.cluster import hierarchy

Z = hierarchy.linkage(ag.children_, method='ward')

fig, ax = plt.subplots(figsize=(15,5))

# Some color setup
red = colors[2]
blue = colors[0]

hierarchy.set_link_color_palette([red, 'gray'])

den = hierarchy.dendrogram(Z, orientation='top', 
                           p=50, truncate_mode='lastp',
                           show_leaf_counts=True, ax=ax,
                           above_threshold_color=blue)

## Conclusion

We conclude that the Hierarchical Agglomerative Cluster model is the method best suited to this dataset. It has separated its results to into two classes and the dendogram shows us a clearer picture of how the clusters would be shaped.