#**Credit Card Customers**


The dataset for this project was obtained from an anonymous financial institute that provides loans for individuals, through [Kaggle](https://www.kaggle.com/arjunbhasin2013/ccdata). The objective from this project is to develop customer segmentation as a means to defining marketing strategy for the next campaign.

The sample Dataset summarizes the usage behavior of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioral variables.



Clustering is a task usually carried out when trying to explore the different groupings of customers for a business, based on shared features. Here we will be exploring kMeans and some of its variations, in addition to Affinity Propagation -all in order to support decision making efforts when deciding on a marketing strategy for the next campaign.

In [1]:
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
from scipy.stats.mstats import winsorize
import seaborn as sns
from sklearn.cluster import AffinityPropagation, estimate_bandwidth, KMeans, MeanShift, SpectralClustering
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings(action="ignore")

  import pandas.util.testing as tm



Let's upload a dataset into google colab

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
import io
data = pd.read_csv(io.BytesIO(uploaded['Credit Card Dataset for Clustering.csv']))
# Dataset is now stored in a Pandas Dataframe

In [None]:
data.head()

**Following is the Data Dictionary for Credit Card dataset :**



**CUST_ID** : Identification of Credit Card holder (Categorical)

**BALANCE** : Balance amount left in their account to make purchases

**BALANCEFREQUENCY** : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)

**PURCHASES** : Amount of purchases made from account

**ONEOFFPURCHASES** : Maximum purchase amount done in one-go

**INSTALLMENTSPURCHASES** : Amount of purchase done in installment

**CASHADVANCE** : Cash in advance given by the user

**PURCHASESFREQUENCY** : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)

**ONEOFFPURCHASESFREQUENCY** : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)

**PURCHASESINSTALLMENTSFREQUENCY** : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)

**CASHADVANCEFREQUENCY** : How frequently the cash in advance being paid

**CASHADVANCETRX** : Number of Transactions made with "Cash in Advanced"

**PURCHASESTRX** : Numbe of purchase transactions made

**CREDITLIMIT** : Limit of Credit Card for user

**PAYMENTS** : Amount of Payment done by user

**MINIMUM_PAYMENTS** : Minimum amount of payments made by user

**PRCFULLPAYMENT** : Percent of full payment paid by user

**TENURE** : Tenure of credit card service for user

#**Exploratory Data Analysis**

Let's begin EDA by visualizing some of the variables we have in our dataset:

In [None]:
plt.hist(winsorize(data['MINIMUM_PAYMENTS'],(0,0.09)))
plt.title('Distribution of Minimum Payments Across Customers')
plt.xlabel('Payment ($USD)')
plt.ylabel('Frequency')
plt.show()

It would be normal to see most payments being in the lowest bin (0 - 250), and they are. But what about the highest bin where a significant number of customers have a minimum payment of more than $2300 a month?

I would guess that there is a group of big spenders who have a high cap on credit limits within our dataset.

Another question is how frequently customers take cash advances? let's take a look:

In [None]:
plt.hist(winsorize(data['CASH_ADVANCE'],(0,0.05)))
plt.title('Distribution of Cash Advances Across Customers')
plt.xlabel('Cash Advance ($USD)')
plt.ylabel('Frequency')
plt.show()

Seems like many customers make use of this feature, and just like in the plot above, many take large amounts of cash in advance (over $4000).

##**Data Cleaning**

Since our data here is all numerical, let's begin by taking a look at its quick statistical description:

In [None]:
data.describe()

Columns like **BALANCE**, **PURCHASES**, **ONEOFF_PURCHASES** and others have outliers. These should be dealt with in order to have an easier and clearer definition of our clusters.

**Missing Values**

In [None]:
data.isnull().sum().sort_values(ascending=False)

Although missing data in **MINIMUM_PAYMENTS** does not represent much of the dataset, being a numerical variable makes it easy to fill values with the mean of each column:

In [None]:
data.loc[(data['MINIMUM_PAYMENTS'].isnull()==True),'MINIMUM_PAYMENTS']=data['MINIMUM_PAYMENTS'].mean()
data.loc[(data['CREDIT_LIMIT'].isnull()==True),'CREDIT_LIMIT']=data['CREDIT_LIMIT'].mean()

data = data.drop(columns=['CUST_ID'])

In [None]:
data.isnull().sum().sort_values(ascending=False)

Now we have no missing values in our dataframe.

##**Feature Engineering**

The feature engineering section will focus on outliers since many of the columns in our dataset have outliers. The good thing is that we have different groups of columns with similar data types and even ranges, which means that binning these columns can help us cluster the data.


These ranges will also prevent loss of data points since binning will help emphasize the difference that we have in our minds to the algorithm:

In [None]:
# The first group: with individual US dollars as units
columns=['BALANCE', 'PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'CREDIT_LIMIT',
        'PAYMENTS', 'MINIMUM_PAYMENTS']

range_min = [0,500,1000,3000,5000]
range_max = [500,1000,3000,5000,10000]
for col in columns:
    bin_col=col+'_BIN'
    data[bin_col]=0
    for i in range(5):
        data.loc[((data[col]>range_min[i])&(data[col]<=range_max[i])),bin_col]=i+1
    data.loc[((data[col]>10000)),bin_col]=6

In [None]:
data = data.drop(columns=['BALANCE', 'PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'CREDIT_LIMIT',
        'PAYMENTS', 'MINIMUM_PAYMENTS']) 

In [None]:
# The second group: with frequencies between 0 and 1 as units
columns=['BALANCE_FREQUENCY', 'PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY', 
         'CASH_ADVANCE_FREQUENCY', 'PRC_FULL_PAYMENT']

for col in columns:
    bin_col=col+'_BIN'
    data[bin_col]=0
    for i in range(0,10):
        i = i/10
        data.loc[((data[col]>i)&(data[col]<=i+0.1)),bin_col]=int((i+0.1)*10)

In [None]:
data = data.drop(columns=['BALANCE_FREQUENCY', 'PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY', 
         'CASH_ADVANCE_FREQUENCY', 'PRC_FULL_PAYMENT'])

In [None]:
# The 3rd group: with number of transactions as units
columns=['PURCHASES_TRX', 'CASH_ADVANCE_TRX']  
for col in columns:
    bin_col=col+'_BIN'
    data[bin_col]=0
    for i in range(0,25,5):
        data.loc[((data[col]>i)&(data[col]<=i+5)),bin_col]=int((i/5)+1)    
    data.loc[((data[col]>25)&(data[col]<=50)),bin_col]=6
    data.loc[((data[col]>50)&(data[col]<=100)),bin_col]=7
    data.loc[((data[col]>100)),bin_col]=8

In [None]:
data = data.drop(columns=['PURCHASES_TRX', 'CASH_ADVANCE_TRX'])



#**Clustering**

Next, we define our features as a numpy array, then we scale the data using the Scikit-Learn StandardScaler in preparation to feed it to the kMeans algorithm:

In [None]:
X= np.asarray(data)
scale = StandardScaler()
X = scale.fit_transform(X)

##**Kmeans**


The way we will approach kMeans first is using inertia to evaluate its performance. Inertia is defined as the sum of squares within a cluster. Through comparing inertia among several values of k, we can have an idea on what value we will be choosing:

In [None]:
n_clusters=8
inertia=[]
k = []
for i in range(1,n_clusters):
    kmean= KMeans(i)
    kmean.fit(X)
    inertia.append(kmean.inertia_)  
    k.append(i)


In [None]:
plt.plot(inertia, 'o-')
plt.title('Inertia for each value of k')
plt.xlabel('Number of k')
plt.ylabel('Inertia')
plt.show()


It seems like 6 clusters is a good value to start from, having the least inertia, meaning it has the most coherent distribution of data points within 6 clusters.



In [None]:
kmean= KMeans(6)
dist = 1 - cosine_similarity(X)
pca = PCA(2)
pca.fit(dist)
X_PCA = pca.transform(dist)

kmean.fit(X)
labels=kmean.labels_

In [None]:
clusters=pd.concat([data, pd.DataFrame({'cluster':labels})], axis=1)
clusters.head()

#**Visualization of Cluster Features**

In [None]:
for c in clusters:
    grid= sns.FacetGrid(clusters, col='cluster')
    grid.map(plt.hist, c)

**Cluster Highlights**

* **Cluster0** Average credit limit, mostly small to medium purchases, they often make use of cash advances.

* **Cluster1** Rarely takes cash advances but makes a lot of purchases. Average to high credit limit and more payment amounts.

* **Cluster2** Same credit limit as cluster 1. Usually doesn't purhcase with full payment and doesn't spend much in general.

* **Cluster3** Average to high credit cluster with customers who take more cash in advance

* **Cluster4** Average credit limit with a distributed purchase frequency.

* **Cluster5** Average to high credit limit customers who do not make much use of their credit cards.

Visualizing these clusters with PCA will make them into an isotropic shape that makes it clearer to see each cluster and where its customers fall within the larger sample:

In [None]:
x, y = X_PCA[:, 0], X_PCA[:, 1]

colors = {0: 'red',
          1: 'blue',
          2: 'green', 
          3: 'yellow', 
          4: 'orange',  
          5:'purple'}

names = {0: 'Small-Med purchases', 
         1: 'People with due payments', 
         2: 'Credit Purchasers', 
         3: 'Take more cash in advance', 
         4: 'Expensive purchases',
         5:'Economical spenders'}
  
df = pd.DataFrame({'x': x, 'y':y, 'label':labels}) 
groups = df.groupby('label')

fig, ax = plt.subplots(figsize=(14, 7)) 

for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=5,
            label=names[name], mec='none')
    ax.set_aspect('auto')
    ax.tick_params(axis='x',which='both',bottom='off',top='off',labelbottom='off')
    ax.tick_params(axis= 'y',which='both',left='off',top='off',labelleft='off')
    
ax.legend()
ax.set_title("Customers Segmentation")
plt.show()

We can see how clusters are dense around the shape in the plot, taking a different side each. The higher up we go in the y-axis, the more credit these clusters are availing.

**Usage**

This document provides a way for financial institutes to cluster their individual credit customers into groups. As market needs evolve, the clusters in here may change accordingly. This clustering can result in beneficial input that can be used in targeting these customers within a marketing strategy and design services that fit their needs.

#**Other Clustering Approaches**

We will try other clustering approaches basing our analysis on purchase frequency:

In [None]:
X = data.drop(columns=['PURCHASES_FREQUENCY_BIN'])
y = data['PURCHASES_FREQUENCY_BIN']

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.9,
    random_state=42)

#**MeanShift**

Mean shift makes no assumptions about the nature of the data or the number of clusters, making it more versatile than 𝑘-means, but it creates clusters where data points form a "globe" around a central point. It works for data sets where many clusters are suspected.

Let's see how MeanShift clusters our dataset based on purchase frequency:

In [None]:
# Here we set the bandwidth. This function automatically derives a bandwidth
# number based on an inspection of the distances among points in the data.
bandwidth = estimate_bandwidth(X_train, quantile=0.2, n_samples=500)

# Declare and fit the model.
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X_train)

# Extract cluster assignments for each data point.
labels = ms.labels_

# Coordinates of the cluster centers.
cluster_centers = ms.cluster_centers_

# Count our clusters.
n_clusters_ = len(np.unique(labels))


print("Number of estimated clusters: {}".format(n_clusters_))

The number of estimated clusters is certainly not 1. We know that through the values we have seen before and the differences in customer behavior throughout features in this dataset.

Also logically, we cannot expect customers to have a uniform behavior given almost any community regardless of the size.

#**Affinity Propagation**

This algorithm is based on the idea of data points' ability to represent each other through similarity.

It defines a similarity matrix that helps in understanding similarity of these points to each other. Then there would be a clustering matrix where similarity has been decided and the algorithm goes towards maximizing similarity for cluster coherence.

It uses availability and responsibility as parameters to deciding how cluster coherence is maximized. Candidate exemplar points send availability information to points which would be in its cluster, their response would be responsibility.

All of this is encompassed in the next block of code, showing the work of Affinity Propagation:

In [None]:
# Declare the model and fit it.
af = AffinityPropagation().fit(X_train)
print('Done')

# Pull the number of clusters and cluster assignments for each data point.
cluster_centers_indices = af.cluster_centers_indices_
n_clusters_ = len(cluster_centers_indices)
labels = af.labels_

print('Estimated number of clusters: {}'.format(n_clusters_))

It is common for Affinity Propagation to overestimate the number of clusters, which is the case in our data.

Logically, we cannot have 44 types of customers, even for a global company that literally has branches in each country. This case would only exist in companies that have customers for different product lines, but that would still make them incomparable.

#**Conclusion**


The reason that kMeans is most widely used in clustering may not be because it works every time, but probably due to its logical handling of data and logical output of clusters.

Although there are benefits to using other clustering algorithms. These benefits may not come to light in such a small project, but more in a constant effort where clustering is really needed, and variations of clustering would come in handy to solve more specific issues in grouping data points.

We did not even have to evaluate MeanShift and Affinity Propagation model clusters using metrics due to the cluster output being absurd and illogical, where the kMeans performed much better and gave output that could be interpreted by humans, businesses, and could produce value for a marketing manager trying to look at a high-level segmentation of their company's customers.