## Practical Activity Classification using SVM

### 1 Practical Activity

#### 1.1 Classification Using Support Vector Machines

This notebook is an exercise for developing a SVM classifier for predicting customer attrition.

### 2 ATH LEAPS Bank Data

In this task, we will predict which customers in the future are more likely to churn or to stop availing the bank's services. Once an adequate prediction model is developed, the bank will be better informed on how to develop strategies to retain customers or at least lose less customers in the future.

This practical will build a Support Vector Machine in predicting customer attrition using the bank dataset.

The last two columns should be deleted.

In [20]:
# Loading the required libraries
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.metrics import accuracy_score
import collections
from sklearn.metrics import classification_report

In [None]:
# Loading the dataset
df = pd.read_csv("BankChurners.csv")

# Check the columns
df.columns

Index(['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender',
       'Dependent_count', 'Education_Level', 'Marital_Status',
       'Income_Category', 'Card_Category', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'],
      dtype='object')

After loading the dataset, we remove the unwanted columns. In this case, we will remove the last two columns and the client number (CLIENTNUM) columns.

In [3]:
df = df.iloc[:, 1:-2]  # Dropping the first and last two columns

In [4]:
df.head()

Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105
2,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,4,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.0
3,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,3,4,1,3313.0,2517,796.0,1.405,1171,20,2.333,0.76
4,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,5,1,0,4716.0,0,4716.0,2.175,816,28,2.5,0.0


In [5]:
# Check the data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Attrition_Flag            10127 non-null  object 
 1   Customer_Age              10127 non-null  int64  
 2   Gender                    10127 non-null  object 
 3   Dependent_count           10127 non-null  int64  
 4   Education_Level           10127 non-null  object 
 5   Marital_Status            10127 non-null  object 
 6   Income_Category           10127 non-null  object 
 7   Card_Category             10127 non-null  object 
 8   Months_on_book            10127 non-null  int64  
 9   Total_Relationship_Count  10127 non-null  int64  
 10  Months_Inactive_12_mon    10127 non-null  int64  
 11  Contacts_Count_12_mon     10127 non-null  int64  
 12  Credit_Limit              10127 non-null  float64
 13  Total_Revolving_Bal       10127 non-null  int64  
 14  Avg_Op

In [6]:
df.shape

(10127, 20)

In [None]:
# Check the class distribution
df["Attrition_Flag"].value_counts()

Attrition_Flag
Existing Customer    8500
Attrited Customer    1627
Name: count, dtype: int64

We observe that we have more samples of "Existing Customer" than "Attrited Customer". This means the model we will train using this dataset is expected to be biased towards the "Existing Customer", which means the model will predict most of the test instances as "Existing Customer". We will test this in the last part of the exercise.

#### 2.1 Note

We have 19 features in this dataset. Using all the features will take a long time to train the model. Therefore, we will select a subset of the features to train our model. Ideally, we will measure correlations and chi^2 or other statistics to find the best set of variables to use. However, for this practical we select the following variables: Customer_Age (int64), Income_Category (object), Credit_Limit (float64), Total_Revolving_Bal (int64), Total_Trans_Amt (int64).

Among our selected variables, Income category is a categorical variable. We need to encode this.

In [None]:
# Encoding the categorical variables
le = preprocessing.LabelEncoder()

df["en_Income_Category"] = le.fit_transform(df["Income_Category"])
df["en_Attrition_Flag"] = le.fit_transform(df["Attrition_Flag"])

In [None]:
features = [
    "Customer_Age",
    "en_Income_Category",
    "Credit_Limit",
    "Total_Revolving_Bal",
    "Total_Trans_Amt",
]

### 3 Feature scaling

In [None]:
scaler = StandardScaler()

X_scaled = scaler.fit_transform(df[features])

# Reverting back to df
X = pd.DataFrame(X_scaled)

X["target"] = df["en_Attrition_Flag"]

### Building a SVM classifier

___
sklearn provides different SVM implementations:

ðŸ‘‰ [SVM Classification](https://scikit-learn.org/stable/modules/svm.html#svm-classification)

We will explore three types of kernals in this practical: linear, polynomial, and rbf.
___

In [None]:
train, test = train_test_split(X, test_size=0.3, stratify=X["target"])

X_train = train.drop("target", axis=1)
y_train = train["target"]

X_test = test.drop("target", axis=1)
y_test = test["target"]

In [None]:
C = 1.0  # SVM regularization parameter
svc = svm.SVC(kernel="linear", C=C).fit(X_train, y_train)

#### 4.1 Evaluate

In [16]:
predictions = svc.predict(X_test)
acc = accuracy_score(y_test, predictions)

print(f"Accuracy: {acc:.2f}")

Accuracy: 0.84


In [17]:
X_test.shape

(3039, 5)

The model is almost 84% accurate which seems to be a good model at the first attempt without tuning any params. However, as mentioned earlier, we have an imbalanced dataset and the model should be biased towards the majority class, in this case, "Existing Customer" which is encoded as 1.

We now inspect the predictions of the model. We can print the predictions. We have 3039 test samples - the predictions array is too big. We can also count the number of 0s and 1s in the predictions array. To count the elements in predictions, we need to use the Python collections library.

In [18]:
# Print the first 10 predictions
print("Predictions for the first 10 samples:")
for i in range(10):
    print(f"Sample {i + 1}: Predicted = {predictions[i]}, Actual = {y_test.iloc[i]}")

Predictions for the first 10 samples:
Sample 1: Predicted = 1, Actual = 0
Sample 2: Predicted = 1, Actual = 1
Sample 3: Predicted = 1, Actual = 1
Sample 4: Predicted = 1, Actual = 1
Sample 5: Predicted = 1, Actual = 1
Sample 6: Predicted = 1, Actual = 1
Sample 7: Predicted = 1, Actual = 1
Sample 8: Predicted = 1, Actual = 1
Sample 9: Predicted = 1, Actual = 1
Sample 10: Predicted = 1, Actual = 1


In [19]:
counter = collections.Counter(predictions.tolist())
print(counter)

Counter({1: 3039})


Interestingly, we observe that the model has predicted all the test samples as instances of class 1, i.e., it does not predict class 0 at all. Still is has an accuracy of 84%.

The above shows the problem with imbalanced dataset and the problem of accuracy measure.

#### 4.2 classification_report

___
sklearn provides a detailed classification performance report:

ðŸ‘‰ [sklearn.metrics: classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

___

In [21]:
target_names = ["Attrited Customer", "Existing Customer"]
print(classification_report(y_test, predictions, target_names=target_names))

                   precision    recall  f1-score   support

Attrited Customer       0.00      0.00      0.00       488
Existing Customer       0.84      1.00      0.91      2551

         accuracy                           0.84      3039
        macro avg       0.42      0.50      0.46      3039
     weighted avg       0.70      0.84      0.77      3039



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The classification report shows that we have 488 instances of attrited customers in our test set and the model failed to predict any of them.

#### 4.3 Dealing with imbalanced dataset

The simplest thing we can do is to balance the dataset by sampling the majority class to make the distribution equal.

In [22]:
df_minor = X.loc[X["target"] == 0]
df_minor.shape

(1627, 6)

In [23]:
# Sampling the majority class
df_major = X.loc[X["target"] == 1].sample(n=df_minor.shape[0])
df_major.shape

(1627, 6)

df_major and df_minor have the same number of samples. We need to merge them together and use them for training.

In [25]:
df_sub = pd.concat([df_major, df_minor], axis=0)
df_sub.head()

Unnamed: 0,0,1,2,3,4,target
4848,-0.789126,1.41967,-0.462016,0.399027,-0.163995,1
3485,0.458314,1.41967,1.657076,1.520572,-0.227287,1
2458,0.95729,0.090436,0.761641,1.637144,-0.461614,1
3771,-0.91387,-0.574182,1.749722,0.383075,-0.264674,1
4750,-0.040662,-0.574182,-0.604837,-1.426858,-0.084807,1


In [26]:
df_sub.shape

(3254, 6)

In [27]:
# Shuffle the dataset. In the current version the first 1627 are of class 1 and the last 1627 are of class 0
df_sub = df_sub.sample(frac=1).reset_index(drop=True)
df_sub.head()

Unnamed: 0,0,1,2,3,4,target
0,-1.038614,0.755053,-0.65237,-1.426858,-0.486342,0
1,0.084082,0.755053,-0.726091,-1.147086,-0.539331,0
2,1.456266,-1.903416,0.447392,1.470262,0.006157,1
3,-0.040662,0.090436,2.848054,-0.230462,2.492496,1
4,1.705754,1.41967,0.514951,0.995385,2.923764,1


In [28]:
# Training model

train, test = train_test_split(df_sub, test_size=0.3, stratify=df_sub["target"])
X_train = train.drop("target", axis=1)
y_train = train["target"]
X_test = test.drop("target", axis=1)
y_test = test["target"]

svc = svm.SVC(kernel="linear", C=1).fit(X_train, y_train)

In [29]:
# Evaluate model
predictions = svc.predict(X_test)
acc = accuracy_score(y_test, predictions)
print(f"Accuracy: {acc:.2f}")

Accuracy: 0.72


In [30]:
counter = collections.Counter(predictions.tolist())
print(counter)

Counter({1: 543, 0: 434})


In [31]:
print(classification_report(y_test, predictions, target_names=target_names))

                   precision    recall  f1-score   support

Attrited Customer       0.74      0.66      0.70       488
Existing Customer       0.70      0.77      0.73       489

         accuracy                           0.72       977
        macro avg       0.72      0.72      0.72       977
     weighted avg       0.72      0.72      0.72       977



Though the new model predicts both classes, the performance is mediocre. Use other types of kernels and compare their performances.

In [32]:
rbf_svc = svm.SVC(kernel="rbf", gamma=0.7, C=C).fit(X_train, y_train)
poly_svc = svm.SVC(kernel="poly", degree=4, C=C).fit(X_train, y_train)

In [None]:
predictions = rbf_svc.predict(X_test)
print(
    f"Accuracy of the model with rbf kernel: {accuracy_score(y_test, predictions):.2f}"
)

predictions = poly_svc.predict(X_test)
print(
    f"Accuracy of the model with poly kernel: {accuracy_score(y_test, predictions):.2f}"
)

Accuracy of the model with rbf kernel: 0.79
Accuracy of the model with poly kernel: 0.73


We observe that rbf kernel gives the most promising model for our dataset. We can also find the support vectors.

In [34]:
rbf_svc.n_support_

array([594, 635], dtype=int32)

In [35]:
rbf_svc.support_vectors_

array([[ 1.08203433,  0.09043572,  2.84805374,  1.52793469, -0.9470482 ],
       [-1.2881015 ,  0.75505294, -0.62222157,  0.21742017, -0.64854615],
       [-1.53758948,  0.09043572,  2.84805374, -1.42685834,  1.4050538 ],
       ...,
       [ 1.33152231,  1.41967017, -0.14160511,  0.79905302, -0.88728891],
       [-1.03861352, -0.5741815 , -0.29619901, -0.0758504 , -0.7453974 ],
       [-0.66438154, -1.23879873,  0.45905543,  1.6616857 , -0.97236494]],
      shape=(1229, 5))

In the above, rbf_svc.n_support_ shows that the model has learned 594 support vectors for class 0 and 635 support vectors for class 1.

rbf_svc.support_vectors_ shows the support vectors for each class.