## Classification on Finance Data

In this notebook you will learn:
- how to clean/tidy the data
- dealing with unbalanced classes
- different kinds of classification models we can utilize

### About the dataset we will be using

Imagine you are working as a data scientist in a big corporate finance company. The company that you work for has gathered a lot of credit-related information. Companies use credit scores to make decisions whether to offer you a morgage, credit card, loans, and other credit products. They are also used to determine the interest rate and credit limit you receive. We will utilize this dataset to build some sort of classifcation model to classify the credit score. If you would like to download the dataset yourself check out this link: https://www.kaggle.com/datasets/parisrohan/credit-score-classification?resource=download

In [4]:
import pandas as pd

#Lets first read this data into a dataframe that we can view
df = pd.read_csv("test.csv")
#Lets view the first few row
df.head()

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance
0,0x160a,CUS_0xd40,September,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,2022.0,Good,809.98,35.030402,22 Years and 9 Months,No,49.574949,236.64268203272132,Low_spent_Small_value_payments,186.26670208571767
1,0x160b,CUS_0xd40,October,Aaron Maashoh,24,821-00-0265,Scientist,19114.12,1824.843333,3,...,4.0,Good,809.98,33.053114,22 Years and 10 Months,No,49.574949,21.465380264657146,High_spent_Medium_value_payments,361.444003853782
2,0x160c,CUS_0xd40,November,Aaron Maashoh,24,821-00-0265,Scientist,19114.12,1824.843333,3,...,4.0,Good,809.98,33.811894,,No,49.574949,148.23393788500923,Low_spent_Medium_value_payments,264.67544623343
3,0x160d,CUS_0xd40,December,Aaron Maashoh,24_,821-00-0265,Scientist,19114.12,,3,...,4.0,Good,809.98,32.430559,23 Years and 0 Months,No,49.574949,39.08251089460281,High_spent_Medium_value_payments,343.82687322383634
4,0x1616,CUS_0x21b1,September,Rick Rothackerj,28,004-07-5839,_______,34847.84,3037.986667,2,...,5.0,Good,605.03,25.926822,27 Years and 3 Months,No,18.816215,39.684018417945296,High_spent_Large_value_payments,485.2984336755923


### Credit Standing

based on the person's credit standing we can make a very good inferenece. Each person can is either:

Standing:
- Good
- Standard
- Bad


"Credit Mix" refers to how well a customer is managing their financial obligations, particularly their payments. It could be an important factor for assessing creditworthiness and risk for lenders and financial institutions. The binary classification of "Good" or "Bad" payment behavior is commonly used to predict whether a customer is likely to make their payments on time or have issues with payments. 

In [5]:
df['Credit_Mix'].value_counts()

Standard    18379
Good        12260
_            9805
Bad          9556
Name: Credit_Mix, dtype: int64

It seems like there is 3800 null values in this column, to preprocess our data here we are going to need to filter this data out of this column so that our classfication is more accurate.

In [6]:
# lets filter our data
filterd_df = df[df['Credit_Mix'] != "_"]
filterd_df = filterd_df[filterd_df['Credit_Mix'] != "Standard"]

filterd_df['Credit_Mix'].value_counts()

Good    12260
Bad      9556
Name: Credit_Mix, dtype: int64

As you can see I have gotten rid of the 9805 "_"  values, alongside this I actually got rid of the standard label for now. (There is a reason for this I will show later)

### Feature Selection

For our feature selection we will consider the following columns:

- Age
- Annual Income
- Num_credit_card
- Interest_Rate
- Credit_Utilization_Ratio
- Num_of_Delayed_Payment
- Credit_History_Age
- Payment_of_Min_Amount
- Total_EMI_per_month
- Amount_invested_monthly


Our Target Variable is the Credit_Mix column

### Lets run .info() to see what our dataframe 

In [7]:
filterd_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21816 entries, 0 to 49998
Data columns (total 27 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ID                        21816 non-null  object 
 1   Customer_ID               21816 non-null  object 
 2   Month                     21816 non-null  object 
 3   Name                      19617 non-null  object 
 4   Age                       21816 non-null  object 
 5   SSN                       21816 non-null  object 
 6   Occupation                21816 non-null  object 
 7   Annual_Income             21816 non-null  object 
 8   Monthly_Inhand_Salary     18573 non-null  float64
 9   Num_Bank_Accounts         21816 non-null  int64  
 10  Num_Credit_Card           21816 non-null  int64  
 11  Interest_Rate             21816 non-null  int64  
 12  Num_of_Loan               21816 non-null  object 
 13  Type_of_Loan              19233 non-null  object 
 14  Delay_

Since there are Null rows we should get rid of those to get the best accuracy for when we go to run classification models on them

In [8]:
filterd_df.dropna(inplace=True)
df

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance
0,0x160a,CUS_0xd40,September,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,2022.0,Good,809.98,35.030402,22 Years and 9 Months,No,49.574949,236.64268203272135,Low_spent_Small_value_payments,186.26670208571772
1,0x160b,CUS_0xd40,October,Aaron Maashoh,24,821-00-0265,Scientist,19114.12,1824.843333,3,...,4.0,Good,809.98,33.053114,22 Years and 10 Months,No,49.574949,21.465380264657146,High_spent_Medium_value_payments,361.44400385378196
2,0x160c,CUS_0xd40,November,Aaron Maashoh,24,821-00-0265,Scientist,19114.12,1824.843333,3,...,4.0,Good,809.98,33.811894,,No,49.574949,148.23393788500925,Low_spent_Medium_value_payments,264.67544623342997
3,0x160d,CUS_0xd40,December,Aaron Maashoh,24_,821-00-0265,Scientist,19114.12,,3,...,4.0,Good,809.98,32.430559,23 Years and 0 Months,No,49.574949,39.08251089460281,High_spent_Medium_value_payments,343.82687322383634
4,0x1616,CUS_0x21b1,September,Rick Rothackerj,28,004-07-5839,_______,34847.84,3037.986667,2,...,5.0,Good,605.03,25.926822,27 Years and 3 Months,No,18.816215,39.684018417945296,High_spent_Large_value_payments,485.2984336755923
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,0x25fe5,CUS_0x8600,December,Sarah McBridec,4975,031-35-0942,Architect,20002.88,1929.906667,10,...,12.0,_,3571.7,34.780553,,Yes,60.964772,146.48632477751087,Low_spent_Small_value_payments,275.53956951573343
49996,0x25fee,CUS_0x942c,September,Nicks,25,078-73-5990,Mechanic,39628.99,,4,...,7.0,Good,502.38,27.758522,31 Years and 11 Months,NM,35.104023,181.44299902757518,Low_spent_Small_value_payments,409.39456169535066
49997,0x25fef,CUS_0x942c,October,Nicks,25,078-73-5990,Mechanic,39628.99,3359.415833,4,...,7.0,Good,502.38,36.858542,32 Years and 0 Months,No,35.104023,__10000__,Low_spent_Large_value_payments,349.7263321025098
49998,0x25ff0,CUS_0x942c,November,Nicks,25,078-73-5990,Mechanic,39628.99,,4,...,7.0,Good,502.38,39.139840,32 Years and 1 Months,No,35.104023,97.59857973344877,High_spent_Small_value_payments,463.23898098947717


In [83]:
filterd_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40428 entries, 0 to 49999
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       40428 non-null  object 
 1   Annual_Income             40428 non-null  object 
 2   Num_Credit_Card           40428 non-null  int64  
 3   Interest_Rate             40428 non-null  int64  
 4   Num_of_Delayed_Payment    40428 non-null  object 
 5   Credit_Utilization_Ratio  40428 non-null  float64
 6   Credit_History_Age        40428 non-null  object 
 7   Payment_of_Min_Amount     40428 non-null  object 
 8   Total_EMI_per_month       40428 non-null  float64
 9   Amount_invested_monthly   40428 non-null  object 
 10  Payment_Behaviour         40428 non-null  object 
dtypes: float64(2), int64(2), object(7)
memory usage: 3.7+ MB


Perfect! As you can see all rows are now matched up to each other, as you can see all of the columns and there data seem to be non-null.

### Setting up our binary decision variable

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# lets set up our features
features = ['Age', 'Annual_Income', 'Num_Credit_Card', 'Interest_Rate', 'Credit_Utilization_Ratio']
# our target variable
target = 'Credit_Mix'

X = filterd_df[features].copy()
y = filterd_df[target]

print(X)





      Age Annual_Income  Num_Credit_Card  Interest_Rate  \
0      23      19114.12                4              3   
1      24      19114.12                4              3   
4      28      34847.84                4              6   
5      28      34847.84                4              6   
9      35     143162.64                5              8   
...    ..           ...              ...            ...   
49990  50       37188.1                4           4252   
49992  29      20002.88                8             29   
49993  29      20002.88                8             29   
49994  29      20002.88                8             29   
49997  25      39628.99                6              7   

       Credit_Utilization_Ratio  
0                     35.030402  
1                     33.053114  
4                     25.926822  
5                     30.116600  
9                     35.685836  
...                         ...  
49990                 25.708414  
49992              

It seems like there are some values that have "_" in them, that will become a problem for our next step so lets clean up this data

In [14]:


# Loop through each column
for col in X.columns:
    # Check if the column data type is object (string)
    if X[col].dtype == 'object':
        # Replace underscores with an appropriate value (e.g., empty string)
        X[col] = X[col].str.replace('_', '')

# Now the underscores are removed from the entire dataset


The StandardScaler is a preprocessing technique used to standardize the features by subtracting the mean and dividing by the standard deviation. This process transforms the features into a distribution with mean 0 and standard deviation 1. This is done independently for each feature.

In [15]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# initlize the standardscaler obj
scaler = StandardScaler()

#calculates the mean and standard deviation of each feature in the training set (X_train). Then, 
# it transforms the training features by subtracting the mean and dividing by the standard deviation
X_train_scaled = scaler.fit_transform(X_train)

# The same transformation is applied to the test set (X_test)

X_test_scaled = scaler.transform(X_test)

### Dummy Classifcation

To start off we will use ```DummyClassifer``` as our first model. The ```DummyClassifier``` is sklearn's baseline model for classification

The ```DummyClassifier``` is a classifier provided by scikit-learn that serves as a baseline model for binary or multiclass classification tasks. It's often used to understand how well a more complex classifier is performing compared to a simple, naive strategy. The DummyClassifier essentially provides predictions based on simple rules, making it useful for establishing a benchmark performance level.

In [16]:
# Initialize and train the DummyClassifier
dummy_model = DummyClassifier(strategy='most_frequent')  # This strategy predicts the most frequent class
dummy_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = dummy_model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_rep)

Accuracy: 0.49040139616055844
Confusion Matrix:
 [[   0 1168]
 [   0 1124]]
Classification Report:
               precision    recall  f1-score   support

         Bad       0.00      0.00      0.00      1168
        Good       0.49      1.00      0.66      1124

    accuracy                           0.49      2292
   macro avg       0.25      0.50      0.33      2292
weighted avg       0.24      0.49      0.32      2292



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### What does this accuracy score mean?

The accuracy score of ```0.4904``` essentially mean that our classifcation model correctly predicted our target value  approximately 49.04% of the time. This makes sense because the split between 'Good' and 'Bad' is quite even, so since the dummy classifer looks for the most frequent cases, it guesses wrong almost half of the time. 

However, this accuracy score is quite low considering that we are using a dummy classifer with binary classifcation. This may mean that other classfication methods may work better for our dataset. This value is far too low to say that this is a reliable method.

### Lets upscale the majority class 

Upscaling the majority class should result in predictable results. We would see a increase in accuracy. Our next steps will be upscaling the majority class and running the same dummy classifer.

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from imblearn.under_sampling import RandomUnderSampler


# Upsample the majority class using RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy = "majority",random_state=42)
X_train_resampled, y_train_resampled = rus.fit_resample(X_train_scaled, y_train)


# Initialize and train the DummyClassifier on the resampled data
dummy_model = DummyClassifier(strategy='most_frequent')
dummy_model.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred = dummy_model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_rep)


Accuracy: 0.5095986038394416
Confusion Matrix:
 [[1168    0]
 [1124    0]]
Classification Report:
               precision    recall  f1-score   support

         Bad       0.51      1.00      0.68      1168
        Good       0.00      0.00      0.00      1124

    accuracy                           0.51      2292
   macro avg       0.25      0.50      0.34      2292
weighted avg       0.26      0.51      0.34      2292



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


As you can see we see a higher accuracy now, this is due to the higher imbalance to the majority class resulting in the dummy classifer being able to make the "guess" more correctly. Mind you not since it was upscaled a little bit 

### Logisitic Regression on Credit Mix Column

Lets talk about the logistic regression algorithm. Logistic regression is an effective classification algorithm for several reasons:

1. ```Interpretability```: Logistic regression provides interpretable results. The coefficients associated with each feature can be directly interpreted as the change in the log-odds of the outcome for a one-unit change in the corresponding feature. This makes it easier to understand the relationship between the features and the target.

2. ```Efficiency```: Logistic regression is computationally efficient and can handle large datasets relatively well. Training logistic regression models is generally faster compared to more complex algorithms.

Lets conduct multiclass classification on the dataset.

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

#Lets first read this data into a dataframe that we can view
df = pd.read_csv("test.csv")

# lets filter our data
filterd_df = df[df['Credit_Mix'] != "_"]
filterd_df['Credit_Mix'].value_counts()


Standard    18379
Good        12260
Bad          9556
Name: Credit_Mix, dtype: int64

We preproccess the data first like we did for binary classification. (note) We are not getting rid of the standard column since its no longer binary.

In [22]:
# lets set up our features
features = ['Age', 'Annual_Income', 'Num_Credit_Card', 'Interest_Rate', 'Credit_Utilization_Ratio']
# our target variable
target = 'Credit_Mix'

X = filterd_df[features].copy()
y = filterd_df[target]

# Loop through each column
for col in X.columns:
    # Check if the column data type is object (string)
    if X[col].dtype == 'object':
        # Replace underscores with an appropriate value (e.g., empty string)
        X[col] = X[col].str.replace('_', '')

# Now the underscores are removed from the entire dataset

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Choose a classifier (Logistic Regression with One-vs-Rest approach)
classifier = LogisticRegression(max_iter=1000, multi_class='ovr')  # 'ovr' stands for "One-vs-Rest"

# Train the classifier
classifier.fit(X_train_scaled, y_train)

# Make predictions
y_pred = classifier.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", classification_rep)

Accuracy: 0.4562756561761413
Classification Report:
               precision    recall  f1-score   support

         Bad       0.00      0.00      0.00      1937
        Good       0.00      0.00      0.00      2434
    Standard       0.46      1.00      0.63      3668

    accuracy                           0.46      8039
   macro avg       0.15      0.33      0.21      8039
weighted avg       0.21      0.46      0.29      8039



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### K-Nearest Neighours

Since we are not seeing much success with logitistic regression lets try the K-nearest neighbour algorithm and see what kind of results we see

To explain what K-nearest neighbours (KNN) is a simple and intuitive algorithm for classification. It works by finding the K closest data points in the training set to a given test data point and then making predictions based on the majority class among those K neighbors

 

Lets import the needed package from scikit learn

In [23]:
from sklearn.neighbors import KNeighborsClassifier

We will use the same data, lets just make sure we have everything correct

In [24]:
#Lets first read this data into a dataframe that we can view
df = pd.read_csv("test.csv")

# lets filter our data
filterd_df = df[df['Credit_Mix'] != "_"]
filterd_df['Credit_Mix'].value_counts()

Standard    18379
Good        12260
Bad          9556
Name: Credit_Mix, dtype: int64

In [28]:
# lets set up our features
features = ['Age', 'Annual_Income', 'Num_Credit_Card', 'Interest_Rate', 'Credit_Utilization_Ratio']
# our target variable
target = 'Credit_Mix'

X = filterd_df[features].copy()
y = filterd_df[target]


# Loop through each column
for col in X.columns:
    # Check if the column data type is object (string)
    if X[col].dtype == 'object':
        # Replace underscores with an appropriate value (e.g., empty string)
        X[col] = X[col].str.replace('_', '')

# Now the underscores are removed from the entire dataset

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Standard    18379
Good        12260
Bad          9556
Name: Credit_Mix, dtype: int64


This stuff should seem very familiar , we select our features and target and filter out our "Null" points in the dataset

In [29]:
# Initialize and train the KNN classifier
k = 5  # Number of neighbors
knn_classifier = KNeighborsClassifier(n_neighbors=k)
knn_classifier.fit(X_train_scaled, y_train)

# Make predictions
y_pred = knn_classifier.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", classification_rep)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Accuracy: 0.6543102375917402
Classification Report:
               precision    recall  f1-score   support

         Bad       0.67      0.71      0.69      1937
        Good       0.67      0.66      0.67      2434
    Standard       0.63      0.62      0.63      3668

    accuracy                           0.65      8039
   macro avg       0.66      0.66      0.66      8039
weighted avg       0.65      0.65      0.65      8039



Using KNN algorithm we seemed to have achived the best accuracy so far. Here are the resons why: 

1. Local Patterns
In our dataset, determining credit standing come with analyzing our feature columns and finding slight patterns in there that results in either a "Good","Bad", or "Standard" credit standing. The KNN algorithm is very good at finding these minute patterns and making inferences based upon that
2. No Strong Assumptions
KNN is a non-parametric algorithm that doesn't make strong assumptions about the underlying data distribution. If your data distribution is complex or nonlinear, KNN can adapt to it.

### Conclusion

For this dataset, we did not see much success with the binary classifier and utilizing the logistic regression algorithm. However, not getting a super high accuracy score is quite common as no algorithm is perfect for every dataset. It is realistic to get accuracy score below 50%, but we aim to get higher than a coin flip for a good classification model. For our dataset, KNN algorithm worked best and gave us a 65% accuracy.