In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score, train_test_split,KFold,RandomizedSearchCV,GridSearchCV
from sklearn import metrics,svm
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
import lightgbm as lgb
from matplotlib.pyplot import figure
from sklearn.metrics import recall_score, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight
from sklearn.utils import class_weight
from sklearn.preprocessing import LabelEncoder

import random

import tensorflow as tf
import tensorflow_addons as tfa
from tensorflow import keras
from keras.utils import np_utils
from keras import layers
from kerastuner.tuners import RandomSearch
from keras import backend as K
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.wrappers.scikit_learn import KerasClassifier

os.getcwd()
os.chdir('/Users/haochunniu/Desktop/Kaggle Compatition/Credit Card Customer Segmentation')


TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

 The versions of TensorFlow you are currently using is 2.9.1 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
  from kerastuner.tuners import RandomSearch


# Problem Statement #
As my master education completed and no longer a student, the very first financial tool I obtained is credit card. In nowadays business environment, bank segments customers and provides all kinds of credit card. Different types of credit card provide different levels of benefits and advantages. Hence, to understand what kind of benefit I could receive based on my income and basic information, I decide to create a classification model to predict the level of card customer would received.  
     
In this project, I will try several different models, such as NN, XGBoost, and LGBM. In addition, to compare the performance of different models and find the best hyperparameter combination, I will use Nested Cross-Validation, Cross-Validation, Grid-Search, and Random-Search techniques.

## Data Preprocessing ##
Within the dataset, there are totally **10,127** rows and **23** columns. **6** of the columns are categorical variables, and the rest of the columns are all numeric columns. The **final independent variable** we are interested in is the **Card Category** categorical variable. It only has **4 possible outcomes, Blue, Silver, Gold, and Platium**. Fortunately, the data is quite clean and tidy. There is no NA or NULL value within the data. Among all the columns,  Avg_Open_To_Buy column does not have clear definition and explanation, and I am not sure about how the number is calculated. Hence, the column is eventually dropped from the data. Also, becuase the CLIENTNUM column does not have any actual meaning, it is also droppped. In addition, customer income is originally divided into several categories. To make it more precise and comparable, I turn the categorical income feature into numerical feature.

In [2]:
raw = pd.read_csv("BankChurners.csv")
raw = raw.drop(columns="Avg_Open_To_Buy")

In [3]:
income_mapping={'Less than $40K':20000,
                '$40K - $60K':50000,
                '$60K - $80K':70000,
                '$80K - $120K':100000,
                '$120K +':200000,
                'Unknown':80000
                }
raw=raw.assign(Income=raw.Income_Category.map(income_mapping))
raw=raw.drop(columns=["Income_Category","CLIENTNUM"])

Based on the distribution of the card categories, this data set suffered from extreme **imbalance data** issue. I will eventually fix this issue with **sampling techniques (under/over-sampling)** or **setting specific class weight**.

In [4]:
#Percentage of 4 card categories
round(raw.Card_Category.value_counts()/len(raw)*100,2)

Blue        93.18
Silver       5.48
Gold         1.15
Platinum     0.20
Name: Card_Category, dtype: float64

## Exploratory Data Analysis & Data Visualization ##

In this project, all of the EDA and Data Visualization will be done on Tableau. Please use the link below to access the public dashboard on my Tableau public account.   
https://public.tableau.com/views/CreditCardCustomerSegmentationAnalysis/Dashboard1?:language=en-US&publish=yes&:display_count=n&:origin=viz_share_link

## Train & Test Data Split ##
To evaluate final model's performance and train models, I split the entire dataset into a **train (80%) and a test (20%)** dataset. To tune the **hyper-parameters and choose the best performing model**, I will use **cross validation** and **nested cross validation** techniques. 

In [5]:
x_train,x_test,y_train,y_test = train_test_split(raw.drop(columns="Card_Category"),raw.Card_Category,test_size=0.2,random_state=99)

In [6]:
#Percentage of 4 card categories for train data
print("Card category distribution for train data")
print(round(y_train.value_counts()/len(y_train)*100,2))

Card category distribution for train data
Blue        93.28
Silver       5.39
Gold         1.11
Platinum     0.21
Name: Card_Category, dtype: float64


In [7]:
#Percentage of 4 card categories for test data
x_test = x_test.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)
print("Card category distribution for test data")
print(round(y_test.value_counts()/len(y_test)*100,2))

Card category distribution for test data
Blue        92.74
Silver       5.82
Gold         1.28
Platinum     0.15
Name: Card_Category, dtype: float64


## Outlier Detection: Isolation Forest ##
Before we start to deal with imbalance data, I first handel the outlier with **the Isolation Forest method**. In addition, bacuase the data is extremely imbalanced, I will **detect outliers category by category**, instead of detecting outliers of the entire training data. With the Isolatio Forest function, **the outliers** will be tagged as **-1**.

In [8]:
from sklearn.ensemble import IsolationForest
model_IF = IsolationForest(contamination=float(0.05),random_state=123)

In [9]:
index=1
for i in y_train.unique():
    x_train_tem = x_train[y_train==i]
    x_train_tem_encode = pd.get_dummies(x_train_tem,columns=['Attrition_Flag','Gender','Education_Level','Marital_Status'])
    model_IF.fit(x_train_tem_encode.values)
    train_tem=x_train_tem.iloc[:,:]
    train_tem['anomaly'] = model_IF.predict(x_train_tem_encode.values)
    train_tem['Card_Category']=i
    if index==1:
        train_no_out=train_tem
    else:
        train_no_out=pd.concat([train_no_out,train_tem])
    index+=1

train_no_out = train_no_out[train_no_out.anomaly==1]
train_no_out = train_no_out.reset_index(drop=True)
x_train_no_out = train_no_out.drop(columns=['Card_Category','anomaly'])
y_train_no_out = train_no_out.Card_Category

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_tem['anomaly'] = model_IF.predict(x_train_tem_encode.values)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_tem['Card_Category']=i
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_tem['anomaly'] = model_IF.predict(x_train_tem_encode.values)
A value is trying to be set on a copy of a

In [10]:
#Percentage of 4 card categories for train data without outliers
print("Card category distribution for train data without outliers")
print(round(y_train_no_out.value_counts()/len(y_train_no_out)*100,2))

Card category distribution for train data without outliers
Blue        93.29
Silver       5.39
Gold         1.10
Platinum     0.21
Name: Card_Category, dtype: float64


## One-Hot Encoding Categorical Variable ##   
Before fitting the model, I will transform the categorical variable into numeric variable via **One-Hot Encoding method**.

In [11]:
#One-Hot encode X
dummy_x_train_no_out = pd.get_dummies(x_train_no_out,columns=['Attrition_Flag','Gender','Education_Level','Marital_Status'])
dummy_x_test =  pd.get_dummies(x_test,columns=['Attrition_Flag','Gender','Education_Level','Marital_Status'])

In [12]:
#One-Hot eoncde Y
encoder1 = LabelEncoder()
encoder1.fit(y_train_no_out)
encoded_y_train_no_out = encoder1.transform(y_train_no_out)
dummy_y_train_no_out = np_utils.to_categorical(encoded_y_train_no_out)
single_label_y_train_no_out=np.argmax(dummy_y_train_no_out, axis=1)

encoder2 = LabelEncoder()
encoder2.fit(y_test)
encoded_y_test = encoder2.transform(y_test)
dummy_y_test = np_utils.to_categorical(encoded_y_test)
single_label_y_test=np.argmax(dummy_y_test, axis=1)

## Calculate Class Weight ##
Given that the data is extremely imbalanced, I will first calculate the class weights for all credit card categories, before training and fitting the models. 

In [13]:
# Calculate class categories' weights: 0 is 'Blue', 1 is 'Gold', 2 'Platinum', 3 is 'Silver'
nn_class_weights = compute_class_weight(class_weight = 'balanced',classes = np.unique(y_train_no_out),y = y_train_no_out)
nn_class_weights = dict(zip([0,1,2,3], nn_class_weights))

nn_class_weights2 = dict(zip([0,1,2,3],[1,50,500,300]))

class_weights = compute_class_weight(class_weight = 'balanced',classes = np.unique(y_train_no_out),y = y_train_no_out)
class_weights = dict(zip(np.unique(y_train_no_out), class_weights))

In [14]:
# Unlike other models, XGBoost model use sample weights not class weights. So, this section is used to calculate sample weights
xgb_class_weights = class_weight.compute_sample_weight(class_weight='balanced',y=y_train_no_out)

## Nested Random Search to select best ML model ##
To find out the best model for the task, I use the **nested random search CV** technique to compare different models' performances. In this project, I try three different ML model, **XGBoost, LightGBM, and Random Forest**. 

In [57]:
# 1. Create the Classifier
xgb=XGBClassifier(objective="multi:softmax",seed=9,use_label_encoder =False)

rf=RandomForestClassifier(random_state=9,class_weight=class_weights)

lgbm=lgb.LGBMClassifier(objective='multiclass',random_state=9,class_weight=class_weights)

##############################################################
# 2. Create the parameter grid
xgb_grid={'eta':np.arange(0.1,0.6,0.1),
          'max_depth':list(range(3,16)),
          'n_estimators':list(range(10,310,10)),
          'gamma':list(range(1,6)) }

rf_grid={'n_estimators':list(range(100,1100,100)),
         'max_depth':list(range(3,11))}

lgbm_grid={'learning_rate':np.arange(0.1,0.6,0.1),
           'max_depth':list(range(3,16)),
           'n_estimators':list(range(10,310,10))}

##############################################################
# 3. Create the CV
inner_cv = KFold(n_splits=3, shuffle=True, random_state=9)
outer_cv = KFold(n_splits=3, shuffle=True, random_state=9)

##############################################################
#In this case, because we are dealing with multi-class classification problem, we need to select a method to average our scoring metrics for the CVs
#So, in this case, we use both weighted f1 for both outer and inner CV
# 4-1-1. Random-search CV for XGBoost
clf = RandomizedSearchCV(xgb,xgb_grid,cv=inner_cv,scoring='f1_weighted',n_iter=10,random_state=9)

# 4-1-2. Nested CV for XGBoost
nested_score = cross_val_score(clf,X=dummy_x_train_no_out, y=single_label_y_train_no_out, cv=outer_cv,scoring='f1_weighted') 

# 4-1-3. Result for Nested CV
xgb_result=nested_score.mean()

##############################################################
# 4-2-1. Random-search CV for Random Forest
clf = RandomizedSearchCV(rf,rf_grid,cv=inner_cv,scoring='f1_weighted',n_iter=15,random_state=9)

# 4-2-2. Nested CV for Random Forest
nested_score = cross_val_score(clf,X=dummy_x_train_no_out, y=y_train_no_out, cv=outer_cv,scoring='f1_weighted')

# 4-2-3. Result for Nested CV
rf_result=nested_score.mean()

##############################################################
# 4-3-1. Random-search CV for LightGBM Classifier
clf = RandomizedSearchCV(lgbm,lgbm_grid,cv=inner_cv,scoring='f1_weighted',n_iter=15,random_state=9)

# 4-3-2. Nested CV for LightGBM Classifier
nested_score = cross_val_score(clf,X=dummy_x_train_no_out, y=y_train_no_out, cv=outer_cv,scoring='f1_weighted')

# 4-3-3. Result for Nested CV
lgbm_result=nested_score.mean()

##############################################################
print("XGBoost Nested CV Weighted F1:",round(xgb_result*100,2),"%")
print("Random Forest Nested CV Weighted F1:",round(rf_result*100,2),"%")
print("LightGBM Nested CV Weighted F1:",round(lgbm_result*100,2),"%")

XGBoost Nested CV Weighted F1: 97.22 %
Random Forest Nested CV Weighted F1: 93.6 %
LightGBM Nested CV Weighted F1: 97.11 %


## Random Search on XGBoost Classifier ##
Based on nested CV result, XGBoost Classifier is the best performing model. To find out the best hyper-parameter combination, I use random search cross-validation to find the best performing hyper-parameters.

In [150]:
# 1. Create estimator
xgb=XGBClassifier(objective="multi:softmax",seed=9,use_label_encoder =False,verbosity = 0)

# 2. Create parameter grid
xgb_grid={'eta':np.arange(0.1,0.6,0.1),
          'max_depth':list(range(3,16)),
          'n_estimators':list(range(10,310,10)),
          'gamma':list(range(1,6))}

# 3. Grid-search
xgb_model = RandomizedSearchCV(xgb,xgb_grid,cv=5,scoring='f1_weighted',n_iter=10,random_state=9)

# 4. Fit the model
xgb_model.fit(dummy_x_train_no_out,single_label_y_train_no_out,sample_weight=xgb_class_weights)

# 5. Predict
y_train_no_out_pred=xgb_model.predict(dummy_x_train_no_out)

In [151]:
# 6. Result
print ("With CV random search, I found the best hyperparameter is eta={}, max_depth={}, n_estimators={}, and gamma={}.".format(xgb_model.best_params_['eta'],
                                                                                                                               xgb_model.best_params_['max_depth'],
                                                                                                                               xgb_model.best_params_['n_estimators'],
                                                                                                                               xgb_model.best_params_['gamma']))
print('----------------------------------------------------------------------------------------------------------------')
print(classification_report(single_label_y_train_no_out,y_train_no_out_pred))

With CV random search, I found the best hyperparameter is eta=0.4, max_depth=14, n_estimators=210, and gamma=1.
----------------------------------------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       1.00      0.99      1.00      7179
           1       0.89      1.00      0.94        85
           2       1.00      1.00      1.00        16
           3       0.88      1.00      0.94       415

    accuracy                           0.99      7695
   macro avg       0.94      1.00      0.97      7695
weighted avg       0.99      0.99      0.99      7695



In [154]:
#Save the final XGboost model
import joblib
joblib.dump(xgb_model.best_estimator_, 'Final_XGBoost.pkl')

#Load the model
#xgb_model = joblib.load("Final_XGBoost.pkl")


['Final_XGBoost.pkl']

## Random Search on Neural Network ##
After finding the best performing machine learning model, I also try some **deep learning models**. In this section, I will try both **3 layers** and **10 layers nerual network** models. To find the best hyper-parameters, I will use the **random search and train-validation split technique**.

In [131]:
#Instead of using cross-validation techniques, I just use simple 70-30 train, validation data split to evaluate different hyper-paramater combination performance.
nn_dummy_x_train_no_out,nn_dummy_x_val_no_out,nn_dummy_y_train_no_out,nn_dummy_y_val_no_out = train_test_split(dummy_x_train_no_out,dummy_y_train_no_out,test_size=0.3,random_state=99)

In [114]:
#Random Search on 3 Layers Neural Network
def build_model(hp):
    model = keras.Sequential()
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=100,
                                        max_value=200,
                                        step=10),
                           activation=hp.Choice('activation',values = ['relu','sigmoid']),
                           input_dim=nn_dummy_x_train_no_out.shape[1]))
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=100,
                                        max_value=200,
                                        step=10),
                           activation=hp.Choice('activation',values = ['relu','sigmoid'])))
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=100,
                                        max_value=200,
                                        step=10),
                           activation=hp.Choice('activation',values = ['relu','sigmoid'])))
    model.add(layers.Dense(4,activation='softmax'))
    model.compile(optimizer=keras.optimizers.Adam(hp.Choice('learning_rate',
                                                            values=[0.01,0.001,0.0001])),
                 loss='categorical_crossentropy',
                 metrics=[keras.metrics.categorical_accuracy])
    return model

tuner=RandomSearch(build_model,
                   objective='val_categorical_accuracy',
                   max_trials=5,
                   overwrite=True,
                   seed=99,
                   executions_per_trial=5)

#Keras cannot input object data type, so no matter the column is boolean or numeric we need to transform them to float32
nn_dummy_x_train_no_out_float = np.asarray(nn_dummy_x_train_no_out).astype(np.float32)
nn_dummy_x_val_no_out_float = np.asarray(nn_dummy_x_val_no_out).astype(np.float32)


tuner.search(x=nn_dummy_x_train_no_out_float,
             y=nn_dummy_y_train_no_out,
             epochs=300,
             batch_size=512,
             validation_data=(nn_dummy_x_val_no_out_float,nn_dummy_y_val_no_out),
             class_weight = nn_class_weights)

Trial 5 Complete [00h 03m 16s]
val_categorical_accuracy: 0.9385015368461609

Best val_categorical_accuracy So Far: 0.9444781422615052
Total elapsed time: 00h 15m 06s
INFO:tensorflow:Oracle triggered exit


In [115]:
# Result
result=tuner.get_best_hyperparameters()[0].values
print('The best 3 layers DNN parameters would be {} neurons, {} as activation function, and {} learning rate.'.format(result['units'],result['activation'],result['learning_rate']))
#print('------------------------------------------')
#print(tuner.results_summary())

#Get the final model
Three_layers_NN=tuner.get_best_models()[0]

#Get the predicton on validation data
nn_y_val_pred=Three_layers_NN.predict(nn_dummy_x_val_no_out_float)
nn_y_val_pred=np.argmax(nn_y_val_pred,axis=1)

single_label_nn_dummy_y_val_no_out=np.argmax(nn_dummy_y_val_no_out,axis=1)
print(classification_report(single_label_nn_dummy_y_val_no_out,nn_y_val_pred))

The best 3 layers DNN parameters would be 140 neurons, relu as activation function, and 0.0001 learning rate.
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      2150
           1       0.38      0.70      0.49        27
           2       0.00      0.00      0.00         4
           3       0.67      0.48      0.56       128

    accuracy                           0.95      2309
   macro avg       0.51      0.54      0.51      2309
weighted avg       0.95      0.95      0.95      2309



In [141]:
#Random Search on 10 Layers Neural Network
def build_model(hp):
    model = keras.Sequential()
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=10,
                                        max_value=150,
                                        step=10),
                           activation=hp.Choice('activation',values = ['relu','sigmoid']),
                           input_dim=nn_dummy_x_train_no_out.shape[1]))
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=10,
                                        max_value=150,
                                        step=10),
                           activation=hp.Choice('activation',values = ['relu','sigmoid'])))
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=10,
                                        max_value=150,
                                        step=10),
                           activation=hp.Choice('activation',values = ['relu','sigmoid'])))
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=10,
                                        max_value=150,
                                        step=10),
                           activation=hp.Choice('activation',values = ['relu','sigmoid'])))
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=10,
                                        max_value=150,
                                        step=10),
                           activation=hp.Choice('activation',values = ['relu','sigmoid'])))
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=10,
                                        max_value=150,
                                        step=10),
                           activation=hp.Choice('activation',values = ['relu','sigmoid'])))
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=10,
                                        max_value=150,
                                        step=10),
                           activation=hp.Choice('activation',values = ['relu','sigmoid'])))
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=10,
                                        max_value=150,
                                        step=10),
                           activation=hp.Choice('activation',values = ['relu','sigmoid'])))
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=10,
                                        max_value=150,
                                        step=10),
                           activation=hp.Choice('activation',values = ['relu','sigmoid'])))
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=10,
                                        max_value=150,
                                        step=10),
                           activation=hp.Choice('activation',values = ['relu','sigmoid'])))
    model.add(layers.Dense(4,activation='softmax'))
    model.compile(optimizer=keras.optimizers.Adam(hp.Choice('learning_rate',
                                                            values=[0.01,0.001,0.0001])),
                 loss='categorical_crossentropy',
                 metrics=[keras.metrics.categorical_accuracy])
    return model

tuner=RandomSearch(build_model,
                   objective='val_categorical_accuracy',
                   max_trials=5,
                   overwrite=True,
                   seed=99,
                   executions_per_trial=3)

#Keras cannot input object data type, so no matter the column is boolean or numeric we need to transform them to float32
nn_dummy_x_train_no_out_float = np.asarray(nn_dummy_x_train_no_out).astype(np.float32)
nn_dummy_x_val_no_out_float = np.asarray(nn_dummy_x_val_no_out).astype(np.float32)


tuner.search(x=nn_dummy_x_train_no_out_float,
             y=nn_dummy_y_train_no_out,
             epochs=400,
             batch_size=512,
             validation_data=(nn_dummy_x_val_no_out_float,nn_dummy_y_val_no_out),
             class_weight = nn_class_weights2)

Trial 5 Complete [00h 01m 31s]
val_categorical_accuracy: 0.05543525516986847

Best val_categorical_accuracy So Far: 0.9113613367080688
Total elapsed time: 00h 08m 42s
INFO:tensorflow:Oracle triggered exit


In [142]:
# Result
result=tuner.get_best_hyperparameters()[0].values
print('The best 10 layers DNN parameters would be {} neurons, {} as activation function, and {} learning rate.'.format(result['units'],result['activation'],result['learning_rate']))
#print('------------------------------------------')
#print(tuner.results_summary())

#Get the final model
Ten_layers_NN=tuner.get_best_models()[0]

#Get the predicton on validation data
nn_y_val_pred=Ten_layers_NN.predict(nn_dummy_x_val_no_out_float)
nn_y_val_pred=np.argmax(nn_y_val_pred,axis=1)

single_label_nn_dummy_y_val_no_out=np.argmax(nn_dummy_y_val_no_out,axis=1)
print(classification_report(single_label_nn_dummy_y_val_no_out,nn_y_val_pred))

The best 10 layers DNN parameters would be 70 neurons, relu as activation function, and 0.0001 learning rate.
              precision    recall  f1-score   support

           0       0.99      0.93      0.96      2150
           1       0.70      0.26      0.38        27
           2       0.08      0.25      0.12         4
           3       0.41      0.87      0.55       128

    accuracy                           0.92      2309
   macro avg       0.54      0.58      0.50      2309
weighted avg       0.96      0.92      0.93      2309



Looking at the final results of the 3 layers NN model and 10 layers model. Even though it may look like the 3 layers NN model did a slightly better job, Yet, the **3 layers NN model actually did not capture any Platinum customers.** So, I believe that the **10 layers NN model with 70 neurons, relu as activation function, and 0.0001 learning rate did a better job.** However, comparing to our XGBoost model, it is still worse. So, I chose **the XGBoost model as the final model**.

## Final Model on Test Data ##
After finding the best performing machine learning model, I will finally evaluate our XGBoost model's performance on the test data set to see the unbiased performance. Eventually, the classifier still did a **poor job on capturing the gold and platium customers**. **Less than 40% of the gold customers are captured, and none of the 3 platium customers are captured**. For further research, I believe with **a lot more sample, much deeper NN model, and more extreme sample weights**, we might be able to **scarifice some accuracy on the majority categories (silver and blue) to boost up performance on the important gold and platimnum customers**.

In [23]:
#Load the final XGBoost model
import joblib
xgb_model = joblib.load("Final_XGBoost.pkl")

#Predict on test data
y_test_pred = xgb_model.predict(dummy_x_test)

#Final classification report on test data
print(classification_report(single_label_y_test,y_test_pred))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      1879
           1       0.53      0.38      0.44        26
           2       0.00      0.00      0.00         3
           3       0.73      0.94      0.82       118

    accuracy                           0.97      2026
   macro avg       0.56      0.58      0.56      2026
weighted avg       0.97      0.97      0.97      2026



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
