This project is to identify the attributes/characteristics that have high tendency of carring out fraudulent credit card transactions based on the dataset collected.

The hypothesis is that some of the demographic factors play an important clues to classify card users into differnt risk level of credit card default groupings.

Dataset Information:
This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005
The target column is the 'default.payment.next.month'

There are 25 variables:
ID: ID of each client
LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
SEX: Gender (1=male, 2=female)
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
MARRIAGE: Marital status (1=married, 2=single, 3=others)
AGE: Age in years
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
PAY_2: Repayment status in August, 2005 (scale same as above)
PAY_3: Repayment status in July, 2005 (scale same as above)
PAY_4: Repayment status in June, 2005 (scale same as above)
PAY_5: Repayment status in May, 2005 (scale same as above)
PAY_6: Repayment status in April, 2005 (scale same as above)
BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
default.payment.next.month: Default payment (1=yes, 0=no)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import csv
import scipy.stats as stats
import seaborn as sns
import pandas as pd
import json
from math import log
from sklearn.datasets import load_boston

%matplotlib inline

In [None]:
Credit= pd.read_csv('/Users/kaiengwee/Documents/GitHub/GA18Aug/Project4/UCI_Credit_Card.csv')
Credit.head()

In [None]:
Credit.rename(columns={'default.payment.next.month':'default_nextMTH'},inplace= True)

In [None]:
Credit.describe()

In [None]:
Credit['default_nextMTH'].describe()

In [None]:
Credit.shape

In [None]:
Credit.info()

In [None]:
Credit.isnull().sum()

In [None]:
Credit.isna().sum()

In [None]:
Credit.nunique()

In [None]:

target_count = Credit['default_nextMTH'].value_counts()
print('No default in next month:', target_count[0])
print('Default in next month:', target_count[1])
print('Proportion of default cases in the dataset:', round(target_count[1] / target_count[0], 2), ': 1')

target_count.plot(kind='bar', title='Count (target column)');


In [None]:
plt.subplots(figsize=(20,20))
sns.heatmap(Credit.corr(),  annot=True, vmin=-1, vmax= 1, cmap='PiYG', center= 0)

From the heatmap above, BILL_AMT1 to BILL_AMT6 are highly correlated within themselves. Since BILL_AMT1 is the 
latest billing month before default occured, I will drop out BILL_AMT2 to BILL_AMT6. In addition, ID and AGE features will 
also be removed since it doesn't mean anything in predicting the default occurence and lower correlated to the target.

In [None]:
Credit.drop(['ID','AGE','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6'], axis = 1, inplace = True)

In [None]:
plt.subplots(figsize=(20,20))
sns.heatmap(Credit.corr(),  annot=True, vmin=-1, vmax= 1, cmap='PiYG', center= 0)

In [None]:
count_d = 0
count_nd = 0

for i in Credit['default_nextMTH']:
    if i == 1:
        count_d += 1
    else:
        count_nd += 1
    
total= count_d + count_nd
print("Total cases under study", total)
print("Total default cases", count_d,"or ", round(count_d/total,2)*100,"%" )
print("Total non-default cases", count_nd,"or ", round(count_nd/total,2)*100,"%" )

In [None]:
# The remaining features have 18 columns
Credit.shape

In [None]:
Credit.head()

In [None]:
# SEX: Gender (1=male, 2=female), the data set in SEX column only has either male or female
Credit['SEX'].value_counts().sort_index()

In [None]:
SEX_dummies= pd.get_dummies(Credit.SEX, prefix= 'SEX')
SEX_dummies.sample(n=5, random_state=1)

In [None]:
# Drop SEX_2 since one column is enough to represent the two genders

SEX_dummies.drop(SEX_dummies.columns[-1], axis=1, inplace= True)
SEX_dummies.head()

In [None]:
# EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
# value counts shows there is redundant categories under EDUCATION column i.e. categories 0, and 6
Credit['EDUCATION'].value_counts().sort_index()

In [None]:
# to replace the category 0 with NA

Credit.EDUCATION.replace(0, np.nan, inplace=True)
Credit.isna().sum()

In [None]:
# Remove the rows with NA

Credit.dropna(inplace=True)

In [None]:
# Check the column EDUCATION after removing the rows with NA

Credit['EDUCATION'].value_counts().sort_index()

In [None]:
# Create EDUCATION DUMMY COLUMNS

EDUCATION_dummies= pd.get_dummies(Credit.EDUCATION, prefix= 'EDUCATION')
EDUCATION_dummies.sample(n=5, random_state=1)

In [None]:
# Drop EDUCATION_0 & EDUCATION_6, columns with unknown
# Drop EDUCATION_5 since EDUCATION 1, 2, 3, and 4 columns are enough to represent all categories of EDUCATION

EDUCATION_dummies.drop(['EDUCATION_5.0','EDUCATION_6.0'], axis=1, inplace= True)
EDUCATION_dummies.head()

In [None]:
# MARRIAGE: Marital status (1=married, 2=single, 3=others)
# value counts shows there is redundant categories under MARRIAGE column i.e. categories 0

Credit['MARRIAGE'].value_counts().sort_index()

In [None]:
# to replace the category 0 with NA

Credit.MARRIAGE.replace(0, np.nan, inplace=True)
Credit.isna().sum()

In [None]:
# Remove the rows with NA

Credit.dropna(inplace=True)

In [None]:
# Check the column MARRIAGE after removing the rows with NA

Credit['MARRIAGE'].value_counts().sort_index()

In [None]:
MARRIAGE_dummies= pd.get_dummies(Credit.MARRIAGE, prefix= 'MARRIAGE')
MARRIAGE_dummies.sample(n=5, random_state=1)

In [None]:
# Drop MARRIAGE_3 since MARRIAGE 1 and 2 are enough to represent all categories of EDUCATION

MARRIAGE_dummies.drop(MARRIAGE_dummies.columns[-1], axis=1, inplace= True)
MARRIAGE_dummies.head()

In [None]:
# Check if the dataset Credit are clear from cell with NA

Credit.isna().sum()

In [None]:
# the latest shape of data set Credit has lower number of rows at 29932

Credit.shape

In [None]:
# Concatenate the Credit with three sets of dummies created namely SEX, EDUCATION and MARRIAGE

Credit_dummies= pd.concat([Credit, SEX_dummies, EDUCATION_dummies, MARRIAGE_dummies], axis=1)
Credit_dummies.head()

In [None]:
# Remove the origianl columns for SEX, EDUCATION and MARRIAGE

Credit_dummies.drop(['SEX','EDUCATION','MARRIAGE'], axis = 1, inplace = True)

In [None]:
# Shift the target column ('default_nextMTH') to the last column

y= Credit_dummies.default_nextMTH
Credit_dummies.drop(['default_nextMTH'], axis=1, inplace= True)
Credit_dummies= pd.concat([Credit_dummies, y], axis=1)

In [None]:
# Show the columns in data set named as Credit_dummies

Credit_dummies.columns

In [None]:
# Check the shape of Credit_dummies after added the three dummy columns

Credit_dummies.shape

In [None]:
# Create feature matrix (X)
feature_cols= Credit_dummies.columns.drop(['default_nextMTH'])
X= Credit_dummies[feature_cols]

# Create response vector (y)

y= Credit_dummies.default_nextMTH

In [None]:
print((type(X)))
print((type(X.values)))
print((type(y)))
print((type(y.values)))

In [None]:
print((X.shape))
print((y.shape))

In [None]:
# check if Credit_dummies contain any NA cell

Credit_dummies.isna().sum()

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_pred = scaler.fit_transform(X)

Split data set for Train and Test sets using Logistic Regression and check their accuracies

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

model = logreg.fit(X_train,y_train)

y_pred = logreg.predict(X_test)

print('Testing score: '  + str(logreg.score(X_test, y_test)))
print('Training score: '  + str(logreg.score(X_train, y_train)))
list(zip(feature_cols, logreg.feature_importances_)) 

The Training and Testing scores are almost equal to the mix of none default cases in the data set. 

In [None]:
print(logreg.intercept_)
print(logreg.coef_)
coeff = pd.DataFrame(dict(zip(X.columns,model.coef_[0])),index=[0])
coeff

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, roc_auc_score, roc_curve, recall_score, classification_report
cm=metrics.confusion_matrix(y_test,y_pred)

In [None]:
list1 = ["Actual: No", "Actual: Yes"]
list2 = ["Predicted: No", "Predicted: Yes"]
pd.DataFrame(cm, list1,list2)

In [None]:
print(classification_report(y_test, y_pred))

The classification report simply shows the model can only predict on actual data set

In [None]:
fpr, tpr, threshold = metrics.roc_curve(y_test, y_pred)
roc_auc = metrics.auc(fpr, tpr)

import matplotlib.pyplot as plt
plt.title('default_nextMTH Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Using another sklearn linear model to train and test the model

In [None]:
import pandas as pd
from sklearn import linear_model, model_selection, metrics

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size= 0.25, random_state=46)
logit_simple = linear_model.LogisticRegression(C=1e9).fit(X_train, y_train)

In [None]:
1- y_train.mean()

In [None]:
# What is our accuracy on the test set?
print(np.mean(y_test == logit_simple.predict(X_test)))

In [None]:
# Get probability predictions.
logit_pred_proba = logit_simple.predict_proba(X_test)[:,1]

In [None]:
metrics.confusion_matrix(y_true=y_test, y_pred=logit_pred_proba > .5)

This sklearn linear model is built on logistic regression and hence, is showing the same accuracy at 78%

Apply RandomForest to get model accuracy

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Train model
Random = RandomForestClassifier(n_estimators=10, max_depth= 10)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

model= Random.fit(X_train,y_train)
 
# Predict on training set
y_pred = Random.predict(X_test)
 
print('Testing score: '  + str(Random.score(X_test, y_test)))
print('Training score: '  + str(Random.score(X_train, y_train)))
list(zip(feature_cols, Random.feature_importances_)) 

The feature importance check does not show any particular feature having significant importance, in other words, the features 
are having similar weights on the target column. 
The most importance featur is PAY_0 which makes sense as this is the latest bill payment. The least important feature
is EDUCATION category 4, because of data availability issue.
Random forest can work better with categorical features, than logistic regression.

Check the test data set accuracy using roc auc curve after modeled by Random forest

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, roc_auc_score, roc_curve, recall_score, classification_report
cm=metrics.confusion_matrix(y_test,y_pred)

In [None]:
list1 = ["Actual: No", "Actual: Yes"]
list2 = ["Predicted: No", "Predicted: Yes"]
pd.DataFrame(cm, list1,list2)

In [None]:
print(classification_report(y_test, y_pred))

The Precision is good but the recall is very low for Class 1 (default class) because the data set is imbalanced
In other words, the model is only good in predicting none default case only.

In [None]:
fpr, tpr, threshold = metrics.roc_curve(y_test, y_pred)
roc_auc = metrics.auc(fpr, tpr)

import matplotlib.pyplot as plt
plt.title('default_nextMTH Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Apply Smote Tomek on imbalanced dataset

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA

from imblearn.combine import SMOTETomek

print(__doc__)

X= Credit_dummies[feature_cols]
y= Credit_dummies.default_nextMTH

# Generate the dataset
#X, y = make_classification(n_classes=2, class_sep=2, weights=[0.22, 0.78],
                           #n_informative=3, n_redundant=1, flip_y=0,
                           #n_features=18, n_clusters_per_class=1,
                           #n_samples=30000, random_state=10)

# Instanciate a PCA object for the sake of easy visualisation
pca = PCA(n_components= 2)
# Fit and transform x to visualise inside a 2D feature space
X_vis = pca.fit_transform(X)

# Apply SMOTE + Tomek links
sm = SMOTETomek()
X_resampled, y_resampled = sm.fit_sample(X, y)
X_res_vis = pca.transform(X_resampled)

# Two subplots, unpack the axes array immediately
f, (ax1, ax2) = plt.subplots(1, 2)

c0 = ax1.scatter(X_vis[y == 0, 0], X_vis[y == 0, 1], label="Class #0",
                 alpha=0.5)
c1 = ax1.scatter(X_vis[y == 1, 0], X_vis[y == 1, 1], label="Class #1",
                 alpha=0.5)
ax1.set_title('Original set')

ax2.scatter(X_res_vis[y_resampled == 0, 0], X_res_vis[y_resampled == 0, 1],
            label="Class #0", alpha=0.5)
ax2.scatter(X_res_vis[y_resampled == 1, 0], X_res_vis[y_resampled == 1, 1],
            label="Class #1", alpha=0.5)
ax2.set_title('SMOTE + Tomek')

# make nice plotting
for ax in (ax1, ax2):
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.get_xaxis().tick_bottom()
    ax.get_yaxis().tick_left()
    ax.spines['left'].set_position(('outward', 10))
    ax.spines['bottom'].set_position(('outward', 10))
    #ax.set_xlim([-6, 8])
    #ax.set_ylim([-6, 6])

plt.figlegend((c0, c1), ('Class #0', 'Class #1'), loc='lower center',
              ncol=2, labelspacing=0.)
plt.tight_layout(pad=3)
plt.show()



In [None]:
# To check the level of imbalanced data set after Smote Tomek application:
# The results show the Smote Tomek function has actually increased the minority and achive a balanced data set

count_d = 0
count_nd = 0

for i in y_resampled:
    if i == 1:
        count_d += 1
    else:
        count_nd += 1
    
total= count_d + count_nd
print("Total cases under study", total)
print("Total default cases", count_d,"or ", round(count_d/total,2)*100,"%" )
print("Total non-default cases", count_nd,"or ", round(count_nd/total,2)*100,"%" )

In [None]:
# Create a new dataframe for all the columns after the Credit_dummies data set was treated by Smote Tomek 

X= pd.DataFrame(X_resampled)
y= pd.DataFrame(y_resampled)
Credit_ST= pd.concat([X, y], axis= 1)
Credit_ST.head()

In [None]:
# Call out the headers of Credit_dummies to compare aginst the headers of X_resampled

Credit_dummies.head(1)

In [None]:
# Put headers of Credit_dummies.columns onto X_resampled and y_resampled

initialcol = Credit_dummies.columns
Credit_ST.columns = initialcol

In [None]:
Credit_ST.head(1)

In [None]:
# Create feature matrix X from the columns treated by Smote Tomek
feature_cols= Credit_ST.columns.drop(['default_nextMTH'])
X= Credit_ST[feature_cols]

# Create response vector (y)

y= Credit_ST.default_nextMTH

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_pred = scaler.fit_transform(X)

The Credit_dummies data set has been treated by Smote Tomek. Use Logistic Regression to check their accuracies

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

model = logreg.fit(X_train,y_train)

y_pred = logreg.predict(X_test)

print('Testing score: '  + str(logreg.score(X_test, y_test)))
print('Training score: '  + str(logreg.score(X_train, y_train)))
list(zip(feature_cols, logreg.feature_importances_)) 

The feature importance check does not show any particular feature having significant importance, in other words, 
the features are having similar weights on the target column.
The train and test scores at about 52% only shows slightly better accuracy than the 50%:50% data set after treated 
by Smote Tomek

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, roc_auc_score, roc_curve, recall_score, classification_report
cm=metrics.confusion_matrix(y_test,y_pred)

In [None]:
list1 = ["Actual: No", "Actual: Yes"]
list2 = ["Predicted: No", "Predicted: Yes"]
pd.DataFrame(cm, list1,list2)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
fpr, tpr, threshold = metrics.roc_curve(y_test, y_pred)
roc_auc = metrics.auc(fpr, tpr)

import matplotlib.pyplot as plt
plt.title('default_nextMTH Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Using another sklearn linear model to train and test the model

In [None]:
import pandas as pd
from sklearn import linear_model, model_selection, metrics

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size= 0.25, random_state=46)
logit_simple = linear_model.LogisticRegression(C=1e9).fit(X_train, y_train)

In [None]:
1- y_train.mean()

In [None]:
# What is our accuracy on the test set?
print(np.mean(y_test == logit_simple.predict(X_test)))

In [None]:
# Get probability predictions.
logit_pred_proba = logit_simple.predict_proba(X_test)[:,1]

In [None]:
metrics.confusion_matrix(y_true=y_test, y_pred=logit_pred_proba > .5)

This sklearn linear model is built on logistic regression and hence, is showing the same accuracy at about 52%

Apply RandomForest to get model accuracy

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Train model
Random = RandomForestClassifier(n_estimators=10, max_depth= 12)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

model= Random.fit(X_train,y_train)
 
# Predict on training set
y_pred = Random.predict(X_test)
 
 
print('Testing score: '  + str(Random.score(X_test, y_test)))
print('Training score: '  + str(Random.score(X_train, y_train)))
list(zip(feature_cols, Random.feature_importances_))  

The feature importance check does not show any particular feature having significant importance, in other words, the features 
are having similar weights on the target column. 
The most importance featur is PAY_0 which makes sense as this is the latest bill payment. The least important feature
is EDUCATION category 4, because of data availability issue.
Random forest can work better with categorical features, than logistic regression.

Check the test data set accuracy using confusion matric

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, roc_auc_score, roc_curve, recall_score, classification_report
cm=metrics.confusion_matrix(y_test,y_pred)

In [None]:
list1 = ["Actual: No", "Actual: Yes"]
list2 = ["Predicted: No", "Predicted: Yes"]
pd.DataFrame(cm, list1,list2)

In [None]:
print(classification_report(y_test, y_pred))

The Precision and recall accuracies are good after the data set are treated by Smote Tomek and get to balanced
The Predictions on default (Positive) and no default (Negative) with correct outcomes are rather evenly distributed.

In [None]:
fpr, tpr, threshold = metrics.roc_curve(y_test, y_pred)
roc_auc = metrics.auc(fpr, tpr)

import matplotlib.pyplot as plt
plt.title('default_nextMTH Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

The Precision and the recall are both very because the data set is now balanced.
In other words, the model is good in predicting both default and none default cases.

In [None]:
import numpy as np

from sklearn.svm import SVC

svc = SVC(gamma='auto')
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

model= svc.fit(X_train, y_train) 

y_pred= svc.predict(X_test)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

print('Testing score: '  + str(svc.score(X_test, y_test)))
print('Training score: '  + str(svc.score(X_train, y_train)))
list(zip(feature_cols, svc.feature_importances_))  

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10)
                             

#array([ 1.     ,  0.93...,  0.86...,  0.93...,  0.93...,
        #0.93...,  0.93...,  1.     ,  0.93...,  1.      ])