# FYS-STK4155 Assignment #1 - Credit Card Fault Detection

Evaluation of Project number: 1 <br />
Name: Lennart Lehmann (ERASMUS Student)

## Abstract 

In this project we show that there are several credit card misuses based on the given dataset. 
By analysing the dataset with tools like Lasso Regression and some Classification techniques from the awesome SciKit-Learn toolbox we can see that there have been x transaction misuses over the entire recording time. 

## Introduction

The goal of this work is to search for Credit Card payments in time. The question we wanted to answer is how possible is a person based on the given dataset to not pay back his debt in time.
We were given a dataset which has 25 variables (which I will call features from now on) such as ID, Sex, Age, PAY_0 until PAY_6 (to show the Repayment status in specific months) and so on and so forth.
This dataset included 102.234 transactions for each single feature. So we have an overall Matrix **X** of the shape $ X \in {\rm I\!R^{102234\times25}}$
$$
\mathbf{X} =
      \begin{bmatrix} x_{1,1} & x_{1,2} & ... & x_{1,25} \\
                                 x_{2,1} & x_{2,2} & ... & x_{2,25} \\
                                   \vdots & \ddots & \ddots & \vdots \\
                                  x_{102234,1} & x_{102234,2} & ... & x_{102234,25}
             \end{bmatrix}\qquad
$$

Here we deal with several unnecessary features that we can neglect for further analysis. ID for example would potentially just bias our estimator and we want to have an unbiased classificator that can deal with the data independent of one's name, since it could turn out to discriminate against specific names.


## Formalism

During our project we used several techniques to accomplish our goal.
Foremost we have to mention that the dataset consists of Data along with its targets or output values. 
Hence, we will use a **supervised learning technique** since we have the corresponding outputs for each single transaction recorded. With this knowledge we already restrict our methods by supervised learning techniques and we can look for some algorithms that can handle numerical data for classification tasks.
First of all we needed to properly clean the data, i.e. getting rid of all NaN values in our matrix as well as deleting some irrelevant features (such as ID).
For Classification purposes we used a **Supported Vector Machine (SVM)** which creates a hyperplane among the classes with a maximum margin between the datapoints. This technique is used very often to deal with classification problems.
For Benchmarking purposes (and due to the fact that I am a big fan of Random forests) I will also use this method to benchmark the results of the SVM against the Random forests. 

SVM solve following Cost problem:

$$
\sigma_{xy} =\frac{1}{n} \sum_{i=0}^{n-1}(x_i- \overline{x})(y_i- \overline{y}).
$$

*Dive into other stuff I did and explain it ehaustively*



## Code and Implementation
*Readability of Code, Implementation and testing and discussion of Benchmarks*

In [1]:
import os
import numpy as np
import pandas as pd

# import visual libraries
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D 
plt.style.use('ggplot')

# import the SKLearn palette
from sklearn.svm import SVC
from sklearn.externals import joblib
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
from sklearn.grid_search import GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, confusion_matrix, accuracy_score

In [None]:
# read in the data and check the first entries
pathToHappiness = os.getcwd() + '\\assignment1_data.csv'
happinessDataFrame = pd.read_csv(pathToHappiness)
happinessDataFrame.head()

In [7]:
# print the shape of the dataframe and check whether we have null values in our Happyness Df
print(happinessDataFrame.shape)
print(happinessDataFrame.isnull().any().sum())

'C:\\Users\\Lenny\\Documents\\Studium_Robotics (M.Sc.)\\Semester 3 - Oslo ERASMUS\\01_Applied Data Analysis and Machine Learning\\Project 1\\assignment1_data.csv'

### Data Preprocessing

In order to get a first glimpse of the data, I usually take a look at the distribution of the labels (True vs. False). <br />
Here we can already see, whether we deal with an **imbalanced dataset**, which would lead to a really bad classification at the end, or whether we have approximately the same numbers for True and False labels which would be a balanced dataset. <br />
Furthermore, we also have to check for any *NaN* values, which would also distort our classifier. <br /> 

Personally, I like making correlation plots to see how the features depend on one another which helps in the later steps to drop specific features in order to reduce the computation complexity. <br />

Another important step is to convert any categorical value to numerical values, since the classifiers can't handle non-numeric data. This can be done by one-hot encoding or similiar techniques. <br /> 

Lastly, normalizing the data helps to deal with outliers better, since they will not weight that much anymore and in general we will have a conform input for each single feature into our classifier. <br />
<br /> 

So, let's get our hands dirty and massage the data the way it loves it!

In [None]:
# Now let's make a deep dive into our data and first check the labels to see with which kind of data we have to deal this time
entireTransactions = happinessDataFrame.shape[0]
disgustingFraudsters = happinessDataFrame[happinessDataFrame['Class'] == 1]
sweetNonFraudsters = happinessDataFrame[happinessDataFrame['Class'] == 0]

relativeFraudsters = len(disgustingFraudsters)/entireTransactions
relativeNonFraudsters = len(sweetNonFraudsters)/entireTransactions

# print the % value of Fraudsters vs. non Fraudsters to get a better feeling of our data at hand
print('FRAUDSTERS: {}% vs. NON FRAUDSTERS: {}%'.format(relativeFraudsters*100, relativeNonFraudsters*100))

# let's visualize our balance of fraudster vs non fraudsters
labels = ['non-fraud','fraud']
classes = pd.value_counts(happinessDataFrame['Class'], sort = True)
classes.plot(kind = 'bar', rot=0)
plt.title("Transaction class distribution")
plt.xticks(range(2), labels)
plt.xlabel("Class")
plt.ylabel("Frequency")

In [None]:
# since we also have categorical data we have to convert it to numerical data
# for this purpose I just use standard conversion techniques like one-hot encoding for the gender, ...



In [None]:
# let's check how the features correlate with one another
correlation_matrix = happinessDataFrame.corr()
fig = plt.figure(figsize=(12,9))
sns.heatmap(correlation_matrix,vmax=0.8,square = True)
plt.show()

### Correlation Plot

based on our feature correlation plot we can see that not so many features correlaate with each other. Thus, dropping one feature would not affect any other one.
Here, since the ID might lead to some unwanted bias and it does not correlate with any other feature I will drop this column to reduce the complexity of the data and have an unbiased, non discriminating predictor at the end

In [None]:
# Next, since the dataset is highly unbiased we have to balance it by using the same amount of fraudsters vs. non-fraudsters
# Let's shuffle the data before creating the subsamples
df = happinessDataFrame.sample(frac=1)

frauds = happinessDataFrame[happinessDataFrame['Class'] == 1]
non_frauds = happinessDataFrame[happinessDataFrame['Class'] == 0][:len(frauds)]

new_dataFrame = pd.concat([non_frauds, frauds])
# Shuffle dataframe rows
new_dataFrame = new_df.sample(frac=1, random_state=38)
# Let's plot the Transaction class against the Frequency
labels = ['non frauds','fraud']
classes = pd.value_counts(new_dataFrame['Class'], sort = True)
classes.plot(kind = 'bar', rot=0)
plt.title("Transaction class distribution")
plt.xticks(range(2), labels)
plt.xlabel("Class")
plt.ylabel("Frequency")

In [None]:
# Now let's drop the unnecessary features we don't want our classifier to utilize for its predictions
features = new_dataFrame.drop(['Class'], axis = 1)
features = features.drop(['ID'], axis = 1)
labels = pd.DataFrame(new_dataFrame['Class'])

feature_array = features.values
label_array = labels.values

In [None]:
# finally split our data into train (80%) and test (20%) datasets
X_train, X_test, y_train, y_test = train_test_split(feature_array,label_array,test_size=0.20)

# Normalize our data to handle outliers in a better way and have conform inputs over all features
X_train = normalize(X_train)
X_test = normalize(X_test)


## K Nearest Neighbor (k-NN) as first approach

[this link](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) is going to the SK Learn kNN function with all its parameters.

In [None]:
# k nearest neighbor approach to ckeck the baseline <-- we want to beat this accuracy
neighbours = np.arange(1,30) # evaluate up to 30 neighbors
train_accuracy = np.empty(len(neighbours))
test_accuracy = np.empty(len(neighbours))

# evaluate the optimal number of k for our dataset
for i,k in enumerate(neighbours):
    #Setup a knn classifier with k neighbors
    knn = KNeighborsClassifier(n_neighbors=k, algorithm="kd_tree", n_jobs=-1)
    
    #Fit the model (.ravel() - function is flattening the array)
    knn.fit(X_train,y_train.ravel())
    
    #Compute accuracy on the training and test set
    train_accuracy[i] = knn.score(X_train, y_train.ravel())
    test_accuracy[i] = knn.score(X_test, y_test.ravel())
    
# plot the different accuracies w.r.t. the amount of k-neighbors
plt.title('k-NN Varying number of neighbors')
plt.plot(neighbours, test_accuracy, label='Testing Accuracy')
plt.plot(neighbours, train_accuracy, label='Training accuracy')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
plt.show()

# select the maximum accuracy (of test dataset)
idx = np.where(test_accuracy == max(test_accuracy))
optimal_k = neighbours[idx]

# fit the final k-NN classifier to our dataset
knn = KNeighborsClassifier(n_neighbors=optimal_k, algorithm="kd_tree", n_jobs=-1)
knn.fit(X_train,y_train.ravel())

# # save the model 
# filename = os.getcwd() + 'finalized_kNN_model.sav'
# joblib.dump(knn, filename)

# # load model again and predict
# knn = joblib.load(filename)
# knn_predicted_test_labels = knn.predict(X_test)

# get the score
knn_accuracy_score  = accuracy_score(y_test, knn_predicted_test_labels)
knn_MSE             = mean_squared_error(y_test, knn_predicted_test_labels)
knn_r2              = r2_score(y_test, knn_predicted_test_labels)

print("Accuracy Score: {} \nMean Squared Error: {} \nR2 Score: {}".format(knn_accuracy_score, knn_MSE, knn_r2))

In [None]:
# confusion Matrix for visualizing the classification task
LABELS = ['Non-Fraud', 'Fraud']
conf_matrix = confusion_matrix(y_test, knn_predicted_test_labels)
plt.figure(figsize=(12, 12))
sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

## Support Vector Machines (SVM)

[this link](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) is going to the SK Learn SVM function with all its parameters.

In [None]:
# evaluate the best soft margin model with different parameter settings
param_grid = { 
    'C': np.logspace(-3, 2, 6),
    'gamma': np.logspace(-3, 2, 6),
    'kernel': ['rbf', 'linear', 'sigmoid']
}
# set up our SVM classifier
svm_model = SVC(gamma='scale', c=c_range)

# check for the optimal paramters using GridSearch
parameterSearch_SVM = GridSearchCV(estimator=svm_model, param_grid=param_grid, refit=True)
parameterSearch_SVM.fit(X_train, y_train) 
print(parameterSearch_SVM.best_params_)
print('\n')
print(' ------------------------------------------------------------------------------------- ')

# Since the GridSearchCV already stores the best parameters, we can straight predict with that model
svm_predicted = parameterSearch_SVM.predict(X_test)

# # save the model 
# filename = os.getcwd() + 'finalized_SVM_model.sav'
# joblib.dump(parameterSearch_SVM, filename)

# # load model again and predict
# svm_model = joblib.load(filename)
# svm_predicted = svm_model.predict(X_test)

# get the score
svm_accuracy_score  = accuracy_score(y_test, svm_predicted)
svm_MSE             = mean_squared_error(y_test, svm_predicted)
svm_r2              = r2_score(y_test, svm_predicted)

print("Accuracy Score: {} \nMean Squared Error: {} \nR2 Score: {}".format(svm_accuracy_score, svm_MSE, svm_r2))

In [None]:
# confusion Matrix for visualizing the classification task
LABELS = ['Non-Fraud', 'Fraud']
conf_matrix = confusion_matrix(y_test, svm_predicted)
plt.figure(figsize=(12, 12))
sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

## Random Forests

[this link](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) is going to the SK Learn Random Forest function with all its parameters.

In [None]:
# set up our Random Forest Classifier
rfc_model = RandomForestClassifier(n_jobs=-1, max_features='sqrt', n_estimators=50, oob_score=True)
param_grid = { 
    'n_estimators': [10, 30, 50, 100, 200, 400, 600, 800, 1000],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [3, 6, 10, 13, 15, 17, None],
}
# evaluate the best paramters for Random Forest
parameterSearch_RFC = GridSearchCV(estimator=rfc_model, param_grid=param_grid, refit=True)
parameterSearch_RFC.fit(X_train, y_train) 
print(parameterSearch_SVM.best_params_)
print('\n')
print(' ------------------------------------------------------------------------------------- ')

# predict the outputs for our test dataset
randomForest_predicted = parameterSearch_RFC.predict(X_test)

# get the score
rfc_accuracy_score  = accuracy_score(y_test, randomForest_predicted)
rfc_MSE             = mean_squared_error(y_test, randomForest_predicted)
rfc_r2              = r2_score(y_test, randomForest_predicted)

print("Accuracy Score: {} \nMean Squared Error: {} \nR2 Score: {}".format(rfc_accuracy_score, rfc_MSE, rfc_r2))

In [None]:
# confusion Matrix for visualizing the classification task
LABELS = ['Non-Fraud', 'Fraud']
conf_matrix = confusion_matrix(y_test, randomForest_predicted)
plt.figure(figsize=(12, 12))
sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

## Multi-Layered Perceptron (Neural Network)

[this link](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) describes all the MLP Paramters in the SK Learn library

In [None]:
# set up the parameters we want to test as well as the classifier itself
parameters={
'learning_rate': ["constant", "invscaling", "adaptive"],
'hidden_layer_sizes': [(128, 256, 256, 64), (111, 168, 66), (122, 122), , (256, 512, 364, 168, 44), (123, 127, 55), (22, 33, 44, 22)],
'alpha': [10.0 ** -np.arange(1, 7)],
'activation': ["logistic", "relu", "Tanh"]
}
mlp_classifier = MLPClassifier(n_jobs=-1, max_features='sqrt', n_estimators=50, oob_score=True)

# create the grid Search
mlp = GridSearchCV(estimator=mlp_classifier, param_grid=parameters, n_jobs=-1, refit=True)
mlp.fit(X_train, y_train) 
print(mlp.best_params_)
print('\n')
print(' ------------------------------------------------------------------------------------- ')

# predict the values of our test dataset
mlp_predicted = mlp.predict(X_test)

# get the score
mlp_accuracy_score  = accuracy_score(y_test, mlp_predicted)
mlp_MSE             = mean_squared_error(y_test, mlp_predicted)
mlp_r2              = r2_score(y_test, mlp_predicted)

print("Accuracy Score: {} \nMean Squared Error: {} \nR2 Score: {}".format(mlp_accuracy_score, mlp_MSE, mlp_r2))


In [None]:
# confusion Matrix for visualizing the classification task
LABELS = ['Non-Fraud', 'Fraud']
conf_matrix = confusion_matrix(y_test, mlp_predicted)
plt.figure(figsize=(12, 12))
sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

## Comparison of all tested Algorithms

Now we have seen that each algorithm performs differently on the underlying dataset. 
In order to have it also more visually appealing, in the following chart we can see an overlay of each algorithm with its best tested Paramters on the Test set by using the Accuracy Score of each single algorithm. 

In [None]:
objects = ('Random Forest', 'SVM', 'k-NN', 'MLP')
y_pos = np.arange(len(objects))
performance = [rfc_accuracy_score, svm_accuracy_score, knn_accuracy_score, mlp_accuracy_score]

plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('Accuracy')
plt.title('Test Accuracy across the ' + str(len(objects)) + ' used algorithms')

plt.show()

## Analysis

As shown in the previous section (**Code and Implementation**) we see that there have been some people who are at risk to not pay their debt in time. Thus, you can already send these people some help and offer some other duties they have to follow, etc. Here we only took a look at the data we have at hand (25 features) and neglect all the other (key) factors that could help us getting more insights about the persons' circumstances. <br />
However, we compared three different kind of Machine Learning algorithms with each other and benchmarked each single one of them. <br />
The overall plot tells us that Random Forests achieved better results than kNN or SVM, respectively. 
Since I have only limited computing power as well as time, I just used random variables for the Hyperparamter Search for all three Algorithms. <br />
Keep in mind that it might be possible that SVM outperforms Random Forest with a different kind of Parameter Setting, but for our case we can definitely say that the Random Forest had the best accuracy with the testes Paramters. <br />
On big problem with the dataset was is that it is unbalanced and we need to have balanced classes for each of the classification outputs otherwise our predictor will be biased (for further read on the impact on unbalanced datasets please check out [this thread](https://www.researchgate.net/post/Effect_of_imbalanced_data_on_machine_learning)) <br />
For better Accuracy of the Credit Card dataset one could scrape the web to see where they made the transactions and derive valuable insights based on that. Basically adding more data and more features to our training and test set to have a better expressiveness of our algorithm and our predictions. 

*Dive into the Analysis and plots from the previous section --> Correlation of single features with each other...* 

## Conclusions

We see that this dataset gives us already some really interesting facts and we can derive some nice predictions based on that. 
Now that we predicted some potential 'threats' in our customer base we can just directly approach these guys and try to help them out with offering them certain duties or suggesting different options for the credit card. 
Other ways to improve the accuracy of our system would be to either 
    1.) generate more data with more features
    2.) apply state of the art artificial neural networks (deep learning) algorithms such as DenseNet, etc.
If we go with 1.) we would need more time since every payment will be recorded and we would get more data based upon time. Another way to artifically generate more data would be by intelligently use other methods to renrich the feature space of our data. So instead of having 25 features we could augment it to 30 with additional features such as *usual shopping district*, *usual time of payments*, etc. 

The 2.) method suggests a modern approach of classification by modelling as good as possible to our training set and have predictions which usually outperform SVMs or Linear Regression methods for more complex datasets.

## References

[1] Rajaratne, M. (2018). *Data Pre-Processing Techniques you should know*, retrieved from https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6, lastly accessed on 20th September 2019. <br />
[2] Bishop, C.M. (2011). *Pattern Recognition and Machine Learning*. Cambridge: Springer. <br />
[3] Duda, R. O. (2007). *Pattern Classification*. San Jose: Wiley. <br />
[4] Murphy, K. P. (2007). *Machine Learning: A probabilistic Perspective*. Camebridge: MIT Press. <br />