# Credit Card Fraud Analysis

## Data
The dataset is found [here](https://www.kaggle.com/dalpozz/creditcardfraud).
Its description is found on the same page.
Summary of the description:
- 284807 credit card transactions among which 492 of them are fraudulent, i.e. 0.172%.
- Predictors are Amount (transaction amount) and V1, V2, ... V28, which are principal components of the original data.
The original features are not available due to confidentiality issues.
- Time (seconds elapsed between transactions) is also part of the data, but we did not include the variable in our models.

## Approach
Since the dataset is highly unbalanced, prediction accuracy is a meaningless metric.
We use F1 score for model evaluation.
We also record recall and precision scores in order to understand the tradeoff between the two scores.
The plot below shows an example of recall-precision tradeoff in a logistic regression model.
The vertical line is the class weight selected by cross-validation.

![class weights vs scores](f1-vs-class-weight.png)

The models we consider are the following:

- Ridge regression & classification by thresholding
- Logistic regression with L1 regularization
- Linear SVM classification.

Models are evaluated in the following manner:
0. Standardize the predictors.
1. Split the dataset into a training set and test set.
2. Parameter tuning on a model is done by 5-fold cross-validation on a training set.
3. The model with tuned parameters is fit on the training set, followed by its evaluation on the testing set.
4. Repeat the tuning and evaluation on different splits of the dataset (i.e. do cross-validation) to estimate the F1 score of the model. The estimated F1 score and accompanying recall and precision scores are reported in the Results section.

Each model has two parameters: M, weight on positive class, and C, regularization strength. M is the ratio of weight on the positive class (class=1) to negative class (class=0), i.e. if M=m, then (weight on class 0):(weight on class 1) = 1:m.
Class weights are necessary in our models as a remedy for unbalanced classes.

The parameter tuning step, which is embarassingly parallel, is run in parallel using `joblib`.
Machine used: Acer S3-391 (Intel Core i3 (2nd Gen) 2367M / 1.4 GHz Dual-Core; DDR3 SDRAM 4GB)

## Results

### Ridge Classification
Ridge regression is ordinary least squares regression with L2 regularization.
A ridge classifier uses a decision rule on a ridge regression result to convert it to binary outputs.

F1 | Recall | Precision
--- | --- | ---
0.662578 | 0.886734 | 0.553262



### Logistic Regression with L1 regularization

F1 | Recall | Precision
--- | --- | ---
0.781691 | 0.802129 | 0.770161

C varies between 0.144 and 0.00298, and M between 1.0 and 10.8.

Results of L1-regularized regression indicate strong predictors.
The plot on the left shows that many cofficients are driven to zero with a stronger regularization.
The vertical line indicates the regularization constant chosen by cross-validation.

![coefficients die out](logistic-L1-allvars-9vars.png)

Logistic regression with L1 regularization on a training set indicates that ['V4', 'V11', 'V5', 'V19', 'V7', 'V28', 'V3', 'V17', 'V18'] are strong predictors.

We run logistic regression without penalty using these variables to obtain the following scores.

F1 | Recall | Precision
--- | --- | ---
0.670403 | 0.699218 | 0.651430

The reduced model is faster to run, because of its small number of features, and a smaller parameter space to search, as it doesn't use regularization constant.
It is trained and tested in 19 mins for 50 parameters to search over, while the full model takes 8h for 200 parameters.
A trade-off is that the reduced model does not have the same predictive perfermance as the full model.
The scores reported above is most likely upward-biased, because variable selection was done prior to parameter tuning without cross-validation.
A proper method to select variables is to run a L1-regularized logistic regression after splitting the dataset into k-folds.
Although variable selection adds extra complexity to model training, prediction and training on a new dataset can be done more efficiently once strong predictors are identified.

### Linear SVM classification
We use `LinearSVC` of `sklearn` with `dual=False`.

F1 | Recall | Precision
--- | --- | ---
0.821735 | 0.900472 | 0.763462

### Summary of Results
The following table shows the results for our models together, with time to complete computation.

Classifier | Predicted F1, Recall, and Precision Scores | Time to Train & Test (in min; grid search over 200 parameter pairs)
--- | --- | --- 
Ridge | 0.663, 0.887, 0.553 | 53
Logistic | 0.783, 0.808, 0.768 | 480
Linear SVM | 0.822, 0.900, 0.763 | 199

In all models, M varies between 1 and 25.
Ridge classifiers achieve high recall scores, though F1 score is comparatively low.
An advantage of ridge classification is that it trains faster than other methods; the time complexity of its training is on the same order as OLS regression, O(n\*p^2).

Linear SVM has the best performance measured in F1 score.
Though not tested, all models, in theory, have same prediction time complexity, which is linear in the number of features.

## Discussion
Training and testing of a model including a thorough parameter tuning takes from 6 hours to overnight, and this has been the bottleneck of progress in model evaluation.
The computation can be expedited with a faster parameter tuning scheme instead of grid search and a better hardware (my laptop has the specs of a chromebook).
Variable selection is another means of reducing computation time, although we found that our model reduced via L1-regularized logistic regression does not preserve the perfermance of the full model.

Linear SVM has the best predictive ability among the models tested.
It is, however, worth studying ridge classifier in more depth, since the model achieves high recall scores, and runs much faster.
Further work on ridge classifier should involve understanding the mechanism behind its recall-precision tradeoff to see if raising the precision score without recall taking a big hit; and applying results from variable selection via L1-regularized logistic regression. 


## Future work
- Take a closer look at variable selection via L1-regularized logistic regression. Try other variable selection methods, such as forward stepwise selection, as well.
- Tree-based classification
- Speed up the parameter tuning step. Consider using RandomizedSearchCV of sklearn or Hyperopt.
- Cost-sensitive analysis. Models should include penalty for not accurately classifying a fraudulent transaction with a high amount.


## Code
The following is the linear SVM classifier.
This code can be made to do logistic regression by minor modifications, namely, by changing `LinearSVC` to `LogisticRegression` and changing parameters of `clf`.
A large portion of the parameter tuning code can be replaced with `cross_val_score` from scikit-learn; however,
coding our own cross-validation allows for 

In [4]:
import pandas as pd
import scipy as sp
import matplotlib
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import RidgeClassifier, RidgeClassifierCV, LogisticRegression, LogisticRegressionCV
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score, recall_score, precision_score
from joblib import Parallel, delayed, load, dump
import tempfile, shutil, os

df = pd.read_csv("../input/creditcard.csv")
X = df.drop(['Time','Class'],axis=1)
y = df['Class']
X = (X - X.mean()) / X.std()

def make_weights(M):
    return {0: 1, 1: M}

import warnings, sklearn
#ignore warnings from f1_score and recall_score.
#The methods return 0 when no positive class is available.
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

kC=30
kM=10
cs = sp.logspace(-6,-2,kC)
ms = sp.linspace(0,10,kM)
clf = LinearSVC()
k = 5
kf = KFold(n_splits=k)

def param_eval_parallel(X, y, clf_class, cs, ms): #for each fold at the top-level
        folder = tempfile.mkdtemp()
        X_path = os.path.join(folder, 'X-data')
        y_path = os.path.join(folder, 'y-data')
        scores_path = os.path.join(folder, 'scores')
        try:
            scores = sp.memmap(scores_path, dtype=X.iat[0,0].dtype,\
                          shape=(kC,kM,3), mode='w+')
            dump(X, X_path)
            dump(y, y_path)
            X = load(X_path, mmap_mode='r')
            y = load(y_path, mmap_mode='r')
            Parallel(n_jobs=2)(delayed(param_eval)(X, y, clf_class, ci, C, mi, M, scores) \
                for ci, C in enumerate(cs) for mi, M in enumerate(ms))
            f1_scores = scores[:,:,0]
            best_idx = sp.unravel_index(sp.argmax(f1_scores), f1_scores.shape)
            return cs[best_idx[0]], ms[best_idx[1]]
        finally:
            try:
                shutil.rmtree(folder)
            except:
                print("Failed to delete: " + folder)
                
def param_eval(X, y, clf_class, ci, C, mi, M, scores):
        class_weight = make_weights(M)
        if clf_class == LogisticRegression:
            clf.set_params(penalty='l1', C=C, dual=False, class_weight=class_weight)
        elif clf_class == RidgeClassifier:
            clf.set_params(alpha=C, class_weight=class_weight)
        elif clf_class == LinearSVC:
            clf.set_params(C=C, dual=False, class_weight=class_weight)

        f1_tmp = sp.empty(k)
        recall_tmp = sp.empty(k)
        precision_tmp = sp.empty(k)
        kf = KFold(n_splits=k)
        for j, (train_idx, test_idx) in enumerate(kf.split(y)):
            #scores = cross_val_score(clf, X, y, scoring='f1', cv=5)
            clf_fit = clf.fit(X.iloc[train_idx], y.iloc[train_idx])
            y_pred_cv = clf_fit.predict(X.iloc[test_idx])
            y_test_cv = y.iloc[test_idx]             
            f1_tmp[j] = f1_score(y_pred_cv, y_test_cv)
            recall_tmp[j] = recall_score(y_pred_cv, y_test_cv)
            precision_tmp[j] = precision_score(y_pred_cv, y_test_cv)
            #print(f1_tmp.mean(), recall_tmp.mean(), precision_tmp.mean())
            scores[ci,mi,0] = f1_tmp.mean()
            scores[ci,mi,1] = recall_tmp.mean()
            scores[ci,mi,2] = precision_tmp.mean()

def model_eval(clf, kC, kM, k):
    clf_class = clf.__class__
    if clf_class == LogisticRegression:
        cs = sp.logspace(-4,0,kC)
        ms = sp.linspace(1,12,kM)
    elif clf_class == RidgeClassifier:
        cs = sp.linspace(1e-1, 1e4, kC) #alphas
        ms = sp.linspace(1,50,kM)
    elif clf_class == LinearSVC:
        cs = sp.logspace(-5,-1,kC)
        ms = sp.linspace(1,12,kM)
    else:
        raise ValueError("%s is not supported" % clf.class_)
                
    f1_scores = sp.empty(k)
    recall_scores = sp.empty(k)
    precision_scores = sp.empty(k)
    kf = KFold(n_splits=k, random_state=0)
    for i, (train_idx, test_idx) in enumerate(kf.split(y)):
        print("%d-th fold" % i)
        Xi, yi = X.iloc[train_idx], y.iloc[train_idx]
        C,M = param_eval_parallel(Xi,yi,clf_class,cs,ms)
        print("%d-th fold:  regularization = %f,  weight = %f" \
             % (i, C, M))
        class_weight = make_weights(M)
        if clf_class == LogisticRegression:
            clf.set_params(penalty='l1', C=C, dual=False, class_weight=class_weight)
        elif clf_class == RidgeClassifier:
            clf.set_params(alpha=C, class_weight=class_weight)
        elif clf_class == LinearSVC:
            clf.set_params(C=C, dual=False, class_weight=class_weight)

        clf_fit = clf.fit(Xi,yi)
        y_pred = clf_fit.predict(X.iloc[test_idx])
        y_test = y.iloc[test_idx]
        f1_scores[i] = f1_score(y_pred,y_test)
        recall_scores[i] = recall_score(y_pred,y_test)
        precision_scores[i] = precision_score(y_pred,y_test)
    print("Predicted scores:\n F1: %f;  Recall: %f;  Precision: %f" \
        % (f1_scores.mean(), recall_scores.mean(), precision_scores.mean()))
    return f1_scores, recall_scores, precision_scores


In [5]:
#Example
kC=3 #regularization grid_num
kM=2 #class weight grid_num  
clf = LinearSVC()
k = 2 # k-fold cross-validation
model_eval(clf,kC,kM,k)