In this project, I'm going to do these things:

 1. Implement Logistic Regression based on Machine Learning Course from
    Stanford University

 2.  Use Logistic Regression Classifier in Python scikit-learn library

 3. Compare Logistic Regression with other classifiers: (1) Support Vector Classifier (2) Random
    Forrest Classifier

 4. Change the size of training set and compare

 5. Conclusion

First, let's read in data.

In [None]:
import pandas as pd
import numpy as np

In [None]:
creditcard = pd.read_csv('../input/creditcard.csv')

In [None]:
X = creditcard[['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']]
y = creditcard['Class']

#1. Implement Logistic Regression

Next, let's do logistic regression. Of course we can explore and scale features, or introduce new features of higher degrees, but I'am just gonna implement a basic logistic regression here.

reference: [Link] https://www.coursera.org/learn/machine-learning

 [Link]http://aimotion.blogspot.com/2011/11/machine-learning-with-python-logistic.html


Cost function is defined as [Link] http://4.bp.blogspot.com/-0vWgkEmE-u4/TraaI_rd-bI/AAAAAAAAAow/Ya5rp0rQS48/s1600/Screen+shot+2011-11-06+at+11.30.37+AM.png

Gradient is defined as [Link] http://2.bp.blogspot.com/-jpwtW1KQIoE/TraaRvy_8MI/AAAAAAAAAo4/9qnO3SyiqaA/s1600/Screen+shot+2011-11-06+at+11.30.41+AM.png

In [None]:
# we need train_test_split to split data into training set and test set
# we need metrics to measure accuracy after preditions
# we need optimize from scipy to optimize cost function
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import scipy.optimize as op

In [None]:
# in order to save time, I keep the size of training set to be less than 100000
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.65, random_state=0)
print(Xtrain.shape);
print(Xtest.shape);

In [None]:
def sigmoid(z):
    return 1/(1 + np.exp(-z));

In [None]:
# define cost function. theta is an array containing coefficents for all feathers.
def costFunctionReg(theta, X, y):
    m = len(y)
    n = len(theta)
    h = sigmoid(X.dot(theta))
    J = (-y.T.dot(np.log(h))-(1-y.T).dot(np.log(1-h)))/m
    return J

In [None]:
# define gradient
def Gradient(theta, X, y):
    m = len(y)
    n = len(theta)
    h = sigmoid(X.dot(theta))
    grad = (1/m)*(X.T).dot(h-y);
    return grad.flatten()

In [None]:
# define predict function
def predict(theta, X):
    m, n = X.shape
    p = np.zeros(m)
    h = sigmoid(X.dot(theta))
    for i in range(0, m):
        if h[i] > 0.5:
            p[i] = 1
        else:
            p[i] = 0
    return p

In [None]:
# convert data to arrays
Xtrain = np.array(Xtrain)
ytrain = np.array(ytrain)
Xtest = np.array(Xtest)
ytest = np.array(ytest)

In [None]:
# add a column of ones to Xtrain
Xtrain_ones = np.append(np.ones((Xtrain.shape[0],1)), Xtrain, axis = 1)

In [None]:
# use fmin_bfgs to minimize cost function and fine theta, about 2 mins
initial_theta = np.zeros(Xtrain_ones.shape[1])
theta_optimal = op.fmin_bfgs(f= costFunctionReg, x0 = initial_theta, args = (Xtrain_ones,ytrain), fprime = Gradient, maxiter = 400);

In [None]:
# make predition and check accuracy
Xtest_ones = np.append(np.ones((Xtest.shape[0],1)), Xtest,axis = 1);
ypred = predict(theta_optimal,Xtest_ones);
print(metrics.confusion_matrix(ytest,ypred));
print(metrics.classification_report(ytest,ypred));
print('Accuracy : %f' %(metrics.accuracy_score(ytest,ypred)));
print('Area under the curve : %f' %(metrics.roc_auc_score(ytest,ypred)));

Well, it works, but not perfect. Let's see what python library can do next. 

#2. Use Logistic Regression Classifier in Python scikit-learn library

In [None]:
# call the classifier and train the data
from sklearn.linear_model import LogisticRegression
clf_logistic = LogisticRegression(penalty='l2');
clf_logistic.fit(Xtrain, ytrain);

In [None]:
# make predition and check accuracy
ypred = clf_logistic.predict(Xtest);
print(metrics.confusion_matrix(ytest,ypred));
print(metrics.classification_report(ytest,ypred));
print('Accuracy : %f' %(metrics.accuracy_score(ytest,ypred)));
print('Area under the curve : %f' %(metrics.roc_auc_score(ytest,ypred)));

Yes, it's better!

#3. Check other classifiers and compare
I will check Support Vector Classifier and Random Forest Classifier

##(1) support vector classifier. 

We can change kernels for SVC. Here I've tested 'linear' and 'sigmoid'.

In [None]:
from sklearn.svm import SVC

Just want to put a reminder here: the following one is very slow!!!

In [None]:
# SVC with 'linar' kernel. It took about 10 mins.
clf_linear = SVC(kernel='linear')
clf_linear.fit(Xtrain, ytrain)

In [None]:
# make prediction and check accuracy
ypred = clf_linear.predict(Xtest)
print(metrics.confusion_matrix(ytest,ypred))
print(metrics.classification_report(ytest,ypred))
print('Accuracy : %f' %(metrics.accuracy_score(ytest,ypred)))
print('Area under the curve : %f' %(metrics.roc_auc_score(ytest,ypred)))

In [None]:
# SVC with 'sigmoid' kernel
clf_sigmoid = SVC(kernel='sigmoid')
clf_sigmoid.fit(Xtrain, ytrain)

In [None]:
ypred = clf_sigmoid.predict(Xtest)
print(metrics.confusion_matrix(ytest,ypred));
print(metrics.classification_report(ytest,ypred));
print('Accuracy : %f' %(metrics.accuracy_score(ytest,ypred)));
print('Area under the curve : %f' %(metrics.roc_auc_score(ytest,ypred)));

Mmm, SVC does not work very good.

##(2) Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier()
clf_rf.fit(Xtrain,ytrain)

In [None]:
ypred = clf_rf.predict(Xtest);
print(metrics.confusion_matrix(ytest,ypred));
print(metrics.classification_report(ytest,ypred));
print('Accuracy : %f' %(metrics.accuracy_score(ytest,ypred)));
print('Area under the curve : %f' %(metrics.roc_auc_score(ytest,ypred)));

Random Forrest is pretty good.

##4. Change the size of training set and compare

I am done with my codes and testing, and I am just thinking: what if the training set is even larger. The result should be better, right? Let's see.

In [None]:
Xtrain2, Xtest2, ytrain2, ytest2 = train_test_split(X, y, test_size=0.2, random_state=0)
print(Xtrain2.shape);
print(Xtest2.shape);

Logistic Regression Classifier

In [None]:
# Use logistic regression again
clf_logistic2 = LogisticRegression(penalty='l2');
clf_logistic2.fit(Xtrain2, ytrain2);

In [None]:
# make predition and check accuracy
ypred2 = clf_logistic2.predict(Xtest2);
print(metrics.confusion_matrix(ytest2,ypred2));
print(metrics.classification_report(ytest2,ypred2));
print('Accuracy : %f' %(metrics.accuracy_score(ytest2,ypred2)));
print('Area under the curve : %f' %(metrics.roc_auc_score(ytest2,ypred2)));

Random Forest Classifier

In [None]:
# Use random forest classifier again
clf_rf2 = RandomForestClassifier()
clf_rf2.fit(Xtrain2,ytrain2);

In [None]:
ypred2 = clf_rf.predict(Xtest2);
print(metrics.confusion_matrix(ytest2,ypred2));
print(metrics.classification_report(ytest2,ypred2));
print('Accuracy : %f' %(metrics.accuracy_score(ytest2,ypred2)));
print('Area under the curve : %f' %(metrics.roc_auc_score(ytest2,ypred2)));

## 5 Conclusion

1. The basic logistic regression codes I've implemented need to be improved for this problem.

2. The Logistic Regression and Random Forrest Classifier in scikit-learn library are pretty good.  The areas under the ROC curve are 0.86 and 0.87 respectively.

3. The Support Vector Classifiers with Kernel 'linear' and 'sigmoid' are not good. The areas under the ROC curve are 0.68 and 0.5. 'Linear' SVC is quite slow and 'Sigmoid' SVC does not recgnize any '1' in the test set at all. This also tells us the measure 'Accuracy' is not a good one for this problem.

4. After I increase the size of training set (from 35% to 80% of all data), the area ROC becomes 0.78 for Logistic Regression and still 0.87 for Random Forrest Classifier. This fact is interesting for me.