# Code for Jupyter Notebook demo (July-2018): 

## Building a logistic regression model for fraud detection 

## Pre-requisities

Spin up a Data Science Virtual Machine via the Azure portal https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/provision-vm

- Preconfigured Virtual Machines: https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/
- Download the Credit card dataset and upload it onto Azure blob storage https://azure.microsoft.com/en-us/services/storage/blobs/ note down the blob credentials for authentication via the code.

## References: 

- Preconfigured Virtual Machines: https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/ 

- Data Source: https://www.kaggle.com/mlg-ulb/creditcardfraud

- Blog Post by Venelin Valkov: https://medium.com/@curiousily/credit-card-fraud-detection-using-autoencoders-in-keras-tensorflow-for-hackers-part-vii-20e0c85301bd , https://www.data-blogger.com/2017/06/15/fraud-detection-a-simple-machine-learning-approach/

- Deep Learning Book by Ian Goodfellow, Yoshua Bengio, Aaron Courville: http://www.deeplearningbook.org/ 

## Environment setup

In [None]:
# Import necessary components
import os
import keras
import shutil
import json

In [None]:
import re
import pandas as pd
import numpy as np
import datetime

from sklearn import preprocessing
from sklearn.metrics import confusion_matrix, recall_score, precision_score
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM, Activation
from math import ceil

In [None]:
import pickle
from scipy import stats
import tensorflow as tf
from sklearn.model_selection import train_test_split
from keras.models import Model, load_model
from keras.layers import Input, Dense
from keras.callbacks import ModelCheckpoint, TensorBoard
from keras import regularizers

In [None]:
import os
cwd = os.getcwd()
cwd

In [None]:
import matplotlib as plt 

In [None]:
%matplotlib inline

In [None]:
import glob
import os

from azure.storage.blob import BlockBlobService
from azure.storage.blob import PublicAccess

Enter the credentials to access the data from the cloud and then download the file for analysis.

In [None]:
# Azure blob credentials to read data
storage_account = '****'
storage_key = '****'

input_container = 'stratalondon'
output_container = 'modeldeploy'

az_blob_service = BlockBlobService(account_name=storage_account, account_key=storage_key)

In [None]:
blob_service = BlockBlobService(account_name=storage_account, account_key=storage_key)
input_container_folder = 'stratalondon/'
generator = blob_service.list_blobs(input_container_folder)
for blob in generator:
    if ("creditcard" in blob.name):
        print(blob.name)
        fname=blob.name

In [None]:
aml_dir = cwd
my_service = BlockBlobService(account_name=storage_account, account_key=storage_key)
my_service.get_blob_to_path('stratalondon', fname, 'C://dsvm//notebooks/creditcard.csv')

## Import the Credit card data set

In [None]:
# Check the path
aml_dir

In [None]:
# Ingest the dataset
cc = pd.read_csv('C://dsvm//notebooks/creditcard.csv')

After data ingestion from Blob, check to see the various columns and number of rows/columns of the dataset.

In [None]:
# Check sample data
cc.head(1)

In [None]:
# Check the number of rows/columns
cc.shape

Now that the data is properly imported, check the descriptive statistics of the columns in the dataset.

In [None]:
# Check data statistics
print(cc.describe())

Here we visualize and access the distribution of the variable 'Class'. This is the variable which indicates whether a transaction was fraud/normal. 

In [None]:
from matplotlib import pyplot as plt 

In [None]:
# Variable class is used for the classification of entries as fraud/non-fraud, check the distribution of the variable
class_freq = pd.value_counts(cc['Class'], sort = True)
class_freq.plot(kind = 'bar', rot=0)
plt.title("Class Frequency")
plt.xlabel("Class")
plt.ylabel("Frequency");

In [None]:
# Count of Fraud/normal transactions
fraud = cc[cc.Class == 1]
normal = cc[cc.Class == 0]
print("Number of Fraud transactions:")
print(fraud.shape)
print("Number of Non-Fraud transactions:")
print(normal.shape)
print("% of Fraud transactions:")
prop = (len(fraud)/(len(fraud)+len(normal)))*100
print(prop)

Check to see how the fraud/normal transactions vary in terms of variable 'Amount'.

In [None]:
# Check Fraud data statistics for variable = 'Amount'
fraud.Amount.describe()

In [None]:
# Compare Fraud data statistics with normal data for variable = 'Amount'
normal.Amount.describe()

## Modeling 

First exclude the variable 'Time'. Since the spread of the variable 'Amount' is large, this variable is standardized. 

In [None]:
# Remove the column 'Time' and standardize the variable 'Amount'
from sklearn.preprocessing import StandardScaler
data = cc.drop(['Time'], axis=1)
data['Amount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))

Next step is to split the data into train/test.

Define the framework for the logistic regression model and then compile and fit using the training data.

In [None]:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn import metrics
import pandas as pd

In [None]:
# Only use the 'Amount' and 'V1', ..., 'V28' features
features = ['Amount'] + ['V%d' % number for number in range(1, 29)]

# The target variable which we would like to predict, is the 'Class' variable
target = 'Class'

# Now create an X variable (containing the features) and an y variable (containing only the target variable)
X = data[features]
y = data[target]

In [None]:
# Define the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

lr = LogisticRegression()
lr.fit(X_train, y_train)

In [None]:
y_pred = lr.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))