<H1>Minimizing the "Real" Cost of Credit Card Fraud</H1>
<H3>by Michael Klear</H3>
This <a href='https://www.kaggle.com/dalpozz/creditcardfraud'>awesome dataset</a> provides a great real-world example of the challenges of credit card fraud detection. Why? Take a look:


In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

df = pd.read_csv('../input/creditcard.csv')
df = df.sample(frac=1, random_state=123) #Shuffle samples

class_imb = df.Class.sum()/len(df) #Record class imbalance

#Pie plot
labels = ['Genuine Transactions', 'Fraudulent Transactions']
sizes = [(len(df)-df.Class.sum()), df.Class.sum()]
colors = ['blue', 'red']

plt.pie(sizes, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=140)
plt.title('Genuine Transactions vs. Fraudulent Transactions')
plt.axis('equal')
plt.show()

The number of fraudent transactions in this dataset is tiny compared to the body of all the "honest" transactions recorded. In a purely theoretical binary classification problem, this situation has its difficulties. In the "real world," it's even more complicated. That's because <b>the cost of a misclassification can vary wildly between type I errors (false positives) and type II errors (false negatives)</b>. Because we live in the real world<sup>[citation needed]</sup>, it may be best to define the cost of each type of error before we predict anything.<br>
<H2>The Cost of Misclassification</H2>
<H3>Type II Error: The Cost of an Undetected Fraudulent Charge</H3><br>
For the sake of this task, let's assume that we (the credit card company) are responsible for refunding the full amount of a fraudulent transaction. The cost of a type II error, then is simply the transaction amount for that particular transaction.
<H3>Type I Error: The Cost of a "False Alarm"</H3><br>
If our model misclassifies a genuine transaction as fraudulent and blocks the transaction, we have a few costs to consider:
<li>The cost of investigating the transaction</li>
<li>The cost of inconvenience to the customer</li>
<li>The lost business (the transaction did not occur)</li><br>
Let's attempt to quantify these costs:
<li>We will probably need to pay out an average of one employee hour to investigate the transaction.  Let's call this \$25. </li>
<li>The cost of inconvenience to the customer is hard to quantify. The biggest risk is losing the customer altogether. Let's say the probability of losing a customer due to inconvenience is 1/100 and the cost of losing a customer (the lifetime expected profit from a customer) is \$1,000. Thus, the mean cost due to inconvenience to a customer would be one percent of \$1,000, or \$10.</li>
<li>If we collect 1% of the transaction amount in fees, the cost of the lost transaction is simply one percent of the transaction amount.</li><br>
We now have a rough notion of the cost of both type of errors. This will come in handy when we evaluate our predictive model.<br>
<H2>Training/Test Split</H2>
I'd like to reserve a good amount of samples for model evaluation, so we'll use a 50-50 training/test split. Half of the data will be used to fit the model, and the other half will be used to evaluate it.<br>
I'll also define our features. I omit time, as it's not a feature that makes sense to use in a production environment (that is to say that any information our model picks up from this feature would be useless in the future, as future times do not appear in our training data.)

In [None]:
#Define our features, omitting time
features = ['V{}'.format(x) for x in range(1, 29)] + ['Amount']
#Set cutoff at 50%
cutoff=int(.5*len(df))
#Split data into training and test 
train = df[:cutoff]
test = df[cutoff:]

#Split up into X's and Y's
X_train = train[features]
Y_train = train.Class

X_test = test[features]
Y_test = test.Class

<H2>Making Predictions</H2><br>
<H3>Model 1: Make No Predictions</H3><br>
With such a high class imbalance, predicting 100% geniune transaction is a viable option that yields 99.8% overall accuracy. It's certainly the simplest solution that would be easy to put into production.<br>
<H4>Error Breakdown</H4>

In [None]:
#record results:
results1 = pd.DataFrame()
results1['true_value'] = np.array(Y_test)
results1['predicted'] = 0
results1['correct'] = np.where(results1.predicted == results1.true_value, 1, 0)
results1['tI_error'] = np.where((results1.predicted == 1) & (results1.true_value == 0), 1, 0)
results1['tII_error'] = np.where((results1.predicted == 0) & (results1.true_value == 1), 1, 0)

labels = ['False Negatives (type II errors)', 'Correctly Identified']
sizes = [results1.tII_error.sum(), results1.correct.sum()]
colors = ['red', 'blue']

plt.pie(sizes, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=140)
plt.title('Model 1 Performance: Error Breakdown')
plt.axis('equal')
plt.show()

results1['Amount'] = np.array(X_test['Amount'])
#type I error cost, as defined
results1['cost_I'] = np.where(results1.tI_error == 1, (25 + 10 + (.01*results1.Amount)), 0)
#type II error cost, as defined
results1['cost_II'] = np.where(results1.tII_error == 1, results1.Amount, 0)
results1['total_cost'] = results1[['cost_I', 'cost_II']].sum(axis=1)
total_cost_I = results1.cost_I.sum()
total_cost_II = results1.cost_II.sum()
total_cost = total_cost_I + total_cost_II
        
fig, ax = plt.subplots(figsize=(8, 5))
plt.bar(
    ['cost due to type I errors', 'cost due to type II errors', 'total_cost'], 
    [total_cost_I, total_cost_II, total_cost],
    color=['yellow', 'orange', 'green']
);
plt.title('Contributions to Cost');
plt.ylabel('Cost in Dollars');
plt.show();

print("The total cost of this model on our test set is: ${}".format(str(total_cost)))

<H3>Results Interpretation</H3><br>
The total cost of using the model in production is almost \$30,000, despite the fact that it's highly accurate. Let's see if we can improve this cost with a predictive model.<br>
<H2>Model 2: Gradient Boosted Model</H2><br>
Now we'll try to reduce the cost by making some predictions and trying to catch fraud. I'm going to make use of xgboost to impliment the gradient boosting algorithm. This wouldn't be a proper Kaggle kernel without xgboost.

In [None]:
#Set up our data and parameters
dtrain = xgb.DMatrix(X_train, label=Y_train)
param = {
     'colsample_bytree': 0.4,
     'eta': 14/100,
     'max_depth': 3,
     'nthread': 4,
     'objective': 'binary:logistic',
     'scale_pos_weight': .5/class_imb,
     'silent': 1
}

plst = param.items()

#Train the model (cue 'Eye of the Tiger')
num_round = 500
bst = xgb.train(plst, dtrain, num_round, verbose_eval=False)

#Grab predictions at p=.5 threshold
dtest = xgb.DMatrix(X_test)
Y_ = bst.predict(dtest)
Y_ = np.where(Y_>.5, 1, 0)

In [None]:
#record results:
results2 = pd.DataFrame()
results2['true_value'] = np.array(Y_test)
results2['predicted'] = Y_
results2['correct'] = np.where(results2.predicted == results2.true_value, 1, 0)
results2['tI_error'] = np.where((results2.predicted == 1) & (results2.true_value == 0), 1, 0)
results2['tII_error'] = np.where((results2.predicted == 0) & (results2.true_value == 1), 1, 0)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.set_title('Model 2 Performance: Error Breakdown')
lbl = ['Incorrectly Identified', 
       'Correctly Identified ({}%)'.format(str(100*results2.correct.sum()/len(results2))[:5])
]
sz = [results2.tII_error.sum() + results2.tI_error.sum(), results2.correct.sum()]
clrs = ['red', 'blue']
ax1.pie(sz, labels=lbl, colors=clrs, shadow=True, startangle=90)

ax2.set_title('Error Types')
labels = [
    'Type I Errors (False Positives): {}'.format(results2.tI_error.sum()),
    'Type II Errors (False Negatives): {}'.format(results2.tII_error.sum())
]
sizes = [
    results2.tI_error.sum(),
    results2.tII_error.sum()
]

ax2.pie(sizes, labels=labels, colors=['cyan', 'purple'], startangle=215, 
        autopct='%1.1f%%')
plt.subplots_adjust(wspace=1.5)
plt.show()

results2['Amount'] = np.array(X_test['Amount'])
#type I error cost, as defined
results2['cost_I'] = np.where(results2.tI_error == 1, (25 + 10 + (.01*results2.Amount)), 0)
#type II error cost, as defined
results2['cost_II'] = np.where(results2.tII_error == 1, results2.Amount, 0)
results2['total_cost'] = results2[['cost_I', 'cost_II']].sum(axis=1)
total_cost_I = results2.cost_I.sum()
total_cost_II = results2.cost_II.sum()
total_cost = total_cost_I + total_cost_II
        
fig, ax = plt.subplots(figsize=(8, 5))
plt.bar(
    ['cost due to type I errors', 'cost due to type II errors', 'total cost'], 
    [total_cost_I, total_cost_II, total_cost],
    color=['cyan', 'purple', 'green']
);
plt.title('Contributions to Cost');
plt.ylabel('Cost in Dollars');
plt.show();

print("The total cost of this model on our test set is: ${}".format(str(total_cost)[:7]))

<H3>Results Interpretation</H3><br>
The total cost of using the model on our test set is over \$8,000, and the model is more accurate overall than a the first (null) model. This reduces our expense due to fraud to less than a third of what it would have been without taking action. Not bad!<br>
We can still see, however, that the majority of our cost comes from type II errors, or instances of undetected fraud. Given the disparity in the cost of each type of misclassification, we may be better off with a lower overall accuracy if we can "catch" more fraudulent charges. One technique would be, of course, to tune up our xgboost implimentation (More than I've already done). Another is to try a different type of model. Maybe I'll come back and try that, but for now I'm happy to have saved my hypothetical company over $20,000!<br>