version 1.1 ~ GMcK

# Fraud Detection

An important application of Data Science within the Fintech "Financial Technology" industry is detecting fraudulent transactions. In this notebook we train a model to predict which transaction might be fraudulent.  

## The dataset

The dataset contains the following columns:

* merchant: The id of our client
*  first_6_digit: The first 6 digit of the credit card used in the transaction
* payment_ref: The unique reference for this payment
* amount_eur: The amount in euros
* card_network: The network that emitted the card
* card_type: The type of card
* timestamp: The timestamp when the payment was received
* refused_by_bank: A payment can also be refused by the bank for various reasons (lack of funds, card lost etc..)
* bank_country_id: The country of the bank
* user_id: The id of the user that is performing the payment
* user_country_id: The country of the user that is performing the payment
* is_fraud: Whether the payment was flagged as fraudulent.

## Machine Learning Steps

* Read in file from name, clean the data & create some derived variables
* Explore relationships between target and features
* Create a Baseline Model off which to benchmark performance
* Select, tune and assess final classificaton model


## Notebook Structure

The Notebook is structured in the following way:

 * Setup (requirements for running the following analysis & code)
 * Data Loading & Preparation & Feature Engineering
 * Exploratory Data Analysis
 * Model Development & Performance




# Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from scipy import stats
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

In [None]:
plt.style.use('ggplot')

In [None]:
! pip install seaborn
! pip install sklearn

# Data Preparation

In [None]:
df = pd.read_csv('../data/fraud_data.csv', index_col=[0])

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df['first_6_digits']= df['first_6_digits'].astype(str)

In [None]:
df.dtypes

## Test yourself!!

* Look at 5 rows of the dataframe where is_fraud==1
![image.png](attachment:image.png)

In [None]:
# Answer
filt = (df['is_fraud']== 1)
df[filt].head(5)

## Time Features

In [None]:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')

In [None]:
df['dow_txn'] = df['timestamp'].apply(lambda x:x.dayofweek)


In [None]:
df['hour_txn'] = df['timestamp'].apply(lambda x:x.hour)

## Test yourself!!

* Create a new feature "month_txn" which is a column in the df which is the number of the month of the transaction

Hint:
See Above

In [None]:
# Answer to 3.3
df['month_txn'] = df['timestamp'].apply(lambda x:x.month)

df.head()

## Data Quality

### Duplication

In [None]:
# check for duplicates
any(df.duplicated())

In [None]:
any(df.duplicated('payment_ref'))

In [None]:
df[df['payment_ref'].isin(list(df[df.duplicated('payment_ref')]['payment_ref']))]

In [None]:
# removal
df = df[~df['payment_ref'].isin(list(df[df.duplicated('payment_ref')]['payment_ref']))]

In [None]:
df.shape

## Test yourself!!
* Check whether there are users that have made the same payments repeatedly. <br>
use a filt: **df.duplicated(['one','two')]** and then the method value_counts(), and apply 'filt'

In [None]:
filt = df.duplicated(['amount_eur', 'user_id'])

### Missing Data

In [None]:
sns.heatmap(df.isnull(), cbar=False)

In [None]:
df.isnull().sum()

There exist missing data for a number of the variables present in this data-set. Treatment of each is outlined below. With more time more sophisticated methods (e.g. impute missing user_country_id from bank_country_id, merchant etc.) should be considered and sensitivity to missing data assessed. Here I have taken the approach of keeping information about 'missingness' (in case it is predictive of fraud) where sensible at the cost of losing information possible inferrable from imputing these values. Omitting these records completely would only be warranted if we were sure the missing data mechanism was MCAR (missing completely at random) and the the "good" records were representative of those missing (but at a minimum you could imagine the missingness of Card Network being assoicated with merchant?)

* first_6_digits: it is very suspicious that these these are missing - it is hard to thing of a non data corruption reason for this but nevertheless will encode as 'missing'
* card_network: again this seems more like a data corruption/processing issue but nevertheless will encode as 'missing'
* card_type: encoded as 'missing' - this could be legitimately not available for some card types?
* bank_country_id: encoded as 'missing' - this could be legitimately not available for some banks
* user_id: encoded as 'missing' - this again seems to be unlikely to be not available
* user_country_id: encoded as 'missing' - this could be legitimately not available for some users although the pattern observed in the above heatmap warrants some further investigation.)



In [None]:
df.fillna('missing', inplace=True)

One hypothesis could being seeing the patern above that if the dataset was ordered by time payment recieved then perhaps the country ids were missing for a particular time period but looking at the below that does not seem to be the case. A fuller investigation would check whether the df is ordered by any other of the features (doesn't appear to be) and also consider the cross tabulation of user country id missingness with other cat features.


In [None]:
plt.plot(df.index, df.timestamp)

## Test yourself!!

* You are suspicious of the missing data in the first_six_digits column and so instead of replacing those values you drop all the rows in the dataframe for which this column is missing a value. Hint: df.dropna(subset= ['one'])

In [None]:
# Put your answer here

#  Exploratory Data Analysis

## Target Variable

In [None]:
df['is_fraud'].value_counts()


In [None]:
df.groupby(['is_fraud'])['payment_ref'].count()

In [None]:
plt.bar(df.groupby(['is_fraud'])['payment_ref'].count().index, df.groupby(['is_fraud'])['payment_ref'].count())

## Test yourself!

* Add a title and labels for x and y axis to the above plot.
hint:
plt.title('x')
plt.xlabel('y')
plt.ylabel('z')

In [None]:
ax = sns.countplot(x="is_fraud", data=df)

## First 6 Digits of the Card

In [None]:
pd.crosstab(df['merchant'], df['is_fraud']).apply(lambda r: round(r/r.sum(), 3),axis = 1)


In [None]:
ax = sns.heatmap(pd.crosstab(df['merchant'], df['is_fraud']), annot = True, fmt=".0f")
ax = ax.set_ylim(len(pd.crosstab(df['merchant'], df['is_fraud']))+ 0.5, -0.5)

In [None]:
df['first_6_digits'].value_counts()[0:20]

Lets look at the top twenty of these and see if there appears to be any glaring relation to fraud.

In [None]:
temp = df[df.first_6_digits.isin(df['first_6_digits'].value_counts()[0:20].index.to_list())]
pd.crosstab(temp['first_6_digits'], temp['is_fraud']).apply(lambda r: round(r/r.sum(), 3),axis = 1)


There is no major pattern immediately discernible and such a highly cardinal variable is difficult to include in the model. In future developments it may make sense to look at derived variables from these (e.g. issuing bank).

## Test yourself!!

* Using a pandas groupby (or some other way) - calculate the average amount spent in each merchant. <br>
df.groupby(['one'])['two'].mean()

In [None]:
# Answer
df.groupby(['merchant'])['amount_eur'].mean()

## 3.3. Amount Euro



Currency amounts are typically highly skewed and this is what is exhibited in the dataset. In addition there are number of extreme values and transactions of amount 0.

* Most instituions have some limit on card transactions and so my supspicion is that the extreme values are not 'true outliers' but rather corrupted data and so I will exlude those 8 for which the amount is above 10,000 euro.
* 0 euro amount transactions may happen when a merchant is "validating" a card. These I have left in as I believe them to be non-corrupted although there are merchant patterns worthy of later investigation.
* I use a log transform of the amount eur to aid in visualisation & model training.



In [None]:
ax = sns.boxplot(x="is_fraud", y="amount_eur",data=df)

There appears to be nothing else strange about these extreme values, and no obvious pattern.

In [None]:
df.nlargest(5, "amount_eur", keep='first')

In [None]:
df = df[df["amount_eur"]< 10000]

Again these look legitimate except that they largely seem to come from 'Shop8' and none are Fraudulent. Perhaps Shop 8 is some sort of (e.g. App) where there is a possibilty to upgrade (for a very low amount e.g. 1 euro) and a card must be provided for sign-up? Anyways I have left them in for the time being - this with more time should be investigate however.

In [None]:
df[df["amount_eur"]== 0]['merchant'].value_counts()

In [None]:
df[df['merchant']== 'Shop8']['amount_eur'].value_counts()


In [None]:
df[df["amount_eur"]== 0]['is_fraud'].value_counts()

In [None]:
df[df["amount_eur"]== 0].head(5)

Looking at the overal distribution of amount in euro it is difficult to make any conclusions due to the skew in the data which we address below.

In [None]:

sns.distplot( df[df.is_fraud ==0]["amount_eur"], color="skyblue", label="Non Fraudulent")
sns.distplot( df[df.is_fraud ==1]["amount_eur"] , color="red", label="Fraudulent")

In [None]:
df[df['amount_eur']< 0]

In [None]:
df['amount_eur'] =  pd.Series(np.log(df['amount_eur'] + 1))

In [None]:
ax = sns.distplot( df[df.is_fraud ==0]["amount_eur"], color="skyblue", label="Non Fraudulent")
ax = sns.distplot( df[df.is_fraud ==1]["amount_eur"] , color="red", label="Fraudulent")
ax.legend()

## Test yourself! (For at Home)

* Try look at another of the features (e.g. user_country_id) using the above analysis as a template, (For at home) 

In [None]:
corr = df.corr()

plt.figure(figsize=(20,10))
ax = sns.heatmap(corr, annot= True, fmt = '.2f')
ax = ax.set_ylim(len(corr)+ 0.5, -0.5)

#  Model Development & Performance

In [None]:
y = df["is_fraud"]

base_features = ['merchant', 'amount_eur', 'card_network']
dummy_features = ['merchant',  'card_network']

X= df[[c for c in df.columns if c in base_features]]

X = pd.get_dummies(X, columns=dummy_features)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## Logistic Regression

In [None]:
lr = LogisticRegression()

In [None]:
lr.fit(X_train, y_train)


In [None]:
y_pred = lr.predict(X_test)

## Test yourself!(~home)

* Try train a different model on your data -> have a look at Skicitlearn docs to see which may suit this prediction task! (Do not try this at home!)

## 4.2. Performance

In [None]:
ax = sns.heatmap(confusion_matrix(y_test, y_pred), annot =True,  fmt = '.0f')
ax = ax.set_ylim(2+ 0.2, -0.2)
plt.title('Confusion Matrix', fontsize = 20) 
plt.xlabel('Predicted', fontsize = 15) 
plt.ylabel('Actual', fontsize = 15) 
plt.show()


In [None]:
test_df = X_test.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

In [None]:
test_df.head()

In [None]:
test_df['real_label'] = y_test
test_df['predicted_label'] = y_pred

In [None]:
test_df

## Test yourself! (At Home)

* Calculate the Precision and Recall of your model  on the test set (test_df) - you can do this using sklearn.metrics or you can use pandas to calculate from scratch
* Precision = number correctly predicted fraud/ all predicted fraud
* Recall = number correctly predicted fraud/all truly fraud