# 3 - Fraud Detection - eCommerce Retailer

The dataset is composed of a set of transactions from an online retailer: for each transaction we have access to several attributes (i.e. properties), as well as a label indicating whether each transaction was fraudulent or not fraudulent.

**The objective is using such historical data to develop a prediction algorithm capable of classifying new transactions as *fraud* or *not fraud*.**

<div class="alert alert-block alert-danger">
<b>Q: Is this supervised or unsupervised learning?</b>
</div>

<div class="alert alert-block alert-success">
We have the true label in the dataset, thus it is *supervised*
</div>

The fraudulent transactions in the original datasets were labelled accordingly to the reports of the credit card owners: when they let the bank know that a transaction was fraudulent and required money back, the transaction was labelled as **fraud**.

The properties of the transaction are named features (a.k.a. attributes, independent variables).
For each transaction, the known properties are:
- age of the account
- number of items purchased
- time of purchase
- payment method
- time since the payment method was added to the account
- indication of the transaction being fraudulent or not

**The goal of this practical session is to train a ML algorithm to identify fraudulent transactions from the five features.**
Ideally, the model should return a probability that the transaction is fraudulent.

<div class="alert alert-block alert-danger">
<b>Q: Why do we prefer a probability instead of a simple [1; 0] output?</b>
</div>

<div class="alert alert-block alert-success">
The probability tells us how confident the model is, and we can leverage that information depending on our needs (e.g. deciding that we want to get as many TP as possible, accepting some additional FP). 
</div>

One column has non-numeric type.
It is called "categorical variable": the values indicate the category items belong to.

Many (not all) ML algorithms require features to be numeric, thus we have to convert categorical features to numeric features:
- create binary features, each row has exactly one feature set to 1
- this is called **one-hot encoding**

---

## Index

- [3.1](#3.1): preliminaries - introduction to *pandas*
- [3.2](#3.2): data exploration
- [3.3](#3.3): data preparation
- [3.4](#3.4): training the models

---
# 3.1

## Preliminaries: introduction to *Pandas*

You can find [here](https://pandas.pydata.org/pandas-docs/version/0.25/) all the documentation; and, remember, Google is you friend.

In [1]:
# this lets you refer to pandas with the shortened name, which is convenient if you call "pandas" many times
import pandas as pd

An important data structure is the `DataFrame`: it is a tabular data structure with labeled axis (rows and columns).
You can think of it as a table with named columns.

In [2]:
# Let's define a small dataframe
example_df = pd.DataFrame({'column_a':[0,1,2,2,3,4], 'column_b':['a','b','c','d','e','f']})

# and display it
example_df

Unnamed: 0,column_a,column_b
0,0,a
1,1,b
2,2,c
3,2,d
4,3,e
5,4,f


an alternative command to display pandas DataFrames is:

In [3]:
display(example_df)

Unnamed: 0,column_a,column_b
0,0,a
1,1,b
2,2,c
3,2,d
4,3,e
5,4,f


There is no (theoretical) limit to the number of rows and the number of columns in the DataFrame.

Useful attributes and methods of DataFrames (examples are shown below):
- **index**: the index (row labels) of the DataFrame. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.index.html)
- **values**: returns the selected labels [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html)
- **columns**: returns the column labels of the pandas dataframe. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html)
- **refer to one column** by selecting one or more columns between square brackets
- **slicing**, similar to what you do with lists [link](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)
- **drop**: drop specified labels from rows or columns [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)
- **drop_duplicates**: return DataFrame with duplicate rows removed, optionally only considering certain columns. Indexes, including time indexes are ignored. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html)
- **sort_values**: Sort by the values along either axis. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)
- **unique**: Return unique values. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html)
- **get_dummies**: Convert categorical variable into dummy/indicator variables. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)
- **sample**: Return a random sample of items from an axis of object. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html)
- **groupby**: A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)

Useful functions of Pandas:
- **read_csv**: Read a comma-separated values (csv) file into DataFrame. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

There are tons of other ones which might be useful in some cases but, as of now, these should be the only ones you need.

---

Examples with the `example_df` created above

In [4]:
# index
len(example_df.index)

6

In [5]:
example_df.index

RangeIndex(start=0, stop=6, step=1)

In [6]:
# columns
example_df.columns

Index(['column_a', 'column_b'], dtype='object')

In [7]:
# refer to one column
example_df['column_a']

0    0
1    1
2    2
3    2
4    3
5    4
Name: column_a, dtype: int64

In [8]:
# refer to several columns
example_df[['column_a', 'column_b']]  # careful, there are double square brakets

Unnamed: 0,column_a,column_b
0,0,a
1,1,b
2,2,c
3,2,d
4,3,e
5,4,f


In [9]:
# filter dataframe depending on some column's values
example_df[example_df['column_a'] == 2]

Unnamed: 0,column_a,column_b
2,2,c
3,2,d


In [10]:
# filter dataframe depending on some column's values
example_df[(example_df['column_a'] == 2)|(example_df['column_a'] == 1)]

Unnamed: 0,column_a,column_b
1,1,b
2,2,c
3,2,d


In [12]:
example_df[(example_df['column_a'] == 2)&(example_df['column_b'] == 'c')]

Unnamed: 0,column_a,column_b
2,2,c


In [13]:
# filter dataframe depending on some column's values
tmp_list = [1, 2]
example_df[example_df['column_a'].isin(tmp_list)]

Unnamed: 0,column_a,column_b
1,1,b
2,2,c
3,2,d


In [14]:
# slicing
example_df[1:3]

Unnamed: 0,column_a,column_b
1,1,b
2,2,c


In [15]:
# drop
example_df.drop('column_a', axis=1)

Unnamed: 0,column_b
0,a
1,b
2,c
3,d
4,e
5,f


In [16]:
# drop_duplicates
example_df.drop_duplicates('column_a')

Unnamed: 0,column_a,column_b
0,0,a
1,1,b
2,2,c
4,3,e
5,4,f


In [17]:
# sort_values
example_df.sort_values('column_a', ascending=True)

Unnamed: 0,column_a,column_b
0,0,a
1,1,b
2,2,c
3,2,d
4,3,e
5,4,f


In [18]:
# sort_values
example_df.sort_values(['column_a', 'column_b'], ascending=[False, True])

Unnamed: 0,column_a,column_b
5,4,f
4,3,e
2,2,c
3,2,d
1,1,b
0,0,a


In [19]:
# unique
example_df['column_a'].unique()

array([0, 1, 2, 3, 4])

In [20]:
example_df

Unnamed: 0,column_a,column_b
0,0,a
1,1,b
2,2,c
3,2,d
4,3,e
5,4,f


In [21]:
# get_dummies
pd.get_dummies(example_df, columns=['column_b'])

Unnamed: 0,column_a,column_b_a,column_b_b,column_b_c,column_b_d,column_b_e,column_b_f
0,0,1,0,0,0,0,0
1,1,0,1,0,0,0,0
2,2,0,0,1,0,0,0
3,2,0,0,0,1,0,0
4,3,0,0,0,0,1,0
5,4,0,0,0,0,0,1


---

## Why do we need a specific data structure for managing large amount of data?

Let's assume we have a large number of values, and we want to get the maximum.
Let's try to do that with a list and using a Pandas DataFrame, and look at the time required in both cases.

In [22]:
import pandas as pd
import numpy as np
import time

# This creates a list with random values in it; specifically, this line creates a list containing 100M values
tmp_list = list(np.random.randn(10**8))

Let's find its max value, with three different approaches.

#### 1) With a `for` loop

In [23]:
# start the timer
start_time = time.time()

# find the max value
max_value = tmp_list[0]
for x in tmp_list[1:]:
    if x > max_value:
        max_value = x
        
# stop the timer and print the result
elapsed_time = time.time() - start_time
print("Elapsed time %.2f seconds" % elapsed_time)

Elapsed time 6.46 seconds


#### 2) Using the `max` built-in

In [24]:
# start the timer
start_time = time.time()

# find the max value
max_value = max(tmp_list)

# stop the timer and print the result
elapsed_time = time.time() - start_time
print("Elapsed time %.2f seconds" % elapsed_time)

Elapsed time 2.12 seconds


#### 3) Using a DataFrame

In [25]:
# define a dataframe made of only one column, containing the list defined above
tmp_df = pd.DataFrame({'A':tmp_list})

In [29]:
display(tmp_df[:2])
display(tmp_df[-2:])

Unnamed: 0,A
0,-1.477562
1,-0.61026


Unnamed: 0,A
99999998,-0.677898
99999999,0.382182


In [27]:
# start the timer
start_time = time.time()

# find the max value
max_a = tmp_df['A'].max()

# stop the timer and print the result
elapsed_time = time.time() - start_time
print("Elapsed time %.2f seconds" % elapsed_time)

Elapsed time 0.66 seconds


---

## Let's move on to fraud detection

---

# 3.2
## data analysis

<div class="alert alert-block alert-warning">
<b>Warning:</b>

As in the previous session, please be careful with how you define the path and with the working directory of the notebook.
</div>

In [30]:
# Read in the data from the CSV file
df = pd.read_csv('datasets/payment_fraud.csv')

On of the most important things to do, while working with large quantities of data, is having a look at the data before starting to play with it.

In [31]:
# Let's have a look at how the data looks like
df.sample(5)

Unnamed: 0,accountAgeDays,numItems,localTime,paymentMethod,paymentMethodAgeDays,label
13343,861,1,4.876771,paypal,0.002083,0
37487,27,1,5.017904,creditcard,0.0,0
23252,2000,1,4.836982,creditcard,5.211806,0
28004,751,1,4.461622,paypal,105.992361,0
7876,123,1,4.52458,paypal,122.924306,0


<div class="alert alert-block alert-danger">
<b>Q: How many entries in the dataframe?</b>
</div>

In [32]:
len(df.index)

39221

<div class="alert alert-block alert-danger">
<b>Q: Which are the columns of the `payment_fraud` dataframe?</b>
</div>

In [33]:
df.columns

Index(['accountAgeDays', 'numItems', 'localTime', 'paymentMethod',
       'paymentMethodAgeDays', 'label'],
      dtype='object')

<div class="alert alert-block alert-danger">
<b>Q: Which column tells you whether a payment is a fraud or not?</b>
</div>

<div class="alert alert-block alert-success">
'label'
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many different values of `paymentMethod` are in the dataframe?</b>
</div>

In [34]:
len(df['paymentMethod'].unique())

3

Another method:

In [35]:
df['paymentMethod'].nunique()

3

<div class="alert alert-block alert-danger">
<b>Q: What are the possible values?</b>
</div>

In [36]:
df['paymentMethod'].unique()

array(['paypal', 'storecredit', 'creditcard'], dtype=object)

<div class="alert alert-block alert-danger">
<b>Q: Which is the maximum number of items purchased in a single transaction?</b>
</div>

In [37]:
max(df['numItems'])

29

<div class="alert alert-block alert-danger">
<b>Q: What does the following cell do, in you opinion?</b>
</div>

In [38]:
df.groupby('paymentMethod').size().reset_index()

Unnamed: 0,paymentMethod,0
0,creditcard,28004
1,paypal,9303
2,storecredit,1914


<div class="alert alert-block alert-success">
Counts the number of occurrences of each possible value of 'paymentMethod'
</div>

Alternative command to do the same thing.

In [39]:
df['paymentMethod'].value_counts()

creditcard     28004
paypal          9303
storecredit     1914
Name: paymentMethod, dtype: int64

<div class="alert alert-block alert-danger">
<b>Q: Can you see any differences between the outcome of the previous two approaches?</b>
</div>

<div class="alert alert-block alert-success">
<code> df.groupby('paymentMethod').size().reset_index() </code> returns a DataFrame object, while 
<code> df['paymentMethod'].value_counts() </code> doesn't. Also, if you apply <code> .reset_index()</code> to value_counts you obtain a different result
</div>

In [40]:
df['paymentMethod'].value_counts().reset_index()

Unnamed: 0,index,paymentMethod
0,creditcard,28004
1,paypal,9303
2,storecredit,1914


<div class="alert alert-block alert-danger">
<b>Q: Try to show the number of transactions for each value of `numItems`.</b>
</div>

In [42]:
df.groupby('numItems').size().reset_index()

Unnamed: 0,numItems,0
0,1,37398
1,2,1348
2,3,164
3,4,42
4,5,168
5,6,15
6,7,5
7,8,5
8,9,1
9,10,71


---

# 3.3
## data preparation

As we said before, we cannot train our model directly on the input DataFrame, as it contains some categorical values.
We have to encode them with one hot encoding.
The `pd.get_dummies` method can be used for doing just that.

<div class="alert alert-block alert-danger">
<b>Q: Create a new DataFrame encoding the 'paymentMethod' attribute with one hot encoding.</b>
</div>

In [43]:
# Convert categorical feature into dummy variables with one-hot encoding
df_one_hot = pd.get_dummies(df, columns=['paymentMethod'])

<div class="alert alert-block alert-danger">
<b>Q: How is the data different after performing one hot encoding?</b>
</div>

In [44]:
df_one_hot.sample(5)

Unnamed: 0,accountAgeDays,numItems,localTime,paymentMethodAgeDays,label,paymentMethod_creditcard,paymentMethod_paypal,paymentMethod_storecredit
7464,67,1,4.748314,0.0,0,0,1,0
23876,419,1,4.745402,275.886806,0,1,0,0
22055,2000,1,4.745402,0.0,0,0,0,1
25159,3,1,3.954522,2.398611,0,1,0,0
38572,1477,1,4.745402,64.592361,0,1,0,0


<div class="alert alert-block alert-success">
It generated 3 columns from the paymentMethod column (one for each possible value) and removed the original paymentMethod column
</div>

Now that we have a dataset that can be given as input to our model, we can perform the split into training and test set.

<div class="alert alert-block alert-danger">
<b>Q: Perform the split into training and testing set (keep a 70:30 ratio), completing the cell below.</b>
</div>

Please do **not** rename the variables, as the rest of the notebook will not run properly if you do that. 

Remember:
- X_train and X_test have to contain all the attributes
- the label can compare only in y_train and y_test

In [46]:
from sklearn.model_selection import train_test_split

In [47]:
X_train, X_test, y_train, y_test = train_test_split(
    df_one_hot.drop('label', axis=1),
    df_one_hot['label'],
    test_size=0.3,
)

In [52]:
X_train[:2]

Unnamed: 0,accountAgeDays,numItems,localTime,paymentMethodAgeDays,paymentMethod_creditcard,paymentMethod_paypal,paymentMethod_storecredit
5345,60,1,4.748314,0.0,1,0,0
7110,111,1,4.742303,110.43125,1,0,0


<div class="alert alert-block alert-danger">
<b>Q: How many entries are in the training data?</b>
</div>

In [48]:
len(X_train.index)

27454

<div class="alert alert-block alert-danger">
<b>Q: How many entries are in the test data?</b>
</div>

In [49]:
len(X_test.index)

11767

<div class="alert alert-block alert-danger">
<b>Q: How many features in the features arrays (i.e. X_train and X_test)?</b>
</div>

In [50]:
len(X_train.columns)

7

In [51]:
len(X_test.columns)

7

---

# 3.4

## Training and evaluating the models

### Logistic Regression

In [53]:
from sklearn.linear_model import LogisticRegression

# Initialize and train classifier model
clf = LogisticRegression(max_iter=20).fit(X_train, y_train)

# Make predictions on test set
y_pred = clf.predict(X_test)



In [54]:
# import methods for measuring accuracy, precision and recall
from sklearn.metrics import (
    accuracy_score, 
    precision_score,
    recall_score,
)

# Compare test set predictions with ground truth labels
accuracy = accuracy_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
print("accuracy", accuracy)
print("precision", precision)
print("recall", recall)

accuracy 0.999830033143537
precision 1.0
recall 0.9875776397515528


<div class="alert alert-block alert-danger">
<b>Q: What does a recall value of 1.0 mean? (e.g. doesn't miss almost any positive, doesn't rais almost any warnings, etc.) </b>
</div>

<div class="alert alert-block alert-success">
You get all the positives.
</div>

<div class="alert alert-block alert-danger">
<b>Q: What does a precision value of 1.0 mean?</b>
</div>

<div class="alert alert-block alert-success">
You didn't raise false positives
</div>

### Gaussian Naive Bayes

<div class="alert alert-block alert-danger">
<b>Q: Complete the cell below in order to train and test a Gaussian Naive Bayes model.</b>
</div>

Hint: here is the documentation:
- [doc](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

In [55]:
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB().fit(X_train, y_train)

y_pred = clf.predict(X_test)

In [56]:
accuracy = accuracy_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
print("accuracy", accuracy)
print("precision", precision)
print("recall", recall)

accuracy 1.0
precision 1.0
recall 1.0


### Linear SVM

<div class="alert alert-block alert-danger">
<b>Q: Complete the cell below in order to train and test a Linear SVM.</b>
</div>

Hint: here is the documentation:
- [doc](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)

In [57]:
from sklearn.svm import LinearSVC

clf = LinearSVC().fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
print("accuracy", accuracy)
print("precision", precision)
print("recall", recall)

accuracy 1.0
precision 1.0
recall 1.0


### Do they results look "strange"? Try to understand why...

<div class="alert alert-block alert-success">
Yes, indeed. It's the classic scenario of "too good to be true".
</div>

Let's try to analyze the data and understand where these results come from...

Let's start by looking at the test labels... Maybe y_test contrains only 1 or only 0. 

In [58]:
y_test.unique()

array([0, 1])

Nope, that it's not the reason. Let's try and check the training labels

In [60]:
y_train.unique()

array([0, 1])

That is okay as well.

Let's see how many 1 and how many 0 labels I had in the original dataset

In [62]:
df_one_hot.groupby('label').size()

label
0    38661
1      560
dtype: int64

It is very unbalanced, bu that is causing the problem, here.

Let's search some strange correlations between the labels and the features.

In [63]:
df_one_hot.groupby(['label', 'numItems']).size().reset_index()

Unnamed: 0,label,numItems,0
0,0,1,36944
1,0,2,1266
2,0,3,148
3,0,4,39
4,0,5,164
5,0,6,15
6,0,7,5
7,0,8,5
8,0,9,1
9,0,10,70


Considering 'label' and 'numItems' everything is okay.

In [64]:
df_one_hot.groupby(['label', 'accountAgeDays']).size().reset_index()

Unnamed: 0,label,accountAgeDays,0
0,0,2,1243
1,0,3,723
2,0,4,477
3,0,5,388
4,0,6,315
5,0,7,298
6,0,8,205
7,0,9,206
8,0,10,153
9,0,11,166


Considering 'label' and 'accountAgeDays' you can find the issue! All the entries with label==1 have accountAgeDays==1, and there are no entries with labe==0 that have accountAgeDays==1.

Thus the model learned that all the entries with accountAgeDays==1 were malicious.

While this worked in our case (as we have seen from the evaluation metrics), probably it wouldn't work in a real world application.

## Bonus part, let's quickly retrain the model ignoring `accountAgeDays`

In [66]:
df_one_hot = df_one_hot.drop('accountAgeDays', axis=1)

X_train, X_test, y_train, y_test = train_test_split(
    df_one_hot.drop('label', axis=1), df_one_hot['label'], test_size=0.3,
)

In [67]:
# Logistic Regression

clf = LogisticRegression(max_iter=20).fit(X_train, y_train)
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
print("accuracy", accuracy)
print("precision", precision)
print("recall", recall)

accuracy 0.9848729497747939
precision 0.005649717514124294
recall 0.3333333333333333




In [68]:
# GaussianNB

clf = GaussianNB().fit(X_train, y_train)
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
print("accuracy", accuracy)
print("precision", precision)
print("recall", recall)

accuracy 0.49358375116852216
precision 0.9265536723163842
recall 0.02684124386252046


As expected, they perform much worse than before!

---