# 3 - Fraud Detection - eCommerce Retailer

This is the same example that was introduced during the lecture on classification (you can find it in the 2nd set of slides).

The dataset is composed of a set of transactions from an online retailer: for each transaction we have access to several attributes (i.e. properties), as well as a label indicating whether each transaction was fraudulent or not.

**The objective is using such historical data to develop a prediction algorithm capable of classifying new transactions as *fraud* or *not fraud*.**



<div class="alert alert-block alert-danger">
<b>Q: Is this supervised or unsupervised learning?</b>
</div>

<div class="alert alert-block alert-success">
Supervised, we know the ground truth (i.e. whether a transaction is a fraud).
</div>

The fraudulent transactions in the original datasets were labelled accordingly to the reports of the credit card owners: when they let the bank know that a transaction was fraudulent and required money back, the transaction was labelled as **fraud**.

The properties of the transaction are named features (a.k.a. attributes, independent variables).
For each transaction, the known properties are:
- age of the account
- number of items purchased
- time of purchase
- payment method
- time since the payment method was added to the account
- indication of the transaction being fraudulent or not

**The goal of this practical session is to train a ML algorithm to identify fraudulent transactions from the five features.**
Ideally, the model should return a probability that the transaction is fraudulent.

<div class="alert alert-block alert-danger">
<b>Q: Why do we prefer a probability instead of a simple [1; 0] output?</b>
</div>

<div class="alert alert-block alert-success">
We have an idea of "how confident" the model is, when it detects a fraudulent transaction.
</div>

One column has non-numeric type.
It is called "categorical variable": the values indicate the category items belong to.

Many (not all) ML algorithms require features to be numeric, thus we have to convert categorical features to numeric features:
- create binary features, each row has exactly one feature set to 1
- this is called **one-hot encoding**

---

## Pandas

You can find [here](https://pandas.pydata.org/pandas-docs/version/0.25/) all the documentation; and, remember, Google is you friend (as well as all the other search engines).

In [1]:
# this lets you refer to pandas with the shortened name, which is convenient if you call "pandas" many times
import pandas as pd

The most important data structure is the `DataFrame`: it is a tabular data structure with labeled axis (rows and columns).
You can think of it as a table with named columns.

In [2]:
# Let's define a small dataframe
example_df = pd.DataFrame({'column_a':[0,1,2,2,3,4], 'column_b':['a','b','c','d','e','f']})

# and display it
example_df

Unnamed: 0,column_a,column_b
0,0,a
1,1,b
2,2,c
3,2,d
4,3,e
5,4,f


There is no (theoretical) limit to the number of rows and the number of columns in the DataFrame.

Useful attributes and methods of DataFrames (examples are shown below):
- **index**: the index (row labels) of the DataFrame. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.index.html)
- **values**: returns the selected labels [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html)
- **columns**: returns the column labels of the pandas dataframe. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html)
- **refer to one column** by selecting one or more columns between square brackets
- **slicing**, similar to what you do with lists [link](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)
- **drop**: drop specified labels from rows or columns [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)
- **drop_duplicates**: return DataFrame with duplicate rows removed, optionally only considering certain columns. Indexes, including time indexes are ignored. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html)
- **sort_values**: Sort by the values along either axis. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)
- **unique**: Return unique values. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html)
- **get_dummies**: Convert categorical variable into dummy/indicator variables. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)
- **sample**: Return a random sample of items from an axis of object. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html)
- **groupby**: A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)

Useful functions of Pandas:
- **read_csv**: Read a comma-separated values (csv) file into DataFrame. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

There are tons of other ones which might be useful in some cases but, as of now, these should be the only ones you need.

---

Examples with the `example_df` created above

In [3]:
# index
len(example_df.index)

6

In [4]:
# columns
example_df.columns

Index(['column_a', 'column_b'], dtype='object')

In [5]:
# refer to one column
example_df['column_a']

0    0
1    1
2    2
3    2
4    3
5    4
Name: column_a, dtype: int64

In [6]:
# slicing
example_df[1:3]

Unnamed: 0,column_a,column_b
1,1,b
2,2,c


In [7]:
# drop
example_df.drop('column_a', axis=1)

Unnamed: 0,column_b
0,a
1,b
2,c
3,d
4,e
5,f


In [8]:
# drop_duplicates
example_df.drop_duplicates('column_a')

Unnamed: 0,column_a,column_b
0,0,a
1,1,b
2,2,c
4,3,e
5,4,f


In [9]:
# sort_values
example_df.sort_values('column_a', ascending=False)

Unnamed: 0,column_a,column_b
5,4,f
4,3,e
2,2,c
3,2,d
1,1,b
0,0,a


In [10]:
# unique
example_df['column_a'].unique()

array([0, 1, 2, 3, 4])

In [11]:
# get_dummies
pd.get_dummies(example_df, columns=['column_b'])

Unnamed: 0,column_a,column_b_a,column_b_b,column_b_c,column_b_d,column_b_e,column_b_f
0,0,1,0,0,0,0,0
1,1,0,1,0,0,0,0
2,2,0,0,1,0,0,0
3,2,0,0,0,1,0,0
4,3,0,0,0,0,1,0
5,4,0,0,0,0,0,1


---

## Why do we need a specific data structure for managing large amount of data?

Let's assume we have a large number of values that we want to sort.
Let's try to do that with a list and using a Pandas DataFrame, and look at the time required in both cases.

In [12]:
import pandas as pd
import numpy as np
import time

# This creates a list with random values in it; specifically, this line creates a list containing 100M values
tmp_list = list(np.random.randn(10**8))

Let's find its max value, with three different approaches.

#### 1) With a `for` loop

In [13]:
# start the timer
start_time = time.time()

# find the max value
max_value = tmp_list[0]
for x in tmp_list[1:]:
    if x > max_value:
        max_value = x
        
# stop the timer and print the result
elapsed_time = time.time() - start_time
print("Elapsed time %.2f seconds" % elapsed_time)

Elapsed time 7.01 seconds


#### 2) Using the `max` built-in

In [14]:
# start the timer
start_time = time.time()

# find the max value
max_value = max(tmp_list)

# stop the timer and print the result
elapsed_time = time.time() - start_time
print("Elapsed time %.2f seconds" % elapsed_time)

Elapsed time 2.70 seconds


#### 3) Using a DataFrame

In [15]:
# define a dataframe made of only one column, containing the list defined above
tmp_df = pd.DataFrame({'A':tmp_list})

In [16]:
# start the timer
start_time = time.time()

# find the max value
max_a = tmp_df['A'].max()

# stop the timer and print the result
elapsed_time = time.time() - start_time
print("Elapsed time %.2f seconds" % elapsed_time)

Elapsed time 0.68 seconds


---

## Let's (finally) move on to fraud detection!

In [17]:
# Read in the data from the CSV file
df = pd.read_csv('datasets/payment_fraud.csv')

On of the most important things to do, while working with large quantities of data, is having a look at the data before starting to play with it.

In [18]:
# Let's have a look at how the data looks like
df.sample(5)

Unnamed: 0,accountAgeDays,numItems,localTime,paymentMethod,paymentMethodAgeDays,label
11665,851,1,5.017904,creditcard,0.009722,0
8543,2000,1,4.057414,creditcard,248.020139,0
27289,112,1,4.965339,paypal,0.0,0
14778,1307,1,4.895263,creditcard,0.0,0
28025,427,1,4.895263,creditcard,0.0,0


<div class="alert alert-block alert-danger">
<b>Q: How many entries in the dataframe?</b>
</div>

In [19]:
len(df.index)

39221

<div class="alert alert-block alert-danger">
<b>Q: Which are the columns of the `payment_fraud` dataframe?</b>
</div>

In [20]:
df.columns

Index(['accountAgeDays', 'numItems', 'localTime', 'paymentMethod',
       'paymentMethodAgeDays', 'label'],
      dtype='object')

<div class="alert alert-block alert-danger">
<b>Q: Which column tells you whether a payment is a fraud or not?</b>
</div>

- `label`

<div class="alert alert-block alert-danger">
<b>Q: How many different values of `paymentMethod` are in the dataframe?</b>
</div>

In [21]:
len(df['paymentMethod'].unique())

3

<div class="alert alert-block alert-danger">
<b>Q: What are the possible values?</b>
</div>

In [22]:
df['paymentMethod'].unique()

array(['paypal', 'storecredit', 'creditcard'], dtype=object)

<div class="alert alert-block alert-danger">
<b>Q: Which is the maximum number of items purchased in a single transaction?</b>
</div>

In [23]:
df['numItems'].max()

29

<div class="alert alert-block alert-danger">
<b>Q: What does the cell below do, in you opinion?</b>
</div>

In [24]:
df.groupby('paymentMethod').size().reset_index()

Unnamed: 0,paymentMethod,0
0,creditcard,28004
1,paypal,9303
2,storecredit,1914


<div class="alert alert-block alert-success">
It counts how many transactions are stored for each payment method.
</div>

<div class="alert alert-block alert-danger">
<b>Q: Try to show the number of transactions for each value of `numItems`.</b>
</div>

In [25]:
df.groupby('numItems').size().reset_index()

Unnamed: 0,numItems,0
0,1,37398
1,2,1348
2,3,164
3,4,42
4,5,168
5,6,15
6,7,5
7,8,5
8,9,1
9,10,71


---

As we said before, we cannot train our model directly on the input DataFrame, as it contains some categorical values.
We have to encode them with one hot encoding.
The `get_dummies` method can be used for doing just that.

<div class="alert alert-block alert-danger">
<b>Q: Create a new DataFrame encoding the 'paymentMethod' attribute with one hot encoding.</b>
</div>

In [26]:
# Convert categorical feature into dummy variables with one-hot encoding
df_one_hot = pd.get_dummies(df, columns=['paymentMethod'])

<div class="alert alert-block alert-danger">
<b>Q: How is the data different after performing one hot encoding?</b>
</div>

In [27]:
df_one_hot.sample(3)

Unnamed: 0,accountAgeDays,numItems,localTime,paymentMethodAgeDays,label,paymentMethod_creditcard,paymentMethod_paypal,paymentMethod_storecredit
5669,2000,1,4.895263,0.0,0,1,0,0
11450,485,1,4.748314,0.0,0,1,0,0
14551,173,2,2.94894,171.6375,0,0,1,0


Now that we have a dataset that can be given as input to our model, we can perform the split into training and test set.

<div class="alert alert-block alert-danger">
<b>Q: Perform the split into training and testing set (keep a 70:30 ratio), completing the cell below.</b>
</div>

Please do **not** rename the variables, as the rest of the notebook will not run properly if you do that. 

Remember:
- X_train and X_test have to contain all the attributes
- the label can compare only in y_train and y_test

In [28]:
from sklearn.model_selection import train_test_split

# Split dataset up into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df_one_hot.drop('label', axis=1), df_one_hot['label'], test_size=0.50
)

<div class="alert alert-block alert-danger">
<b>Q: How many entries are in the training data?</b>
</div>

In [29]:
len(X_train)

19610

<div class="alert alert-block alert-danger">
<b>Q: How many entries are in the test data?</b>
</div>

In [30]:
len(X_test)

19611

<div class="alert alert-block alert-danger">
<b>Q: How many features in the features arrays (i.e. X_train and X_test)?</b>
</div>

In [31]:
len(X_train.columns)

7

---

## Training of the model

### Logistic Regression

In [32]:
from sklearn.linear_model import LogisticRegression

# Initialize and train classifier model
clf = LogisticRegression(max_iter=20).fit(X_train, y_train)

# Make predictions on test set
y_pred = clf.predict(X_test)



In [33]:
# import methods for measuring accuracy, precision and recall
from sklearn.metrics import (
    accuracy_score, 
    precision_score,
    recall_score,
)

# Compare test set predictions with ground truth labels
accuracy = accuracy_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
print("accuracy", accuracy)
print("precision", precision)
print("recall", recall)

accuracy 1.0
precision 1.0
recall 1.0


<div class="alert alert-block alert-danger">
<b>Q: What does a recall value close to 1.0 mean? (e.g. doesn't miss almost any positive, doesn't rais almost any warnings, etc.) </b>
</div>

<div class="alert alert-block alert-success">
If the recall value is 1, it means that the model detected all the fraudulent transactions.
</div>

<div class="alert alert-block alert-danger">
<b>Q: What does a precision value close to 1.0 mean?</b>
</div>

<div class="alert alert-block alert-success">
If the precision value is 1, it means that all the fraudulent transactions detected by the system were actually fraudulent. 
</div>

### Decision Tree

<div class="alert alert-block alert-danger">
<b>Q: Complete the cell below in order to train and test a Decision Tree model.</b>
</div>

Hint: here is the documentation:
- [doc](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

In [34]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

# Compare test set predictions with ground truth labels
accuracy = accuracy_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
print("accuracy", accuracy)
print("precision", precision)
print("recall", recall)

accuracy 1.0
precision 1.0
recall 1.0
