# 3 - Fraud Detection - eCommerce Retailer

The dataset is composed of a set of transactions from an online retailer: for each transaction we have access to several attributes (i.e. properties), as well as a label indicating whether each transaction was fraudulent or not fraudulent.

**The objective is using such historical data to develop a prediction algorithm capable of classifying new transactions as *fraud* or *not fraud*.**

<div class="alert alert-block alert-danger">
<b>Q: Is this supervised or unsupervised learning?</b>
</div>

<div class="alert alert-block alert-success">
Write here the answer (double click on this cell to enable editing)
</div>

The fraudulent transactions in the original datasets were labelled accordingly to the reports of the credit card owners: when they let the bank know that a transaction was fraudulent and required money back, the transaction was labelled as **fraud**.

The properties of the transaction are named features (a.k.a. attributes, independent variables).
For each transaction, the known properties are:
- age of the account
- number of items purchased
- time of purchase
- payment method
- time since the payment method was added to the account
- indication of the transaction being fraudulent or not

**The goal of this practical session is to train a ML algorithm to identify fraudulent transactions from the five features.**
Ideally, the model should return a probability that the transaction is fraudulent.

<div class="alert alert-block alert-danger">
<b>Q: Why do we prefer a probability instead of a simple [1; 0] output?</b>
</div>

<div class="alert alert-block alert-success">
Write here the answer (double click on this cell to enable editing)
</div>

One column has non-numeric type.
It is called "categorical variable": the values indicate the category items belong to.

Many (not all) ML algorithms require features to be numeric, thus we have to convert categorical features to numeric features:
- create binary features, each row has exactly one feature set to 1
- this is called **one-hot encoding**

---

## Index

- [3.1](#3.1): preliminaries - introduction to *pandas*
- [3.2](#3.2): data exploration
- [3.3](#3.3): data preparation
- [3.4](#3.4): training the models

---
# 3.1

## Preliminaries: introduction to *Pandas*

You can find [here](https://pandas.pydata.org/pandas-docs/version/0.25/) all the documentation; and, remember, Google is you friend.

In [None]:
# this lets you refer to pandas with the shortened name, which is convenient if you call "pandas" many times
import pandas as pd

An important data structure is the `DataFrame`: it is a tabular data structure with labeled axis (rows and columns).
You can think of it as a table with named columns.

In [None]:
# Let's define a small dataframe
example_df = pd.DataFrame({'column_a':[0,1,2,2,3,4], 'column_b':['a','b','c','d','e','f']})

# and display it
example_df

an alternative command to display pandas DataFrames is:

In [None]:
display(example_df)

There is no (theoretical) limit to the number of rows and the number of columns in the DataFrame.

Useful attributes and methods of DataFrames (examples are shown below):
- **index**: the index (row labels) of the DataFrame. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.index.html)
- **values**: returns the selected labels [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html)
- **columns**: returns the column labels of the pandas dataframe. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html)
- **refer to one column** by selecting one or more columns between square brackets
- **slicing**, similar to what you do with lists [link](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)
- **drop**: drop specified labels from rows or columns [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)
- **drop_duplicates**: return DataFrame with duplicate rows removed, optionally only considering certain columns. Indexes, including time indexes are ignored. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html)
- **sort_values**: Sort by the values along either axis. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)
- **unique**: Return unique values. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html)
- **get_dummies**: Convert categorical variable into dummy/indicator variables. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)
- **sample**: Return a random sample of items from an axis of object. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html)
- **groupby**: A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)

Useful functions of Pandas:
- **read_csv**: Read a comma-separated values (csv) file into DataFrame. [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

There are tons of other ones which might be useful in some cases but, as of now, these should be the only ones you need.

---

Examples with the `example_df` created above

In [None]:
# index
len(example_df.index)

In [None]:
# columns
example_df.columns

In [None]:
# refer to one column
example_df['column_a']

In [None]:
# refer to several columns
example_df[['column_a', 'column_b']]  # careful, there are double square brakets

In [None]:
# filter dataframe depending on some column's values
example_df[example_df['column_a'] == 2]

In [None]:
# filter dataframe depending on some column's values
example_df[(example_df['column_a'] == 2)|(example_df['column_a'] == 1)]

In [None]:
# filter dataframe depending on some column's values
tmp_list = [1, 2]
example_df[example_df['column_a'].isin(tmp_list)]

In [None]:
# slicing
example_df[1:3]

In [None]:
# drop
example_df.drop('column_a', axis=1)

In [None]:
# drop_duplicates
example_df.drop_duplicates('column_a')

In [None]:
# sort_values
example_df.sort_values('column_a', ascending=False)

In [None]:
# unique
example_df['column_a'].unique()

In [None]:
# get_dummies
pd.get_dummies(example_df, columns=['column_b'])

---

## Why do we need a specific data structure for managing large amount of data?

Let's assume we have a large number of values, and we want to get the maximum.
Let's try to do that with a list and using a Pandas DataFrame, and look at the time required in both cases.

In [None]:
import pandas as pd
import numpy as np
import time

# This creates a list with random values in it; specifically, this line creates a list containing 100M values
tmp_list = list(np.random.randn(10**8))

Let's find its max value, with three different approaches.

#### 1) With a `for` loop

In [None]:
# start the timer
start_time = time.time()

# find the max value
max_value = tmp_list[0]
for x in tmp_list[1:]:
    if x > max_value:
        max_value = x
        
# stop the timer and print the result
elapsed_time = time.time() - start_time
print("Elapsed time %.2f seconds" % elapsed_time)

#### 2) Using the `max` built-in

In [None]:
# start the timer
start_time = time.time()

# find the max value
max_value = max(tmp_list)

# stop the timer and print the result
elapsed_time = time.time() - start_time
print("Elapsed time %.2f seconds" % elapsed_time)

#### 3) Using a DataFrame

In [None]:
# define a dataframe made of only one column, containing the list defined above
tmp_df = pd.DataFrame({'A':tmp_list})

In [None]:
display(tmp_df)

In [None]:
# start the timer
start_time = time.time()

# find the max value
max_a = tmp_df['A'].max()

# stop the timer and print the result
elapsed_time = time.time() - start_time
print("Elapsed time %.2f seconds" % elapsed_time)

---

## Let's move on to fraud detection

---

# 3.2
## data analysis

<div class="alert alert-block alert-warning">
<b>Warning:</b>

As in the previous session, please be careful with how you define the path and with the working directory of the notebook.
</div>

In [None]:
# Read in the data from the CSV file
df = pd.read_csv('datasets/payment_fraud.csv')

On of the most important things to do, while working with large quantities of data, is having a look at the data before starting to play with it.

In [None]:
# Let's have a look at how the data looks like
df.sample(5)

<div class="alert alert-block alert-danger">
<b>Q: How many entries in the dataframe?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Which are the columns of the `payment_fraud` dataframe?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Which column tells you whether a payment is a fraud or not?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many different values of `paymentMethod` are in the dataframe?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: What are the possible values?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Which is the maximum number of items purchased in a single transaction?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: What does the following cell do, in you opinion?</b>
</div>

In [None]:
df.groupby('paymentMethod').size().reset_index()

<div class="alert alert-block alert-success">
Write here the answer (double click on this cell to enable editing)
</div>

Alternative command to do the same thing.

In [None]:
df['paymentMethod'].value_counts()

<div class="alert alert-block alert-danger">
<b>Q: Can you see any differences between the outcome of the previous two approaches?</b>
</div>

<div class="alert alert-block alert-success">
Write here the answer (double click on this cell to enable editing)
</div>

<div class="alert alert-block alert-danger">
<b>Q: Try to show the number of transactions for each value of `numItems`.</b>
</div>

---

# 3.3
## data preparation

As we said before, we cannot train our model directly on the input DataFrame, as it contains some categorical values.
We have to encode them with one hot encoding.
The `pd.get_dummies` method can be used for doing just that.

<div class="alert alert-block alert-danger">
<b>Q: Create a new DataFrame encoding the 'paymentMethod' attribute with one hot encoding.</b>
</div>

In [None]:
# Convert categorical feature into dummy variables with one-hot encoding
df_one_hot = 

<div class="alert alert-block alert-danger">
<b>Q: How is the data different after performing one hot encoding?</b>
</div>

<div class="alert alert-block alert-success">
Write here the answer (double click on this cell to enable editing)
</div>

Now that we have a dataset that can be given as input to our model, we can perform the split into training and test set.

<div class="alert alert-block alert-danger">
<b>Q: Perform the split into training and testing set (keep a 70:30 ratio), completing the cell below.</b>
</div>

Please do **not** rename the variables, as the rest of the notebook will not run properly if you do that. 

Remember:
- X_train and X_test have to contain all the attributes
- the label can compare only in y_train and y_test

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = 

<div class="alert alert-block alert-danger">
<b>Q: How many entries are in the training data?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many entries are in the test data?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many features in the features arrays (i.e. X_train and X_test)?</b>
</div>

---

# 3.4

## Training and evaluating the models

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize and train classifier model
clf = LogisticRegression(max_iter=20).fit(X_train, y_train)

# Make predictions on test set
y_pred = clf.predict(X_test)

In [None]:
# import methods for measuring accuracy, precision and recall
from sklearn.metrics import (
    accuracy_score, 
    precision_score,
    recall_score,
)

# Compare test set predictions with ground truth labels
accuracy = accuracy_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
print("accuracy", accuracy)
print("precision", precision)
print("recall", recall)

<div class="alert alert-block alert-danger">
<b>Q: What does a recall value close to 1.0 mean? (e.g. doesn't miss almost any positive, doesn't rais almost any warnings, etc.) </b>
</div>

<div class="alert alert-block alert-success">
Write here the answer (double click on this cell to enable editing)
</div>

<div class="alert alert-block alert-danger">
<b>Q: What does a precision value close to 1.0 mean?</b>
</div>

<div class="alert alert-block alert-success">
Write here the answer (double click on this cell to enable editing)
</div>

### Gaussian Naive Bayes

<div class="alert alert-block alert-danger">
<b>Q: Complete the cell below in order to train and test a Gaussian Naive Bayes model.</b>
</div>

Hint: here is the documentation:
- [doc](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

In [None]:
from sklearn.naive_bayes import 

clf = # TODO

y_pred = # TODO

accuracy = # TODO
precision = # TODO
recall = # TODO
print("accuracy", accuracy)
print("precision", precision)
print("recall", recall)

### Linear SVM

<div class="alert alert-block alert-danger">
<b>Q: Complete the cell below in order to train and test a Linear SVM.</b>
</div>

Hint: here is the documentation:
- [doc](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)

In [None]:
from sklearn.svm import # TODO

clf = # TODO

y_pred = # TODO

accuracy = # TODO
precision = # TODO
recall = # TODO
print("accuracy", accuracy)
print("precision", precision)
print("recall", recall)

### Do they results look "strange"? Try to understand why...

<div class="alert alert-block alert-success">
Write here the answer (double click on this cell to enable editing)
</div>

---