<center>
<img align="center" src="http://sydney.edu.au/images/content/about/logo-mono.jpg">
</center>
<h1 align="center" style="margin-top:10px">Statistical Learning and Data Mining</h1>
<h2 align="center" style="margin-top:20px">Week 13 Tutorial: Introduction to PyTorch</h2>
<br>

This tutorial is an introduction to building and training neural networks with [PyTorch](https://pytorch.org/). We'll build a simple feedforward network for fraud detection and train it by stochastic gradient descent.

<a href="#1.-Credit-Card-Fraud-Data">Credit card fraud data</a> <br>
<a href="#2.-Dataset-and-DataLoader">Dataset and DataLoader</a> <br>
<a href="#3.-Building-a-neural-network">Building a neural network</a> <br>
<a href="#4.-Training">Training</a> <br>
<a href="#5.-Logistic-Regression">Logistic regression</a> <br>
<a href="#6.-Validation-results">Validation results</a> <br>

This notebook relies on the following libraries and settings.

In [1]:
# Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore') 
from IPython.display import clear_output

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

# 1. Credit card fraud data

This tutorial will be based on the [credit card fraud dataset](hhttps://www.kaggle.com/mlg-ulb/creditcardfraud) available from [Kaggle Datasets](https://www.kaggle.com/datasets). Our objective is to detect fraudulent credit card transactions using classification methods. 

Let's assume the following loss matrix: 

<table>
  <tr>
    <th>Actual/Predicted</th>
    <th>Legitimate</th>
     <th>Fraud</th>
  </tr>
  <tr>
    <th>Legitimate</th>
    <td>0</td>
    <td>1</td>
  </tr>
  <tr>
    <th>Fraud</th>
    <td>10</td>
    <td>0</td>
  </tr>
</table>

That is, we assume that it is much worse for the financial institution to miss a fraudulent transaction than to flag a legitimate transaction as potential fraud.

We start by loading and inspecting the data. All features except the transaction amount are the result of a principal components analysis (PCA) transformation of undisclosed predictors.

In [3]:
data = pd.read_csv('Data/creditcard.csv')
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


This is a relatively large dataset with 284,807 transactions:

In [4]:
print(data.shape)

(284807, 31)


The classes are highly imbalanced: only 492 transactions (0.17%) are fraudulent.  This makes the problem much more challenging than the total number of observations would suggest.

In [5]:
data['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

With so few observations in the fraud class, we should ideally use cross-validation throughout the analysis. However, this would excessively complicate the code, and we simply create a validation set.

In [6]:
response='Class'
index_train, index_val  = train_test_split(np.array(data.index), stratify=data[response], train_size=0.8, random_state=1)

predictors = list(data.columns[1:-1])  # we won't use the  time variable

X_train = data.loc[index_train, predictors].to_numpy()
y_train = data.loc[index_train, response].to_numpy()

X_valid = data.loc[index_val, predictors].to_numpy()
y_valid = data.loc[index_val, response].to_numpy()

The last step that we need to prepare the data is to standardise the predictors.

In [7]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_valid = scaler.transform(X_valid)

# 2. Dataset and DataLoader

When working with PyTorch, we need to tell it how to process the data and construct minibatches for stochastic gradient descent. 

The first step is to create a PyTorch dataset object. A `Dataset` class must implement three methods: `__init__`, `__len__`, and `__getitem__`. The first takes data as an input, processes it as required, and instantiates the `DataSet` object. The second returns the number of observations. The third takes an index as an input and returns the observation that corresponds to that index.

The `__init__` implementation below converts the original NumPy arrays into [PyTorch tensors](https://pytorch.org/docs/stable/tensors.html) and converts them into the required memory format.

A tensor in this context is just another name for an array. The PyTorch documentation writes:

> A `torch.Tensor` is a multi-dimensional matrix containing elements of a single data type.

In [8]:
import torch
from torch.utils.data import Dataset

class FraudDataset(Dataset):
    
    def __init__(self, features, response):
        self.features =  torch.from_numpy(features).float()
        self.response = torch.from_numpy(response).float()

    def __len__(self):
        return len(self.response)

    def __getitem__(self, idx):
        return self.features[idx, :], self.response[idx]
    
train_data = FraudDataset(X_train, y_train)

A [DataLoader](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader) takes a `DataSet` as an input and combines it a sampling strategy to allow PyTorch to iterate over mini-batches.

 Setting the `shuffle` option to `True` makes the DataLoader reshuffle the data at every epoch.

In [9]:
from torch.utils.data import DataLoader

train_loader = DataLoader(train_data, batch_size = 1024, shuffle=True)

The next cell grabs a randomly sampled mini-batch for inspection.

In [10]:
X, y = next(iter(train_loader))

X

tensor([[ 0.4037, -0.7890,  0.3855,  ...,  0.1138,  0.1171,  0.3953],
        [-0.2206,  0.5561, -1.6911,  ...,  0.9195,  0.8567, -0.3148],
        [ 0.6056,  0.3252, -0.1057,  ...,  0.0495,  0.1221, -0.3508],
        ...,
        [ 0.5775,  0.0777,  0.5506,  ...,  0.0917,  0.1075, -0.2349],
        [-0.3241,  0.4454,  0.4775,  ..., -1.0179, -0.0201, -0.3428],
        [-0.2604,  0.4896,  1.4939,  ..., -0.1925, -0.4826, -0.3463]])

In [11]:
X.shape

torch.Size([1024, 29])

In [12]:
y.shape

torch.Size([1024])

# 3. Building a neural network

Our neural network model will be a feedforward network with three hidden layers. Each layer will have 128 hidden units and the activation function will be the rectified linear unit (ReLU).

We need to specify the model as a PyTorch neural network module that is a subclass of [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). The model class needs to have an `__init__` method that initialises the neural network layers and `forward` method that implements the operations to be performed on the inputs.

The following code takes advantage of the `nn.Sequential` class, which allows us to quickly stack pre-defined layers.

In [13]:
from torch import nn
   
class NeuralNetwork(nn.Module):
    
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        
        self.feedforward = nn.Sequential(            
            nn.Linear(29, 128),            
            nn.ReLU(),                       
            nn.Linear(128, 128),
            nn.ReLU(),  
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 1),
            nn.Sigmoid()
        )                        


    def forward(self, features):        
        return self.feedforward(features).flatten() # returns a flat array as desired

Next, we instantiate the model, move it to the GPU (if available), and print the model structure. 

In [14]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

dfn = NeuralNetwork().to(device)

print(dfn)

NeuralNetwork(
  (feedforward): Sequential(
    (0): Linear(in_features=29, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=128, bias=True)
    (3): ReLU()
    (4): Linear(in_features=128, out_features=128, bias=True)
    (5): ReLU()
    (6): Linear(in_features=128, out_features=1, bias=True)
    (7): Sigmoid()
  )
)


If you're using a GPU, note that it needs to have enough memory to hold the model and a minibatch. This is not a problem here, but in practice you'll often need worry about what your GPU's memory.

# 4. Training

To train the neural network, it's useful to code a function that loops over the entire training set to co complete one epoch of the optimisation algorithm.

In [15]:
def train_loop(dataloader, model, loss_fn, optimiser):
    
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    n_batches = len(dataloader)
    n_obs = len(dataloader.dataset)
    
    for batch, (features, response) in enumerate(dataloader):
        
        # Move data to GPU, if available
        features = features.to(device)
        response  = response.to(device)
        
        # Compute the predictions (forward pass)
        prediction = model(features)
        
        # Evaluate cost function
        loss = loss_fn(prediction, response)

        # Compute gradient (backward pass)
        optimiser.zero_grad()
        loss.backward()
        
        # Update parameters
        optimiser.step()

        # Print progress
        if batch % int(np.floor(0.2*n_batches)) == 0:
            loss, current = loss.item(), batch * len(response)
            print(f"Loss: {loss:>7f}  [{current:>5d}/{n_obs:>5d}]")

We'll also compute and print the validation results at the end of every epoch. In this case, we process the validation training set in one batch.

In [16]:
from sklearn.metrics import recall_score, log_loss, average_precision_score

def validation(model):
    
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Predicted probabilities 
    # the .cpu().detach().numpy() part transfers the result the cpu and converts it to a numpy array
    with torch.no_grad():
        y_prob = model(torch.from_numpy(X_valid).float().to(device)).cpu().detach().numpy()
    
    
    # Classification using the decision threshold
    tau = 1/11
    y_pred = (y_prob > tau).astype(int)
    
    # Metrics
    nll = log_loss(y_valid, y_prob)
    sensitivity = recall_score(y_valid, y_pred)
    auprc = average_precision_score(y_valid, y_prob)
      

    print('')
    print('Validation metrics \n')
    print(f"Loss: {np.round(nll, 4)}")
    print(f"Sensitivity: {np.round(sensitivity, 3)}")
    print(f"Average precision: {np.round( auprc, 3)} \n")

We now finally train the model. We use the Adam optimiser, which is an extension of SGD that often works well as a default optimisation method. 

The learning rate is based on trial and error, though in practice we can use hyperparameter optimisation tools to select it. 

To select the number of epochs, we can use early stopping. In this approach, we keep track of the validation metrics at the end of each epoch and stop the learning process when the validation performance stops improving. We don't explicitly code this method below, but the validation metrics tend to stop improving after about five or six epochs for this problem.

Note that you will not get the same numbers because of the randomness in the learning algorithm. When using the CPU for training, it's possible to achieve reproducibility by following the [recommendations](https://pytorch.org/docs/stable/notes/randomness.html) in the PyTorch documentation. With GPU training, reproducibility is difficult and often not possible. 

In [17]:
epochs = 5
learning_rate = 1e-3

loss_fn = nn.BCELoss() # binary cross-entropy loss
optimiser = torch.optim.Adam(dfn.parameters(), lr = learning_rate) # Adam tends to work well as a default
 
for i in range(epochs):
    print(f"Epoch {i+1}\n-------------------------------")
    train_loop(train_loader, dfn, loss_fn, optimiser)
    validation(dfn)

print("Done!")

Epoch 1
-------------------------------
Loss: 0.660927  [    0/227845]
Loss: 0.008776  [45056/227845]
Loss: 0.014260  [90112/227845]
Loss: 0.007543  [135168/227845]
Loss: 0.003137  [180224/227845]
Loss: 0.004282  [225280/227845]

Validation metrics 

Loss: 0.0036
Sensitivity: 0.816
Average precision: 0.801 

Epoch 2
-------------------------------
Loss: 0.016526  [    0/227845]
Loss: 0.000512  [45056/227845]
Loss: 0.003951  [90112/227845]
Loss: 0.001746  [135168/227845]
Loss: 0.002246  [180224/227845]
Loss: 0.001426  [225280/227845]

Validation metrics 

Loss: 0.0025
Sensitivity: 0.888
Average precision: 0.845 

Epoch 3
-------------------------------
Loss: 0.001321  [    0/227845]
Loss: 0.010567  [45056/227845]
Loss: 0.001628  [90112/227845]
Loss: 0.000954  [135168/227845]
Loss: 0.010703  [180224/227845]
Loss: 0.000702  [225280/227845]

Validation metrics 

Loss: 0.0022
Sensitivity: 0.888
Average precision: 0.865 

Epoch 4
-------------------------------
Loss: 0.009099  [    0/227845]

#  5. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

# No regularisation
logit = LogisticRegression(penalty='none', solver='lbfgs')
logit.fit(X_train, y_train)

# L2 regularisation
logit_l2= LogisticRegressionCV(Cs = 50, penalty='l2', solver='lbfgs', scoring='neg_log_loss', n_jobs=-1)
logit_l2.fit(X_train, y_train)

# 6. Validation results

The next cell compares the validation performance of the neural network against the logistic regression benchmark. In my results, the neural network significantly outperforms the logistic regression in terms of the estimated risk, average precision, and cross-entropy. 

Some important comments:

(i) As noted above, you will not get the same numbers because of the randomness in the optimisation process. 

(ii) A useful trick for training neural networks is to re-run the learning algorithm if necessary. You can then select and save a model that performs well on the validation set. The disadvantage of this approach is that it can overfit the validation set. If possible, it's better to average multiple neural networks trained on different training-validation splits, discarding those with poor validation performance. 

(iii) The comparison with the logistic regression is not entirely rigorous since we looked at the validation performance to select a reasonable number of epochs for training the neural network.

(iv) It's not clear why the neural networks seems to perform better. As earlier in the unit, it's important to use EDA and interpretability tools to understand what's happening. 

In [None]:
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix,  f1_score

columns=['Estimated risk', 'Error rate', 'Sensitivity', 'Specificity', 
         'Precision', 'Average Precision', 'F1 Score', 'Cross-entropy']
rows=['Logistic', 'Logistic $\ell_2$', 'Neural network']
results=pd.DataFrame(0.0, columns=columns, index=rows) 

methods=[logit, logit_l2, dfn]

lfp = 1
lfn = 10
tau = lfp/(lfp+lfn)

for i, method in enumerate(methods):
    
    if i==2:
        with torch.no_grad():
            y_prob = dfn(torch.from_numpy(X_valid).float().to(device)).cpu().detach().numpy()
    else:
        y_prob = method.predict_proba(X_valid)[:, 1]

    y_pred = (y_prob>tau).astype(int)
       
    tn, fp, fn, tp = confusion_matrix(y_valid, y_pred).ravel()
    
    results.iloc[i,0]=  (fp*lfp+fn*lfn)/len(y_valid)
    results.iloc[i,1]=  1 - accuracy_score(y_valid, y_pred)
    results.iloc[i,2]=  tp/(tp+fn)
    results.iloc[i,3]=  tn/(tn+fp)
    results.iloc[i,4]=  precision_score(y_valid, y_pred)
    results.iloc[i,5]=  average_precision_score(y_valid, y_prob)
    results.iloc[i,6]=  f1_score(y_valid, y_pred)
    results.iloc[i,7]=  10*log_loss(y_valid, y_prob)

results.iloc[:,0] /= results.iloc[0,0]
results.round(3)