# Unit 3, Exercise 1: Banknote Authentication

In this exercise, we are applying logistic regression to a banknote authentication dataset to distinguish between genuine and forged bank notes.


**The dataset consists of 1372 examples and 4 features for binary classification.** The features are

1. variance of a wavelet-transformed image (continuous)
2. skewness of a wavelet-transformed image (continuous)
3. kurtosis of a wavelet-transformed image (continuous)
4. entropy of the image (continuous)

(You can fine more details about this dataset at [https://archive.ics.uci.edu/ml/datasets/banknote+authentication](https://archive.ics.uci.edu/ml/datasets/banknote+authentication).)


In essence, these four features represent features that were manually extracted from image data. Note that you do not need the details of these features for this exercise.

However, you are encouraged to explore the dataset further, e.g., by plotting the features, looking at the value ranges, and so forth. (We will skip these steps for brevity in this exercise)

Most of the code should look familiar to you since it is based on the logistic regression code from Unit 3.6.

## 1) Installing Libraries

You likely already have all libraries installed and don't need to do anything here.

In [1]:
# !conda install numpy pandas matplotlib --yes

In [2]:
# !pip install torch

In [6]:
!pip install watermark

Collecting watermark
  Downloading watermark-2.5.0-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting jedi>=0.16 (from ipython>=6.0->watermark)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading watermark-2.5.0-py2.py3-none-any.whl (7.7 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi, watermark
Successfully installed jedi-0.19.2 watermark-2.5.0


In [7]:
%load_ext watermark
%watermark -v -p numpy,pandas,matplotlib,torch

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

numpy     : 1.26.4
pandas    : 2.2.2
matplotlib: 3.8.0
torch     : 2.5.1+cu121



## 2) Loading the Dataset

We are using the familiar `read_csv` function from pandas to load the dataset:

In [8]:
import pandas as pd

In [13]:
# url = "https://github.com/Lightning-AI/dl-fundamentals/blob/main/unit03-pytorch-training/exercises/2_standardization/data_banknote_authentication.txt"
df = pd.read_csv("/content/data_banknote_authentication.txt", header=None, sep=',')
df.head()

Unnamed: 0,0,1,2,3,4
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [14]:
X_features = df[[0, 1, 2, 3]].values
y_labels = df[4].values

Number of examples and features:

In [15]:
X_features.shape

(1372, 4)

It is usually a good idea to look at the label distribution:

In [16]:
import numpy as np

np.bincount(y_labels)

array([762, 610])

## 3) Defining a DataLoader

The `DataLoader` code is the same code we used in Unit 3.6:

In [17]:
from torch.utils.data import Dataset, DataLoader


class MyDataset(Dataset):
    def __init__(self, X, y):

        self.features = torch.tensor(X, dtype=torch.float32)
        self.labels = torch.tensor(y, dtype=torch.float32)

    def __getitem__(self, index):
        x = self.features[index]
        y = self.labels[index]
        return x, y

    def __len__(self):
        return self.labels.shape[0]

We will be using 80% of the data for training, 20% of the data for validation. In a real-project, we would also have a separate dataset for the final test set (in this case, we do not have an explicit test set).

In [18]:
train_size = int(X_features.shape[0]*0.80)
train_size

1097

In [19]:
val_size = X_features.shape[0] - train_size
val_size

275

Using `torch.utils.data.random_split`, we generate the training and validation sets along with the respective data loaders:

In [20]:
import torch

dataset = MyDataset(X_features, y_labels)

torch.manual_seed(1)
train_set, val_set = torch.utils.data.random_split(dataset, [train_size, val_size])

train_loader = DataLoader(
    dataset=train_set,
    batch_size=10,
    shuffle=True,
)

val_loader = DataLoader(
    dataset=val_set,
    batch_size=10,
    shuffle=False,
)

## 4) Implementing the model

Here, we are resusing the same model code we used in Unit 3.6:

In [21]:
import torch

class LogisticRegression(torch.nn.Module):

    def __init__(self, num_features):
        super().__init__()
        self.linear = torch.nn.Linear(num_features, 1)

    def forward(self, x):
        logits = self.linear(x)
        probas = torch.sigmoid(logits)
        return probas

## 5) The training loop

In this section, we are using the training loop from Unit 3.6. It's the exact same code except for some small modification: We added the line `if not batch_idx % 20` to only print the loss for every 20th batch (to reduce the number of output lines).

<font color='red'>YOUR TASK is to find a good learning rate and epoch number so that you achieve a training and validation performance of at least 98%.</font>

In [66]:
import torch.nn.functional as F


torch.manual_seed(1)
model = LogisticRegression(num_features=4)
optimizer = torch.optim.SGD(model.parameters(), lr=0.02) ## FILL IN VALUE

num_epochs = 20  ## FILL IN VALUE

for epoch in range(num_epochs):

    model = model.train()
    for batch_idx, (features, class_labels) in enumerate(train_loader):

        probas = model(features)

        loss = F.binary_cross_entropy(probas, class_labels.view(probas.shape))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        ### LOGGING
        if not batch_idx % 20: # log every 20th batch
            print(f'Epoch: {epoch+1:03d}/{num_epochs:03d}'
                   f' | Batch {batch_idx:03d}/{len(train_loader):03d}'
                   f' | Loss: {loss:.2f}')

Epoch: 001/020 | Batch 000/110 | Loss: 1.30
Epoch: 001/020 | Batch 020/110 | Loss: 0.57
Epoch: 001/020 | Batch 040/110 | Loss: 0.48
Epoch: 001/020 | Batch 060/110 | Loss: 0.23
Epoch: 001/020 | Batch 080/110 | Loss: 0.16
Epoch: 001/020 | Batch 100/110 | Loss: 0.17
Epoch: 002/020 | Batch 000/110 | Loss: 0.29
Epoch: 002/020 | Batch 020/110 | Loss: 0.14
Epoch: 002/020 | Batch 040/110 | Loss: 0.29
Epoch: 002/020 | Batch 060/110 | Loss: 0.13
Epoch: 002/020 | Batch 080/110 | Loss: 0.14
Epoch: 002/020 | Batch 100/110 | Loss: 0.14
Epoch: 003/020 | Batch 000/110 | Loss: 0.09
Epoch: 003/020 | Batch 020/110 | Loss: 0.20
Epoch: 003/020 | Batch 040/110 | Loss: 0.14
Epoch: 003/020 | Batch 060/110 | Loss: 0.31
Epoch: 003/020 | Batch 080/110 | Loss: 0.12
Epoch: 003/020 | Batch 100/110 | Loss: 0.08
Epoch: 004/020 | Batch 000/110 | Loss: 0.10
Epoch: 004/020 | Batch 020/110 | Loss: 0.04
Epoch: 004/020 | Batch 040/110 | Loss: 0.13
Epoch: 004/020 | Batch 060/110 | Loss: 0.11
Epoch: 004/020 | Batch 080/110 |

## 6) Evaluating the results

Again, reusing the code from Unit 3.6, we will calculate the training and validation set accuracy.

In [67]:
def compute_accuracy(model, dataloader):

    model = model.eval()

    correct = 0.0
    total_examples = 0

    for idx, (features, class_labels) in enumerate(dataloader):

        with torch.no_grad():
            probas = model(features)

        pred = torch.where(probas > 0.5, 1, 0)
        lab = class_labels.view(pred.shape).to(pred.dtype)

        compare = lab == pred
        correct += torch.sum(compare)
        total_examples += len(compare)

    return correct / total_examples

In [68]:
train_acc = compute_accuracy(model, train_loader)
print(f"Accuracy: {train_acc*100:.2f}%")

Accuracy: 98.45%


<font color='red'>Notice that the code validation accuracy is not shown? It's part of the exercise to implement it :)</font>

In [69]:
## YOUR CODE
val_acc = compute_accuracy(model, val_loader)
print(f"Accuracy: {val_acc*100:.2f}%")

Accuracy: 98.91%
