### 📦 Dataset Attribution

This project uses the **FUTURA Synthetic Invoices Dataset**, publicly available via Zenodo:

> **FUTURA - Synthetic Invoices Dataset for Document Analysis**  
> Authors: Dimosthenis Karatzas, Fei Chen, Davide Fichera, Diego Marchetti  
> License: [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)  
> DOI: [https://doi.org/10.5281/zenodo.10371464](https://doi.org/10.5281/zenodo.10371464)  
> Accessed via Zenodo. Redistribution and derivative works must credit the original authors.

We gratefully acknowledge the authors for creating and releasing this dataset.

## 🧠 Training CNN for Invoice Tampering Detection (with Class Imbalance Handling)

In this notebook, we aim to train a Convolutional Neural Network using **PyTorch** to classify invoice images as either *real* or *tampered*. Given the dataset is **heavily imbalanced** (10,000 real vs 600 tampered), we’ll take the following measures:

- ✅ **Transfer Learning** using a pre-trained `ResNet18` to leverage learned visual features
- 🧪 **WeightedRandomSampler** to balance batches during training
- 🎨 **Data Augmentation** (e.g., random crops, flips, brightness changes) to increase tampered sample diversity
- 📊 Track performance using train, validation, and test splits with confusion matrix and accuracy

The overall goal is to train a reliable and reproducible model that detects subtle manipulations in invoices effectively.


In [5]:
# Imports, Seeding and Environment Setup
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, WeightedRandomSampler
from torchvision import transforms
from sklearn.model_selection import train_test_split
import numpy as np
import random
import os

In [6]:
# No GPU assumed for now; if available, still fine.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu


In [7]:
# Set seeds for reproducibility
def seed_all(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

seed_all()

In [10]:
# Load data using memory-mapping
X = np.load("../data/processed/X.npy", mmap_mode='r')
y = np.load("../data/processed/y.npy")  # y is small enough to load entirely

print(f"✅ Loaded X: shape={X.shape}, dtype={X.dtype} (memory-mapped)")
print(f"✅ Loaded y: shape={y.shape}, unique labels: {np.unique(y, return_counts=True)}")

✅ Loaded X: shape=(10600, 841, 595, 3), dtype=uint8 (memory-mapped)
✅ Loaded y: shape=(10600,), unique labels: (array([0, 1]), array([10000,   600]))


In [11]:
# Split data into Train, Validation, and Test sets
from sklearn.model_selection import train_test_split

# First split: Train (70%) vs Temp (30%)
X_train_idx, X_temp_idx, y_train, y_temp = train_test_split(
    np.arange(len(X)), y, test_size=0.3, stratify=y, random_state=42
)

# Second split: Val (15%) and Test (15%)
X_val_idx, X_test_idx, y_val, y_test = train_test_split(
    X_temp_idx, y_temp, test_size=0.5, stratify=y_temp, random_state=42
)

print(f"🧪 Train: {len(X_train_idx)}, Val: {len(X_val_idx)}, Test: {len(X_test_idx)}")

🧪 Train: 7420, Val: 1590, Test: 1590


In [12]:
# Define DataLoader parameters
from torchvision import transforms
from torch.utils.data import Dataset
from PIL import Image

# Define target size for resizing
TARGET_SIZE = (256, 256)

# Define transform
transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize(TARGET_SIZE),
    transforms.ToTensor(),  # Converts to [C, H, W] with values in [0,1]
])

# Custom Dataset to read from memory-mapped array
class InvoiceDataset(Dataset):
    def __init__(self, X, y, indices, transform=None):
        self.X = X
        self.y = y
        self.indices = indices
        self.transform = transform

    def __len__(self):
        return len(self.indices)

    def __getitem__(self, idx):
        i = self.indices[idx]
        image = self.X[i]
        label = self.y[i]
        if self.transform:
            image = self.transform(image)
        return image, label


In [14]:
# Initialize datasets using index arrays (no slicing of X)
train_dataset = InvoiceDataset(X, y, X_train_idx, transform)
val_dataset   = InvoiceDataset(X, y, X_val_idx, transform)
test_dataset  = InvoiceDataset(X, y, X_test_idx, transform)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader   = DataLoader(val_dataset, batch_size=32)
test_loader  = DataLoader(test_dataset, batch_size=32)

# Peek at one batch to verify
images, labels = next(iter(train_loader))
print(f"🧾 Batch shape: {images.shape}, Labels: {labels[:5]}")


🧾 Batch shape: torch.Size([32, 3, 256, 256]), Labels: tensor([0, 0, 0, 0, 0])


In [20]:
# Applied ONLY to training images
train_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize(TARGET_SIZE),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(degrees=5),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor()
])

# Used for validation and test (no changes to images)
eval_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize(TARGET_SIZE),
    transforms.ToTensor()
])

In [21]:
# Redefine datasets with final transforms
train_dataset = InvoiceDataset(X, y, X_train_idx, transform=train_transform)
val_dataset   = InvoiceDataset(X, y, X_val_idx,   transform=eval_transform)
test_dataset  = InvoiceDataset(X, y, X_test_idx,  transform=eval_transform)

# Compute class weights: inverse of class frequency
class_counts = np.bincount(y_train)
class_weights = 1. / class_counts
sample_weights = class_weights[y_train]

# Create the sampler
weighted_sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(sample_weights),
    replacement=True
)

# Final DataLoaders using updated datasets and sampler
train_loader = DataLoader(train_dataset, batch_size=32, sampler=weighted_sampler)
val_loader   = DataLoader(val_dataset, batch_size=32)
test_loader  = DataLoader(test_dataset, batch_size=32)

# Peek at final batch
images, labels = next(iter(train_loader))
print(f"✅ Final train batch shape: {images.shape}, Labels: {labels[:5]}")


✅ Final train batch shape: torch.Size([32, 3, 256, 256]), Labels: tensor([0, 1, 1, 1, 1])


In [29]:
# Train the model for 5 epochs
from torchvision.models import resnet18, ResNet18_Weights

# Load pre-trained ResNet18
weights = ResNet18_Weights.DEFAULT
model = resnet18(weights=weights)

# Replace the final FC layer with a single output (for binary classification)
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 1)  # output = 1 logit

model = model.to(device)
print("✅ ResNet18 adjusted for binary classification.")

✅ ResNet18 adjusted for binary classification.


In [30]:
# Binary classification → use BCEWithLogitsLoss (more numerically stable)
criterion = nn.BCEWithLogitsLoss()

# Optimizer (fine-tuning only the final layer is optional)
optimizer = optim.Adam(model.parameters(), lr=1e-4)

In [31]:
# Train the model
def train_model(model, train_loader, val_loader, criterion, optimizer, device, epochs=5):
    for epoch in range(epochs):
        model.train()
        train_loss = 0.0

        for images, labels in train_loader:
            images = images.to(device)
            labels = labels.float().to(device).unsqueeze(1)  # for BCEWithLogitsLoss

            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            train_loss += loss.item() * images.size(0)

        avg_train_loss = train_loss / len(train_loader.dataset)

        # Validation phase
        model.eval()
        val_loss = 0.0
        correct = 0
        total = 0

        with torch.no_grad():
            for images, labels in val_loader:
                images = images.to(device)
                labels = labels.float().to(device).unsqueeze(1)

                outputs = model(images)
                loss = criterion(outputs, labels)
                val_loss += loss.item() * images.size(0)

                preds = torch.sigmoid(outputs) > 0.5
                correct += (preds.squeeze().long() == labels.squeeze().long()).sum().item()
                total += labels.size(0)

        avg_val_loss = val_loss / len(val_loader.dataset)
        accuracy = correct / total

        print(f"📅 Epoch {epoch+1}/{epochs} | Train Loss: {avg_train_loss:.4f} | "
              f"Val Loss: {avg_val_loss:.4f} | Val Accuracy: {accuracy:.4f}")


In [33]:
# Train the model for 5 epochs
train_model(model, train_loader, val_loader, criterion, optimizer, device, epochs=5)

📅 Epoch 1/5 | Train Loss: 0.1610 | Val Loss: 0.0627 | Val Accuracy: 0.9881
📅 Epoch 2/5 | Train Loss: 0.1398 | Val Loss: 0.0763 | Val Accuracy: 0.9862
📅 Epoch 3/5 | Train Loss: 0.1347 | Val Loss: 0.0999 | Val Accuracy: 0.9774
📅 Epoch 4/5 | Train Loss: 0.1144 | Val Loss: 0.0771 | Val Accuracy: 0.9692
📅 Epoch 5/5 | Train Loss: 0.1100 | Val Loss: 0.0796 | Val Accuracy: 0.9855


In [37]:
# Make sure the actual directory exists
os.makedirs("../outputs", exist_ok=True)

# Now save the model
torch.save(model.state_dict(), "../outputs/resnet_invoice.pt")
print("💾 Model saved to ../outputs/resnet_invoice.pt")

💾 Model saved to ../outputs/resnet_invoice.pt


## 📌 Conclusion: Invoice Fraud Detection Model

This notebook trained a CNN using a fine-tuned **ResNet18** to detect tampered invoices.

### ✅ Summary of Steps

- **Data Handling**:
  - Loaded preprocessed image arrays with `numpy.memmap`.
  - Split data into train, val, and test sets.
  - Applied augmentations for training; resizing for evaluation.

- **Class Balancing**:
  - Used class weights with a `WeightedRandomSampler`.

- **Model**:
  - Fine-tuned `ResNet18` for binary classification.
  - Used `BCEWithLogitsLoss` + Adam optimizer.

- **Training**:
  - Trained for 5 epochs.
  - Achieved **~98.55% validation accuracy**.

- **Saving**:
  - Saved model weights to `../outputs/resnet_invoice.pt`.

This model is now ready for evaluation, inference, and deployment.