# Mon-Reader: Detecting Page Flipping, Text Extraction, and Speech Synthesis
This notebook covers the following workflow:
1. **Detect page flipping vs. still pages using CNNs** (ResNet, MobileNet, EfficientNet, and a custom Osama Net) on the provided image dataset.
2. **Extract text from detected still page frames** (future step, not implemented in this notebook).
3. **Synthesize speech from extracted text (TTS)** (future step, not implemented in this notebook).

## Notebook Outline
- Import Required Libraries (with GPU support)
- Load and Preprocess Dataset (images/training and images/testing, with flip/notflip subfolders)
- Prepare Data Generators (with augmentation)
- Build and Evaluate ResNet Model
- Build and Evaluate MobileNet Model
- Build and Evaluate EfficientNet Model
- Define and Train Custom CNN (Osama Net)
- Evaluate Osama Net on Test Set and Show Predictions
- Compare Model Accuracies and Analysis

# 1. Import Required Libraries and Enable GPU Support

In this section, we import PyTorch and check for GPU availability. PyTorch will automatically use the GPU when available, which significantly speeds up model training.

## CPU vs GPU Tips:
- **GPU Acceleration**: Deep learning models train much faster on GPU than CPU.
- **Device Selection**: We use `torch.device()` to automatically select GPU when available.
- **Memory Management**: 
  - For large models, use smaller batch sizes on GPU with limited memory
  - Use `pin_memory=True` for faster GPU data transfers
  - Call `model.to(device)` to move the model to GPU/CPU
  - Ensure tensors are on the same device with `.to(device)`
- **Switching Between Devices**: To switch between CPU/GPU, just change the device variable and move tensors accordingly

In [2]:
import torch
import torchvision

print(torch.__version__)
print(f"CUDA available: {torch.cuda.is_available()}")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

2.7.1+cu128
CUDA available: True
Using device: cuda


In [3]:
import os
import cv2
import numpy as np
import matplotlib.pyplot as plt
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Dataset
from torchvision.datasets import ImageFolder
import random
from torch import nn, optim
from torchvision import models
from torch.optim import Adam
from torch.utils.data.sampler import SubsetRandomSampler

# 2. Load and Preprocess Dataset

Here we define our image size and batch size parameters, and load the training and testing datasets from their respective directories. The dataset has two classes: 'flip' and 'notflip'.

In [5]:
IMG_SIZE = (128, 128)
BATCH_SIZE = 32

train_dir = "images/training"
test_dir = "images/testing"

transform = transforms.Compose(
    [
        transforms.Resize(IMG_SIZE),
        transforms.ToTensor(),
    ]
)

train_dataset = ImageFolder(root=train_dir, transform=transform)
test_dataset = ImageFolder(root=test_dir, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

class_names = train_dataset.classes
print(f"Classes: {class_names}")

Classes: ['flip', 'notflip']


# 3. Prepare Data Generators with Augmentation

Data augmentation helps prevent overfitting by creating variations of our training images. 
This is especially important for smaller datasets. The transformations include:
- Random horizontal flipping
- Random rotation (up to 10 degrees)
- Random affine transformations (translation and scaling)

In [6]:
aug_transform = transforms.Compose(
    [
        transforms.Resize(IMG_SIZE),
        transforms.RandomHorizontalFlip(),
        transforms.RandomRotation(10),
        transforms.RandomAffine(0, translate=(0.1, 0.1), scale=(0.9, 1.1)),
        transforms.ToTensor(),
    ]
)

aug_train_dataset = ImageFolder(root=train_dir, transform=aug_transform)
aug_train_loader = DataLoader(aug_train_dataset, batch_size=BATCH_SIZE, shuffle=True)

# 4. Build and Evaluate Models

## Using BCEWithLogitsLoss Instead of BCELoss

In our updated models, we'll use `BCEWithLogitsLoss` instead of `BCELoss` for the following reasons:

1. **Numerical Stability**: BCEWithLogitsLoss combines sigmoid and binary cross-entropy in one operation, which is more numerically stable.

2. **Performance**: It's more efficient, especially on GPU, since it can leverage optimized implementations.

3. **Avoiding Vanishing Gradients**: By incorporating the sigmoid operation, it prevents extreme values that could lead to vanishing gradients.

4. **Simplified Model Architecture**: We can remove the final Sigmoid layer from our models, as BCEWithLogitsLoss applies it internally.

All of our models will output raw logits rather than probabilities, with the loss function handling the sigmoid transformation.

## ResNet Model

In [10]:
def build_resnet(num_classes=1):
    model = models.resnet50(weights='IMAGENET1K_V1')
    for param in model.parameters():
        param.requires_grad = False
    model.fc = nn.Sequential(
        nn.Linear(model.fc.in_features, 128),
        nn.ReLU(),
        nn.Dropout(0.3),
        nn.Linear(128, num_classes),
        # No sigmoid - using BCEWithLogitsLoss
    )
    return model

def train_model(model, train_loader, test_loader, device, epochs=10):
    model = model.to(device)
    # Using BCEWithLogitsLoss for numerical stability
    criterion = nn.BCEWithLogitsLoss()
    optimizer = Adam(model.parameters(), lr=1e-4)
    best_acc = 0.0
    for epoch in range(epochs):
        model.train()
        for images, labels in train_loader:
            images, labels = images.to(device), labels.float().to(device)
            labels = labels.unsqueeze(1)  # Add channel dimension for binary classification
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for images, labels in test_loader:
                images, labels = images.to(device), labels.float().to(device)
                labels = labels.unsqueeze(1)
                outputs = model(images)
                # Threshold logits at 0 (equivalent to 0.5 for sigmoid)
                preds = (outputs > 0).int()
                correct += (preds == labels.int()).sum().item()
                total += labels.size(0)
        acc = correct / total
        print(f"Epoch {epoch+1}/{epochs}, Test Accuracy: {acc:.4f}")
        if acc > best_acc:
            best_acc = acc
    return best_acc

resnet_model = build_resnet()
resnet_acc = train_model(resnet_model, aug_train_loader, test_loader, device, epochs=10)
print(f"ResNet Test Accuracy: {resnet_acc:.4f}")

Epoch 1/10, Test Accuracy: 0.7069
Epoch 2/10, Test Accuracy: 0.7605
Epoch 3/10, Test Accuracy: 0.8208
Epoch 4/10, Test Accuracy: 0.8476
Epoch 5/10, Test Accuracy: 0.8191
Epoch 6/10, Test Accuracy: 0.8040
Epoch 7/10, Test Accuracy: 0.8007
Epoch 8/10, Test Accuracy: 0.7420
Epoch 9/10, Test Accuracy: 0.7822
Epoch 10/10, Test Accuracy: 0.8576
ResNet Test Accuracy: 0.8576


# 5. Build and Evaluate MobileNet Model

MobileNet is a lightweight CNN architecture designed for mobile and embedded vision applications.
It's significantly smaller than ResNet while still providing good accuracy.

In [None]:
def build_mobilenet(num_classes=1):
    model = models.mobilenet_v2(weights="IMAGENET1K_V1")
    # Freeze all the parameters in the pre-trained model
    for param in model.parameters():
        param.requires_grad = False

    # Replace the classifier with a new one
    model.classifier = nn.Sequential(
        nn.Linear(model.classifier[1].in_features, 128),
        nn.ReLU(),
        nn.Dropout(0.3),
        nn.Linear(128, num_classes),
        # No sigmoid - using BCEWithLogitsLoss
    )
    return model


def train_model(model, train_loader, test_loader, device, epochs=10):
    model = model.to(device)

    # Ensure only the classifier parameters are being trained
    optimizer = Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4)

    # Using BCEWithLogitsLoss for numerical stability
    criterion = nn.BCEWithLogitsLoss()
    best_acc = 0.0

    for epoch in range(epochs):
        model.train()
        for images, labels in train_loader:
            images, labels = images.to(device), labels.float().to(device)
            labels = labels.unsqueeze(1)  # Add channel dimension for binary classification

            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for images, labels in test_loader:
                images, labels = images.to(device), labels.float().to(device)
                labels = labels.unsqueeze(1)

                outputs = model(images)
                # Threshold logits at 0 (equivalent to 0.5 for sigmoid)
                preds = (outputs > 0).int()

                correct += (preds == labels.int()).sum().item()
                total += labels.size(0)

        acc = correct / total
        print(f"Epoch {epoch + 1}/{epochs}, Test Accuracy: {acc:.4f}")
        if acc > best_acc:
            best_acc = acc

    return best_acc


# Build the MobileNet model
mobilenet_model = build_mobilenet()

# Train the model
mobilenet_acc = train_model(
    mobilenet_model, aug_train_loader, test_loader, device, epochs=10
)

# Print the final accuracy
print(f"MobileNet Test Accuracy: {mobilenet_acc:.4f}")


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


# 6. Build and Evaluate EfficientNet Model

EfficientNet uses a compound scaling method that uniformly scales network width, depth, and resolution
to balance model size and accuracy. It's known for achieving state-of-the-art accuracy with fewer parameters.

In [None]:
def build_efficientnet(num_classes=1):
    model = models.efficientnet_b0(weights='IMAGENET1K_V1')
    for param in model.parameters():
        param.requires_grad = False
    model.classifier = nn.Sequential(
        nn.Dropout(0.3),
        nn.Linear(model.classifier[1].in_features, 128),
        nn.ReLU(),
        nn.Linear(128, num_classes),
        # No sigmoid - using BCEWithLogitsLoss
    )
    return model

efficientnet_model = build_efficientnet()
efficientnet_acc = train_model(efficientnet_model, aug_train_loader, test_loader, device, epochs=10)
print(f"EfficientNet Test Accuracy: {efficientnet_acc:.4f}")

# 7. Define and Train Custom CNN (Osama Net)

Here we define a custom CNN architecture called "Osama Net" with three convolutional blocks
followed by a classifier with dropout for regularization.

In [None]:
class OsamaNet(nn.Module):
    def __init__(self, num_classes=1):
        super(OsamaNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.BatchNorm2d(32),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.BatchNorm2d(64),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.BatchNorm2d(128),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 16 * 16, 128),
            nn.ReLU(),
            nn.Dropout(0.4),
            nn.Linear(128, num_classes),
            # No sigmoid - using BCEWithLogitsLoss
        )
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

osama_net = OsamaNet()
osama_acc = train_model(osama_net, aug_train_loader, test_loader, device, epochs=20)

# 8. Evaluate Models and Compare Results

Finally, we evaluate our models on the test dataset and visualize some predictions.
We also compare the performance of all models to identify the best architecture for our task.

In [None]:
# Evaluate Osama Net on Test Set and Show Predictions
osama_net.eval()
correct, total = 0, 0
all_preds, all_labels = [], []
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.float().to(device)
        outputs = osama_net(images)
        # Threshold logits at 0 (equivalent to 0.5 for sigmoid)
        preds = (outputs > 0).int()
        correct += (preds == labels.unsqueeze(1).int()).sum().item()
        total += labels.size(0)
        all_preds.extend(preds.cpu().numpy().flatten())
        all_labels.extend(labels.cpu().numpy().flatten())
osama_acc = correct / total
print(f"Osama Net Test Accuracy: {osama_acc:.4f}")

# Show sample predictions
indices = random.sample(range(len(all_labels)), 6)
plt.figure(figsize=(12, 6))
for i, idx in enumerate(indices):
    img_path, true_label, pred_label = test_dataset.imgs[idx][0], int(all_labels[idx]), int(all_preds[idx])
    img = cv2.imread(img_path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    plt.subplot(2, 3, i + 1)
    plt.imshow(img)
    plt.title(f"True: {class_names[true_label]}\nPred: {class_names[pred_label]}")
    plt.axis("off")
plt.tight_layout()
plt.show()

# Compare Model Accuracies and Analysis
print(f"ResNet Test Accuracy: {resnet_acc:.4f}")
print(f"MobileNet Test Accuracy: {mobilenet_acc:.4f}")
print(f"EfficientNet Test Accuracy: {efficientnet_acc:.4f}")
print(f"Osama Net Test Accuracy: {osama_acc:.4f}")

accuracies = [resnet_acc, mobilenet_acc, efficientnet_acc, osama_acc]
model_names = ["ResNet", "MobileNet", "EfficientNet", "Osama Net"]
plt.figure(figsize=(8, 5))
plt.bar(model_names, accuracies, color=["royalblue", "orange", "green", "purple"])
plt.ylabel("Test Accuracy")
plt.title("Model Comparison on Flipping Classification")
plt.ylim(0, 1)
plt.show()

# Analysis
print("Analysis:")
print("Compare the results above. The best model is the one with the highest test accuracy. Consider model complexity, training time, and overfitting when choosing the best architecture for this task.")