<a href="https://colab.research.google.com/github/juliancramos/cnn-pneumonia-covid-detector/blob/main/cnn-pneumonia-covid-detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chest X-Ray Classification: Normal vs. Pneumonia vs. COVID-19

**Authors:** Sergio Ortíz - Julián Ramos - Melissa Ruíz

**Course:** Machine Learning Techniques

**Date:** February 2026

## 1. Executive Summary
Lung diseases such as Pneumonia and COVID-19 are critical health conditions requiring timely diagnosis. Traditional manual interpretation of chest X-rays is subjective and prone to error.

This project aims to develop an automated Deep Learning system using **Convolutional Neural Networks (CNNs)** to classify chest X-ray images into three categories:
1.  **Normal**
2.  **Pneumonia**
3.  **COVID-19**

---
### 1.1 Project Objectives
This project follows a structured workflow designed to meet specific technical requirements:

1.  **Exploratory Data Analysis (EDA):**
    * Explore the dataset to identify key characteristics, including class distribution and image quality.
    * Detect potential biases (e.g., class imbalance) and present visualizations to support design decisions.

2.  **Image Preprocessing:**
    * Implement techniques to improve input quality, such as pixel normalization and resizing.
    * Apply **Data Augmentation** strategies to improve model generalization and prevent overfitting.
    * Justify how these steps prepare the data for the neural network.

3.  **Neural Network Design:**
    * Design a custom **Convolutional Neural Network (CNN)** architecture suitable for classification.
    * Justify the selection of layers (Convolutional, Pooling, Fully Connected), filter sizes, and activation functions.
    * Present the architecture diagram and explain its expected behavior.

4.  **Training and Evaluation:**
    * Train the model using an appropriate optimization strategy and loss function.
    * Evaluate performance on a validation set using robust metrics such as **Precision, Recall, and F1-Score**.
    * Analyze results to identify issues like overfitting and propose technical solutions.

## 2. Environment Setup and Configuration

In this section, the execution environment is configured.

### 2.1 Technical Stack
* **Data Manipulation:** `pandas` and `numpy` are utilized for metadata handling and numerical operations.
* **Visualization:** `matplotlib` and `seaborn` are employed to analyze data distributions and training metrics.
* **Image Processing:** `cv2` (OpenCV) is used for reading and resizing X-ray images.
* **Deep Learning:** `tensorflow` and `keras` serve as the core framework for constructing the CNN.



In [None]:
import os
import random
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
import tensorflow as tf

# CONSTANTS CONFIGURATION
# Fixed seed to ensure reproducibility (weights)
SEED = 42
# Target image resolution (224x224 pixels)
IMG_SIZE = 224
# Number of images processed per iteration
BATCH_SIZE = 32

def set_seed(seed_value):
    """
    Sets the random seed for reproducibility across all libraries.

    This function locks the pseudo-random number generators of Python, NumPy,
    and TensorFlow to a specific sequence.

    Args:
        seed_value (int): The seed number to initialize generators.
    """
    # Python's built-in hashing (dictionaries/sets)
    os.environ['PYTHONHASHSEED'] = str(seed_value)

    # Python's standard random library (shuffle, choice)
    random.seed(seed_value)

    # NumPy's random module (array splitting, matrices)
    np.random.seed(seed_value)

    # TensorFlow's random module (weight initialization)
    tf.random.set_seed(seed_value)

# Apply the configuration
set_seed(SEED)

# Visualization Setup
# Set a style for all plots (background grid)
sns.set_style("whitegrid")

# Define a global figure size (12x6 inches)
plt.rcParams['figure.figsize'] = (12, 6)

# Silence non-critical alerts (DeprecationWarnings)
warnings.filterwarnings('ignore')

print("Environment setup complete.")
print(f"TensorFlow Version: {tf.__version__}")

Environment setup complete.
TensorFlow Version: 2.19.0


## 3. Data Ingestion

In this section, the required datasets are downloaded from Kaggle.

The Kaggle API (`kaggle.json`) is utilized because the dataset consists of **unstructured image data** organized in directory trees, rather than structured tabular files (e.g., CSV or Excel).
**Datasets acquired:**
* **Chest X-Ray Images (Pneumonia):** provided by Paul Mooney.
* **COVID-19 Radiography Database:** provided by Tawsifur Rahman.

In [None]:
import os
from google.colab import drive

def setup_kaggle_from_drive():
    """
    Configures the Kaggle API by loading the token from Google Drive.

    This function mounts the user's Google Drive, looks for 'kaggle.json'
    in the root directory ('/content/drive/MyDrive/'), and download the datasets.
    """

    # Validate if data already exists
    if os.path.exists('chest_xray') and os.path.exists('COVID-19_Radiography_Dataset'):
        print("Data already exists. Skipping download and extraction.")
        return

    # Mount Google Drive
    print("Mounting Google Drive...")
    drive.mount('/content/drive')

    # Path to the token in Drive
    drive_path = '/content/drive/MyDrive/kaggle.json'
    #local_path = '/root/.kaggle/kaggle.json'

    # Check and Copy Credentials
    if not os.path.exists(drive_path):
        print(f"Error: 'kaggle.json' not found at {drive_path}.")
        print("Upload the file to your Google Drive root folder.")
        return

    # Create the hidden directory .kaggle
    !mkdir -p ~/.kaggle

    # Copy the file from Drive to the local Colab environment
    !cp "$drive_path" ~/.kaggle/

    # Set security permissions (Required by Kaggle API)
    !chmod 600 ~/.kaggle/kaggle.json
    print("API Token loaded from Drive successfully.")

    # Dataset Download & Extraction
    print("\nDownloading Datasets...")

    print("Downloading Chest X-Ray Images (Pneumonia)...")
    !kaggle datasets download -d paultimothymooney/chest-xray-pneumonia --force

    print("Downloading COVID-19 Radiography Database...")
    !kaggle datasets download -d tawsifurrahman/covid19-radiography-database --force

    print("\nExtracting Files...")
    # Unzip
    !unzip -qo chest-xray-pneumonia.zip
    !unzip -qo covid19-radiography-database.zip

    print("SUCCESS: Data ingestion complete.")

# Execute the function
setup_kaggle_from_drive()

Data already exists. Skipping download and extraction.


## 4. Data Processing
The data comes from two different sources with distinct directory structures. In this section, it is unified into a single DataFrame.

**Process:**
1.  **Cleanup:** Remove temporary `.zip` files to free up disk space.
2.  **Path Extraction:** Traverse the directory trees of both datasets.
3.  **Labeling:** Assign standardized labels (`COVID-19`, `NORMAL`, `PNEUMONIA`) based on the folder source.
4.  **Consolidation:** Create a Pandas DataFrame containing all file paths and labels for easy manipulation.

### 4.1 Initialization and Environment Cleanup
Before processing, the environment is prepared by importing necessary libraries ( `pathlib`, `os`). Additionally, the source `.zip` archives are removed from the working directory to release disk space.

Global lists `image_paths` and `labels` are initialized here to serve as the accumulators for the subsequent data extraction steps.

In [None]:
import os
from pathlib import Path

# Remove source zip files to save space
if os.path.exists('chest-xray-pneumonia.zip'):
    !rm *.zip
    print("Temporary zip files removed.")

# Initialize storage for the unified dataset
image_paths = []
labels = []

### 4.2 Processing COVID-19 Radiography Dataset
This dataset is organized into four distinct folders: `COVID`, `Normal`, `Lung_Opacity`, and `Viral Pneumonia`.

**Class Selection:**

1.  **`COVID` (Selected):** Represents the positive class for the detection target.
2.  **`Normal` (Selected):** These images are included to supplement the `NORMAL` class from the second dataset. Combining healthy samples from both sources.
3. **`Lung_Opacity` (Excluded):** This folder is removed. “Lung Opacity” is a general label that can point to pneumonia, pulmonary edema, lung cancer, or other issues. Keeping it as a separate class from “Pneumonia” or “COVID-19” would add ambiguity.

4. **`Viral Pneumonia` (Included):** This folder is included and merged into the general `PNEUMONIA` class. By incorporating these 1,345 additional samples, the diversity of the pneumonia dataset is increased.

In [None]:
# COVID-19 Radiography Dataset
# Extract 'COVID' and 'Normal'
print("Processing COVID-19 Radiography Dataset...")

# Class: COVID-19
covid_path = Path("COVID-19_Radiography_Dataset/COVID/images")
covid_files = list(covid_path.glob('*.*')) #any file type
image_paths.extend([str(path) for path in covid_files])
labels.extend(['COVID-19'] * len(covid_files))
print(f"-> Added {len(covid_files)} COVID-19 images.")

# Class: Normal
normal_covid_path = Path("COVID-19_Radiography_Dataset/Normal/images")
normal_covid_files = list(normal_covid_path.glob('*.*'))
image_paths.extend([str(path) for path in normal_covid_files])
labels.extend(['NORMAL'] * len(normal_covid_files))
print(f"-> Added {len(normal_covid_files)} Normal images from COVID-19 dataset")

# Class: Viral Pneumonia
# These images are added to the PNEUMONIA label to increase its sample size
viral_pneumo_path = Path("COVID-19_Radiography_Dataset/Viral Pneumonia/images")
viral_files = list(viral_pneumo_path.glob('*.*'))
image_paths.extend([str(path) for path in viral_files])
labels.extend(['PNEUMONIA'] * len(viral_files))
print(f"-> Added {len(viral_files)} Viral Pneumonia images to the PNEUMONIA class.")

Processing COVID-19 Radiography Dataset...
-> Added 3616 COVID-19 images.
-> Added 10192 Normal images from COVID-19 dataset
-> Added 1345 Viral Pneumonia images to the PNEUMONIA class.


### 4.3 Processing Chest X-Ray (Pneumonia) Dataset
This dataset follows a standard machine learning directory structure, split into `train`, `test`, and `val` folders, each containing subdirectories for `PNEUMONIA` and `NORMAL`.

**Extraction Strategy:**
Since the goal is to create a unified DataFrame for a custom split later, the code ignores the original pre-split structure. An iteration is performed over all subdirectories (`train`, `test`, `val`) to aggregate every image into a list.



In [None]:
# Chest X-Ray (Pneumonia) Dataset
print("\n--- Processing Chest X-Ray (Pneumonia) Dataset ---")
pneumonia_path = Path("chest_xray")

# Initialize counters
count_normal = 0
count_pneumonia = 0

# Loop through the training, testing, and validation folders
for split in ['train', 'test', 'val']:
    # Loop through both classes inside each folder
    for class_name in ['NORMAL', 'PNEUMONIA']:
        class_path = pneumonia_path / split / class_name

        # Search for all possible image extensions to avoid missing any files
        files = list(class_path.glob('*.jpeg')) + \
                list(class_path.glob('*.jpg')) + \
                list(class_path.glob('*.png'))

        # Add them to the main lists
        image_paths.extend([str(path) for path in files])
        labels.extend([class_name] * len(files))

        # Update specific counters
        if class_name == 'NORMAL':
            count_normal += len(files)
        elif class_name == 'PNEUMONIA':
            count_pneumonia += len(files)

        print(f"-> Added {len(files)} images for class '{class_name}' from '{split}' set.")

# Final report for this specific source
print(f"-> Processed all images from Chest X-Ray source.")
print(f"   - Total PNEUMONIA added: {count_pneumonia}")
print(f"   - Total NORMAL added:    {count_normal}")


--- Processing Chest X-Ray (Pneumonia) Dataset ---
-> Added 1341 images for class 'NORMAL' from 'train' set.
-> Added 3875 images for class 'PNEUMONIA' from 'train' set.
-> Added 234 images for class 'NORMAL' from 'test' set.
-> Added 390 images for class 'PNEUMONIA' from 'test' set.
-> Added 8 images for class 'NORMAL' from 'val' set.
-> Added 8 images for class 'PNEUMONIA' from 'val' set.
-> Processed all images from Chest X-Ray source.
   - Total PNEUMONIA added: 4273
   - Total NORMAL added:    1583


### 4.4 DataFrame Consolidation and Shuffling
Finally, the accumulated file paths and their corresponding labels are merged into a Pandas DataFrame.

**Data Shuffling:**

Here, the DataFrame is shuffled (`sample(frac=1)`). When the data is first loaded, the images are grouped by class (for example, all COVID images first, then Normal, then Pneumonia). If the model is trained in that same order, it could struggle to learn properly and end up focusing on one class at a time.

By randomizing the order, each training batch contains a mix of different classes.


In [None]:
# Create DataFrame
df = pd.DataFrame({'filepath': image_paths, 'label': labels})

# Apply shuffling to ensure the model does not learn the order of classes
# .sample(frac=1): Shuffles the dataset by returning 100% of the rows in random order.
# random_state=42: Sets a fixed seed to guarantee the random results are reproducible.
# .reset_index(drop=True): Re-indexes the rows from 0 to N.
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"DataFrame Created. Total Samples: {len(df)}")
print("-" * 50)
print(df['label'].value_counts())
print("-" * 50)
display(df.head())

DataFrame Created. Total Samples: 21009
--------------------------------------------------
label
NORMAL       11775
PNEUMONIA     5618
COVID-19      3616
Name: count, dtype: int64
--------------------------------------------------


Unnamed: 0,filepath,label
0,COVID-19_Radiography_Dataset/Normal/images/Nor...,NORMAL
1,COVID-19_Radiography_Dataset/Normal/images/Nor...,NORMAL
2,COVID-19_Radiography_Dataset/COVID/images/COVI...,COVID-19
3,COVID-19_Radiography_Dataset/Normal/images/Nor...,NORMAL
4,COVID-19_Radiography_Dataset/Normal/images/Nor...,NORMAL
