# Pneumonia Detection: CNN from Scratch
## Thesis Section: Baseline Model
This notebook trains a convolutional neural network (CNN) from scratch to detect pneumonia from chest X-ray images using the Kaggle Chest X-Ray Pneumonia dataset ([Kaggle link](https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia)). The goal is to establish a baseline model for comparison with transfer learning approaches in the thesis, demonstrating that training from scratch yields moderate accuracy due to limited model complexity and dataset size.

The dataset contains 5,863 images (train: 1,341 normal, 3,875 pneumonia; validation: 16 images; test: 624 images). Due to class imbalance, we apply class weights during training. Data augmentation is used to improve generalization, and the model is evaluated using accuracy, precision, recall, F1-score, and visualizations.

In [None]:
# Install Kaggle API to download dataset
!pip install -q kaggle

# Upload kaggle.json file (from your Kaggle account)
from google.colab import files
files.upload()  # Upload kaggle.json

# Set up Kaggle directory and permissions
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Download and unzip the chest X-ray pneumonia dataset
!kaggle datasets download -d paultimothymooney/chest-xray-pneumonia
!unzip -q chest-xray-pneumonia.zip -d chest_xray

## Data Exploration
We explore the dataset to understand its structure and class distribution. The training set is imbalanced, with significantly more pneumonia images than normal ones, necessitating class weights.

In [None]:
import os

# Define dataset paths
base_dir = 'chest_xray/chest_xray'
train_dir = os.path.join(base_dir, 'train')
val_dir = os.path.join(base_dir, 'val')
test_dir = os.path.join(base_dir, 'test')

# Count images in each class
train_normal = len(os.listdir(os.path.join(train_dir, 'NORMAL')))
train_pneumonia = len(os.listdir(os.path.join(train_dir, 'PNEUMONIA')))
print(f'Training set: {train_normal} normal, {train_pneumonia} pneumonia images')
print(f'Validation set: {len(os.listdir(os.path.join(val_dir, "NORMAL")))} normal, {len(os.listdir(os.path.join(val_dir, "PNEUMONIA")))} pneumonia images')
print(f'Test set: {len(os.listdir(os.path.join(test_dir, "NORMAL")))} normal, {len(os.listdir(os.path.join(test_dir, "PNEUMONIA")))} pneumonia images')

## Data Preprocessing
We use `ImageDataGenerator` to preprocess images by rescaling pixel values to [0,1] and applying augmentation (rotation, zoom, horizontal flip) to enhance model generalization. Images are resized to 150x150 pixels to reduce computational load.

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Parameters
img_height, img_width = 150, 150
batch_size = 32

# Data generators with augmentation for training
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=15,
    zoom_range=0.1,
    horizontal_flip=True
)
val_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

# Load data
train_data = train_datagen.flow_from_directory(
    train_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary'
)
val_data = val_datagen.flow_from_directory(
    val_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary'
)
test_data = test_datagen.flow_from_directory(
    test_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary',
    shuffle=False
)

## Class Weights
Due to the imbalanced dataset, we compute class weights to assign higher importance to the minority class (normal images) during training.

In [None]:
from sklearn.utils import class_weight
import numpy as np

# Compute class weights
labels = train_data.classes
weights = class_weight.compute_class_weight('balanced', classes=np.unique(labels), y=labels)
class_weights = dict(enumerate(weights))
print('Class weights:', class_weights)

## Model Architecture
We define a simple CNN with three convolutional layers (32, 64, 128 filters), max-pooling, a flatten layer, a dense layer with dropout (0.5) for regularization, and a sigmoid output for binary classification.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.optimizers import Adam

# Define CNN model
model = Sequential([
    Conv2D(32, (3,3), activation='relu', input_shape=(img_height, img_width, 3)),
    MaxPooling2D(2,2),
    Conv2D(64, (3,3), activation='relu'),
    MaxPooling2D(2,2),
    Conv2D(128, (3,3), activation='relu'),
    MaxPooling2D(2,2),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

# Compile model
model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

## Training
The model is trained for 20 epochs with class weights to address imbalance. Validation performance is monitored, though the small validation set (16 images) may lead to noisy metrics.

In [None]:
# Train model
epochs = 20
history = model.fit(
    train_data,
    epochs=epochs,
    validation_data=val_data,
    class_weight=class_weights
)

## Evaluation and Visualization
We evaluate the model on the test set and visualize training/validation accuracy and loss, as well as a confusion matrix to assess performance across classes.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report

# Plot training history
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

# Evaluate on test set
test_loss, test_acc = model.evaluate(test_data)
print(f'Test Accuracy: {test_acc:.3f}')

# Generate confusion matrix
y_pred = (model.predict(test_data) > 0.5).astype(int)
y_true = test_data.classes
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

# Classification report
print(classification_report(y_true, y_pred, target_names=['Normal', 'Pneumonia']))

## Model Saving
The trained model is saved for integration into the web service backend and downloaded for local use.

In [None]:
# Save and download model
model.save('pneumonia_cnn.h5')
files.download('pneumonia_cnn.h5')

## Discussion
The CNN from scratch achieves moderate accuracy (expected 80-85%) due to limited model depth and dataset size. The small validation set may cause unreliable validation metrics. This baseline highlights the need for a more advanced approach, such as transfer learning, to improve accuracy for the web service.