# Pneumonia Detection: Transfer Learning with ResNet50
## Thesis Section: Improved Model
This notebook uses transfer learning with ResNet50, pre-trained on ImageNet, to detect pneumonia from chest X-ray images using the Kaggle Chest X-Ray Pneumonia dataset ([Kaggle link](https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia)). By leveraging pre-trained features and fine-tuning, this approach aims to achieve higher accuracy (target 90-95%) than the CNN from scratch, making it suitable for the web service backend.

The dataset is merged (train + val) and split into 80/10/10 to address the small validation set (8 normal, 8 pneumonia). Class imbalance is handled with adjusted weights, and aggressive augmentation with fine-tuning of more layers is applied.

In [None]:
# Install Kaggle API to download dataset
!pip install -q kaggle

# Upload kaggle.json file
from google.colab import files
files.upload()  # Upload kaggle.json

# Set up Kaggle directory and permissions
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Download and unzip the chest X-ray pneumonia dataset
!kaggle datasets download -d paultimothymooney/chest-xray-pneumonia
!unzip -q chest-xray-pneumonia.zip -d chest_xray

## Data Exploration and Splitting
Merge train and val sets, then create a new 80/10/10 split.

In [None]:
import os
import glob
from sklearn.model_selection import train_test_split

# Define base directory
base_dir = 'chest_xray/chest_xray'
train_dir = os.path.join(base_dir, 'train')
val_dir = os.path.join(base_dir, 'val')
test_dir = os.path.join(base_dir, 'test')

# Collect all file paths
normal_files = glob.glob(os.path.join(train_dir, 'NORMAL', '*.jpeg')) + glob.glob(os.path.join(val_dir, 'NORMAL', '*.jpeg'))
pneumonia_files = glob.glob(os.path.join(train_dir, 'PNEUMONIA', '*.jpeg')) + glob.glob(os.path.join(val_dir, 'PNEUMONIA', '*.jpeg'))
all_files = normal_files + pneumonia_files
labels = [0] * len(normal_files) + [1] * len(pneumonia_files)

# First split: 90% (train+val) and 10% (test)
train_val_files, test_files, train_val_labels, test_labels = train_test_split(all_files, labels, test_size=0.1, stratify=labels, random_state=42)

# Second split: 80% train, 20% val from train_val
train_files, val_files, train_labels, val_labels = train_test_split(train_val_files, train_val_labels, test_size=0.2222, stratify=train_val_labels, random_state=42)

## Data Preprocessing
Images are resized to 224x224, with enhanced augmentation for robustness.

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Parameters
img_height, img_width = 224, 224
batch_size = 128

# Data generators with enhanced augmentation
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    zoom_range=0.2,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=True,
    vertical_flip=True,
    shear_range=0.2,
    brightness_range=[0.8, 1.2]
)
val_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

# Load data with new split (using directories or DataFrame approach)
import pandas as pd
train_df = pd.DataFrame({'filename': train_files, 'class': train_labels})
val_df = pd.DataFrame({'filename': val_files, 'class': val_labels})
test_df = pd.DataFrame({'filename': test_files, 'class': test_labels})

train_data = train_datagen.flow_from_dataframe(
    train_df,
    x_col='filename',
    y_col='class',
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary'
)
val_data = val_datagen.flow_from_dataframe(
    val_df,
    x_col='filename',
    y_col='class',
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary'
)
test_data = test_datagen.flow_from_dataframe(
    test_df,
    x_col='filename',
    y_col='class',
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary',
    shuffle=False
)

## Class Weights
Adjusted class weights to balance the training process, increasing weight for pneumonia.

In [None]:
from sklearn.utils import class_weight
import numpy as np

# Compute adjusted class weights
labels = train_labels
weights = class_weight.compute_class_weight('balanced', classes=np.unique(labels), y=labels)
class_weights = {0: weights[0], 1: 0.8}  # Manually adjust pneumonia weight

## Model Architecture
ResNet50 with more layers fine-tuned and learning rate scheduling.

In [None]:
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping

# Load ResNet50 base model
base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(img_height, img_width, 3))

# Unfreeze last 100 layers for fine-tuning
for layer in base_model.layers[:-100]:
    layer.trainable = False
for layer in base_model.layers[-100:]:
    layer.trainable = True

# Add custom head
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.5)(x)
predictions = Dense(1, activation='sigmoid')(x)

# Create model
model = Model(inputs=base_model.input, outputs=predictions)

# Compile model
model.compile(optimizer=Adam(learning_rate=1e-4), loss='binary_crossentropy', metrics=['accuracy'])

# Callbacks
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, min_lr=1e-6)
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Model summary
model.summary()

## Model Training

In [None]:
# Train model
history = model.fit(
    train_data,
    epochs=20,
    validation_data=val_data,
    class_weight=class_weights,
    callbacks=[reduce_lr, early_stopping]
)

## Evaluation and Visualization

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report

# Plot combined training history
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

# Evaluate on test set
test_loss, test_acc = model.evaluate(test_data)
print(f'Test Accuracy: {test_acc:.3f}')

# Generate confusion matrix
y_pred = (model.predict(test_data) > 0.4).astype(int)  # Adjusted threshold
y_true = test_data.classes
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

# Classification report
print(classification_report(y_true, y_pred, target_names=['Normal', 'Pneumonia']))

## Model Saving
The fine-tuned model is saved for backend integration and downloaded.

In [None]:
# Save and download model
model.save('resnet50_pneumonia_optimized.keras')
files.download('resnet50_pneumonia_optimized.keras')

## Discussion
The optimized ResNet50 model, with a larger validation set, adjusted weights, and enhanced fine-tuning, targets 90-95% accuracy. The new split improves metric stability, while increased pneumonia weight boosts recall. Limitations include potential over-augmentation, which should be monitored.