# Chest X-Ray Abnormality Detection (Multi-Label CNN)

KENDALL MCNEIL

November 2023

insert image collage here

DESCRIPTION:  The chest radiograph is one of the most challenging to interpret, which can result in misdiagnosis even for seasoned healthcare providers. Building a strong convolutional neural network (CNN) to detect common thoracic lung diseases in chest x-rays would improve diagnostic accuracy for patients and ultimately save lives through early and accurate detection. The CNN will act as an automated system to support radiologists as a second opinion in reviewing chest x-rays for abnormalities. The work product will alleviate the stress of busy doctors and healthcare providers while also providing patients with a more accurate and efficient diagnosis. 

OBJECTIVE: The objective, therefore, is to detect a variety (14 total) of common thoracic lung abnormalities in chest x-rays by building a Convolutional Neural Network (CNN) to develop an AI system for thoracic lung abnormality detection. The multi-label neural network model was designed using Tensorflow. 

AUDIENCES: The general target audience for the project is healthcare providers. The more specific presentation audience is Vingroup Big Data Institute (VinBigData) that is working to build large-scale and high-precision medical imaging solutions based on the latest advancements in AI to facilitate efficient clinical workflows. 

DATA: The dataset includes 18,000 dicom images and was created by assembling de-identified chest X-ray studies provided by two hospitals in Vietnam: the Hospital 108 and the Hanoi Medical University Hospital.

# A. Imports and Setup

In [1]:
cd

C:\Users\Jackson


In [2]:
cd Documents\flatiron\CAPSTONE\data

C:\Users\Jackson\Documents\flatiron\CAPSTONE\data


In [3]:
#basic imports 
import numpy as np
import pandas as pd
import os
import shutil
import random
import glob
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline

#tensorflow imports for CNN image classification project 
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Dropout, BatchNormalization, Conv2D, MaxPooling2D
from tensorflow.keras.optimizers import Adam
from keras import regularizers, optimizers
from tensorflow.keras.metrics import categorical_crossentropy
from tensorflow.keras.preprocessing.image import ImageDataGenerator

#additional plotting imports
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

First, let's read in the data. Then let's store the full dataset under "original df" then create a dataframe with only image and class ID

In [None]:
original_df = pd.read_csv('train.csv')
df = original_df[['image_id','class_id']]

# B. Data Cleaning

Let's create a legend for the class names and class ids.

In [51]:
class_df = pd.DataFrame({'Number': list(range(15)),
    'Class_Name': ['Aortic enlargement','Atelectasis','Calcification','Cardiomegaly','Consolidation','ILD',
    'Infiltration','Lung Opacity','Nodule/Mass','Other lesion','Pleural effusion','Pleural thickening','Pneumothorax',
    'Pulmonary fibrosis','No finding']})
class_df

Unnamed: 0,Number,Class_Name
0,0,Aortic enlargement
1,1,Atelectasis
2,2,Calcification
3,3,Cardiomegaly
4,4,Consolidation
5,5,ILD
6,6,Infiltration
7,7,Lung Opacity
8,8,Nodule/Mass
9,9,Other lesion


How many of each class are in our dataset?

In [None]:
class_counts_df = df['class_id'].value_counts()
class_counts_df

The distribution looks okay. There may be a class imbalance that may rear its head later. Let's peak into image ID value counts.

In [None]:
df['image_id'].value_counts().head(55)

That's odd. Some image IDs have over 50 counts. Let's check on duplicates.

In [None]:
df.duplicated().sum()

Wow. There are a lot of duplicates. Let's drop them.

In [None]:
df = df.drop_duplicates()
df['image_id'].value_counts().head(55)

Much better. Let's also check that there are no duplicate photos in the images folder just to be sure.

In [58]:
folder_path = 'images'

# Create a dictionary to store encountered file names.
file_names = {}

# Iterate through the files in the folder.
for filename in os.listdir(folder_path):
    # Check if the file is a regular file (not a subdirectory).
    if os.path.isfile(os.path.join(folder_path, filename)):
        # Check if the file name has been encountered before.
        if filename in file_names:
            print(f'Duplicate file name: {filename}')
            print(f'First occurrence: {file_names[filename]}')
            print(f'Second occurrence: {os.path.join(folder_path, filename)}')
        else:
            # Store the file name and its full path for future reference.
            file_names[filename] = os.path.join(folder_path, filename)

Great. There are no duplicate photos.

# C. Data Preprocessing

The next few cells are outdated as I was working to organize the data. I created a function that would create class id folders and copy images respectively into them, but then realized there are more efficient and storage-conscious ways to organize the data.

Additionally, it is worth mentioning that the data provided did not have labeled targets (likely because it is an active competition). Therefore, we will use the 15,000 images in the train folder and take a subset of that set for testing. We will use 12500 images for training and 2500 images for testing - a 17/83% split. 

Originally, I had random.sample randomly select 2500 images and move them into the test folder, but later on decided to perform the split inside our ImageDataGenerator.

In [None]:
# Function to create class folders and copy images
def create_class_folders_and_copy_images(base_folder, class_counts, df):
    for _, row in df.iterrows():
        image_id = row['image_id']
        class_id = row['class_id']

        # Create a folder for the class if it doesn't exist
        class_folder = os.path.join(base_folder, str(class_id))
        os.makedirs(class_folder, exist_ok=True)

        source_image_path = os.path.join(base_folder, f"{image_id}.PNG")
        image_path = os.path.join(class_folder, f"{image_id}.PNG")

        # Check if the source file exists before copying
        if os.path.exists(source_image_path):
            shutil.copy(source_image_path, image_path)
            print(f"Copied: {source_image_path} -> {image_path}")
        else:
            print(f"Source file not found: {source_image_path}")

In [None]:
# Create class folders and copy images for the "train" folder
#create_class_folders_and_copy_images(train_folder, class_counts, df)

# Create class folders and copy images for the "test" folder
#create_class_folders_and_copy_images(test_folder, class_counts, df)

In [61]:
#train_folder = 'train'
#test_folder = 'test'

# List all image files in the train folder
#image_files = [file for file in os.listdir(train_folder) if file.lower().endswith(('.png', '.jpg', '.jpeg'))]

# Randomly select 2500 images from the list
#selected_images = random.sample(image_files, 2500)

# Move the selected images to the test folder
#for image in selected_images:
    #source_path = os.path.join(train_folder, image)
    #destination_path = os.path.join(test_folder, image)
    #shutil.move(source_path, destination_path)

#print("Randomly selected and moved 2500 images to the test folder.")

We need to perform additional data preprocessing before using Tensorflow. Currently, there are multiple rows for each image that falls into multiple classes. To fix this for our ImageDataGenerator, we need to groupby class ID, one hot encode the class IDs, and then merge the dataframes. That will create a final dataframe that has one row per image with one hot encoded classes. Let's also shuffle the dataframe images just in case because I will be selecting the first 12500 for the test set and the remaining 2500 for the test set. Last, let's add ".png" to all the image IDs so that our ImageDataGenerator can locate them efficiently.

In [65]:
#Step 1: Group by image ID
grouped = df.groupby('image_id')['class_id'].apply(list).reset_index()

#Step 2: Perform one-hot encoding
one_hot_encoded = pd.get_dummies(grouped['class_id'].apply(pd.Series).stack()).sum(level=0)

#Step 3: Merge dataframes 
final_df = grouped.merge(one_hot_encoded, left_index=True, right_index=True)

#Step 4: shuffle the dataframe
final_df = final_df.sample(frac=1).reset_index(drop=True)

#Step 5: add ".png" to the image_ids 
final_df['image_id'] = final_df['image_id'] + '.png'

#The final_df DataFrame now contains one-hot encoded class labels for each image ID with one row per image
final_df

Unnamed: 0,image_id,class_id
0,000434271f63a053c4128a0ba6352c7f,[14]
1,00053190460d56c53cc3e57321387478,[14]
2,0005e8e3701dfb1dd93d53e2ff537b6e,"[7, 8, 6, 4]"
3,0006e0a85696f6bb578e84fafa9a5607,[14]
4,0007d316f756b3fa0baea2ff514ce945,"[13, 11, 3, 0, 5]"
...,...,...
14995,ffe6f9fe648a7ec29a50feb92d6c15a4,"[3, 0, 9]"
14996,ffea246f04196af602c7dc123e5e48fc,[14]
14997,ffeffc54594debf3716d6fcd2402a99f,[0]
14998,fff0f82159f9083f3dd1f8967fc54f6a,[14]


Now we're cookin' and ready for Tensorflow's ImageDataGenerator!

# D. ImageDataGenerator Setup 

In [72]:
train_datagen= ImageDataGenerator(rescale=1.0/255.0)
test_datagen= ImageDataGenerator(rescale=1.0/255.0)

final_columns = final_df.columns[2:].tolist()

train_generator = train_datagen.flow_from_dataframe(
    dataframe=final_df[:12500], directory='images', x_col='image_id',
    y_col=final_columns, seed=42, class_mode='raw', color_mode='grayscale')

test_generator=test_datagen.flow_from_dataframe(
    dataframe=final_df[12500:], directory='images', x_col='image_id',
    seed=42, class_mode=None, color_mode='grayscale')

Found 12500 validated image filenames.
Found 2500 validated image filenames.


Let's also create a function to evaluate the models moving forward. 

In [None]:
def evaluate(model):
    train_loss, train_accuracy = model.evaluate(train_generator)
    test_loss, test_accuracy = model.evaluate(test_generator)
    
    print(f'Train Loss: {train_loss}')
    print(f'Test Loss: {test_loss}')
    
    print('----------')
    
    
    print(f'Train Accuracy: {train_accuracy}')
    print(f'Test Accuracy: {test_accuracy}')

# E. First Simple Model

In [None]:
# Define the first simple model
fsmodel = keras.Sequential([
    layers.Input(shape=(256, 256, 1)),  # Adjust input shape as needed
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(15, activation='sigmoid')  # Use 'sigmoid' for multi-label classification
])

# Compile the model
fsmodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Display the model summary
fsmodel.summary()

In [None]:
# Train the model⏰ This cell may take about a minute to run
fsmhistory = fsmodel.fit(train_generator, epochs=10, validation_data=test_generator)

# Evaluate the model
evaluate(fsmodel)

# F. Model Building

This is our baseline first simple model that our model must perform better than. Let's start by adding some Conv2D and MaxPooling2D layers.

In [78]:
model1 = keras.Sequential([
    layers.Input(shape=(256, 256, 1)),
    layers.Conv2D(32, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dense(15, activation='sigmoid')  # Use 'sigmoid' activation for multi-label classification
])

# Compile the model
model1.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Display the model summary
model1.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_2 (Conv2D)           (None, 254, 254, 32)      320       
                                                                 
 max_pooling2d_2 (MaxPoolin  (None, 127, 127, 32)      0         
 g2D)                                                            
                                                                 
 conv2d_3 (Conv2D)           (None, 125, 125, 64)      18496     
                                                                 
 max_pooling2d_3 (MaxPoolin  (None, 62, 62, 64)        0         
 g2D)                                                            
                                                                 
 flatten_1 (Flatten)         (None, 246016)            0         
                                                                 
 dense_2 (Dense)             (None, 128)              

In [None]:
# Train the model⏰ This cell may take about a minute to run
history1 = model1.fit(train_generator, epochs=10, validation_data=test_generator)

# Evaluate the model
test_loss, test_accuracy = model.evaluate(test_generator)
print(f'Test accuracy: {test_accuracy}')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

In [None]:
#can i use this line of code instead and get the same results?

# Train the model⏰ This cell may take about a minute to run
model1.fit(train_generator, epochs=10, validation_data=test_generator)

# Evaluate the model
evaluate(model1)

# G. Tuning the Model

# H. Selecting Final Model