### Authors: Prof. Dr. Soumi Ray, Ravi Teja Kothuru and Abhay Srivastav

### Acknowledgements:
I would like to thank my team mates Prof. Dr. Soumi Ray and Abhay Srivastav for their guidance and support throughout this project.

**Title of the Project:** Comparative Analysis of Image-Based and Feature-Based Approaches for Pneumonia Detection in Chest X-rays

**Description of the Project:** This project focuses on detecting pneumonia from chest X-ray images using Advanced Machine Learning and Deep Learning techniques (Rajpurkar et al., 2017; Wang et al., 2017). By leveraging a comprehensive dataset, including annotated images of pneumonia and normal cases, we aim to develop and compare image-based and feature-based approaches. Our goal is to identify the most effective method for accurate and interpretable pneumonia detection, contributing to improved patient outcomes through early diagnosis and treatment. This model will classify patients based on their chest X-ray images as either having pneumonia (1) or not having pneumonia (0).

**Objectives of the Project:** 

- **Image Analysis:** Develop and evaluate deep learning models to classify chest X-rays directly. This approach leverages deep learning models, particularly Convolutional Neural Networks (CNNs), to perform end-to-end image classification. The models directly process raw chest X-ray images to classify them as normal or pneumonia.

- **Feature Analysis:** Extract meaningful features from the images and use them to train and evaluate traditional machine learning models. In this approach, we first extract features from the chest X-ray images. These features are then used as inputs for traditional machine learning algorithms. The process includes steps such as feature extraction, selection, and transformation, followed by the application of machine learning techniques like Support Vector Machines (SVM), Random Forests.

**Name of the Dataset:** The dataset used in this project is the Chest X-ray dataset considered from the Research paper named **Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification**.

**Description of the Dataset:** The Diabetes Health Indicators Dataset contains healthcare statistics and lifestyle survey information about people in general along with their diagnosis of diabetes. The 35 features consist of some demographics, lab test results, and answers to survey questions for each patient. The target variable for classification is whether a patient has diabetes, is pre-diabetic, or healthy.

**Dataset Source:** 

- https://data.mendeley.com/datasets/rscbjbr9sj/2

**Type of the Dataset:**

- X-ray Images

**Description of Dataset:** 
The considered dataset has the following information for better reference:
- Separate folders to train and validate/test the model.
- Enough number of Chest X-ray images to train the model to detect and diagnose Pneumonia.
- The target variable for classification is whether patient has pneumonia or not.

**Goal of the Project using this Dataset:**
The goal of this project is to conduct a comprehensive comparative analysis of image-based and feature-based approaches for pneumonia detection using chest X-ray images. By evaluating the performance, robustness, and interpretability of deep learning and traditional machine learning models, we aim to identify the most effective method for accurately classifying chest X-rays as normal or pneumonia. This comparison will provide valuable insights into the strengths and limitations of each approach, ultimately contributing to improved detection and diagnosis of pneumonia, which can enhance patient outcomes and survival rates.

**Why did we choose this dataset?**
We selected this dataset based on several factors. For more detailed information, please refer to the following:
- The dataset is extensive, providing a large number of images suitable for evaluating and training deep learning models.
- It aligns well with the project's objectives by offering a challenging and realistic scenario for developing an image classification model using deep learning, specifically for Chest X-ray images.
- The dataset is annotated with images of two different diseases, enabling the development of a binary-class classification model.
- It is publicly available, facilitating easy access for research and development purposes.

**Size of dataset:**
- Total images size = 1.27 GB
- Dataset has 2 folders:
  -  **Train:**
    -  Normal (without Pneumonia) = 1349 images
    -  Pneumonia = 3884 images
  -  **Test:**
    -  Normal (without Pneumonia) = 234 images
    -  Pneumonia = 390 images
    
**Expected Behaviors and Problem Handling:**
- Classify Chest X-ray images with high accuracy.
- Handle variations in image quality, resolution, and orientation.
- Be robust to noise and artifacts in the images.
- Provide interpretable results.

**Issues to focus on:**
- Improving model interpretability and explainability.
- Optimizing model performance on a held-out test set.
- Following AI Ethics and Data Safety practices.

# Import all the required files and libraries

In [1]:
import os
import ssl

# Disable SSL certificate verification
ssl._create_default_https_context = ssl._create_unverified_context

# Automatically reload imported modules when their source code changes
%load_ext autoreload
%reload_ext autoreload
%autoreload 2

# Import python files from local to use the corresponding function
from cxr_image_features_extraction import CxrImageFeatureExtraction

# Perform Chest X-ray Images Feature Extraction

## Create an object of the Image Feature Extraction class

In [2]:
image_feature_extraction = CxrImageFeatureExtraction()

## Fetch the absolute paths of the normalized image dataset

In [3]:
# Define the path to the dataset
dataset_path = image_feature_extraction.get_base_path_of_dataset() + "_nrm"
print(f"Normalized Dataset Path = {dataset_path}")

# Fetch train, test, NORMAL and PNEUMONIA folder names
train_folder_name = str(image_feature_extraction.train_test_image_dirs[0])
test_folder_name = str(image_feature_extraction.train_test_image_dirs[1])

normal_img_folder_name = str(image_feature_extraction.normal_pneumonia_image_dirs[0])
pneumonia_img_folder_name = str(image_feature_extraction.normal_pneumonia_image_dirs[1])

# Define the paths to the train and test datasets
# Train
train_normal = os.path.join(dataset_path, train_folder_name + "_nrm", normal_img_folder_name + "_nrm")
train_pneumonia = os.path.join(dataset_path, train_folder_name + "_nrm", pneumonia_img_folder_name + "_nrm")

# Test
test_normal = os.path.join(dataset_path, test_folder_name + "_nrm", normal_img_folder_name + "_nrm")
test_pneumonia = os.path.join(dataset_path, test_folder_name + "_nrm", pneumonia_img_folder_name + "_nrm")

# Print the paths to the train and test datasets
print("\nNormalized Train Images")
print("************************")
print(f"NORMAL = {train_normal}")
print(f"\nPNEUMONIA = {train_pneumonia}")

print("\n\nNormalized Test Images")
print("***************************")
print(f"NORMAL = {test_normal}")
print(f"\nPNEUMONIA = {test_pneumonia}")

Normalized Dataset Path = /Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/dataset/chest_xray_nrm

Normalized Train Images
************************
NORMAL = /Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/dataset/chest_xray_nrm/train_nrm/NORMAL_nrm

PNEUMONIA = /Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/dataset/chest_xray_nrm/train_nrm/PNEUMONIA_nrm


Normalized Test Images
***************************
NORMAL = /Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/dataset/chest_xray_nrm/test_nrm/NORMAL_nrm

PNEUMONIA = /Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/dataset/chest_xray_nrm/test_nrm/PNEUMONIA_nrm


## Convert all Normalized image folder absolute paths to a list

In [4]:
image_normalized_folders = [
    train_normal, train_pneumonia,
    test_normal, test_pneumonia
]

image_normalized_folders

['/Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/dataset/chest_xray_nrm/train_nrm/NORMAL_nrm',
 '/Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/dataset/chest_xray_nrm/train_nrm/PNEUMONIA_nrm',
 '/Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/dataset/chest_xray_nrm/test_nrm/NORMAL_nrm',
 '/Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/dataset/chest_xray_nrm/test_nrm/PNEUMONIA_nrm']

### Extract Second Order NGTDM Features of all the images and write into the existing Excel file

1. Coarseness

Definition: Coarseness is a measure of the texture coarseness, which is the sum of the NGTDM matrix divided by the total number of pixels in the image.

Formula: $$\text{NGTDM_Coarseness} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)}{N \times M}$$

where $G$ is the number of gray levels, $N$ and $M$ are the dimensions of the image, and $ngtdm(i,j)$ is the value of the NGTDM matrix at position $(i,j)$.

2. Contrast

Definition: Contrast is a measure of the texture contrast, which is the sum of the squared NGTDM matrix divided by the total number of pixels in the image.

Formula: $$\text{NGTDM_Contrast} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)^2}{N \times M}$$

3. Busyness

Definition: Busyness is a measure of the texture busyness, which is the sum of the NGTDM matrix divided by the product of the total number of pixels in the image and the sum of the absolute differences between the image pixels and the mean image pixel value.

Formula: $$\text{NGTDM_Busyness} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)}{(N \times M) \times \sum_{i=0}^{N-1} \sum_{j=0}^{M-1} |I(i,j) - \mu|}$$

where $I(i,j)$ is the value of the image pixel at position $(i,j)$, and $\mu$ is the mean image pixel value.

4. Complexity

Definition: Complexity is a measure of the texture complexity, which is the sum of the product of the NGTDM matrix and the logarithm of the NGTDM matrix plus 1, divided by the total number of pixels in the image.

Formula: $$\text{NGTDM_Complexity} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j) \times \log_2(ngtdm(i,j)+1)}{N \times M}$$

5. Dissimilarity

Definition: Dissimilarity is a measure of the texture dissimilarity, which is the sum of the absolute differences between the NGTDM matrix and the mean NGTDM value.

Formula: $$\text{NGTDM_Dissimilarity} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} |ngtdm(i,j) - \mu_{ngtdm}|}{N \times M}$$

where $\mu_{ngtdm}$ is the mean NGTDM value.

6. Joint Energy

Definition: Joint energy is a measure of the texture joint energy, which is the sum of the squared NGTDM matrix divided by the total number of pixels in the image.

Formula: $$\text{NGTDM_JointEnergy} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)^2}{N \times M}$$

7. Joint Entropy

Definition: Joint entropy is a measure of the texture joint entropy, which is the sum of the product of the NGTDM matrix and the logarithm of the NGTDM matrix plus 1, divided by the total number of pixels in the image.

Formula: $$\text{NGTDM_JointEntropy} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j) \times \log_2(ngtdm(i,j)+1)}{N \times M}$$

8. Informational Measure of Correlation 1

Definition: Informational measure of correlation 1 is a measure of the texture informational measure of correlation 1, which is a measure of the correlation between the image pixels and the mean image pixel value.

Formula: $$\text{NGTDM_InformationalMeasureOfCorrelation1} = \frac{HXY - HXY1}{HXY}$$

where:

$HXY$ is the joint entropy of the NGTDM matrix, calculated as:
$$HXY = -\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j) \times \log_2(ngtdm(i,j))$$

$HXY1$ is the joint entropy of the NGTDM matrix with one pixel shifted, calculated as:
$$HXY1 = -\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j) \times \log_2(ngtdm(i+1,j))$$

Note that $ngtdm(i,j)$ is the value of the NGTDM matrix at position $(i,j)$, and $G$ is the number of gray levels in the image.

9. Inverse Difference Moment

Definition: Inverse difference moment is a measure of the texture inverse difference moment, which is the sum of the NGTDM matrix divided by the sum of the absolute differences between the image pixels and the mean image pixel value.

Formula: $$\text{NGTDM_InverseDifferenceMoment} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)}{\sum_{i=0}^{N-1} \sum_{j=0}^{M-1} |I(i,j) - \mu|}$$

10. Inverse Difference Moment Normalized

Definition: Inverse difference moment normalized is a measure of the texture inverse difference moment normalized, which is the sum of the NGTDM matrix divided by the product of the sum of the absolute differences between the image pixels and the mean image pixel value and the maximum gray level.

Formula: $$\text{NGTDM_InverseDifferenceMomentNormalized} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)}{(\sum_{i=0}^{N-1} \sum_{j=0}^{M-1} |I(i,j) - \mu|) \times G}$$

11. Inverse Difference

Definition: Inverse difference is a measure of the texture inverse difference, which is the sum of the NGTDM matrix divided by the sum of the absolute differences between the image pixels and the mean image pixel value.

Formula: $$\text{NGTDM_InverseDifference} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)}{\sum_{i=0}^{N-1} \sum_{j=0}^{M-1} |I(i,j) - \mu|}$$

12. Inverse Variance

Definition: Inverse variance is a measure of the texture inverse variance, which is the sum of the NGTDM matrix divided by the variance of the image.

Formula: $$\text{NGTDM_InverseVariance} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)}{\sigma^2}$$

where $\sigma^2$ is the variance of the image.

13. Maximum Probability

Definition: Maximum probability is a measure of the texture maximum probability, which is the maximum value in the NGTDM matrix.

Formula: $$\text{NGTDM_MaximumProbability} = \max_{i,j} ngtdm(i,j)$$

14. Sum Average

Definition: Sum average is a measure of the texture sum average, which is the sum of the NGTDM matrix divided by the total number of pixels in the image.

Formula: $$\text{NGTDM_SumAverage} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)}{N \times M}$$

15. Sum Entropy

Definition: Sum entropy is a measure of the texture sum entropy, which is the sum of the product of the NGTDM matrix and the logarithm of the NGTDM matrix plus 1, divided by the total number of pixels in the image.

Formula: $$\text{NGTDM_SumEntropy} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j) \times \log_2(ngtdm(i,j)+1)}{N \times M}$$

16. Sum of Squares

Definition: Sum of squares is a measure of the texture sum of squares, which is the sum of the squared NGTDM matrix divided by the total number of pixels in the image.

Formula: $$\text{NGTDM_SumOfSquares} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)^2}{N \times M}$$

17. Auto-Correlation

Definition: Auto-correlation is a measure of the texture auto-correlation, which is the sum of the product of the NGTDM matrix and the NGTDM matrix shifted by one position, divided by the total number of pixels in the image.

Formula: $$\text{NGTDM_AutoCorrelation} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j) \times ngtdm(i+1,j)}{N \times M}$$

18. Cluster Prominence

Definition: Cluster prominence is a measure of the texture cluster prominence, which is the sum of the squared NGTDM matrix divided by the sum of the NGTDM matrix.

Formula: $$\text{NGTDM_ClusterProminence} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)^2}{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)}$$

19. Cluster Shade

Definition: Cluster shade is a measure of the texture cluster shade, which is the sum of the cubed NGTDM matrix divided by the sum of the NGTDM matrix.

Formula: $$\text{NGTDM_ClusterShade} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)^3}{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)}$$

20. Cluster Tendency

Definition: Cluster tendency is a measure of the texture cluster tendency, which is the sum of the fourth power of the NGTDM matrix divided by the sum of the NGTDM matrix.

Formula: $$\text{NGTDM_ClusterTendency} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)^4}{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j)}$$

21. Correlation

Definition: Correlation is a measure of the texture correlation, which is the sum of the product of the NGTDM matrix and the NGTDM matrix shifted by one position, divided by the total number of pixels in the image.

Formula: $$\text{NGTDM_Correlation} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} ngtdm(i,j) \times ngtdm(i+1,j)}{N \times M}$$

22. Variance

Definition: Variance is a measure of the texture variance, which is the sum of the squared differences between the NGTDM matrix and the mean NGTDM value.

Formula: $$\text{NGTDM_Variance} = \frac{\sum_{i=0}^{G-1} \sum_{j=0}^{G-1} (ngtdm(i,j) - \mu_{ngtdm})^2}{N \times M}$$

where $\mu_{ngtdm}$ is the mean NGTDM value.

These are the 22 features that can be extracted from the NGTDM matrix. Each feature provides a unique perspective on the texture of the image, and can be used in various applications such as image classification, segmentation, and retrieval.

In [9]:
image_feature_extraction.update_second_order_ngtdm_features_to_excel_file(folders=image_normalized_folders)

Extracted NGTDM features will be saved to - /Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/image_information/chest_xray_images_second_order_features_ngtdm.xlsx


Extracting second-order features NGTDM from: /Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/dataset/chest_xray_nrm/train_nrm/NORMAL_nrm






Folder: train_nrm/NORMAL_nrm:   0%|                                                           | 0/1349 [00:00<?, ?it/s][A[A[A[A



Folder: train_nrm/NORMAL_nrm:   1%|▌                                                 | 16/1349 [00:33<45:55,  2.07s/it][A[A[A[A



Folder: train_nrm/NORMAL_nrm:   2%|▊                                               | 24/1349 [01:03<1:01:48,  2.80s/it][A[A[A[A



Folder: train_nrm/NORMAL_nrm:   2%|█▏                                              | 32/1349 [01:44<1:19:51,  3.64s/it][A[A[A[A



Folder: train_nrm/NORMAL_nrm:   3%|█▍                                              | 40/1349 [02:24<1:30:29,  4.15s/it][A[A[A[A



Folder: train_nrm/NORMAL_nrm:   4%|█▋                                              | 48/1349 [03:33<2:01:24,  5.60s/it][A[A[A[A



Folder: train_nrm/NORMAL_nrm:   4%|█▉                                              | 56/1349 [04:26<2:08:06,  5.94s/it][A[A[A[A



Folder: train_nrm/NORMAL_nrm:   5%|██▎             

Extracting second-order features NGTDM from: /Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/dataset/chest_xray_nrm/train_nrm/PNEUMONIA_nrm






Folder: train_nrm/PNEUMONIA_nrm:   0%|                                                        | 0/3883 [00:00<?, ?it/s][A[A[A[A



Folder: train_nrm/PNEUMONIA_nrm:   0%|▏                                              | 16/3883 [00:05<23:48,  2.71it/s][A[A[A[A



Folder: train_nrm/PNEUMONIA_nrm:   1%|▎                                            | 24/3883 [00:26<1:24:22,  1.31s/it][A[A[A[A



Folder: train_nrm/PNEUMONIA_nrm:   1%|▎                                            | 32/3883 [00:43<1:41:18,  1.58s/it][A[A[A[A



Folder: train_nrm/PNEUMONIA_nrm:   1%|▍                                            | 40/3883 [01:08<2:17:01,  2.14s/it][A[A[A[A



Folder: train_nrm/PNEUMONIA_nrm:   1%|▌                                            | 48/3883 [01:40<2:55:06,  2.74s/it][A[A[A[A



Folder: train_nrm/PNEUMONIA_nrm:   1%|▋                                            | 56/3883 [02:03<2:58:46,  2.80s/it][A[A[A[A



Folder: train_nrm/PNEUMONIA_nrm:   2%|▋            

Extracting second-order features NGTDM from: /Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/dataset/chest_xray_nrm/test_nrm/NORMAL_nrm


Folder: test_nrm/NORMAL_nrm: 100%|███████████████████████████████████████████████████| 234/234 [21:33<00:00,  5.53s/it]


Extracting second-order features NGTDM from: /Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/dataset/chest_xray_nrm/test_nrm/PNEUMONIA_nrm


Folder: test_nrm/PNEUMONIA_nrm: 100%|████████████████████████████████████████████████| 390/390 [12:36<00:00,  1.94s/it]




All first-order features are extracted to the Excel file: /Users/raviteja/Documents/Teja_Career/Master_Degree/USD/MS_AAI/AAI-501/Final_Project/pneumonia-detection-in-chest-X-rays/image_information/chest_xray_images_second_order_features_ngtdm.xlsx
Please check the Excel file for further analysis and interpretation
