# **Final Project: Chest X-Ray Images Pneumonia**

Authors: Adrián Barreno Sánchez (adrian.barreno@alumnos.upm.es), Alberto González Delgado (alberto.gondelgado@alumnos.upm.es), Julian Elijah Politsch (julian.politsch@alumnos.upm.es), Angelo D'Angelo (angelo.dangelo@alumnos.upm.es)

Date: 01/2023

## 1. Introduction


## 2. Setup

### Starting Spark Session

In [1]:
appname = "Chest X-Ray Images Pneumonia"

# Look into https://spark.apache.org/downloads.html for the latest version
spark_mirror = "https://mirrors.sonic.net/apache/spark"
spark_version = "3.3.1"
hadoop_version = "3"

# Install Java 8 (Spark does not work with newer Java versions)
! apt-get update > /dev/null
! apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Download and extract Spark binary distribution
! rm -rf spark-{spark_version}-bin-hadoop{hadoop_version}.tgz spark-{spark_version}-bin-hadoop{hadoop_version}
! wget -q {spark_mirror}/spark-{spark_version}/spark-{spark_version}-bin-hadoop{hadoop_version}.tgz
! tar xzf spark-{spark_version}-bin-hadoop{hadoop_version}.tgz

# The only 2 environment variables needed to set up Java and Spark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/spark-{spark_version}-bin-hadoop{hadoop_version}"

# Set up the Spark environment based on the environment variable SPARK_HOME 
! pip install -q findspark
import findspark
findspark.init()

# Get the Spark session object (basic entry point for every operation)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(appname).master("local[*]").getOrCreate()

# Import GoogleDrive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Importing data

In [5]:
!unzip /content/drive/MyDrive/chest_x_ray.zip  > /dev/null


replace chest_xray/__MACOSX/._chest_xray? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace chest_xray/__MACOSX/chest_xray/._.DS_Store? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace chest_xray/__MACOSX/chest_xray/._test? [y]es, [n]o, [A]ll, [N]one, [r]ename: y5
replace chest_xray/__MACOSX/chest_xray/._train? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace chest_xray/__MACOSX/chest_xray/test/._.DS_Store? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace chest_xray/__MACOSX/chest_xray/test/._NORMAL? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace chest_xray/__MACOSX/chest_xray/test/._PNEUMONIA? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace chest_xray/__MACOSX/chest_xray/test/NORMAL/._IM-0001-0001.jpeg? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace chest_xray/__MACOSX/chest_xray/test/NORMAL/._IM-0003-0001.jpeg? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace chest_xray/__MACOSX/chest_xray/test/NORMAL/._IM-0005-0001.jpeg? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace chest_xray/__MACOS

### Importing packages


In [9]:
import os
import glob
import pandas as pd
import numpy as np

## 3. Dataset


### Summary of the dataset

In [60]:
for dirpath, dirnames, filenames in os.walk('/content/chest_xray'):
  print(f"There are {len(filenames)} images in {dirpath}")

There are 0 images in /content/chest_xray
There are 0 images in /content/chest_xray/val
There are 8 images in /content/chest_xray/val/PNEUMONIA
There are 8 images in /content/chest_xray/val/NORMAL
There are 0 images in /content/chest_xray/train
There are 3875 images in /content/chest_xray/train/PNEUMONIA
There are 1341 images in /content/chest_xray/train/NORMAL
There are 1 images in /content/chest_xray/__MACOSX
There are 3 images in /content/chest_xray/__MACOSX/chest_xray
There are 1 images in /content/chest_xray/__MACOSX/chest_xray/val
There are 9 images in /content/chest_xray/__MACOSX/chest_xray/val/PNEUMONIA
There are 9 images in /content/chest_xray/__MACOSX/chest_xray/val/NORMAL
There are 3 images in /content/chest_xray/__MACOSX/chest_xray/train
There are 3876 images in /content/chest_xray/__MACOSX/chest_xray/train/PNEUMONIA
There are 1342 images in /content/chest_xray/__MACOSX/chest_xray/train/NORMAL
There are 3 images in /content/chest_xray/__MACOSX/chest_xray/test
There are 390 

### Getting all the pathes of the images


In [49]:
train_images = glob.glob(f"/content/chest_xray/train/**/*.jpeg")
val_images = glob.glob(f"/content/chest_xray/val/**/*.jpeg")
test_images = glob.glob(f"/content/chest_xray/test/**/*.jpeg")


In [59]:
print(f'Number of train samples:\t\t{len(train_images)}')
print(f'Number of validation samples :\t\t{len(val_images)}')
print(f'Number of test samples :\t\t{len(test_images)}')
print('=============================================')
print(f'Total number of samples :\t\t{len(train_images)+len(val_images)+len(test_images)}')

Number of train samples:		5216
Number of validation samples :		16
Number of test samples :		624
Total number of samples :		5856


0:NORMAL
1:PNEUMONIA

In [76]:
label=[]
for path in train_images:
  if('NORMAL' in path):
    label.append(0)
  else:
    label.append(1)
df_train=pd.DataFrame({'image': path,'label':label})
df_train.head()

Unnamed: 0,image,label
0,/content/chest_xray/train/NORMAL/IM-0640-0001....,1
1,/content/chest_xray/train/NORMAL/IM-0640-0001....,1
2,/content/chest_xray/train/NORMAL/IM-0640-0001....,1
3,/content/chest_xray/train/NORMAL/IM-0640-0001....,1
4,/content/chest_xray/train/NORMAL/IM-0640-0001....,1


In [80]:
label=[]
for path in test_images:
  if('NORMAL' in path):
    label.append(0)
  else:
    label.append(1)
df_test=pd.DataFrame({'image': path,'label':label})
df_test.head()

Unnamed: 0,image,label
0,/content/chest_xray/test/NORMAL/NORMAL2-IM-030...,1
1,/content/chest_xray/test/NORMAL/NORMAL2-IM-030...,1
2,/content/chest_xray/test/NORMAL/NORMAL2-IM-030...,1
3,/content/chest_xray/test/NORMAL/NORMAL2-IM-030...,1
4,/content/chest_xray/test/NORMAL/NORMAL2-IM-030...,1


In [78]:
label=[]
for path in val_images:
  if('NORMAL' in path):
    label.append(0)
  else:
    label.append(1)
df_val=pd.DataFrame({'image': path,'label':label})
df_val.head()

Unnamed: 0,image,label
0,/content/chest_xray/val/NORMAL/NORMAL2-IM-1437...,1
1,/content/chest_xray/val/NORMAL/NORMAL2-IM-1437...,1
2,/content/chest_xray/val/NORMAL/NORMAL2-IM-1437...,1
3,/content/chest_xray/val/NORMAL/NORMAL2-IM-1437...,1
4,/content/chest_xray/val/NORMAL/NORMAL2-IM-1437...,1


## 4. Preprocessing

## 5. Modeling

## 6. Training

## 7. Performance and evaluation

## 8. Discussions and conclusions
