#### Copyright 2020 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Image Classification Project

###Group Members: Patricia, Javi, Jalen



In this project we will build an image classification model and use the model to identify if the lungs pictured indicate that the patient has pneumonia. The outcome of the model will be true or false for each image.

The [data is hosted on Kaggle](https://www.kaggle.com/rob717/pneumonia-dataset) and consists of 5,863 x-ray images. Each image is classified as 'pneumonia' or 'normal'.

## Ethical Considerations

We will frame the problem as:

> *A hospital is having issues correctly diagnosing patients with pneumonia. Their current solution is to have two trained technicians examine every patient scan. Unfortunately, there are many times when two technicians are not available, and the scans have to wait for multiple days to be interpreted.*
>
> *They hope to fix this issue by creating a model that can identify if a patient has pneumonia. They will have one technician and the model both examine the scans and make a prediction. If the two agree, then the diagnosis is accepted. If the two disagree, then a second technician is brought in to provide their analysis and break the tie.*

Discuss some of the ethical considerations of building and using this model. 

* Consider potential bias in the data that we have been provided. 
* Should this model err toward precision or accuracy?
* What are the implications of massively over-classifying patients as having pneumonia?
* What are the implications of massively under-classifying patients as having pneumonia?
* Are there any concerns with having only one technician make the initial call?

The questions above are prompts. Feel free to bring in other considerations you might have.

### **Student Solution**

* Each x-ray scan is a different patient so there could be underlying conditions across each patient. "Normal" x-rays may have other conditions that could affect the model.

* Since we want to detect pnemonia as early as possible the model should lean more towards accuracy. The model should try to detect the slightest hint of pnemonia. However, this may lead to more work for the doctors. If we lean more towards precision, we could miss potential diagnoses, but the doctors would have to make less decisions to go through.

* Some implication of massively over classifying patients as having pneumonia is the fact that they could prescribe unnecessary medication which could lead to potential health complications down the line. An implication of massively under classifying patients as having pneumonia is the potential of having individuals go home with an undiagnosed condition which could lead to death.
* There could possibly be some sampling bias depending on the age range that the dataset is skewed towards if it is. Body parts change as we grow older and i am sure that the way pneumonia looks on a 5 year old, is different from how it looks on a 80 year old.

* If someone is diagnosed with having pneumonia and they actually do not, they could miss work, pay for medicine they do not need, and hospital costs that are unneccessary.

* If someone with pneumonia is not diagnosed they could possibly infect many others, be very sick themselves and not get the proper care and even in extreme cases, die.

* Having two technicians agree on a diagnosis is the smartest way to do it because that eliminates the chances of a mistake. It is concerning that if only one makes the call, there is possibility they could be wrong which is why technology could be useful in the near future


---

## Modeling

In this section of the lab, you will build, train, test, and validate a model or models. The data is the ["Detecting Pneumonia" dataset](https://www.kaggle.com/rob717/pneumonia-dataset). You will build a binary classifier that determines if an x-ray image has pneumonia or not.

You'll need to:

* Download the dataset
* Perform EDA on the dataset
* Build a model that can classify the data
* Train the model using the training portion of the dataset. (It is already split out.)
* Test at least three different models or model configurations using the testing portion of the dataset. This step can include changing model types, adding and removing layers or nodes from a neural network, or any other parameter tuning that you find potentially useful. Score the model (using accuracy, precision, recall, F1, or some other relevant score(s)) for each configuration.
* After finding the "best" model and parameters, use the validation portion of the dataset to perform one final sanity check by scoring the model once more with the hold-out data.
* If you train a neural network (or other model that you can get epoch-per-epoch performance), graph that performance over each epoch.

Explain your work!

> *Note: You'll likely want to [enable GPU in this lab](https://colab.research.google.com/notebooks/gpu.ipynb) if it is not already enabled.*

If you get to a working solution you're happy with and want another challenge, you'll find pre-trained models on the [landing page of the dataset](https://www.kaggle.com/paultimothymooney/detecting-pneumonia-in-x-ray-images). Try to load one of those and see how it compares to your best model.

Use as many text and code cells as you need to for your solution.

### **Student Solution**

#### Load dataset

In [None]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && cp kaggle.json ~/.kaggle/ && echo 'Done'
! kaggle datasets download paultimothymooney/chest-xray-pneumonia
! unzip chest-xray-pneumonia.zip
! ls

#### EDA

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing import image
%matplotlib inline

train_dir = 'chest_xray/train' # image folder
test_dir = 'chest_xray/test' # image folder
val_dir = 'chest_xray/val' # image folder

# get the list of jpegs from sub image class folders
train_normal_imgs = [fn for fn in os.listdir(f'{train_dir}/NORMAL') if fn.endswith('.jpeg')]
train_pneumo_imgs = [fn for fn in os.listdir(f'{train_dir}/PNEUMONIA') if fn.endswith('.jpeg')]

# get the list of jpegs from sub image class folders
test_normal_imgs = [fn for fn in os.listdir(f'{test_dir}/NORMAL') if fn.endswith('.jpeg')]
test_pneumo_imgs = [fn for fn in os.listdir(f'{test_dir}/PNEUMONIA') if fn.endswith('.jpeg')]

# get the list of jpegs from sub image class folders
val_normal_imgs = [fn for fn in os.listdir(f'{val_dir}/NORMAL') if fn.endswith('.jpeg')]
val_pneumo_imgs = [fn for fn in os.listdir(f'{val_dir}/PNEUMONIA') if fn.endswith('.jpeg')]

print(len(train_normal_imgs), len(train_pneumo_imgs), len(test_normal_imgs), len(test_pneumo_imgs), len(val_normal_imgs), len(val_pneumo_imgs))

In [None]:
train_dir = './chest_xray/train'
train_categories = set(os.listdir(train_dir))
test_dir = 'chest_xray/test'
test_categories = set(os.listdir(test_dir))

if train_categories.symmetric_difference(test_categories):
  print("Warning!: ", train_categories.symmetric_difference(test_categories))

print(sorted(train_categories))
print(len(train_categories))

In [None]:
import cv2 as cv
import matplotlib.pyplot as plt

sample_dir = os.path.join(train_dir, 'NORMAL')
img = cv.imread(os.path.join(sample_dir, os.listdir(sample_dir)[0]))
_ = plt.imshow(img)

In [None]:
img.shape

In [None]:
img.min(), img.max()

Now we need to find a way to get the images into the model. TensorFlow Keras has a class called [`DirectoryIterator`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/DirectoryIterator) that can help with that.

The iterator pulls images from a directory and passes them to our model in batches. There are many settings we can change. In our example here, we set the `target_size` to the size of our input images. Notice that we don't provide a third dimension even though these are RGB files. This is because the default `color_mode` is `'rgb'`, which implies three values.

We also set `image_data_generator` to `None`. If we wanted to, we could have passed an [`ImageDataGenerator`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator) to augment the image and increase the size of our dataset. We'll save this for an exercise.

In [None]:
import tensorflow as tf

train_dir = 'chest_xray/train'

train_image_iterator = tf.keras.preprocessing.image.DirectoryIterator(
    target_size=(100, 100),
    directory=train_dir,
    batch_size=128,
    image_data_generator=None)

In [None]:
print(train_image_iterator.filepaths[np.where(train_image_iterator.labels == 0)[0][0]])
print(train_image_iterator.filepaths[np.where(train_image_iterator.labels == 1)[0][0]])

#### Model 1

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(16, 3, padding='same', activation='relu',
                           input_shape=(100, 100, 3)),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax')
])

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

Now let's start training. Let one or two epochs run but then **!!!! STOP THE CELL FROM RUNNING !!!!**

How long was each epoch taking? Ours was taking about `4` minutes. Let's do the math. If each epoch took `4` minutes and we ran `100` epochs, then we'd be training for `400` minutes. That's just under `7` hours of training!

Luckily there is a better way. In the menu click on 'Runtime' and then 'Change runtime type'. In the modal that appears, there is an option called 'Hardware accelerator' that is set to 'None'. Change this to 'GPU' and save your settings.

Your runtime will change, so you'll need to go back to the start of this section and run all of the cells from the start. Don't forget to upload your `kaggle.json` again.

When you get back to this cell a second time and start it running, you should notice a big improvement in training time. We were getting `9` seconds per epoch, which is about `900` seconds total. This totals `15` minutes, which is much better. Let the cell run to completion (hopefully about `15` minutes). You should see it progressing as it is running.

In [None]:
history = model.fit(
    train_image_iterator,
    epochs=5,
)

##### Plots

In [None]:
import matplotlib.pyplot as plt

plt.plot(list(range(len(history.history['accuracy']))),
         history.history['accuracy'])
plt.show()

And our loss.

In [None]:
import matplotlib.pyplot as plt

plt.plot(list(range(len(history.history['loss']))), history.history['loss'])
plt.show()

Over `99%` training accuracy. Let's see how well this generalizes:

In [None]:
import tensorflow as tf

test_dir = 'chest_xray/test'

test_image_iterator = tf.keras.preprocessing.image.DirectoryIterator(
    target_size=(100, 100),
    directory=test_dir,
    batch_size=128,
    shuffle=False,
    image_data_generator=None)

model.evaluate(test_image_iterator)

We can also make predictions. The code below selects the next batch, gets predictions for it, and then returns the first prediction.

In [None]:
predicted_class = np.argmax(model(next(test_image_iterator)[0])[0])
predicted_class

This maps to the directory in that position.

In [None]:
os.listdir(train_dir)[predicted_class]

##### F1 Score

In [None]:
# f1 score
from sklearn.metrics import f1_score

actual_classes = test_image_iterator.classes

predictions = model.predict(test_image_iterator)

predicted_classes = [np.argmax(p) for p in predictions]

f1_score(actual_classes, predicted_classes, average='micro')

#### Model 2

In [None]:
import tensorflow as tf

train_dir = 'chest_xray/train'

train_image_iterator = tf.keras.preprocessing.image.DirectoryIterator(
    target_size=(50, 50), # changed target size
    directory=train_dir,
    batch_size=256, # changed batch size
    image_data_generator=None)

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(16, 3, padding='same', activation='relu',
                           input_shape=(50, 50, 3)),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(128, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax')
])

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

Now let's start training. Let one or two epochs run but then **!!!! STOP THE CELL FROM RUNNING !!!!**

How long was each epoch taking? Ours was taking about `4` minutes. Let's do the math. If each epoch took `4` minutes and we ran `100` epochs, then we'd be training for `400` minutes. That's just under `7` hours of training!

Luckily there is a better way. In the menu click on 'Runtime' and then 'Change runtime type'. In the modal that appears, there is an option called 'Hardware accelerator' that is set to 'None'. Change this to 'GPU' and save your settings.

Your runtime will change, so you'll need to go back to the start of this section and run all of the cells from the start. Don't forget to upload your `kaggle.json` again.

When you get back to this cell a second time and start it running, you should notice a big improvement in training time. We were getting `9` seconds per epoch, which is about `900` seconds total. This totals `15` minutes, which is much better. Let the cell run to completion (hopefully about `15` minutes). You should see it progressing as it is running.

In [None]:
history = model.fit(
    train_image_iterator,
    epochs=5,
)

##### Plots

In [None]:
import matplotlib.pyplot as plt

plt.plot(list(range(len(history.history['accuracy']))),
         history.history['accuracy'])
plt.show()

And our loss.

In [None]:
import matplotlib.pyplot as plt

plt.plot(list(range(len(history.history['loss']))), history.history['loss'])
plt.show()

Over `99%` training accuracy. Let's see how well this generalizes:

In [None]:
import tensorflow as tf

test_dir = 'chest_xray/test'

test_image_iterator = tf.keras.preprocessing.image.DirectoryIterator(
    target_size=(50, 50),
    directory=test_dir,
    batch_size=256,
    shuffle=False,
    image_data_generator=None)

model.evaluate(test_image_iterator)

We can also make predictions. The code below selects the next batch, gets predictions for it, and then returns the first prediction.

In [None]:
predicted_class = np.argmax(model(next(test_image_iterator)[0])[0])
predicted_class

This maps to the directory in that position.

In [None]:
os.listdir(train_dir)[predicted_class]

##### F1 Score

In [None]:
# f1 score
from sklearn.metrics import f1_score

actual_classes = test_image_iterator.classes

predictions = model.predict(test_image_iterator)

predicted_classes = [np.argmax(p) for p in predictions]

f1_score(actual_classes, predicted_classes, average='micro')

#### Model 3 (Best Model)

In [None]:
import tensorflow as tf

train_dir = 'chest_xray/train'

train_image_iterator = tf.keras.preprocessing.image.DirectoryIterator(
    target_size=(50, 50), # changed target size
    directory=train_dir,
    batch_size=256, # changed batch size
    image_data_generator=None)

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(16, 3, padding='same', activation='relu',
                           input_shape=(50, 50, 3)),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax')
])

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

Now let's start training. Let one or two epochs run but then **!!!! STOP THE CELL FROM RUNNING !!!!**

How long was each epoch taking? Ours was taking about `4` minutes. Let's do the math. If each epoch took `4` minutes and we ran `100` epochs, then we'd be training for `400` minutes. That's just under `7` hours of training!

Luckily there is a better way. In the menu click on 'Runtime' and then 'Change runtime type'. In the modal that appears, there is an option called 'Hardware accelerator' that is set to 'None'. Change this to 'GPU' and save your settings.

Your runtime will change, so you'll need to go back to the start of this section and run all of the cells from the start. Don't forget to upload your `kaggle.json` again.

When you get back to this cell a second time and start it running, you should notice a big improvement in training time. We were getting `9` seconds per epoch, which is about `900` seconds total. This totals `15` minutes, which is much better. Let the cell run to completion (hopefully about `15` minutes). You should see it progressing as it is running.

In [None]:
history = model.fit(
    train_image_iterator,
    epochs=5,
)

##### Plots

In [None]:
import matplotlib.pyplot as plt

plt.plot(list(range(len(history.history['accuracy']))),
         history.history['accuracy'])
plt.show()

And our loss.

In [None]:
import matplotlib.pyplot as plt

plt.plot(list(range(len(history.history['loss']))), history.history['loss'])
plt.show()

Over `99%` training accuracy. Let's see how well this generalizes:

In [None]:
import tensorflow as tf

test_dir = 'chest_xray/test'

test_image_iterator = tf.keras.preprocessing.image.DirectoryIterator(
    target_size=(50, 50),
    directory=test_dir,
    batch_size=256,
    shuffle=False,
    image_data_generator=None)

model.evaluate(test_image_iterator)

We can also make predictions. The code below selects the next batch, gets predictions for it, and then returns the first prediction.

In [None]:
predicted_class = np.argmax(model(next(test_image_iterator)[0])[0])
predicted_class

This maps to the directory in that position.

In [None]:
os.listdir(train_dir)[predicted_class]

##### F1 Score

In [None]:
# f1 score
from sklearn.metrics import f1_score

actual_classes = test_image_iterator.classes

predictions = model.predict(test_image_iterator)

predicted_classes = [np.argmax(p) for p in predictions]

f1_score(actual_classes, predicted_classes, average='micro')

#### Validate

In [None]:
val_dir = 'chest_xray/val'
val_image_iterator = tf.keras.preprocessing.image.DirectoryIterator(
    target_size=(50, 50),
    directory=val_dir,
    batch_size=256,
    shuffle=False,
    image_data_generator=None)
model.fit(val_image_iterator)

In [None]:
from sklearn.metrics import f1_score
# f1 score
actual_classes = val_image_iterator.classes

predictions = model.predict(val_image_iterator)

predicted_classes = [np.argmax(p) for p in predictions]

f1_score(actual_classes, predicted_classes, average='micro')

---