# Data Exploration
### Plot data characteristics and visually examine the images. 

### Setup drive

Run the following cell to mount your Drive onto Colab. Go to the given URL and once you login and copy and paste the authorization code, you should see "drive" pop up in the files tab on the left.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Click the little triangle next to "drive" and navigate to the "AI4All Chest X-Ray Project" folder. Hover over it and click the 3 dots that appear on the right. Select "copy path" and replace the PASTE PATH HERE with the path to your folder.

In [None]:
cd "PASTE PATH HERE"

/content/drive/My Drive/AI4All Project/AI4All Chest X-Ray Project


### Import necessary libraries
skimage (scikit-image) is a package designed specifically for preprocessing images

In [None]:
import matplotlib.pyplot as plt
import os
import pandas as pd
import random
import seaborn as sns
from skimage import io

### Setup paths

Define paths and load metadata

In [None]:
path_to_dataset = os.path.join('data')
path_to_images = os.path.join(path_to_dataset, 'images')

metadata = pd.read_csv(os.path.join(path_to_dataset, 'metadata_train.csv'))

### Examine metadata

The metadata provides useful information about the dataset such as data sources and other background factors.

Print out metadata and take a look at each of the columns!

In [None]:
metadata

**Check out the dataframe**

What information is provided for each image? What values are in each column?

Are there any patterns with regard to empty values in the dataset?

Write your thoughts here: 

In [None]:
# Experiment with .isna(), .isna().any(), and .isna().any(axis = 0)


### Explore data characteristics

Characteristics of the data may affect the model evaluation in unexpected ways. By examining properties of the data, we can anticipate and account for biases in the data. For example, if all the Covid-19 images came from adults while all the No Finding images came from children, our model may  pick up signals related to age rather than to the disease. Such a model would not generalize well when applied to predict real-world data.

Note: Feel free to add cells and experiment with the data on your own! Maybe you'll find patterns we didn't even think of!

**Making plots from dataframes with Seaborn**

Seaborn (`sns`) is a nice library for visualizing plots from dataframes. In the code below, we provide the dataframe `data=metadata` to the plotting function `sns.countplot`. We also specify which column from the dataframe to plot on the x-axis `x='finding'` and which column to use for color `hue='dataset'`. 

Other functions in seaborn use similar syntax. You can [check out examples here](http://seaborn.pydata.org/examples/index.html) for other plots to try!



In [None]:
# plots number of covid vs no finding images in the two datasets
sns.countplot(x='finding', hue='dataset', data=metadata)

In [None]:
# EXERCISE: try changing the x and hue to explore other factors!




**Examine how the data is distributed according to variables in the dataframe** 

Try different plots to examine these variables (age, sex, etc).

Write your observations here: 

In [None]:
# EXERCISE: plot other variables, try different kinds of plots such as sns.distplot



What are some other important attributes to consider, including those not provided in the metadata?

**More plotting functions**

We can make more detailed visualization that splits the data across multiple plots based on attributes. Here we use the [FacetGrid function](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html) to plot the distribution of `sex` in subsets based on `finding` and `dataset` values.

Feel free to play around with this code to make additional plots

In [None]:
g = sns.FacetGrid(metadata, col='dataset', row='finding', sharey=False)
p = g.map(sns.countplot, 'sex').add_legend()

### Explore the images

Machines are smart but not that smart. If human vision cannot distinguish between two sets of images, it may also be difficult for our models to detect any differences. Our knowledge of the data can also help guide model design for better accuracy.

In [None]:
def sample_images(df, n=25):
    '''Randomly samples n images (rows) from  dataframe and reads the images
    
    Args:
      df: dataframe
      n: number of images 
    
    Returns: list of images
    '''
    
    files = df.apply(lambda x: os.path.join(path_to_images, x.folder, x.filename), 
                     axis=1)
    files = files.tolist()
    
    subset = random.sample(files, n)
    
    ims = [io.imread(p) for p in subset]

    return ims

In [None]:
def plot_grid(ims, n=5):
    '''Plots images in a grid (nxn)
    
    Args:
      ims: list of images
      n: grid size (nxn)
    
    Returns: None (plots images)
    '''

    # plots images as greyscale
    plt.gray()
    
    fig, axes = plt.subplots(n,n, figsize=(15,15))
    axes = axes.ravel()

    for im, ax in zip(ims, axes):
        ax.imshow(im)

    plt.show()

**Do you see any differences between Covid and healthy images?**

Write your thoughts here:


In [None]:
# EXERCISE: Use the above helper functions to plot 25 randomly selected 
# images of patients with Covid


In [None]:
# EXERCISE: Use the above helper functions to plot 25 randomly selected 
# images of healthy patients



**Any other image properties to note?**

How do the three views differ?

Are there any superficial differences in the image that could impact the training?

In [None]:
# EXERCISE: Compare images from different views by subsetting the 
# dataframe and using the same helper functions above



### Class balancing

Another important factor to consider when examining your data is the portion of data in each class (label). In this dataset, I have provided an equal number of Covid-19 and No Finding images (395 images each). However, there are actually 16985 No Finding images available in the Stanford dataset! 

What would happen if we used all the No Finding images with only 395 Covid-19 images?

In [None]:
num_covid = 395
num_norm = 16985

print(f'Percent Covid in entire dataset: {num_covid/num_norm *100 :0.2f}%')

### Check for multiple images for the same patient

Here, we will be checking whether there are multiple images from the same patient in our dataset.

**Why might it be bad to include images of the same patient both in the train and test set?**

Write your thoughts here: 

In [None]:
multiple_images = metadata[metadata.duplicated(['patientid'])]

# EXERCISE: fill in the following variables!

# the total number of images that have duplicate patient id's
num_multi = 

# the  number of images that have duplicate patient id's in the Cohen dataset
num_multi_cohen = 

# the  number of images that have duplicate patient id's in the Stanford dataset
num_multi_stanford = 

print(f'Number of images for the same patients: {num_multi}')
print(f'- In Cohen dataset: {num_multi_cohen}')
print(f'- In Stanford dataset: {num_multi_stanford}')

### Discussion

**What are your initial thoughts on the data? Anything to keep in mind when training and evaluating the model?**

Write your thoughts here: 