<a href="https://colab.research.google.com/github/mthomp89/NU_489_capstone/blob/master/NIH_EDA_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This section works to download all of the data from Kaggle.**
**I couldn't get just the .csv file to download and didn't want to wait for all 43 GB to transfer, so I just stored it in my Google Drive.**

In [0]:
# bring in Colab files
from google.colab import files

# load Kaggle
!pip install -q Kaggle

In [0]:
# bring in Kaggle API
api = files.upload() 

In [0]:
# make a directory to save the .json file

! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json


In [0]:
# load Kaggle data

!kaggle datasets download -d nih-chest-xrays/data/Data_Entry_2017.csv.zip

**Starting here I am just importing the Data_Entry_2017.csv file from a saved location, not Kaggle**

In [0]:
# load G Drive files

from google.colab import drive
drive.mount('/content/drive')

In [0]:
# bring in Python libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')

In [0]:
# load data file
df = pd.read_csv('/content/drive/My Drive/Capstone/Data_Entry_2017.csv', low_memory=False)

In [0]:
df.head()

In [0]:
# clean up some of the columns names

df.rename(columns = {'OriginalImage[Width': 'ImageWidth', 'Height]': 'ImageHeight', 'OriginalImagePixelSpacing[x': 'ImageSpacing_X', 'y]': 'ImageSpacing_Y'}, inplace = True)

# drop last column - it is empty

df.drop('Unnamed: 11', axis = 1, inplace = True)

In [0]:
df.head()

In [0]:
df.describe()

**There are some age outliers - someone has an age of 414**


*   Identify the age outliers - anyone over 100
*   Impute the median value from the dataset for those outliers






In [0]:
# look at age outliers and try to clean up
sns.set_style('whitegrid')
sns.boxplot(x = 'Patient Gender', y = 'Patient Age', data = df)

In [0]:
# how many records have age > 100?

print('There are', len(df[df['Patient Age'] > 100]), 'records with Age > 100')

In [0]:
# impute values over 100 to median age

median = df.loc[df['Patient Age'] <= 100, 'Patient Age'].median()
df['Patient Age'] = np.where(df['Patient Age'] > 100, median, df['Patient Age'])

In [0]:
# show how ages break down 
plt.figure(figsize=(9,9))
sns.distplot(df['Patient Age'], kde = False, color = 'blue', bins = 10)
plt.xlabel('Patient Age', fontsize = 12)
plt.ylabel('Frequency', fontsize = 12)
plt.title('Age Breakdown', fontsize = 16)

**Look at how the images are labeled with their conditions**


*   Some labels have multiple conditions
*   First look at the label counts by combining multiple conditions into 1 label



In [0]:
# how many images are in each label?

df['Finding Labels'].value_counts()

**There are 836 unique image labels.  Many of them have multiple diagnoses.**

In [0]:
# create updated Image label for multiple diagnoses
# these are separated by | in the finding labels

def multiple_conditions(x):
  if x.find('|') == -1:
    return x
  return 'Multiple Conditions'

df['Contition_Updated'] = df['Finding Labels'].apply(multiple_conditions)


In [0]:
# show counts of conditions

label_count = df.groupby('Contition_Updated').size().reset_index()
label_count.rename(columns={0:'Counts'},inplace = True)
label_count = label_count.sort_values(['Counts'], ascending=False).reset_index(drop = True)

plt.figure(figsize=(9,9))
sns.barplot(x = 'Counts', y = 'Contition_Updated', data = label_count)
plt.xlabel('Image Count')
plt.ylabel('Conditions')
plt.title('Count of Image Results')


**Split out multiple condition labels so that they are captured with each individual condition**

In [0]:
# separate the different conditions in each image to classify them better
# use one-hot encoding

conditions = ['No Finding','Infiltration','Atelectasis','Effusion','Nodule','Pneumothorax','Mass','Consolidation','Pleural_Thickening','Cardiomegaly','Emphysema','Fibrosis','Edema','Pneumonia','Hernia']

for i in conditions :
    df[i] = df['Finding Labels'].apply(lambda x: 1 if i in x else 0)

# melt data together to get count of all conditions
df_2 = pd.melt(df, id_vars=['Image Index','Finding Labels','Patient ID','Patient Gender','Patient Age'], value_vars = list(conditions), var_name='Labels', value_name='Count')



In [0]:
# how many conditions are there now?
new_label_count = df_2.groupby('Labels').sum().reset_index()
new_label_count.drop(['Patient ID','Patient Age'], axis = 1, inplace = True)
new_label_count.rename(columns={0:'Counts'},inplace = True)
new_label_count = new_label_count.sort_values(['Count'], ascending=False).reset_index(drop = True)

plt.figure(figsize=(9,9))
sns.barplot(x = 'Count', y = 'Labels', data = new_label_count)
plt.xlabel('Image Count')
plt.ylabel('Conditions')
plt.title('Count of Image Results')


In [0]:
new_label_count.head(20)