# Predicting pneumonia from X-ray image: Checkpoint #1

Members: 
- Claire Boyd
- Jack Gibson
- Benjamin Leiva
- Raul Castellanos

In [None]:
# Import modules
import pandas as pd
from process import load

: 

## Load and prepare data

In [None]:
# Generate dataframes from csv files for descriptive statistics
train = pd.read_csv('data/output/train.csv')
test  = pd.read_csv('data/output/test.csv')
val   = pd.read_csv('data/output/val.csv')

types = ['train', 'test', 'val']
dataframes = [train, test, val]

train['type'] = 'train'
test['type'] =  'test'
val['type'] = 'val'

train['size'] = train['height'] * train['width']
test['size'] = test['height'] * test['width']
val['size'] = val['height'] * val['width']

#for df in dataframes:
    # df['type'] = str(df)
    # df['size'] = df['width'] * df['height'] 

#    collect_image_files(df)
    # f'{df}' = pd.read_csv(f'data/output/{df}.csv')
    # f'{df}'['type'] = f'{df}'
    # f'{df}'['size'] = f'{df}'['width'] * f'{df}'['height']

# Combine datasets
data = pd.concat([train, test, val], ignore_index=True, axis=0)

: 

## Descriptive statistics

### 1. Label count:

In [None]:
data.groupby(["type", "label"])["type"].count().reset_index(name="count")

: 

Looking at the table above we see that:
- Training data has 2,682 normal labels (26%) and 7,750 pneumonia ones (74%).
- Testing data has 468 normal x-ray labels (38%) and 780 pneumonia ones (62%).
- Validation data has a 50-50 ratio between normal and pneumonia labels. 

### 2. Height and width count

In [None]:
# Height count
train['height_bracket'] = 'More than 2000 pixels'
train.loc[(train.loc[:, 'height'] < 2000) & (train.loc[:, 'height'] >= 1500), 'height_bracket'] = 'Between 1500-2000 pixels'
train.loc[(train.loc[:, 'height'] < 1500) & (train.loc[:, 'height'] >= 1000), 'height_bracket'] = 'Between 1000-1500 pixels'
train.loc[(train.loc[:, 'height'] < 1000) & (train.loc[:, 'height'] >= 500), 'height_bracket'] = 'Between 500-1000 pixels'
train.loc[(train.loc[:, 'height'] < 500), 'height_bracket'] = 'Less than 500 pixels'

train.groupby(["type", "height_bracket"])["type"].count().reset_index(name="count").sort_values('count', ascending=False)

: 

In [None]:
# Width count
train['width_bracket'] = 'More than 2000 pixels'
train.loc[(train.loc[:, 'width'] < 2000) & (train.loc[:, 'width'] >= 1500), 'width_bracket'] = 'Between 1500-2000 pixels'
train.loc[(train.loc[:, 'width'] < 1500) & (train.loc[:, 'width'] >= 1000), 'width_bracket'] = 'Between 1000-1500 pixels'
train.loc[(train.loc[:, 'width'] < 1000) & (train.loc[:, 'width'] >= 500), 'width_bracket'] = 'Between 500-1000 pixels'
train.loc[(train.loc[:, 'width'] < 500), 'width_bracket'] = 'Less than 500 pixels'

train.groupby(["type", "width_bracket"])["type"].count().reset_index(name="count").sort_values('count', ascending=False)

: 

In [None]:
# Size descriptive statistics
# data[['height', 'width', 'size']].describe()
train['size'].describe().apply("{0:.0f}".format)

: 

## Data pre-processing

We decided on two transformations for our training data: CenterCrop and Resize.

The reason for choosing CenterCrop is that removing the borders of x-rays leaves us with the part of the image that gives us information about the presence of pneumonia. This way, we prevent our model from analyzing pixels from, for example, the shoulders or arms regions, and focus on the chest and lungs. Then, we use Resize on the resulting images in order to have homogeneous image sizes in our training data.

Since both of these transformations need height and width parameters to work, we use their observed means in the training data.



In [None]:
# Height mean
height_mean = train['height'].mean()
print(height_mean)

# Width mean
width_mean = train['width'].mean()
print(width_mean)

: 