# Pulmonary Fibrosis Progression (I)

## Analysis of tabular data

In this notebook we are exploring the input data to get a better understanding on the main properties and the correlelation between them.

Some of the questions we are going to be answering:


In [2]:
## Imports

import os
import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

#plotly
import plotly.express as px
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot
#import cufflinks
#cufflinks.go_offline()
#cufflinks.set_config_file(world_readable=True, theme='pearl')

#color
#from colorama import Fore, Back, Style

import seaborn as sns
sns.set(style="whitegrid")

#pydicom
import pydicom

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

# Settings for pretty nice plots
plt.style.use('fivethirtyeight')
plt.show()

### Overview of the train and test dataset
In this first section we are going to investigate both test and train datasets to get some of the basic information about it.

In [17]:
train_df = pd.read_csv( 'data/train.csv' )
test_df  = pd.read_csv( 'data/test.csv' )
print(f'[train_df] shape: {train_df.shape}')
print(f'[test_df] shape: {test_df.shape}')

[train_df] shape: (1549, 7)
[test_df] shape: (5, 7)


In [6]:
train_df.head()

Unnamed: 0,Patient,Weeks,FVC,Percent,Age,Sex,SmokingStatus
0,ID00007637202177411956430,-4,2315,58.253649,79,Male,Ex-smoker
1,ID00007637202177411956430,5,2214,55.712129,79,Male,Ex-smoker
2,ID00007637202177411956430,7,2061,51.862104,79,Male,Ex-smoker
3,ID00007637202177411956430,9,2144,53.950679,79,Male,Ex-smoker
4,ID00007637202177411956430,11,2069,52.063412,79,Male,Ex-smoker


In [7]:
test_df.head()

Unnamed: 0,Patient,Weeks,FVC,Percent,Age,Sex,SmokingStatus
0,ID00419637202311204720264,6,3020,70.186855,73,Male,Ex-smoker
1,ID00421637202311550012437,15,2739,82.045291,68,Male,Ex-smoker
2,ID00422637202311677017371,6,1930,76.672493,73,Male,Ex-smoker
3,ID00423637202312137826377,17,3294,79.258903,72,Male,Ex-smoker
4,ID00426637202313170790466,0,2925,71.824968,73,Male,Never smoked


In [30]:
print(f'[train_df] Number of rows: {train_df["Patient"].count()}')
print(f'[train_df] Number of unique patients: {train_df["Patient"].value_counts().shape[0]}')
print(f'[test_df] Number of rows: {test_df["Patient"].count()}')
print(f'[test_df] Number of unique patients: {test_df["Patient"].value_counts().shape[0]}')
print(f'[train_df] SmokingStatus values: {train_df["SmokingStatus"].value_counts().index.tolist()}')
print(f'[test_df] SmokingStatus values: {test_df["SmokingStatus"].value_counts().index.tolist()}')

[train_df] Number of rows: 1549
[train_df] Number of unique patients: 176
[test_df] Number of rows: 5
[test_df] Number of unique patients: 5
[train_df] SmokingStatus values: ['Ex-smoker', 'Never smoked', 'Currently smokes']
[test_df] SmokingStatus values: ['Ex-smoker', 'Never smoked']


In [34]:
train_images_path = "data/train"
test_images_path = "data/test"
num_train_folders = 0
num_test_folders = 0
for _, dirnames, _ in os.walk(train_images_path):
    num_train_folders += len(dirnames)
for _, dirnames,_ in os.walk(test_images_path):
    num_test_folders += len(dirnames)
    
print(f'Number of train folders: {num_train_folders}')
print(f'Number of test folders: {num_test_folders}')


Number of train folders: 176
Number of test folders: 5


#### [Summary] Overview of tabular data

* Each of the rows in the train and test dataset contains information for a patient visit with details about:
 * Weeks- the relative number of weeks pre/post the baseline CT (may be negative)
 * FVC - the recorded lung capacity in ml
 * Percent- a computed field which approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics
 * Age
 * Sex
 * SmokingStatus
* SmokingStatus values
 * Ex-smoker
 * Never smoked
 * Currently smokes
* Train and test dataset contain 176 and 5 unique patients respectively.

#### [Summary] Overview of dicom data
