# Pulmonary Fibrosis Progression (I)

## Analysis of tabular data

In this notebook we are exploring the input data to get a better understanding on the main properties and the correlelation between them.

Some of the questions we are going to be answering:


In [118]:
## Imports

import os
import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

#plotly
import plotly.express as px
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')


import seaborn as sns
sns.set(style="whitegrid")

#pydicom
import pydicom

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

# Settings for pretty nice plots
plt.style.use('fivethirtyeight')
plt.show()

### Overview of the train and test dataset
In this first section we are going to investigate both test and train datasets to get some of the basic information about it.

In [17]:
train_df = pd.read_csv( 'data/train.csv' )
test_df  = pd.read_csv( 'data/test.csv' )
print(f'[train_df] shape: {train_df.shape}')
print(f'[test_df] shape: {test_df.shape}')

[train_df] shape: (1549, 7)
[test_df] shape: (5, 7)


In [6]:
train_df.head()

Unnamed: 0,Patient,Weeks,FVC,Percent,Age,Sex,SmokingStatus
0,ID00007637202177411956430,-4,2315,58.253649,79,Male,Ex-smoker
1,ID00007637202177411956430,5,2214,55.712129,79,Male,Ex-smoker
2,ID00007637202177411956430,7,2061,51.862104,79,Male,Ex-smoker
3,ID00007637202177411956430,9,2144,53.950679,79,Male,Ex-smoker
4,ID00007637202177411956430,11,2069,52.063412,79,Male,Ex-smoker


In [7]:
test_df.head()

Unnamed: 0,Patient,Weeks,FVC,Percent,Age,Sex,SmokingStatus
0,ID00419637202311204720264,6,3020,70.186855,73,Male,Ex-smoker
1,ID00421637202311550012437,15,2739,82.045291,68,Male,Ex-smoker
2,ID00422637202311677017371,6,1930,76.672493,73,Male,Ex-smoker
3,ID00423637202312137826377,17,3294,79.258903,72,Male,Ex-smoker
4,ID00426637202313170790466,0,2925,71.824968,73,Male,Never smoked


In [30]:
print(f'[train_df] Number of rows: {train_df["Patient"].count()}')
print(f'[train_df] Number of unique patients: {num_unique_patients}')
print(f'[test_df] Number of rows: {test_df["Patient"].count()}')
print(f'[test_df] Number of unique patients: {test_df["Patient"].value_counts().shape[0]}')
print(f'[train_df] SmokingStatus values: {train_df["SmokingStatus"].value_counts().index.tolist()}')
print(f'[test_df] SmokingStatus values: {test_df["SmokingStatus"].value_counts().index.tolist()}')

[train_df] Number of rows: 1549
[train_df] Number of unique patients: 176
[test_df] Number of rows: 5
[test_df] Number of unique patients: 5
[train_df] SmokingStatus values: ['Ex-smoker', 'Never smoked', 'Currently smokes']
[test_df] SmokingStatus values: ['Ex-smoker', 'Never smoked']


In [92]:
train_images_path = "data/train"
test_images_path = "data/test"
num_train_folders = 0
num_test_folders = 0
train_patient_array = np.array(os.listdir(train_images_path))
test_patient_array = np.array(os.listdir(test_images_path))

for root, dirnames, filenames in os.walk(train_images_path):
    num_train_folders += len(dirnames)
        
for root, dirnames,filenames in os.walk(test_images_path):
    num_test_folders += len(dirnames)
    
print(f'Number of train CT scan folders: {num_train_folders}')
print(f'Number of test CT scan folders: {num_test_folders}')
print(f'[TRAIN] Does every patient have a CT scan folder? '+
      ('yes :) ' if len(np.setdiff1d(train_df["Patient"].unique(), train_patient_array))==0 else 'No :('))
print(f'[TEST] Does every patient have a CT scan folder? '+
      ('yes :) ' if len(np.setdiff1d(test_df["Patient"].unique(), test_patient_array))==0 else 'No :('))

Number of train CT scan folders: 176
Number of test CT scan folders: 5
[TRAIN] Does every patient have a CT scan folder? yes :) 
[TEST] Does every patient have a CT scan folder? yes :) 


#### [Summary] Overview of tabular data

* Each of the rows in the train and test dataset contains information for a patient visit with details about:
 * Weeks- the relative number of weeks pre/post the baseline CT (may be negative)
 * FVC - the recorded lung capacity in ml
 * Percent- a computed field which approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics
 * Age
 * Sex
 * SmokingStatus
* SmokingStatus values
 * Ex-smoker
 * Never smoked
 * Currently smokes
* Train and test dataset contain 176 and 5 unique patients respectively.

#### [Summary] Overview of dicom data
* Every patient in the tabular data has a directory containing the images of the baseline CT scan.

### Analysis of the Tabular data
In this section we are going to look at the nature of the tabular data given as part of the input dataset. The main idea is understand better the distribution of the data per each of the identified features (FVC, percent, age, sex and smoking status). Also we are going to look at how the data changes in time by anaylising patient data between weeks.

In [144]:
# create patient dataframe: patient, baseline_fvc, baseline_percentage, age, sex, smoking status, num_visits
#Creating new rows
patients = []
age = []
sex = []
smoking_status = []
num_visits = []

#for each unique patient filter collect static data
for patient in train_df["Patient"].unique():
    patient_rows = train_df[train_df['Patient'] == patient]
    patient_row = patient_rows.iloc[0]
    patients.append(patient_row['Patient'])
    age.append(patient_row['Age'])
    sex.append(patient_row['Sex'])
    smoking_status.append(patient_row['SmokingStatus'])
    num_visits.append(patient_rows.shape[0])
    
patient_df = pd.DataFrame(list(zip(patients, age, sex, smoking_status, num_visits)), 
                                 columns =['Patient', 'Age', 'Sex', 'Smoking_Status', 'Num_Visits'])
print(patient_df.info())
patient_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 176 entries, 0 to 175
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Patient         176 non-null    object
 1   Age             176 non-null    int64 
 2   Sex             176 non-null    object
 3   Smoking_Status  176 non-null    object
 4   Num_Visits      176 non-null    int64 
dtypes: int64(2), object(3)
memory usage: 7.0+ KB
None


Unnamed: 0,Patient,Age,Sex,Smoking_Status,Num_Visits
0,ID00007637202177411956430,79,Male,Ex-smoker,9
1,ID00009637202177434476278,69,Male,Ex-smoker,9
2,ID00010637202177584971671,60,Male,Ex-smoker,9
3,ID00011637202177653955184,72,Male,Ex-smoker,9
4,ID00012637202177665765362,65,Male,Never smoked,9


In [220]:
def plot_sex_distribution(patient_df):
    '''Plots sex distribution as a pie from patient_df
       :param patient_df: dataframe contianing patient information
       '''
    male_condition = patient_df['Sex'] == 'Male'
    female_condition = patient_df['Sex'] == 'Female'
    sex_distribution = pd.DataFrame({
        'Sex': ['Male', 'Female'], 
        'Count': [patient_df[male_condition]['Patient'].count(), patient_df[female_condition]['Patient'].count()]})
    fig = px.pie(sex_distribution, values='Count', names='Sex')
    fig.show()
    
def plot_patient_df_age_over_sex(patient_df):
    '''Plots patients's age distributions from patient_df
       :param patient_df: dataframe contianing patient information
       '''
    # plot patient age distribution
    male_condition = patient_df['Sex'] == 'Male'
    female_condition = patient_df['Sex'] == 'Female'
    age_distribution_over_sex = pd.concat([patient_df[male_condition]['Age'].value_counts(),
                                           patient_df[female_condition]['Age'].value_counts()],
                                          axis=1, ignore_index=True).fillna(0)
    age_distribution_over_sex = age_distribution_over_sex.rename(columns={0: "Male", 1: "Female"})
    age_distribution_over_sex.iplot(kind='bar',
                                      xTitle='Age',
                                      yTitle='Counts',
                                      linecolor='black', 
                                      opacity=0.7,
                                      theme='pearl',
                                      bargap=0.3,
                                      barmode='stack',
                                      gridcolor='white',
                                      title='Age distribution over Sex')
    
def plot_patient_df_smoking_status_over_sex(patient_df):
    '''Plots patients's smoking status distributions 
       :param patient_df: dataframe contianing patient information
    '''
    # plot patient age distribution
    smoking_status_male = []
    smoking_status_female = []
    for smoking_status in patient_df['Smoking_Status'].unique():
        male_condition = (patient_df['Smoking_Status'] == smoking_status) & (patient_df['Sex'] == 'Male')
        female_condition = (patient_df['Smoking_Status'] == smoking_status) & (patient_df['Sex'] == 'Female')
        smoking_status_male.append(patient_df[male_condition]['Patient'].count())
        smoking_status_female.append(patient_df[female_condition]['Patient'].count())
    smoking_status_over_sex_distribution = pd.DataFrame({
        'Male': smoking_status_male, 
        'Female': smoking_status_female}, index=[patient_df['Smoking_Status'].unique()])
    smoking_status_over_sex_distribution.iplot(kind='bar',
                                      xTitle='Smoking Distribution',
                                      yTitle='Count (over Sex)',
                                      linecolor='black', 
                                      opacity=0.7,
                                      theme='pearl',
                                      bargap=0.3,
                                      barmode='stack',
                                      gridcolor='white',
                                      title='Smoking distribution over Sex')

plot_sex_distribution(patient_df) 
plot_patient_df_age_over_sex(patient_df)
plot_patient_df_smoking_status_over_sex(patient_df)

In [142]:
def plot_data_df_distributions(data_df):
    '''Plots data distribution of data_df
       :param data_df: data dataframe which corresponds with the format of the train and test dataset
       '''
    # plot weeks distribution
    train_df['Weeks'].value_counts().iplot(kind='bar',
                                          xTitle='Week',
                                          yTitle='Counts',
                                          linecolor='black', 
                                          opacity=0.7,
                                          color='blue',
                                          theme='pearl',
                                          bargap=0.3,
                                          gridcolor='white',
                                          title='Weeks distribution')