## Loading Required Libraries

For this experiment we are employing the following libraries:

* **pandas**  to read, process and visualize tabular data; 
* **collections** to compute some group metrics like count, sum, and mean;
* **numpy** no some matrix and algebric operations;
* **os** to file system search; and
* **misingno** to visualize missing data.

In [1]:
from collections import Counter

import pandas as pd
import numpy as np
import os

## Loading and Merging Data Sets

All collect data are from the study CoMMpassâ„  (Relating Clinical Outcomes in MM to Personal Assessment of Genetic Profile) organized by the Multiple Myeloma Research Foundation (MMRF) avaliable at https://research.themmrf.org. The script merge all collected data into a unique pandas data frame keeping only patients with related to the following threatment classes:

* IMIDs-based
* Bortezomib-based
* Carfilzomib-based
* Combined bortezomib/IMIDs-based
* Combined IMIDs/carfilzomib-based

We exclude observations associated to missing *therapy_first_line* and *iss* variables, which describe the first therapy applied to the Multiple Myeloma (MM) patient, and the cance stange, respectively. We remove redundant or leakage variables:

* progression_free_survivel_status
* therapy_first_line_most_common
* therapy_first_line_starting_treatment
* days_to_overall_survival
* disease_status
* overall_survival_status

In [2]:
BASE_PATH = "./data/mmrf"

df = None

for root, directories, files in os.walk(BASE_PATH):
    for file in files:
        if '.tsv' in file:
            
            tmp = pd.read_csv(os.path.join(root, file), sep='\t', index_col='ID')
            
            tmp = tmp[~tmp.index.duplicated(keep='first')]
            
            if tmp.shape[0] > 400:
            
                try:
                    df  = tmp if df is None else df.join(tmp, how='outer')
                except Exception as e:
                    pass

for c in ['progression_free_survivel_status', 'therapy_first_line_most_common', 
          'therapy_first_line_starting_treatment', 'days_to_overall_survival', 
          'disease_status', 'overall_survival_status']:     
    del df[c]
                
classes = ['IMIDs-based', 'Bortezomib-based', 'Carfilzomib-based', 
           'Combined bortezomib/IMIDs-based']

print('\n\nRaw data set composed by {} rows and {} columns\n'.format(df.shape[0], df.shape[1]))

#
#
#
count_therapy_class_all = pd.DataFrame(dict(Counter(df['therapy_first_line_class'])),index=['count']).T

count_therapy_all = pd.DataFrame(dict(Counter(df['therapy_first_line'])), index=['count']).T

#
#
#
df = df.loc[df['therapy_first_line_class'].isin(classes),:]

df = df.loc[~df.index.duplicated(keep='first')]

df = df.loc[~df['therapy_first_line'].isnull()]

df = df.loc[~df['iss'].isnull()]

df.to_csv('data/input.tsv', sep='\t', index=True)

display(df.iloc[:8,:8])

print('\n\nFiltered data set composed by {} rows and {} columns'.format(df.shape[0], df.shape[1]))



Raw data set composed by 1525 rows and 60 columns



Unnamed: 0_level_0,cmmc,ecog_ps,cell_markers,dna_index,lgh,lgl,percent_aneuploid,percent_plama_cells_bone_marrow
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
MMRF1011,,PS 1 (Restricted in physically strenuous activ...,CD138,,Not Recorded,Not Recorded,0.0,0.9
MMRF1013,,PS 1 (Restricted in physically strenuous activ...,CD117,,Unknown,Unknown,0.0,1.3
MMRF1016,,PS 1 (Restricted in physically strenuous activ...,CD117,,IgG,Lambda,0.0,2.0
MMRF1017,,PS 1 (Restricted in physically strenuous activ...,CD138,1.25,IgG,Lambda,6.9,2.1
MMRF1018,,PS 1 (Restricted in physically strenuous activ...,CD117,,IgA,Kappa,0.0,2.1
MMRF1029,,PS 1 (Restricted in physically strenuous activ...,CD117,,Unknown,Kappa,0.0,8.4
MMRF1030,,PS 1 (Restricted in physically strenuous activ...,CD117,1.16,IgG,Kappa,15.4,9.6
MMRF1031,,PS 0 (Fully Active),CD117,1.28,IgA,Unknown,18.3,10.1




Filtered data set composed by 690 rows and 60 columns


# Counting Patient per Therapy Class

The following script counts the patients associated to each considered therapy class. We consider three counts criterium:

1. All data composed by missing and no missing data
2. Exclusion of all observations associated to missing *days_to_disease_progression*
3. Exclusion of all observations associated to missing *best_response_first_line*

In [3]:
count_therapy_class_rep = pd.DataFrame(dict(Counter(
    df.loc[~df['days_to_disease_progression'].isnull()]['therapy_first_line_class'])), 
             index=['days_to_disease_progression']).T.join(pd.DataFrame(dict(
    Counter(df.loc[~df['best_response_first_line'].isnull()]['therapy_first_line_class'])),
                        index=['best_response_first_line']).T)

count_therapy_class_all.join(count_therapy_class_rep).fillna(0)

Unnamed: 0,count,days_to_disease_progression,best_response_first_line
Combined bortezomib/IMIDs-based,554,241.0,477.0
Combined bortezomib/IMIDs/carfilzomib-based,34,0.0,0.0
,443,0.0,0.0
IMIDs-based,64,34.0,50.0
Bortezomib-based,203,91.0,122.0
Combined IMIDs/carfilzomib-based,163,0.0,0.0
Carfilzomib-based,62,0.0,0.0
Combined bortezomib/carfilzomib-based,2,0.0,0.0


# Couting Patient per First Line Therapy 

Now we count patients per first line therapy, grouping results in the same way used for first line therapy class above.

In [4]:
count_therapy_rep = pd.DataFrame(dict(Counter(
    df.loc[~df['days_to_disease_progression'].isnull()]['therapy_first_line'])),
                                 index=['days_to_disease_progression']).T.join(
    pd.DataFrame(dict(Counter(df.loc[~df['best_response_first_line'].isnull()]['therapy_first_line'])),
                 index=['best_response_first_line']).T)

count_therapy_all.join(count_therapy_rep).fillna(0)

Unnamed: 0,count,days_to_disease_progression,best_response_first_line
,738,0.0,0.0
Bor-Dex,126,74.0,95.0
Bor-Len-Dex,367,146.0,309.0
Bor-Cyc-Dex,190,84.0,160.0
Len-Dex,88,51.0,73.0
Bor,11,8.0,8.0
Len,5,3.0,4.0


# Response Variables

Here we define two response variables:

* **best_response_first_line_class** is computed from *best_response_first_line* variable that describes how well the patient responde to the applied therapy in a qualitative way. We split patient in two groups generated by the following responses:
    * **Low Risk**: Stringent Complete Response, and Complete Response
    * **Hight Risk**: Very Good Partial Response, Partial Response, Stable Disease, Progressive Disease
 

* **days_to_disease_progression** is computed from *days_to_disease_progression* variable that describe the amount of days until the disease progression. We split patients into two groups:
    * **Low Risk**: patients with days_to_disease_progression >= 18 months
    * **High Risk**: patients with days_to_disease_progression < 18 months

In [5]:
groups = (['Stringent Complete Response', 'Complete Response'], 
          ['Very Good Partial Response', 'Partial Response', 'Stable Disease', 'Progressive Disease'])

# split response variable and drop it from iss and fish variables
    
df['response_best_response_first_line'] = \
    df['best_response_first_line'].apply(lambda x: np.nan if pd.isnull(x) else (1 if x in groups[0] else 0))

del df['best_response_first_line']

df['response_days_to_disease_progression'] = \
    df['days_to_disease_progression'].apply(
        lambda x: np.nan if np.isnan(x) or pd.isnull(x) or x is None else (0 if x <= 30 * 18 else 1))

del df['days_to_disease_progression']

df[['response_best_response_first_line', 'response_days_to_disease_progression']].head(10)

Unnamed: 0_level_0,response_best_response_first_line,response_days_to_disease_progression
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
MMRF1011,0.0,1.0
MMRF1013,1.0,1.0
MMRF1016,0.0,1.0
MMRF1017,0.0,0.0
MMRF1018,0.0,
MMRF1029,0.0,
MMRF1030,1.0,1.0
MMRF1031,0.0,1.0
MMRF1032,0.0,1.0
MMRF1033,0.0,0.0


# Describing Qualitative Variables

Describing qualitative variable filtring by dtype =='object'. Count unique values and present them as a list.

In [6]:
qualitative_counts = {'variable': [], 'unique_values': [], 'unique_count': []}

for c in df:
    if df[c].dtype == 'object':
        uniquies = df[c].unique().tolist()
        qualitative_counts['variable'].append(c)
        qualitative_counts['unique_values'].append(', '.join([str(u) for u in uniquies]))
        qualitative_counts['unique_count'].append(len(uniquies))

pd.DataFrame(qualitative_counts).set_index('variable').head(8)

Unnamed: 0_level_0,unique_values,unique_count
variable,Unnamed: 1_level_1,Unnamed: 2_level_1
ecog_ps,PS 1 (Restricted in physically strenuous activ...,6
cell_markers,"CD138, CD117, CD13, nan, CD38",5
lgh,"Not Recorded, Unknown, IgG, IgA, nan, IgM, IgM...",7
lgl,"Not Recorded, Unknown, Lambda, Kappa, nan, Bi-...",7
iss,"Stage III, Stage I, Stage II",3
family_cancer,"Yes, No, Unk",3
gender,"Male, Female",2
race,"White, Asian, Black/African American, Other",4


# Cleaning Qualitative Variables

Replacing *'Not Recorded* and *Unknown* by *numpy.nan* for all qualitaive variables. Replace ISS values by corresponding numeric ones.

In [7]:
iss_dict = {'Stage I': 1, 'Stage II': 2, 'Stage III': 3}

df['iss'] = df['iss'].apply(lambda x: iss_dict[x])

for c in df:
    if df[c].dtype == 'object':
        df[c] = df[c].apply(lambda v: np.nan if v in ('Not Recorded', 'Unknown') else v)

# Exporting New Data Set

We save our new data set to *data/input.tsv* formmated as colunar file split by tabs (\t).

In [8]:
df.to_csv('data/input.tsv', sep='\t', index=True)