# Objectives
1) To assess the associations between various demographic (Age, Gender, Ethnic & Deprivation) on myeloma patients' frailty score.
Method : Binary Logistic Regression (BLR)

2) To assess and predict the overall survival (time between the date of diagnosis and the date of death) based on frailty (inclusive demographic factors and deprivation).
Method : Kaplan-Meier (KM) survival analysis & Random Forest (RF) (Predict)

# Data Source
Simulacrum : https://simulacrum.healthdatainsight.org.uk/

# Dataset and Variables Used

1) sim_av_patient dataset 
   Variables:
   * PATIENTID : Pseudonymised patient ID (will be used to merge with different dataset)
   * GENDER : Person stated gender
   * ETHNICITY : Person stated enthnicity
   * VITALSTATUS : Vital status of the patient (to filter out patients with status D: Dead, and A: Alive only)
   * VITALSTATUSDATE : Date of vital status (in this cases this will be the date when the patients died)
   
2) sim_av_tumour dataset
   variables:
   * PATIENTID : Pseudonymised patient ID (will be used to merge with different dataset)
   * DIAGNOSISDATEBEST : Diagnosis date (will be used to calculate the survival period)
   * SITE_ICD10_O2_3CHAR : Site of neoplasm (3-character ICD-10/O2 code original version) (will group based on cancer type)
   * MORPH_ICD10_O2 : Histology of the cancer, in the ICD-10/O2 system
   * AGE : Age at diagnosis (will be use for frailty score)
   * QUINTILE_2019 : Measure of deprivation: the population-weighted quintile of income-level deprivation at small area level 
                     (LSAO)
   * PERFORMANCESTATUS : Performance status recorded at diagnosis (will act as WHO performance status for frailty score)
   * CHRL_TOT_27_03 : Total Charlson comorbididy score (will be use for frailty score)
   

# 1) Data Merging

In [4]:
# importing libraries
import numpy as np
import pandas as pd #will be used for various task (merge, import data, change data format for date column etc)
from ydata_profiling import ProfileReport #will be used for EDA

In [5]:
# importing required datasets
patient = pd.read_csv("C:/Users/User/Documents/2. Master in Data Science/3. Dissertation/3. Simulacrum/simulacrum_v2.1.0/simulacrum_v2.1.0/Data/sim_av_patient.csv")
tumour = pd.read_csv("C:/Users/User/Documents/2. Master in Data Science/3. Dissertation/3. Simulacrum/simulacrum_v2.1.0/simulacrum_v2.1.0/Data/sim_av_tumour.csv", low_memory = False)

In [6]:
patient

Unnamed: 0,PATIENTID,GENDER,ETHNICITY,DEATHCAUSECODE_1A,DEATHCAUSECODE_1B,DEATHCAUSECODE_1C,DEATHCAUSECODE_2,DEATHCAUSECODE_UNDERLYING,DEATHLOCATIONCODE,VITALSTATUS,VITALSTATUSDATE,LINKNUMBER
0,10000001,1,A,,,,,,,A,2022-07-05,101610884
1,10000002,1,,,,,,,,A,2022-07-05,101343783
2,10000003,2,A,,,,,,,A,2022-07-05,101560124
3,10000004,1,A,,,,,,,A,2022-07-05,101833580
4,10000005,1,A,,,,,,,A,2022-07-05,100957799
...,...,...,...,...,...,...,...,...,...,...,...,...
1871600,250002539,2,A,,,,,,,A,2022-07-05,100642102
1871601,250002540,2,L,,,,,,,D,2021-06-10,101223249
1871602,250002541,2,A,C439,,,,I259,4,D,2022-06-10,100870402
1871603,250002542,1,A,C66,,,"I259,J449,I10",C66,2,D,2019-09-25,100803641


In [7]:
tumour

Unnamed: 0,TUMOURID,GENDER,PATIENTID,DIAGNOSISDATEBEST,SITE_ICD10_O2_3CHAR,SITE_ICD10_O2,SITE_ICD10R4_O2_3CHAR_FROM2013,SITE_ICD10R4_O2_FROM2013,SITE_ICDO3REV2011,SITE_ICDO3REV2011_3CHAR,...,QUINTILE_2019,DATE_FIRST_SURGERY,CANCERCAREPLANINTENT,PERFORMANCESTATUS,CHRL_TOT_27_03,COMORBIDITIES_27_03,GLEASON_PRIMARY,GLEASON_SECONDARY,GLEASON_TERTIARY,GLEASON_COMBINED
0,10399610,1,10000001,2017-03-31,C44,C444,C44,C444,C444,C44,...,4,,,3.0,0.0,,,,,
1,10694862,1,10000002,2016-01-14,C44,C449,C44,C449,C449,C44,...,5 - least deprived,2016-01-14,,,0.0,,,,,
2,11938715,2,10000003,2018-12-10,C44,C442,C44,C442,C442,C44,...,3,2018-12-10,,,0.0,,,,,
3,11869010,1,10000004,2018-04-05,C44,C449,C44,C449,C449,C44,...,4,,C,0.0,1.0,06,,,,
4,11037077,1,10000005,2018-04-23,C44,C446,C44,C446,C446,C44,...,3,2018-04-23,,0.0,0.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995565,11848292,1,250002429,2018-01-23,C64,C64,C64,C64,C669,C66,...,2,,C,,0.0,,,,,
1995566,11802787,1,250002512,2019-02-05,C61,C61,C61,C61,C619,C61,...,1 - most deprived,2019-02-05,,,3.0,0413,3.0,4.0,,7.0
1995567,11070198,1,250000217,2018-11-30,C73,C73,C73,C73,C659,C65,...,1 - most deprived,2018-11-30,,,2.0,13,,,,
1995568,10795064,1,250001761,2018-03-19,C66,C66,C66,C66,C341,C34,...,5 - least deprived,2018-03-19,9,,2.0,0109,,,,


In [8]:
# merging patient dataset with tumour dataset
set1 = pd.merge(patient, tumour, on="PATIENTID")

In [9]:
# to take in only rows with VITALSTATUS = D (dead) or A (alive)
set2 = set1[set1['VITALSTATUS'].isin(['D', 'A'])]
set2

Unnamed: 0,PATIENTID,GENDER_x,ETHNICITY,DEATHCAUSECODE_1A,DEATHCAUSECODE_1B,DEATHCAUSECODE_1C,DEATHCAUSECODE_2,DEATHCAUSECODE_UNDERLYING,DEATHLOCATIONCODE,VITALSTATUS,...,QUINTILE_2019,DATE_FIRST_SURGERY,CANCERCAREPLANINTENT,PERFORMANCESTATUS,CHRL_TOT_27_03,COMORBIDITIES_27_03,GLEASON_PRIMARY,GLEASON_SECONDARY,GLEASON_TERTIARY,GLEASON_COMBINED
0,10000001,1,A,,,,,,,A,...,4,,,3.0,0.0,,,,,
1,10000002,1,,,,,,,,A,...,5 - least deprived,2016-01-14,,,0.0,,,,,
2,10000003,2,A,,,,,,,A,...,3,2018-12-10,,,0.0,,,,,
3,10000004,1,A,,,,,,,A,...,4,,C,0.0,1.0,06,,,,
4,10000005,1,A,,,,,,,A,...,3,2018-04-23,,0.0,0.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995565,250002540,2,L,,,,,,,D,...,1 - most deprived,2016-06-09,,,0.0,,,,,
1995566,250002541,2,A,C439,,,,I259,4,D,...,3,2018-01-25,C,,1.0,07,,,,
1995567,250002542,1,A,C66,,,"I259,J449,I10",C66,2,D,...,4,2019-07-18,,,3.0,010206,,,,
1995568,250002543,2,A,C809,,,I259,K559,2,D,...,5 - least deprived,2016-11-27,,,0.0,,,,,


# 2) Data Preparation

# 2.1) Selecting relevant variables and value for analysis

In [11]:
# to select relevant variables for analysis which contains all types of cancer
set3 = set2[["PATIENTID", "GENDER_x", "ETHNICITY", "VITALSTATUS", "VITALSTATUSDATE", "DIAGNOSISDATEBEST", "SITE_ICD10_O2_3CHAR", "MORPH_ICD10_O2", "BEHAVIOUR_ICD10_O2", "AGE", "QUINTILE_2019", "PERFORMANCESTATUS", "CHRL_TOT_27_03"]]
set3

Unnamed: 0,PATIENTID,GENDER_x,ETHNICITY,VITALSTATUS,VITALSTATUSDATE,DIAGNOSISDATEBEST,SITE_ICD10_O2_3CHAR,MORPH_ICD10_O2,BEHAVIOUR_ICD10_O2,AGE,QUINTILE_2019,PERFORMANCESTATUS,CHRL_TOT_27_03
0,10000001,1,A,A,2022-07-05,2017-03-31,C44,8070,3,84,4,3.0,0.0
1,10000002,1,,A,2022-07-05,2016-01-14,C44,8090,3,67,5 - least deprived,,0.0
2,10000003,2,A,A,2022-07-05,2018-12-10,C44,8070,3,79,3,,0.0
3,10000004,1,A,A,2022-07-05,2018-04-05,C44,8090,3,76,4,0.0,1.0
4,10000005,1,A,A,2022-07-05,2018-04-23,C44,8070,3,49,3,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995565,250002540,2,L,D,2021-06-10,2016-06-09,D01,9440,2,63,1 - most deprived,,0.0
1995566,250002541,2,A,D,2022-06-10,2018-01-25,C67,8120,3,64,3,,1.0
1995567,250002542,1,A,D,2019-09-25,2019-07-18,C45,8810,3,46,4,,3.0
1995568,250002543,2,A,D,2020-05-24,2016-10-18,C49,8851,3,75,5 - least deprived,,0.0


In [12]:
# to create dataset only for Multiple Myeloma / Myeloma patients using SITE_ICD10_O2_3CHAR
set4 = set3[set3['SITE_ICD10_O2_3CHAR'].isin(["C90"])]
set4

Unnamed: 0,PATIENTID,GENDER_x,ETHNICITY,VITALSTATUS,VITALSTATUSDATE,DIAGNOSISDATEBEST,SITE_ICD10_O2_3CHAR,MORPH_ICD10_O2,BEHAVIOUR_ICD10_O2,AGE,QUINTILE_2019,PERFORMANCESTATUS,CHRL_TOT_27_03
419272,10390456,1,,A,2022-07-05,2019-12-17,C90,9732,3,70,5 - least deprived,,2.0
421866,10392912,1,A,A,2022-07-05,2018-03-22,C90,9732,3,83,5 - least deprived,,4.0
421972,10393008,1,A,A,2022-07-05,2018-10-06,C90,9732,3,75,3,1.0,0.0
422916,10393897,1,A,A,2022-07-05,2018-06-23,C90,9732,3,73,3,1.0,0.0
426691,10397412,2,A,A,2022-07-05,2018-03-29,C90,9732,3,75,3,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1992051,240002155,1,A,D,2019-11-17,2019-09-11,C90,8990,1,58,2,,0.0
1992088,240002190,1,P,A,2022-07-05,2019-03-27,C90,8990,1,72,1 - most deprived,0.0,0.0
1992264,240002360,1,A,A,2022-07-05,2019-05-17,C90,9591,3,70,4,1.0,0.0
1992576,240002657,2,A,A,2022-07-05,2017-08-09,C90,9861,3,69,5 - least deprived,0.0,0.0


In [13]:
# to rename variable GENDER_x to GENDER
set4.rename(columns = {'GENDER_x' : 'GENDER'}, inplace = True)
set4.columns

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  set4.rename(columns = {'GENDER_x' : 'GENDER'}, inplace = True)


Index(['PATIENTID', 'GENDER', 'ETHNICITY', 'VITALSTATUS', 'VITALSTATUSDATE',
       'DIAGNOSISDATEBEST', 'SITE_ICD10_O2_3CHAR', 'MORPH_ICD10_O2',
       'BEHAVIOUR_ICD10_O2', 'AGE', 'QUINTILE_2019', 'PERFORMANCESTATUS',
       'CHRL_TOT_27_03'],
      dtype='object')

In [14]:
set5 = set4.copy()

In [15]:
# to replace value in QUINTILE_2019 columns (5 - least deprived : 5 | 1 - most deprived : 1)
set5["QUINTILE_2019"].replace({"5 - least deprived" : "5", "1 - most deprived" : "1"}, inplace = True)

In [16]:
set5

Unnamed: 0,PATIENTID,GENDER,ETHNICITY,VITALSTATUS,VITALSTATUSDATE,DIAGNOSISDATEBEST,SITE_ICD10_O2_3CHAR,MORPH_ICD10_O2,BEHAVIOUR_ICD10_O2,AGE,QUINTILE_2019,PERFORMANCESTATUS,CHRL_TOT_27_03
419272,10390456,1,,A,2022-07-05,2019-12-17,C90,9732,3,70,5,,2.0
421866,10392912,1,A,A,2022-07-05,2018-03-22,C90,9732,3,83,5,,4.0
421972,10393008,1,A,A,2022-07-05,2018-10-06,C90,9732,3,75,3,1.0,0.0
422916,10393897,1,A,A,2022-07-05,2018-06-23,C90,9732,3,73,3,1.0,0.0
426691,10397412,2,A,A,2022-07-05,2018-03-29,C90,9732,3,75,3,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1992051,240002155,1,A,D,2019-11-17,2019-09-11,C90,8990,1,58,2,,0.0
1992088,240002190,1,P,A,2022-07-05,2019-03-27,C90,8990,1,72,1,0.0,0.0
1992264,240002360,1,A,A,2022-07-05,2019-05-17,C90,9591,3,70,4,1.0,0.0
1992576,240002657,2,A,A,2022-07-05,2017-08-09,C90,9861,3,69,5,0.0,0.0


In [17]:
# to replace value in ETHNICITY columns
set5["ETHNICITY"].replace({
    "0" : 0,
    "A" : 1,
    "B" : 2,
    "C" : 3,
    "D" : 4,
    "E" : 5,
    "F" : 6,
    "G" : 7,
    "H" : 8,
    "J" : 9,
    "K" : 10,
    "L" : 11,
    "M" : 12,
    "N" : 13,
    "P" : 14,
    "R" : 15,
    "S" : 16,
    "X" : 17,
    "Z" : 18}, inplace = True)

In [18]:
# to replace value in VITALSTATUS columns
set5["VITALSTATUS"].replace({"A" : "1", "D" : "0"}, inplace = True)

In [19]:
set5

Unnamed: 0,PATIENTID,GENDER,ETHNICITY,VITALSTATUS,VITALSTATUSDATE,DIAGNOSISDATEBEST,SITE_ICD10_O2_3CHAR,MORPH_ICD10_O2,BEHAVIOUR_ICD10_O2,AGE,QUINTILE_2019,PERFORMANCESTATUS,CHRL_TOT_27_03
419272,10390456,1,,1,2022-07-05,2019-12-17,C90,9732,3,70,5,,2.0
421866,10392912,1,1.0,1,2022-07-05,2018-03-22,C90,9732,3,83,5,,4.0
421972,10393008,1,1.0,1,2022-07-05,2018-10-06,C90,9732,3,75,3,1.0,0.0
422916,10393897,1,1.0,1,2022-07-05,2018-06-23,C90,9732,3,73,3,1.0,0.0
426691,10397412,2,1.0,1,2022-07-05,2018-03-29,C90,9732,3,75,3,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1992051,240002155,1,1.0,0,2019-11-17,2019-09-11,C90,8990,1,58,2,,0.0
1992088,240002190,1,14.0,1,2022-07-05,2019-03-27,C90,8990,1,72,1,0.0,0.0
1992264,240002360,1,1.0,1,2022-07-05,2019-05-17,C90,9591,3,70,4,1.0,0.0
1992576,240002657,2,1.0,1,2022-07-05,2017-08-09,C90,9861,3,69,5,0.0,0.0


In [20]:
set5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20650 entries, 419272 to 1992694
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   PATIENTID            20650 non-null  int64  
 1   GENDER               20650 non-null  int64  
 2   ETHNICITY            20322 non-null  float64
 3   VITALSTATUS          20650 non-null  object 
 4   VITALSTATUSDATE      20650 non-null  object 
 5   DIAGNOSISDATEBEST    20650 non-null  object 
 6   SITE_ICD10_O2_3CHAR  20650 non-null  object 
 7   MORPH_ICD10_O2       20650 non-null  int64  
 8   BEHAVIOUR_ICD10_O2   20650 non-null  int64  
 9   AGE                  20650 non-null  int64  
 10  QUINTILE_2019        20650 non-null  object 
 11  PERFORMANCESTATUS    10757 non-null  float64
 12  CHRL_TOT_27_03       20545 non-null  float64
dtypes: float64(3), int64(5), object(5)
memory usage: 2.2+ MB


# 2.2) To calculate survival time to alive and dead patients

In [21]:
# to check each column data format
set5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20650 entries, 419272 to 1992694
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   PATIENTID            20650 non-null  int64  
 1   GENDER               20650 non-null  int64  
 2   ETHNICITY            20322 non-null  float64
 3   VITALSTATUS          20650 non-null  object 
 4   VITALSTATUSDATE      20650 non-null  object 
 5   DIAGNOSISDATEBEST    20650 non-null  object 
 6   SITE_ICD10_O2_3CHAR  20650 non-null  object 
 7   MORPH_ICD10_O2       20650 non-null  int64  
 8   BEHAVIOUR_ICD10_O2   20650 non-null  int64  
 9   AGE                  20650 non-null  int64  
 10  QUINTILE_2019        20650 non-null  object 
 11  PERFORMANCESTATUS    10757 non-null  float64
 12  CHRL_TOT_27_03       20545 non-null  float64
dtypes: float64(3), int64(5), object(5)
memory usage: 2.2+ MB


In [22]:
# to convert DIAGNOSISDATEBEST and VITALSTATUSDATE to datetime formate
set5['DIAGNOSISDATEBEST'] = pd.to_datetime(set5['DIAGNOSISDATEBEST'])
set5['VITALSTATUSDATE'] = pd.to_datetime(set5['VITALSTATUSDATE'])

# to check each column data format
set5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20650 entries, 419272 to 1992694
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   PATIENTID            20650 non-null  int64         
 1   GENDER               20650 non-null  int64         
 2   ETHNICITY            20322 non-null  float64       
 3   VITALSTATUS          20650 non-null  object        
 4   VITALSTATUSDATE      20650 non-null  datetime64[ns]
 5   DIAGNOSISDATEBEST    20650 non-null  datetime64[ns]
 6   SITE_ICD10_O2_3CHAR  20650 non-null  object        
 7   MORPH_ICD10_O2       20650 non-null  int64         
 8   BEHAVIOUR_ICD10_O2   20650 non-null  int64         
 9   AGE                  20650 non-null  int64         
 10  QUINTILE_2019        20650 non-null  object        
 11  PERFORMANCESTATUS    10757 non-null  float64       
 12  CHRL_TOT_27_03       20545 non-null  float64       
dtypes: datetime64[ns](2), fl

In [23]:
# to calculate survival time for each of the patients [VITALSTATUSDATE : date of death, DIAGNOSISDATEBEST : date of diagnose]
set5['SURVIVAL_TIME'] = set5['VITALSTATUSDATE'] - set5['DIAGNOSISDATEBEST']

# extracting the number of years from the survival time and store it in a new column
set5['SURVIVAL_YEARS'] = round(set5['SURVIVAL_TIME'] / pd.Timedelta(days=365.25),2)  # considering leap years

# displaying the DataFrame with the survival years
print(set5[['DIAGNOSISDATEBEST', 'VITALSTATUSDATE', 'SURVIVAL_YEARS']])

        DIAGNOSISDATEBEST VITALSTATUSDATE  SURVIVAL_YEARS
419272         2019-12-17      2022-07-05            2.55
421866         2018-03-22      2022-07-05            4.29
421972         2018-10-06      2022-07-05            3.75
422916         2018-06-23      2022-07-05            4.03
426691         2018-03-29      2022-07-05            4.27
...                   ...             ...             ...
1992051        2019-09-11      2019-11-17            0.18
1992088        2019-03-27      2022-07-05            3.27
1992264        2019-05-17      2022-07-05            3.13
1992576        2017-08-09      2022-07-05            4.90
1992694        2019-05-02      2020-08-07            1.27

[20650 rows x 3 columns]


In [24]:
set5

Unnamed: 0,PATIENTID,GENDER,ETHNICITY,VITALSTATUS,VITALSTATUSDATE,DIAGNOSISDATEBEST,SITE_ICD10_O2_3CHAR,MORPH_ICD10_O2,BEHAVIOUR_ICD10_O2,AGE,QUINTILE_2019,PERFORMANCESTATUS,CHRL_TOT_27_03,SURVIVAL_TIME,SURVIVAL_YEARS
419272,10390456,1,,1,2022-07-05,2019-12-17,C90,9732,3,70,5,,2.0,931 days,2.55
421866,10392912,1,1.0,1,2022-07-05,2018-03-22,C90,9732,3,83,5,,4.0,1566 days,4.29
421972,10393008,1,1.0,1,2022-07-05,2018-10-06,C90,9732,3,75,3,1.0,0.0,1368 days,3.75
422916,10393897,1,1.0,1,2022-07-05,2018-06-23,C90,9732,3,73,3,1.0,0.0,1473 days,4.03
426691,10397412,2,1.0,1,2022-07-05,2018-03-29,C90,9732,3,75,3,,0.0,1559 days,4.27
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1992051,240002155,1,1.0,0,2019-11-17,2019-09-11,C90,8990,1,58,2,,0.0,67 days,0.18
1992088,240002190,1,14.0,1,2022-07-05,2019-03-27,C90,8990,1,72,1,0.0,0.0,1196 days,3.27
1992264,240002360,1,1.0,1,2022-07-05,2019-05-17,C90,9591,3,70,4,1.0,0.0,1145 days,3.13
1992576,240002657,2,1.0,1,2022-07-05,2017-08-09,C90,9861,3,69,5,0.0,0.0,1791 days,4.90


In [25]:
set5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20650 entries, 419272 to 1992694
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype          
---  ------               --------------  -----          
 0   PATIENTID            20650 non-null  int64          
 1   GENDER               20650 non-null  int64          
 2   ETHNICITY            20322 non-null  float64        
 3   VITALSTATUS          20650 non-null  object         
 4   VITALSTATUSDATE      20650 non-null  datetime64[ns] 
 5   DIAGNOSISDATEBEST    20650 non-null  datetime64[ns] 
 6   SITE_ICD10_O2_3CHAR  20650 non-null  object         
 7   MORPH_ICD10_O2       20650 non-null  int64          
 8   BEHAVIOUR_ICD10_O2   20650 non-null  int64          
 9   AGE                  20650 non-null  int64          
 10  QUINTILE_2019        20650 non-null  object         
 11  PERFORMANCESTATUS    10757 non-null  float64        
 12  CHRL_TOT_27_03       20545 non-null  float64        
 13  SURVIVAL_

In [27]:
# separate the dataset based on 'VITALSTATUS' values
alive_data = set5[set5['VITALSTATUS'] == '1']
dead_data = set5[set5['VITALSTATUS'] == '0']

# to find the earliest and latest dates for alive patients
earliest_alive_date = alive_data['VITALSTATUSDATE'].min()
latest_alive_date = alive_data['VITALSTATUSDATE'].max()

# to find the earliest and latest dates for dead patients
earliest_dead_date = dead_data['VITALSTATUSDATE'].min()
latest_dead_date = dead_data['VITALSTATUSDATE'].max()

# Print the results
print("Earliest Alive Date:", earliest_alive_date)
print("Latest Alive Date:", latest_alive_date)
print("Earliest Dead Date:", earliest_dead_date)
print("Latest Dead Date:", latest_dead_date)


Earliest Alive Date: 2022-07-05 00:00:00
Latest Alive Date: 2022-07-25 00:00:00
Earliest Dead Date: 2016-01-08 00:00:00
Latest Dead Date: 2022-11-10 00:00:00


In [29]:
set5.isnull().sum()

PATIENTID                 0
GENDER                    0
ETHNICITY               328
VITALSTATUS               0
VITALSTATUSDATE           0
DIAGNOSISDATEBEST         0
SITE_ICD10_O2_3CHAR       0
MORPH_ICD10_O2            0
BEHAVIOUR_ICD10_O2        0
AGE                       0
QUINTILE_2019             0
PERFORMANCESTATUS      9893
CHRL_TOT_27_03          105
SURVIVAL_TIME             0
SURVIVAL_YEARS            0
dtype: int64

In [30]:
#to save myeloma dataset as csv file
set5.to_csv("C:/Users/User/Documents/2. Master in Data Science/3. Dissertation/5. Final/Data/myeloma_RAW_dataset.csv", index=False)