# Exploratory Data Analysis


This notebook aims to explore the several csv files of the MSOAC placebo dataset. We will analyse the data and try to find patterns 

In [1]:
import pandas as pd

### 1. Demographics data (dm.csv) [one record per subject]

In [2]:
# Load demographics .csv
file_path = 'C:/Users/anaso/Desktop/SOFIA MENDES/KU Leuven/Master Thesis/MSOAC Placebo dataset/csv files/dm.csv'

# create data frame
demographics = pd.read_csv(file_path)
demographics


Unnamed: 0,STUDYID,DOMAIN,USUBJID,SUBJID,RFSTDTC,RFENDTC,DTHDTC,DTHFL,SITEID,INVID,...,ARM,ACTARMCD,ACTARM,COUNTRY,DMDTC,DMDY,DMENDY,DMDTC_TS,RFENDTC_TS,RFSTDTC_TS
0,MSOAC,DM,MSOAC/0649,649,,,,,,,...,PLACEBO,,,USA,,,,,,
1,MSOAC,DM,MSOAC/2224,2224,,,,,,,...,PLACEBO,,,SRB,,,,,,
2,MSOAC,DM,MSOAC/0576,576,,,,,,,...,PLACEBO,,,,,,,,,
3,MSOAC,DM,MSOAC/4961,4961,,,,,,,...,PLACEBO,,,,,,,,,
4,MSOAC,DM,MSOAC/5990,5990,,,,,,,...,PLACEBO,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2460,MSOAC,DM,MSOAC/2501,2501,,,,,,,...,PLACEBO,1.0,PLACEBO,,,,,,,
2461,MSOAC,DM,MSOAC/8672,8672,,,,,,,...,PLACEBO,,,,,,,,,
2462,MSOAC,DM,MSOAC/5705,5705,,,,,,,...,PLACEBO,,,,,,,,,
2463,MSOAC,DM,MSOAC/8255,8255,,,,,,,...,PLACEBO,,,,,,,,,


Check how many missing values we have per column

In [3]:
missing_percentage = (demographics.isnull().sum() / len(demographics)) * 100
missing_demographics = pd.DataFrame({'Column Name': missing_percentage.index, 'Missing Percentage': missing_percentage.values})
#missing_demographics = missing_demographics.sort_values(by='Missing Percentage', ascending=False)
print(missing_demographics)

   Column Name  Missing Percentage
0      STUDYID            0.000000
1       DOMAIN            0.000000
2      USUBJID            0.000000
3       SUBJID            0.000000
4      RFSTDTC          100.000000
5      RFENDTC          100.000000
6       DTHDTC          100.000000
7        DTHFL          100.000000
8       SITEID          100.000000
9        INVID          100.000000
10      INVNAM          100.000000
11     BRTHDTC          100.000000
12         AGE            3.367140
13        AGEU            3.367140
14         SEX            0.000000
15        RACE           31.399594
16      ETHNIC           90.750507
17       ARMCD            0.000000
18         ARM            0.000000
19    ACTARMCD           87.626775
20      ACTARM           87.626775
21     COUNTRY           56.267748
22       DMDTC          100.000000
23        DMDY          100.000000
24      DMENDY          100.000000
25    DMDTC_TS          100.000000
26  RFENDTC_TS          100.000000
27  RFSTDTC_TS      

We will drop the columns with more than 85% missing values

In [4]:
columns_to_drop = ['RFSTDTC','RFENDTC','DTHDTC','DTHFL','SITEID','INVID','INVNAM','ETHNIC','ACTARMCD','ACTARM','BRTHDTC','DMDTC','DMDY','DMENDY','DMDTC_TS','RFENDTC_TS','RFSTDTC_TS']
demographics = demographics.drop(columns_to_drop, axis=1)
demographics

Unnamed: 0,STUDYID,DOMAIN,USUBJID,SUBJID,AGE,AGEU,SEX,RACE,ARMCD,ARM,COUNTRY
0,MSOAC,DM,MSOAC/0649,649,,,F,WHITE,1,PLACEBO,USA
1,MSOAC,DM,MSOAC/2224,2224,38.0,YEARS,F,WHITE,1,PLACEBO,SRB
2,MSOAC,DM,MSOAC/0576,576,50.0,YEARS,F,WHITE,1,PLACEBO,
3,MSOAC,DM,MSOAC/4961,4961,44.0,YEARS,F,WHITE,1,PLACEBO,
4,MSOAC,DM,MSOAC/5990,5990,52.0,YEARS,F,WHITE,1,PLACEBO,
...,...,...,...,...,...,...,...,...,...,...,...
2460,MSOAC,DM,MSOAC/2501,2501,46.0,YEARS,F,WHITE,1,PLACEBO,
2461,MSOAC,DM,MSOAC/8672,8672,43.0,YEARS,F,,1,PLACEBO,
2462,MSOAC,DM,MSOAC/5705,5705,30.0,YEARS,M,,1,PLACEBO,
2463,MSOAC,DM,MSOAC/8255,8255,42.0,YEARS,M,,1,PLACEBO,


- Descriptive statistics for continuous variables (in this case, just age)

In [5]:
continuous_columns = ['AGE']

descriptive_continuous = {
    'Count': demographics[continuous_columns].count(), #cases that are not missing
    'Missing Cases': demographics[continuous_columns].isna().sum(),
    'Mean': demographics[continuous_columns].mean(),
    'Standard Deviation': demographics[continuous_columns].std()
}

cont_demographics = pd.DataFrame(descriptive_continuous)

print(cont_demographics)

     Count  Missing Cases       Mean  Standard Deviation
AGE   2382             83  41.766583           10.413545


- Descriptive statistics for categorical variables (in this case, gender, race and country)

In [6]:
categorical_columns = ['SEX', 'RACE','COUNTRY']

descriptive_categorical = {}
for col in categorical_columns:
    descriptive_categorical[col] = {
        'Count': demographics[col].count(),
        'Missing Cases': demographics[col].isna().sum(),
        'Unique Values': demographics[col].nunique(),
        'Mode': demographics[col].mode().values[0],
        'Mode Frequency': demographics[col].value_counts().max()
    }

cat_demographics = pd.DataFrame(descriptive_categorical).T
print(cat_demographics)

        Count Missing Cases Unique Values   Mode Mode Frequency
SEX      2465             0             2      F           1658
RACE     1691           774             7  WHITE           1534
COUNTRY  1078          1387            35    USA            249


- Number of observations for each RACE category

In [7]:
race_counts_demographics = demographics['RACE'].value_counts().reset_index()
race_counts_demographics.columns = ['Race', 'Count']

print(race_counts_demographics)

                               Race  Count
0                             WHITE   1534
1                             ASIAN     64
2                             OTHER     41
3         BLACK OR AFRICAN AMERICAN     39
4                          HISPANIC     10
5  AMERICAN INDIAN OR ALASKA NATIVE      2
6                HISPANIC OR LATINO      1


- Number of observations for each SEX category

In [8]:
sex_counts_demographics = demographics['SEX'].value_counts().reset_index()
sex_counts_demographics.columns = ['Gender', 'Count']

print(sex_counts_demographics)

  Gender  Count
0      F   1658
1      M    807


- Number of observations for each COUNTRY category

In [9]:
country_counts_demographics = demographics['COUNTRY'].value_counts().reset_index()
country_counts_demographics.columns = ['Country', 'Count']

print(country_counts_demographics)

   Country  Count
0      USA    249
1      POL    177
2      CAN     73
3      UKR     63
4      CZE     63
5      IND     56
6      RUS     48
7      SRB     46
8      DEU     44
9      GBR     37
10     NLD     26
11     BGR     21
12     HUN     19
13     ROU     16
14     GRC     14
15     FRA     13
16     NZL     10
17     BEL     10
18     SWE      9
19     MEX      9
20     EST      8
21     ESP      7
22     PER      7
23     GEO      7
24     AUS      7
25     ISR      6
26     CHE      6
27     HRV      5
28     TUR      5
29     COL      5
30     LVA      3
31     FIN      3
32     IRL      3
33     DNK      2
34     CHL      1


#### *Ideas*:
- Impute age with mean (only around 3% missing)
- Is country important for prognosis? If not, drop. If yes, what do to regarding missing values? 
- COUNTRY variable (if used): should we group by continent?
- RACE variable is highly imbalanced - maybe use just two categories (white / non-white)?

### 2. Clinical events (ce.csv) - [One record per event per subject]

In [10]:
# Load clinical events .csv file
file_path = 'C:/Users/anaso/Desktop/SOFIA MENDES/KU Leuven/Master Thesis/MSOAC Placebo dataset/csv files/ce.csv'

# Create data frame
clinical_events = pd.read_csv(file_path)

# Sort by the 'USUBJID' and 'CESEQ' columns in ascending order
clinical_events = clinical_events.sort_values(by=['USUBJID','CESEQ'], ascending=True)
clinical_events

Unnamed: 0,STUDYID,DOMAIN,USUBJID,CESEQ,CEGRPID,CEREFID,CESPID,CETERM,CEMODIFY,CEDECOD,...,CEENDY,CESTRF,CEENRF,CEEVLINT,CEEVINTX,CESTRTPT,CESTTPT,CEENRTPT,CEENTPT,MIDS
432,MSOAC,CE,MSOAC/0031,1,,,,MS RELAPSE,Neurologist Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,...,279.0,,,,,,,,,MS RELAPSE 1
1334,MSOAC,CE,MSOAC/0031,2,,,,MS RELAPSE,Neurologist Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,...,,,,,,,,,,MS RELAPSE 2
1022,MSOAC,CE,MSOAC/0035,1,,,,MS RELAPSE CONFIRMED BY EDSS,EDSS Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,...,,,,,,,,,,MS RELAPSE 1
1368,MSOAC,CE,MSOAC/0035,2,,,,MS RELAPSE CONFIRMED BY EDSS,EDSS Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,...,,,,,,,,,,MS RELAPSE 2
1819,MSOAC,CE,MSOAC/0041,1,,,,MS EXACERBATION #1,MULTIPLE SCLEROSIS AGGRAVATED,MULTIPLE SCLEROSIS,...,,,ONGOING,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1959,MSOAC,CE,MSOAC/9995,4,,,,UNCONFIRMED MS RELAPSE,Suspected Relapse,RELAPSE-LIKE EVENT,...,,,,,,,,,,
871,MSOAC,CE,MSOAC/9998,1,,,,Confirmed MS Exacerbation,,MULTIPLE SCLEROSIS RELAPSE,...,,,,,,,,,,MS RELAPSE 1
1739,MSOAC,CE,MSOAC/9998,2,,,,Suspected MS Exacerbation,,RELAPSE-LIKE EVENT,...,,,,,,,,,,
975,MSOAC,CE,MSOAC/9999,1,,,,MS RELAPSE CONFIRMED BY EDSS,EDSS Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,...,,,,,,,,,,MS RELAPSE 1


Check how many different patients in the dataset

In [11]:
unique_count = clinical_events['USUBJID'].nunique()
print(f"The number of unique values in USUBJID: {unique_count}") #less than the total we have (2465)

The number of unique values in USUBJID: 1215


Check columns with missing values

In [12]:
missing_percentage_ce = (clinical_events.isnull().sum() / len(clinical_events)) * 100
missing_clinical_events = pd.DataFrame({'Column Name': missing_percentage_ce.index, 'Missing Percentage': missing_percentage_ce.values})
#missing_demographics = missing_demographics.sort_values(by='Missing Percentage', ascending=False)
print(missing_clinical_events)

   Column Name  Missing Percentage
0      STUDYID            0.000000
1       DOMAIN            0.000000
2      USUBJID            0.000000
3        CESEQ            0.000000
4      CEGRPID          100.000000
5      CEREFID          100.000000
6       CESPID          100.000000
7       CETERM            0.000000
8     CEMODIFY           55.315355
9      CEDECOD           54.624471
10       CECAT           85.892579
11      CESCAT          100.000000
12     CEPRESP           45.375529
13     CEOCCUR           45.375529
14      CESTAT          100.000000
15    CEREASND          100.000000
16    CEBODSYS           77.646534
17       CELOC          100.000000
18       CELAT          100.000000
19       CESEV           55.449075
20       CESER           78.493425
21      CEPATT          100.000000
22       CEOUT           87.162915
23     CESHOSP          100.000000
24    CECONTRT           81.034099
25     CETOXGR          100.000000
26    VISITNUM           39.915311
27       VISIT      

Drop columns with more than 80% missing values

In [13]:
columns_to_drop = missing_clinical_events[missing_clinical_events['Missing Percentage'] > 80]['Column Name'].tolist()
clinical_events.drop(columns=columns_to_drop, inplace=True)
clinical_events

Unnamed: 0,STUDYID,DOMAIN,USUBJID,CESEQ,CETERM,CEMODIFY,CEDECOD,CEPRESP,CEOCCUR,CEBODSYS,CESEV,CESER,VISITNUM,VISIT,CEDY,CESTDY,MIDS
432,MSOAC,CE,MSOAC/0031,1,MS RELAPSE,Neurologist Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,,,Nervous system disorders,MILD,N,,,,268.0,MS RELAPSE 1
1334,MSOAC,CE,MSOAC/0031,2,MS RELAPSE,Neurologist Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,,,Nervous system disorders,MILD,N,,,,814.0,MS RELAPSE 2
1022,MSOAC,CE,MSOAC/0035,1,MS RELAPSE CONFIRMED BY EDSS,EDSS Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,,,,MODERATE,,,,,144.0,MS RELAPSE 1
1368,MSOAC,CE,MSOAC/0035,2,MS RELAPSE CONFIRMED BY EDSS,EDSS Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,,,,MODERATE,,,,,221.0,MS RELAPSE 2
1819,MSOAC,CE,MSOAC/0041,1,MS EXACERBATION #1,MULTIPLE SCLEROSIS AGGRAVATED,MULTIPLE SCLEROSIS,,,Nervous system disorders,MILD,N,,,,179.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1959,MSOAC,CE,MSOAC/9995,4,UNCONFIRMED MS RELAPSE,Suspected Relapse,RELAPSE-LIKE EVENT,,,,MILD,,,,,682.0,
871,MSOAC,CE,MSOAC/9998,1,Confirmed MS Exacerbation,,MULTIPLE SCLEROSIS RELAPSE,,,Nervous system disorders,,,3.0,MONTH 3,85.0,79.0,MS RELAPSE 1
1739,MSOAC,CE,MSOAC/9998,2,Suspected MS Exacerbation,,RELAPSE-LIKE EVENT,,,Nervous system disorders,,,999.0,UNSCHEDULED,95.0,,
975,MSOAC,CE,MSOAC/9999,1,MS RELAPSE CONFIRMED BY EDSS,EDSS Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,,,,MILD,,,,,69.0,MS RELAPSE 1


- descriptive analysis of categorical variables

In [14]:
ceterm_counts_clinical_events = clinical_events['CETERM'].value_counts().reset_index()
ceterm_counts_clinical_events.columns = ['CETERM', 'Count']
print(ceterm_counts_clinical_events)

                                                CETERM  Count
0                    POSSIBLE RELAPSE SINCE LAST VISIT   2451
1                         MS RELAPSE CONFIRMED BY EDSS    762
2                                           MS RELAPSE    573
3                               UNCONFIRMED MS RELAPSE    271
4                                      MS EXACERBATION     30
..                                                 ...    ...
173                   NUMBNESS WAIST DOWN (MS RELAPSE)      1
174                                        RELAPSE # 2      1
175  INCREASED LEG TWITCHING WITH R GREATER THAN L ...      1
176                                          MSRELAPSE      1
177                MS RELAPSE-PROGRESSIVE LEG WEAKNESS      1

[178 rows x 2 columns]


In [15]:
cemodify_counts_clinical_events = clinical_events['CEMODIFY'].value_counts().reset_index()
cemodify_counts_clinical_events.columns = ['CEMODIFY', 'Count']
print(cemodify_counts_clinical_events)

                             CEMODIFY  Count
0              EDSS Confirmed Relapse    762
1                   Suspected Relapse    470
2       Neurologist Confirmed Relapse    459
3              INEC Confirmed Relapse    181
4                   Confirmed Relapse     84
5        Non-Protocol Defined Relapse     24
6                  Multiple Sclerosis     14
7       MULTIPLE SCLEROSIS AGGRAVATED      7
8   PROGRESSION OF MULTIPLE SCLEROSIS      2
9                  MULTIPLE SCLEROSIS      1
10                   MS-LIKE SYNDROME      1


In [16]:
cebodsys_counts_clinical_events = clinical_events['CEBODSYS'].value_counts().reset_index()
cebodsys_counts_clinical_events.columns = ['CEBODSYS', 'Count']
print(cebodsys_counts_clinical_events)

                   CEBODSYS  Count
0  Nervous system disorders   1003


Note: unique category - remove?

In [17]:
cesev_counts_clinical_events = clinical_events['CESEV'].value_counts().reset_index()
cesev_counts_clinical_events.columns = ['CESEV', 'Count']
print(cesev_counts_clinical_events) #severeness of disease

      CESEV  Count
0  MODERATE    989
1      MILD    784
2    SEVERE    226


Note: seems that CEMODIFY groups the CETERM

#### **Ideas**:
- MIDS column associated to MIDS column in milestones
- use CESEV column as output or input
- check ce_analysis.png!! some variables seem to complement each other

*continue*

### 3. Disposition (ds.csv) - [One record per disposition status or protocol milestone per subject]

In [18]:
# Load disposition .csv file
file_path = 'C:/Users/anaso/Desktop/SOFIA MENDES/KU Leuven/Master Thesis/MSOAC Placebo dataset/csv files/ds.csv'

# create data frame
disposition = pd.read_csv(file_path)

# Sort by the 'USUBJID' and 'DSSEQ' columns in ascending order
disposition = disposition.sort_values(by=['USUBJID','DSSEQ'], ascending=True)
disposition

Unnamed: 0,STUDYID,DOMAIN,USUBJID,DSSEQ,DSGRPID,DSREFID,DSSPID,DSTERM,DSMODIFY,DSDECOD,DSCAT,DSSCAT,VISITNUM,VISIT,EPOCH,DSDTC,DSSTDTC,DSSTDY
1194,MSOAC,DS,MSOAC/0019,1,,,,WRITTEN CONSENT OBTAINED,WRITTEN CONSENT OBTAINED,INFORMED CONSENT OBTAINED,PROTOCOL MILESTONE,,-3.0,SCREENING -3,,,,-27
712,MSOAC,DS,MSOAC/0019,2,,,,BAD INTERIM ANALYSIS RESULT,OTHER,OTHER,DISPOSITION EVENT,,997.0,EARLY/TERMINATION,,,,899
108,MSOAC,DS,MSOAC/0019,3,,,,BAD INTERIM ANALYSIS RESULT,OTHER,OTHER,DISPOSITION EVENT,,36.0,MONTH 36,,,,1088
992,MSOAC,DS,MSOAC/0030,1,,,,FIRST RANDOMIZATION,,RANDOMIZATION,PROTOCOL MILESTONE,,,,,,,1
858,MSOAC,DS,MSOAC/0041,1,,,,Lack of Clinical Efficacy,,LACK OF EFFICACY,DISPOSITION EVENT,EARLY WITHDRAWAL,,,,,,365
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
861,MSOAC,DS,MSOAC/9974,2,,,,SPONSOR DECISION,OTHER,OTHER,DISPOSITION EVENT,,997.0,EARLY/TERMINATION,,,,1007
616,MSOAC,DS,MSOAC/9980,1,,,,FIRST RANDOMIZATION,,RANDOMIZATION,PROTOCOL MILESTONE,,,,,,,1
606,MSOAC,DS,MSOAC/9986,1,,,,FIRST RANDOMIZATION,,RANDOMIZATION,PROTOCOL MILESTONE,,,,,,,1
390,MSOAC,DS,MSOAC/9998,1,,,,WRITTEN CONSENT OBTAINED,WRITTEN CONSENT OBTAINED,INFORMED CONSENT OBTAINED,PROTOCOL MILESTONE,,-3.0,SCREENING -3,,,,-28


Number of unique patients

In [22]:
unique_count = disposition['USUBJID'].nunique()
print(f"The number of unique values in USUBJID: {unique_count}")

The number of unique values in USUBJID: 852


Check how many missing values we have per column

In [19]:
missing_percentage_ds = (disposition.isnull().sum() / len(disposition)) * 100
missing_disposition = pd.DataFrame({'Column Name': missing_percentage_ds.index, 'Missing Percentage': missing_percentage_ds.values})
#missing_demographics = missing_demographics.sort_values(by='Missing Percentage', ascending=False)
print(missing_disposition)

   Column Name  Missing Percentage
0      STUDYID            0.000000
1       DOMAIN            0.000000
2      USUBJID            0.000000
3        DSSEQ            0.000000
4      DSGRPID          100.000000
5      DSREFID          100.000000
6       DSSPID          100.000000
7       DSTERM            0.000000
8     DSMODIFY           42.789598
9      DSDECOD            0.000000
10       DSCAT            0.000000
11      DSSCAT           96.611505
12    VISITNUM           42.789598
13       VISIT           42.789598
14       EPOCH          100.000000
15       DSDTC          100.000000
16     DSSTDTC          100.000000
17      DSSTDY            0.000000


Note: DSSCAT is always 'early withdrawal', but seems random in a way

Drop columns with more than 95% values missing

In [20]:
columns_to_drop = ['DSGRPID','DSREFID','DSSPID','DSSCAT','EPOCH','DSDTC','DSSTDTC']
disposition = disposition.drop(columns_to_drop, axis=1)
disposition

Unnamed: 0,STUDYID,DOMAIN,USUBJID,DSSEQ,DSTERM,DSMODIFY,DSDECOD,DSCAT,VISITNUM,VISIT,DSSTDY
1194,MSOAC,DS,MSOAC/0019,1,WRITTEN CONSENT OBTAINED,WRITTEN CONSENT OBTAINED,INFORMED CONSENT OBTAINED,PROTOCOL MILESTONE,-3.0,SCREENING -3,-27
712,MSOAC,DS,MSOAC/0019,2,BAD INTERIM ANALYSIS RESULT,OTHER,OTHER,DISPOSITION EVENT,997.0,EARLY/TERMINATION,899
108,MSOAC,DS,MSOAC/0019,3,BAD INTERIM ANALYSIS RESULT,OTHER,OTHER,DISPOSITION EVENT,36.0,MONTH 36,1088
992,MSOAC,DS,MSOAC/0030,1,FIRST RANDOMIZATION,,RANDOMIZATION,PROTOCOL MILESTONE,,,1
858,MSOAC,DS,MSOAC/0041,1,Lack of Clinical Efficacy,,LACK OF EFFICACY,DISPOSITION EVENT,,,365
...,...,...,...,...,...,...,...,...,...,...,...
861,MSOAC,DS,MSOAC/9974,2,SPONSOR DECISION,OTHER,OTHER,DISPOSITION EVENT,997.0,EARLY/TERMINATION,1007
616,MSOAC,DS,MSOAC/9980,1,FIRST RANDOMIZATION,,RANDOMIZATION,PROTOCOL MILESTONE,,,1
606,MSOAC,DS,MSOAC/9986,1,FIRST RANDOMIZATION,,RANDOMIZATION,PROTOCOL MILESTONE,,,1
390,MSOAC,DS,MSOAC/9998,1,WRITTEN CONSENT OBTAINED,WRITTEN CONSENT OBTAINED,INFORMED CONSENT OBTAINED,PROTOCOL MILESTONE,-3.0,SCREENING -3,-28


- analysis of categorical variables

In [24]:
# DSTERM
dsterm_counts_disposition = disposition['DSTERM'].value_counts().reset_index()
dsterm_counts_disposition.columns = ['DSTERM', 'Count']
print(dsterm_counts_disposition)

                                                DSTERM  Count
0                                  FIRST RANDOMIZATION    500
1                             WRITTEN CONSENT OBTAINED    309
2                          END OF PLANNED STUDY COURSE     74
3                                     SPONSOR DECISION     54
4                             SPONSOR TERMINATED STUDY     22
..                                                 ...    ...
176          SPONSOR DECISION TO TERMINATE TRIAL EARLY      1
177  PATIENT WITHDREW CONSENT DUE TO PROGRESSION OF MS      1
178                      NONCOMPLIANCE WITH STUDY DRUG      1
179                              UNABLE TO CONTACT PT.      1
180                            SPONSOR HAS ENDED STUDY      1

[181 rows x 2 columns]


In [25]:
# DSMODIFY
dsmodify_counts_disposition = disposition['DSMODIFY'].value_counts().reset_index()
dsmodify_counts_disposition.columns = ['DSMODIFY', 'Count']
print(dsmodify_counts_disposition)

                       DSMODIFY  Count
0      WRITTEN CONSENT OBTAINED    309
1                         OTHER    249
2   END OF PLANNED STUDY COURSE     74
3  PATIENT DECISION TO WITHDRAW     66
4             LOST TO FOLLOW-UP     10
5       DEATH: NOT DRUG RELATED      8
6                    SERIOUS AE      6
7            ADVERSE EXPERIENCE      4


In [26]:
# Check categories of DSTERM when DSMODIFY is NA
filter_dsterm = disposition.loc[disposition['DSMODIFY'].isnull(), 'DSTERM']
unique_values_dsterm = filter_dsterm.unique()
print(unique_values_dsterm)

['FIRST RANDOMIZATION' 'Lack of Clinical Efficacy'
 'Patient withdrew of consent' 'Investigator judgment'
 'Serious Adverse Event' 'Patient withdrew of consent: PERSONAL PROBLEMS'
 'Adverse Event' 'REFUSED FOLLOW-UP'
 'Initiation of other treatment for MS' 'Lost to Follow up']


Note:
- DSMODIFY seems to group DSTERM
- DSTERM seems to need survival analysis
- some of the categories also exist in the DSMODIFY -- so are they missing??
- every time it is 'FIRST RANDOMIZATION', it is missing in the other variable

In [27]:
# DSMODIFY
dsdecod_counts_disposition = disposition['DSDECOD'].value_counts().reset_index()
dsdecod_counts_disposition.columns = ['DSDECOD', 'Count']
print(dsdecod_counts_disposition)

                     DSDECOD  Count
0              RANDOMIZATION    500
1  INFORMED CONSENT OBTAINED    309
2                      OTHER    253
3      WITHDRAWAL BY SUBJECT     81
4                  COMPLETED     74
5              ADVERSE EVENT     18
6           LACK OF EFFICACY     14
7          LOST TO FOLLOW-UP     11
8                      DEATH      8
9      INVESTIGATOR JUDGMENT      1


Note: DSDECOD seems to organize DSMODIFY and DSTERM in groups

### 4. Medical history (mh.csv) - [One record per medical history event per subject]

In [28]:
# Load medical history .csv file
file_path = 'C:/Users/anaso/Desktop/SOFIA MENDES/KU Leuven/Master Thesis/MSOAC Placebo dataset/csv files/mh.csv'

# create data frame
medical_history = pd.read_csv(file_path)

# Sort by the 'USUBJID' and 'MHSEQ' columns in ascending order
medical_history = medical_history.sort_values(by=['USUBJID','MHSEQ'], ascending=True)
medical_history #check warning!!!

  medical_history = pd.read_csv(file_path)


Unnamed: 0,STUDYID,DOMAIN,USUBJID,MHSEQ,MHGRPID,MHREFID,MHSPID,MHTERM,MHMODIFY,MHLLT,...,MHDY,MHSTDY,MHENDY,MHDUR,MHSTRF,MHENRF,MHEVLINT,MHENRTPT,MHENTPT,MHHLGT
21041,MSOAC,MH,MSOAC/0014,1,,,,RRMS,,,...,,,,,,,,,,
12262,MSOAC,MH,MSOAC/0016,1,,,,BOWEL URGENCY OR INCONTINENCE,,,...,-21.0,,,,,,-P3M,,,
12296,MSOAC,MH,MSOAC/0016,2,,,,CEREBELLAR SYMPTOMS,,,...,-21.0,,,,,,-P3M,,,
12302,MSOAC,MH,MSOAC/0016,3,,,,CONSTIPATION,,,...,-21.0,,,,,,-P3M,,,
12323,MSOAC,MH,MSOAC/0016,4,,,,DECREASED MENTATION,,,...,-21.0,,,,,,-P3M,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15955,MSOAC,MH,MSOAC/9998,36,,,,URINARY URGENCY,,,...,-28.0,-294.0,,,,,,ONGOING,SCREENING -3,
15372,MSOAC,MH,MSOAC/9998,37,,,,VERTIGO,,,...,-28.0,,,,,,,ONGOING,SCREENING -3,
20393,MSOAC,MH,MSOAC/9998,38,,,,MS DIAGNOSIS,,,...,-28.0,,,,,,,,,
20438,MSOAC,MH,MSOAC/9998,39,,,,PPMS,,,...,-28.0,,,,,,,,,


Number of patients

In [29]:
unique_count = medical_history['USUBJID'].nunique()
print(f"The number of unique values in USUBJID: {unique_count}")

The number of unique values in USUBJID: 2465


Check how many missing values we have per column

In [30]:
missing_percentage_mh = (medical_history.isnull().sum() / len(medical_history)) * 100
missing_medical_history = pd.DataFrame({'Column Name': missing_percentage_mh.index, 'Missing Percentage': missing_percentage_mh.values})
#missing_demographics = missing_demographics.sort_values(by='Missing Percentage', ascending=False)
print(missing_medical_history)

   Column Name  Missing Percentage
0      STUDYID            0.000000
1       DOMAIN            0.000000
2      USUBJID            0.000000
3        MHSEQ            0.000000
4      MHGRPID          100.000000
5      MHREFID          100.000000
6       MHSPID          100.000000
7       MHTERM            0.000000
8     MHMODIFY          100.000000
9        MHLLT           87.608563
10     MHDECOD           40.153131
11       MHCAT           12.391437
12      MHSCAT           69.644218
13     MHPRESP           76.980801
14     MHOCCUR           76.980801
15      MHSTAT          100.000000
16    MHREASND          100.000000
17    MHBODSYS           56.879476
18       MHSOC           84.153588
19       MHLOC          100.000000
20       MHLAT          100.000000
21       MHSEV           81.932805
22      MHPATT           99.169587
23    MHCONTRT           94.853725
24    VISITNUM           20.912692
25       VISIT           20.912692
26       MHDTC          100.000000
27     MHSTDTC      

- drop columns with more than 80% missing

In [31]:
columns_to_drop = missing_medical_history[missing_medical_history['Missing Percentage'] > 80]['Column Name'].tolist()
medical_history.drop(columns=columns_to_drop, inplace=True)
medical_history

Unnamed: 0,STUDYID,DOMAIN,USUBJID,MHSEQ,MHTERM,MHDECOD,MHCAT,MHSCAT,MHPRESP,MHOCCUR,MHBODSYS,VISITNUM,VISIT,MHDY,MHENRF
21041,MSOAC,MH,MSOAC/0014,1,RRMS,Relapsing-remitting multiple sclerosis,PRIMARY DIAGNOSIS,PROTOCOL DEFINED CURRENT COURSE,,,,,,,
12262,MSOAC,MH,MSOAC/0016,1,BOWEL URGENCY OR INCONTINENCE,,MS SYMPTOMS,,Y,N,,1.0,SCREENING,-21.0,
12296,MSOAC,MH,MSOAC/0016,2,CEREBELLAR SYMPTOMS,,MS SYMPTOMS,,Y,N,,1.0,SCREENING,-21.0,
12302,MSOAC,MH,MSOAC/0016,3,CONSTIPATION,,MS SYMPTOMS,,Y,N,,1.0,SCREENING,-21.0,
12323,MSOAC,MH,MSOAC/0016,4,DECREASED MENTATION,,MS SYMPTOMS,,Y,N,,1.0,SCREENING,-21.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15955,MSOAC,MH,MSOAC/9998,36,URINARY URGENCY,,MS SYMPTOMS PRIOR TO STUDY,,,,,-3.0,SCREENING -3,-28.0,
15372,MSOAC,MH,MSOAC/9998,37,VERTIGO,,MS SYMPTOMS PRIOR TO STUDY,,,,,-3.0,SCREENING -3,-28.0,
20393,MSOAC,MH,MSOAC/9998,38,MS DIAGNOSIS,Multiple sclerosis,PRIMARY DIAGNOSIS,ONSET COURSE,,,,-3.0,SCREENING -3,-28.0,
20438,MSOAC,MH,MSOAC/9998,39,PPMS,Primary progressive multiple sclerosis,PRIMARY DIAGNOSIS,PROTOCOL DEFINED CURRENT COURSE,,,,-3.0,SCREENING -3,-28.0,


Note: MHDECOD write extensivelly the name of some terms in MHTERM (seems an irrelevant variable, might drop later) - but at the same time it groups other categories in MHTERM which can help
- should i use MHDECOD when it exists, and complement with MHTERM in the missing values?

*to be continued*

### 5. Reproductive System Findings	(rp.csv) - [One record per Reproductive System Finding per time point per visit per subject]

In [32]:
# Load reproductive system .csv file
file_path = 'C:/Users/anaso/Desktop/SOFIA MENDES/KU Leuven/Master Thesis/MSOAC Placebo dataset/csv files/rp.csv'

# create data frame
reproductive_system = pd.read_csv(file_path)

# Sort by the 'USUBJID' and 'RPSEQ' column in ascending order
reproductive_system = reproductive_system.sort_values(by=['USUBJID','RPSEQ'], ascending=True)
reproductive_system

Unnamed: 0,STUDYID,DOMAIN,USUBJID,RPSEQ,RPGRPID,RPREFID,RPSPID,RPTESTCD,RPTEST,RPCAT,...,RPDRVFL,VISITNUM,VISIT,VISITDY,EPOCH,RPDY,RPTPT,RPTPTNUM,RPELTM,RPTPTREF
315,MSOAC,RP,MSOAC/0041,1,,,,PREGTEST,Pregnancy Test,,...,,2.0,VISIT 2,1,,1,,,,
253,MSOAC,RP,MSOAC/0041,2,,,,PREGTEST,Pregnancy Test,,...,,6.0,VISIT 6,167,,167,,,,
207,MSOAC,RP,MSOAC/0041,3,,,,PREGTEST,Pregnancy Test,,...,,15.1,EARLY WITHDRAWAL VISIT,365,,365,,,,
458,MSOAC,RP,MSOAC/0094,1,,,,PREGTEST,Pregnancy Test,,...,,1.0,BASELINE,-28,,-28,,,,
498,MSOAC,RP,MSOAC/0094,2,,,,PREGTEST,Pregnancy Test,,...,,2.0,VISIT 2,1,,1,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,MSOAC,RP,MSOAC/9921,1,,,,PREGTEST,Pregnancy Test,,...,,10.0,VISIT 10,340,,340,,,,
365,MSOAC,RP,MSOAC/9951,1,,,,PREGTEST,Pregnancy Test,,...,,1.0,BASELINE,-28,,-28,,,,
19,MSOAC,RP,MSOAC/9951,2,,,,PREGTEST,Pregnancy Test,,...,,2.0,VISIT 2,1,,1,,,,
471,MSOAC,RP,MSOAC/9951,3,,,,PREGTEST,Pregnancy Test,,...,,6.0,VISIT 6,176,,176,,,,


In [33]:
# change entry 'MISSING' to NA
reproductive_system['RPORRES'] = reproductive_system['RPORRES'].replace('MISSING', pd.NA)
reproductive_system['RPSTRESC'] = reproductive_system['RPSTRESC'].replace('MISSING', pd.NA)

reproductive_system

Unnamed: 0,STUDYID,DOMAIN,USUBJID,RPSEQ,RPGRPID,RPREFID,RPSPID,RPTESTCD,RPTEST,RPCAT,...,RPDRVFL,VISITNUM,VISIT,VISITDY,EPOCH,RPDY,RPTPT,RPTPTNUM,RPELTM,RPTPTREF
315,MSOAC,RP,MSOAC/0041,1,,,,PREGTEST,Pregnancy Test,,...,,2.0,VISIT 2,1,,1,,,,
253,MSOAC,RP,MSOAC/0041,2,,,,PREGTEST,Pregnancy Test,,...,,6.0,VISIT 6,167,,167,,,,
207,MSOAC,RP,MSOAC/0041,3,,,,PREGTEST,Pregnancy Test,,...,,15.1,EARLY WITHDRAWAL VISIT,365,,365,,,,
458,MSOAC,RP,MSOAC/0094,1,,,,PREGTEST,Pregnancy Test,,...,,1.0,BASELINE,-28,,-28,,,,
498,MSOAC,RP,MSOAC/0094,2,,,,PREGTEST,Pregnancy Test,,...,,2.0,VISIT 2,1,,1,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,MSOAC,RP,MSOAC/9921,1,,,,PREGTEST,Pregnancy Test,,...,,10.0,VISIT 10,340,,340,,,,
365,MSOAC,RP,MSOAC/9951,1,,,,PREGTEST,Pregnancy Test,,...,,1.0,BASELINE,-28,,-28,,,,
19,MSOAC,RP,MSOAC/9951,2,,,,PREGTEST,Pregnancy Test,,...,,2.0,VISIT 2,1,,1,,,,
471,MSOAC,RP,MSOAC/9951,3,,,,PREGTEST,Pregnancy Test,,...,,6.0,VISIT 6,176,,176,,,,


Number of patients

In [34]:
unique_count = reproductive_system['USUBJID'].nunique()
print(f"The number of unique values in USUBJID: {unique_count}") #less than total no. of women in the dataset

The number of unique values in USUBJID: 141


Check how many missing values per column

In [35]:
missing_percentage_rs = (reproductive_system.isnull().sum() / len(reproductive_system)) * 100
missing_reproductive_system = pd.DataFrame({'Column Name': missing_percentage_rs.index, 'Missing Percentage': missing_percentage_rs.values})
#missing_demographics = missing_demographics.sort_values(by='Missing Percentage', ascending=False)
print(missing_reproductive_system)

   Column Name  Missing Percentage
0      STUDYID            0.000000
1       DOMAIN            0.000000
2      USUBJID            0.000000
3        RPSEQ            0.000000
4      RPGRPID          100.000000
5      RPREFID          100.000000
6       RPSPID          100.000000
7     RPTESTCD            0.000000
8       RPTEST            0.000000
9        RPCAT          100.000000
10      RPSCAT          100.000000
11     RPORRES            0.570342
12    RPORRESU          100.000000
13    RPSTRESC            0.570342
14    RPSTRESN          100.000000
15    RPSTRESU          100.000000
16      RPSTAT          100.000000
17    RPREASND          100.000000
18      RPSPEC            0.570342
19      RPBLFL          100.000000
20     RPDRVFL          100.000000
21    VISITNUM            0.000000
22       VISIT            0.000000
23     VISITDY            0.000000
24       EPOCH          100.000000
25        RPDY            0.000000
26       RPTPT          100.000000
27    RPTPTNUM      

Drop columns with no info (100% missing)

In [36]:
columns_to_drop = ['RPGRPID','RPREFID','RPSPID','RPCAT','RPSCAT','RPORRESU','RPSTRESN','RPSTRESU','RPSTAT','RPREASND','RPBLFL','RPDRVFL','EPOCH','RPTPT','RPTPTNUM','RPELTM','RPTPTREF']
reproductive_system = reproductive_system.drop(columns_to_drop, axis=1)
reproductive_system

Unnamed: 0,STUDYID,DOMAIN,USUBJID,RPSEQ,RPTESTCD,RPTEST,RPORRES,RPSTRESC,RPSPEC,VISITNUM,VISIT,VISITDY,RPDY
315,MSOAC,RP,MSOAC/0041,1,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,URINE,2.0,VISIT 2,1,1
253,MSOAC,RP,MSOAC/0041,2,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,URINE,6.0,VISIT 6,167,167
207,MSOAC,RP,MSOAC/0041,3,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,SERUM,15.1,EARLY WITHDRAWAL VISIT,365,365
458,MSOAC,RP,MSOAC/0094,1,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,SERUM,1.0,BASELINE,-28,-28
498,MSOAC,RP,MSOAC/0094,2,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,URINE,2.0,VISIT 2,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,MSOAC,RP,MSOAC/9921,1,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,URINE,10.0,VISIT 10,340,340
365,MSOAC,RP,MSOAC/9951,1,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,SERUM,1.0,BASELINE,-28,-28
19,MSOAC,RP,MSOAC/9951,2,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,URINE,2.0,VISIT 2,1,1
471,MSOAC,RP,MSOAC/9951,3,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,URINE,6.0,VISIT 6,176,176


Notes:
- RPTESTCD and RPTEST give the same info
    - both only say 'Pregnancy test' or 'PREGTEST', so might just drop one (or both)
- VISITNUM and VISIT seem to give the same info, but:
    - 15.1	EARLY WITHDRAWAL VISIT
    - 15.2	FOLLOW UP MONTH 27
    - 15.3	FOLLOW UP VISIT
- VISITDY and RPDY seem to give the same info -- drop one
    - what do they mean?
- RPORRES and RPSTRESC always 'NEGATIVE' (or very little missing) -- variable does not seem to add anything, so maybe i should drop one (or both)

In [37]:
# Drop some more variables based on previous notes
columns_to_drop = ['RPTESTCD','RPSTRESC','VISITNUM','RPDY'] # might drop RPTEST and RPORRES later
reproductive_system = reproductive_system.drop(columns_to_drop, axis=1)
reproductive_system

Unnamed: 0,STUDYID,DOMAIN,USUBJID,RPSEQ,RPTEST,RPORRES,RPSPEC,VISIT,VISITDY
315,MSOAC,RP,MSOAC/0041,1,Pregnancy Test,NEGATIVE,URINE,VISIT 2,1
253,MSOAC,RP,MSOAC/0041,2,Pregnancy Test,NEGATIVE,URINE,VISIT 6,167
207,MSOAC,RP,MSOAC/0041,3,Pregnancy Test,NEGATIVE,SERUM,EARLY WITHDRAWAL VISIT,365
458,MSOAC,RP,MSOAC/0094,1,Pregnancy Test,NEGATIVE,SERUM,BASELINE,-28
498,MSOAC,RP,MSOAC/0094,2,Pregnancy Test,NEGATIVE,URINE,VISIT 2,1
...,...,...,...,...,...,...,...,...,...
409,MSOAC,RP,MSOAC/9921,1,Pregnancy Test,NEGATIVE,URINE,VISIT 10,340
365,MSOAC,RP,MSOAC/9951,1,Pregnancy Test,NEGATIVE,SERUM,BASELINE,-28
19,MSOAC,RP,MSOAC/9951,2,Pregnancy Test,NEGATIVE,URINE,VISIT 2,1
471,MSOAC,RP,MSOAC/9951,3,Pregnancy Test,NEGATIVE,URINE,VISIT 6,176


- analysis of categorical variables

In [38]:
# VISIT
visit_counts_reproductive_system = reproductive_system['VISIT'].value_counts().reset_index()
visit_counts_reproductive_system.columns = ['VISIT', 'Count']
print(visit_counts_reproductive_system)

                    VISIT  Count
0                BASELINE    125
1                 VISIT 2     94
2                 VISIT 6     88
3                VISIT 10     79
4                VISIT 12     68
5                VISIT 14     59
6  EARLY WITHDRAWAL VISIT      8
7      FOLLOW UP MONTH 27      3
8                VISIT 11      1
9         FOLLOW UP VISIT      1


In [39]:
# RPSPEC
rpspec_counts_reproductive_system = reproductive_system['RPSPEC'].value_counts().reset_index()
rpspec_counts_reproductive_system.columns = ['RPSPEC', 'Count']
print(rpspec_counts_reproductive_system)

  RPSPEC  Count
0  URINE    388
1  SERUM    135


Notes:
- we have 1658 females in the dataset, but 141 records in this file
- imputation for column RPSTRESC, RPORRES, RPSPEC (only 0.570342% missing) 
- see better relation between VISIT and VISITDY  
- add number of missing in the categorical variables??        

### 6. Subject Disease Milestones (sm.csv) - [One record per Disease Milestone per subject]

In [40]:
# Load subject disease milestones .csv file
file_path = 'C:/Users/anaso/Desktop/SOFIA MENDES/KU Leuven/Master Thesis/MSOAC Placebo dataset/csv files/sm.csv'

# Create data frame
milestones = pd.read_csv(file_path)

# Sort by the 'USUBJID' and 'SMSEQ' columns in ascending order
milestones = milestones.sort_values(by=['USUBJID', 'SMSEQ'], ascending=True)
milestones

Unnamed: 0,STUDYID,DOMAIN,USUBJID,SMSEQ,SMSTDY,SMENDY,SMENRF,MIDS,MIDSTYPE
222,MSOAC,SM,MSOAC/0031,1,268.0,279.0,,MS RELAPSE 1,MULTIPLE SCLEROSIS RELAPSE EVENT
57,MSOAC,SM,MSOAC/0031,2,814.0,,,MS RELAPSE 2,MULTIPLE SCLEROSIS RELAPSE EVENT
683,MSOAC,SM,MSOAC/0035,1,144.0,,,MS RELAPSE 1,MULTIPLE SCLEROSIS RELAPSE EVENT
421,MSOAC,SM,MSOAC/0035,2,221.0,,,MS RELAPSE 2,MULTIPLE SCLEROSIS RELAPSE EVENT
1287,MSOAC,SM,MSOAC/0044,1,414.0,,,MS RELAPSE 1,MULTIPLE SCLEROSIS RELAPSE EVENT
...,...,...,...,...,...,...,...,...,...
797,MSOAC,SM,MSOAC/9995,1,142.0,,,MS RELAPSE 1,MULTIPLE SCLEROSIS RELAPSE EVENT
996,MSOAC,SM,MSOAC/9995,2,555.0,,,MS RELAPSE 2,MULTIPLE SCLEROSIS RELAPSE EVENT
1039,MSOAC,SM,MSOAC/9998,1,79.0,,,MS RELAPSE 1,MULTIPLE SCLEROSIS RELAPSE EVENT
272,MSOAC,SM,MSOAC/9999,1,69.0,,,MS RELAPSE 1,MULTIPLE SCLEROSIS RELAPSE EVENT


Number of patients

In [41]:
unique_count = milestones['USUBJID'].nunique()
print(f"The number of unique values in USUBJID: {unique_count}")

The number of unique values in USUBJID: 853


Check how many missing values per column

In [42]:
missing_percentage_sm = (milestones.isnull().sum() / len(milestones)) * 100
missing_milestones = pd.DataFrame({'Column Name': missing_percentage_sm.index, 'Missing Percentage': missing_percentage_sm.values})
#missing_demographics = missing_demographics.sort_values(by='Missing Percentage', ascending=False)
print(missing_milestones)

  Column Name  Missing Percentage
0     STUDYID            0.000000
1      DOMAIN            0.000000
2     USUBJID            0.000000
3       SMSEQ            0.000000
4      SMSTDY            1.989390
5      SMENDY           57.294430
6      SMENRF           99.469496
7        MIDS            0.000000
8    MIDSTYPE            0.000000


Drop column with more than 90%

In [43]:
columns_to_drop = ['SMENRF']
milestones = milestones.drop(columns_to_drop, axis=1)
milestones

Unnamed: 0,STUDYID,DOMAIN,USUBJID,SMSEQ,SMSTDY,SMENDY,MIDS,MIDSTYPE
222,MSOAC,SM,MSOAC/0031,1,268.0,279.0,MS RELAPSE 1,MULTIPLE SCLEROSIS RELAPSE EVENT
57,MSOAC,SM,MSOAC/0031,2,814.0,,MS RELAPSE 2,MULTIPLE SCLEROSIS RELAPSE EVENT
683,MSOAC,SM,MSOAC/0035,1,144.0,,MS RELAPSE 1,MULTIPLE SCLEROSIS RELAPSE EVENT
421,MSOAC,SM,MSOAC/0035,2,221.0,,MS RELAPSE 2,MULTIPLE SCLEROSIS RELAPSE EVENT
1287,MSOAC,SM,MSOAC/0044,1,414.0,,MS RELAPSE 1,MULTIPLE SCLEROSIS RELAPSE EVENT
...,...,...,...,...,...,...,...,...
797,MSOAC,SM,MSOAC/9995,1,142.0,,MS RELAPSE 1,MULTIPLE SCLEROSIS RELAPSE EVENT
996,MSOAC,SM,MSOAC/9995,2,555.0,,MS RELAPSE 2,MULTIPLE SCLEROSIS RELAPSE EVENT
1039,MSOAC,SM,MSOAC/9998,1,79.0,,MS RELAPSE 1,MULTIPLE SCLEROSIS RELAPSE EVENT
272,MSOAC,SM,MSOAC/9999,1,69.0,,MS RELAPSE 1,MULTIPLE SCLEROSIS RELAPSE EVENT


- Descriptive statistics for continuous variables (SMSTDY)

In [44]:
continuous_columns_ms = ['SMSTDY','SMENDY']

descriptive_continuous_ms = {
    'Count': milestones[continuous_columns_ms].count(), #cases that are not missing
    'Missing Cases': milestones[continuous_columns_ms].isna().sum(),
    'Mean': milestones[continuous_columns_ms].mean(),
    'Standard Deviation': milestones[continuous_columns_ms].std()
}

cont_milestones = pd.DataFrame(descriptive_continuous_ms)
print(cont_milestones)

        Count  Missing Cases        Mean  Standard Deviation
SMSTDY   1478             30  395.510149          303.959886
SMENDY    644            864  338.309006          223.011404


- Descriptive statistics for categorical variables

In [45]:
categorical_columns_ms = ['SMSEQ', 'MIDS','MIDSTYPE']

descriptive_categorical_ms = {}
for col in categorical_columns_ms:
    descriptive_categorical_ms[col] = {
        'Count': milestones[col].count(),
        'Missing Cases': milestones[col].isna().sum(),
        'Unique Values': milestones[col].nunique(),
        'Mode': milestones[col].mode().values[0],
        'Mode Frequency': milestones[col].value_counts().max()
    }

cat_milestones = pd.DataFrame(descriptive_categorical_ms).T
print(cat_milestones)

         Count Missing Cases Unique Values                              Mode  \
SMSEQ     1508             0             9                                 1   
MIDS      1508             0             9                      MS RELAPSE 1   
MIDSTYPE  1508             0             1  MULTIPLE SCLEROSIS RELAPSE EVENT   

         Mode Frequency  
SMSEQ               853  
MIDS                853  
MIDSTYPE           1508  


Note: MIDSTYPE is irrelevant -- drop column later

- Number of observations for each SMSEQ category 

In [46]:
# SMSEQ
smseq_counts_milestones = milestones['SMSEQ'].value_counts().reset_index()
smseq_counts_milestones.columns = ['SMSEQ', 'Count']
print(smseq_counts_milestones)

   SMSEQ  Count
0      1    853
1      2    368
2      3    157
3      4     75
4      5     36
5      6     14
6      7      3
7      8      1
8      9      1


- number of observations for each MIDS category

In [47]:
# MIDS
mids_counts_milestones = milestones['MIDS'].value_counts().reset_index()
mids_counts_milestones.columns = ['MIDS', 'Count']
print(mids_counts_milestones)

           MIDS  Count
0  MS RELAPSE 1    853
1  MS RELAPSE 2    368
2  MS RELAPSE 3    157
3  MS RELAPSE 4     75
4  MS RELAPSE 5     36
5  MS RELAPSE 6     14
6  MS RELAPSE 7      3
7  MS RELAPSE 8      1
8  MS RELAPSE 9      1


Conclusion: MIDS and SMSEQ give exactly the same info - number of relapses for each individual

**Final MS dataset**

In [48]:
# Create the 'RLPCOUNT' column and keep only the rows with the maximum 'SMSEQ' value for each 'USUBJID'
# milestones['RLPCOUNT'] = milestones.groupby('USUBJID')['SMSEQ'].transform('max')
# milestones = milestones.loc[milestones.groupby('USUBJID')['SMSEQ'].idxmax()]

# Drop unnecessary columns
columns_to_drop = ['STUDYID','SMSEQ','MIDSTYPE'] #MIDS or SMSEQ
milestones = milestones.drop(columns_to_drop, axis=1)
milestones

Unnamed: 0,DOMAIN,USUBJID,SMSTDY,SMENDY,MIDS
222,SM,MSOAC/0031,268.0,279.0,MS RELAPSE 1
57,SM,MSOAC/0031,814.0,,MS RELAPSE 2
683,SM,MSOAC/0035,144.0,,MS RELAPSE 1
421,SM,MSOAC/0035,221.0,,MS RELAPSE 2
1287,SM,MSOAC/0044,414.0,,MS RELAPSE 1
...,...,...,...,...,...
797,SM,MSOAC/9995,142.0,,MS RELAPSE 1
996,SM,MSOAC/9995,555.0,,MS RELAPSE 2
1039,SM,MSOAC/9998,79.0,,MS RELAPSE 1
272,SM,MSOAC/9999,69.0,,MS RELAPSE 1


Notes:
- what are the values in SMSTDY and SMENDY??	

*continue*: do continuous variables statistics for each individual

---

# Notes

- we can merge the datasets by the column 'USUBJID'
- for Clinical Events dataset (ce.csv): column CETERM is not consistent - we have 'MS RELAPSE', 'MS - RELAPSE', 'RELAPSE OF MULTIPLE SCLEROSIS', etc. and all mean the same 

#### Datasets by relevance

- High: demographics (dm), subject disease milestones (ms)
- Medium:
- Low: