# Exploratory Data Analysis


This notebook aims to explore the several csv files of the MSOAC placebo dataset. We will analyse the data and try to find patterns 

### 1. Demographics data (dm.csv) [one record per subject]

In [6]:
import pandas as pd

# Replace 'your_file.csv' with the path to your CSV file
file_path = 'C:/Users/anaso/Desktop/SOFIA MENDES/KU Leuven/Master Thesis/MSOAC Placebo dataset/csv files/dm.csv'

# Read the CSV file into a DataFrame
demographics = pd.read_csv(file_path)

# Now you can work with the data in the DataFrame
print(demographics)


     STUDYID DOMAIN     USUBJID  SUBJID  RFSTDTC  RFENDTC  DTHDTC  DTHFL  \
0      MSOAC     DM  MSOAC/0649     649      NaN      NaN     NaN    NaN   
1      MSOAC     DM  MSOAC/2224    2224      NaN      NaN     NaN    NaN   
2      MSOAC     DM  MSOAC/0576     576      NaN      NaN     NaN    NaN   
3      MSOAC     DM  MSOAC/4961    4961      NaN      NaN     NaN    NaN   
4      MSOAC     DM  MSOAC/5990    5990      NaN      NaN     NaN    NaN   
...      ...    ...         ...     ...      ...      ...     ...    ...   
2460   MSOAC     DM  MSOAC/2501    2501      NaN      NaN     NaN    NaN   
2461   MSOAC     DM  MSOAC/8672    8672      NaN      NaN     NaN    NaN   
2462   MSOAC     DM  MSOAC/5705    5705      NaN      NaN     NaN    NaN   
2463   MSOAC     DM  MSOAC/8255    8255      NaN      NaN     NaN    NaN   
2464   MSOAC     DM  MSOAC/6796    6796      NaN      NaN     NaN    NaN   

      SITEID  INVID  ...      ARM  ACTARMCD   ACTARM COUNTRY DMDTC DMDY  \
0        NaN

Check how many missing values we have per column

In [7]:
missing_percentage = (demographics.isnull().sum() / len(demographics)) * 100

missing_demographics = pd.DataFrame({'Column Name': missing_percentage.index, 'Missing Percentage': missing_percentage.values})

#missing_demographics = missing_demographics.sort_values(by='Missing Percentage', ascending=False)

print(missing_demographics)

   Column Name  Missing Percentage
0      STUDYID            0.000000
1       DOMAIN            0.000000
2      USUBJID            0.000000
3       SUBJID            0.000000
4      RFSTDTC          100.000000
5      RFENDTC          100.000000
6       DTHDTC          100.000000
7        DTHFL          100.000000
8       SITEID          100.000000
9        INVID          100.000000
10      INVNAM          100.000000
11     BRTHDTC          100.000000
12         AGE            3.367140
13        AGEU            3.367140
14         SEX            0.000000
15        RACE           31.399594
16      ETHNIC           90.750507
17       ARMCD            0.000000
18         ARM            0.000000
19    ACTARMCD           87.626775
20      ACTARM           87.626775
21     COUNTRY           56.267748
22       DMDTC          100.000000
23        DMDY          100.000000
24      DMENDY          100.000000
25    DMDTC_TS          100.000000
26  RFENDTC_TS          100.000000
27  RFSTDTC_TS      

We will drop the columns with (almost) no information (more than 85% missing)

In [8]:
columns_to_drop = ['RFSTDTC','RFENDTC','DTHDTC','DTHFL','SITEID','INVID','INVNAM','ETHNIC','ACTARMCD','ACTARM','BRTHDTC','DMDTC','DMDY','DMENDY','DMDTC_TS','RFENDTC_TS','RFSTDTC_TS']
demographics = demographics.drop(columns_to_drop, axis=1)
demographics

Unnamed: 0,STUDYID,DOMAIN,USUBJID,SUBJID,AGE,AGEU,SEX,RACE,ARMCD,ARM,COUNTRY
0,MSOAC,DM,MSOAC/0649,649,,,F,WHITE,1,PLACEBO,USA
1,MSOAC,DM,MSOAC/2224,2224,38.0,YEARS,F,WHITE,1,PLACEBO,SRB
2,MSOAC,DM,MSOAC/0576,576,50.0,YEARS,F,WHITE,1,PLACEBO,
3,MSOAC,DM,MSOAC/4961,4961,44.0,YEARS,F,WHITE,1,PLACEBO,
4,MSOAC,DM,MSOAC/5990,5990,52.0,YEARS,F,WHITE,1,PLACEBO,
...,...,...,...,...,...,...,...,...,...,...,...
2460,MSOAC,DM,MSOAC/2501,2501,46.0,YEARS,F,WHITE,1,PLACEBO,
2461,MSOAC,DM,MSOAC/8672,8672,43.0,YEARS,F,,1,PLACEBO,
2462,MSOAC,DM,MSOAC/5705,5705,30.0,YEARS,M,,1,PLACEBO,
2463,MSOAC,DM,MSOAC/8255,8255,42.0,YEARS,M,,1,PLACEBO,


- Descriptive statistics for continuous variables (in this case, just age)

In [9]:
continuous_columns = ['AGE']

# Calculate the statistics for the selected columns
descriptive_continuous = {
    'Count': demographics[continuous_columns].count(), #cases that are not missing
    'Missing Cases': demographics[continuous_columns].isna().sum(),
    'Mean': demographics[continuous_columns].mean(),
    'Standard Deviation': demographics[continuous_columns].std()
}

cont_demographics = pd.DataFrame(descriptive_continuous)

print(cont_demographics)

     Count  Missing Cases       Mean  Standard Deviation
AGE   2382             83  41.766583           10.413545


- Descriptive statistics for categorical variables (in this case, gender and race)

In [10]:
categorical_columns = ['SEX', 'RACE','COUNTRY']

# Calculate descriptive statistics for each categorical column
descriptive_categorical = {}
for col in categorical_columns:
    descriptive_categorical[col] = {
        'Count': demographics[col].count(),
        'Missing Cases': demographics[col].isna().sum(),
        'Unique Values': demographics[col].nunique(),
        'Mode': demographics[col].mode().values[0],
        'Mode Frequency': demographics[col].value_counts().max()
    }

# Create a summary DataFrame for categorical variables
cat_demographics = pd.DataFrame(descriptive_categorical).T

print(cat_demographics)

        Count Missing Cases Unique Values   Mode Mode Frequency
SEX      2465             0             2      F           1658
RACE     1691           774             7  WHITE           1534
COUNTRY  1078          1387            35    USA            249


- Number of observations for each RACE category

In [11]:
race_counts_demographics = demographics['RACE'].value_counts().reset_index()
race_counts_demographics.columns = ['Race', 'Count']

# Display the DataFrame with count of observations for each category
print(race_counts_demographics)

                               Race  Count
0                             WHITE   1534
1                             ASIAN     64
2                             OTHER     41
3         BLACK OR AFRICAN AMERICAN     39
4                          HISPANIC     10
5  AMERICAN INDIAN OR ALASKA NATIVE      2
6                HISPANIC OR LATINO      1


- Number of observations for each SEX category

In [5]:
sex_counts_demographics = demographics['SEX'].value_counts().reset_index()
sex_counts_demographics.columns = ['Gender', 'Count']

# Display the DataFrame with count of observations for each category
print(sex_counts_demographics)

  Gender  Count
0      F   1658
1      M    807


- Number of observations for each COUNTRY category

In [6]:
country_counts_demographics = demographics['COUNTRY'].value_counts().reset_index()
country_counts_demographics.columns = ['Country', 'Count']

# Display the DataFrame with count of observations for each category
print(country_counts_demographics)

   Country  Count
0      USA    249
1      POL    177
2      CAN     73
3      UKR     63
4      CZE     63
5      IND     56
6      RUS     48
7      SRB     46
8      DEU     44
9      GBR     37
10     NLD     26
11     BGR     21
12     HUN     19
13     ROU     16
14     GRC     14
15     FRA     13
16     NZL     10
17     BEL     10
18     SWE      9
19     MEX      9
20     EST      8
21     ESP      7
22     PER      7
23     GEO      7
24     AUS      7
25     ISR      6
26     CHE      6
27     HRV      5
28     TUR      5
29     COL      5
30     LVA      3
31     FIN      3
32     IRL      3
33     DNK      2
34     CHL      1


#### Ideas:

- Impute age with mean (only around 3% missing)
- Is country important for prognosis? If not, drop. If yes, what do to regarding missing values? 
- COUNTRY variable (if used): should we group by continent?
- RACE variable is highly imbalanced - maybe use just two categories (white / non-white)?

### 2. Clinical events (ce.csv) - [One record per event per subject]

In [42]:
# Replace 'your_file.csv' with the path to your CSV file
file_path = 'C:/Users/anaso/Desktop/SOFIA MENDES/KU Leuven/Master Thesis/MSOAC Placebo dataset/csv files/ce.csv'

# Read the CSV file into a DataFrame
clinical_events = pd.read_csv(file_path)

# Now you can work with the data in the DataFrame
print(clinical_events)

     STUDYID DOMAIN     USUBJID  CESEQ  CEGRPID  CEREFID  CESPID  \
0      MSOAC     CE  MSOAC/8216      1      NaN      NaN     NaN   
1      MSOAC     CE  MSOAC/9349      1      NaN      NaN     NaN   
2      MSOAC     CE  MSOAC/1879      1      NaN      NaN     NaN   
3      MSOAC     CE  MSOAC/5359      1      NaN      NaN     NaN   
4      MSOAC     CE  MSOAC/4758      1      NaN      NaN     NaN   
...      ...    ...         ...    ...      ...      ...     ...   
4482   MSOAC     CE  MSOAC/1248      1      NaN      NaN     NaN   
4483   MSOAC     CE  MSOAC/2966      1      NaN      NaN     NaN   
4484   MSOAC     CE  MSOAC/2800      1      NaN      NaN     NaN   
4485   MSOAC     CE  MSOAC/4845      1      NaN      NaN     NaN   
4486   MSOAC     CE  MSOAC/3018      1      NaN      NaN     NaN   

                         CETERM                CEMODIFY  \
0                    MS RELAPSE  INEC Confirmed Relapse   
1                    MS RELAPSE  INEC Confirmed Relapse   
2     

In [43]:
unique_count = clinical_events['USUBJID'].nunique()

# Display the number of unique values
print(f"The number of unique values in USUBJID: {unique_count}")

The number of unique values in USUBJID: 1215


In [44]:
# Sort the DataFrame by the 'USUBJID' column in ascending order
sorted_ce = clinical_events.sort_values(by='USUBJID', ascending=True)

print("Data frame sorted by 'USUBJID' in ascending order:")
print(sorted_ce)

Data frame sorted by 'USUBJID' in ascending order:
     STUDYID DOMAIN     USUBJID  CESEQ  CEGRPID  CEREFID  CESPID  \
432    MSOAC     CE  MSOAC/0031      1      NaN      NaN     NaN   
1334   MSOAC     CE  MSOAC/0031      2      NaN      NaN     NaN   
1022   MSOAC     CE  MSOAC/0035      1      NaN      NaN     NaN   
1368   MSOAC     CE  MSOAC/0035      2      NaN      NaN     NaN   
2375   MSOAC     CE  MSOAC/0041      2      NaN      NaN     NaN   
...      ...    ...         ...    ...      ...      ...     ...   
1454   MSOAC     CE  MSOAC/9995      3      NaN      NaN     NaN   
1739   MSOAC     CE  MSOAC/9998      2      NaN      NaN     NaN   
871    MSOAC     CE  MSOAC/9998      1      NaN      NaN     NaN   
1397   MSOAC     CE  MSOAC/9999      2      NaN      NaN     NaN   
975    MSOAC     CE  MSOAC/9999      1      NaN      NaN     NaN   

                                 CETERM                       CEMODIFY  \
432                          MS RELAPSE  Neurologist Confi

Check columns with missing values

In [45]:
missing_percentage_ce = (clinical_events.isnull().sum() / len(clinical_events)) * 100

missing_clinical_events = pd.DataFrame({'Column Name': missing_percentage_ce.index, 'Missing Percentage': missing_percentage_ce.values})

#missing_demographics = missing_demographics.sort_values(by='Missing Percentage', ascending=False)

print(missing_clinical_events)

   Column Name  Missing Percentage
0      STUDYID            0.000000
1       DOMAIN            0.000000
2      USUBJID            0.000000
3        CESEQ            0.000000
4      CEGRPID          100.000000
5      CEREFID          100.000000
6       CESPID          100.000000
7       CETERM            0.000000
8     CEMODIFY           55.315355
9      CEDECOD           54.624471
10       CECAT           85.892579
11      CESCAT          100.000000
12     CEPRESP           45.375529
13     CEOCCUR           45.375529
14      CESTAT          100.000000
15    CEREASND          100.000000
16    CEBODSYS           77.646534
17       CELOC          100.000000
18       CELAT          100.000000
19       CESEV           55.449075
20       CESER           78.493425
21      CEPATT          100.000000
22       CEOUT           87.162915
23     CESHOSP          100.000000
24    CECONTRT           81.034099
25     CETOXGR          100.000000
26    VISITNUM           39.915311
27       VISIT      

Drop columns with more than 80% missing values

In [46]:
# Identify columns with more than 80% missing values
columns_to_drop = missing_clinical_events[missing_clinical_events['Missing Percentage'] > 80]['Column Name'].tolist()

# Drop the identified columns
clinical_events.drop(columns=columns_to_drop, inplace=True)

clinical_events

Unnamed: 0,STUDYID,DOMAIN,USUBJID,CESEQ,CETERM,CEMODIFY,CEDECOD,CEPRESP,CEOCCUR,CEBODSYS,CESEV,CESER,VISITNUM,VISIT,CEDY,CESTDY,MIDS
0,MSOAC,CE,MSOAC/8216,1,MS RELAPSE,INEC Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,,,Nervous system disorders,MODERATE,N,300.0,UNSCD RLPSE EVAL,,2.0,MS RELAPSE 1
1,MSOAC,CE,MSOAC/9349,1,MS RELAPSE,INEC Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,,,Nervous system disorders,MODERATE,Y,300.0,UNSCD RLPSE EVAL,,5.0,MS RELAPSE 1
2,MSOAC,CE,MSOAC/1879,1,MS RELAPSE,INEC Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,,,Nervous system disorders,SEVERE,N,300.0,UNSCD RLPSE EVAL,,5.0,MS RELAPSE 1
3,MSOAC,CE,MSOAC/5359,1,MS RELAPSE,INEC Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,,,Nervous system disorders,MODERATE,N,300.0,UNSCD RLPSE EVAL,,3.0,MS RELAPSE 1
4,MSOAC,CE,MSOAC/4758,1,MS RELAPSE,INEC Confirmed Relapse,MULTIPLE SCLEROSIS RELAPSE,,,Nervous system disorders,MODERATE,N,300.0,UNSCD RLPSE EVAL,,6.0,MS RELAPSE 1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4482,MSOAC,CE,MSOAC/1248,1,Suspected MS Exacerbation,,RELAPSE-LIKE EVENT,,,Nervous system disorders,,,999.0,UNSCHEDULED,281.0,,
4483,MSOAC,CE,MSOAC/2966,1,Suspected MS Exacerbation,,RELAPSE-LIKE EVENT,,,Nervous system disorders,,,999.0,UNSCHEDULED,427.0,,
4484,MSOAC,CE,MSOAC/2800,1,Suspected MS Exacerbation,,RELAPSE-LIKE EVENT,,,Nervous system disorders,,,999.0,UNSCHEDULED,428.0,,
4485,MSOAC,CE,MSOAC/4845,1,Suspected MS Exacerbation,,RELAPSE-LIKE EVENT,,,Nervous system disorders,,,999.0,UNSCHEDULED,470.0,,


*continue*

### 3. Disposition (ds.csv) - [One record per disposition status or protocol milestone per subject]

In [20]:
# Replace 'your_file.csv' with the path to your CSV file
file_path = 'C:/Users/anaso/Desktop/SOFIA MENDES/KU Leuven/Master Thesis/MSOAC Placebo dataset/csv files/ds.csv'

# Read the CSV file into a DataFrame
disposition = pd.read_csv(file_path)

# Sort the DataFrame by the 'USUBJID' column in ascending order
disposition = disposition.sort_values(by='USUBJID', ascending=True)

print("Data frame sorted by 'USUBJID' in ascending order:")
print(disposition)

Data frame sorted by 'USUBJID' in ascending order:
     STUDYID DOMAIN     USUBJID  DSSEQ  DSGRPID  DSREFID  DSSPID  \
108    MSOAC     DS  MSOAC/0019      3      NaN      NaN     NaN   
712    MSOAC     DS  MSOAC/0019      2      NaN      NaN     NaN   
1194   MSOAC     DS  MSOAC/0019      1      NaN      NaN     NaN   
992    MSOAC     DS  MSOAC/0030      1      NaN      NaN     NaN   
858    MSOAC     DS  MSOAC/0041      1      NaN      NaN     NaN   
...      ...    ...         ...    ...      ...      ...     ...   
861    MSOAC     DS  MSOAC/9974      2      NaN      NaN     NaN   
616    MSOAC     DS  MSOAC/9980      1      NaN      NaN     NaN   
606    MSOAC     DS  MSOAC/9986      1      NaN      NaN     NaN   
390    MSOAC     DS  MSOAC/9998      1      NaN      NaN     NaN   
212    MSOAC     DS  MSOAC/9998      2      NaN      NaN     NaN   

                           DSTERM                  DSMODIFY  \
108   BAD INTERIM ANALYSIS RESULT                     OTHER   
712   

Check how many missing values we have per column

In [21]:
missing_percentage_ds = (disposition.isnull().sum() / len(disposition)) * 100

missing_disposition = pd.DataFrame({'Column Name': missing_percentage_ds.index, 'Missing Percentage': missing_percentage_ds.values})

#missing_demographics = missing_demographics.sort_values(by='Missing Percentage', ascending=False)

print(missing_disposition)

   Column Name  Missing Percentage
0      STUDYID            0.000000
1       DOMAIN            0.000000
2      USUBJID            0.000000
3        DSSEQ            0.000000
4      DSGRPID          100.000000
5      DSREFID          100.000000
6       DSSPID          100.000000
7       DSTERM            0.000000
8     DSMODIFY           42.789598
9      DSDECOD            0.000000
10       DSCAT            0.000000
11      DSSCAT           96.611505
12    VISITNUM           42.789598
13       VISIT           42.789598
14       EPOCH          100.000000
15       DSDTC          100.000000
16     DSSTDTC          100.000000
17      DSSTDY            0.000000


Drop columns with more than 90% values missing

In [22]:
columns_to_drop = ['DSGRPID','DSREFID','DSSPID','DSSCAT','EPOCH','DSDTC','DSSTDTC']
disposition = disposition.drop(columns_to_drop, axis=1)
disposition

Unnamed: 0,STUDYID,DOMAIN,USUBJID,DSSEQ,DSTERM,DSMODIFY,DSDECOD,DSCAT,VISITNUM,VISIT,DSSTDY
108,MSOAC,DS,MSOAC/0019,3,BAD INTERIM ANALYSIS RESULT,OTHER,OTHER,DISPOSITION EVENT,36.0,MONTH 36,1088
712,MSOAC,DS,MSOAC/0019,2,BAD INTERIM ANALYSIS RESULT,OTHER,OTHER,DISPOSITION EVENT,997.0,EARLY/TERMINATION,899
1194,MSOAC,DS,MSOAC/0019,1,WRITTEN CONSENT OBTAINED,WRITTEN CONSENT OBTAINED,INFORMED CONSENT OBTAINED,PROTOCOL MILESTONE,-3.0,SCREENING -3,-27
992,MSOAC,DS,MSOAC/0030,1,FIRST RANDOMIZATION,,RANDOMIZATION,PROTOCOL MILESTONE,,,1
858,MSOAC,DS,MSOAC/0041,1,Lack of Clinical Efficacy,,LACK OF EFFICACY,DISPOSITION EVENT,,,365
...,...,...,...,...,...,...,...,...,...,...,...
861,MSOAC,DS,MSOAC/9974,2,SPONSOR DECISION,OTHER,OTHER,DISPOSITION EVENT,997.0,EARLY/TERMINATION,1007
616,MSOAC,DS,MSOAC/9980,1,FIRST RANDOMIZATION,,RANDOMIZATION,PROTOCOL MILESTONE,,,1
606,MSOAC,DS,MSOAC/9986,1,FIRST RANDOMIZATION,,RANDOMIZATION,PROTOCOL MILESTONE,,,1
390,MSOAC,DS,MSOAC/9998,1,WRITTEN CONSENT OBTAINED,WRITTEN CONSENT OBTAINED,INFORMED CONSENT OBTAINED,PROTOCOL MILESTONE,-3.0,SCREENING -3,-28


Number of patients

In [23]:
unique_count = disposition['USUBJID'].nunique()

# Display the number of unique values
print(f"The number of unique values in USUBJID: {unique_count}")

The number of unique values in USUBJID: 852


### 4. Medical history (mh.csv) - [One record per medical history event per subject]

In [25]:
# Replace 'your_file.csv' with the path to your CSV file
file_path = 'C:/Users/anaso/Desktop/SOFIA MENDES/KU Leuven/Master Thesis/MSOAC Placebo dataset/csv files/mh.csv'

# Read the CSV file into a DataFrame
medical_history = pd.read_csv(file_path)

# Sort the DataFrame by the 'USUBJID' column in ascending order
medical_history = medical_history.sort_values(by='USUBJID', ascending=True)

print("Data frame sorted by 'USUBJID' in ascending order:")
print(medical_history)

Data frame sorted by 'USUBJID' in ascending order:
      STUDYID DOMAIN     USUBJID  MHSEQ  MHGRPID  MHREFID  MHSPID  \
21041   MSOAC     MH  MSOAC/0014      1      NaN      NaN     NaN   
13487   MSOAC     MH  MSOAC/0016     25      NaN      NaN     NaN   
13321   MSOAC     MH  MSOAC/0016     15      NaN      NaN     NaN   
12687   MSOAC     MH  MSOAC/0016     24      NaN      NaN     NaN   
12262   MSOAC     MH  MSOAC/0016      1      NaN      NaN     NaN   
...       ...    ...         ...    ...      ...      ...     ...   
15715   MSOAC     MH  MSOAC/9998     27      NaN      NaN     NaN   
15366   MSOAC     MH  MSOAC/9998     31      NaN      NaN     NaN   
3752    MSOAC     MH  MSOAC/9998     12      NaN      NaN     NaN   
15621   MSOAC     MH  MSOAC/9998     23      NaN      NaN     NaN   
22313   MSOAC     MH  MSOAC/9999      1      NaN      NaN     NaN   

                              MHTERM  MHMODIFY MHLLT  ...  MHDY MHSTDY MHENDY  \
21041                           RRMS   

  medical_history = pd.read_csv(file_path)


Number of patients

In [26]:
unique_count = medical_history['USUBJID'].nunique()

# Display the number of unique values
print(f"The number of unique values in USUBJID: {unique_count}")

The number of unique values in USUBJID: 2465


Check how many missing values we have per column

In [27]:
missing_percentage_mh = (medical_history.isnull().sum() / len(medical_history)) * 100

missing_medical_history = pd.DataFrame({'Column Name': missing_percentage_mh.index, 'Missing Percentage': missing_percentage_mh.values})

#missing_demographics = missing_demographics.sort_values(by='Missing Percentage', ascending=False)

print(missing_medical_history)

   Column Name  Missing Percentage
0      STUDYID            0.000000
1       DOMAIN            0.000000
2      USUBJID            0.000000
3        MHSEQ            0.000000
4      MHGRPID          100.000000
5      MHREFID          100.000000
6       MHSPID          100.000000
7       MHTERM            0.000000
8     MHMODIFY          100.000000
9        MHLLT           87.608563
10     MHDECOD           40.153131
11       MHCAT           12.391437
12      MHSCAT           69.644218
13     MHPRESP           76.980801
14     MHOCCUR           76.980801
15      MHSTAT          100.000000
16    MHREASND          100.000000
17    MHBODSYS           56.879476
18       MHSOC           84.153588
19       MHLOC          100.000000
20       MHLAT          100.000000
21       MHSEV           81.932805
22      MHPATT           99.169587
23    MHCONTRT           94.853725
24    VISITNUM           20.912692
25       VISIT           20.912692
26       MHDTC          100.000000
27     MHSTDTC      

*to be continued*

### 5. Reproductive System Findings	(rp.csv) - [One record per Reproductive System Finding per time point per visit per subject]

In [29]:
# Replace 'your_file.csv' with the path to your CSV file
file_path = 'C:/Users/anaso/Desktop/SOFIA MENDES/KU Leuven/Master Thesis/MSOAC Placebo dataset/csv files/rp.csv'

# Read the CSV file into a DataFrame
reproductive_system = pd.read_csv(file_path)

# Sort the DataFrame by the 'USUBJID' column in ascending order
reproductive_system = reproductive_system.sort_values(by='USUBJID', ascending=True)

print("Data frame sorted by 'USUBJID' in ascending order:")
print(reproductive_system)

Data frame sorted by 'USUBJID' in ascending order:
    STUDYID DOMAIN     USUBJID  RPSEQ  RPGRPID  RPREFID  RPSPID  RPTESTCD  \
207   MSOAC     RP  MSOAC/0041      3      NaN      NaN     NaN  PREGTEST   
315   MSOAC     RP  MSOAC/0041      1      NaN      NaN     NaN  PREGTEST   
253   MSOAC     RP  MSOAC/0041      2      NaN      NaN     NaN  PREGTEST   
458   MSOAC     RP  MSOAC/0094      1      NaN      NaN     NaN  PREGTEST   
498   MSOAC     RP  MSOAC/0094      2      NaN      NaN     NaN  PREGTEST   
..      ...    ...         ...    ...      ...      ...     ...       ...   
409   MSOAC     RP  MSOAC/9921      1      NaN      NaN     NaN  PREGTEST   
19    MSOAC     RP  MSOAC/9951      2      NaN      NaN     NaN  PREGTEST   
471   MSOAC     RP  MSOAC/9951      3      NaN      NaN     NaN  PREGTEST   
365   MSOAC     RP  MSOAC/9951      1      NaN      NaN     NaN  PREGTEST   
196   MSOAC     RP  MSOAC/9951      4      NaN      NaN     NaN  PREGTEST   

             RPTEST  RPC

Number of patients

In [30]:
unique_count = reproductive_system['USUBJID'].nunique()

# Display the number of unique values
print(f"The number of unique values in USUBJID: {unique_count}")

The number of unique values in USUBJID: 141


Check how many missing values per column

In [31]:
missing_percentage_rs = (reproductive_system.isnull().sum() / len(reproductive_system)) * 100

missing_reproductive_system = pd.DataFrame({'Column Name': missing_percentage_rs.index, 'Missing Percentage': missing_percentage_rs.values})

#missing_demographics = missing_demographics.sort_values(by='Missing Percentage', ascending=False)

print(missing_reproductive_system)

   Column Name  Missing Percentage
0      STUDYID            0.000000
1       DOMAIN            0.000000
2      USUBJID            0.000000
3        RPSEQ            0.000000
4      RPGRPID          100.000000
5      RPREFID          100.000000
6       RPSPID          100.000000
7     RPTESTCD            0.000000
8       RPTEST            0.000000
9        RPCAT          100.000000
10      RPSCAT          100.000000
11     RPORRES            0.000000
12    RPORRESU          100.000000
13    RPSTRESC            0.000000
14    RPSTRESN          100.000000
15    RPSTRESU          100.000000
16      RPSTAT          100.000000
17    RPREASND          100.000000
18      RPSPEC            0.570342
19      RPBLFL          100.000000
20     RPDRVFL          100.000000
21    VISITNUM            0.000000
22       VISIT            0.000000
23     VISITDY            0.000000
24       EPOCH          100.000000
25        RPDY            0.000000
26       RPTPT          100.000000
27    RPTPTNUM      

Drop columns with no info (100% missing)

In [32]:
columns_to_drop = ['RPGRPID','RPREFID','RPSPID','RPCAT','RPSCAT','RPORRESU','RPSTRESN','RPSTRESU','RPSTAT','RPREASND','RPBLFL','RPDRVFL','EPOCH','RPTPT','RPTPTNUM','RPELTM','RPTPTREF']
reproductive_system = reproductive_system.drop(columns_to_drop, axis=1)
reproductive_system

Unnamed: 0,STUDYID,DOMAIN,USUBJID,RPSEQ,RPTESTCD,RPTEST,RPORRES,RPSTRESC,RPSPEC,VISITNUM,VISIT,VISITDY,RPDY
207,MSOAC,RP,MSOAC/0041,3,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,SERUM,15.1,EARLY WITHDRAWAL VISIT,365,365
315,MSOAC,RP,MSOAC/0041,1,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,URINE,2.0,VISIT 2,1,1
253,MSOAC,RP,MSOAC/0041,2,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,URINE,6.0,VISIT 6,167,167
458,MSOAC,RP,MSOAC/0094,1,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,SERUM,1.0,BASELINE,-28,-28
498,MSOAC,RP,MSOAC/0094,2,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,URINE,2.0,VISIT 2,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,MSOAC,RP,MSOAC/9921,1,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,URINE,10.0,VISIT 10,340,340
19,MSOAC,RP,MSOAC/9951,2,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,URINE,2.0,VISIT 2,1,1
471,MSOAC,RP,MSOAC/9951,3,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,URINE,6.0,VISIT 6,176,176
365,MSOAC,RP,MSOAC/9951,1,PREGTEST,Pregnancy Test,NEGATIVE,NEGATIVE,SERUM,1.0,BASELINE,-28,-28


*continue*

Notes:
- we have 1658 females in the dataset, but 141 records in this file
- imputation for column RPSPEC (only 0.570342% missing)           

### 6. Subject Disease Milestones (sm.csv) - [One record per Disease Milestone per subject]

### (extra) Functional tests (ft.csv) - [One record per functional test per task per repetition per time point per visit per subject]

In [9]:
import pandas as pd

# Replace 'your_file.csv' with the path to your CSV file
file_path = 'C:/Users/anaso/Desktop/SOFIA MENDES/KU Leuven/Master Thesis/MSOAC Placebo dataset/csv files/ft.csv'

# Read the CSV file into a DataFrame
functional_tests = pd.read_csv(file_path)

# Now you can work with the data in the DataFrame
print(functional_tests)

  functional_tests = pd.read_csv(file_path)


       STUDYID DOMAIN     USUBJID  FTSEQ   FTGRPID  FTREFID  FTSPID  FTTESTCD  \
0        MSOAC     FT  MSOAC/7115      6   NHPT001      NaN     NaN  NHPT0101   
1        MSOAC     FT  MSOAC/7115      5   NHPT001      NaN     NaN  NHPT0101   
2        MSOAC     FT  MSOAC/7115      3   NHPT001      NaN     NaN  NHPT0101   
3        MSOAC     FT  MSOAC/7115      2   NHPT001      NaN     NaN  NHPT0101   
4        MSOAC     FT  MSOAC/7115     10  T25FW001      NaN     NaN  T25FW101   
...        ...    ...         ...    ...       ...      ...     ...       ...   
241351   MSOAC     FT  MSOAC/6673     78  T25FW008      NaN     NaN  T25FW102   
241352   MSOAC     FT  MSOAC/6720     40  T25FW005      NaN     NaN  T25FW102   
241353   MSOAC     FT  MSOAC/8672     20  T25FW003      NaN     NaN  T25FW102   
241354   MSOAC     FT  MSOAC/9011     53  T25FW006      NaN     NaN  T25FW102   
241355   MSOAC     FT  MSOAC/9336     73  T25FW008      NaN     NaN  T25FW102   

                           

In [10]:
# Sort the DataFrame by the 'USUBJID' column in ascending order
sorted_ft = functional_tests.sort_values(by='USUBJID', ascending=True)

print("Data frame sorted by 'USUBJID' in ascending order:")
print(sorted_ft)

Data frame sorted by 'USUBJID' in ascending order:
       STUDYID DOMAIN     USUBJID  FTSEQ   FTGRPID  FTREFID  FTSPID  FTTESTCD  \
196574   MSOAC     FT  MSOAC/0014     10  PASAT001      NaN     NaN  PASAT101   
232174   MSOAC     FT  MSOAC/0014     11  T25FW002      NaN     NaN  T25FW101   
195151   MSOAC     FT  MSOAC/0014      9   NHPT001      NaN     NaN  NHPT0102   
193354   MSOAC     FT  MSOAC/0014      2  T25FW001      NaN     NaN  T25FW101   
234012   MSOAC     FT  MSOAC/0014     19   NHPT002      NaN     NaN  NHPT0102   
...        ...    ...         ...    ...       ...      ...     ...       ...   
216105   MSOAC     FT  MSOAC/9999     40  PASAT004      NaN     NaN  PASAT101   
239739   MSOAC     FT  MSOAC/9999     56   NHPT006      NaN     NaN  NHPT0102   
208469   MSOAC     FT  MSOAC/9999     24   NHPT003      NaN     NaN  NHPT0101   
239136   MSOAC     FT  MSOAC/9999     52  T25FW006      NaN     NaN  T25FW101   
191531   MSOAC     FT  MSOAC/9999      8   NHPT001      Na

---

# Notes

- we can merge the datasets by the column 'USUBJID'
- for Clinical Events dataset (ce.csv): column CETERM is not consistent - we have 'MS RELAPSE', 'MS - RELAPSE', 'RELAPSE OF MULTIPLE SCLEROSIS', etc. and all mean the same 

#### Datasets by relevance

- High: demographics (dm)
- Medium:
- Low: