# Exploratory Data Analysis

This notebook aims to explore the several csv files of the MSOAC placebo dataset. We will analyse the data and try to find patterns. The csv files that will be explored in this notebook are:

- cm.csv (Concomitant Medications)
- fa.csv (Findings About)
- ft.csv (Functional Tests)
- oe.csv (Ophthalmic Examinations)
- qs.csv (Questionnaires)
- sc.csv (Subject Characteristics)

In [1]:
# Imports

import pandas as pd

### 1. Findings data (fa.csv) [one record per finding per object per time point per time point reference per visit per subject]

This dataset contains findings about **multiple sclerosis disease history**, including whether the subject has experienced at least 1 acute relapse and the number of multiple sclerosis relapses in the past 1, 2, or 3 years or since diagnosis.

In [2]:
# Replace 'your_file.csv' with the path to your CSV file
file_path = 'C:/Users/lenne/Downloads/MSOAC Placebo Data/fa.csv'

# Read the CSV file into a DataFrame
findings = pd.read_csv(file_path)
findings

Unnamed: 0,STUDYID,DOMAIN,USUBJID,FASEQ,FAGRPID,FASPID,FATESTCD,FATEST,FACAT,FASCAT,...,FAMETHOD,FABLFL,FAOBJ,FAEVAL,VISITNUM,VISIT,FADTC,FADY,FAEVLINT,FAEVINTX
0,MSOAC,FAMH,MSOAC/8028,1,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,1.0,SCREENING D-28 TO -2,,,-P1Y,
1,MSOAC,FAMH,MSOAC/5757,2,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,1.0,V1 - SCREENING,,,-P2Y,
2,MSOAC,FAMH,MSOAC/3737,2,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,,,,-36.0,-P3Y,
3,MSOAC,FAMH,MSOAC/3673,3,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,1.0,SCREENING D-28 TO -2,,,,Since MS Diagnosis
4,MSOAC,FAMH,MSOAC/5603,2,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,,,,-29.0,-P3Y,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4624,MSOAC,FAMH,MSOAC/5244,1,,,ACUTRLPS,Experienced at Least One Acute Relapse,,,...,,,MS DISEASE HISTORY,,0.0,SCREENING,,-55.0,,SINCE MS DIAGNOSIS
4625,MSOAC,FAMH,MSOAC/0885,3,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,1.0,SCREENING D-28 TO -2,,,,Since MS Diagnosis
4626,MSOAC,FAMH,MSOAC/2774,1,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,10.0,SCREENING,,,-P1Y,
4627,MSOAC,FAMH,MSOAC/0521,3,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,1.0,V1 - SCREENING,,,,Since MS Diagnosis


Check how many missing values we have per column.

In [3]:
missing_percentage_findings = (findings.isnull().sum() / len(findings)) * 100
missing_findings = pd.DataFrame({'Column Name': missing_percentage_findings.index, 'Missing Percentage': missing_percentage_findings.values})
missing_findings = missing_findings.sort_values(by='Missing Percentage', ascending=False)
print(missing_findings)

   Column Name  Missing Percentage
14    FASTRESU          100.000000
15      FASTAT          100.000000
25       FADTC          100.000000
22      FAEVAL          100.000000
4      FAGRPID          100.000000
5       FASPID          100.000000
20      FABLFL          100.000000
19    FAMETHOD          100.000000
8        FACAT          100.000000
9       FASCAT          100.000000
18       FALAT          100.000000
11    FAORRESU          100.000000
17       FALOC          100.000000
16    FAREASND          100.000000
28    FAEVINTX           76.928062
26        FADY           71.894578
27    FAEVLINT           23.071938
23    VISITNUM           21.602938
24       VISIT           21.602938
13    FASTRESN            6.588896
1       DOMAIN            0.000000
12    FASTRESC            0.000000
10     FAORRES            0.000000
7       FATEST            0.000000
6     FATESTCD            0.000000
21       FAOBJ            0.000000
3        FASEQ            0.000000
2      USUBJID      

We will drop the columns with more than 85% missing values.

In [4]:
# Set the threshold for missing percentage
threshold = 85

# Filter columns based on missing percentage
columns_to_drop = missing_findings[missing_findings['Missing Percentage'] >= threshold]['Column Name']

# Drop columns from the DataFrame
findings = findings.drop(columns=columns_to_drop)

In [5]:
# the column studyid is redundant so we remove it
studyid_values = findings['STUDYID'].unique()
print(studyid_values)

# the column domain is also redundant so we remove it
studyid_values = findings['DOMAIN'].unique()
print(studyid_values)

# the column faobj is also redundant so we remove it
studyid_values = findings['FAOBJ'].unique()
print(studyid_values)

['MSOAC']
['FAMH']


In [6]:
findings = findings.drop(columns=['STUDYID', 'FAMH', 'FAOBJ'])
findings.sort_values(by='USUBJID', inplace=True)
findings

Unnamed: 0,DOMAIN,USUBJID,FASEQ,FATESTCD,FATEST,FAORRES,FASTRESC,FASTRESN,FAOBJ,VISITNUM,VISIT,FADY,FAEVLINT,FAEVINTX
4602,FAMH,MSOAC/0014,3,NUMRLPS,Number of MS Relapses,3,3,3.0,MS DISEASE HISTORY,1.0,SCREENING D-28 TO -2,,,Since MS Diagnosis
1289,FAMH,MSOAC/0014,1,NUMRLPS,Number of MS Relapses,1,1,1.0,MS DISEASE HISTORY,1.0,SCREENING D-28 TO -2,,-P1Y,
1616,FAMH,MSOAC/0014,2,NUMRLPS,Number of MS Relapses,2,2,2.0,MS DISEASE HISTORY,1.0,SCREENING D-28 TO -2,,-P2Y,
1912,FAMH,MSOAC/0024,1,NUMRLPS,Number of MS Relapses,00,00,0.0,MS DISEASE HISTORY,-1.0,PRIOR TO RANDOMIZATION,,-P1Y,
1887,FAMH,MSOAC/0024,2,NUMRLPS,Number of MS Relapses,00,00,0.0,MS DISEASE HISTORY,-1.0,PRIOR TO RANDOMIZATION,,-P3Y,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1021,FAMH,MSOAC/9995,2,NUMRLPS,Number of MS Relapses,3,3,3.0,MS DISEASE HISTORY,1.0,V1 - SCREENING,,-P2Y,
1293,FAMH,MSOAC/9995,1,NUMRLPS,Number of MS Relapses,2,2,2.0,MS DISEASE HISTORY,1.0,V1 - SCREENING,,-P1Y,
4213,FAMH,MSOAC/9999,2,NUMRLPS,Number of MS Relapses,2,2,2.0,MS DISEASE HISTORY,1.0,SCREENING D-28 TO -2,,-P2Y,
2166,FAMH,MSOAC/9999,3,NUMRLPS,Number of MS Relapses,10,10,10.0,MS DISEASE HISTORY,1.0,SCREENING D-28 TO -2,,,Since MS Diagnosis


For how many patients do we have data about findings?

In [7]:
unique_usubjid_count = findings['USUBJID'].nunique()
print("Number of unique values in USUBJID:", unique_usubjid_count)

Number of unique values in USUBJID: 2086


**Note that we don't have data on all the 2465 patients in this dataset!**

Which columns are numerical and categorical? (see also SDTM fields in Data Dictionary!)

In [13]:
# Create an empty list to store column types
column_types = []

# Iterate through each column
for column, dtype in findings.dtypes.items():
    # Categorize columns
    if dtype == 'object':
        column_type = 'categorical'
    elif dtype in ['int64', 'float64']:
        column_type = 'numeric'
    else:
        column_type = 'other'

    # Append to the list
    column_types.append({'Column': column, 'Type': column_type})

# Create a DataFrame from the list
column_types_df = pd.DataFrame(column_types)
column_types_df.sort_values(by='Type', inplace=True)

# Display the resulting DataFrame
column_types_df


Unnamed: 0,Column,Type
0,DOMAIN,categorical
1,USUBJID,categorical
3,FATESTCD,categorical
4,FATEST,categorical
5,FAORRES,categorical
6,FASTRESC,categorical
8,FAOBJ,categorical
10,VISIT,categorical
12,FAEVLINT,categorical
13,FAEVINTX,categorical


- Descriptive statistics for categorical variables

In [15]:
categorical_columns = ['FATESTCD', 'FATEST', 'FAORRES', 'FASTRESC', 'VISIT', 'FAEVLINT', 'FAEVINTX']

descriptive_categorical = {}
for col in categorical_columns:
    descriptive_categorical[col] = {
        'Count': findings[col].count(),
        'Missing Cases': findings[col].isna().sum(),
        'Unique Values': findings[col].nunique(),
        'Mode': findings[col].mode().values[0],
        'Mode Frequency': findings[col].value_counts().max()
    }

cat_findings = pd.DataFrame(descriptive_categorical).T
cat_findings

Unnamed: 0,Count,Missing Cases,Unique Values,Mode,Mode Frequency
FATESTCD,4629,0,2,NUMRLPS,4324
FATEST,4629,0,2,Number of MS Relapses,4324
FAORRES,4629,0,35,2,1371
FASTRESC,4629,0,35,2,1371
FAOBJ,4629,0,1,MS DISEASE HISTORY,4629
VISIT,3629,1000,4,V1 - SCREENING,1248
FAEVLINT,3561,1068,3,-P1Y,1781
FAEVINTX,1068,3561,2,Since MS Diagnosis,763
