# Exploratory Data Analysis

This notebook aims to explore the several csv files of the MSOAC placebo dataset. We will analyse the data and try to find patterns. The csv files that will be explored in this notebook are:

- cm.csv (Concomitant Medications)
- fa.csv (Findings About)
- ft.csv (Functional Tests)
- oe.csv (Ophthalmic Examinations)
- qs.csv (Questionnaires)
- sc.csv (Subject Characteristics)

In [127]:
# Imports

import pandas as pd

### 1. Findings about MS disease history data (fa.csv) [one record per finding per object per time point per time point reference per visit per subject]

This dataset contains findings about **multiple sclerosis disease history**, including whether the subject has experienced at least 1 acute relapse and the number of multiple sclerosis relapses in the past 1, 2, or 3 years or since diagnosis.

In [128]:
# Replace 'your_file.csv' with the path to your CSV file
file_path = 'C:/Users/lenne/Downloads/MSOAC Placebo Data/fa.csv'

# Read the CSV file into a DataFrame
findings = pd.read_csv(file_path)
findings

Unnamed: 0,STUDYID,DOMAIN,USUBJID,FASEQ,FAGRPID,FASPID,FATESTCD,FATEST,FACAT,FASCAT,...,FAMETHOD,FABLFL,FAOBJ,FAEVAL,VISITNUM,VISIT,FADTC,FADY,FAEVLINT,FAEVINTX
0,MSOAC,FAMH,MSOAC/8028,1,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,1.0,SCREENING D-28 TO -2,,,-P1Y,
1,MSOAC,FAMH,MSOAC/5757,2,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,1.0,V1 - SCREENING,,,-P2Y,
2,MSOAC,FAMH,MSOAC/3737,2,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,,,,-36.0,-P3Y,
3,MSOAC,FAMH,MSOAC/3673,3,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,1.0,SCREENING D-28 TO -2,,,,Since MS Diagnosis
4,MSOAC,FAMH,MSOAC/5603,2,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,,,,-29.0,-P3Y,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4624,MSOAC,FAMH,MSOAC/5244,1,,,ACUTRLPS,Experienced at Least One Acute Relapse,,,...,,,MS DISEASE HISTORY,,0.0,SCREENING,,-55.0,,SINCE MS DIAGNOSIS
4625,MSOAC,FAMH,MSOAC/0885,3,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,1.0,SCREENING D-28 TO -2,,,,Since MS Diagnosis
4626,MSOAC,FAMH,MSOAC/2774,1,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,10.0,SCREENING,,,-P1Y,
4627,MSOAC,FAMH,MSOAC/0521,3,,,NUMRLPS,Number of MS Relapses,,,...,,,MS DISEASE HISTORY,,1.0,V1 - SCREENING,,,,Since MS Diagnosis


Check how many missing values we have per column.

In [129]:
missing_percentage_findings = (findings.isnull().sum() / len(findings)) * 100
missing_findings = pd.DataFrame({'Column Name': missing_percentage_findings.index, 'Missing Percentage': missing_percentage_findings.values})
missing_findings = missing_findings.sort_values(by='Missing Percentage', ascending=False)
print(missing_findings)

   Column Name  Missing Percentage
14    FASTRESU          100.000000
15      FASTAT          100.000000
25       FADTC          100.000000
22      FAEVAL          100.000000
4      FAGRPID          100.000000
5       FASPID          100.000000
20      FABLFL          100.000000
19    FAMETHOD          100.000000
8        FACAT          100.000000
9       FASCAT          100.000000
18       FALAT          100.000000
11    FAORRESU          100.000000
17       FALOC          100.000000
16    FAREASND          100.000000
28    FAEVINTX           76.928062
26        FADY           71.894578
27    FAEVLINT           23.071938
23    VISITNUM           21.602938
24       VISIT           21.602938
13    FASTRESN            6.588896
1       DOMAIN            0.000000
12    FASTRESC            0.000000
10     FAORRES            0.000000
7       FATEST            0.000000
6     FATESTCD            0.000000
21       FAOBJ            0.000000
3        FASEQ            0.000000
2      USUBJID      

We will drop the columns with more than 85% missing values.

In [130]:
# Set the threshold for missing percentage
threshold = 80

# Filter columns based on missing percentage
columns_to_drop = missing_findings[missing_findings['Missing Percentage'] >= threshold]['Column Name']

# Drop columns from the DataFrame
findings = findings.drop(columns=columns_to_drop)

In [131]:
# the column studyid is redundant so we remove it
studyid_values = findings['STUDYID'].unique()
print(studyid_values)

# the column domain is also redundant so we remove it
studyid_values = findings['DOMAIN'].unique()
print(studyid_values)

# the column faobj is also redundant so we remove it
studyid_values = findings['FAOBJ'].unique()
print(studyid_values)

# the column fatestcd & fatest contain the exact same info, remove one of the two

['MSOAC']
['FAMH']
['MS DISEASE HISTORY']


In [132]:
findings = findings.drop(columns=['STUDYID', 'DOMAIN', 'FAOBJ', 'FATESTCD'])
findings.sort_values(by=['USUBJID', 'FASEQ'], inplace=True)
findings

Unnamed: 0,USUBJID,FASEQ,FATEST,FAORRES,FASTRESC,FASTRESN,VISITNUM,VISIT,FADY,FAEVLINT,FAEVINTX
1289,MSOAC/0014,1,Number of MS Relapses,1,1,1.0,1.0,SCREENING D-28 TO -2,,-P1Y,
1616,MSOAC/0014,2,Number of MS Relapses,2,2,2.0,1.0,SCREENING D-28 TO -2,,-P2Y,
4602,MSOAC/0014,3,Number of MS Relapses,3,3,3.0,1.0,SCREENING D-28 TO -2,,,Since MS Diagnosis
1912,MSOAC/0024,1,Number of MS Relapses,00,00,0.0,-1.0,PRIOR TO RANDOMIZATION,,-P1Y,
1887,MSOAC/0024,2,Number of MS Relapses,00,00,0.0,-1.0,PRIOR TO RANDOMIZATION,,-P3Y,
...,...,...,...,...,...,...,...,...,...,...,...
1021,MSOAC/9995,2,Number of MS Relapses,3,3,3.0,1.0,V1 - SCREENING,,-P2Y,
2750,MSOAC/9995,3,Number of MS Relapses,7,7,7.0,1.0,V1 - SCREENING,,,Since MS Diagnosis
1340,MSOAC/9999,1,Number of MS Relapses,1,1,1.0,1.0,SCREENING D-28 TO -2,,-P1Y,
4213,MSOAC/9999,2,Number of MS Relapses,2,2,2.0,1.0,SCREENING D-28 TO -2,,-P2Y,


For how many patients do we have data about MS disease history?

In [133]:
unique_usubjid_count = findings['USUBJID'].nunique()
print("Number of unique values in USUBJID:", unique_usubjid_count)

Number of unique values in USUBJID: 2086


**Note that we don't have data on all the 2465 patients in this dataset!**

In [134]:
# Since MS Diagnosis and SINCE MS DIAGNOSIS is the same thing so convert 'FAEVINTX' column to uppercase
findings['FAEVINTX'] = findings['FAEVINTX'].str.upper()

Which columns are numerical and categorical? (see also SDTM fields in Data Dictionary!)

In [135]:
# Create an empty list to store column types
column_types = []

# Iterate through each column
for column, dtype in findings.dtypes.items():
    # Categorize columns
    if dtype == 'object':
        column_type = 'categorical'
    elif dtype in ['int64', 'float64']:
        column_type = 'numeric'
    else:
        column_type = 'other'

    # Append to the list
    column_types.append({'Column': column, 'Type': column_type})

# Create a DataFrame from the list
column_types_df = pd.DataFrame(column_types)

# Display the resulting DataFrame
column_types_df


Unnamed: 0,Column,Type
0,USUBJID,categorical
1,FASEQ,numeric
2,FATEST,categorical
3,FAORRES,categorical
4,FASTRESC,categorical
5,FASTRESN,numeric
6,VISITNUM,numeric
7,VISIT,categorical
8,FADY,numeric
9,FAEVLINT,categorical


- Descriptive statistics for numeric variables

In [136]:
numeric_columns = ['FASEQ', 'FASTRESN', 'VISITNUM', 'FADY']

descriptive_numeric = {
    'Count': findings[numeric_columns].count(), #cases that are not missing
    'Missing Cases': findings[numeric_columns].isna().sum(),
    'Median': findings[numeric_columns].median(),
    'Standard Deviation': findings[numeric_columns].std()
}

num_findings = pd.DataFrame(descriptive_numeric)
num_findings 

Unnamed: 0,Count,Missing Cases,Median,Standard Deviation
FASEQ,4629,0,2.0,0.73045
FASTRESN,4324,305,2.0,2.454083
VISITNUM,3629,1000,1.0,3.507341
FADY,1301,3328,-32.0,14.244152


- Descriptive statistics for categorical variables

In [137]:
categorical_columns = ['FATEST', 'FAORRES', 'FASTRESC', 'VISIT', 'FAEVLINT', 'FAEVINTX']

descriptive_categorical = {}
for col in categorical_columns:
    descriptive_categorical[col] = {
        'Count': findings[col].count(),
        'Missing Cases': findings[col].isna().sum(),
        'Unique Values': findings[col].nunique(),
        'Mode': findings[col].mode().values[0],
        'Mode Frequency': findings[col].value_counts().max()
    }

cat_findings = pd.DataFrame(descriptive_categorical).T
cat_findings

Unnamed: 0,Count,Missing Cases,Unique Values,Mode,Mode Frequency
FATEST,4629,0,2,Number of MS Relapses,4324
FAORRES,4629,0,35,2,1371
FASTRESC,4629,0,35,2,1371
VISIT,3629,1000,4,V1 - SCREENING,1248
FAEVLINT,3561,1068,3,-P1Y,1781
FAEVINTX,1068,3561,1,SINCE MS DIAGNOSIS,1068


- Number of observations for each FATEST (Findings About Test Name) category

In [138]:
FATEST_counts = findings['FATEST'].value_counts().reset_index()
FATEST_counts.columns = ['FATEST', 'Count']
FATEST_counts

Unnamed: 0,FATEST,Count
0,Number of MS Relapses,4324
1,Experienced at Least One Acute Relapse,305


- Number of observations for each FAORRES (Result or Finding in Original Units) category

In [139]:
FAORRES_counts = findings['FAORRES'].value_counts().reset_index()
FAORRES_counts.columns = ['FAORRES', 'Count']
FAORRES_counts

# contains both yes/no and numbers, should be looked at together with FATEST!

Unnamed: 0,FAORRES,Count
0,2,1371
1,1,1329
2,3,518
3,YES,303
4,4,238
5,00,230
6,5,125
7,01,118
8,0,64
9,6,63


In [140]:
# Create a cross-tabulation for counts
counts_table = pd.crosstab(index=findings['FATEST'], columns=findings['FAORRES'], margins=True, margins_name='Total')

# Display the counts table
counts_table

FAORRES,0,00,01,02,03,04,05,06,07,1,...,4,5,50,6,7,8,9,NO,YES,Total
FATEST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Experienced at Least One Acute Relapse,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,303,305
Number of MS Relapses,64,230,118,42,24,8,1,5,2,1329,...,238,125,1,63,52,28,14,0,0,4324
Total,64,230,118,42,24,8,1,5,2,1329,...,238,125,1,63,52,28,14,2,303,4629


- Number of observations for each FASTRESC (Character Result/Finding in Std Format) category
= same values as FAORRES but the numerical ones are now standardized I think

In [141]:
FASTRESC_counts = findings['FASTRESC'].value_counts().reset_index()
FASTRESC_counts.columns = ['FASTRESC', 'Count']
FASTRESC_counts

# contains both yes/no and numbers - look at it together with FATEST!

Unnamed: 0,FASTRESC,Count
0,2,1371
1,1,1329
2,3,518
3,Y,303
4,4,238
5,00,230
6,5,125
7,01,118
8,0,64
9,6,63


In [142]:
# Create a cross-tabulation for counts
counts_table = pd.crosstab(index=findings['FATEST'], columns=findings['FASTRESC'], margins=True, margins_name='Total')

# Display the counts table
counts_table

FASTRESC,0,00,01,02,03,04,05,06,07,1,...,4,5,50,6,7,8,9,N,Y,Total
FATEST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Experienced at Least One Acute Relapse,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,303,305
Number of MS Relapses,64,230,118,42,24,8,1,5,2,1329,...,238,125,1,63,52,28,14,0,0,4324
Total,64,230,118,42,24,8,1,5,2,1329,...,238,125,1,63,52,28,14,2,303,4629


- Number of observations for each VISIT (Visit Number) category 

1000 missing cases

In [143]:
VISIT_counts = findings['VISIT'].value_counts().reset_index()
VISIT_counts.columns = ['VISIT', 'Count']
VISIT_counts

Unnamed: 0,VISIT,Count
0,V1 - SCREENING,1248
1,SCREENING D-28 TO -2,1060
2,SCREENING,887
3,PRIOR TO RANDOMIZATION,434


- Number of observations for each FAEVLINT (Evaluation Interval) category

Indicates the evaluation time period of the test in ISO 8601 format.  For example, FAEVLINT="-P1Y" indicates an evaluation interval of the past 1 year relative to the visit day.

In [144]:
FAEVLINT_counts = findings['FAEVLINT'].value_counts().reset_index()
FAEVLINT_counts.columns = ['FAEVLINT', 'Count']
print(FAEVLINT_counts)

  FAEVLINT  Count
0     -P1Y   1781
1     -P3Y   1008
2     -P2Y    772


- Number of observations for each FAEVINTX (Evaluation Interval Text) category

Indicates the evaluation time period of the test when it cannot be expressed in ISO 8601 format.

In [145]:
FAEVINTX_counts = findings['FAEVINTX'].value_counts().reset_index()
FAEVINTX_counts.columns = ['FAEVINTX', 'Count']
print(FAEVINTX_counts)

# Should be combined into the same thing

             FAEVINTX  Count
0  SINCE MS DIAGNOSIS   1068


#### *Questions and ideas*:
- **This dataset is not super straightforward**
- make FAEVLINT into a numerical column somehow, but how to combine it with FAEVINTX? Because when 1 of the 2 is missing, the other isn't (so both columns together have 0% missing)
- how to interpret VISIT?
- What to do with FATEST, FAORRES & FARESC?
- How to interpret FADY, like why is it negative so often?

### 2. Functional tests (ft.csv) - [One record per functional test per task per repetition per time point per visit per subject]

This dataset contains info on Timed 25-Foot Walk (T25FW), Nine Hole Peg Test (NHPT), Paced Auditory Serial Addition Test (PASAT), and Symbol Digit Modalities Test (SDMT).

In [146]:
# Replace 'your_file.csv' with the path to your CSV file
file_path = 'C:/Users/lenne/Downloads/MSOAC Placebo Data/ft.csv'

# Read the CSV file into a DataFrame
ftests = pd.read_csv(file_path)
ftests

  tests = pd.read_csv(file_path)


Unnamed: 0,STUDYID,DOMAIN,USUBJID,FTSEQ,FTGRPID,FTREFID,FTSPID,FTTESTCD,FTTEST,FTTSTDTL,...,VISIT,VISITDY,FTDTC,FTDY,FTTPT,FTTPTNUM,FTELTM,FTTPTREF,FTRFTDTC,FTREPNUM
0,MSOAC,FT,MSOAC/7115,6,NHPT001,,,NHPT0101,NHPT01-Time to Complete 9-Hole Peg Test,,...,SCREENING,,,-50.0,PRACTICE TEST 1,10.0,,,,2.0
1,MSOAC,FT,MSOAC/7115,5,NHPT001,,,NHPT0101,NHPT01-Time to Complete 9-Hole Peg Test,,...,SCREENING,,,-50.0,PRACTICE TEST 1,10.0,,,,1.0
2,MSOAC,FT,MSOAC/7115,3,NHPT001,,,NHPT0101,NHPT01-Time to Complete 9-Hole Peg Test,,...,SCREENING,,,-50.0,PRACTICE TEST 1,10.0,,,,2.0
3,MSOAC,FT,MSOAC/7115,2,NHPT001,,,NHPT0101,NHPT01-Time to Complete 9-Hole Peg Test,,...,SCREENING,,,-50.0,PRACTICE TEST 1,10.0,,,,1.0
4,MSOAC,FT,MSOAC/7115,10,T25FW001,,,T25FW101,T25FW1-Time to Complete 25-Foot Walk,,...,SCREENING,,,-50.0,PRACTICE TEST 1,10.0,,,,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
241351,MSOAC,FT,MSOAC/6673,78,T25FW008,,,T25FW102,T25FW1-More Than Two Attempts,,...,UNSCHEDULED,,,,,,,,,
241352,MSOAC,FT,MSOAC/6720,40,T25FW005,,,T25FW102,T25FW1-More Than Two Attempts,,...,UNSCHEDULED,,,,,,,,,
241353,MSOAC,FT,MSOAC/8672,20,T25FW003,,,T25FW102,T25FW1-More Than Two Attempts,,...,UNSCHEDULED,,,,,,,,,
241354,MSOAC,FT,MSOAC/9011,53,T25FW006,,,T25FW102,T25FW1-More Than Two Attempts,,...,UNSCHEDULED,,,,,,,,,
