# 03 — Data Cleaning
**ClinicalTrials.gov Dataset**  
*Author: John Seaton*  
*Last updated: 2025-12-10*

---

## 1. Purpose of This Notebook

This notebook performs a complete, systematic cleaning of the ClinicalTrials.gov dataset, preparing it for downstream analysis and long-term reuse. 

Goals:  
- Standardize critical fields such as trial phases, date columns, interventions, and enrollment values
- Resolve inconsistences arising from formatting differences, categorical variants and and grouped values
- Handle missing, irregular, or redundant data using transparent and reproducible transformations
- Engineer derived columns that improve analytical clarity and support futher exploratory or modeling efforts
- Generate a fully cleaned, versioned dataset (CSV or Parquet) for use in subsequent notebooks and pipeline stages

All transformations in this notebook are documented, reproducible, and reversible.
While a new processed dataset is created through this notebook, the original raw dataset remains intact.

## 2. Imports and Settings

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='whitegrid')

## 3. Load Raw Data

In [28]:
df = pd.read_csv("../data/raw/ClinicalTrials/ctg-studies.csv")
df.shape

(530028, 22)

## 4. High Level DataFrame Exploration

In [12]:
df.dtypes

NCT Number                     object
Study Title                    object
Study URL                      object
Study Status                   object
Conditions                     object
Interventions                  object
Primary Outcome Measures       object
Secondary Outcome Measures     object
Sponsor                        object
Collaborators                  object
Age                            object
Phases                         object
Enrollment                    float64
Funder Type                    object
Study Type                     object
Study Design                   object
Start Date                     object
Primary Completion Date        object
Completion Date                object
First Posted                   object
Locations                      object
Study Documents                object
dtype: object

In [13]:
# Show column names
df.columns.tolist()


['NCT Number',
 'Study Title',
 'Study URL',
 'Study Status',
 'Conditions',
 'Interventions',
 'Primary Outcome Measures',
 'Secondary Outcome Measures',
 'Sponsor',
 'Collaborators',
 'Age',
 'Phases',
 'Enrollment',
 'Funder Type',
 'Study Type',
 'Study Design',
 'Start Date',
 'Primary Completion Date',
 'Completion Date',
 'First Posted',
 'Locations',
 'Study Documents']

In [15]:
# High-level missingness snapshot
df.isna().mean().sort_values(ascending=False).head(15)

Study Documents               0.909578
Collaborators                 0.675219
Phases                        0.614666
Secondary Outcome Measures    0.269754
Interventions                 0.098470
Locations                     0.081267
Primary Completion Date       0.032085
Primary Outcome Measures      0.025414
Completion Date               0.023299
Enrollment                    0.007309
Conditions                    0.000006
Study Status                  0.000000
NCT Number                    0.000000
Study Title                   0.000000
Study URL                     0.000000
dtype: float64

The columns containing missing data will be addressed in the following ways:

Study Documents (90.9% missing): Drop column. Not needed for analysis.

Collaborators (67.5% missing): Drop column. Duplicative with Sponsor/Funder Type.

Phases (61.5% missing): Keep column, leave missing values as-is. Missingness expected for observational studies.

Primary Outcome Measures (23.3% missing) and Secondary Outcome Measures (27.0% missing): Keep columns, leave missing values as-is. Helpful but not essential data.

Interventions (9.8% missing): Keep column. Investigate missing cases during cleaning. 

Locations (8.1% missing): Keep column. Investigate missing cases during cleaning. 

Primary Completion Date (3.2% missing) and Completion Date (2.3% missing): Keep columns. Ongoing trials. 

Enrollment (0.7% missing): Keep column. Investigate missing cases during cleaning. 

In [14]:
# Small data sample
df.sample(3)

Unnamed: 0,NCT Number,Study Title,Study URL,Study Status,Conditions,Interventions,Primary Outcome Measures,Secondary Outcome Measures,Sponsor,Collaborators,...,Enrollment,Funder Type,Study Type,Study Design,Start Date,Primary Completion Date,Completion Date,First Posted,Locations,Study Documents
157186,NCT03309410,Mathematical Arterialization of Venous Blood Gas,https://clinicaltrials.gov/study/NCT03309410,COMPLETED,Matched-Pair Analysis|Blood Gas Analysis|Emerg...,DIAGNOSTIC_TEST: Venous to arterial conversion...,"Lin's Concordance correlation coefficient, Com...","Hemoglobin concentration, Comparison of venous...",Aalborg University,"Department of Anesthesiology, North Denmark Re...",...,30.0,OTHER,OBSERVATIONAL,Observational Model: |Time Perspective: p,2015-09-01,2016-01-30,2017-10-01,2017-10-13,,
133809,NCT04778410,Study of Magrolimab Combinations in Participan...,https://clinicaltrials.gov/study/NCT04778410,TERMINATED,Myeloid Malignancies,DRUG: Magrolimab|DRUG: Azacitidine|DRUG: Venet...,Complete Remission (CR) Rate (Cohorts 1 and 2)...,"Overall Response Rate (ORR), ORR was the perce...",Gilead Sciences,,...,54.0,INDUSTRY,INTERVENTIONAL,Allocation: NON_RANDOMIZED|Intervention Model:...,2021-06-28,2024-03-04,2024-03-04,2021-03-03,"University of Alabama at Birmingham, Birmingha...","Study Protocol, https://cdn.clinicaltrials.gov..."
24323,NCT02841891,A Clinical Study Comparing Standard Anastomosi...,https://clinicaltrials.gov/study/NCT02841891,TERMINATED,Colorectal and Ileorectal Anastomosis|Colocoli...,DEVICE: Sylys® Surgical Sealant|PROCEDURE: Sta...,Safety of Sylys® Surgical Sealant: Number of A...,"Sealant Application Evaluation Questionnaire, ...","Cohera Medical, Inc.",,...,58.0,INDUSTRY,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,2016-07,2018-08-13,2018-08-13,2016-07-22,"University of Alabama at Birmingham, Birmingha...",


## 5. Cleaning Overview

This notebooks applies a series of cleaning steps to standardize the ClinicalTrials.gov dataframe for analysis. Key transformations include: converting all date fields to proper datetime format, normalizing categorical fields such as trial phases and sponsor information, parsing and classifying interventions, cleaning and structuring the conditions and locations fields, verifying enrollment values, removing redundant or irrelevant columns, dropping rows that are not pertitent to the analysis of drug development pipelines, and engineering derived fields such as trial duration and primary intervention class. Each transformation or cleaning step is applied sequentially below and documented for reproducability.

## 6. Standardize Date Columns

In [None]:
# text

In [None]:
# Convert all four data columns from object type to datetime format

date_cols = ['Start Date', 'Primary Completion Date', 'Completion Date', 'First Posted']

# Store raw string backups
for col in date_cols:
    df[f'{col} (raw)'] = df[col]

# Convert to datetime
for col in date_cols:
    df[col] = pd.to_datetime(df[col], errors="coerce") # 

 # Add '15' (monthly-midpoint) to dates with missing day values to convert YYYY-MM to YYYY-MM-15
for col in date_cols:
    raw_col = f'{col} (raw)'
    
    # Locate dates with YYYY-MM formatting
    month_only = df[raw_col].astype(str).str.match(r"^\d{4}-\d{2}$")
    
    df.loc[month_only, col] = pd.to_datetime(df.loc[month_only, raw_col] + '-15',errors="coerce")

In [43]:
# Confirm dates have been coverted to datetime
print(df[date_cols].dtypes)
print()

=== 1. Confirm dtype ===
Start Date                 datetime64[ns]
Primary Completion Date    datetime64[ns]
Completion Date            datetime64[ns]
First Posted               datetime64[ns]
dtype: object



In [None]:
# Check on missing date values after above transformations
df[date_cols].isna().mean() * 100

Start Date                 0.000000
Primary Completion Date    3.208510
Completion Date            2.329877
First Posted               0.000000
dtype: float64

In [47]:
# Convert % missing into absolute counts for key date fields
for col in ['Primary Completion Date', 'Completion Date']:
    missing_count = df[col].isna().sum()
    total = len(df)
    print(f"{col}: {missing_count} missing out of {total} "
          f"({missing_count / total * 100:.2f}%)")

Primary Completion Date: 17006 missing out of 530028 (3.21%)
Completion Date: 12349 missing out of 530028 (2.33%)


In [49]:
for col in ['Primary Completion Date', 'Completion Date']:
    print(f"\n=== Value counts for {col} ===")
    print(df[col].value_counts(dropna=False).head(3))


=== Value counts for Primary Completion Date ===
Primary Completion Date
NaT           17006
2025-12-31     3768
2025-12-15     3116
Name: count, dtype: int64

=== Value counts for Completion Date ===
Completion Date
NaT           12349
2025-12-31     4188
2025-12-15     3307
Name: count, dtype: int64


This validates that all remaining missingness results from data that was not input into the clinicaltrials.gov dataset. The four date columns have been converted to datetime format, the YYYY-MM data values have been converted to YYYY-MM-15 format, and all remaining missing values in these columns are NaT (not a Time). 

## 7. Dropping Data

In [None]:
# Check existing dataframe shape before dropping columns
df.shape


(530028, 26)

In [51]:
df = df.drop('NCT Number', axis=1)
df = df.drop('Study Title', axis=1)
df = df.drop('Study Status', axis=1)
df = df.drop('Study URL', axis=1)
df = df.drop('Conditions', axis=1)
df = df.drop('Interventions', axis=1)
df = df.drop('Study Documents', axis=1)
df = df.drop('Collaborators', axis=1)

In [None]:
# Check existing dataframe shape after dropping columns
df.shape

(530028, 18)

## 8. Clean and Normalize Categorical Columns

This section exists to identify categorical columns that require normalization and light cleaning (inconsistent casing, delimiter usage, or multi-value fields). The goal is to ensure categorical values are internally consistent and suitable for later analysis.

In [53]:
df.select_dtypes(include="object").nunique().sort_values()

Study Type                            2
Age                                   6
Phases                                7
Funder Type                           9
Study Design                       1692
First Posted (raw)                 6093
Start Date (raw)                   9234
Primary Completion Date (raw)      9933
Completion Date (raw)             10477
Sponsor                           46212
Locations                        289919
Secondary Outcome Measures       384364
Primary Outcome Measures         508913
dtype: int64

## 9. Interventions Cleaning

## 10. Conditions Cleaning


## 11. Location Cleaning


## 12. Enrollment Cleaning

## 13. Engineering Derived Analytical Columns

## 14. Final Checks and Export