# 01 — Data Overview  
**ClinicalTrials.gov Dataset**  
*Author: John Seaton*  
*Last updated: 2025-12-08*

---
## 1. Scope & Role of This Notebook

**Project context:** This notebook provides a high-level overview of the raw ClinicalTrials.gov dataset before any exploration or cleaning.

**Goals:**   

This notebook establishes the following:
- What files are present?  
- How large is the dataset?  
- What columns exist?  
- What are the basic types and missingness patterns?  
- Are there any initial structural anomalies?

This notebook does not modify the data.

## 2. Imports & Settings
Load core libraries for data exploration below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)

## 3. Load Raw Data

In [2]:
df = pd.read_csv("../data/raw/ClinicalTrials/ctg-studies.csv")
df.shape

(530028, 22)

In [3]:
df.head()

Unnamed: 0,NCT Number,Study Title,Study URL,Study Status,Conditions,Interventions,Primary Outcome Measures,Secondary Outcome Measures,Sponsor,Collaborators,Age,Phases,Enrollment,Funder Type,Study Type,Study Design,Start Date,Primary Completion Date,Completion Date,First Posted,Locations,Study Documents
0,NCT05299372,Telemonitoring in NIV MND (OptNIVent),https://clinicaltrials.gov/study/NCT05299372,UNKNOWN,"Motor Neuron Disease, Amyotrophic Lateral Scle...",OTHER: Telemonitoring via Careportal®,Acceptability - Qualitative data through semi-...,Patient Reported Outcome Measurement - Neurolo...,Liverpool University Hospitals NHS Foundation ...,,"ADULT, OLDER_ADULT",,15.0,OTHER_GOV,INTERVENTIONAL,Allocation: NA|Intervention Model: SINGLE_GROU...,2022-04-25,2022-10-31,2022-12-31,2022-03-29,,
1,NCT01662895,High-Dose Deferoxamine in Intracerebral Hemorr...,https://clinicaltrials.gov/study/NCT01662895,TERMINATED,Intracerebral Hemorrhage,DRUG: Deferoxamine|DRUG: Normal saline,Number of Subjects With Modified Rankin Scale ...,"Number of Subjects With mRS Score 0-3, The pro...",Beth Israel Deaconess Medical Center,Medical University of South Carolina|National ...,"ADULT, OLDER_ADULT",PHASE2,42.0,OTHER,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,2013-03-18,2014-01-15,2018-05-10,2012-08-13,"St. Joseph's Hospital, Phoenix, Arizona, 85013...",
2,NCT06047795,Endurance Training in Patients With Post-TB Lu...,https://clinicaltrials.gov/study/NCT06047795,COMPLETED,Tuberculosis|Post-Tuberculous Pleural Fibrosis...,OTHER: Experimental|OTHER: Control,"Functional capacity, Changes from baseline to ...",,Riphah International University,,ADULT,,36.0,OTHER,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,2023-09-11,2024-06-30,2024-06-30,2023-09-21,"Green Star NGO, Peshawar, KPK, 25000, Pakistan",
3,NCT00414895,Absolute Myocardial Perfusion Measurement in t...,https://clinicaltrials.gov/study/NCT00414895,COMPLETED,Cardiac Transplantation,,,,"Insel Gruppe AG, University Hospital Bern",Swiss National Science Foundation|Swiss Heart ...,"ADULT, OLDER_ADULT",,90.0,OTHER,OBSERVATIONAL,Observational Model: |Time Perspective: p,2006-12,2008-12,2009-06,2006-12-22,"University Hospital Inselspital, Bern, 3010, S...",
4,NCT05934695,Lymphedema Severity on Shoulder Joint Function...,https://clinicaltrials.gov/study/NCT05934695,COMPLETED,Breast Cancer|Breast Cancer Female,OTHER: Lymphedema severity stratification,"Shoulder joint mobility, shoulder flexion, abd...","Shoulder flexors strength, Maximal shoulder fl...",Ahram Canadian University,,"ADULT, OLDER_ADULT",,75.0,OTHER,OBSERVATIONAL,Observational Model: |Time Perspective: p,2023-06-30,2024-01-03,2024-01-03,2023-07-07,Outpatient clinic of faculty of physical thera...,


## 4. Column Overview

In [4]:
df.columns.tolist()

['NCT Number',
 'Study Title',
 'Study URL',
 'Study Status',
 'Conditions',
 'Interventions',
 'Primary Outcome Measures',
 'Secondary Outcome Measures',
 'Sponsor',
 'Collaborators',
 'Age',
 'Phases',
 'Enrollment',
 'Funder Type',
 'Study Type',
 'Study Design',
 'Start Date',
 'Primary Completion Date',
 'Completion Date',
 'First Posted',
 'Locations',
 'Study Documents']

In [6]:
df.dtypes

NCT Number                     object
Study Title                    object
Study URL                      object
Study Status                   object
Conditions                     object
Interventions                  object
Primary Outcome Measures       object
Secondary Outcome Measures     object
Sponsor                        object
Collaborators                  object
Age                            object
Phases                         object
Enrollment                    float64
Funder Type                    object
Study Type                     object
Study Design                   object
Start Date                     object
Primary Completion Date        object
Completion Date                object
First Posted                   object
Locations                      object
Study Documents                object
dtype: object

'Start Date', 'Primary Completion Date', 'Completion Date' and 'First Posted' all appear as objects.
These will need to be converted to datetime.

## 5. Memory Usage

In [7]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 530028 entries, 0 to 530027
Data columns (total 22 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   NCT Number                  530028 non-null  object 
 1   Study Title                 530028 non-null  object 
 2   Study URL                   530028 non-null  object 
 3   Study Status                530028 non-null  object 
 4   Conditions                  530025 non-null  object 
 5   Interventions               477836 non-null  object 
 6   Primary Outcome Measures    516558 non-null  object 
 7   Secondary Outcome Measures  387051 non-null  object 
 8   Sponsor                     530028 non-null  object 
 9   Collaborators               172143 non-null  object 
 10  Age                         530028 non-null  object 
 11  Phases                      204238 non-null  object 
 12  Enrollment                  526154 non-null  float64
 13  Funder Type   

In [8]:
df.duplicated().sum()

np.int64(0)

## 6. Missingness Overview

In [9]:
df.isna().mean().sort_values(ascending=False)

Study Documents               0.909578
Collaborators                 0.675219
Phases                        0.614666
Secondary Outcome Measures    0.269754
Interventions                 0.098470
Locations                     0.081267
Primary Completion Date       0.032085
Primary Outcome Measures      0.025414
Completion Date               0.023299
Enrollment                    0.007309
Conditions                    0.000006
Study Status                  0.000000
NCT Number                    0.000000
Study Title                   0.000000
Study URL                     0.000000
Funder Type                   0.000000
Sponsor                       0.000000
Age                           0.000000
Start Date                    0.000000
Study Design                  0.000000
Study Type                    0.000000
First Posted                  0.000000
dtype: float64

The above output shows the fraction of rows that are missing in each column. 'Study Documents' is mostly null values, and will be removed. Remaining columns will be analyzed and handled for missing values on a case-by-case basis.

## 7. DataFrame Sample

In [11]:
df.sample(7)

Unnamed: 0,NCT Number,Study Title,Study URL,Study Status,Conditions,Interventions,Primary Outcome Measures,Secondary Outcome Measures,Sponsor,Collaborators,Age,Phases,Enrollment,Funder Type,Study Type,Study Design,Start Date,Primary Completion Date,Completion Date,First Posted,Locations,Study Documents
80122,NCT06819995,Accuracy of Static Guided Implant Surgery: 3D-...,https://clinicaltrials.gov/study/NCT06819995,RECRUITING,Dental Implant|Guided Surgery Accuracy,PROCEDURE: Static guided implant placement wit...,"Angular deviation, The discrepancy between the...","Probing Depth (PD), This measurement will be p...",Universidad Complutense de Madrid,Klockner Implant System,"ADULT, OLDER_ADULT",,48.0,OTHER,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,2025-02-15,2026-12-31,2026-12-31,2025-02-11,"Facultad de Odontología, Universidad Compluten...",
407089,NCT01975610,Efficacy and Safety Study of CC-292 Versus Pla...,https://clinicaltrials.gov/study/NCT01975610,COMPLETED,Rheumatoid Arthritis,DRUG: CC-292|DRUG: Placebo,American College of Rheumatology Criteria for ...,"Number of participants with adverse events, Sa...",Celgene,,"ADULT, OLDER_ADULT",PHASE2,47.0,INDUSTRY,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,2013-10,2016-01,2016-02,2013-11-04,"Achieve Clinical Research LLC, Birmingham, Ala...",
529232,NCT04584060,Conventional VS Enhanced Recovery After Surger...,https://clinicaltrials.gov/study/NCT04584060,UNKNOWN,Patient Presented With Acute Abdomen|Patient U...,COMBINATION_PRODUCT: ERAS protocols|COMBINATIO...,"Length of hospital stay, In days, Up to one month","Time to ambulation, in hours, Up to one month|...",Assiut University,,"ADULT, OLDER_ADULT",,60.0,OTHER,OBSERVATIONAL,Observational Model: |Time Perspective: p,2020-12-01,2021-11-30,2022-02-28,2020-10-12,,
438381,NCT04735835,Personalized Responses to Dietary Composition ...,https://clinicaltrials.gov/study/NCT04735835,RECRUITING,Diabetes|Heart Diseases|Diet Habit|Diet Modifi...,OTHER: Dietary Intervention,"Glucose, Measurement of blood glucose by conti...","Dietary assessment, Weighed food log, 6-14 day...",Zoe Global Limited,Massachusetts General Hospital|Stanford Univer...,"ADULT, OLDER_ADULT",,250000.0,OTHER,INTERVENTIONAL,Allocation: NA|Intervention Model: SINGLE_GROU...,2020-07-20,2030-01-01,2030-01-01,2021-02-03,"Zoe US Inc., Needham, Massachusetts, 02494, Un...",
64400,NCT01400672,Imiquimod/Brain Tumor Initiating Cell (BTIC) V...,https://clinicaltrials.gov/study/NCT01400672,TERMINATED,Diffuse Intrinsic Pontine Glioma,BIOLOGICAL: Tumor Lysate Vaccine|DRUG: Imiquim...,"Dose-limiting toxicity, Determined as Grade 3 ...","Time to Tumor Progression, Imaging will includ...","Masonic Cancer Center, University of Minnesota",,"CHILD, ADULT, OLDER_ADULT",PHASE1,8.0,OTHER,INTERVENTIONAL,Allocation: NA|Intervention Model: SINGLE_GROU...,2012-07-17,2016-03-07,2018-10-08,2011-07-22,"Masonic Cancer Center, University of Minnesota...",
33386,NCT06841458,"Six-Month Single-Blind, Placebo-Controlled Stu...",https://clinicaltrials.gov/study/NCT06841458,RECRUITING,Androgenetic Alopecia,OTHER: Oral Supplement,Objective Evaluation of Hair Health Benefits A...,"Subjective evaluation questionnaire, Consumer ...","Industrial Farmacéutica Cantabria, S.A.",,ADULT,,45.0,INDUSTRY,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,2024-10-07,2025-06-01,2025-06-01,2025-02-24,"DermaClaim, Valencia, Valencia, 46020, Spain|D...",
267340,NCT01907191,Ultrasound Guided Local Infiltration Analgesia...,https://clinicaltrials.gov/study/NCT01907191,TERMINATED,Hip Injuries,DRUG: liposomal bupivacaine|DRUG: Bupivacaine,Opioid Consumption (Mean Number of 5mg Oxycont...,"Pain Scores, Analysis of pain scores (at rest ...",Trinity Health Of New England,,"ADULT, OLDER_ADULT",,23.0,OTHER,INTERVENTIONAL,Allocation: RANDOMIZED|Intervention Model: PAR...,2013-07,2019-01-30,2019-01-30,2013-07-24,"Saint Francis Hospital and Medical Center, Har...","Study Protocol and Statistical Analysis Plan, ..."
