# TrialDura ↔ src Dataset Alignment 

Objective – Keep only the NCT IDs that also appear in features_v3.parquet (src pipeline) and create the same 2019 train/test split for TrialDura.

Prerequisites: <br>
• `data/time_prediction_input.csv` freshly rebuilt<br>
• `data/src_nctids.txt` (279 203 IDs)

In [1]:
import pyarrow.parquet as pq, textwrap, pprint
import pandas as pd, pathlib

cols = pq.read_schema(
    "/Users/jonathanfung/Library/Mobile Documents/com~apple~CloudDocs/UCL MSc DSML/MSc Project/data/processed/features_v3.parquet"
).names
pprint.pprint(cols[:60])      # first 60 column names
print("… total", len(cols), "columns")


['indication/disease area',
 'rare, non-rare (established disease area and clear diagnosis criteria)',
 'mode of administration (ex. NBE, NCE, iv vs pill)',
 'disease modifying or treating symptoms',
 'population - adults vs peds',
 'phase',
 '# patients',
 'primary_outcomes',
 'secondary_outcomes',
 'other_outcomes',
 'cohorts (sequential or parallel)',
 'Eligibility Criteria: The stringency and number of eligibility criteria for '
 'participants',
 'placebo included',
 '# safety cuts, DMCs',
 '# sites',
 'Study Start Date',
 'Primary Completion Date',
 'mask level',
 'sponsorship type',
 'study type',
 'minimum age',
 'maximum age',
 'Allocation (Randomised / Non-randomised)',
 'FDA-regulated drug',
 'FDA-regulated device',
 'Primary Completion Type',
 'Overall status',
 'Number of arms',
 'NCT',
 'start_date',
 'complete_date',
 'duration_days',
 'start_year',
 'sponsor_class',
 'condition_top',
 'therapeutic_area',
 'intervention_type',
 'site_n',
 'country_n',
 'assessments_n',
 '

In [2]:
tbl = pd.read_parquet(
    "/Users/jonathanfung/Library/Mobile Documents/com~apple~CloudDocs/UCL MSc DSML/MSc Project/data/processed/features_v3.parquet"
)

ids = tbl["NCT"].dropna().astype(str).unique()
pd.Series(ids).to_csv("data/src_nctids.txt", index=False, header=False)

print("✅  saved", len(ids), "IDs to data/src_nctids.txt")


✅  saved 279203 IDs to data/src_nctids.txt


## Trim TrialDura’s merged CSV to only those IDs

In [3]:
keep = set(pd.read_csv("data/src_nctids.txt", header=None)[0])

td = pd.read_csv("/Users/jonathanfung/Library/Mobile Documents/com~apple~CloudDocs/UCL MSc DSML/MSc Project/TrialDura/data/time_prediction_input.csv", sep="\t")
print("Before filter:", len(td))

td = td[td["nctid"].isin(keep)]
td.to_csv("/Users/jonathanfung/Library/Mobile Documents/com~apple~CloudDocs/UCL MSc DSML/MSc Project/TrialDura/data/time_prediction_input_MATCHED.csv", sep="\t", index=False)

print("After  filter :", len(td))   # should equal the ID count above


Before filter: 308369
After  filter : 279194


In [4]:
CSV = pathlib.Path(
    "/Users/jonathanfung/Library/Mobile Documents/com~apple~CloudDocs/UCL MSc DSML/MSc Project/TrialDura/data/time_prediction_input_MATCHED.csv"
)

td = pd.read_csv(CSV, sep="\t")
print("rows in TrialDura CSV:", len(td))


rows in TrialDura CSV: 279194


In [5]:
src_ids = set(pd.read_csv("data/src_nctids.txt", header=None)[0])
td_ids  = set(pd.read_csv(
    "/Users/jonathanfung/Library/Mobile Documents/com~apple~CloudDocs/UCL MSc DSML/MSc Project/TrialDura/data/time_prediction_input.csv",
    sep="\t")["nctid"])

missing = src_ids - td_ids
print("IDs missing in TrialDura:", len(missing))
print(sorted(list(missing))[:9])


IDs missing in TrialDura: 9
['NCT01161732', 'NCT01199562', 'NCT01287871', 'NCT02823522', 'NCT04441268', 'NCT04537273', 'NCT05219656', 'NCT05521854', 'NCT05618041']


## Create 2019 split

In [6]:
df = pd.read_csv(
    "/Users/jonathanfung/Library/Mobile Documents/com~apple~CloudDocs/UCL MSc DSML/MSc Project/TrialDura/data/time_prediction_input_MATCHED.csv", sep="\t"
)

# same date rule the paper uses
df["start_year"]      = df["start_date"].str[-4:]
df["completion_year"] = df["completion_date"].str[-4:]

train = df[df["completion_year"] <  "2019"]
test  = df[df["start_year"]      >= "2019"]

train.to_csv("data/time_prediction_train.csv", sep="\t", index=False)
test.to_csv("data/time_prediction_test.csv",  sep="\t", index=False)

print("✔ train:", train.shape, "test:", test.shape)


✔ train: (161870, 13) test: (79233, 13)
