# 00 - Descarga de datos y validación de datos raw
Descarga inicial de los datos desde la API de ClinicalTrials.gov (https://clinicaltrials.gov/data-api/api) utilizando el script `src/download_clinicaltrials.py`.  

- Verifica que el fichero con datos RAW se ha generado correctamente.
- Carga una muestra para inspeccionar la estructura y coherencia.
- Guarda el dataset para el Notebook 01 (EDA).

**Nota: Este notebook solo debe ejecutarse cuando se desea descargar datos.** Idealmente ejecutar el notebook una sola vez, guardar el snapshot en `data/raw/`, y usar siempre ese archivo en el resto de notebooks.

In [1]:
from pathlib import Path
import pandas as pd
import datetime

import json

# Buscar cual es la ruta del proyecto, un nivel mas arriba que la ruta del notebook

PROJECT_ROOT = Path.cwd().parent
print("Proyecto cargado desde:", PROJECT_ROOT)

# Creamos las carpeta RAW y CLEAN para guardar los datos

RAW_DIR = PROJECT_ROOT / "data" / "raw"
CLEAN_DIR = PROJECT_ROOT / "data" / "clean"

RAW_DIR.mkdir(parents = True, exist_ok = True)
CLEAN_DIR.mkdir(parents = True, exist_ok = True)

print("\nCarpeta datos RAW:", RAW_DIR)
print("Carpeta datos CLEAN:", CLEAN_DIR)

Proyecto cargado desde: C:\Users\Administrador\Documents\tfm_clinicaltrials

Carpeta datos RAW: C:\Users\Administrador\Documents\tfm_clinicaltrials\data\raw
Carpeta datos CLEAN: C:\Users\Administrador\Documents\tfm_clinicaltrials\data\clean


In [2]:
# Ejecución del script de descarga
# Nota: el script de descarga filtra por `StudyType = Interventional`, registros posteriores a 2000 y que el campo Phase esté rellenado, usando `filter.advanced` en la API.

%cd ..
%run -m src.download_clinicaltrials 200

C:\Users\Administrador\Documents\tfm_clinicaltrials
Descargando 200 filas
Descargando todos los campos por defecto


Descargando ensayos: 100%|██████████| 200/200 [00:02<00:00, 90.03ensayos/s] 


Alcanzado el número máximo de ensayos=200, parando la ejecución.
Se han descargado 200 ensayos.
Datos RAW guardados en: C:\Users\Administrador\Documents\tfm_clinicaltrials\data\raw\clinicaltrials_raw_20251209_015930.csv


In [3]:
raw_files = list(RAW_DIR.glob("clinicaltrials_*.csv"))
raw_files_sorted = sorted(raw_files, key = lambda x: x.stat().st_mtime, reverse = True)

print("\nFicheros RAW mas recientes", RAW_DIR)

raw_files_sorted[:5]



Ficheros RAW mas recientes C:\Users\Administrador\Documents\tfm_clinicaltrials\data\raw


[WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251209_015930.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251209_011852.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251209_011310.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251209_011013.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251209_010901.csv')]

In [4]:
if not raw_files_sorted:
    raise FileNotFoundError("No hay ficheros en data/raw/. Necesario ejecutar el script de descarga.")

raw_file = raw_files_sorted[0]
print("Se usa el fichero mas reciente:", raw_file)

df_raw_check = pd.read_csv(raw_file, low_memory = False)
df_raw_check.head(20)


Se usa el fichero mas reciente: C:\Users\Administrador\Documents\tfm_clinicaltrials\data\raw\clinicaltrials_raw_20251209_015930.csv


Unnamed: 0,protocolSection.identificationModule.nctId,protocolSection.identificationModule.orgStudyIdInfo.id,protocolSection.identificationModule.organization.fullName,protocolSection.identificationModule.organization.class,protocolSection.identificationModule.briefTitle,protocolSection.identificationModule.officialTitle,protocolSection.statusModule.statusVerifiedDate,protocolSection.statusModule.overallStatus,protocolSection.statusModule.lastKnownStatus,protocolSection.statusModule.expandedAccessInfo.hasExpandedAccess,...,resultsSection.moreInfoModule.pointOfContact.phoneExt,protocolSection.eligibilityModule.genderBased,protocolSection.eligibilityModule.genderDescription,derivedSection.miscInfoModule.submissionTracking.firstMcpInfo.postDateStruct.date,derivedSection.miscInfoModule.submissionTracking.firstMcpInfo.postDateStruct.type,protocolSection.identificationModule.orgStudyIdInfo.type,protocolSection.identificationModule.orgStudyIdInfo.link,protocolSection.statusModule.expandedAccessInfo.nctId,protocolSection.statusModule.expandedAccessInfo.statusForNctId,protocolSection.identificationModule.nctIdAliases
0,NCT01315821,1234,Zekai Tahir Burak Women's Health Research and ...,OTHER,Effect of Saccharomyces Boulardii on Necrotizi...,Role Of Saccharomyces Boulardii in Preventin N...,2011-02,UNKNOWN,RECRUITING,False,...,,,,,,,,,,
1,NCT04551521,NCT-PMO-1602,German Cancer Research Center,OTHER,CRAFT: The NCT-PMO-1602 Phase II Trial,Continuous ReAssessment With Flexible ExTensio...,2024-05,COMPLETED,,False,...,,,,,,,,,,
2,NCT04086121,1368-0037,Boehringer Ingelheim,INDUSTRY,A Study to Test the Long-term Safety of BI 655...,An Open Label Extension Study to Assess the Lo...,2025-02,TERMINATED,,False,...,,,,,,,,,,
3,NCT01181921,CR015586,"Janssen-Cilag, S.A.",INDUSTRY,The CIRCADIAN Study: Evaluation of Modulating ...,Phase IV Study for the Assessment of Modulatin...,2014-04,TERMINATED,,False,...,,,,,,,,,,
4,NCT05435014,TACE-OHEP-001,"T-ACE Medical Co., Ltd",INDUSTRY,T-ACE Oil by TAE/TACE in Patients With Hepatoc...,"Phase I/II Randomized, Double-Blind, First-in-...",2024-11,RECRUITING,,False,...,,,,,,,,,,
5,NCT02497716,17992,Bayer,INDUSTRY,Phase I Study on Rivaroxaban Granules for Oral...,Single-dose Study Testing Rivaroxaban Granules...,2019-04,COMPLETED,,False,...,,,,,,,,,,
6,NCT01730716,NS2012-3,Neuralstem Inc.,INDUSTRY,Dose Escalation and Safety Study of Human Spin...,"A Phase II, Open-label, Dose Escalation and Sa...",2013-05,UNKNOWN,ACTIVE_NOT_RECRUITING,False,...,,,,,,,,,,
7,NCT06236516,202401015,Washington University School of Medicine,OTHER,One Fraction Simulation-Free Treatment With CT...,One Fraction Simulation-Free Treatment With CT...,2025-08,COMPLETED,,False,...,,,,,,,,,,
8,NCT05833906,COSMOS-21-RegenT-1,CK Regeon Inc.,INDUSTRY,"Safety, Tolerability, and Pharmacokinetic Eval...","A Randomized, Single-blind, Placebo-controlled...",2024-01,COMPLETED,,False,...,,,,,,,,,,
9,NCT02157506,CXL-1427-02,Bristol-Myers Squibb,INDUSTRY,A Dose Ranging Phase IIa Study of 6 Hour Intra...,"A Phase IIa Study of the Safety, Tolerability ...",2019-07,COMPLETED,,False,...,,,,,,,,,,


In [5]:
print("Shape:", df_raw_check.shape)
print("\nColumnas:")
print(df_raw_check.columns.tolist())

Shape: (200, 130)

Columnas:
['protocolSection.identificationModule.nctId', 'protocolSection.identificationModule.orgStudyIdInfo.id', 'protocolSection.identificationModule.organization.fullName', 'protocolSection.identificationModule.organization.class', 'protocolSection.identificationModule.briefTitle', 'protocolSection.identificationModule.officialTitle', 'protocolSection.statusModule.statusVerifiedDate', 'protocolSection.statusModule.overallStatus', 'protocolSection.statusModule.lastKnownStatus', 'protocolSection.statusModule.expandedAccessInfo.hasExpandedAccess', 'protocolSection.statusModule.startDateStruct.date', 'protocolSection.statusModule.primaryCompletionDateStruct.date', 'protocolSection.statusModule.primaryCompletionDateStruct.type', 'protocolSection.statusModule.completionDateStruct.date', 'protocolSection.statusModule.completionDateStruct.type', 'protocolSection.statusModule.studyFirstSubmitDate', 'protocolSection.statusModule.studyFirstSubmitQcDate', 'protocolSection.st

In [6]:
## 2. Selección de campos y renombrado

lista_columnas = [
    "protocolSection.identificationModule.nctId",
    "protocolSection.identificationModule.briefTitle",
    "protocolSection.identificationModule.officialTitle",
    "protocolSection.statusModule.overallStatus",
    "protocolSection.statusModule.lastKnownStatus",
    "protocolSection.statusModule.whyStopped",
    "protocolSection.statusModule.startDateStruct.date",
    "protocolSection.statusModule.primaryCompletionDateStruct.date",
    "protocolSection.statusModule.completionDateStruct.date",
    "protocolSection.designModule.studyType",
    "protocolSection.designModule.phases",
    "protocolSection.designModule.enrollmentInfo.count",
    "protocolSection.designModule.designInfo.allocation",
    "protocolSection.designModule.designInfo.interventionModel",
    "protocolSection.designModule.designInfo.primaryPurpose",
    "protocolSection.designModule.designInfo.maskingInfo.masking",
    "protocolSection.designModule.designInfo.maskingInfo.whoMasked",
    "protocolSection.conditionsModule.conditions",
    "protocolSection.conditionsModule.keywords",
    "protocolSection.contactsLocationsModule.locations",
    "protocolSection.armsInterventionsModule.interventions",
    "protocolSection.outcomesModule.primaryOutcomes",
    "protocolSection.outcomesModule.secondaryOutcomes",
    "protocolSection.descriptionModule.briefSummary",
    "protocolSection.sponsorCollaboratorsModule.leadSponsor.name",
    "protocolSection.sponsorCollaboratorsModule.leadSponsor.class",
    "protocolSection.sponsorCollaboratorsModule.collaborators",
    "protocolSection.eligibilityModule.minimumAge",
    "protocolSection.eligibilityModule.maximumAge",
    "protocolSection.eligibilityModule.sex",
    "protocolSection.eligibilityModule.healthyVolunteers",
    "protocolSection.eligibilityModule.eligibilityCriteria",
    "derivedSection.conditionBrowseModule.meshes",
    "derivedSection.interventionBrowseModule.meshes",
    "hasResults",
    "protocolSection.statusModule.lastUpdateSubmitDate",
    "protocolSection.statusModule.studyFirstSubmitDate",
    "protocolSection.ipdSharingStatementModule.ipdSharing",
]

# Observamos en https://clinicaltrials.gov/data-api/about-api/study-data-structure cuales son los nombres de los campos que se aceptan:

fields_api = [
    "NCTId", "BriefTitle", "OfficialTitle", "OverallStatus", "LastKnownStatus","WhyStopped", "StartDate", "PrimaryCompletionDate","CompletionDate",
    "StudyType", "Phase", "EnrollmentCount", "DesignAllocation", "DesignInterventionModel", "DesignPrimaryPurpose", "DesignMasking","DesignWhoMasked",
    "Condition","Keyword", "LocationCountry", "InterventionType","InterventionName", "PrimaryOutcomeMeasure","SecondaryOutcomeMeasure", "BriefSummary",
    "LeadSponsorName","LeadSponsorClass","CollaboratorName", "MinimumAge","MaximumAge","Sex","HealthyVolunteers", "EligibilityCriteria",
    "ConditionMeshTerm","InterventionMeshTerm", "HasResults", "LastUpdateSubmitDate","StudyFirstSubmitDate", "IPDSharing"
]

cols_str = " ".join(fields_api)

In [7]:
%run -m src.download_clinicaltrials 300000 $cols_str

Descargando 300000 filas
Usando columnas: ['NCTId', 'BriefTitle', 'OfficialTitle', 'OverallStatus', 'LastKnownStatus', 'WhyStopped', 'StartDate', 'PrimaryCompletionDate', 'CompletionDate', 'StudyType', 'Phase', 'EnrollmentCount', 'DesignAllocation', 'DesignInterventionModel', 'DesignPrimaryPurpose', 'DesignMasking', 'DesignWhoMasked', 'Condition', 'Keyword', 'LocationCountry', 'InterventionType', 'InterventionName', 'PrimaryOutcomeMeasure', 'SecondaryOutcomeMeasure', 'BriefSummary', 'LeadSponsorName', 'LeadSponsorClass', 'CollaboratorName', 'MinimumAge', 'MaximumAge', 'Sex', 'HealthyVolunteers', 'EligibilityCriteria', 'ConditionMeshTerm', 'InterventionMeshTerm', 'HasResults', 'LastUpdateSubmitDate', 'StudyFirstSubmitDate', 'IPDSharing']


Descargando ensayos:  68%|██████▊   | 204675/300000 [05:38<02:37, 603.93ensayos/s]


No existe nextPageToken, parando la ejecución.
Se han descargado 204675 ensayos.
Datos RAW guardados en: C:\Users\Administrador\Documents\tfm_clinicaltrials\data\raw\clinicaltrials_raw_20251209_020512.csv


In [8]:
raw_files = list(RAW_DIR.glob("clinicaltrials_*.csv"))
raw_files_sorted = sorted(raw_files, key=lambda x: x.stat().st_mtime, reverse=True)

print("\nFicheros RAW mas recientes", RAW_DIR)

raw_files_sorted[:5]


Ficheros RAW mas recientes C:\Users\Administrador\Documents\tfm_clinicaltrials\data\raw


[WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251209_020512.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251209_015930.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251209_011852.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251209_011310.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251209_011013.csv')]

In [9]:
if not raw_files_sorted:
    raise FileNotFoundError("No hay ficheros en data/raw/. Necesario ejecutar el script de descarga.")

raw_file = raw_files_sorted[0]
print("Se usa el fichero mas reciente:", raw_file)

df_raw = pd.read_csv(raw_file, low_memory=False)
df_raw.head()

Se usa el fichero mas reciente: C:\Users\Administrador\Documents\tfm_clinicaltrials\data\raw\clinicaltrials_raw_20251209_020512.csv


Unnamed: 0,protocolSection.identificationModule.nctId,protocolSection.identificationModule.briefTitle,protocolSection.identificationModule.officialTitle,protocolSection.statusModule.overallStatus,protocolSection.statusModule.lastKnownStatus,protocolSection.statusModule.startDateStruct.date,protocolSection.statusModule.primaryCompletionDateStruct.date,protocolSection.statusModule.completionDateStruct.date,protocolSection.statusModule.studyFirstSubmitDate,protocolSection.statusModule.lastUpdateSubmitDate,...,protocolSection.eligibilityModule.sex,protocolSection.eligibilityModule.minimumAge,protocolSection.eligibilityModule.maximumAge,protocolSection.contactsLocationsModule.locations,derivedSection.conditionBrowseModule.meshes,hasResults,derivedSection.interventionBrowseModule.meshes,protocolSection.statusModule.whyStopped,protocolSection.ipdSharingStatementModule.ipdSharing,protocolSection.sponsorCollaboratorsModule.collaborators
0,NCT01315821,Effect of Saccharomyces Boulardii on Necrotizi...,Role Of Saccharomyces Boulardii in Preventin N...,UNKNOWN,RECRUITING,2011-02,2011-12,2011-12,2011-02-24,2011-08-04,...,ALL,1 Day,2 Months,"[{""country"": ""Turkey (Türkiye)""}]","[{""term"": ""Enterocolitis, Necrotizing""}]",False,,,,
1,NCT04551521,CRAFT: The NCT-PMO-1602 Phase II Trial,Continuous ReAssessment With Flexible ExTensio...,COMPLETED,,2021-10-13,2024-12-30,2024-12-30,2020-07-24,2025-01-07,...,ALL,18 Years,,"[{""country"": ""Germany""}, {""country"": ""Germany""...","[{""term"": ""Neoplasm Metastasis""}]",False,"[{""term"": ""Vemurafenib""}, {""term"": ""cobimetini...",,,
2,NCT04086121,A Study to Test the Long-term Safety of BI 655...,An Open Label Extension Study to Assess the Lo...,TERMINATED,,2019-09-24,2021-04-28,2022-02-23,2019-09-10,2025-02-10,...,ALL,18 Years,75 Years,"[{""country"": ""United States""}, {""country"": ""Un...","[{""term"": ""Dermatitis, Atopic""}]",True,"[{""term"": ""spesolimab""}]",Sponsor decision,NO,
3,NCT01181921,The CIRCADIAN Study: Evaluation of Modulating ...,Phase IV Study for the Assessment of Modulatin...,TERMINATED,,2011-05,2011-06,2011-06,2010-08-12,2014-04-15,...,ALL,18 Years,,"[{""country"": ""Spain""}]","[{""term"": ""Alzheimer Disease""}, {""term"": ""Deme...",True,"[{""term"": ""Galantamine""}]",The recruitment rate was very low (one screeni...,,
4,NCT05435014,T-ACE Oil by TAE/TACE in Patients With Hepatoc...,"Phase I/II Randomized, Double-Blind, First-in-...",RECRUITING,,2022-09-13,2026-06-30,2026-06-30,2021-11-16,2024-12-17,...,ALL,20 Years,,"[{""country"": ""Taiwan""}, {""country"": ""Taiwan""},...","[{""term"": ""Carcinoma, Hepatocellular""}]",False,"[{""term"": ""Ethiodized Oil""}]",,,


In [10]:
df_raw.columns[:10]

Index(['protocolSection.identificationModule.nctId',
       'protocolSection.identificationModule.briefTitle',
       'protocolSection.identificationModule.officialTitle',
       'protocolSection.statusModule.overallStatus',
       'protocolSection.statusModule.lastKnownStatus',
       'protocolSection.statusModule.startDateStruct.date',
       'protocolSection.statusModule.primaryCompletionDateStruct.date',
       'protocolSection.statusModule.completionDateStruct.date',
       'protocolSection.statusModule.studyFirstSubmitDate',
       'protocolSection.statusModule.lastUpdateSubmitDate'],
      dtype='object')

In [11]:
df_clean = df_raw.copy()

df_clean.columns = (df_clean.columns.str.lower())

rename_map = {
    "protocolsection.identificationmodule.nctid": "NCTId",
    "protocolsection.identificationmodule.brieftitle": "BriefTitle",
    "protocolsection.identificationmodule.officialtitle": "OfficialTitle",
    "protocolsection.sponsorcollaboratorsmodule.leadsponsor.name": "LeadSponsorName",
    "protocolsection.sponsorcollaboratorsmodule.leadsponsor.class": "LeadSponsorClass",
    "protocolsection.sponsorcollaboratorsmodule.collaborators": "CollaboratorName",
    "protocolsection.statusmodule.overallstatus": "OverallStatus",
    "protocolsection.statusmodule.lastknownstatus": "LastKnownStatus",
    "protocolsection.statusmodule.whystopped": "WhyStopped",
    "protocolsection.statusmodule.startdatestruct.date": "StartDate",
    "protocolsection.statusmodule.primarycompletiondatestruct.date": "PrimaryCompletionDate",
    "protocolsection.statusmodule.completiondatestruct.date": "CompletionDate",
    "protocolsection.statusmodule.lastupdatesubmitdate": "LastUpdateSubmitDate",
    "protocolsection.statusmodule.studyfirstsubmitdate": "StudyFirstSubmitDate",
    "protocolsection.designmodule.studytype": "StudyType",
    "protocolsection.designmodule.phases": "Phase",
    "protocolsection.designmodule.enrollmentinfo.count": "EnrollmentCount",
    "protocolsection.designmodule.enrollmentinfo.type": "EnrollmentType",
    "protocolsection.designmodule.designinfo.allocation": "DesignAllocation",
    "protocolsection.designmodule.designinfo.interventionmodel": "DesignInterventionModel",
    "protocolsection.designmodule.designinfo.primarypurpose": "DesignPrimaryPurpose",
    "protocolsection.designmodule.designinfo.maskinginfo.masking": "DesignMasking",
    "protocolsection.designmodule.designinfo.maskinginfo.whomasked": "DesignWhoMasked",
    "protocolsection.conditionsmodule.conditions": "Condition",
    "protocolsection.conditionsmodule.keywords": "Keyword",
    "protocolsection.contactslocationsmodule.locations": "LocationCountry",
    "protocolsection.armsinterventionsmodule.interventions": "InterventionName",
    "protocolsection.outcomesmodule.primaryoutcomes": "PrimaryOutcomeMeasure",
    "protocolsection.outcomesmodule.secondaryoutcomes": "SecondaryOutcomeMeasure",
    "protocolsection.descriptionmodule.briefsummary": "BriefSummary",
    "protocolsection.eligibilitymodule.minimumage": "MinimumAge",
    "protocolsection.eligibilitymodule.maximumage": "MaximumAge",
    "protocolsection.eligibilitymodule.sex": "Sex",
    "protocolsection.eligibilitymodule.healthyvolunteers": "HealthyVolunteers",
    "protocolsection.eligibilitymodule.eligibilitycriteria": "EligibilityCriteria",
    "derivedsection.conditionbrowsemodule.meshes": "ConditionMeshTerm",
    "derivedsection.interventionbrowsemodule.meshes": "InterventionMeshTerm",
    "hasresults": "HasResults",
    "protocolsection.ipdsharingstatementmodule.ipdsharing": "IPDSharing",
}



df_clean = df_clean.rename(columns=rename_map)


df_clean = df_clean.apply(lambda col: 
    col.astype(str)
       .str.replace("\n", " ", regex=False)
       .str.replace("\r", " ", regex=False)
    if col.dtype == "object" else col
)


df_clean


Unnamed: 0,NCTId,BriefTitle,OfficialTitle,OverallStatus,LastKnownStatus,StartDate,PrimaryCompletionDate,CompletionDate,StudyFirstSubmitDate,LastUpdateSubmitDate,...,Sex,MinimumAge,MaximumAge,LocationCountry,ConditionMeshTerm,HasResults,InterventionMeshTerm,WhyStopped,IPDSharing,CollaboratorName
0,NCT01315821,Effect of Saccharomyces Boulardii on Necrotizi...,Role Of Saccharomyces Boulardii in Preventin N...,UNKNOWN,RECRUITING,2011-02,2011-12,2011-12,2011-02-24,2011-08-04,...,ALL,1 Day,2 Months,"[{""country"": ""Turkey (Türkiye)""}]","[{""term"": ""Enterocolitis, Necrotizing""}]",False,,,,
1,NCT04551521,CRAFT: The NCT-PMO-1602 Phase II Trial,Continuous ReAssessment With Flexible ExTensio...,COMPLETED,,2021-10-13,2024-12-30,2024-12-30,2020-07-24,2025-01-07,...,ALL,18 Years,,"[{""country"": ""Germany""}, {""country"": ""Germany""...","[{""term"": ""Neoplasm Metastasis""}]",False,"[{""term"": ""Vemurafenib""}, {""term"": ""cobimetini...",,,
2,NCT04086121,A Study to Test the Long-term Safety of BI 655...,An Open Label Extension Study to Assess the Lo...,TERMINATED,,2019-09-24,2021-04-28,2022-02-23,2019-09-10,2025-02-10,...,ALL,18 Years,75 Years,"[{""country"": ""United States""}, {""country"": ""Un...","[{""term"": ""Dermatitis, Atopic""}]",True,"[{""term"": ""spesolimab""}]",Sponsor decision,NO,
3,NCT01181921,The CIRCADIAN Study: Evaluation of Modulating ...,Phase IV Study for the Assessment of Modulatin...,TERMINATED,,2011-05,2011-06,2011-06,2010-08-12,2014-04-15,...,ALL,18 Years,,"[{""country"": ""Spain""}]","[{""term"": ""Alzheimer Disease""}, {""term"": ""Deme...",True,"[{""term"": ""Galantamine""}]",The recruitment rate was very low (one screeni...,,
4,NCT05435014,T-ACE Oil by TAE/TACE in Patients With Hepatoc...,"Phase I/II Randomized, Double-Blind, First-in-...",RECRUITING,,2022-09-13,2026-06-30,2026-06-30,2021-11-16,2024-12-17,...,ALL,20 Years,,"[{""country"": ""Taiwan""}, {""country"": ""Taiwan""},...","[{""term"": ""Carcinoma, Hepatocellular""}]",False,"[{""term"": ""Ethiodized Oil""}]",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
204670,NCT01596387,Validation of a Pharmacokinetic Pharmacodynami...,Validation of a Pharmacokinetic Pharmacodynami...,COMPLETED,,2012-03,2012-04,2012-05,2012-05-08,2012-05-10,...,ALL,18 Years,60 Years,"[{""country"": ""Chile""}]","[{""term"": ""Obesity""}]",False,"[{""term"": ""Propofol""}]",,,
204671,NCT02919787,Nordic Pancreatic Cancer Trial (NorPACT) - 1,Nordic Multicentre Un-blinded Phase II Randomi...,ACTIVE_NOT_RECRUITING,,2016-09,2022-12-22,2026-04-30,2016-09-14,2024-11-18,...,ALL,18 Years,,"[{""country"": ""Denmark""}, {""country"": ""Finland""...","[{""term"": ""Pancreatic Neoplasms""}]",False,"[{""term"": ""Fluorouracil""}, {""term"": ""Oxaliplat...",,NO,"[{""name"": ""St. Olavs Hospital""}, {""name"": ""Hau..."
204672,NCT04954287,Phase 1 Study of Intranasal PIV5 COVID-19 Vacc...,"A Phase 1 Open-Label, Dose-Ranging Trial to Ev...",COMPLETED,,2021-08-06,2023-06-10,2023-06-10,2021-06-30,2024-02-05,...,ALL,18 Years,55 Years,"[{""country"": ""United States""}, {""country"": ""Un...","[{""term"": ""COVID-19""}]",False,"[{""term"": ""CVXGA1 COVID-19 vaccine""}]",,YES,
204673,NCT06577987,Safety/Efficacy Study of CID-078 in Patients W...,"A Phase 1, Open-Label, Multicenter Study to Ev...",RECRUITING,,2024-08-14,2027-01-14,2027-03-14,2024-08-23,2025-09-09,...,ALL,12 Years,,"[{""country"": ""United States""}, {""country"": ""Un...","[{""term"": ""Neoplasm Metastasis""}, {""term"": ""Ne...",False,,,,


In [12]:
# Guardamos el archivo una vez limpiado, que se usará en el proximo paso

out_path = CLEAN_DIR / f"clinicaltrials_clean.csv"

df_clean.to_csv(out_path, index=False)

print("Dataset guardado en:")
out_path


Dataset guardado en:


WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/clean/clinicaltrials_clean.csv')

- Se ha cargado correctamente el fichero RAW más reciente.
 
- Se observa la estructura inicial de los datos tal y como vienen de la API.
 
- Se ha realizado una limpieza:
  - Normalización de nombres de columnas
  - Conversión inicial de columnas de fechas
  - Eliminación de formatos no estándar

El resto del análisis se basará en el archivo generado en `data/clean/`:
- `clinicaltrials_clean.csv`