# 00 — Descarga de datos y validación de datos raw
Descarga inicial de los datos desde la API de ClinicalTrials.gov (https://clinicaltrials.gov/data-api/api) utilizando el script `src/download_clinicaltrials.py`.  

- Verifica que el fichero con datos RAW se ha generado correctamente.
- Carga una muestra para inspeccionar la estructura y coherencia.
- Guarda el dataset para el Notebook 01 (EDA).

**Nota: Este notebook solo debe ejecutarse cuando se desea descargar datos.** Idealmente ejecutar el notebook una sola vez, guardar el snapshot en `data/raw/`, y usar siempre ese archivo en el resto de notebooks.

In [1]:
from pathlib import Path
import pandas as pd
import datetime

import json

# Buscar cual es la ruta del proyecto, un nivel mas arriba que la ruta del notebook

PROJECT_ROOT = Path.cwd().parent
print("Proyecto cargado desde:", PROJECT_ROOT)

# Creamos las carpeta RAW y CLEAN para guardar los datos

RAW_DIR = PROJECT_ROOT / "data" / "raw"
CLEAN_DIR = PROJECT_ROOT / "data" / "clean"

RAW_DIR.mkdir(parents=True, exist_ok=True)
CLEAN_DIR.mkdir(parents=True, exist_ok=True)

print("\nCarpeta datos RAW:", RAW_DIR)
print("Carpeta datos CLEAN:", CLEAN_DIR)

Proyecto cargado desde: C:\Users\Administrador\Documents\tfm_clinicaltrials

Carpeta datos RAW: C:\Users\Administrador\Documents\tfm_clinicaltrials\data\raw
Carpeta datos CLEAN: C:\Users\Administrador\Documents\tfm_clinicaltrials\data\clean


In [2]:
# Ejecución del script de descarga

%cd ..
%run -m src.download_clinicaltrials 200

C:\Users\Administrador\Documents\tfm_clinicaltrials
Descargando 200 filas
Descargando todos los campos por defecto


Descargando ensayos: 100%|██████████| 200/200 [00:01<00:00, 119.40ensayos/s]


Alcanzado el número máximo de ensayos=200, parando la ejecución.
Se han descargado 200 ensayos.
Datos RAW guardados en: C:\Users\Administrador\Documents\tfm_clinicaltrials\data\raw\clinicaltrials_raw_20251208_182649.csv


In [3]:
raw_files = list(RAW_DIR.glob("clinicaltrials_*.csv"))
raw_files_sorted = sorted(raw_files, key=lambda x: x.stat().st_mtime, reverse=True)

print("\nFicheros RAW mas recientes", RAW_DIR)

raw_files_sorted[:5]



Ficheros RAW mas recientes C:\Users\Administrador\Documents\tfm_clinicaltrials\data\raw


[WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251208_182649.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251208_182300.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251208_181025.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251208_143015.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251208_141314.csv')]

In [4]:
if not raw_files_sorted:
    raise FileNotFoundError("No hay ficheros en data/raw/. Necesario ejecutar el script de descarga.")

raw_file = raw_files_sorted[0]
print("Se usa el fichero mas reciente:", raw_file)

df_raw_check = pd.read_csv(raw_file, low_memory=False)
df_raw_check.head(20)


Se usa el fichero mas reciente: C:\Users\Administrador\Documents\tfm_clinicaltrials\data\raw\clinicaltrials_raw_20251208_182649.csv


Unnamed: 0,protocolSection.identificationModule.nctId,protocolSection.identificationModule.orgStudyIdInfo.id,protocolSection.identificationModule.secondaryIdInfos,protocolSection.identificationModule.organization.fullName,protocolSection.identificationModule.organization.class,protocolSection.identificationModule.briefTitle,protocolSection.identificationModule.officialTitle,protocolSection.statusModule.statusVerifiedDate,protocolSection.statusModule.overallStatus,protocolSection.statusModule.expandedAccessInfo.hasExpandedAccess,...,protocolSection.ipdSharingStatementModule.url,protocolSection.identificationModule.orgStudyIdInfo.type,protocolSection.identificationModule.orgStudyIdInfo.link,derivedSection.miscInfoModule.submissionTracking.firstMcpInfo.postDateStruct.date,derivedSection.miscInfoModule.submissionTracking.firstMcpInfo.postDateStruct.type,protocolSection.oversightModule.isUnapprovedDevice,protocolSection.referencesModule.availIpds,resultsSection.moreInfoModule.pointOfContact.phoneExt,protocolSection.statusModule.expandedAccessInfo.nctId,protocolSection.statusModule.expandedAccessInfo.statusForNctId
0,NCT03034993,H-35967,"[{""id"": ""catalyst grant"", ""type"": ""OTHER"", ""do...",Boston University,OTHER,Self-Management Using Text Messaging in a Home...,Improving Self-Management of Chronic Condition...,2019-05,COMPLETED,False,...,,,,,,,,,,
1,NCT06822621,106/2023_16.11.23,,Harokopio University,OTHER,Functional Cereal Products and Contribution to...,Functional Cereal Products and Contribution to...,2025-02,ENROLLING_BY_INVITATION,False,...,,,,,,,,,,
2,NCT04551521,NCT-PMO-1602,,German Cancer Research Center,OTHER,CRAFT: The NCT-PMO-1602 Phase II Trial,Continuous ReAssessment With Flexible ExTensio...,2024-05,COMPLETED,False,...,,,,,,,,,,
3,NCT04086121,1368-0037,,Boehringer Ingelheim,INDUSTRY,A Study to Test the Long-term Safety of BI 655...,An Open Label Extension Study to Assess the Lo...,2025-02,TERMINATED,False,...,,,,,,,,,,
4,NCT04538521,NiaMIT_002,,University of Helsinki,OTHER,NiaMIT Continuation With Early-stage Mitochond...,NiaMIT (NiaMIT_0001) Continuation for Early-st...,2021-01,COMPLETED,False,...,,,,,,,,,,
5,NCT07103421,Tishreen U _ periodontic,,Tishreen University,OTHER,Impact of Non Surgical Periodontal Therapy on ...,Effect of Non-Surgical Periodontal Therapy on ...,2025-08,COMPLETED,False,...,,,,,,,,,,
6,NCT06684821,36264PR903/10/24,,Tanta University,OTHER,Epidural Neuroplasty Using Racz Catheter Durin...,Analgesic Efficacy of Epidural Neuroplasty Usi...,2025-04,COMPLETED,False,...,,,,,,,,,,
7,NCT05435014,TACE-OHEP-001,,"T-ACE Medical Co., Ltd",INDUSTRY,T-ACE Oil by TAE/TACE in Patients With Hepatoc...,"Phase I/II Randomized, Double-Blind, First-in-...",2024-11,RECRUITING,False,...,,,,,,,,,,
8,NCT02497716,17992,"[{""id"": ""2015-000962-76"", ""type"": ""EUDRACT_NUM...",Bayer,INDUSTRY,Phase I Study on Rivaroxaban Granules for Oral...,Single-dose Study Testing Rivaroxaban Granules...,2019-04,COMPLETED,False,...,,,,,,,,,,
9,NCT06236516,202401015,,Washington University School of Medicine,OTHER,One Fraction Simulation-Free Treatment With CT...,One Fraction Simulation-Free Treatment With CT...,2025-08,COMPLETED,False,...,,,,,,,,,,


In [5]:
print("Shape:", df_raw_check.shape)
print("\nColumnas:")
print(df_raw_check.columns.tolist())

Shape: (200, 130)

Columnas:
['protocolSection.identificationModule.nctId', 'protocolSection.identificationModule.orgStudyIdInfo.id', 'protocolSection.identificationModule.secondaryIdInfos', 'protocolSection.identificationModule.organization.fullName', 'protocolSection.identificationModule.organization.class', 'protocolSection.identificationModule.briefTitle', 'protocolSection.identificationModule.officialTitle', 'protocolSection.statusModule.statusVerifiedDate', 'protocolSection.statusModule.overallStatus', 'protocolSection.statusModule.expandedAccessInfo.hasExpandedAccess', 'protocolSection.statusModule.startDateStruct.date', 'protocolSection.statusModule.startDateStruct.type', 'protocolSection.statusModule.primaryCompletionDateStruct.date', 'protocolSection.statusModule.primaryCompletionDateStruct.type', 'protocolSection.statusModule.completionDateStruct.date', 'protocolSection.statusModule.completionDateStruct.type', 'protocolSection.statusModule.studyFirstSubmitDate', 'protocolSec

In [6]:
lista_columnas = [
    "protocolSection.identificationModule.nctId",
    "protocolSection.identificationModule.briefTitle",
    "protocolSection.identificationModule.officialTitle",
    "protocolSection.statusModule.overallStatus",
    "protocolSection.statusModule.lastKnownStatus",
    "protocolSection.statusModule.whyStopped",
    "protocolSection.statusModule.startDateStruct.date",
    "protocolSection.statusModule.primaryCompletionDateStruct.date",
    "protocolSection.statusModule.completionDateStruct.date",
    "protocolSection.designModule.studyType",
    "protocolSection.designModule.phases",
    "protocolSection.designModule.enrollmentInfo.count",
    "protocolSection.designModule.designInfo.allocation",
    "protocolSection.designModule.designInfo.interventionModel",
    "protocolSection.designModule.designInfo.primaryPurpose",
    "protocolSection.designModule.designInfo.maskingInfo.masking",
    "protocolSection.designModule.designInfo.maskingInfo.whoMasked",
    "protocolSection.conditionsModule.conditions",
    "protocolSection.conditionsModule.keywords",
    "protocolSection.contactsLocationsModule.locations",
    "protocolSection.armsInterventionsModule.interventions",
    "protocolSection.outcomesModule.primaryOutcomes",
    "protocolSection.outcomesModule.secondaryOutcomes",
    "protocolSection.descriptionModule.briefSummary",
    "protocolSection.sponsorCollaboratorsModule.leadSponsor.name",
    "protocolSection.sponsorCollaboratorsModule.leadSponsor.class",
    "protocolSection.sponsorCollaboratorsModule.collaborators",
    "protocolSection.eligibilityModule.minimumAge",
    "protocolSection.eligibilityModule.maximumAge",
    "protocolSection.eligibilityModule.sex",
    "protocolSection.eligibilityModule.healthyVolunteers",
    "protocolSection.eligibilityModule.eligibilityCriteria",
    "derivedSection.conditionBrowseModule.meshes",
    "derivedSection.interventionBrowseModule.meshes",
    "hasResults",
    "protocolSection.statusModule.lastUpdateSubmitDate",
    "protocolSection.statusModule.studyFirstSubmitDate",
    "protocolSection.ipdSharingStatementModule.ipdSharing",
]

# Observamos en https://clinicaltrials.gov/data-api/about-api/study-data-structure cuales son los nombres de los campos que se aceptan:

fields_api = [
    "NCTId","BriefTitle","OfficialTitle",
    "OverallStatus","LastKnownStatus","WhyStopped",
    "StartDate","PrimaryCompletionDate","CompletionDate",
    "StudyType","Phase","EnrollmentCount",
    "DesignAllocation",
    "DesignInterventionModel ",
    "DesignPrimaryPurpose ",
    "DesignMasking","DesignWhoMasked",
    "Condition","Keyword",
    "LocationCountry",
    "InterventionType","InterventionName",
    "PrimaryOutcomeMeasure","SecondaryOutcomeMeasure",
    "BriefSummary",
    "LeadSponsorName","LeadSponsorClass","CollaboratorName",
    "MinimumAge","MaximumAge","Sex","HealthyVolunteers",
    "EligibilityCriteria",
    "ConditionMeshTerm","InterventionMeshTerm",
    "HasResults",
    "LastUpdateSubmitDate","StudyFirstSubmitDate",
    "IPDSharing"
]

cols_str = " ".join(fields_api)

In [7]:
%run -m src.download_clinicaltrials 300000 $cols_str

Descargando 300000 filas
Usando columnas: ['NCTId', 'BriefTitle', 'OfficialTitle', 'OverallStatus', 'LastKnownStatus', 'WhyStopped', 'StartDate', 'PrimaryCompletionDate', 'CompletionDate', 'StudyType', 'Phase', 'EnrollmentCount', 'DesignAllocation', 'DesignInterventionModel', 'DesignPrimaryPurpose', 'DesignMasking', 'DesignWhoMasked', 'Condition', 'Keyword', 'LocationCountry', 'InterventionType', 'InterventionName', 'PrimaryOutcomeMeasure', 'SecondaryOutcomeMeasure', 'BriefSummary', 'LeadSponsorName', 'LeadSponsorClass', 'CollaboratorName', 'MinimumAge', 'MaximumAge', 'Sex', 'HealthyVolunteers', 'EligibilityCriteria', 'ConditionMeshTerm', 'InterventionMeshTerm', 'HasResults', 'LastUpdateSubmitDate', 'StudyFirstSubmitDate', 'IPDSharing']


Descargando ensayos:  91%|█████████ | 273095/300000 [07:39<00:45, 594.90ensayos/s]


No existe nextPageToken, parando la ejecución.
Se han descargado 273095 ensayos.
Datos RAW guardados en: C:\Users\Administrador\Documents\tfm_clinicaltrials\data\raw\clinicaltrials_raw_20251208_183431.csv


In [8]:
raw_files = list(RAW_DIR.glob("clinicaltrials_*.csv"))
raw_files_sorted = sorted(raw_files, key=lambda x: x.stat().st_mtime, reverse=True)

print("\nFicheros RAW mas recientes", RAW_DIR)

raw_files_sorted[:5]


Ficheros RAW mas recientes C:\Users\Administrador\Documents\tfm_clinicaltrials\data\raw


[WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251208_183431.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251208_182649.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251208_182300.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251208_181025.csv'),
 WindowsPath('C:/Users/Administrador/Documents/tfm_clinicaltrials/data/raw/clinicaltrials_raw_20251208_143015.csv')]

In [9]:
if not raw_files_sorted:
    raise FileNotFoundError("No hay ficheros en data/raw/. Necesario ejecutar el script de descarga.")

raw_file = raw_files_sorted[0]
print("Se usa el fichero mas reciente:", raw_file)

df_raw = pd.read_csv(raw_file, low_memory=False)
df_raw.head()

Se usa el fichero mas reciente: C:\Users\Administrador\Documents\tfm_clinicaltrials\data\raw\clinicaltrials_raw_20251208_183431.csv


Unnamed: 0,protocolSection.identificationModule.nctId,protocolSection.identificationModule.briefTitle,protocolSection.identificationModule.officialTitle,protocolSection.statusModule.overallStatus,protocolSection.statusModule.startDateStruct.date,protocolSection.statusModule.primaryCompletionDateStruct.date,protocolSection.statusModule.completionDateStruct.date,protocolSection.statusModule.studyFirstSubmitDate,protocolSection.statusModule.lastUpdateSubmitDate,protocolSection.sponsorCollaboratorsModule.leadSponsor.name,...,protocolSection.eligibilityModule.minimumAge,protocolSection.contactsLocationsModule.locations,protocolSection.ipdSharingStatementModule.ipdSharing,derivedSection.conditionBrowseModule.meshes,hasResults,protocolSection.designModule.designInfo.maskingInfo.whoMasked,protocolSection.eligibilityModule.maximumAge,derivedSection.interventionBrowseModule.meshes,protocolSection.statusModule.whyStopped,protocolSection.statusModule.lastKnownStatus
0,NCT03034993,Self-Management Using Text Messaging in a Home...,Improving Self-Management of Chronic Condition...,COMPLETED,2017-07-21,2018-12-30,2019-04-30,2017-01-26,2019-05-01,Boston University,...,18 Years,"[{""country"": ""United States""}]",NO,"[{""term"": ""Chronic Disease""}, {""term"": ""Medica...",False,,,,,
1,NCT06822621,Functional Cereal Products and Contribution to...,Functional Cereal Products and Contribution to...,ENROLLING_BY_INVITATION,2024-09-02,2025-06-30,2025-09-30,2025-01-31,2025-02-06,Harokopio University,...,25 Years,"[{""country"": ""Greece""}]",NO,"[{""term"": ""Hypercholesterolemia""}, {""term"": ""O...",False,"[""PARTICIPANT""]",60 Years,"[{""term"": ""Methods""}]",,
2,NCT04551521,CRAFT: The NCT-PMO-1602 Phase II Trial,Continuous ReAssessment With Flexible ExTensio...,COMPLETED,2021-10-13,2024-12-30,2024-12-30,2020-07-24,2025-01-07,German Cancer Research Center,...,18 Years,"[{""country"": ""Germany""}, {""country"": ""Germany""...",,"[{""term"": ""Neoplasm Metastasis""}]",False,,,"[{""term"": ""Vemurafenib""}, {""term"": ""cobimetini...",,
3,NCT04086121,A Study to Test the Long-term Safety of BI 655...,An Open Label Extension Study to Assess the Lo...,TERMINATED,2019-09-24,2021-04-28,2022-02-23,2019-09-10,2025-02-10,Boehringer Ingelheim,...,18 Years,"[{""country"": ""United States""}, {""country"": ""Un...",NO,"[{""term"": ""Dermatitis, Atopic""}]",True,,75 Years,"[{""term"": ""spesolimab""}]",Sponsor decision,
4,NCT04538521,NiaMIT Continuation With Early-stage Mitochond...,NiaMIT (NiaMIT_0001) Continuation for Early-st...,COMPLETED,2019-02-11,2020-09-18,2020-09-18,2020-08-29,2021-01-21,University of Helsinki,...,17 Years,"[{""country"": ""Finland""}]",NO,"[{""term"": ""Mitochondrial Myopathies""}]",False,,,"[{""term"": ""Niacin""}]",,


In [12]:
df_clean = df_raw.copy()

df_clean.columns = (df_clean.columns.str.lower())

rename_map = {
    "protocolsection.identificationmodule.nctid": "nct_id",
    "protocolsection.identificationmodule.brieftitle": "brief_title",
    "protocolsection.identificationmodule.officialtitle": "official_title",
    "protocolsection.sponsorcollaboratorsmodule.leadsponsor.name": "lead_sponsor_name",
    "protocolsection.sponsorcollaboratorsmodule.leadsponsor.class": "lead_sponsor_class",
    "protocolsection.sponsorcollaboratorsmodule.collaborators": "collaborators",
    "protocolsection.statusmodule.overallstatus": "overall_status",
    "protocolsection.statusmodule.lastknownstatus": "last_known_status",
    "protocolsection.statusmodule.whystopped": "why_stopped",
    "protocolsection.statusmodule.startdatestruct.date": "start_date",
    "protocolsection.statusmodule.primarycompletiondatestruct.date": "primary_completion_date",
    "protocolsection.statusmodule.completiondatestruct.date": "completion_date",
    "protocolsection.statusmodule.lastupdatesubmitdate": "last_update_submit_date",
    "protocolsection.statusmodule.studyfirstsubmitdate": "study_first_submit_date",
    "protocolsection.designmodule.studytype": "study_type",
    "protocolsection.designmodule.phases": "phases",
    "protocolsection.designmodule.enrollmentinfo.count": "enrollment_count",
    "protocolsection.designmodule.enrollmentinfo.type": "enrollment_type",
    "protocolsection.designmodule.designinfo.allocation": "allocation",
    "protocolsection.designmodule.designinfo.interventionmodel": "intervention_model",
    "protocolsection.designmodule.designinfo.primarypurpose": "primary_purpose",
    "protocolsection.designmodule.designinfo.maskinginfo.masking": "masking",
    "protocolsection.designmodule.designinfo.maskinginfo.whomasked": "who_masked",
    "protocolsection.conditionsmodule.conditions": "conditions",
    "protocolsection.conditionsmodule.keywords": "keywords",
    "protocolsection.contactslocationsmodule.locations": "locations",
    "protocolsection.armsinterventionsmodule.interventions": "interventions",
    "protocolsection.outcomesmodule.primaryoutcomes": "primary_outcomes",
    "protocolsection.outcomesmodule.secondaryoutcomes": "secondary_outcomes",
    "protocolsection.descriptionmodule.briefsummary": "brief_summary",
    "protocolsection.eligibilitymodule.minimumage": "minimum_age",
    "protocolsection.eligibilitymodule.maximumage": "maximum_age",
    "protocolsection.eligibilitymodule.sex": "sex",
    "protocolsection.eligibilitymodule.healthyvolunteers": "healthy_volunteers",
    "protocolsection.eligibilitymodule.eligibilitycriteria": "eligibility_criteria",
    "derivedsection.conditionbrowsemodule.meshes": "condition_mesh_terms",
    "derivedsection.interventionbrowsemodule.meshes": "intervention_mesh_terms",
    "hasresults": "has_results",
    "protocolsection.ipdsharingstatementmodule.ipdsharing": "ipd_sharing",
}


df_clean = df_clean.rename(columns=rename_map)
  
# Convertir fechas
for col in ["start_date", "primary_completion_date", "completion_date", "last_update_submit_date", "study_first_submit_date"]:
    if col in df_clean.columns:
        df_clean[col] = pd.to_datetime(df_clean[col], errors="coerce")

df_clean["duration_days"] = (df_clean["completion_date"] - df_clean["start_date"]).dt.days

df_clean = df_clean.apply(lambda col: 
    col.astype(str)
       .str.replace("\n", " ", regex=False)
       .str.replace("\r", " ", regex=False)
    if col.dtype == "object" else col
)


df_clean


Unnamed: 0,nct_id,brief_title,official_title,overall_status,start_date,primary_completion_date,completion_date,study_first_submit_date,last_update_submit_date,lead_sponsor_name,...,locations,ipd_sharing,condition_mesh_terms,has_results,who_masked,maximum_age,intervention_mesh_terms,why_stopped,last_known_status,duration_days
0,NCT03034993,Self-Management Using Text Messaging in a Home...,Improving Self-Management of Chronic Condition...,COMPLETED,2017-07-21,2018-12-30,2019-04-30,2017-01-26,2019-05-01,Boston University,...,"[{""country"": ""United States""}]",NO,"[{""term"": ""Chronic Disease""}, {""term"": ""Medica...",False,,,,,,648.0
1,NCT06822621,Functional Cereal Products and Contribution to...,Functional Cereal Products and Contribution to...,ENROLLING_BY_INVITATION,2024-09-02,2025-06-30,2025-09-30,2025-01-31,2025-02-06,Harokopio University,...,"[{""country"": ""Greece""}]",NO,"[{""term"": ""Hypercholesterolemia""}, {""term"": ""O...",False,"[""PARTICIPANT""]",60 Years,"[{""term"": ""Methods""}]",,,393.0
2,NCT04551521,CRAFT: The NCT-PMO-1602 Phase II Trial,Continuous ReAssessment With Flexible ExTensio...,COMPLETED,2021-10-13,2024-12-30,2024-12-30,2020-07-24,2025-01-07,German Cancer Research Center,...,"[{""country"": ""Germany""}, {""country"": ""Germany""...",,"[{""term"": ""Neoplasm Metastasis""}]",False,,,"[{""term"": ""Vemurafenib""}, {""term"": ""cobimetini...",,,1174.0
3,NCT04086121,A Study to Test the Long-term Safety of BI 655...,An Open Label Extension Study to Assess the Lo...,TERMINATED,2019-09-24,2021-04-28,2022-02-23,2019-09-10,2025-02-10,Boehringer Ingelheim,...,"[{""country"": ""United States""}, {""country"": ""Un...",NO,"[{""term"": ""Dermatitis, Atopic""}]",True,,75 Years,"[{""term"": ""spesolimab""}]",Sponsor decision,,883.0
4,NCT04538521,NiaMIT Continuation With Early-stage Mitochond...,NiaMIT (NiaMIT_0001) Continuation for Early-st...,COMPLETED,2019-02-11,2020-09-18,2020-09-18,2020-08-29,2021-01-21,University of Helsinki,...,"[{""country"": ""Finland""}]",NO,"[{""term"": ""Mitochondrial Myopathies""}]",False,,,"[{""term"": ""Niacin""}]",,,585.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
273090,NCT05253287,Growth Hormone in Decompensated Liver Cirrhosis,Impact of Repurposed Growth Hormone Treatment ...,COMPLETED,2022-02-01,2024-12-31,2024-12-31,2021-12-15,2025-11-27,Post Graduate Institute of Medical Education a...,...,"[{""country"": ""India""}]",NO,"[{""term"": ""Liver Cirrhosis""}, {""term"": ""Fibros...",False,"[""OUTCOMES_ASSESSOR""]",80 Years,"[{""term"": ""Growth Hormone""}]",,,1064.0
273091,NCT02919787,Nordic Pancreatic Cancer Trial (NorPACT) - 1,Nordic Multicentre Un-blinded Phase II Randomi...,ACTIVE_NOT_RECRUITING,NaT,2022-12-22,2026-04-30,2016-09-14,2024-11-18,Oslo University Hospital,...,"[{""country"": ""Denmark""}, {""country"": ""Finland""...",NO,"[{""term"": ""Pancreatic Neoplasms""}]",False,,,"[{""term"": ""Fluorouracil""}, {""term"": ""Oxaliplat...",,,
273092,NCT04954287,Phase 1 Study of Intranasal PIV5 COVID-19 Vacc...,"A Phase 1 Open-Label, Dose-Ranging Trial to Ev...",COMPLETED,2021-08-06,2023-06-10,2023-06-10,2021-06-30,2024-02-05,CyanVac LLC,...,"[{""country"": ""United States""}, {""country"": ""Un...",YES,"[{""term"": ""COVID-19""}]",False,,55 Years,"[{""term"": ""CVXGA1 COVID-19 vaccine""}]",,,673.0
273093,NCT06577987,Safety/Efficacy Study of CID-078 in Patients W...,"A Phase 1, Open-Label, Multicenter Study to Ev...",RECRUITING,2024-08-14,2027-01-14,2027-03-14,2024-08-23,2025-09-09,Circle Pharma,...,"[{""country"": ""United States""}, {""country"": ""Un...",,"[{""term"": ""Neoplasm Metastasis""}, {""term"": ""Ne...",False,,,,,,942.0


In [None]:
# Guardamos el archivo una vez limpiado, que se usará en el proximo paso

today = datetime.today().strftime("%Y%m%d_%H%M%S")
out_path = CLEAN_DIR / f"clinicaltrials_clean_{today}.csv"

df_clean.to_csv(out_path, index=False)

print("Dataset guardado en:")
out_path


- Se ha cargado correctamente el fichero RAW más reciente.
 
- Se observa la estructura inicial de los datos tal y como vienen de la API.
 
- Se ha realizado una limpieza:
  - Normalización de nombres de columnas
  - Conversión inicial de columnas de fechas
  - Eliminación de formatos no estándar

El resto del análisis se basará en el archivo generado en `data/clean/`:
- `clinicaltrials_clean_YYYYMMDD_HHMMSS.csv`