# ETL - Clinical Trial Trends

## Objectives 
- Extract: To load the CSV file from Kaggle into a Pandas DataFrame
- Transform: To handle all missing values, normalise column names, remove duplicates and convert data types.
- Load: To save the cleaned dataset into a new CSV file, ready for EDA.

## Inputs

- Overview Clinical Trial Trends dataset from [Kaggle](https://www.kaggle.com/datasets/thedevastator/a-quick-overview-of-clinical-trials)

## Outputs

INSERT CLEANED DATASET

## Additional Comments

- The Clinical Trials dataset gives an overview of clinical trails over the years, from multiple sponsors.

_____

## Extract - Load and Clean Dataset

The clinical trials dataset will be loaded into this notebook, to gain insight into the data and conduct data cleaning.

In [13]:
# Import necessary Python libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [14]:
# Load dataset and convert to DataFrame
df = pd.read_csv('../data/inputs/raw/clinical_trials.csv')
df.head()

Unnamed: 0,index,NCT,Sponsor,Title,Summary,Start_Year,Start_Month,Phase,Enrollment,Status,Condition
0,0,NCT00003305,Sanofi,A Phase II Trial of Aminopterin in Adults and ...,RATIONALE: Drugs used in chemotherapy use diff...,1997,7,Phase 2,75,Completed,Leukemia
1,1,NCT00003821,Sanofi,Phase II Trial of Aminopterin in Patients With...,RATIONALE: Drugs used in chemotherapy use diff...,1998,1,Phase 2,0,Withdrawn,Endometrial Neoplasms
2,2,NCT00004025,Sanofi,"Phase I/II Trial of the Safety, Immunogenicity...",RATIONALE: Vaccines made from a person's white...,1999,3,Phase 1/Phase 2,36,Unknown status,Melanoma
3,3,NCT00005645,Sanofi,Phase II Trial of ILX295501 Administered Orall...,RATIONALE: Drugs used in chemotherapy use diff...,1999,5,Phase 2,0,Withdrawn,Ovarian Neoplasms
4,4,NCT00008281,Sanofi,"A Multicenter, Open-Label, Randomized, Three-A...",RATIONALE: Drugs used in chemotherapy use diff...,2000,10,Phase 3,0,Unknown status,Colorectal Neoplasms


In [15]:
# Investigate the size of the dataset
df.shape

(13748, 11)

In [None]:
import os

# Ensure the output directory exists
output_dir = '../data/outputs'
os.makedirs(output_dir, exist_ok=True)

# Save the sampled dataset to the output directory
sampled_df.to_csv(f'{output_dir}/sampled_clinical_trials.csv', index=False)

In [30]:
# Take a random sample of 10,000 entries
sampled_df = df.sample(n=10000, random_state=28)

# Display the first few rows of the sampled dataset
sampled_df.head()

# Optionally, save the sampled dataset to a new CSV file for future use
sampled_df.to_csv('../data/outputs/sampled_clinical_trials.csv', index=False)

In [32]:
# Confirm the shape of the sampled DataFrame
sampled_df.shape

(10000, 11)

In [34]:
# Display the first few rows of the sampled dataset
sampled_df.head()

Unnamed: 0,index,NCT,Sponsor,Title,Summary,Start_Year,Start_Month,Phase,Enrollment,Status,Condition
8746,8746,NCT00442871,GSK,"An Open-Label, Non-Randomized Pharmacokinetic ...",The main purpose of this study is to compare h...,2006,9,Phase 1,29,Completed,"Purpura, Thrombocytopenic, Idiopathic"
1499,1499,NCT03600805,Sanofi,"A Randomized, Double-blind, Placebo-controlled...",Primary Objective: To evaluate the efficacy of...,2018,11,Phase 3,360,Recruiting,Giant Cell Arteritis
2132,2132,NCT00408954,Pfizer,A Multi Center Randomized Cross Over Double Bl...,This is a pilot study to generate hypotheses a...,2007,3,Phase 2,27,Completed,Prostatic Hyperplasia
4422,4422,NCT00785083,Novartis,"A Double-blind, Randomized, Placebo-controlled...",This study will evaluate the effect of FTY720 ...,2008,9,Phase 2,36,Completed,Asthma
5352,5352,NCT02047656,Novartis,"A Two Part Study Including a Randomized, Doubl...",This study is designed to enable optimal dose ...,2013,8,Phase 1,93,Terminated,Hypertension
