# EDA - Exploratory Data Analysis in Clinical Trial Trends

## Objectives

Explore the Clinical Trials dataset to investiagte and carry out the following:

- Identify Outliers
- Statistical Analysis
- Descriptive Analysis
- Predictive Analysis
- Hypothesis Testing
- Qualitative Analysis

## Inputs

- data/outputs/cleaned_clinical_trials.csv

## Output

- INSERT UPDATED DATASET HERE

_______

The clinical trials dataset will be loaded into this notebook to conduct bivariate and multivariate analysis. 

The necessary libraries to conduct the analysis will be imported into this notebook.

In [21]:
# Import necessary libraries for data analysis and visualization (assistance from CoPilot)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns  
import plotly.express as px
import plotly.graph_objects as go
from scipy import stats
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
from sklearn.feature_extraction.text import TfidfVectorizer

## Descriptive Analysis

Import cleaned dataset as a CSV file to understand the structure of the data.

In [22]:
# Load the cleaned dataset 
df = pd.read_csv('../data/outputs/cleaned_clinical_trials.csv')
df.head()

Unnamed: 0,index,Sponsor,Title,Summary,Start_Year,Start_Month,Phase,Enrollment,Status,Condition,Start_Month_Name,Start_Year_Period,Enrolled_Participants
0,8746,GSK,"An Open-Label, Non-Randomized Pharmacokinetic ...",The main purpose of this study is to compare h...,2006,9,Phase 1,29,Completed,"Purpura, Thrombocytopenic, Idiopathic",September,2006-2010,<50
1,1499,Sanofi,"A Randomized, Double-blind, Placebo-controlled...",Primary Objective: To evaluate the efficacy of...,2018,11,Phase 3,360,Recruiting,Giant Cell Arteritis,November,2016-2020,100-499
2,2132,Pfizer,A Multi Center Randomized Cross Over Double Bl...,This is a pilot study to generate hypotheses a...,2007,3,Phase 2,27,Completed,Prostatic Hyperplasia,March,2006-2010,<50
3,4422,Novartis,"A Double-blind, Randomized, Placebo-controlled...",This study will evaluate the effect of FTY720 ...,2008,9,Phase 2,36,Completed,Asthma,September,2006-2010,<50
4,5352,Novartis,"A Two Part Study Including a Randomized, Doubl...",This study is designed to enable optimal dose ...,2013,8,Phase 1,93,Terminated,Hypertension,August,2011-2015,50-99


Investigate the basic structure of the dataset.

In [23]:
# Display basic information about the dataset
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9719 entries, 0 to 9718
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   index                  9719 non-null   int64 
 1   Sponsor                9719 non-null   object
 2   Title                  9719 non-null   object
 3   Summary                9719 non-null   object
 4   Start_Year             9719 non-null   int64 
 5   Start_Month            9719 non-null   int64 
 6   Phase                  9719 non-null   object
 7   Enrollment             9719 non-null   int64 
 8   Status                 9719 non-null   object
 9   Condition              9719 non-null   object
 10  Start_Month_Name       9719 non-null   object
 11  Start_Year_Period      9719 non-null   object
 12  Enrolled_Participants  9439 non-null   object
dtypes: int64(4), object(9)
memory usage: 987.2+ KB


Unnamed: 0,index,Sponsor,Title,Summary,Start_Year,Start_Month,Phase,Enrollment,Status,Condition,Start_Month_Name,Start_Year_Period,Enrolled_Participants
0,8746,GSK,"An Open-Label, Non-Randomized Pharmacokinetic ...",The main purpose of this study is to compare h...,2006,9,Phase 1,29,Completed,"Purpura, Thrombocytopenic, Idiopathic",September,2006-2010,<50
1,1499,Sanofi,"A Randomized, Double-blind, Placebo-controlled...",Primary Objective: To evaluate the efficacy of...,2018,11,Phase 3,360,Recruiting,Giant Cell Arteritis,November,2016-2020,100-499
2,2132,Pfizer,A Multi Center Randomized Cross Over Double Bl...,This is a pilot study to generate hypotheses a...,2007,3,Phase 2,27,Completed,Prostatic Hyperplasia,March,2006-2010,<50
3,4422,Novartis,"A Double-blind, Randomized, Placebo-controlled...",This study will evaluate the effect of FTY720 ...,2008,9,Phase 2,36,Completed,Asthma,September,2006-2010,<50
4,5352,Novartis,"A Two Part Study Including a Randomized, Doubl...",This study is designed to enable optimal dose ...,2013,8,Phase 1,93,Terminated,Hypertension,August,2011-2015,50-99


Explore the basic statistics from the cleaned dataset.

In [24]:
# Display basic statistics of the dataset (assistance from CoPilot)
categorical_stats = df.describe(include='object')
display(categorical_stats)

Unnamed: 0,Sponsor,Title,Summary,Phase,Status,Condition,Start_Month_Name,Start_Year_Period,Enrolled_Participants
count,9719,9719,9719,9719,9719,9719,9719,9719,9439
unique,10,9635,9623,7,9,745,12,8,8
top,GSK,See Detailed Description,#NAME?,Phase 3,Completed,"Diabetes Mellitus, Type 2",October,2006-2010,100-499
freq,1752,4,7,3541,7465,381,948,3665,3532


Observations:

The inital exploration of this dataset reveals that the top sponsor is GSK and the top condition used for clinical trends within this dataset is Diabetes Mellitus, Type 2.

### Basic Statistics

Exclude column 1 as the index is being saved as a column and will skew the analysis results.

Here is the summary and breakdown of the basic statistics found within the dataset. 

In [42]:
# Drop the column labeled 'index' if it exists
if 'index' in df.columns:
    df = df.drop(columns=['index'])

# Verify the changes
print(df.head())

    Sponsor                                              Title  \
0       GSK  An Open-Label, Non-Randomized Pharmacokinetic ...   
1    Sanofi  A Randomized, Double-blind, Placebo-controlled...   
2    Pfizer  A Multi Center Randomized Cross Over Double Bl...   
3  Novartis  A Double-blind, Randomized, Placebo-controlled...   
4  Novartis  A Two Part Study Including a Randomized, Doubl...   

                                             Summary  Start_Year  Start_Month  \
0  The main purpose of this study is to compare h...        2006            9   
1  Primary Objective: To evaluate the efficacy of...        2018           11   
2  This is a pilot study to generate hypotheses a...        2007            3   
3  This study will evaluate the effect of FTY720 ...        2008            9   
4  This study is designed to enable optimal dose ...        2013            8   

     Phase  Enrollment      Status                              Condition  \
0  Phase 1          29   Completed  Pur

In [43]:
# Perform statistical analysis
summary_stats = df.describe()
styled_summary_stats = summary_stats.style.background_gradient(cmap='Blues')
display(styled_summary_stats)

Unnamed: 0,Start_Year,Start_Month,Enrollment
count,9719.0,9719.0,9719.0
mean,2009.151456,6.696162,441.062146
std,4.810069,3.491232,1806.892397
min,1984.0,1.0,0.0
25%,2006.0,4.0,40.0
50%,2009.0,7.0,125.0
75%,2013.0,10.0,372.0
max,2020.0,12.0,69274.0


The dataset contains a larger portion of columns with categorical data (9 out of 12 columns(excluding the index)), so the statistical analysis is only carried out on the 3 numeric columns.

**Analysis Summary:**

**Start_Year:**
- Mean = 2009 
- Median = 2009
The identical values of the mean and median suggest the start year distribution is fairly symmetrical.

- Standard Deviation = 4.81 
This value suggests that there is a moderate spread in the distribution. It further suggests that the dataset covers studies condcuted over several years, adding credibility to the study. 



**Start_Month**
- Mean = 6.6 (June)
- Median = 7 (July)

**Enrollment**
- Mean = 441
- Median = 125