# Introduction

This notedbook loads and processed the source data to make it ready for RAG system. I am experimenting with publicly available 
- [Patient-level Clinical Drug Trial Data](https://www.kaggle.com/datasets/dillonmyrick/bells-palsy-clinical-trial)
- [COVID-19 Clinical Trials dataset](https://www.kaggle.com/datasets/parulpandey/covid19-clinical-trials-dataset)

datasets from Kaggle. The two datasets are fundamentally different. The former has more int or bool data fields, while the latter is semi-structured that has fields like trial title and outcome measures. 

In [16]:
import pandas as pd

## Data Processing 

The input data is the [COVID-19 Clinical Trials dataset]('https://www.kaggle.com/datasets/parulpandey/covid19-clinical-trials-dataset') a data set in csv format from Kaggle. Using __pandas__ package, I am loading the data into a data frame and getting some descriptive statistics using the *Describe*. 

In [17]:
## load the clinical trial data from file to dataframe for analysis
df_CT_cancer = pd.read_csv('./data/COVID_clinical_trials.csv')
df_CT_cancer.head()

Unnamed: 0,Rank,NCT Number,Title,Acronym,Status,Study Results,Conditions,Interventions,Outcome Measures,Sponsor/Collaborators,...,Other IDs,Start Date,Primary Completion Date,Completion Date,First Posted,Results First Posted,Last Update Posted,Locations,Study Documents,URL
0,1,NCT04785898,Diagnostic Performance of the ID Now™ COVID-19...,COVID-IDNow,"Active, not recruiting",No Results Available,Covid19,Diagnostic Test: ID Now™ COVID-19 Screening Test,Evaluate the diagnostic performance of the ID ...,Groupe Hospitalier Paris Saint Joseph,...,COVID-IDNow,"November 9, 2020","December 22, 2020","April 30, 2021","March 8, 2021",,"March 8, 2021","Groupe Hospitalier Paris Saint-Joseph, Paris, ...",,https://ClinicalTrials.gov/show/NCT04785898
1,2,NCT04595136,Study to Evaluate the Efficacy of COVID19-0001...,COVID-19,Not yet recruiting,No Results Available,SARS-CoV-2 Infection,Drug: Drug COVID19-0001-USR|Drug: normal saline,Change on viral load results from baseline aft...,United Medical Specialties,...,COVID19-0001-USR,"November 2, 2020","December 15, 2020","January 29, 2021","October 20, 2020",,"October 20, 2020","Cimedical, Barranquilla, Atlantico, Colombia",,https://ClinicalTrials.gov/show/NCT04595136
2,3,NCT04395482,Lung CT Scan Analysis of SARS-CoV2 Induced Lun...,TAC-COVID19,Recruiting,No Results Available,covid19,Other: Lung CT scan analysis in COVID-19 patients,A qualitative analysis of parenchymal lung dam...,University of Milano Bicocca,...,TAC-COVID19,"May 7, 2020","June 15, 2021","June 15, 2021","May 20, 2020",,"November 9, 2020","Ospedale Papa Giovanni XXIII, Bergamo, Italy|P...",,https://ClinicalTrials.gov/show/NCT04395482
3,4,NCT04416061,The Role of a Private Hospital in Hong Kong Am...,COVID-19,"Active, not recruiting",No Results Available,COVID,Diagnostic Test: COVID 19 Diagnostic Test,Proportion of asymptomatic subjects|Proportion...,Hong Kong Sanatorium & Hospital,...,RC-2020-08,"May 25, 2020","July 31, 2020","August 31, 2020","June 4, 2020",,"June 4, 2020","Hong Kong Sanatorium & Hospital, Hong Kong, Ho...",,https://ClinicalTrials.gov/show/NCT04416061
4,5,NCT04395924,Maternal-foetal Transmission of SARS-Cov-2,TMF-COVID-19,Recruiting,No Results Available,Maternal Fetal Infection Transmission|COVID-19...,Diagnostic Test: Diagnosis of SARS-Cov2 by RT-...,COVID-19 by positive PCR in cord blood and / o...,Centre Hospitalier Régional d'Orléans|Centre d...,...,CHRO-2020-10,"May 5, 2020",May 2021,May 2021,"May 20, 2020",,"June 4, 2020","CHR Orléans, Orléans, France",,https://ClinicalTrials.gov/show/NCT04395924


In [19]:
## Dataset statistics like total acount & data distribution values (mean / std )
df_CT_cancer.describe()

Unnamed: 0,Rank,Enrollment
count,5783.0,5749.0
mean,2892.0,18319.49
std,1669.552635,404543.7
min,1.0,0.0
25%,1446.5,60.0
50%,2892.0,170.0
75%,4337.5,560.0
max,5783.0,20000000.0


In [22]:
# remove empty values 
df_CT_cancer = df_CT_cancer[df_CT_cancer['Title'].notna()]
df_CT_cancer = df_CT_cancer[df_CT_cancer['Study Documents'].notna()]

df_CT_cancer.describe()

Unnamed: 0,Rank,Enrollment
count,182.0,179.0
mean,3158.406593,10216.96648
std,1821.715884,82183.01819
min,20.0,1.0
25%,1579.25,50.0
50%,3057.5,200.0
75%,5017.0,880.0
max,5771.0,1000000.0


In [23]:
## Convert dataframe to dictionary as a preparation to create the embedding (vectorised data). 
## The data is stored in key-value format
df_CT_cancer.to_dict('records')

[{'Rank': 20,
  'NCT Number': 'NCT04407585',
  'Title': 'Testing the Accuracy of a Digital Test to Diagnose Covid-19',
  'Acronym': nan,
  'Status': 'Recruiting',
  'Study Results': 'No Results Available',
  'Conditions': 'Covid-19',
  'Interventions': 'Diagnostic Test: Covid-19 swab PCR test',
  'Outcome Measures': 'SARS-CoV-2 infection',
  'Sponsor/Collaborators': "King's College London|Zoe Global Limited|Department of Health, United Kingdom",
  'Gender': 'All',
  'Age': '18 Years and older \xa0 (Adult, Older Adult)',
  'Phases': nan,
  'Enrollment': 1000000.0,
  'Funded Bys': 'Other',
  'Study Type': 'Observational',
  'Study Designs': 'Observational Model: Cohort|Time Perspective: Prospective',
  'Other IDs': 'Covid-19 Validation Study',
  'Start Date': 'June 1, 2020',
  'Primary Completion Date': 'May 10, 2021',
  'Completion Date': 'May 10, 2021',
  'First Posted': 'May 29, 2020',
  'Results First Posted': nan,
  'Last Update Posted': 'June 24, 2020',
  'Locations': "King's Colle

In [24]:
## load the eligibility criteria for clinical trial data set from file to dataframe for analysis
df_PatintEC = pd.read_csv('./data/Bells_Palsy_Clinical_Trial.csv')
df_PatintEC.head()

Unnamed: 0,Patient ID,Sex,Age,Baseline Score on House–Brackmann scale,Time between onset of symptoms and start of treatment,Treatment Group,Received Prednisolone,Received Acyclovir,3-Month Score on House–Brackmann scale,Full Recovery in 3 Months,9-Month Score on House–Brackmann scale,Full Recovery in 9 Months
0,1,Female,77,6,Within 24 hr,Prednisolone–Placebo,Yes,No,2,No,2,No
1,2,Female,61,6,Within 24 hr,Prednisolone–Placebo,Yes,No,1,Yes,1,Yes
2,3,Female,46,4,>24 to ≤48 hr,Prednisolone–Placebo,Yes,No,1,Yes,1,Yes
3,4,Female,46,3,Within 24 hr,Prednisolone–Placebo,Yes,No,1,Yes,1,Yes
4,5,Female,42,3,>24 to ≤48 hr,Prednisolone–Placebo,Yes,No,1,Yes,1,Yes


In [28]:
df_PatintEC.describe()

Unnamed: 0,Patient ID,Age,Baseline Score on House–Brackmann scale,3-Month Score on House–Brackmann scale,9-Month Score on House–Brackmann scale
count,494.0,494.0,494.0,494.0,494.0
mean,247.5,44.868421,3.680162,1.340081,1.143725
std,142.749781,14.550357,1.131752,0.609037,0.46105
min,1.0,16.0,2.0,1.0,1.0
25%,124.25,34.0,3.0,1.0,1.0
50%,247.5,44.0,4.0,1.0,1.0
75%,370.75,55.0,4.0,2.0,1.0
max,494.0,90.0,6.0,4.0,4.0


In [29]:
## Convert dataframe to dictionary as a preparation to create the embedding (vectorised data). 
## The data is stored in key-value format
df_PatintEC.to_dict('records')

[{'Patient ID': 1,
  'Sex': 'Female',
  'Age': 77,
  'Baseline Score on House–Brackmann scale': 6,
  'Time between onset of symptoms and start of treatment': 'Within 24 hr',
  'Treatment Group': 'Prednisolone–Placebo',
  'Received Prednisolone': 'Yes',
  'Received Acyclovir': 'No',
  '3-Month Score on House–Brackmann scale': 2,
  'Full Recovery in 3 Months': 'No',
  '9-Month Score on House–Brackmann scale': 2,
  'Full Recovery in 9 Months': 'No'},
 {'Patient ID': 2,
  'Sex': 'Female',
  'Age': 61,
  'Baseline Score on House–Brackmann scale': 6,
  'Time between onset of symptoms and start of treatment': 'Within 24 hr',
  'Treatment Group': 'Prednisolone–Placebo',
  'Received Prednisolone': 'Yes',
  'Received Acyclovir': 'No',
  '3-Month Score on House–Brackmann scale': 1,
  'Full Recovery in 3 Months': 'Yes',
  '9-Month Score on House–Brackmann scale': 1,
  'Full Recovery in 9 Months': 'Yes'},
 {'Patient ID': 3,
  'Sex': 'Female',
  'Age': 46,
  'Baseline Score on House–Brackmann scale'