# Cleaning clinical trials df 

**Fields to be kept and cleaned:**

- Title & ID
- Location (country)
- Location (state or city)
- Conditions
- Interventions
- Outcome measures 
- Status
- Age
- Gender
- Sponsor/collaborators
- Phases
- Enrollment
- Study Type
- Study Design
- Start Date
- Completion date 
- URL 


## 0. Setup

In [209]:
import pandas as pd 
import numpy as np

# Change option so that you can see all column/rownames displayed
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [210]:
# Read in dataset
covid_trials_df = pd.read_csv("../data/covid_studies_092020.tsv", sep="\t")

covid_trials_df.head()

Unnamed: 0,Rank,NCT Number,Title,Acronym,Status,Study Results,Conditions,Interventions,Outcome Measures,Sponsor/Collaborators,Gender,Age,Phases,Enrollment,Funded Bys,Study Type,Study Designs,Other IDs,Start Date,Primary Completion Date,Completion Date,First Posted,Results First Posted,Last Update Posted,Locations,Study Documents,URL
0,1,NCT04372602,Duvelisib to Combat COVID-19,,Not yet recruiting,No Results Available,COVID-19,Drug: Duvelisib|Procedure: Peripheral blood dr...,Overall survival|Length of hospital stay|Lengt...,Washington University School of Medicine|Veras...,All,"18 Years and older (Adult, Older Adult)",Phase 2,28.0,Other|Industry,Interventional,Allocation: Randomized|Intervention Model: Sin...,202007009,"September 30, 2020","October 31, 2021","March 31, 2022","May 4, 2020",,"September 10, 2020","Washington University School of Medicine, Sain...",,https://ClinicalTrials.gov/show/NCT04372602
1,2,NCT04364698,Observational Cohort of COVID-19 Patients at R...,COVID-RPC,Recruiting,No Results Available,COVID-19,,"clinical, biological and radiological characte...",Assistance Publique - Hôpitaux de Paris,All,"18 Years and older (Adult, Older Adult)",,500.0,Other,Observational,Observational Model: Cohort|Time Perspective: ...,20SBS-COVID-RPC,"May 7, 2020",June 2020,June 2020,"April 28, 2020",,"May 14, 2020","Department of Infectiology, Raymond Poincaré H...",,https://ClinicalTrials.gov/show/NCT04364698
2,3,NCT04482621,Decitabine for Coronavirus (COVID-19) Pneumoni...,DART,Recruiting,No Results Available,COVID-19,Drug: Decitabine|Other: Placebo Saline,Change in clinical state as assessed by a 6-po...,Johns Hopkins University,All,"18 Years and older (Adult, Older Adult)",Phase 2,40.0,Other,Interventional,Allocation: Randomized|Intervention Model: Par...,IRB00247544,"August 31, 2020",January 2021,July 2021,"July 22, 2020",,"August 18, 2020","Johns Hopkins University, Baltimore, Maryland,...",,https://ClinicalTrials.gov/show/NCT04482621
3,4,NCT04459637,COVID-19 Surveillance Based on Smart Wearable ...,COVID-19SWD,Not yet recruiting,No Results Available,COVID-19,,Deterioration of the condition|Mortality|The i...,Peking University First Hospital,All,"18 Years to 75 Years (Adult, Older Adult)",,200.0,Other,Observational,Observational Model: Cohort|Time Perspective: ...,2020055-0615,"July 1, 2020","March 10, 2021","March 10, 2021","July 7, 2020",,"July 7, 2020","Peking University First Hospital, Beijing, Bei...",,https://ClinicalTrials.gov/show/NCT04459637
4,5,NCT04425538,A Phase 2 Trial of Infliximab in Coronavirus D...,,Recruiting,No Results Available,COVID-19,Drug: Infliximab,Time to improvement in oxygenation|28-day mort...,Tufts Medical Center|National Institutes of He...,All,"18 Years and older (Adult, Older Adult)",Phase 2,17.0,Other|NIH,Interventional,Allocation: N/A|Intervention Model: Single Gro...,STUDY00000564,"June 1, 2020",September 2020,December 2020,"June 11, 2020",,"June 11, 2020","Tufts Medical Center, Boston, Massachusetts, U...",,https://ClinicalTrials.gov/show/NCT04425538


## 1. Only keep columns of interest 

In [211]:
covid_trials_df.columns

Index(['Rank', 'NCT Number', 'Title', 'Acronym', 'Status', 'Study Results',
       'Conditions', 'Interventions', 'Outcome Measures',
       'Sponsor/Collaborators', 'Gender', 'Age', 'Phases', 'Enrollment',
       'Funded Bys', 'Study Type', 'Study Designs', 'Other IDs', 'Start Date',
       'Primary Completion Date', 'Completion Date', 'First Posted',
       'Results First Posted', 'Last Update Posted', 'Locations',
       'Study Documents', 'URL'],
      dtype='object')

In [212]:
columns_of_interest = ['NCT Number', 
                       'Title', 
                       'Locations',
                       'Status', 
                       'Study Results',
                       'Conditions', 
                       'Interventions', 
                       'Outcome Measures', 
                       'Sponsor/Collaborators', 
                       'Gender', 
                       'Age', 
                       'Phases', 
                       'Enrollment',
                       'Funded Bys', 
                       'Study Type', 
                       'Study Designs',
                       'Start Date',
                       'Completion Date',
                       'First Posted',
                       'Last Update Posted',
                       'URL']

covid_trials_df = covid_trials_df[columns_of_interest]

Look at unique values in each column: 

In [213]:
for col in covid_trials_df:
    print(covid_trials_df[col].unique())

['NCT04372602' 'NCT04364698' 'NCT04482621' ... 'NCT04386876' 'NCT04276987'
 'NCT03474965']
['Duvelisib to Combat COVID-19'
 'Observational Cohort of COVID-19 Patients at Raymond-Poincare'
 'Decitabine for Coronavirus (COVID-19) Pneumonia- Acute Respiratory Distress Syndrome (ARDS) Treatment: DART Trial'
 ...
 'Bioequivalence Study of Lopinavir/Ritonavir 200/50 mg Film Tablet (World Medicine Ilac, Turkey) Under Fasting Conditions'
 'A Pilot Clinical Study on Inhalation of Mesenchymal Stem Cells Exosomes Treating Severe Novel Coronavirus Pneumonia'
 'Study of Dose Confirmation and Safety of Crizanlizumab in Pediatric Sickle Cell Disease Patients']
['Washington University School of Medicine, Saint Louis, Missouri, United States'
 'Department of Infectiology, Raymond Poincaré Hospital, APHP, Garches, France'
 'Johns Hopkins University, Baltimore, Maryland, United States' ...
 'ICDDRB, Dhaka, Bangladesh|Kinshasa School of Public Health, Kinshasa, Congo, The Democratic Republic of the|Instit

## 2. Standardize date format

This is a bit of a mess because some dates are in the format "Month Day, Year" and others are of form "Month Year". First look at some examples of what dates are there: 

In [214]:
date_columns = ['Start Date',                       
                'Completion Date',
                'First Posted',
                'Last Update Posted' ]

for d in date_columns:
    print(covid_trials_df[d].unique()[0:10])

['September 30, 2020' 'May 7, 2020' 'August 31, 2020' 'July 1, 2020'
 'June 1, 2020' 'October 15, 2020' 'July 2020' 'July 21, 2020'
 'April 2020' 'May 25, 2020']
['March 31, 2022' 'June 2020' 'July 2021' 'March 10, 2021' 'December 2020'
 'April 15, 2021' 'October 2020' 'September 2021' 'March 2021'
 'September 25, 2020']
['May 4, 2020' 'April 28, 2020' 'July 22, 2020' 'July 7, 2020'
 'June 11, 2020' 'September 14, 2020' 'July 1, 2020' 'June 30, 2020'
 'April 3, 2020' 'June 16, 2020']
['September 10, 2020' 'May 14, 2020' 'August 18, 2020' 'July 7, 2020'
 'June 11, 2020' 'September 14, 2020' 'July 2, 2020' 'September 1, 2020'
 'May 4, 2020' 'June 16, 2020']


For consistency (and because the details of the *exact* date don't matter as much here, let's just remove all the dates: 

In [215]:
for d in date_columns:
    # Only keep month and year
    covid_trials_df[d] = [(str(i).split(" ")[0] + " " + str(i).split(" ")[-1]) for i in list(covid_trials_df[d])]

covid_trials_df[date_columns].head()

Unnamed: 0,Start Date,Completion Date,First Posted,Last Update Posted
0,September 2020,March 2022,May 2020,September 2020
1,May 2020,June 2020,April 2020,May 2020
2,August 2020,July 2021,July 2020,August 2020
3,July 2020,March 2021,July 2020,July 2020
4,June 2020,December 2020,June 2020,June 2020


These can now be easily converted into Datetime values if needed (Note that the format is "%B %Y"). 

In [216]:
covid_trials_df.head()

Unnamed: 0,NCT Number,Title,Locations,Status,Study Results,Conditions,Interventions,Outcome Measures,Sponsor/Collaborators,Gender,Age,Phases,Enrollment,Funded Bys,Study Type,Study Designs,Start Date,Completion Date,First Posted,Last Update Posted,URL
0,NCT04372602,Duvelisib to Combat COVID-19,"Washington University School of Medicine, Sain...",Not yet recruiting,No Results Available,COVID-19,Drug: Duvelisib|Procedure: Peripheral blood dr...,Overall survival|Length of hospital stay|Lengt...,Washington University School of Medicine|Veras...,All,"18 Years and older (Adult, Older Adult)",Phase 2,28.0,Other|Industry,Interventional,Allocation: Randomized|Intervention Model: Sin...,September 2020,March 2022,May 2020,September 2020,https://ClinicalTrials.gov/show/NCT04372602
1,NCT04364698,Observational Cohort of COVID-19 Patients at R...,"Department of Infectiology, Raymond Poincaré H...",Recruiting,No Results Available,COVID-19,,"clinical, biological and radiological characte...",Assistance Publique - Hôpitaux de Paris,All,"18 Years and older (Adult, Older Adult)",,500.0,Other,Observational,Observational Model: Cohort|Time Perspective: ...,May 2020,June 2020,April 2020,May 2020,https://ClinicalTrials.gov/show/NCT04364698
2,NCT04482621,Decitabine for Coronavirus (COVID-19) Pneumoni...,"Johns Hopkins University, Baltimore, Maryland,...",Recruiting,No Results Available,COVID-19,Drug: Decitabine|Other: Placebo Saline,Change in clinical state as assessed by a 6-po...,Johns Hopkins University,All,"18 Years and older (Adult, Older Adult)",Phase 2,40.0,Other,Interventional,Allocation: Randomized|Intervention Model: Par...,August 2020,July 2021,July 2020,August 2020,https://ClinicalTrials.gov/show/NCT04482621
3,NCT04459637,COVID-19 Surveillance Based on Smart Wearable ...,"Peking University First Hospital, Beijing, Bei...",Not yet recruiting,No Results Available,COVID-19,,Deterioration of the condition|Mortality|The i...,Peking University First Hospital,All,"18 Years to 75 Years (Adult, Older Adult)",,200.0,Other,Observational,Observational Model: Cohort|Time Perspective: ...,July 2020,March 2021,July 2020,July 2020,https://ClinicalTrials.gov/show/NCT04459637
4,NCT04425538,A Phase 2 Trial of Infliximab in Coronavirus D...,"Tufts Medical Center, Boston, Massachusetts, U...",Recruiting,No Results Available,COVID-19,Drug: Infliximab,Time to improvement in oxygenation|28-day mort...,Tufts Medical Center|National Institutes of He...,All,"18 Years and older (Adult, Older Adult)",Phase 2,17.0,Other|NIH,Interventional,Allocation: N/A|Intervention Model: Single Gro...,June 2020,December 2020,June 2020,June 2020,https://ClinicalTrials.gov/show/NCT04425538


## 3. Extract consistent location information 

The location information is very inconsistent. For example: 

In [217]:
# Look at first 5 location entries: 
for i in covid_trials_df["Locations"][0:5]:
    print(i)

Washington University School of Medicine, Saint Louis, Missouri, United States
Department of Infectiology, Raymond Poincaré Hospital, APHP, Garches, France
Johns Hopkins University, Baltimore, Maryland, United States
Peking University First Hospital, Beijing, Beijing, China
Tufts Medical Center, Boston, Massachusetts, United States


For information and mapping purposes (at least initially) we probably just want:

- The institution
- The city (non-US) or state (US)
- The country

So, let's extract that: 

### Country

In [218]:
# Add a countries column
covid_trials_df.loc[:,"Location_Country"] = [str(i).split(",")[-1].strip() for i in list(covid_trials_df["Locations"].copy())]

# Inspect
covid_trials_df["Location_Country"].unique()

array(['United States', 'France', 'China', 'Netherlands', 'nan',
       'United Kingdom', 'Islamic Republic of', 'Turkey', 'Switzerland',
       'Italy', 'Spain', 'Russian Federation', 'Malaysia', 'Colombia',
       'Portugal', 'Republic of', 'Puerto Rico', 'Israel', 'Pakistan',
       'Mexico', 'Egypt', 'Brazil', 'Peru', 'Saudi Arabia', 'Belgium',
       'Canada', 'Singapore', 'Sweden', 'Japan', 'South Africa',
       'Indonesia', 'Senegal', 'Argentina', 'Germany', 'India', 'Denmark',
       'Australia', 'Poland', 'Hong Kong', 'Jordan', 'Qatar', 'Iceland',
       'Ukraine', 'Austria', 'Dominican Republic', 'Bahrain',
       'United Arab Emirates', 'Romania', 'Greece', 'Bangladesh',
       'Bosnia and Herzegovina', 'Taiwan', 'Slovenia', 'Sudan', 'Monaco',
       'Nigeria', 'Costa Rica', 'Chile', 'Norway', 'Vietnam', 'Mongolia',
       'Hungary', 'Guinea-Bissau', 'Kuwait', 'Croatia',
       'The Democratic Republic of the', 'Zimbabwe', 'Thailand',
       'North Macedonia', 'Czechia', 'E

There are some issues. For example, studies in Iran and Congo are listed as being "Islamic Repuplic of" and "Democratic Republic of the" instead of their respective countries:

In [219]:
for i in covid_trials_df.loc[covid_trials_df["Location_Country"].isin(["Islamic Republic of",
                                                                      "The Democratic Republic of the"]),
                                                                      "Locations"]:
    print(i)

Regenerative Medicine Research Center, Kermanshah University of Medical Sciences, Kermanshah, Iran, Kermanshah, Iran, Islamic Republic of
Kings County Hospital Center, Brooklyn, NY, USA, Albertson, New York, United States|Faculty of Medicine, Zagazig University, Zagazig, Sharkia, Egypt|Al-Azhar Univerisity, Cairo, Egypt|Ahvaz Imam hospital, Ahvaz, Iran, Islamic Republic of
Medical Biology Research Center, Kermanshah University of Medical Sciences, Kermanshah, Iran, Kermanshah, Iran, Islamic Republic of
Shahid Modarres Hospital, Shahid Beheshti University of Medical Sciences and Health Services, Tehran, Iran, Islamic Republic of
Shahid Modarres Hospital, Shahid Beheshti University of Medical Sciences and Health Services, Tehran, Iran, Islamic Republic of
Loghman Hakim Hospital, Shahid Beheshti University of Medical Sciences and Health Services, Tehran, Iran, Islamic Republic of
Loghman Hakim Hospital, Shahid Beheshti University of Medical Sciences and Health Services, Tehran, Iran, Isla

In [220]:
# fix Incorrectly named country columns 
covid_trials_df["Location_Country"].loc[covid_trials_df["Location_Country"] == "Islamic Republic of"] = "Iran"
covid_trials_df["Location_Country"].loc[covid_trials_df["Location_Country"] == "The Democratic Republic of the"] = "Congo"

# Inspect
covid_trials_df["Location_Country"].unique()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


array(['United States', 'France', 'China', 'Netherlands', 'nan',
       'United Kingdom', 'Iran', 'Turkey', 'Switzerland', 'Italy',
       'Spain', 'Russian Federation', 'Malaysia', 'Colombia', 'Portugal',
       'Republic of', 'Puerto Rico', 'Israel', 'Pakistan', 'Mexico',
       'Egypt', 'Brazil', 'Peru', 'Saudi Arabia', 'Belgium', 'Canada',
       'Singapore', 'Sweden', 'Japan', 'South Africa', 'Indonesia',
       'Senegal', 'Argentina', 'Germany', 'India', 'Denmark', 'Australia',
       'Poland', 'Hong Kong', 'Jordan', 'Qatar', 'Iceland', 'Ukraine',
       'Austria', 'Dominican Republic', 'Bahrain', 'United Arab Emirates',
       'Romania', 'Greece', 'Bangladesh', 'Bosnia and Herzegovina',
       'Taiwan', 'Slovenia', 'Sudan', 'Monaco', 'Nigeria', 'Costa Rica',
       'Chile', 'Norway', 'Vietnam', 'Mongolia', 'Hungary',
       'Guinea-Bissau', 'Kuwait', 'Croatia', 'Congo', 'Zimbabwe',
       'Thailand', 'North Macedonia', 'Czechia', 'Ethiopia', 'Kenya',
       'Martinique', 'Luxemb

### City or State

In [221]:
# Add a countries column
covid_trials_df.loc[:,"Location_City_or_State"] = [str(i).split(",")[-2].strip() 
                                                   if len(str(i).split(",")) > 1 
                                                   else str(i)
                                                   for i in list(covid_trials_df["Locations"].copy())]

# Inspect
covid_trials_df["Location_City_or_State"].unique()

array(['Missouri', 'Garches', 'Maryland', 'Beijing', 'Massachusetts',
       'Limburg', 'nan', 'London', 'Iran', 'Istanbul', 'Canton De Genève',
       'Milano', 'Zaragoza', 'Wisconsin', 'St. Petersburg',
       'Kuala Lumpur', 'Bogotá', 'Shanghai', 'Oporto',
       'Boulogne-Billancourt', 'Nebraska', 'Utah', 'Connecticut',
       'Louisiana', 'Korea', 'Ponce', 'Ohio', 'New York', 'Mississippi',
       'Hubei', 'Guangdong', "Be'er Sheva", 'Lahore', 'North Carolina',
       'Michigan', 'Illinois', 'Valencia', 'Minnesota', 'Marseille',
       'Nuevo Leon', 'Qinā', 'Sao Paulo', 'New Jersey', 'Siena',
       'Créteil', 'Lima', 'Paris', 'Riyadh', 'Arkansas', 'Cairo',
       'Hasselt', 'Kansas', 'Yaroslavl', 'São Paulo', 'Toulon',
       'Pennsylvania', 'Mantova', 'Smolensk', 'Alberta', 'Lombardia',
       'SP', 'Jiangsu', 'Nova Scotia', 'Singapore', 'Hong Kong', 'Madrid',
       'Québec', 'Ile-de-France', 'Danderyd', 'Rio de Janeiro', 'Milan',
       'Assiut', 'Osaka', 'Brussels', 'Nîmes', 

They mostly seem fine, but it's a little hard to check city/state/region names manually. So we'll leave it for now and probably edit later. 

### Institution

In [222]:
covid_trials_df.loc[:,"Location_Institution"] = [str(i).split(",")[0].strip()  
                                                   for i in list(covid_trials_df["Locations"].copy())]

# inspect
covid_trials_df["Location_Institution"].unique()

array(['Washington University School of Medicine',
       'Department of Infectiology', 'Johns Hopkins University', ...,
       'ICDDRB',
       'Ruijin Hospital Shanghai Jiao Tong University School of Medicine',
       'University of Alabama 1600 7th ave'], dtype=object)

## 4. Split fields with multiple entries 

In [223]:
cols_to_explode = ["Interventions", 
                   "Outcome Measures", 
                   "Sponsor/Collaborators",
                   "Funded Bys",
                   "Study Type",
                   "Study Designs"]
cols_not_to_explode = [i for i in list(covid_trials_df.columns) if i not in cols_to_explode]

# Convert columns with multiple fields into lists so they can be exploded 
for lst_col in cols_to_explode: 
    print(lst_col)
    # split by delimiter
    covid_trials_df = covid_trials_df.assign(**{lst_col:covid_trials_df[lst_col].str.split('|')})
    
    # explode delimited column
    covid_trials_df = covid_trials_df.explode(lst_col)

Interventions
Outcome Measures
Sponsor/Collaborators
Funded Bys
Study Type
Study Designs


In [224]:
covid_trials_df.head()

Unnamed: 0,NCT Number,Title,Locations,Status,Study Results,Conditions,Interventions,Outcome Measures,Sponsor/Collaborators,Gender,Age,Phases,Enrollment,Funded Bys,Study Type,Study Designs,Start Date,Completion Date,First Posted,Last Update Posted,URL,Location_Country,Location_City_or_State,Location_Institution
0,NCT04372602,Duvelisib to Combat COVID-19,"Washington University School of Medicine, Sain...",Not yet recruiting,No Results Available,COVID-19,Drug: Duvelisib,Overall survival,Washington University School of Medicine,All,"18 Years and older (Adult, Older Adult)",Phase 2,28.0,Other,Interventional,Allocation: Randomized,September 2020,March 2022,May 2020,September 2020,https://ClinicalTrials.gov/show/NCT04372602,United States,Missouri,Washington University School of Medicine
0,NCT04372602,Duvelisib to Combat COVID-19,"Washington University School of Medicine, Sain...",Not yet recruiting,No Results Available,COVID-19,Drug: Duvelisib,Overall survival,Washington University School of Medicine,All,"18 Years and older (Adult, Older Adult)",Phase 2,28.0,Other,Interventional,Intervention Model: Single Group Assignment,September 2020,March 2022,May 2020,September 2020,https://ClinicalTrials.gov/show/NCT04372602,United States,Missouri,Washington University School of Medicine
0,NCT04372602,Duvelisib to Combat COVID-19,"Washington University School of Medicine, Sain...",Not yet recruiting,No Results Available,COVID-19,Drug: Duvelisib,Overall survival,Washington University School of Medicine,All,"18 Years and older (Adult, Older Adult)",Phase 2,28.0,Other,Interventional,"Masking: Triple (Participant, Care Provider, I...",September 2020,March 2022,May 2020,September 2020,https://ClinicalTrials.gov/show/NCT04372602,United States,Missouri,Washington University School of Medicine
0,NCT04372602,Duvelisib to Combat COVID-19,"Washington University School of Medicine, Sain...",Not yet recruiting,No Results Available,COVID-19,Drug: Duvelisib,Overall survival,Washington University School of Medicine,All,"18 Years and older (Adult, Older Adult)",Phase 2,28.0,Other,Interventional,Primary Purpose: Treatment,September 2020,March 2022,May 2020,September 2020,https://ClinicalTrials.gov/show/NCT04372602,United States,Missouri,Washington University School of Medicine
0,NCT04372602,Duvelisib to Combat COVID-19,"Washington University School of Medicine, Sain...",Not yet recruiting,No Results Available,COVID-19,Drug: Duvelisib,Overall survival,Washington University School of Medicine,All,"18 Years and older (Adult, Older Adult)",Phase 2,28.0,Industry,Interventional,Allocation: Randomized,September 2020,March 2022,May 2020,September 2020,https://ClinicalTrials.gov/show/NCT04372602,United States,Missouri,Washington University School of Medicine


## 5. Convert all string columns to upper case 

Just for consistency, since it's unclear how things are capitalized

In [225]:
covid_trials_df = covid_trials_df.applymap(lambda s:s.upper() if type(s) == str else s)

In [226]:
covid_trials_df.head()

Unnamed: 0,NCT Number,Title,Locations,Status,Study Results,Conditions,Interventions,Outcome Measures,Sponsor/Collaborators,Gender,Age,Phases,Enrollment,Funded Bys,Study Type,Study Designs,Start Date,Completion Date,First Posted,Last Update Posted,URL,Location_Country,Location_City_or_State,Location_Institution
0,NCT04372602,DUVELISIB TO COMBAT COVID-19,"WASHINGTON UNIVERSITY SCHOOL OF MEDICINE, SAIN...",NOT YET RECRUITING,NO RESULTS AVAILABLE,COVID-19,DRUG: DUVELISIB,OVERALL SURVIVAL,WASHINGTON UNIVERSITY SCHOOL OF MEDICINE,ALL,"18 YEARS AND OLDER (ADULT, OLDER ADULT)",PHASE 2,28.0,OTHER,INTERVENTIONAL,ALLOCATION: RANDOMIZED,SEPTEMBER 2020,MARCH 2022,MAY 2020,SEPTEMBER 2020,HTTPS://CLINICALTRIALS.GOV/SHOW/NCT04372602,UNITED STATES,MISSOURI,WASHINGTON UNIVERSITY SCHOOL OF MEDICINE
0,NCT04372602,DUVELISIB TO COMBAT COVID-19,"WASHINGTON UNIVERSITY SCHOOL OF MEDICINE, SAIN...",NOT YET RECRUITING,NO RESULTS AVAILABLE,COVID-19,DRUG: DUVELISIB,OVERALL SURVIVAL,WASHINGTON UNIVERSITY SCHOOL OF MEDICINE,ALL,"18 YEARS AND OLDER (ADULT, OLDER ADULT)",PHASE 2,28.0,OTHER,INTERVENTIONAL,INTERVENTION MODEL: SINGLE GROUP ASSIGNMENT,SEPTEMBER 2020,MARCH 2022,MAY 2020,SEPTEMBER 2020,HTTPS://CLINICALTRIALS.GOV/SHOW/NCT04372602,UNITED STATES,MISSOURI,WASHINGTON UNIVERSITY SCHOOL OF MEDICINE
0,NCT04372602,DUVELISIB TO COMBAT COVID-19,"WASHINGTON UNIVERSITY SCHOOL OF MEDICINE, SAIN...",NOT YET RECRUITING,NO RESULTS AVAILABLE,COVID-19,DRUG: DUVELISIB,OVERALL SURVIVAL,WASHINGTON UNIVERSITY SCHOOL OF MEDICINE,ALL,"18 YEARS AND OLDER (ADULT, OLDER ADULT)",PHASE 2,28.0,OTHER,INTERVENTIONAL,"MASKING: TRIPLE (PARTICIPANT, CARE PROVIDER, I...",SEPTEMBER 2020,MARCH 2022,MAY 2020,SEPTEMBER 2020,HTTPS://CLINICALTRIALS.GOV/SHOW/NCT04372602,UNITED STATES,MISSOURI,WASHINGTON UNIVERSITY SCHOOL OF MEDICINE
0,NCT04372602,DUVELISIB TO COMBAT COVID-19,"WASHINGTON UNIVERSITY SCHOOL OF MEDICINE, SAIN...",NOT YET RECRUITING,NO RESULTS AVAILABLE,COVID-19,DRUG: DUVELISIB,OVERALL SURVIVAL,WASHINGTON UNIVERSITY SCHOOL OF MEDICINE,ALL,"18 YEARS AND OLDER (ADULT, OLDER ADULT)",PHASE 2,28.0,OTHER,INTERVENTIONAL,PRIMARY PURPOSE: TREATMENT,SEPTEMBER 2020,MARCH 2022,MAY 2020,SEPTEMBER 2020,HTTPS://CLINICALTRIALS.GOV/SHOW/NCT04372602,UNITED STATES,MISSOURI,WASHINGTON UNIVERSITY SCHOOL OF MEDICINE
0,NCT04372602,DUVELISIB TO COMBAT COVID-19,"WASHINGTON UNIVERSITY SCHOOL OF MEDICINE, SAIN...",NOT YET RECRUITING,NO RESULTS AVAILABLE,COVID-19,DRUG: DUVELISIB,OVERALL SURVIVAL,WASHINGTON UNIVERSITY SCHOOL OF MEDICINE,ALL,"18 YEARS AND OLDER (ADULT, OLDER ADULT)",PHASE 2,28.0,INDUSTRY,INTERVENTIONAL,ALLOCATION: RANDOMIZED,SEPTEMBER 2020,MARCH 2022,MAY 2020,SEPTEMBER 2020,HTTPS://CLINICALTRIALS.GOV/SHOW/NCT04372602,UNITED STATES,MISSOURI,WASHINGTON UNIVERSITY SCHOOL OF MEDICINE


## 6. Set null values to NaN

There are values in the data frame called "NaN" or "nan" that should be null types. 

In [227]:
covid_trials_df = covid_trials_df.replace("nan", np.nan)
covid_trials_df = covid_trials_df.replace("NaN", np.nan)

## 7. Write cleaned data frame to file

In [228]:
covid_trials_df.to_csv("../data/cleaned_covid_studies_092020.tsv", sep="\t")