# Predictive Analysis on Clinical Trial Outcomes

#### Investigator
- [Nicholas S. McBride, PhD](https://www.linkedin.com/in/nsmcbride/)

## Project Objective
This project aims to develop predictive models to analyze clinical trial outcomes based on the study design and implementation details available before any results are published. Leveraging data from [ClinicalTrials.gov](https://clinicaltrials.gov/), the goal is to forecast the trajectory of clinical trials, which is invaluable in the Biotechnology industry for risk assessment, resource allocation, and strategic decision-making.

## Section 1: Data Collection

### Data Sources
Data is sourced through the [ClinicalTrials.gov API](https://clinicaltrials.gov/data-api/api), which provides a comprehensive database of clinical studies. The information structure and query syntax used in this project are guided by the [API documentation](https://clinicaltrials.gov/data-api/api) and [ESSIE query syntax](https://clinicaltrials.gov/find-studies/constructing-complex-search-queries), which was [developed specifically for the ClinicalTrials.gov project at the NLM](https://doi.org/10.1197/jamia.M2233) for in silico biomedical literature research.

#### [ClinicalTrials.gov](https://clinicaltrials.gov/) Reference Links

- [API documentation](https://clinicaltrials.gov/data-api/api)
- [Data structure](https://clinicaltrials.gov/data-api/about-api/study-data-structure)
- [ESSIE query syntax](https://clinicaltrials.gov/find-studies/constructing-complex-search-queries)

### Inclusion Criteria
The scope is confined to clinical studies categorized under specific recruitment and expanded access statuses. Clincial studies with overall status indicating a terminal success or failure were selected. Studies with overall status indicating active or in recruitment status were excluded.

#### [Study Overall Status Definitions](https://clinicaltrials.gov/study-basics/glossary)

| Recruitment status | Description |
|-|-|
| Not yet recruiting | The study has not started recruiting participants. |
| Recruiting | The study is currently recruiting participants. |
| Enrolling by invitation | The study is selecting its participants from a population decided on by the researchers in advance. These studies are not open to everyone who meets the eligibility criteria but only to people in that particular population, who are specifically invited to participate. |
| Active, not recruiting | The study is ongoing, and participants are receiving an intervention or being examined, but potential participants are not currently being recruited or enrolled. |
| Suspended | The study has stopped early but may start again. |
| Terminated | The study has stopped early and will not start again. Participants are no longer being examined or treated. |
| Completed | The study has ended normally, and participants are no longer being examined or treated (that is, the last participant's last visit has occurred). |
| Withdrawn | The study stopped early, before enrolling its first participant. |
| Unknown | A study on ClinicalTrials.gov whose last known status was recruiting; not yet recruiting; or active, not recruiting but that has passed its completion date, and the status has not been last verified within the past 2 years. |

| Expanded access status | Description |
|-|-|
| Available | Expanded access is currently available for this investigational treatment, and patients who are not participants in the clinical study may be able to gain access to the drug, biologic, or medical device being studied. |
| No longer available | Expanded access was available for this intervention previously but is not currently available and will not be available in the future. |
| Temporarily not available| Expanded access is not currently available for this intervention but is expected to be available in the future. |
| Approved for marketing | The intervention has been approved by the U.S. Food and Drug Administration for use by the public. |

Based in this information. The scope was limited to studies with overall statuses:
- Suspended
- Terminated
- Completed
- Withdrawn
- Unknown
- Available
- No longer available
- Temporarily not available
- Approved for marketing

Examination of the API data revealed an additional undocumented status 'Withheld' encompassing 880 studies in the ClinicalTrials.gov database. Withheld studies contained little data and most information was redacted. These studies were excluded due to insufficient data. Further documentation reveals that "Withheld studies are a special case where a FDA-regulated device product is not yet approved or cleared by U.S. FDA".

### Imports and Functions

In [1]:
# Imports
import requests
import pandas as pd
from datetime import date, datetime
import pickle
import bz2

In [2]:
# Google Colab environment
# Determine if runtime is a Google Colab enviroment
colab = 'google.colab' in str(get_ipython())

# Workspace settings
if colab:
    project_path = data_path = './'
else:
    project_path = '../'
    data_path = project_path + 'data/'

Here we define 2 functions for interacting with the [ClincalTrials.gov API](https://clinicaltrials.gov/data-api/api). Note that at the time of this projects completion in May, 2024. v2 of the API had been introduced two months prior and v1 of the API was in the process of being decommissioned with intermittent outages. As such, existing wrappers interacting with v1 of the API were unreliable and the following functions were written from scratch.

In [3]:
# Functions
def count_studies(url, parameters={'countTotal': 'true'}):
    """
    Submits an API requests and returns only the total count of studies found matching the supplied parameters.
    """

    # Add 'countTotal': 'true' and to parameters
    parameters['countTotal'] = 'true'

    # Send request
    response = requests.get(url, params=parameters)
    response.raise_for_status()
    
    return response.json()['totalCount']

def request_all_studies(url, parameters={'countTotal': 'true'}):
    """
    Submits successive API requests until all pages have been returned and combines the output into a list of studies.
    """

    # Add 'countTotal': 'true' and to parameters
    parameters['countTotal'] = 'true'

    # Initialize variables
    studies = []
    token = True
    
    while token:
        
        response = requests.get(url, params=parameters)
        response.raise_for_status()
        if 'totalCount' in response.json():
            total_count = response.json()['totalCount']

        # Note: `reponse.json()` here is converting our JSON to a python dict.
        # This will overwrite any duplicate key values, which are possible in JSON, but not in a dict.
        # For the purpose of this analysis, we will treat the presence of the values as boolean, not as quantitative variables.
        studies = studies + response.json()['studies']
        print(f'Retreived studies {len(studies)} / {total_count} ({round(100 * len(studies) / total_count)}%)')

        if 'nextPageToken' in response.json():
            parameters['pageToken'] = response.json()['nextPageToken']
        else:
            token = False
            if 'pageToken' in parameters:
                del parameters['pageToken'] # This is necessary to reset the global variable
    
    return studies

### Data Collection from [ClinicalTrials.gov](https://clinicaltrials.gov/)

#### Download Field Information

The [ClincalTrials.gov API](https://clinicaltrials.gov/data-api/api) provides an endpoint with database field value information and metadata. This proved to be helpful in understanding and working with the dataset. Data were saved using pickle object serialization and bzip2 compression for further analysis.

In [4]:
# Download API field information
field_info = requests.get('https://clinicaltrials.gov/api/v2/stats/field/values')
field_info.raise_for_status()
field_info = pd.json_normalize(field_info.json())

# Pickle field info
field_info.to_pickle(data_path + 'field_info.pkl.bz2', compression='bz2')

#### Download Study Data

Initial project work excluded studies based on `StartDate`, `CompletionDate`, and `Phase`. However, exploratory analysis showed that this had the unintended consequence of excluding all studies with Expanded Access statuses, such as 'Approved for marketing'. In order to keep these studies for exploratory analysis, the inclusion criteria were broadened to rely soley on `OverallStatus` for inclusion/exclusion.

In [5]:
# Parameters to query '/studies' endpoint
api_server = 'https://clinicaltrials.gov/api/v2'
endpoint = '/studies'
url = api_server + endpoint

# Load list of in scope fields
with open(project_path + 'inscope_fields.txt') as file:
    fields = file.read().splitlines()
print(f'No. fields selected: {len(fields)}')

## Original parameters
# filter_advanced = ['AREA[StartDate]NOT MISSING',
#                    f'AND AREA[CompletionDate]RANGE[MIN, {date.today()}]',
#                    'AND AREA[Phase]NOT (NA OR MISSING)']

# params = {'format': 'json',
#           'fields': ','.join(fields),
#           'filter.overallStatus': 'COMPLETED,SUSPENDED,TERMINATED,WITHDRAWN,AVAILABLE,NO_LONGER_AVAILABLE,TEMPORARILY_NOT_AVAILABLE,APPROVED_FOR_MARKETING,UNKNOWN',
#           'filter.advanced': ' '.join(filter_advanced),
#           'pageSize': 1000}

# Revised parameters
params = {'format': 'json',
          'fields': ','.join(fields),
          'filter.overallStatus': 'COMPLETED,SUSPENDED,TERMINATED,WITHDRAWN,AVAILABLE,NO_LONGER_AVAILABLE,TEMPORARILY_NOT_AVAILABLE,APPROVED_FOR_MARKETING,UNKNOWN',
          'pageSize': 1000}

No. fields selected: 115


In [6]:
study_count = count_studies(url, params)
print(f'Count of studies in scope: {study_count}')

Count of studies in scope: 383135


In [7]:
# Retrieve studies
studies = request_all_studies(url, parameters=params)

Retreived studies 1000 / 383135 (0%)
Retreived studies 2000 / 383135 (1%)
Retreived studies 3000 / 383135 (1%)
Retreived studies 4000 / 383135 (1%)
Retreived studies 5000 / 383135 (1%)
Retreived studies 6000 / 383135 (2%)
Retreived studies 7000 / 383135 (2%)
Retreived studies 8000 / 383135 (2%)
Retreived studies 9000 / 383135 (2%)
Retreived studies 10000 / 383135 (3%)
Retreived studies 11000 / 383135 (3%)
Retreived studies 12000 / 383135 (3%)
Retreived studies 13000 / 383135 (3%)
Retreived studies 14000 / 383135 (4%)
Retreived studies 15000 / 383135 (4%)
Retreived studies 16000 / 383135 (4%)
Retreived studies 17000 / 383135 (4%)
Retreived studies 18000 / 383135 (5%)
Retreived studies 19000 / 383135 (5%)
Retreived studies 20000 / 383135 (5%)
Retreived studies 21000 / 383135 (5%)
Retreived studies 22000 / 383135 (6%)
Retreived studies 23000 / 383135 (6%)
Retreived studies 24000 / 383135 (6%)
Retreived studies 25000 / 383135 (7%)
Retreived studies 26000 / 383135 (7%)
Retreived studies 270

#### Save Study Data to File

In [8]:
# Pickle the studies data
filename = data_path + datetime.now().strftime('%Y-%m-%dT%H%M%S') + '_studies_json.pkl.bz2'

with bz2.BZ2File(filename, 'w') as file:
    pickle.dump(studies, file)

print(f'Saved {study_count} study records to {filename}')

Saved 383135 study records to ../data/2024-05-22T142524_studies_json.pkl.bz2
