Draft conversion of ORCHID_COVID19.ipynb from R to python.

# ORCHID Clinical Trial: statistical analysis reproduction

# Version 1.0

This notebook reproduces the statistical analysis of the ORCHID clinical trial. Results have been published to JAMA, on November 7th 2021: ["Effect of Hydroxychloroquine on Clinical Status at 14 Days in Hospitalized Patients With COVID-19"](https://jamanetwork.com/journals/jama/fullarticle/2772922). The statistical analysis plan can be found on [clinicaltrials.gov](https://clinicaltrials.gov/ct2/show/NCT04332991?term=orchid&cond=Covid19&cntry=US&draw=2&rank=1). 

The clinical trial has been conducted between April and July 2020, and stopped before enrollment completion for futility, finding no difference of efficacy between hydroxychloroquine and placebo. This notebook is a reproduction of the clinical trial results based on the clinical trial protocol and the investigators original source code.

# Data Access using PIC-SURE API

User access authentication works through a security token, which is passed to the API using the token.txt file (file to be created by the user). In order to know how to get your security token, please see [the README of the PIC-SURE API GitHub repo](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst).

# ORCHID Clinical Trial

It is a multi-center, double blinded, randomized clinical trial conducted to assess the efficacy of hydroxychloroquine in the treatment of COVID-19 in hospitalized patients.

NHLBI made available the data to every authorized investigators. Hence, this notebook enables anybody with authorized credentials to reproduce the ORCHID clinical trial results by showing how to:
1. Access the data using the PIC-SURE API
2. Reproduce the results of this study using the open-source R programming languages

# Overview of the statistical analysis plan

The primary outcome is the COVID-19 Outcome Scale (COS) assess at 14 days. The scale comprises 7 levels, as follows:
- 1, Dead
- 2, Hospitalized on invasive mechanical ventilation or ECMO
- 3, Hospitalized on non-invasive ventilation or high flow nasal cannula
- 4, Hospitalized on supplemental oxygen
- 5, Hospitalized not on supplemental oxygen
- 6, Not hospitalized with limitation in activity (continued symptoms)
- 7, Not hospitalized without limitation in activity (no symptoms)

This scale will also be assessed at different timepoints as secondary outcomes: day-3, day-7, and day-28.

This scale will be treated as an ordered factor, and thus this outcome will be analyzed using a proportional odds regression model. 

Other secondary outcomes will be considered. Death, and composite of death and ECMO will be analyzed using logistic regression models. Time to recovery (patient without oxygen supplementation -- ie COS level 6 or 7) and time to discharge will be analyzed using survival models. Support free-days (hospital, oxygen, ICU, ventilation, and vasopressor) will be treated as ordered factors and analyzed using proportional odds regression models.

All these outcomes will be analyzed using multivariable models, taking into account the following potential confoundings: 
- Age at randomization
- Sex
- Clinical status as assessed by the COVID Ordinal Outcome Scale at randomization
- Sequential Organ Failure Assessment (SOFA) score at randomization
- Duration of acute respiratory infection symptoms prior to randomization

# Packages Installation 

In [None]:
# Install packages needed to use PIC-SURE
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
!cat requirements.txt

In [None]:
!{sys.executable} -m pip install -r requirements.txt

In [None]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureClient
import PicSureBdcAdapter

from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

# Installing the library and connecting to the database using the PICSURE API 

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"
with open(token_file, "r") as f:
    my_token = f.read()
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

# Querying the data

In [None]:
dictionary_results = resource.dictionary().find('ORCHID')
list_variables = dictionary_results.keys()
query = resource.query()
query.anyof().add(list_variables)
raw_df = query.getResultsDataFrame(low_memory=False)

# Data Management

The raw data contains long variable names, and the following code trims the uninformative part.

In [None]:
simplified_names = []
for i in list(raw_df.columns):
    short_var = i.split('\\')
    if short_var[-1] == 'Patient ID':
        simplified_names.append(short_var[-1])
    else:
        simplified_names.append(short_var[-2])