# Working with OpenTrials

This notebook is a very simple introduction to the [OpenTrials API](https://api.opentrials.net/). 

You can read the [docs for the API here](https://api.opentrials.net/v1/docs/).

The API powers the main UI for OpenTrials, the [OpenTrials Explorer](https://explorer.opentrials.net/), and is freely available for use in 3rd party applications.

## Configuring an API client

The OpenTrials API is compliant with the [Open API spec](https://openapis.org). Formerly known as "Swagger", Open API is a standard and tolling for writing and working with APIs. A nice benefit of working to a standard like this is the ability to use a generic library that instrospects the API definition, and dynamically generates an API client. 

That's nice.

Here, we are using Python. But there are clients available in many languages. See [here](http://swagger.io/open-source-integrations/) to check support in your favorite programming language. It should not be too big a jump to go from these examples in Python to using a Swagger client in another programming language.

In [1]:
# just for presentation in notebooks
from pprint import pprint as print

# our actual code imports
from bravado.client import SwaggerClient


# The spec that will be used to generate the methods of the API client.
OPENTRIALS_API_SPEC = 'http://api.opentrials.net/v1/swagger.yaml'

# we want our data returned as an array of dicts, and not as class instances.
config = {'use_models': False}

# instantiate our API client
client = SwaggerClient.from_url(OPENTRIALS_API_SPEC, config=config)

# inspect the client properties
dir(client)

['conditions',
 'interventions',
 'organisations',
 'persons',
 'publications',
 'search',
 'sources',
 'trials']

All we've done above is create an instance of a Swagger client, and, we can see, if comparing to the [OpenTrials API](https://api.opentrials.net/v1/docs/) docs, we have a property on the client for each endpoint of our API.

To learn more about how this Swagger/Open API stuff all works, [read the spec](http://swagger.io/specification/). Otherwise, just follow along and let's explore the OpenTrials data.

## Working with data

The main endpoint for interacting with data on OpenTrials is the search endpoint. This is an endpoint that exposes all the OpenTrials trial data, indexed in [Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-query-string-query.html#query-string-syntax). 

If you know Elasticsearch, you are probably crying with joy now at the power you have at your fingertips to explore the OpenTrials database. 

If you don't know Elasticsearch, then don't worry - we'll walk through some simple examples here, and you'll be able to consult the [Elasticsearch Query String docs](https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-query-string-query.html#query-string-syntax) to build progressively more complex queries.

Enough words, let's get some data. The `searchTrials` method is the key to out Elasticsearch kingdom.

### Construct a query

First, we'll query for a condition that surely someone has conducted some clinical trials on.

In [3]:
# Passing in a very simple query, we will paginate results by 10
# The query response is then saved in the `result` variable
result = client.trials.searchTrials(q='depression', per_page=10).result()

Now we have some data in `results`, let's take a look.

### Basic metadata

How many trials related to depression does OpenTrials know about? 

Handily, a result from the search endpoint has a property called `total_count` to answer that!

In [69]:
'OpenTrials knows about {} trials related to depression.'.format(result['total_count'])

'OpenTrials knows about 7596 trials related to depression.'

### The trials, where are the matching trials?

But where is the data? the result has a property called `items`, which is an array of trials. 

Let's see the name of each of the first ten trials related to depression.

In [20]:
[obj['public_title'] for obj in result['items']]

['Optimising the clinical application of capnography for monitoring ventilation of sedated patients in the cardiac catheterisation laboratory',
 'Ventilation status of sedated patients in the cardiac catheterisation laboratory',
 'CANreduce 2.0 - comparing two differently optimized versions of a web-based self-help program to reduce cannabis use with each other and a waiting list',
 'A pilot evaluation on the efficacy of a universal school-based mindfulness intervention to enhance resilience in children',
 'The Clinical Research on the Relationship Between Depression and Gut Microbiota in TBI Patients',
 'I2PETPG - Imidazoline2 Binding Sites in a Group of Participants Diagnosed With AD',
 'A Study of Chinese Medicine Treating Depression',
 'Evaluation of the Impact of the Level of Mindfulness on the Management of Patients With Recurrent Depressive Disorders by the Mindfulness Based Cognitive Therapy ( MBCT ): an Exploratory Study',
 'Integrated Mental Health Care and Vocational Rehabil

### Inspecting a trial

What type of data does OpenTrials actually expose on each trial? 

Let's look at the properties of one of them to see.

In [22]:
sample = result['items'][0]
list(sample.keys())

['records',
 'url',
 'id',
 'interventions',
 'publications',
 'target_sample_size',
 'risks_of_bias',
 'locations',
 'status',
 'has_published_results',
 'gender',
 'documents',
 'registration_date',
 'public_title',
 'identifiers',
 'brief_summary',
 'recruitment_status',
 'discrepancies',
 'sources',
 'source_id',
 'conditions',
 'persons',
 'organisations']

Ok!

There is a lot to explore here. Let's dive further into the *trial object* that is returned by our search queries for trials. This will help us understand what we can expect to see in the OpenTrials database, and thereby, the type of questions we may expect to answer with this data, in further analysis.

In [14]:
# the unique identifer in the OpenTrials database
sample['id']

'3a9dec72-6ba7-11e6-8215-0242ac12000b'

In [24]:
# the globally known identifiers
sample['identifiers']

{'actrn': 'ACTRN12616001132437'}

In [32]:
# Does the trial have published results?
sample['has_published_results'] or 'No results'

'No results'

In [34]:
# when was the trial first registered?
sample['registration_date'].isoformat()

'2016-08-19T00:00:00+00:00'

In [39]:
# exactly which conditions does the trial test
[condition['name'] for condition in sample['conditions']]

['sedation-induced respiratory depression']

In [40]:
# and, which interventions are tested against this condition?
[intervention['name'] for intervention in sample['interventions']]

[]

In [70]:
# is there any summary description that tells us what this trial is about?
sample['brief_summary']

'Respiratory depression is more likely to be detected during sedation when capnography is used. However, sedation-induced respiratory depression is commonly transient and does not always cause patient harm. Randomised controlled trials have produced conflicting results regarding whether capnography improves patient safety when used during sedation and there is considerable variation in the utilisation of capnography monitoring in clinical practice. This project seeks to optimise the implementation of this technology into clinical practice.  AIMS 1.\tTo identify subgroups of patients based on their physiological responses to sedation. 2.\tTo characterise the identified subgroups by determining whether they are associated with particular demographic and clinical characteristics. 3.\tTo examine variation in clinical interventions applied to support respiratory function between subgroups.  4.\tTo determine if there are associations between the subgroups and intra-procedural ventilation sta

In [45]:
# what data sources have contributed to the information on this trial?
list(sample['sources'].keys())

['actrn']

In [50]:
# what records from these sources has OpenTrials collected?
list([(record['source_url'], record['url'])for record in sample['records']])

[('https://www.anzctr.org.au/Trial/Registration/TrialReview.aspx?id=371304&isReview=true',
  'http://api.opentrials.net/v1/trials/3a9dec72-6ba7-11e6-8215-0242ac12000b/records/19b47d01-84f6-41c9-9133-2addff32d5bd')]

And, you get the picture. It can be quite interesting to inspect the top-level data of a given trial like this.

## Related entities

The major API endpoint for exploration is the search endpoint. But as you saw at the start of the notebook, we have several other endpoints available.

In [51]:
dir(client)

['conditions',
 'interventions',
 'organisations',
 'persons',
 'publications',
 'search',
 'sources',
 'trials']

Each of these additional endpoints is a RESTful endpoint for entities related to trials. As we've collected information on trials, we've also extracted out entities and started cleaning and normalising these. We think this will turn into a really useful collection of datasets on clinical trials, giving direct query access to the organisations and people, for example, behind clinical trials.

Let's take another sample and investigate some of these related entities.

In [19]:
sample = result['items'][9]

sample

{'brief_summary': 'This is a Phase 1/2, open-label, multicenter study.',
 'conditions': [{'id': '574bbf1b-5a6a-47df-bcbe-aa389f88dd82',
   'name': "Non-Hodgkin's Lymphoma",
   'url': 'http://api.opentrials.net/v1/conditions/574bbf1b-5a6a-47df-bcbe-aa389f88dd82'}],
 'discrepancies': {'recruitment_status': [{'record_id': '14fb8631-26fc-49e7-9bca-f54609329992',
    'source_name': 'WHO ICTRP',
    'value': 'not_recruiting'},
   {'record_id': '3ce116a4-214e-4534-9dc6-b488548feb2c',
    'source_name': 'ClinicalTrials.gov',
    'value': 'recruiting'}]},
 'documents': [],
 'gender': 'both',
 'has_published_results': False,
 'id': '60a5ba18-594e-4e13-87b8-65c035c8d0de',
 'identifiers': {'nct': 'NCT02856685'},
 'interventions': [{'id': 'e456bc59-a706-40e4-bcaa-6b3003ec5cec',
   'name': 'Mitoxantrone Hydrochloride Liposome',
   'type': None,
   'url': 'http://api.opentrials.net/v1/interventions/e456bc59-a706-40e4-bcaa-6b3003ec5cec'}],
 'locations': [{'id': '4e9484c3-2bc9-4813-a3e7-6710ae2a3324',


This sample has information on several related entities. Notably, persons, interventions and conditions. Let's look at each. As the database expands and the data is further cleaned and normalised, these entity endpoints will hold more and more contextual information.

Currently, they do not hold much more than is directly available via the search API directly, so watch this space.

In [25]:
sample['interventions']

[{'id': 'e456bc59-a706-40e4-bcaa-6b3003ec5cec',
  'name': 'Mitoxantrone Hydrochloride Liposome',
  'type': None,
  'url': 'http://api.opentrials.net/v1/interventions/e456bc59-a706-40e4-bcaa-6b3003ec5cec'}]

In [26]:
intervention = client.interventions.getIntervention(id=sample['interventions'][0]['id']).result()

intervention

{'id': 'e456bc59-a706-40e4-bcaa-6b3003ec5cec',
 'name': 'Mitoxantrone Hydrochloride Liposome',
 'type': None,
 'url': 'http://api.opentrials.net/v1/interventions/e456bc59-a706-40e4-bcaa-6b3003ec5cec'}

### Unique insights

Let's take a look at some information that is *only* possible because OpenTrials has centralised data from a range of sources and threaded that data together. 

Doing so shows us the type of unique insights that are possible now, and will grow as we collect and match more data from more varied sources.

The example we'll use is that of *discrepancies*. It is known that the various records of information on trials often hold discrepant data - but no one really knows how much. Discrepancies in what is publicly published about a trial is a big issue, as these public records get referenced and used - for example in academic literature - and then become a 'truth' on which new facts are based.

So, just how many trials does OpenTrials think have discrepant data? Let's take a look now, but a few caveats:

1. There are definitely false positives - this is a matter of tuning the way we detect discrepancies across multiple data providers, in a world where there are no standards on how to actually publish clinical trial data.
2. OpenTrials is just at the beginning of build a huge database of information, and obviously information can change over time, as more data is added.

In [4]:
# the total number of trials
result = client.trials.searchTrials(per_page=10).result()
trial_count = result['total_count']

trial_count

335025

In [5]:
# the total number of trials we think have discrepancies
result = client.trials.searchTrials(q='_exists_:discrepancies', per_page=10).result()
discrepancy_count = result['total_count']

discrepancy_count

36720

So we can see that, currently, we think around 10% of trials have discrepancies. 

Even if that figure changes, let's inspect an example to see how important the discrepancies detection can be as a heuristic for exploration.

In [13]:
sample = result['items'][0]

sample['public_title'], sample['status'], sample['has_published_results'], \
[i['name'] for i in sample['interventions']], sample['discrepancies'], \
sample['registration_date'].isoformat()

('A Investigating Safety, Tolerability, PK and PD for Multiple Doses of NNC9204-0530 in Combination With Liraglutide in Male and Female Subjects Being Overweight or With Obesity',
 'ongoing',
 False,
 ['Placebo', 'NNC9204-0530', 'LIRAGLUTIDE'],
 {'recruitment_status': [{'record_id': 'a48ef0f3-3bc9-4f23-8f3b-bd63a5d910f4',
    'source_name': 'ClinicalTrials.gov',
    'value': 'recruiting'},
   {'record_id': '5cf4bac8-93f6-4c08-9c6b-7a7b39b50423',
    'source_name': 'WHO ICTRP',
    'value': 'not_recruiting'}]},
 '2016-08-12T00:00:00+00:00')

So, is this recently registered trial recruiting for participants, or not? Which trial record is correct, and which is incorrect? Who does this impact (patients, carers, clinicians, etc.)?

## Power search

Most of the above examples have been with a very simple query - 'depression'. However, in the last example, we used `_exists_:discrepancies`. What was that?

That was an example of the [powerful query interface](https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-query-string-query.html#query-string-syntax) that Elasticsearch exposes so that you can do all kinds of awesome.

Here are some examples - we encourage you to try more!

In [58]:
# this and that
result = client.trials.searchTrials(q='suicide AND depression', per_page=10).result()

result['total_count']

178

In [59]:
# wildcard matches
result = client.trials.searchTrials(q='head*', per_page=10).result()

result['total_count']

5932

In [60]:
# this or that
result = client.trials.searchTrials(q='headache OR migraine', per_page=10).result()

result['total_count']

1634

In [61]:
# fuzzy matching
result = client.trials.searchTrials(q='brain~', per_page=10).result()

result['total_count']

12950

In [63]:
# date ranges
result = client.trials.searchTrials(q='registration_date:[2014-01-01 TO 2014-12-31]', per_page=10).result()

result['total_count']

27891

In [65]:
# grouping clauses
result = client.trials.searchTrials(q='(male OR female) AND sex', per_page=10).result()

result['total_count']

647

Honestly, the examples above barely scratch the surface. See the [full reference here](https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-query-string-query.html#query-string-syntax).

## Closing

This primer on the OpenTrials API is to get you started quickly. The focus has been on basic inspection of the data, a high-level exploration of the endpoints, and a brief glance into the search capabilities available. 

From here, it is up to you. Whether you want to analyse the data, do some visualisations, write a custom app, or even hack on the core OpenTrials platform, we can't wait to see what you are working on! [Come tell us here](https://gitter.im/opentrials/chat).