# Clinical Trials Statistics Report, for a selected Organization 

From a GRID ID and time frame, extract all publications that were cited by clinical trials. 
    
Publications CSV fields 

```
1. Dim. pub ID  [as a link]
2. Pub title [Full title]
3. Citations [number of]
4. Recent citations [number of]
5. FCR [number of]
6. FOR-codes [semicolon separated values, no cap]
7. Authors [semicolon separated values, capped]
8. Affiliations [semicolon separated values, capped]
9. Own authors [GRID's own authors, nested not capped; include corr. author when available]
10. # of citing patents [number of]
11. Citing patents IDs [Nested, no cap]
```

Clinical Trials CSV fields 

```
1. Dim. Clinical Trial ID [as a link]
2. FOR-codes [semicolon separated values, no cap]
3. Clinical Trial title
4. Investigators [semicolon separated values, no cap; all names also if not disambiguated]
4. Investigators (GRID own)
5. Organizations related [semicolon separated values, no cap]
6. # of cited pubs (total) [number of]
7. # of cited pubs (GRIDs own) [number of]
8. Cited publication IDs [Nested, no cap]
```

Sample outputs (in gsheets): 
* pubs: https://docs.google.com/spreadsheets/d/1NBprMWj4jOkEXApX2pKLXiUVTXMYnEopyoMn0I4iLG4/edit?usp=sharing
* clinical trials: https://docs.google.com/spreadsheets/d/1niEGPxt_1JOVxEOSvPdIzOX8j0FXegcxWFqYkDsEcHo/edit?usp=sharing



## 1. Install Libraries and Log into Dimensions API


In [1]:
# @markdown # Get the API library and login 
# @markdown **Privacy tip**: leave the password blank and you'll be asked for it later. This can be handy on shared computers.
username = ""  #@param {type: "string"}
password = ""  #@param {type: "string"}
endpoint = "https://app.dimensions.ai"  #@param {type: "string"}


# INSTALL/LOAD LIBRARIES 
# ps optimized for Google Colab /modify installation as needed based on your environment
# 
print("==\nInstalling libraries..")
!pip install dimcli tqdm -U --quiet 

import os
import sys
import time
import json
import pandas as pd
from pandas.io.json import json_normalize
from tqdm.notebook import tqdm as progress
import dimcli 
from dimcli.shortcuts import *

# AUTHENTICATION 
# https://github.com/digital-science/dimcli#authentication
#
# == Google Colab users ==
# If username/password not provided, the interactive setup assistant `dimcli --init` is invoked
#
# == Jupyter Notebook users == 
# If username/password not provided, try to use the global API credentials file.
# To create one, open a terminal (File/New/Terminal) and run `dimcli --init` from there
#  
#
print("==\nLogging in..")
if username and password:
  dimcli.login(username, password, endpoint)
else:
  if 'google.colab' in sys.modules:
    print("Environment: Google Colab")
    if username and not password:
      import getpass
      password = getpass.getpass(prompt='Password: ')     
      dimcli.login(username, password, endpoint)
    else:
      print("... launching interactive setup assistant")
      !dimcli --init    
      dimcli.login()
  else:
    print("Environment: Jupyter Notebook\n... looking for API credentials file")
    dimcli.login()

dsl = dimcli.Dsl()

   

==
Installing libraries..


You should consider upgrading via the '/Users/michele.pasin/Envs/dslqa/bin/python -m pip install --upgrade pip' command.[0m


==
Logging in..
Environment: Jupyter Notebook
... looking for API credentials file


[2mDimcli - Dimensions API Client (v0.7.3)[0m


[2mConnected to: https://app.dimensions.ai - DSL v1.26[0m


[2mMethod: dsl.ini file[0m


## 2. Select GRID organization and time period

Tip: pick one from https://grid.ac/institutes. 

In [2]:
#@markdown Please enter a valid org grid

GRIDID = "grid.7841.a" #@param {type:"string"}

#@markdown The start/end year of publications used to extract patents
YEAR_START = 2011 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2012 #@param {type: "slider", min: 1950, max: 2020}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

# gen link to Dimensions
try:
  gridname = dsl.query(f"""search organizations where id="{GRIDID}" return organizations[name]""", verbose=False).organizations[0]['name']
except:
  gridname = ""
from IPython.core.display import display, HTML
display(HTML('GRID: <a href="{}" title="View selected organization in Dimensions">{} - {} &#x29c9;</a>'.format(dimensions_url(GRIDID), GRIDID, gridname)))
display(HTML('Time period: {} to {} <br /><br />'.format(YEAR_START, YEAR_END)))


#
# data-saving utils 
#
DATAFOLDER = "stats_cltrials_" + str(GRIDID)
if not os.path.exists(DATAFOLDER):
  !mkdir $DATAFOLDER
  print(f"==\nCreated data folder:", DATAFOLDER + "/")
#
#
def save_as_csv(df, save_name_without_extension):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{DATAFOLDER}/{save_name_without_extension}.csv", index=False)
    print("===\nSaved: ", f"{DATAFOLDER}/{save_name_without_extension}.csv")


==
Created data folder: stats_cltrials_grid.7841.a/


## 3. Run the extraction

The results are being saved in a folder called `stats_cltrials_grid_xxx`, which contains two CSV files `publications.csv` and `clinical_trials.csv`. 

PS sometimes single publications records can contain a lot of data, causing the publications extraction to fail with a 'query too long or complex' error. This can be avoided by reducing the overall number of records extracted per iteration (generally 1000 is fine though).

In [3]:
#@markdown If you run into 'query too long or complex' errors, try reducing the records extracted per iteration. 
PUBS_PER_ITERATION = 100 #@param {type: "slider", min: 10, max: 1000}
# PS this is just for publications loop

# ;;;;;
#
# get publications from selected grid and time period
#
# ;;;;;


print("===\nExtracting publications for: ", GRIDID, "from", YEAR_START, "to", YEAR_END)

publications = dsl.query_iterative(f"""
search publications
    where research_orgs.id = "{GRIDID}"
    and year in [{YEAR_START}:{YEAR_END}]
    return publications[id+doi+title+times_cited+recent_citations+field_citation_ratio+category_for+authors]
""", limit=PUBS_PER_ITERATION).as_dataframe()

print("Total publications found: ", len(publications))




# ;;;;;
#
# get CLINICAL TRIALS citations 
#
# ;;;;;


pubsids = list(publications['id'])

print("\n===\nExtracting clinical trials citing these publications")

q = """search clinical_trials where publication_ids in {}
  return clinical_trials[id+title+category_for+investigator_details+organizations+publication_ids]"""

# iterate pubids using chunks 
VERBOSE = False
CHUNKS_SIZE = 300 
results = []

for chunk in progress(list(chunks_of(pubsids, CHUNKS_SIZE))):
    query = q.format(json.dumps(chunk))
    data = dsl.query_iterative(query, verbose=VERBOSE)
    results += data.clinical_trials
    time.sleep(0.5)

# remove duplicates
clinical_trials = pd.DataFrame().from_dict(results)
clinical_trials.drop_duplicates(subset='id', inplace=True)

print("Total related clinical_trials found: ", len(clinical_trials))

    
    

# ;;;;;
#
# count CLINICAL TRIALS per publication
#
# ;;;;;


print("\n===\nCounting clinical_trials for each publication...")

# build str column version for checking inclusion
clinical_trials['publication_ids_str'] = clinical_trials['publication_ids'].apply(lambda x: ','.join(map(str, x)))                                 
def get_clinical_trials_per_pub(pubid):
    global clinical_trials
    # turn list into str and check content in one line
    return clinical_trials[clinical_trials['publication_ids_str'].str.contains(pubid)]['id']


publications['clinical_trials_count'] = 0
publications['clinical_trials_ids'] = ""
for index, row in progress(publications.iterrows(), total=publications.shape[0]):
    match_clinical_trials = get_clinical_trials_per_pub(row['id'])
    publications.at[index,'clinical_trials_count'] = len(match_clinical_trials)
    publications.at[index,'clinical_trials_ids'] = list(match_clinical_trials)


# keep only pubs with at least one patent citation and sort by citations     
publications_subset = publications[publications['clinical_trials_count'] > 0].copy()
publications_subset.sort_values("clinical_trials_count", ascending=False, inplace=True)

print("Total publications with at least one clinical_trials citation: ", len(publications_subset))




# ;;;;;
#
# count PUBLICATIONS per clinical trial 
#
# ;;;;;


print("\n===\nCounting publications for each clinical_trial ...")
# add tot column
clinical_trials['publications_cited_tot'] = clinical_trials['publication_ids'].apply(lambda x: len(x))

# count tot publications cited from GRIDID
def is_in_grid_pubs(test_ids):
    "intersection of two lists: all cited pubs VS pubs from selected grid org"
    global pubsids
    return len(list(set(test_ids) & set(pubsids)))

progress.pandas(desc="Clinical Trials")
clinical_trials['publications_cited_grid'] = clinical_trials['publication_ids'].progress_apply(lambda x: is_in_grid_pubs(x))



# ;;;;;
#
# simplify JSON publication fields into simple strings 
#
# ;;;;;


print("\n===\nSimplifying publication/clinical_trials fields..")

# turn ids into URLs
publications_subset['id'] = publications_subset['id'].apply(lambda x: dimensions_url(x))
# simplify FOR codes (after filling in blanks)
publications_subset['category_for'] = publications_subset['category_for'].fillna("").apply(lambda x: "; ".join([y['name'] for y in x]))

# represent authors and affiliations as semicolon-delimited lists 

def nice_authors(authorslist):
    authors = []
    for x in authorslist:
        name = x.get('first_name', "") + " " + x.get('last_name', "")
        authors.append(name)
    return "; ".join(authors)

def nice_affiliations(authorslist):
    affiliations = []
    for x in authorslist:
        for a in x['affiliations']:
            affiliations.append(a.get('name', ""))
    return "; ".join(list(set(affiliations)))

# extract OWN authors (at any point in time!) 
def ownauthors(authorslist):
    ownauthors = []
    global GRIDID
    for x in authorslist:
        name = x.get('first_name', "") + " " + x.get('last_name', "")
        for a in x['affiliations']:
            if "id" in a and a['id'] == GRIDID:
                ownauthors.append(name)
    return "; ".join(ownauthors)

publications_subset['all_authors'] = publications_subset['authors'].fillna("").apply(lambda x: nice_authors(x))
publications_subset['own_authors'] = publications_subset['authors'].fillna("").apply(lambda x: ownauthors(x))
publications_subset['affiliations'] = publications_subset['authors'].fillna("").apply(lambda x: nice_affiliations(x))

# sort columns
publications_subset = publications_subset[['id', 'title', 'times_cited', 'recent_citations', 'field_citation_ratio', 'category_for', 'all_authors',  'own_authors', 'affiliations', 'clinical_trials_count', 'clinical_trials_ids']]





# ;;;;;
#
# simplify JSON CLINICAL TRIALS fields  
#
# ;;;;;

# turn ids into URLs
clinical_trials['id'] = clinical_trials['id'].apply(lambda x: dimensions_url(x, "clinical_trials"))
# simplify FOR codes (after filling in blanks)
clinical_trials['category_for'] = clinical_trials['category_for'].fillna("").apply(lambda x: "; ".join([y['name'] for y in x]))

# transform list into semicolon delimited string
clinical_trials['investigator_all'] = clinical_trials['investigator_details'].fillna("").apply(lambda x: "; ".join([val[0] for val in x]))

# identify own investigators  
# helper:
def check_grid_investigator(llist, gridid):
    """
    Handle exceptions silently - normally grid is in position 5
    
            "investigator_details": [
                [
                    "Salvacion Gatchalian",
                    "MD",
                    "Principal Investigator",
                    "Research Institute for Tropical Medicine,",
                    "Research Institute for Tropical Medicine",
                    "grid.437564.7"
                ]
            ],
    """
    try:
        return llist[5] == gridid
    except:
        return False

clinical_trials['investigator_grid'] = clinical_trials['investigator_details'].fillna("").apply(lambda x: "; ".join([val[0] for val in x if check_grid_investigator(val, GRIDID)]))
    

# simplify the related organizations column
clinical_trials['organizations'] = clinical_trials['organizations'].fillna("").apply(lambda x: "; ".join([val['id'] for val in x]))
    
    
# finally sort columns
clinical_trials = clinical_trials[['id', 'title', 'category_for', 'investigator_all', 'investigator_grid', 'organizations', 'publications_cited_tot', 'publications_cited_grid', 'publication_ids']]





# ;;;;;
#
# save the data as CSV
#
# ;;;;;


save_as_csv(publications_subset, "publications")
save_as_csv(clinical_trials, "clinical_trials")


print("===\nCompleted.")

===
Extracting publications for:  grid.7841.a from 2011 to 2012
Starting iteration with limit=100 skip=0 ...


0-100 / 10437 (1.2106270790100098s)


100-200 / 10437 (3.7274019718170166s)


200-300 / 10437 (2.0583558082580566s)


300-400 / 10437 (4.803505897521973s)


400-500 / 10437 (4.843851089477539s)


500-600 / 10437 (1.432697057723999s)


600-700 / 10437 (0.6380801200866699s)


700-800 / 10437 (4.8519041538238525s)


800-900 / 10437 (7.4519617557525635s)


900-1000 / 10437 (9.503302097320557s)


1000-1100 / 10437 (4.729135990142822s)


1100-1200 / 10437 (4.124349117279053s)


1200-1300 / 10437 (6.876145362854004s)


1300-1400 / 10437 (5.198312997817993s)


1400-1500 / 10437 (5.492586612701416s)


1500-1600 / 10437 (1.562514066696167s)


1600-1700 / 10437 (5.676270961761475s)


1700-1800 / 10437 (8.78333306312561s)


1800-1900 / 10437 (2.2388579845428467s)


1900-2000 / 10437 (4.877010107040405s)


2000-2100 / 10437 (4.189291000366211s)


2100-2200 / 10437 (8.140240907669067s)


2200-2300 / 10437 (2.6171867847442627s)


2300-2400 / 10437 (0.6224629878997803s)


2400-2500 / 10437 (3.8107597827911377s)


2500-2600 / 10437 (5.285265922546387s)


2600-2700 / 10437 (3.591262102127075s)


2700-2800 / 10437 (1.3589859008789062s)


2800-2900 / 10437 (7.6187379360198975s)


2900-3000 / 10437 (8.034372091293335s)


3000-3100 / 10437 (8.853363990783691s)


3100-3200 / 10437 (0.7193419933319092s)


3200-3300 / 10437 (7.836992025375366s)


3300-3400 / 10437 (3.6634538173675537s)


3400-3500 / 10437 (6.867624998092651s)


3500-3600 / 10437 (1.2446489334106445s)


3600-3700 / 10437 (10.610204219818115s)


3700-3800 / 10437 (6.146013975143433s)


3800-3900 / 10437 (3.0118179321289062s)


3900-4000 / 10437 (2.5806870460510254s)


1 EvaluationError found
The response generated by your query is too large, e.g. because it includes records with lots of data. Please review it by keeping in mind the guidelines on https://docs.dimensions.ai/dsl/faq.html#queries-and-errors [code: 2]

>>>[Dimcli tip] An error occurred with the batch '4000-4100'. Consider using the 'limit' argument to retrieve fewer records per iteration, or use 'force=True' to ignore errors and continue the extraction.


TypeError: can only concatenate list (not "DslDataset") to list

## 4. Download the results 

If you are viewing this notebook in **Google Colab**, run the following cell to download all data as a zip file. 

In [4]:

# zip up all files to make download easier
import zipfile
import os 

def zipdir(path, ziph):
    # ziph is zipfile handle
    for root, dirs, files in os.walk(path):
        for file in files:
            ziph.write(os.path.join(root, file))

zip_name = DATAFOLDER + '.zip'
zipf = zipfile.ZipFile(zip_name, 'w', zipfile.ZIP_DEFLATED)
zipdir(DATAFOLDER + '/', zipf)
zipf.close()

try:
  # try to download from colab: sometimes it fails hence print a message
  from google.colab import files
  time.sleep(5)
  files.download(zip_name) 
except:
  print("Google Colab failed to download - please try again.")


Google Colab failed to download - please try again.
