# Dimensions Statistics Report, for one or more Organizations

This notebook allows to produce up-to-date statistics on the number of documents associated to one of more GRID organizations. 

**Basic stats:**

- Publications total for GRID IDs
  `search publications where research_orgs = []..`
- Grant number and funding given to the GRID IDs
  `search grants where research_orgs = []..`
- Patents with GRID IDs as assignees 
  `search patents where assignees = []..`
- Clinical Trials with associated GRID IDs
  `search clinical_trials where organizations = []..`
- PS Policy papers will always be manual since it is tricky, but this is the query:
  `search policy_documents where publisher_org = []..`

**Advanced stats (= publication references from other sources)**

- Publication count citing the GRID ID publications (straight forward citation count, if difficult, can be taken from web app)
  `search publications where research_orgs=[] return year aggregate citations_total`
- Clinical trials referencing the publications of the GRID ID
  `search clinical_trials where publication_ids = [..]`
- Grants linked to publications of GRID ID
  `search grants where resulting_publications_ids ..`
- Patents referencing GID ID publications 
  `search patents where publication_ids in`
- Policy documents referencing the GRID ID publications
  `search policy_document where publication_ids in`





# TODOs
* max number of grid ids accepted
* add query for total funding amount
* also a way to load them from a csv file
* sort by time to deal with max 50k results
* add citations total from publications


## 1. Install Libraries and Log into Dimensions API


In [1]:
# @markdown # Get the API library and login 
# @markdown **Privacy tip**: leave the password blank and you'll be asked for it later. This can be handy on shared computers.
username = "dsl.demo.1@dimensions.ai"  #@param {type: "string"}
password = "1.Demo.Dsl"  #@param {type: "string"}
endpoint = "https://app.dimensions.ai"  #@param {type: "string"}


# INSTALL/LOAD LIBRARIES 
# ps optimized for Google Colab /modify installation as needed based on your environment
# 
print("==\nInstalling libraries..")
!pip install dimcli plotly_express -U --quiet 

import os
import sys
import time
import json
import pandas as pd
from pandas.io.json import json_normalize
from tqdm import tqdm_notebook as progressbar
import plotly_express as px
import dimcli 
from dimcli.shortcuts import *

# AUTHENTICATION 
# https://github.com/digital-science/dimcli#authentication
#
# == Google Colab users ==
# If username/password not provided, the interactive setup assistant `dimcli --init` is invoked
#
# == Jupyter Notebook users == 
# If username/password not provided, try to use the global API credentials file.
# To create one, open a terminal (File/New/Terminal) and run `dimcli --init` from there
#  
#
print("==\nLogging in..")
if username and password:
  dimcli.login(username, password, endpoint)
else:
  if 'google.colab' in sys.modules:
    print("Environment: Google Colab")
    if username and not password:
      import getpass
      password = getpass.getpass(prompt='Password: ')     
      dimcli.login(username, password, endpoint)
    else:
      print("... launching interactive setup assistant")
      !dimcli --init    
      dimcli.login()
  else:
    print("Environment: Jupyter Notebook\n... looking for API credentials file")
    dimcli.login()

dsl = dimcli.Dsl()



#
# data-saving utils 
#
DATAFOLDER = "stats_data"
if not os.path.exists(DATAFOLDER):
  !mkdir $DATAFOLDER
  print(f"==\nCreated data folder:", DATAFOLDER + "/")
#
#
def save_as_csv(df, save_name_without_extension):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{DATAFOLDER}/{save_name_without_extension}.csv", index=False)
    print("===\nSaved: ", f"{DATAFOLDER}/{save_name_without_extension}.csv")
   

==
Installing libraries..
[K     |████████████████████████████████| 122kB 3.5MB/s 
[?25h==
Logging in..
DimCli v0.6.1.2 - Succesfully connected to <https://app.dimensions.ai> (method: manual login)
==
Created data folder: stats_data/


## 2. GRIDs selection

You can pick one from https://grid.ac/institutes. 

In [2]:
#@markdown Please enter one or more GRID IDs, comma separated

organizations = "grid.469830.0" #@param {type: "string"}
gridids = [x.strip() for x in organizations.split(",")]
gridids = list(set(gridids))

MAX_GRIDS = 30

from IPython.core.display import display, HTML
def dimensions_url(grids):
    root = "https://app.dimensions.ai/discover/publication?or_facet_research_org="
    return root + "&or_facet_research_org=".join([x for x in grids])


if len(gridids) > MAX_GRIDS:
  print("You entered too many GRID IDs. Max is ", MAX_GRIDS)
else:
  print("GRIDs entered (unique): ", len(gridids), "\n =>", gridids)
  # gen link to Dimensions
  display(HTML('---<br /><a href="{}">View selected organizations in Dimensions &#x29c9;</a>'.format(dimensions_url(gridids))))


GRIDs entered (unique):  1 
 => ['grid.469830.0']


## 3. Data Extraction: Counts of Publications, Grants, Patents etc..

In this step we count how many documents direcly associated to the GRID IDs exist - for each of the main Dimensions document types. 

Things to note: 

* Since the list of GRID IDs can be long, we need to split them up into smaller chunks to ensure API queries aren't too long. We will thus have a separate query per GRID IDs group. 
* After going through all GRID IDs chunks, we sum up the results and remove duplicates. This is because the same document could result from different GRID IDs.
* All data gets saves in the `results` folder as CSV files

In [0]:
from tqdm import tqdm_notebook as pbar

query_templates = {
    "publications" : """search publications where research_orgs in {} return publications[id]""" ,
    "grants" : """search grants where research_orgs in {} return grants[id]""" ,
    "clinical_trials" : """search clinical_trials where organizations in {} return clinical_trials[id]""" ,
    "patents" : """search patents where assignees in {} return patents[id]""" ,
    "policy_documents" : """search policy_documents where publisher_org in {} return policy_documents[id]""" ,
}

VERBOSE = False
CHUNKS_SIZE = 10 # how many grids to process at a time
results = []

loop1 = pbar(list(query_templates))
for doctype in loop1:
  loop1.set_description("Extracting %s" % doctype.capitalize())
  bucket = []
  for chunk in (chunks_of(list(gridids), CHUNKS_SIZE)): 
    # dsl query
    q = query_templates[doctype].format(json.dumps(chunk))
    data = dsl.query_iterative(q, verbose=VERBOSE)
    #
    bucket += getattr(data, doctype)

  df = pd.DataFrame().from_dict(bucket)
  df.drop_duplicates(subset="id", inplace=True)
  #
  print(f"============\n{doctype.capitalize()} available: ", data.count_total)
  print(f"{doctype.capitalize()} downloaded: ", len(df))
  save_as_csv(df, f"tot_{doctype}")
  #
  results.append({'doctype': doctype, 'total' : data.count_total, 'downloaded' : len(df)})


# save to a dataframe
summary = pd.DataFrame().from_dict(results)
save_as_csv(summary, "overview_objects")
summary

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))

Publications available:  473
Publications downloaded:  473
===
Saved:  stats_data/tot_publications.csv
Grants available:  177
Grants downloaded:  177
===
Saved:  stats_data/tot_grants.csv
Clinical_trials available:  0
Clinical_trials downloaded:  0
===
Saved:  stats_data/tot_clinical_trials.csv
Patents available:  0
Patents downloaded:  0
===
Saved:  stats_data/tot_patents.csv
Policy_documents available:  0
Policy_documents downloaded:  0
===
Saved:  stats_data/tot_policy_documents.csv

===
Saved:  stats_data/overview_objects.csv


Unnamed: 0,doctype,total,downloaded
0,publications,473,473
1,grants,177,177
2,clinical_trials,0,0
3,patents,0,0
4,policy_documents,0,0


## 5. Data Extraction: incoming links **to** Publications **from** Grants, Clinical Trials, Patents etc...


#### **TODO clarify the logic here!! what links do we need?**

In this section we extract existing links from **publications** to the other document types, based on publications list previously created for the list of GRID IDs.

Note: 

* The publications data extracted above `tot_publications.csv` is loaded to seed the extraction.
* We are only interested in total counts, but since we have multiple query-iterations per each document-type, we still need to extract all the data in order to remove duplicates afterwards. 
* After each query we pause for a second to avoid hitting the 30 queries per minute max API quota.
* Also here, data gets saved in the `results` folder 



In [0]:
pubs = pd.read_csv(DATAFOLDER + "/tot_publications.csv")

In [0]:
#
# the main queries
#

q_template_sources = {
    'grants' : 
      """search grants where resulting_publication_ids in {} return grants[id] limit 1000""",
    'clinical_trials' : 
      """search clinical_trials where publication_ids in {} return clinical_trials[id] limit 1000""",
    'patents' : 
      """search patents where publication_ids in {} return patents[id] limit 1000""",
    'policy_documents' : 
      """search policy_documents where publication_ids in {} return policy_documents[id] limit 1000""", 
}

ids = pubs['id']

VERBOSE = False
CHUNK_SIZE = 300

output = []

for doctype in q_template_sources:
  print("\n===\nExtracting", doctype)
  results = []
  for chunk in pbar(list(chunks_of(list(ids), CHUNK_SIZE))):
    q = q_template_sources[doctype].format(json.dumps(chunk))
    # 
    data = dsl.query(q, verbose=VERBOSE)
    if data.count_total == 1000:
      print("Warning: no of records too high - please lower the chunks size for iterations")
    #
    results += getattr(data, doctype)
    time.sleep(0.5)
  dfdf = pd.DataFrame().from_dict(results)
  dfdf.drop_duplicates(subset="id", inplace=True)
  #
  save_as_csv(dfdf, f"linked_{doctype}")
  #
  print(f"---\nLinks from {doctype} found:", len(dfdf))
  output.append({'to': 'publications', 'from': doctype, 'links' : len(dfdf)})


# save to a dataframe
df2 = pd.DataFrame().from_dict(output)
#
save_as_csv(df2, f"overview_links")
#
df2



===
Extracting grants


HBox(children=(IntProgress(value=0, max=54), HTML(value='')))

===
Saved:  stats_data/linked_grants.csv
---
Links from grants found: 1275

===
Extracting clinical_trials


HBox(children=(IntProgress(value=0, max=54), HTML(value='')))

===
Saved:  stats_data/linked_clinical_trials.csv
---
Links from clinical_trials found: 11

===
Extracting patents


HBox(children=(IntProgress(value=0, max=54), HTML(value='')))

===
Saved:  stats_data/linked_patents.csv
---
Links from patents found: 5930

===
Extracting policy_documents


HBox(children=(IntProgress(value=0, max=54), HTML(value='')))

===
Saved:  stats_data/linked_policy_documents.csv
---
Links from policy_documents found: 14
===
Saved:  stats_data/overview_links.csv


Unnamed: 0,to,from,links
0,publications,grants,1275
1,publications,clinical_trials,11
2,publications,patents,5930
3,publications,policy_documents,14


## 6. Downloading all results 

If you are viewing this notebook in **Google Colab**, run the following cell to download all data as a zip file. 

In [0]:

# zip up all files to make download easier
import zipfile
import os 

def zipdir(path, ziph):
    # ziph is zipfile handle
    for root, dirs, files in os.walk(path):
        for file in files:
            ziph.write(os.path.join(root, file))

zipf = zipfile.ZipFile('results.zip', 'w', zipfile.ZIP_DEFLATED)
zipdir(DATAFOLDER + '/', zipf)
zipf.close()

try:
  # try to download from colab: sometimes it fails hence print a message
  from google.colab import files
  time.sleep(10)
  files.download('results.zip') 
except:
  print("Google Colab failed to download - please try again.")
