# Dimensions Statistics Report, for one or more Organizations

This notebook allows to produce up-to-date statistics on the number of documents associated to one of more GRID organizations. These numbers can be used to then generate visual summaries like the following:

In [1]:
from IPython.display import Image
Image(url= "http://api-sample-data.dimensions.ai/stats-notebook/chart-example-1.jpg", width=700)

In [2]:
from IPython.display import Image
Image(url= "http://api-sample-data.dimensions.ai/stats-notebook/chart-example-2.jpg", width=400)

### Statistics Detailed Description

The statistics are divided into two groups and two corresponding CSV files:

**1) Number of Documents Overview** - saved to `overview_objects.csv`

- Publications number from the GRID IDs.
  - Query template: `search publications where research_orgs = []..`
- Grant number and funding given to the GRID IDs (funding aggregated in US dollars).
  - Query template: `search grants where research_orgs = []..`
- Patents with GRID IDs as assignees. 
  - Query template: `search patents where assignees = []..`
- Clinical Trials with associated GRID IDs.
  - Query template: `search clinical_trials where organizations = []..`
- Policy papers published by GRID IDs (**NOTE** policy papers will always need to adjusted manually since it is tricky)
  - Query template: `search policy_documents where publisher_org = []..`


**2) Incoming Links Overview (= publication citations)** - saved to `overview_links.csv`

- Publication count citing the GRID ID publications. 
  - Query template: `search publications where research_orgs=[] return year aggregate citations_total` (then the yearly citations are summed up).
- Clinical trials referencing the publications of the GRID ID.
  - Query template: `search clinical_trials where publication_ids = [..]`
- Grants linked to publications of GRID ID.
  - Query template: `search grants where resulting_publications_ids ..`
- Patents referencing the GRID ID publications. 
  - Query template: `search patents where publication_ids in`
- Policy documents referencing the GRID ID publications.
  - Query template: `search policy_document where publication_ids in`


**NOTE** 

If the starting GRID IDs generate a dataset of more than 50k publications, only the __most recent 50k publications__  will be extracted. 

This means that in such cases the group 2 statistics should then be read as the *number of patents citing the most recent 50k publications* etc...


# 1. Install Libraries and Log into Dimensions API


In [3]:
# @markdown **Privacy tip**: leave the password blank and you'll be asked for it later. This can be handy on shared computers.
username = ""  #@param {type: "string"}
password = ""  #@param {type: "string"}
endpoint = "https://app.dimensions.ai"  #@param {type: "string"}


# INSTALL/LOAD LIBRARIES 
# ps optimized for Google Colab /modify installation as needed based on your environment
# 
print("==\nInstalling libraries..")
!pip install dimcli networkx pyvis -U --quiet 

import os
import sys
import time
import json
import pandas as pd
from pandas.io.json import json_normalize
from tqdm.notebook import tqdm as progressbar
import dimcli 
from dimcli.shortcuts import *

# AUTHENTICATION 
# https://github.com/digital-science/dimcli#authentication
#
# == Google Colab users ==
# If username/password not provided, the interactive setup assistant `dimcli --init` is invoked
#
# == Jupyter Notebook users == 
# If username/password not provided, try to use the global API credentials file.
# To create one, open a terminal (File/New/Terminal) and run `dimcli --init` from there
#  
#
print("==\nLogging in..")
if username and password:
  dimcli.login(username, password, endpoint)
else:
  if 'google.colab' in sys.modules:
    print("Environment: Google Colab")
    if username and not password:
      import getpass
      password = getpass.getpass(prompt='Password: ')     
      dimcli.login(username, password, endpoint)
    else:
      print("... launching interactive setup assistant")
      !dimcli --init    
      dimcli.login()
  else:
    print("Environment: Jupyter Notebook\n... looking for API credentials file")
    dimcli.login()

dsl = dimcli.Dsl()

   

==
Installing libraries..
==
Logging in..
Environment: Jupyter Notebook
... looking for API credentials file
DimCli v0.6.6.5 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)


# 2. Enter GRIDs and start data extraction

Tip: pick one from https://grid.ac/institutes. 

In [4]:
#@markdown Please enter up to 30 GRID IDs, comma separated.

organizations = "grid.473100.1, grid.474488.3" #@param {type: "string"}
gridids = [x.strip() for x in organizations.split(",")]
gridids = list(set(gridids))

MAX_GRIDS = 30

from IPython.core.display import display, HTML
def dimensions_url(grids):
    root = "https://app.dimensions.ai/discover/publication?or_facet_research_org="
    return root + "&or_facet_research_org=".join([x for x in grids])


if len(gridids) > MAX_GRIDS:
  print("You entered too many GRID IDs. Max is ", MAX_GRIDS)
  raise
else:
  print("GRIDs entered (unique): ", len(gridids), "\n =>", gridids)
  # gen link to Dimensions
  display(HTML('---<br /><a href="{}">View selected organizations in Dimensions &#x29c9;</a>'.format(dimensions_url(gridids))))

#
# data-saving utils 
#

DATAFOLDER = "stats_data_" + str(gridids[0])
if not os.path.exists(DATAFOLDER):
  !mkdir $DATAFOLDER
  print(f"==\nCreated data folder:", DATAFOLDER + "/")
#
#
def save_as_csv(df, save_name_without_extension):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{DATAFOLDER}/{save_name_without_extension}.csv", index=False)
    print("===\nSaved: ", f"{DATAFOLDER}/{save_name_without_extension}.csv")


###
#
# STEP 1: Basic stats: Publications, Grants, Patents, Clinical Trials etc..
# In this step we count how many documents direcly associated to the GRID IDs exist - for each of the main Dimensions document types.
#
###


print("\n\n======\n======\nStarting data extraction part [1]....\n======\n======\n\n")

from tqdm import tqdm_notebook as pbar

query_templates = {
    "publications" : """search publications where research_orgs in {} return publications[id] {}""" ,
    "grants" : """search grants where research_orgs in {} return grants[id] {}""" ,
    "patents" : """search patents where assignees in {} return patents[id] {}""" ,
    "clinical_trials" : """search clinical_trials where organizations in {} return clinical_trials[id] {}""" ,
    "policy_documents" : """search policy_documents where publisher_org in {} return policy_documents[id] {}""" ,
}

VERBOSE = True
# CHUNKS_SIZE = 30 # how many grids to process at a time
results = []


def get_funding():
  """Extra query to get funding
  We're using `title_language` to facet + aggregate as that seems to be field which is a)single and unique per grant and b) always present; 
  Tried using also `start_year` but often grants were missing cause they didn't have one.
  """
  q = """search grants where research_orgs in {} return title_language aggregate funding limit 500""".format(json.dumps(gridids))
  d = dsl.query(q, verbose=False)  
  funding_total = 0
  if d.count_total and "title_language" in d.json:
    for x in d.title_language:
      funding_total += x['funding']
  print("Grants total Funding: ", funding_total)
  return funding_total


loop1 = pbar(list(query_templates))
for doctype in loop1:
  loop1.set_description("Extracting Total Count of %s" % doctype.capitalize())
  # dsl query
  q = query_templates[doctype].format(json.dumps(gridids), "limit 1")
  data = dsl.query(q, verbose=VERBOSE)
  if doctype == "grants":
    money = get_funding()
    results.append({"source" : doctype, "documents" : data.count_total, "funding" : money})
  else:
    results.append({"source" : doctype, "documents" : data.count_total, "funding" : 0})
  time.sleep(1)

# print(results)

# save to a dataframe
summary = pd.DataFrame().from_dict(results)
save_as_csv(summary, "overview_objects")
# summary




###
#
# STEP 2: References: from Grants, Clinical Trials, Patents etc... to Publications
# In this section we extract existing links from the various document types to the publications from the selected GRID IDs.
#
###


print("\n\n======\n======\nStarting data extraction part [2]....\n======\n======\n\n")

#
# 0. prerequisite: pubs baseset
#

print("===\nExtracting all publications for GRIDs (max 50k, sorted by date)")

q = """search publications where research_orgs in {} return publications[id] sort by date"""
pubs = dsl.query_iterative(q.format(json.dumps(gridids)))
pubs = pubs.as_dataframe()
pubsids = list(pubs['id'])

# dict to store final results
overview_links = { 'publications' : 0, 'grants' : 0, 'patents' : 0, 'clinical_trials' : 0, 'policy_documents' : 0 }


#
# 1. pubs references
#
print("===\nExtracting publications citations")

q = """search publications where research_orgs in {} return year aggregate citations_total limit 1000 """
d = dsl.query(q.format(json.dumps(gridids)))
citations_total = 0
if d.count_total and "year" in d.json:
  for x in d.year:
    citations_total += x['citations_total']
print("===\nTotal Citations: ", citations_total)
overview_links['publications'] = citations_total

#
# 2. grants references 
#

print("===\nExtracting grants citations")

q = """search grants where resulting_publication_ids in {} return grants[id]"""

# iterate pubids using chunks 
VERBOSE = False
CHUNKS_SIZE = 300 
results = []

for chunk in pbar(list(chunks_of(pubsids, CHUNKS_SIZE))):
  query = q.format(json.dumps(chunk))
  data = dsl.query_iterative(query, verbose=VERBOSE)
  results += data.grants
  time.sleep(0.5)

#
# put the citing grants data into a dataframe, remove duplicates and save
grants = pd.DataFrame().from_dict(results)
# print("===\nRelated grants found: ", len(grants))
grants.drop_duplicates(subset='id', inplace=True)
print("Total related grants found: ", len(grants))
overview_links['grants'] = len(grants)


#
# 3. patents references 
#

print("===\nExtracting patents citations")

q = """search patents where publication_ids in {} return patents[id]"""

# iterate pubids using chunks 
VERBOSE = False
CHUNKS_SIZE = 300 
results = []

for chunk in pbar(list(chunks_of(pubsids, CHUNKS_SIZE))):
  query = q.format(json.dumps(chunk))
  data = dsl.query_iterative(query, verbose=VERBOSE)
  results += data.patents
  time.sleep(0.5)

#
# put the citing grants data into a dataframe, remove duplicates and save
patents = pd.DataFrame().from_dict(results)
# print("===\nRelated grants found: ", len(grants))
patents.drop_duplicates(subset='id', inplace=True)
print("Total related patents found: ", len(patents))
overview_links['patents'] = len(patents)

#
# 4. clinical trials references 
#

print("===\nExtracting clinical_trials citations")

q = """search clinical_trials where publication_ids in {} return clinical_trials[id]"""

# iterate pubids using chunks 
VERBOSE = False
CHUNKS_SIZE = 300 
results = []

for chunk in pbar(list(chunks_of(pubsids, CHUNKS_SIZE))):
  query = q.format(json.dumps(chunk))
  data = dsl.query_iterative(query, verbose=VERBOSE)
  results += data.clinical_trials
  time.sleep(0.5)

#
# put the citing grants data into a dataframe, remove duplicates and save
clinical_trials = pd.DataFrame().from_dict(results)
# print("===\nRelated grants found: ", len(grants))
clinical_trials.drop_duplicates(subset='id', inplace=True)
print("Total related clinical_trials found: ", len(clinical_trials))
overview_links['clinical_trials'] = len(clinical_trials)


#
# 5. policy documents references 
#

print("===\nExtracting policy_documents citations")

q = """search policy_documents where publication_ids in {} return policy_documents[id]"""

# iterate pubids using chunks 
VERBOSE = False
CHUNKS_SIZE = 300 
results = []

for chunk in pbar(list(chunks_of(pubsids, CHUNKS_SIZE))):
  query = q.format(json.dumps(chunk))
  data = dsl.query_iterative(query, verbose=VERBOSE)
  results += data.policy_documents
  time.sleep(0.5)

#
# put the citing grants data into a dataframe, remove duplicates and save
policy_documents = pd.DataFrame().from_dict(results)
# print("===\nRelated grants found: ", len(grants))
policy_documents.drop_duplicates(subset='id', inplace=True)
print("Total related policy_documents found: ", len(policy_documents))
overview_links['policy_documents'] = len(policy_documents)



#
# 6. collate results and save 
#


print("===\nGenerating summary...")

# we want to produce a dict list this 
# overview_links = [
#                   {'from_source' : 'publications',     'to_source' : 'publications', 'links' : 0 },
#                   {'from_source' : 'grants',           'to_source' : 'publications', 'links' : 0 },
#                   {'from_source' : 'patents',          'to_source' : 'publications', 'links' : 0 },
#                   {'from_source' : 'clinical_trials',  'to_source' : 'publications', 'links' : 0 },
#                   {'from_source' : 'policy_documents', 'to_source' : 'publications', 'links' : 0 },
#                 ]

nice_links = []
for x in overview_links:
  nice_links.append({'from_source' : x, 'to_source' : 'publications', 'links' : overview_links[x] })

#
# finally, save to a dataframe
df2 = pd.DataFrame().from_dict(nice_links)
#
save_as_csv(df2, f"overview_links")
#
df2




###
#
# STEP 3: Download
#
###

print("\n===\nDownloading...")

# zip up all files to make download easier
import zipfile
import os 

def zipdir(path, ziph):
    # ziph is zipfile handle
    for root, dirs, files in os.walk(path):
        for file in files:
            ziph.write(os.path.join(root, file))

zip_name = DATAFOLDER + '.zip'
zipf = zipfile.ZipFile(zip_name, 'w', zipfile.ZIP_DEFLATED)
zipdir(DATAFOLDER + '/', zipf)
zipf.close()

try:
  # try to download from colab: sometimes it fails hence print a message
  from google.colab import files
  time.sleep(5)
  files.download(zip_name) 
  print("\n===\nDone.")
except:
  print("ERROR - Google Colab couldn't download - please try again...")
  



GRIDs entered (unique):  2 
 => ['grid.474488.3', 'grid.473100.1']


==
Created data folder: stats_data_grid.474488.3/


Starting data extraction part [1]....




Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

Returned Publications: 1 (total = 100)
Returned Grants: 0
Grants total Funding:  0
Returned Patents: 1 (total = 106)
Returned Clinical_trials: 0
Field 'organizations' is deprecated in favor of research_orgs. Please refer to https://docs.dimensions.ai/dsl/releasenotes.html for more details
Returned Policy_documents: 0

===
Saved:  stats_data_grid.474488.3/overview_objects.csv


Starting data extraction part [2]....


===
Extracting all publications for GRIDs (max 50k, sorted by date)
100 / 100
===
Extracting publications citations
Returned Year: 20
===
Total Citations:  591.0
===
Extracting grants citations


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


Total related grants found:  15
===
Extracting patents citations


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


Total related patents found:  7
===
Extracting clinical_trials citations


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


Total related clinical_trials found:  0
===
Extracting policy_documents citations


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


Total related policy_documents found:  0
===
Generating summary...
===
Saved:  stats_data_grid.474488.3/overview_links.csv

===
Downloading...
ERROR - Google Colab couldn't download - please try again...


## Downloading the data

The exports should download automatically. 
However if there's an error with downloading the file, try running the cell below.

In [None]:
files.download(zip_name)

## Network diagram 

Run the following cell to build a simple network diagram based on the data extracted. 

In [6]:
#@markdown Please select the width of the diagram

WIDTH = 945 #@param {type: "slider", min: 500, max: 1200}


#
# 7. Display Network Graph 
#

from dimcli.core.extras import NetworkViz
json_summary = json.loads(summary.to_json(orient="records"))
json_links = json.loads(df2.to_json(orient="records"))

labels = [x['source'] + " ({})".format(x['documents']) for x in json_summary]
titles = [str(x['documents']) + " " + x['source'] + " from {}".format(str(gridids)) for x in json_summary]

g = NetworkViz(notebook=True, width="{}px".format(WIDTH))
g.add_nodes(list(summary.source),
            value=[x for x in summary.documents], # never 0, default 1
            title=titles,
            label=labels,
            color=["#00ff1e", "#162347", "#dd4b39", "#00bfff", "#ffbf00"])

for x in json_links:
  g.add_edge(x['from_source'], x['to_source'], 
             value=int(x['links']), 
             label=int(x['links']), 
             arrows="to",
             title="{} {} link to {} publications ".format(int(x['links']), x['from_source'], str(gridids)))

g.show(DATAFOLDER + "/graph.html")

