# Fraunhofer Data in Dimensions: Statistics

This notebook allows to produce up-to-date statistics on the number of documents associated to the Fraunhofer Society organizations, which are available via Dimensions. 

The statistics produced are: 
* number of authored publications 
* number of grants received 
* number of associated clinical trials 
* number of assigned patents 
* number of associated policy documents *(TBC as this is zero)*

Furthermore, starting from the authored Publications data, we also extract the overall number of outgoing links to other objects: Grants, Clinical Trials, Patents and Policy Documents. 




Background: 
* Extraction Plan: https://docs.google.com/document/d/1uNT-GXP5xoL0IGMwLFTJ9p6R8yM1z3x1vnl_1t6qMQM/edit
* Fraunhofer GRIDs: https://docs.google.com/spreadsheets/d/1R0aDY0BZvzag9dAiklaqWq9v8FQpMxHSqZVcKg__CmQ/edit#gid=0


## 1. Install Libraries


In [1]:
!pip install dimcli plotly_express -U --quiet 
try:
  !mkdir results
except:
  print("Failed to create a `results` folder to store the output of this notebook - please do it manually.")


[?25l[K     |██▉                             | 10kB 22.6MB/s eta 0:00:01[K     |█████▊                          | 20kB 3.3MB/s eta 0:00:01[K     |████████▌                       | 30kB 4.7MB/s eta 0:00:01[K     |███████████▍                    | 40kB 3.0MB/s eta 0:00:01[K     |██████████████▏                 | 51kB 3.7MB/s eta 0:00:01[K     |█████████████████               | 61kB 4.3MB/s eta 0:00:01[K     |███████████████████▉            | 71kB 5.0MB/s eta 0:00:01[K     |██████████████████████▊         | 81kB 5.6MB/s eta 0:00:01[K     |█████████████████████████▌      | 92kB 6.2MB/s eta 0:00:01[K     |████████████████████████████▍   | 102kB 4.8MB/s eta 0:00:01[K     |███████████████████████████████▏| 112kB 4.8MB/s eta 0:00:01[K     |████████████████████████████████| 122kB 4.8MB/s 
[?25h

## 2. Log into the Dimensions API

And also setup a bunch of useful libraries. 

In [2]:
##
# load common libraries
##

import dimcli
from dimcli.shortcuts import *
import plotly_express as px
import pandas as pd
import time
import json


##
# LOG IN 
##

user = "m.pasin@digital-science.com"  #@param {type: "string"}
password = "" #@param {type: "string"}
print('=> username is:', user)
print('=> password is:', "*" * len(password))
dimcli.login(user, password)

##
# OBJECTS 
##

dsl = dimcli.Dsl()



=> username is: 
=> password is: 
ERROR: `dsl.ini` credentials file not found.
HowTo: https://github.com/digital-science/dimcli#credentials-file


SystemExit: ignored


To exit: use 'exit', 'quit', or Ctrl-D.



## 3. GRIDs selection

In [0]:
#@markdown Please choose a specific member of the Fraunhofer Society, or 'ALL', and then run this cell.

organization = "ALL"  #@param ['ALL', 'Fraunhofer Institute for Laser Technology == grid.461628.f', 'Fraunhofer Institute for Molecular Biology and Applied Ecology == grid.418010.c', 'Fraunhofer Institute for Production Technology == grid.461634.2', 'Fraunhofer Research Institution for Casting, Composite and Processing Technology == grid.506241.4', 'Fraunhofer Institute for Telecommunications, Heinrich Hertz Institute == grid.435231.2', 'Fraunhofer Institute for Open Communication Systems == grid.469837.7', 'Fraunhofer Institute for Production Systems and Design Technology == grid.469819.b', 'Fraunhofer Institute for Reliability and Microintegration == grid.469839.9', 'Fraunhofer Institute for Wood Research Wilhelm-Klauditz-Institut == grid.469829.8', 'Fraunhofer Institute for Surface Engineering and Thin Films == grid.462227.7', 'Fraunhofer Institute for Digital Medicine == grid.428590.2', 'Fraunhofer Institute for Manufacturing Technology and Advanced Materials == grid.461617.3', 'Fraunhofer Institute for Wind Energy Systems == grid.8440.8', 'Fraunhofer Institute for Electronic Nano Systems == grid.469847.0', 'Fraunhofer Institute for Machine Tools and Forming Technology == grid.461651.1', 'Fraunhofer Institute for Structural Durability and System Reliability == grid.434481.e', 'Fraunhofer Institute for Computer Graphics Research == grid.461618.c', 'Fraunhofer Institute for Secure Information Technology == grid.469848.f', 'Fraunhofer Institute for Material Flow and Logistics == grid.469827.6', 'Fraunhofer Institute for Software and Systems Engineering == grid.469821.0', 'Fraunhofer Institute for Ceramic Technologies and Systems == grid.461622.5', 'Fraunhofer Institute for Organic Electronics, Electron Beam and Plasma Technology == grid.469851.7', 'Fraunhofer Institute for Photonic Microsystems == grid.469853.5', 'Fraunhofer Institute for Transportation and Infrastructure Systems == grid.469826.7', 'Fraunhofer Institute for Material and Beam Technology == grid.461641.0', 'Fraunhofer Institute for Microelectronic Circuits and Systems == grid.469854.2', 'Fraunhofer Institute for Integrated Circuits == grid.469850.6', 'Fraunhofer Institute for Integrated Circuits == grid.469823.2', 'Fraunhofer Institute for Integrated Systems and Device Technology == grid.469855.3', 'Fraunhofer Institute of Optronics, System Technologies and Image Exploitation == grid.466706.5', 'Fraunhofer Institute for Technological Trend Analysis == grid.469856.0', 'Fraunhofer Institute for Applied Solid State Physics == grid.424642.2', 'Fraunhofer Institute for High-Speed Dynamics, Ernst-Mach-Institut == grid.461627.0', 'Fraunhofer Institute for Physical Measurement Techniques == grid.461631.7', 'Fraunhofer Institute for Solar Energy Systems == grid.434479.9', 'Fraunhofer Institute for Mechanics of Materials == grid.461645.4', 'Fraunhofer Institute for Process Engineering and Packaging == grid.466709.a', 'Fraunhofer Institute for Microstructure of Materials and Systems == grid.469857.1', 'Fraunhofer Research Institution for Additive Manufacturing Technologies == grid.506239.b', 'Fraunhofer Institute for Toxicology and Experimental Medicine == grid.418009.4', 'Fraunhofer Institute for Ceramic Technologies and Systems == grid.461622.5', 'Fraunhofer Institute for Building Physics == grid.469871.5', 'Fraunhofer Institute for Digital Media Technology  == grid.469861.4', 'Fraunhofer Institute for Silicon Technology == grid.469817.5', 'Fraunhofer Institute for Applied Optics and Precision Engineering == grid.418007.a', 'Fraunhofer Institute for Experimental Software Engineering == grid.469863.6', 'Fraunhofer Institute for Industrial Mathematics == grid.461635.3', 'Fraunhofer Institute of Optronics, System Technologies and Image Exploitation == grid.466706.5', 'Fraunhofer Institute for Systems and Innovation Research == grid.459551.9', 'Fraunhofer Institute for Energy Economics and Energy System Technology == grid.506250.4', 'Fraunhofer Center for International Management and Knowledge Economy == grid.462230.1', 'Fraunhofer Institute for Cell Therapy and Immunology == grid.418008.5', 'Fraunhofer Research Institution for Marine Biotechnology and Cell Technology == grid.469834.4', 'Fraunhofer Institute for Factory Operation and Automation == grid.469818.a', 'Fraunhofer Institute for Microengineering and Microsystems == grid.28894.3f', 'Fraunhofer Institute for Embedded Systems and Communication Technologies == grid.469865.0', 'Fraunhofer Research Institution for Microsystems and Solid State Technologies == grid.469866.3', 'Fraunhofer Institute for Applied and Integrated Security == grid.469867.2', 'Fraunhofer Institute for Environmental, Safety, and Energy Technology == grid.424428.c', 'Fraunhofer Institute for Mechatronic Systems Design == grid.469868.d', 'Fraunhofer Institute for Chemical Technology == grid.461616.2', 'Fraunhofer Institute for Applied Polymer Research == grid.461615.1', 'Fraunhofer Research Institution for Large Structures in Production Engineering == grid.506226.5', 'Fraunhofer Institute for Nondestructive Testing == grid.469830.0', 'Fraunhofer Institute for Algorithms and Scientific Computing == grid.418688.b', 'Fraunhofer Institute for Applied Information Technology == grid.469870.4', 'Fraunhofer Institute for Intelligent Analysis and Information Systems == grid.469822.3', 'Fraunhofer Institute for Molecular Biology and Applied Ecology == grid.418010.c', 'Fraunhofer Institute for Industrial Engineering == grid.434477.7', 'Fraunhofer Institute for Building Physics == grid.469871.5', 'Fraunhofer Institute for Interfacial Engineering and Biotechnology == grid.469831.1', 'Fraunhofer Institute for Manufacturing Engineering and Automation == grid.469833.3', 'Fraunhofer Information Center for Planning and Building == grid.469872.6', 'Fraunhofer Institute for Biomedical Engineering == grid.452493.d', 'Fraunhofer Institute for High Frequency Physics and Radar Techniques == grid.461619.d', 'Fraunhofer Institute for Communication, Information Processing and Ergonomics == grid.469836.6', 'Fraunhofer Institute for Silicate Research == grid.424644.4']
grids_all = ['Fraunhofer Institute for Laser Technology == grid.461628.f', 'Fraunhofer Institute for Molecular Biology and Applied Ecology == grid.418010.c', 'Fraunhofer Institute for Production Technology == grid.461634.2', 'Fraunhofer Research Institution for Casting, Composite and Processing Technology == grid.506241.4', 'Fraunhofer Institute for Telecommunications, Heinrich Hertz Institute == grid.435231.2', 'Fraunhofer Institute for Open Communication Systems == grid.469837.7', 'Fraunhofer Institute for Production Systems and Design Technology == grid.469819.b', 'Fraunhofer Institute for Reliability and Microintegration == grid.469839.9', 'Fraunhofer Institute for Wood Research Wilhelm-Klauditz-Institut == grid.469829.8', 'Fraunhofer Institute for Surface Engineering and Thin Films == grid.462227.7', 'Fraunhofer Institute for Digital Medicine == grid.428590.2', 'Fraunhofer Institute for Manufacturing Technology and Advanced Materials == grid.461617.3', 'Fraunhofer Institute for Wind Energy Systems == grid.8440.8', 'Fraunhofer Institute for Electronic Nano Systems == grid.469847.0', 'Fraunhofer Institute for Machine Tools and Forming Technology == grid.461651.1', 'Fraunhofer Institute for Structural Durability and System Reliability == grid.434481.e', 'Fraunhofer Institute for Computer Graphics Research == grid.461618.c', 'Fraunhofer Institute for Secure Information Technology == grid.469848.f', 'Fraunhofer Institute for Material Flow and Logistics == grid.469827.6', 'Fraunhofer Institute for Software and Systems Engineering == grid.469821.0', 'Fraunhofer Institute for Ceramic Technologies and Systems == grid.461622.5', 'Fraunhofer Institute for Organic Electronics, Electron Beam and Plasma Technology == grid.469851.7', 'Fraunhofer Institute for Photonic Microsystems == grid.469853.5', 'Fraunhofer Institute for Transportation and Infrastructure Systems == grid.469826.7', 'Fraunhofer Institute for Material and Beam Technology == grid.461641.0', 'Fraunhofer Institute for Microelectronic Circuits and Systems == grid.469854.2', 'Fraunhofer Institute for Integrated Circuits == grid.469850.6', 'Fraunhofer Institute for Integrated Circuits == grid.469823.2', 'Fraunhofer Institute for Integrated Systems and Device Technology == grid.469855.3', 'Fraunhofer Institute of Optronics, System Technologies and Image Exploitation == grid.466706.5', 'Fraunhofer Institute for Technological Trend Analysis == grid.469856.0', 'Fraunhofer Institute for Applied Solid State Physics == grid.424642.2', 'Fraunhofer Institute for High-Speed Dynamics, Ernst-Mach-Institut == grid.461627.0', 'Fraunhofer Institute for Physical Measurement Techniques == grid.461631.7', 'Fraunhofer Institute for Solar Energy Systems == grid.434479.9', 'Fraunhofer Institute for Mechanics of Materials == grid.461645.4', 'Fraunhofer Institute for Process Engineering and Packaging == grid.466709.a', 'Fraunhofer Institute for Microstructure of Materials and Systems == grid.469857.1', 'Fraunhofer Research Institution for Additive Manufacturing Technologies == grid.506239.b', 'Fraunhofer Institute for Toxicology and Experimental Medicine == grid.418009.4', 'Fraunhofer Institute for Ceramic Technologies and Systems == grid.461622.5', 'Fraunhofer Institute for Building Physics == grid.469871.5', 'Fraunhofer Institute for Digital Media Technology  == grid.469861.4', 'Fraunhofer Institute for Silicon Technology == grid.469817.5', 'Fraunhofer Institute for Applied Optics and Precision Engineering == grid.418007.a', 'Fraunhofer Institute for Experimental Software Engineering == grid.469863.6', 'Fraunhofer Institute for Industrial Mathematics == grid.461635.3', 'Fraunhofer Institute of Optronics, System Technologies and Image Exploitation == grid.466706.5', 'Fraunhofer Institute for Systems and Innovation Research == grid.459551.9', 'Fraunhofer Institute for Energy Economics and Energy System Technology == grid.506250.4', 'Fraunhofer Center for International Management and Knowledge Economy == grid.462230.1', 'Fraunhofer Institute for Cell Therapy and Immunology == grid.418008.5', 'Fraunhofer Research Institution for Marine Biotechnology and Cell Technology == grid.469834.4', 'Fraunhofer Institute for Factory Operation and Automation == grid.469818.a', 'Fraunhofer Institute for Microengineering and Microsystems == grid.28894.3f', 'Fraunhofer Institute for Embedded Systems and Communication Technologies == grid.469865.0', 'Fraunhofer Research Institution for Microsystems and Solid State Technologies == grid.469866.3', 'Fraunhofer Institute for Applied and Integrated Security == grid.469867.2', 'Fraunhofer Institute for Environmental, Safety, and Energy Technology == grid.424428.c', 'Fraunhofer Institute for Mechatronic Systems Design == grid.469868.d', 'Fraunhofer Institute for Chemical Technology == grid.461616.2', 'Fraunhofer Institute for Applied Polymer Research == grid.461615.1', 'Fraunhofer Research Institution for Large Structures in Production Engineering == grid.506226.5', 'Fraunhofer Institute for Nondestructive Testing == grid.469830.0', 'Fraunhofer Institute for Algorithms and Scientific Computing == grid.418688.b', 'Fraunhofer Institute for Applied Information Technology == grid.469870.4', 'Fraunhofer Institute for Intelligent Analysis and Information Systems == grid.469822.3', 'Fraunhofer Institute for Molecular Biology and Applied Ecology == grid.418010.c', 'Fraunhofer Institute for Industrial Engineering == grid.434477.7', 'Fraunhofer Institute for Building Physics == grid.469871.5', 'Fraunhofer Institute for Interfacial Engineering and Biotechnology == grid.469831.1', 'Fraunhofer Institute for Manufacturing Engineering and Automation == grid.469833.3', 'Fraunhofer Information Center for Planning and Building == grid.469872.6', 'Fraunhofer Institute for Biomedical Engineering == grid.452493.d', 'Fraunhofer Institute for High Frequency Physics and Radar Techniques == grid.461619.d', 'Fraunhofer Institute for Communication, Information Processing and Ergonomics == grid.469836.6', 'Fraunhofer Institute for Silicate Research == grid.424644.4']
if organization == "ALL":
  gridids = [x.split("==")[1].strip() for x in grids_all]
else:
  gridids = [organization.split("==")[1].strip()]

gridids = list(set(gridids))
print("Selection: ", len(gridids), "unique GRIDs", gridids)

def dimensions_url(grids):
    root = "https://app.dimensions.ai/discover/publication?or_facet_research_org="
    return root + "&or_facet_research_org=".join([x for x in grids])

# gen link to Dimensions
from IPython.core.display import display, HTML
display(HTML('---<br /><a href="{}">Open in Dimensions &#x29c9;</a>'.format(dimensions_url(gridids))))


## 4. Data Extraction: Counts of Publications, Grants, Patents etc..

In this step we count how many documents direcly associated to the GRID IDs exist - for each of the main Dimensions document types. 

Things to note: 

* Since the list of GRID IDs can be long, we need to split them up into smaller chunks to ensure API queries aren't too long. We will thus have a separate query per GRID IDs group. 
* After going through all GRID IDs chunks, we sum up the results and remove duplicates. This is because the same document could result from different GRID IDs.
* All data gets saves in the `results` folder as CSV files

In [0]:
from tqdm import tqdm_notebook as pbar

query_templates = {
    "publications" : """search publications where research_orgs in {} return publications[id]""" ,
    "grants" : """search grants where research_orgs in {} return grants[id]""" ,
    "clinical_trials" : """search clinical_trials where organizations in {} return clinical_trials[id]""" ,
    "patents" : """search patents where assignees in {} return patents[id]""" ,
    "policy_documents" : """search policy_documents where publisher_org in {} return policy_documents[id]""" ,
    "publications_altmetric" : """search publications where research_orgs in {} and altmetric > 0 return publications[id]""" ,
}

VERBOSE = False

results = []

loop1 = pbar(list(query_templates))
for doctype in loop1:
  loop1.set_description("Extracting %s" % doctype.capitalize())
  loop2 = pbar(list(chunks_of(list(gridids), 10)))
  bucket = []
  for chunk in loop2: 
    loop2.set_description("Processing Grid IDs..")
    #
    q = query_templates[doctype].format(json.dumps(chunk))
    data = dsl.query_iterative(q, verbose=VERBOSE)
    #
    if doctype == "publications_altmetric": # name used only for results export
      bucket += getattr(data, "publications")
    else:
      bucket += getattr(data, doctype)
    # time.sleep(1)
  temp_df = pd.DataFrame().from_dict(bucket)
  temp_df.drop_duplicates(subset="id", inplace=True)
  temp_df.to_csv(f"results/tot_{doctype}.csv", index=False)
  print(f"===\nUnique {doctype} found: ", len(temp_df))
  results.append({'doctype': doctype, 'count' : len(temp_df)})


# save to a dataframe
df = pd.DataFrame().from_dict(results)
df.to_csv(f"results/overview_objects.csv", index=False)
df

## 5. Data Extraction: links **from** Publications **to** Grants, Clinical Trials, Patents etc...

In this section we extract existing links from **publications** to the other document types, based on publications list previously created for the list of GRID IDs.

Note: 

* The publications data extracted above `tot_publications.csv` is loaded to seed the extraction.
* We are only interested in total counts, but since we have multiple query-iterations per each document-type, we still need to extract all the data in order to remove duplicates afterwards. 
* After each query we pause for a second to avoid hitting the 30 queries per minute max API quota.
* Also here, data gets saved in the `results` folder 



In [0]:
pubs = pd.read_csv("results/tot_publications.csv")

In [0]:
#
# the main queries
#

q_template_sources = {
    'grants' : 
      """search grants where resulting_publication_ids in {} return grants[id] limit 1000""",
    'clinical_trials' : 
      """search clinical_trials where publication_ids in {} return clinical_trials[id] limit 1000""",
    'patents' : 
      """search patents where publication_ids in {} return patents[id] limit 1000""",
    'policy_documents' : 
      """search policy_documents where publication_ids in {} return policy_documents[id] limit 1000""", 
}

ids = pubs['id']

VERBOSE = False

output = []

for doctype in q_template_sources:
  print("\n===\nExtracting", doctype)
  results = []
  for chunk in pbar(list(chunks_of(list(ids), 300))):
    q = q_template_sources[doctype].format(json.dumps(chunk))
    # print(q)
    data = dsl.query(q, verbose=VERBOSE)
    results += getattr(data, doctype)
    time.sleep(0.5)
  dfdf = pd.DataFrame().from_dict(results)
  dfdf.drop_duplicates(subset="id", inplace=True)
  dfdf.to_csv(f"results/linked_{doctype}.csv", index=False)
  print(f"---\nLinks to {doctype} found:", len(dfdf))
  output.append({'from': 'publications', 'to': doctype, 'links' : len(dfdf)})


# save to a dataframe
df2 = pd.DataFrame().from_dict(output)
df2.to_csv(f"results/overview_links.csv", index=False)
df2


## 6. Downloading all results 

If you are viewing this notebook in **Google Colab**, run the following cell to download all data as a zip file. 

In [0]:

# zip up all files to make download easier
import zipfile
import os 

def zipdir(path, ziph):
    # ziph is zipfile handle
    for root, dirs, files in os.walk(path):
        for file in files:
            ziph.write(os.path.join(root, file))

zipf = zipfile.ZipFile('results.zip', 'w', zipfile.ZIP_DEFLATED)
zipdir('results/', zipf)
zipf.close()

try:
  # try to download from colab: sometimes it fails hence print a message
  from google.colab import files
  time.sleep(10)
  files.download('results.zip') 
except:
  print("Google Colab failed to download - please try again.")
