# Patents Statistics Report, for a selected Organization

From a GRID ID and time frame, extract all publications that were cited by patents. 
    
Publications CSV fields 

```
1. Dim. pub ID  [as a link]
2. Pub title [Full title]
3. Citations [number of]
4. Recent citations [number of]
5. FCR [number of]
6. FOR-codes [semicolon separated values, no cap]
7. Authors [semicolon separated values, capped]
8. Affiliations [semicolon separated values, capped]
9. Own authors [GRID's own authors, nested not capped; include corr. author when available]
10. # of citing patents [number of]
11. Citing patents IDs [Nested, no cap]
```

Patents CSV fields 

```
1. Dim. patent ID [as a link]
2. FOR-codes [semicolon separated values, no cap]
3. Patent title
4. Inventors [semicolon separated values, no cap; all names also if not disambiguated]
5. Assignees [Nested, no cap]
6. # of cited pubs (total) [number of]
7. # of cited pubs (GRIDs own) [number of]
8. Cited publication IDs [Nested, no cap]
9. # of patent citations [number of]
```

Sample outputs (in gsheets): 
* pubs: https://docs.google.com/spreadsheets/d/1NBprMWj4jOkEXApX2pKLXiUVTXMYnEopyoMn0I4iLG4/edit?usp=sharing
* patents: https://docs.google.com/spreadsheets/d/1WtWxzbiNm6uQbnTvzAUK0D-J95f97gSiMPNWUV6yAdM/edit?usp=sharing




## 1. Install Libraries and Log into Dimensions API


In [1]:
# @markdown **Privacy tip**: leave the password blank and you'll be asked for it later. This can be handy on shared computers.
username = ""  #@param {type: "string"}
password = ""  #@param {type: "string"}
endpoint = "https://app.dimensions.ai"  #@param {type: "string"}


# INSTALL/LOAD LIBRARIES 
# ps optimized for Google Colab /modify installation as needed based on your environment
# 
print("==\nInstalling libraries..")
!pip install dimcli tqdm -U --quiet 

import os
import sys
import time
import json
import pandas as pd
from pandas.io.json import json_normalize
try:
  from tqdm.notebook import tqdm as pbar
except:
  from tqdm import tqdm_notebook as pbar
import dimcli 
from dimcli.shortcuts import *

# AUTHENTICATION 
# https://github.com/digital-science/dimcli#authentication
#
# == Google Colab users ==
# If username/password not provided, the interactive setup assistant `dimcli --init` is invoked
#
# == Jupyter Notebook users == 
# If username/password not provided, try to use the global API credentials file.
# To create one, open a terminal (File/New/Terminal) and run `dimcli --init` from there
#  
#
print("==\nLogging in..")
if username and password:
  dimcli.login(username, password, endpoint)
else:
  if 'google.colab' in sys.modules:
    print("Environment: Google Colab")
    if username and not password:
      import getpass
      password = getpass.getpass(prompt='Password: ')     
      dimcli.login(username, password, endpoint)
    else:
      print("... launching interactive setup assistant")
      !dimcli --init    
      dimcli.login()
  else:
    print("Environment: Jupyter Notebook\n... looking for API credentials file")
    dimcli.login()

dsl = dimcli.Dsl()

   

==
Installing libraries..


You should consider upgrading via the '/Users/michele.pasin/Envs/dslqa/bin/python -m pip install --upgrade pip' command.[0m


==
Logging in..
Environment: Jupyter Notebook
... looking for API credentials file


[2mDimcli - Dimensions API Client (v0.7.3)[0m


[2mConnected to: https://app.dimensions.ai - DSL v1.26[0m


[2mMethod: dsl.ini file[0m


## 2. Select GRID organization and time period

Tip: pick one from https://grid.ac/institutes. 

In [2]:
#@markdown Please enter a valid org grid

GRIDID = "grid.170205.1" #@param {type:"string"}

#@markdown The start/end year of publications used to extract patents
YEAR_START = 2012 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2013 #@param {type: "slider", min: 1950, max: 2020}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START

# gen link to Dimensions
try:
  gridname = dsl.query(f"""search organizations where id="{GRIDID}" return organizations[name]""", verbose=False).organizations[0]['name']
except:
  gridname = ""
from IPython.core.display import display, HTML
display(HTML('GRID: <a href="{}" title="View selected organization in Dimensions">{} - {} &#x29c9;</a>'.format(dimensions_url(GRIDID), GRIDID, gridname)))
display(HTML('Time period: {} to {} <br /><br />'.format(YEAR_START, YEAR_END)))


#
# data-saving utils 
#
DATAFOLDER = "stats_patents_" + str(GRIDID)
if not os.path.exists(DATAFOLDER):
  !mkdir $DATAFOLDER
  print(f"==\nCreated data folder:", DATAFOLDER + "/")
#
#
def save_as_csv(df, save_name_without_extension):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{DATAFOLDER}/{save_name_without_extension}.csv", index=False)
    print("===\nSaving: ", f"{DATAFOLDER}/{save_name_without_extension}.csv")



==
Created data folder: stats_patents_grid.170205.1/


## 3. Run the extraction

The results are being saved in a folder called `stats_patents_grid_xxx`, which contains two CSV files `publications.csv` and `patents.csv`. 

PS sometimes single publications records can contain a lot of data, causing the publications extraction to fail with a 'query too long or complex' error. This can be avoided by reducing the overall number of records extracted per iteration (generally 1000 is fine though).

In [3]:
#@markdown If you run into 'query too long or complex' errors, try reducing the records extracted per iteration. 
PUBS_PER_ITERATION = 100 #@param {type: "slider", min: 10, max: 1000}
# PS this is just for publications loop

# ;;;;;
#
# get publications from selected grid and time period
#
# ;;;;;


print("===\nExtracting publications for: ", GRIDID, "from", YEAR_START, "to", YEAR_END)

publications = dsl.query_iterative(f"""
search publications
    where research_orgs.id = "{GRIDID}"
    and year in [{YEAR_START}:{YEAR_END}]
    return publications[id+doi+title+times_cited+recent_citations+field_citation_ratio+category_for+authors]
""", limit=PUBS_PER_ITERATION).as_dataframe()

print("Total publications found: ", len(publications))


# ;;;;;
#
# get patents citations 
#
# ;;;;;



pubsids = list(publications['id'])

print("\n===\nExtracting patents citing these publications")

q = """search patents where publication_ids in {}
  return patents[id+title+category_for+inventor_names+assignee_names+publication_ids+times_cited]"""

# iterate pubids using chunks 
VERBOSE = False
CHUNKS_SIZE = 300 
results = []

for chunk in pbar(list(chunks_of(pubsids, CHUNKS_SIZE))):
    query = q.format(json.dumps(chunk))
    data = dsl.query_iterative(query, verbose=VERBOSE)
    results += data.patents
    time.sleep(0.5)

# remove duplicates
patents = pd.DataFrame().from_dict(results)
patents.drop_duplicates(subset='id', inplace=True)

print("Total related patents found: ", len(patents))




    
# ;;;;;
#
# count PATENTS per publication
#
# ;;;;;


print("\n===\nCounting patents for each publication...")

# build str column version for checking inclusion
patents['publication_ids_str'] = patents['publication_ids'].apply(lambda x: ','.join(map(str, x)))                                 
def get_patents_per_pub(pubid):
    global patents
    # turn list into str and check content in one line
    return patents[patents['publication_ids_str'].str.contains(pubid)]['id']


publications['patents_count'] = 0
publications['patents_ids'] = ""
for index, row in pbar(publications.iterrows(), total=publications.shape[0]):
    match_patents = get_patents_per_pub(row['id'])
    publications.at[index,'patents_count'] = len(match_patents)
    publications.at[index,'patents_ids'] = list(match_patents)


# keep only pubs with at least one patent citation and sort by citations     
publications_subset = publications[publications['patents_count'] > 0].copy()
publications_subset.sort_values("patents_count", ascending=False, inplace=True)

print("Total publications with at least one patent citation: ", len(publications_subset))



# ;;;;;
#
# count PUBLICATIONS per patent 
#
# ;;;;;


print("\n===\nCounting publications for each patent ...")
# add tot column
patents['publications_cited_tot'] = patents['publication_ids'].apply(lambda x: len(x))

# count tot publications cited from GRIDID
def is_in_grid_pubs(test_ids):
    "intersection of two lists: all cited pubs VS pubs from selected grid org"
    global pubsids
    return len(list(set(test_ids) & set(pubsids)))

pbar.pandas(desc="Patents")
patents['publications_cited_grid'] = patents['publication_ids'].progress_apply(lambda x: is_in_grid_pubs(x))



# ;;;;;
#
# simplify JSON publication fields into simple strings 
#
# ;;;;;


print("\n===\nSimplifying publication/patents fields..")

# turn ids into URLs
publications_subset['id'] = publications_subset['id'].apply(lambda x: dimensions_url(x))
# simplify FOR codes (after filling in blanks)
publications_subset['category_for'] = publications_subset['category_for'].fillna("").apply(lambda x: "; ".join([y['name'] for y in x]))

# represent authors and affiliations as semicolon-delimited lists 

def nice_authors(authorslist):
    authors = []
    for x in authorslist:
        name = x.get('first_name', "") + " " + x.get('last_name', "")
        authors.append(name)
    return "; ".join(authors)

def nice_affiliations(authorslist):
    affiliations = []
    for x in authorslist:
        for a in x['affiliations']:
            affiliations.append(a.get('name', ""))
    return "; ".join(list(set(affiliations)))

# extract OWN authors (at any point in time!) 
def ownauthors(authorslist):
    ownauthors = []
    global GRIDID
    for x in authorslist:
        name = x.get('first_name', "") + " " + x.get('last_name', "")
        for a in x['affiliations']:
            if "id" in a and a['id'] == GRIDID:
                ownauthors.append(name)
    return "; ".join(ownauthors)

publications_subset['all_authors'] = publications_subset['authors'].fillna("").apply(lambda x: nice_authors(x))
publications_subset['own_authors'] = publications_subset['authors'].fillna("").apply(lambda x: ownauthors(x))
publications_subset['affiliations'] = publications_subset['authors'].fillna("").apply(lambda x: nice_affiliations(x))

# sort columns
publications_subset = publications_subset[['id', 'title', 'times_cited', 'recent_citations', 'field_citation_ratio', 'category_for', 'all_authors',  'own_authors', 'affiliations', 'patents_count', 'patents_ids']]



# ;;;;;
#
# simplify JSON patent fields  
#
# ;;;;;

# turn ids into URLs
patents['id'] = patents['id'].apply(lambda x: dimensions_url(x, "patents"))
# simplify FOR codes (after filling in blanks)
patents['category_for'] = patents['category_for'].fillna("").apply(lambda x: "; ".join([y['name'] for y in x]))

# transform list into semicolon delimited string
patents['inventor_names'] = patents['inventor_names'].fillna("").apply(lambda x: "; ".join([y for y in x]))
patents['assignee_names'] = patents['assignee_names'].fillna("").apply(lambda x: "; ".join([y for y in x]))
# set no value to 0
patents['times_cited'].fillna(0, inplace=True)

# sort columns
patents = patents[['id', 'title', 'times_cited', 'category_for', 'inventor_names', 'assignee_names', 'publications_cited_tot', 'publications_cited_grid', 'publication_ids']]


# ;;;;;
#
# save the data as CSV
#
# ;;;;;


save_as_csv(publications_subset, "publications")
save_as_csv(patents, "patents")


print("===\nCompleted.")



===
Extracting publications for:  grid.170205.1 from 2012 to 2013
Starting iteration with limit=100 skip=0 ...


0-100 / 4996 (1.6070530414581299s)


100-200 / 4996 (1.9879209995269775s)


200-300 / 4996 (1.3492639064788818s)


300-400 / 4996 (2.1135737895965576s)


400-500 / 4996 (0.9145572185516357s)


500-600 / 4996 (10.237658739089966s)


600-700 / 4996 (3.1922919750213623s)


700-800 / 4996 (3.03245210647583s)


800-900 / 4996 (2.656450033187866s)


900-1000 / 4996 (36.44534087181091s)


1000-1100 / 4996 (3.70888614654541s)


1100-1200 / 4996 (1.474830150604248s)


1200-1300 / 4996 (7.093698978424072s)


1300-1400 / 4996 (0.7300140857696533s)


1400-1500 / 4996 (3.6571860313415527s)


1500-1600 / 4996 (4.992002964019775s)


1600-1700 / 4996 (2.1645758152008057s)


1 EvaluationError found
The response generated by your query is too large, e.g. because it includes records with lots of data. Please review it by keeping in mind the guidelines on https://docs.dimensions.ai/dsl/faq.html#queries-and-errors [code: 2]

>>>[Dimcli tip] An error occurred with the batch '1700-1800'. Consider using the 'limit' argument to retrieve fewer records per iteration, or use 'force=True' to ignore errors and continue the extraction.


TypeError: can only concatenate list (not "DslDataset") to list

## 4. Download the results 

If you are viewing this notebook in **Google Colab**, run the following cell to download all data as a zip file. 

In [4]:

# zip up all files to make download easier
import zipfile
import os 

def zipdir(path, ziph):
    # ziph is zipfile handle
    for root, dirs, files in os.walk(path):
        for file in files:
            ziph.write(os.path.join(root, file))

zip_name = DATAFOLDER + '.zip'
zipf = zipfile.ZipFile(zip_name, 'w', zipfile.ZIP_DEFLATED)
zipdir(DATAFOLDER + '/', zipf)
zipf.close()

try:
  # try to download from colab: sometimes it fails hence print a message
  from google.colab import files
  time.sleep(5)
  files.download(zip_name) 
except:
  print("Google Colab failed to download - please try again.")


Google Colab failed to download - please try again.
