# Notebook for coronavirus open citations visualisation

This Python notebook contains all the Python code to retrieve all the data used for creating the [Coronavirus Open Citations Dataset](https://opencitations.github.io/coronavirus/). In particular, we used the Crossref API and the unifying REST API for all the OpenCitations Indexes for getting all the information needed for the visualisation.

## Preliminaries

The next code imports all the module needed and set up the basic variables used for retrieving the data.

In [1]:
from requests import get
from requests.exceptions import Timeout, ConnectionError
from json import loads, load, dump
from re import sub
from os.path import exists
from os import makedirs, sep
from csv import reader, writer
import logging
from urllib.parse import quote

headers = {
    "User-Agent": 
    "COVID-19 / OpenCitations "
    "(http://opencitations.net; mailto:contact@opencitations.net)"
}

# the base directory where to store the files with the full data
data_dir = "data"

# the CSV document containing the DOIs of the articles relevant for the analysis
doi_file = data_dir + sep + "dois.csv"

# the JSON document containing the citations of the articles relevant for the analysis
cit_file = data_dir + sep + "citations.json"

# the CSV document containing the DOIs of the articles relevant for the analysis that do not have references deposited in Crossref
doi_no_ref_file = data_dir + sep + "dois_no_ref.csv"

# the JSON document containing the metadata of the articles involved in the relevant citations
met_file = data_dir + sep + "metadata.json"

# the CSV document containing the DOIs of the articles for which Crossref does not return any information
nod_file = data_dir + sep + "metadata_not_found.csv"

# the base directory containing all the material for the visualisation
vis_dir = "docs"

# the directory where to store the files with the partial data used in the visualisation
vis_data_dir = vis_dir + sep + data_dir

# the JSON document containing the citations of the articles relevant for the visualisation
vis_cit_file = vis_data_dir + sep + "citations.json"

# the JSON document containing the metadata of the articles used in the visualisation
vis_met_file = vis_data_dir + sep + "metadata.json"

In order to debug the following code snippets, it is possible to set the logger to a debug level (`logging.DEBUG`). If debug messages are not needed, specify the level at `logging.INFO`.

In [2]:
# change the following variable to logging.INFO for removing debug, or logging.DEBUG to add debug messages
logging_level = logging.INFO

logging.basicConfig(format='%(levelname)s: %(message)s.')
log = logging.getLogger()
log.setLevel(logging_level)

## Getting the data

The following code retrieves the list of relevant articles talking about coronaviruses using the Crossref API. It looks for all the articles which contain the word "coronavirus", "covid19", "sarscov", "ncov2019", and "2019ncov", either in their title or abstract. All these data are stored in the file `doi_file`. If the file already exists on the file system, the process will not run. Thus, to launch again the process, it is needed to remove the file `doi_file` from the file system first.

In [3]:
crossref_query = "https://api.crossref.org/works?query.bibliographic=coronavirus+OR+covid19+OR+sarscov+OR+ncov2019+OR+2019ncov&rows=1000&cursor=*"
dois = set()
cursors = set()
next_cursor = "*"

if not exists(doi_file):
    logging.debug("The file with DOIs does not exist: start querying Crossref for retrieving data")

    while next_cursor:
        if next_cursor not in cursors:
            log.debug(f"Current cursor for querying Crossref: '{next_cursor}'")
            cursors.add(next_cursor)
            crossref_data = get(sub("&cursor=.+$", "&cursor=", crossref_query) + quote(next_cursor), 
                                headers=headers)
            if crossref_data.status_code == 200:
                crossref_data.encoding = "utf-8"
                if crossref_json := loads(crossref_data.text).get("message"):
                    next_cursor = crossref_json.get("next-cursor")
                    for item in crossref_json.get("items"):
                        dois.add(item.get("DOI"))
                else:
                    log.debug(f"Crossref response does not contain a 'message' " 
                              f"item:\n{crossref_data.text}")
            else:
                log.debug(f"The request to Crossref end up with a non-OK status code " 
                          f"('{crossref_data.status_code}'): stopping the download\n" + crossref_data.text)
                next_cursor = None
        else:
            logging.debug(f"Current cursor '{next_cursor}' already used: stopping the download")
            next_cursor = None
    
    if not exists(data_dir):
        makedirs(data_dir)
    with open(doi_file, "w") as f:
        csv_writer = writer(f)
        for doi in dois:
            csv_writer.writerow((doi, ))
        
else:
    log.debug("The file with DOIs exist: load information directly from there")
    with open(doi_file) as f:
        csv_reader = reader(f)
        for doi, in csv_reader:
            dois.add(doi)

log.info(f"Total DOIs available: {len(dois)}")

INFO: Total DOIs available: 11842.


The following code retrieves all the citations which involve the DOIs of the articles obtained in the previous step, either as a citing entity or as a cited entity. It uses the unifying REST API for all the OpenCitations Indexes for getting the citation information, thus using all the OpenCitations Indexes currently available, i.e. COCI and CROCI. All the citation data are stored in the file indicated by the variable `cit_file` in a JSON format compatible with the one used by Cytoscape JS - which is the tool used for visualising the data. If the file `cit_file` already exists on the file system, the process will not run. Thus, to launch again the process, it is needed to remove the file `cit_file` from the file system first.

In [4]:
citations = list()
opencitations_query = 'https://opencitations.net/index/api/v1/%s/%s?json=array("; ",citing).array("; ",cited).dict(" => ",citing,source,doi).dict(" => ",cited,source,doi)'
cit_id = 0    

def extract_citations(res, cit_id):
    result = list()
    res.encoding = "utf-8"
    for citation in loads(res.text):
        cit_id += 1
        citation_item = {
            "id": str(cit_id), 
            "source": citation["citing"][0]["doi"], 
            "target": citation["cited"][0]["doi"]
        }

        result.append(citation_item)
    
    return result, cit_id
        
if not exists(cit_file):
    log.debug("The file with citations does not exist: start querying OpenCitations "
                  "for retrieving citation data")
    for doi in dois:
        logging.debug(f"Process DOI '{doi}'")
        reference_data = get(opencitations_query % ("references", doi), headers=headers)
        if reference_data.status_code == 200:
            all_citations, cit_id = extract_citations(reference_data, cit_id)
            for citation in all_citations:
                if citation not in citations:
                    citations.append(citation)
        else:
            log.warning(f"Status code '{reference_data.status_code}' when requesting references for "
                            "DOI '{doi}'")
        citation_data = get(opencitations_query % ("citations", doi), headers=headers)
        if citation_data.status_code == 200:
            all_citations, cit_id = extract_citations(citation_data, cit_id)
            for citation in all_citations:
                if citation not in citations:
                    citations.append(citation)
        else:
            log.warning(f"Status code '{citation_data.status_code}' when requesting citations for "
                            "DOI '{doi}'")
    
    with open(cit_file, "w") as f:
        dump(citations, f, ensure_ascii=False, indent=0)
    
else:
    log.debug("The file with citations exist: load information directly from there.")
    with open(cit_file) as f:
        citations.extend(load(f))

citing_dois = set()
for citation in citations:
    citing_dois.add(citation["source"])
        
articles_with_references = set()
articles_without_references = set()
for doi in dois:
    if doi in citing_dois:
        articles_with_references.add(doi)
    else:
        articles_without_references.add(doi)

if not exists(doi_no_ref_file):
    with open(doi_no_ref_file, "w") as f:
        csv_writer = writer(f)
        for doi in articles_without_references:
            csv_writer.writerow((doi, ))
        
log.info(f"Total citations available: {len(citations)}. Number of articles with references "
         f"deposited in Crossref and available in the OpenCitations Indexes: "
         f"{len(articles_with_references)} out of {len(dois)} total articles retrieved in Crossref")

INFO: Total citations available: 189697. Number of articles with references deposited in Crossref and available in the OpenCitations Indexes: 3348 out of 11842 total articles retrieved in Crossref.


Finally, the following code uses the Crossref API again to identify all the basic metadata (i.e. authors, year of publication, title, publication venue, and DOI) of the articles involved in all the citations retrieved in the previous step. Only the metadata of the articles for which Crossref has metadata are stored in the file `met_file`. If the file already exists on the file system, the process will not run. Thus, to launch again the process, it is needed to remove the file `met_file` from the file system first.

In [5]:
crossref_query = "https://api.crossref.org/works/"

def normalise(o):
    if o is None:
        s = ""
    else:
        s = str(o)
    return sub("\s+", " ", s).strip()

def create_title_from_list(title_list):
    cur_title = ""

    for title in title_list:
        strip_title = title.strip()
        if strip_title != "":
            if cur_title == "":
                cur_title = strip_title
            else:
                cur_title += " - " + strip_title

    return normalise(cur_title.title())

def get_basic_metadata(body):
    authors = []
    for author in body.get("author", []):
        authors.append(normalise(author.get("family", "").title()))

    year = ""
    if "issued" in body and "date-parts" in body["issued"] and len(body["issued"]["date-parts"]) and \
            len(body["issued"]["date-parts"][0]):
        year = normalise(body["issued"]["date-parts"][0][0])

    title = ""
    if "title" in body:
        title = create_title_from_list(body.get("title", []))

    source_title = ""
    if "container-title" in body:
        source_title = create_title_from_list(body.get("container-title", []))

    return ", ".join(authors), year, title, source_title

dois_in_citations = set()
for citation in citations:
    dois_in_citations.add(citation["source"])
    dois_in_citations.add(citation["target"])

existing_doi = set()
metadata = []
if exists(met_file):
    with open(met_file) as f:
        for article in load(f):
            existing_doi.add(article["id"])
            metadata.append(article)

if exists(nod_file):
    with open(nod_file) as f:
        csv_reader = reader(f)
        for doi, in csv_reader:
            existing_doi.add(doi)

dois_not_found = []
for doi in dois_in_citations.difference(existing_doi):
    log.debug(f"Requesting Crossref metadata for DOI '{doi}'")
    try:
        article = get(crossref_query + doi, headers=headers, timeout=30)
        if article.status_code == 200:
            article.encoding = "utf-8"
            if article_json := loads(article.text).get("message"):
                author, year, title, source_title = get_basic_metadata(article_json)
                metadata.append({
                    "id": doi,
                    "author": author,
                    "year": year,
                    "title": title,
                    "source_title": source_title
                })
            else:
                log.warning(f"No article metadata in Crossref for DOI '{doi}'")
                dois_not_found.append(doi)
        else:
            dois_not_found.append(doi)
            log.warning(f"Status code '{article.status_code}' when requesting Crossref metadata "
                        f"for DOI '{doi}'")
    except Timeout:
        dois_not_found.append(doi)
        log.warning(f"Timeout when querying Crossref for DOI '{doi}'")
    except ConnectionError:
        dois_not_found.append(doi)
        log.warning(f"Connection issues when querying Crossref for DOI '{doi}'")

with open(met_file, "w") as f:
    dump(metadata, f, ensure_ascii=False, indent=0)
        
if dois_not_found:
    with open(nod_file, "w") as f:
        csv_writer = writer(f)
        for doi in dois_not_found:
            csv_writer.writerow((doi, ))

log.info(f"The total number of articles involved in the citations retrieved "
         f"are {len(metadata) + len(dois_not_found)}. "
         f"The total number of articles with available metadata is {len(metadata)}, "
         f"and there are {len(dois_not_found)} articles with no metadata found")

INFO: The total number of articles involved in the citations retrieved are 49719. The total number of articles with available metadata is 49719, and there are 0 articles with no metadata found.


All the DOIs of the articles for which Crossref does not provide any metadata are stored in the file `nod_file`. In this case, the metadata for that articles must be completed by hand, and then added to the file `met_file`.

## Data for the visualisation

For visualisation purposes, we selected only a partial subset of the citations and the articles retrived in the previous steps. In particular, we considered only: 

1. the citations having both the DOIs of the citing entity and the cited entity included in the file `doi_file`;
2. the articles that received, overall, at least ten citations per year since their publication date.

The following code stores the citations and the metadata of the articles used in the visualisation in the files `vis_cit_file` and `vis_met_file` respectively. If the files already exist on the file system, the process will not run. Thus, to launch again the process, it is needed to remove the files `vis_cit_file` and `vis_met_file` from the file system first.

In [6]:
min_num_cits_per_year = 20

# Consider only the citations from and to articles of the selected dataset
filtered_citations = []
number_of_citations = {}
for citation in citations:
    if citation["target"] in dois:
        filtered_citations.append(citation)
        number_of_citations[citation["target"]] = number_of_citations.get(citation["target"], 0) + 1

# Publication years of the articles
pub_year = {}
for article in metadata:
    pub_year[article["id"]] = int(article["year"]) if article["year"] else 0

current_year = 2020
only_highly_cited = []
dois_in_selected_citations = set()
for citation in filtered_citations:
    if citation["source"] in dois and citation["target"] in dois and \
       number_of_citations.get(citation["source"], 0) >= ((current_year - pub_year[citation["source"]] + 1) * min_num_cits_per_year) and \
       number_of_citations.get(citation["target"], 0) >= ((current_year - pub_year[citation["target"]] + 1) * min_num_cits_per_year):
        dois_in_selected_citations.add(citation["source"])
        dois_in_selected_citations.add(citation["target"])
        only_highly_cited.append(citation)

log.info(f"Number of citations ({len(only_highly_cited)}) and articles "
         f"({len(dois_in_selected_citations)}) selected for visualization purposes")

if not exists(vis_data_dir):
    makedirs(vis_data_dir)

if not exists(vis_cit_file):
    with open(vis_cit_file, "w") as f:
        dump(only_highly_cited, f, ensure_ascii=False, indent=0)

if not exists(vis_met_file):
    partial_metadata = []
    with open(met_file) as f:
        for article in load(f):
            if article["id"] in dois_in_selected_citations:
                article["count"] = number_of_citations.get(article["id"], 0)
                partial_metadata.append(article)

    with open(vis_met_file, "w") as f:
        dump(partial_metadata, f, ensure_ascii=False, indent=0)

INFO: Number of citations (902) and articles (109) selected for visualization purposes.


## License

### Code

Copyright 2020, Silvio Peroni (essepuntato@gmail.com)

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

### Notebook text
Copyright 2020, Silvio Peroni (essepuntato@gmail.com)

Attribution 4.0 International (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/legalcode

You are free to:

* Share — copy and redistribute the material in any medium or format.
* Adapt — remix, transform, and build upon the material for any purpose, even commercially.

Under the following terms:

* Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
* No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

This license is acceptable for Free Cultural Works. The licensor cannot revoke these freedoms as long as you follow the license terms.

Notices:

* You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.
* No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.