<div align="center" style="border:solid 1px gray;">
    <a href="https://openalex.org/">
        <img src="../../../resources/img/OpenAlex-banner.png" alt="OpenAlex banner" width="300">
    </a>
</div>

# OpenAlex API Webinar - Tutorial 01 - Getting data about papers that a university's research has cited
April 25, 2024

Welcome to the Jupyter Notebook accompanying the first OpenAlex webinar on using Python to access the API.

* If you aren't familiar with Jupyter notebooks, [you can learn more here](https://jupyter.org/try-jupyter/notebooks/?path=notebooks/Intro.ipynb)
* To learn all about the OpenAlex API: [visit the technical documentation](https://docs.openalex.org)

The OpenAlex API is very powerful, but it is also very easy to use. There is no authentication required, and all your code needs to do is make standard HTTP GET requests.

While there are some good libraries you can use to access the API, we're going to start very simply by making API calls directly. We will import just two small libraries to help us.

In [1]:
# setup: import libraries
import requests
import csv

In [2]:
# IMPORTANT: Set your email here in order to use the API's "polite pool"
# See: https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication#the-polite-pool
mailto = ""

In [3]:
url = "https://api.openalex.org/works"
if not mailto:
    raise ValueError("You need to fill in your email address in the `mailto` variable above!")
params = {
    "mailto": mailto,
    "filter": "authorships.author.id:a5086928770",  # Kyle Demes's author ID
}
response = requests.get(url, params=params)

# A "200" status code means that the API query was successful
if response.status_code == 200:
    print("Success!")

Success!


In [4]:
results = response.json()['results']
print(f"Number of results: {len(results)}")

Number of results: 24


We've retrieved the papers from the API, using a simple query:

[`https://api.openalex.org/works?filter=authorships.author.id:a5086928770`](https://api.openalex.org/works?filter=authorships.author.id:a5086928770)

You can follow that link in the browser to get the same result. But with the data now accessible by our code, we can save a CSV file with whatever fields we want.

Instructions for how to write CSV files with Python are [here](https://docs.python.org/3/library/csv.html). Following these instructions:

In [5]:
# The Python documentation shows how to write data to a CSV file:
# https://docs.python.org/3/library/csv.html
with open('kdemes_works.csv', 'w', newline='') as f:
    # initialize the csv writer for this file
    writer = csv.writer(f)

    # write a header row at the top
    header = ['id', 'doi', 'publication_year', 'title']
    writer.writerow(header)

    # loop through the works and write each row
    for item in results:
        this_id = item['id']
        this_doi = item['doi']
        this_publication_year = item['publication_year']
        this_title = item['title']
        writer.writerow([this_id, this_doi, this_publication_year, this_title])

### University (Institution)

Next, we'll try something a little more advanced. We're going to get the works from a certain institution, and then retrieve all of the references from those works (the works cited by the university's works).

We'll start by collecting the university's works. We'll limit to just recently published papers so it doesn't take too long, but you could get all of the papers just as easily, if you're willing to wait.

#### Cursor paging

Each API query will only return a limited subset of the overall data, in what is known as a page. We need to make multiple queries to "page through" all of the data, collecting the data for each API query. We use a method called ["cursor paging"](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging) to do this.

In [6]:
url = "https://api.openalex.org/works"
params = {
    "mailto": "jportenoy@ourresearch.org",
    "filter": f"authorships.institutions.lineage:i129801699,publication_year:>2022",  # University of Tasmania
    "per-page": 100,
    "select": "id,doi,publication_year,title,primary_location,authorships,topics",
}

# Initialize cursor
cursor = "*"

# Loop through pages
all_results = []
count_api_queries = 0
while cursor:
    params["cursor"] = cursor
    response = requests.get(url, params=params)
    if response.status_code != 200:
        print("Oh no! Something went wrong during the live demo! How embarrassing!")
        break
    this_page_results = response.json()['results']
    for result in this_page_results:
        all_results.append(result)
    count_api_queries += 1

    # Update cursor, using the response's `next_cursor` metadata field
    cursor = response.json()['meta']['next_cursor']
print(f"Done paging through results. We made {count_api_queries} API queries, and retrieved {len(all_results)} results.")

Done paging through results. We made 44 API queries, and retrieved 4254 results.


Our next step is to loop through each work collected above, and collect all of the referenced works.

In [7]:
# Let's make our cursor paging code above into a function, so we can reuse it easily.
# This code just defines the function. We'll need to call the function later on to get it to actually get it to run.
def api_query_page_results(url, params):
    # Initialize cursor
    cursor = "*"

    # Loop through pages
    all_results = []
    while cursor:
        params["cursor"] = cursor
        response = requests.get(url, params=params)
        if response.status_code != 200:
            print("Oh no! Something went wrong during the live demo! How embarrassing!")
            response.raise_for_status()
        this_page_results = response.json()['results']
        for result in this_page_results:
            all_results.append(result)

        # Update cursor
        cursor = response.json()['meta']['next_cursor']
    return all_results

In [8]:
# collect all of the works referenced by the works found above
# This will be a dictionary mapping Citing Paper -> List of Cited Papers
all_references = {}

# Let's limit the results to loop through to only n=100, because this is a demo, and we don't want to wait for too long
works_to_collect = all_results[:100]

# We will keep track of the number of works retrieved from the API
count_works_retrieved = 0

for work in works_to_collect:
    # Get references to this work (i.e., works that have been cited by this work)
    this_work_id = work['id']
    url = "https://api.openalex.org/works"
    params = {
        "mailto": "jportenoy@ourresearch.org",
        "filter": f"cited_by:{this_work_id}",
        "per-page": 100,
        "select": "id,doi,publication_year,title,primary_location,authorships,topics",
    }
    this_work_references = api_query_page_results(url, params=params)
    # put this data into our dictionary:
    all_references[this_work_id] = this_work_references
    count_works_retrieved += len(this_work_references)
print(f"Done collecting references. We retrieved {count_works_retrieved} works.")

Done collecting references. We retrieved 9132 works.


In [9]:
# Function to shorten the OpenAlex ID to make it better for display
def make_short_id(long_id):
    short_id = long_id.replace("https://openalex.org/", "")
    return short_id

In [10]:
# Write each citing -> cited pair of works to a CSV file
output_filename = "tasmania_paper_references.csv"
with open(output_filename, 'w', newline='') as f:
    # initialize the csv writer for this file
    writer = csv.writer(f)

    # write a header row at the top
    header = ['citing_paper_id', 'cited_paper_id']
    writer.writerow(header)

    # loop through each citation, writing one row for each citation
    for citing_id, cited_works in all_references.items():
        citing_id_short = make_short_id(citing_id)
        for cited_work in cited_works:
            cited_id_short = make_short_id(cited_work['id'])
            writer.writerow([citing_id_short, cited_id_short])

In [11]:
# We can keep track of how many times each work has been cited.
# One way to do this is to use Python's collections.Counter
from collections import Counter
citation_counts = Counter()
for citing_id, cited_works in all_references.items():
    citation_counts.update([w['id'] for w in cited_works])

In [12]:
output_filename = "tasmania_references_paper_metadata.csv"
seen_work_ids = set()
with open(output_filename, 'w', newline='') as f:
    # initialize the csv writer for this file
    writer = csv.writer(f)

    # write a header row at the top
    header = ['work_id', 'title', 'doi', 'utasmania_citation_count', 
              'source_id', 'source_issn', 'source_display_name', 
              'primary_topic_id', 'primary_topic_display_name']
    writer.writerow(header)

    for cited_works in all_references.values():
        for w in cited_works:
            work_id = w['id']
            work_id_short = make_short_id(work_id)
            title = w['title']
            if work_id not in seen_work_ids and title != 'Deleted Work':
                # Write a row to the CSV file for this work
                doi = w['doi']
                utasmania_citation_count = citation_counts[work_id]

                # Get source (journal)
                try:
                    source = w['primary_location']['source']
                    source_id = source['id']
                    source_id_short = make_short_id(source_id)
                    source_issn = source['issn_l']
                    source_display_name = source['display_name']
                except (KeyError, TypeError):
                    source_id = None
                    source_issn = None
                    source_display_name = None

                # Get primary_topic
                try:
                    primary_topic = w['topics'][0]
                    primary_topic_id = primary_topic['id']
                    primary_topic_id_short = make_short_id(primary_topic_id)
                    primary_topic_display_name = primary_topic['display_name']
                except (IndexError, KeyError, TypeError):
                    primary_topic_id = None
                    primary_topic_display_name = None
                
                writer.writerow([work_id_short, title, doi, 
                                 utasmania_citation_count, source_id_short, 
                                 source_issn, source_display_name, 
                                 primary_topic_id_short, primary_topic_display_name])
            
                seen_work_ids.add(work_id)


Now we have two CSV files:

* `tasmania_paper_references.csv` has a two column edge-list of citing work -> cited work
* `tasmania_references_paper_metadata.csv` has metadata about each cited work