OpenAlex
========

https://openalex.org/ is an open catalog of the scientific literature. You can access information about papers, authors and more both through a browser, and via an API.

# API

An API is an Application Programming Interface. With OpenAlex, we specifically have a REST API which is an interface to exchange information over the internet using urls. Your browser does this all the time, and we will learn to do it in a shell today.

The first step in using an API is to review it: https://docs.openalex.org/how-to-use-the-api/api-overview

Note there are some Python libraries already, but we do not use these today.

It is helpful to install a JsonViewer extension in your browser. I use https://chrome.google.com/webstore/detail/jsonvue/chklaanhfefbnpoihckbnefhakgolnmc.

We start with the quickstart tutorial to get an idea of what is happening.

https://docs.openalex.org/quickstart-tutorial

To use a REST API, we have to know how to construct a URL to an *endpoint* that will return data to us. These URLs have some pieces:

    https://api.openalex.org/institutions is an endpoint for institutions
    
    ?search=carnegie+mellon+university is a query for that endpoint
    
We combine these to form a single URL: https://api.openalex.org/institutions?search=carnegie+mellon+university




If you click on that url, you should see a web page of json data, it might resemble a dictionary in Python. 



# JSON

JSON stands for Javascript Object Notation. It is a standard data format for exchanging data. Most programming languages provide a library to parse and access this data, which makes it very convenient.



# Using REST APIs in the shell

The goal today is not to use the browser. We will use it to help explore data, but with JSON data like this, the browser is not that useful to automate analysis, or for data extraction.

## Exploring the institutions endpoint

In a shell, we use the `curl` command. Run this command in a shell:

    curl https://api.openalex.org/institutions?search=carnegie+mellon+university


In [1]:
! curl https://api.openalex.org/institutions?search=carnegie+mellon+university

{"meta":{"count":4,"db_response_time_ms":29,"page":1,"per_page":25},"results":[{"id":"https://openalex.org/I74973139","ror":"https://ror.org/05x2bcf33","display_name":"Carnegie Mellon University","relevance_score":211218.61,"country_code":"US","type":"education","homepage_url":"http://www.cmu.edu/index.shtml","image_url":"https://upload.wikimedia.org/wikipedia/commons/1/1d/Www.wikipedia.org_screenshot_2018.png","image_thumbnail_url":"https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/Www.wikipedia.org_screenshot_2018.png/76px-Www.wikipedia.org_screenshot_2018.png","display_name_acronyms":["CMU"],"display_name_alternatives":[],"repositories":[{"id":"https://openalex.org/S4306400668","display_name":"Research Showcase @ Carnegie Mellon University (Carnegie Mellon University)","host_organization":"https://openalex.org/I74973139","host_organization_name":"Carnegie Mellon University","host_organization_lineage":["https://openalex.org/I74973139"]}],"works_count":111876,"cited_by_count"

```{note}
Note the + in the search query. This is a common way to combine terms. It is usually undesireable to have spaces in the query, so they must either be escaped (usually by %20) or replaced (as in this case with +).

It is also worth noting that by default curl and your browser are doing a GET request to the API. In this form, all data is sent to the API through the URL. There are other kinds of requests that are possible, such as  POST that allow you to send data in a different form. We don't cover that here.



The output is quite dense, and hard to look at. Let's pipe the output of that command into a python module that pretty-prints it:

    

In [3]:
! curl -s https://api.openalex.org/institutions?search=carnegie+mellon+university | python -m json.tool

{
    "meta": {
        "count": 4,
        "db_response_time_ms": 29,
        "page": 1,
        "per_page": 25
    },
    "results": [
        {
            "id": "https://openalex.org/I74973139",
            "ror": "https://ror.org/05x2bcf33",
            "display_name": "Carnegie Mellon University",
            "relevance_score": 211218.61,
            "country_code": "US",
            "type": "education",
            "homepage_url": "http://www.cmu.edu/index.shtml",
            "image_url": "https://upload.wikimedia.org/wikipedia/commons/1/1d/Www.wikipedia.org_screenshot_2018.png",
            "image_thumbnail_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/Www.wikipedia.org_screenshot_2018.png/76px-Www.wikipedia.org_screenshot_2018.png",
            "display_name_acronyms": [
                "CMU"
            ],
            "display_name_alternatives": [],
            "repositories": [
                {
                    "id": "https://openalex.org/S4306400668"

You should scroll up and down to see that this gets us the same information as what is in the browser. It is important to note that now we have requested the same information twice, and we still haven't done anything with it. To avoid this, let's request it one more time, and redirect the output into a file for further analysis. Since we do that in the notebook, the file gets saved in the current working directory.


In [5]:
! curl https://api.openalex.org/institutions?search=carnegie+mellon+university > cmu.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22429  100 22429    0     0  49472      0 --:--:-- --:--:-- --:--:-- 49512


    
Now, we can check the contents of that file to make sure we have the right thing:

    

In [6]:
! cat cmu.json | python -m json.tool

{
    "meta": {
        "count": 4,
        "db_response_time_ms": 29,
        "page": 1,
        "per_page": 25
    },
    "results": [
        {
            "id": "https://openalex.org/I74973139",
            "ror": "https://ror.org/05x2bcf33",
            "display_name": "Carnegie Mellon University",
            "relevance_score": 211218.61,
            "country_code": "US",
            "type": "education",
            "homepage_url": "http://www.cmu.edu/index.shtml",
            "image_url": "https://upload.wikimedia.org/wikipedia/commons/1/1d/Www.wikipedia.org_screenshot_2018.png",
            "image_thumbnail_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/Www.wikipedia.org_screenshot_2018.png/76px-Www.wikipedia.org_screenshot_2018.png",
            "display_name_acronyms": [
                "CMU"
            ],
            "display_name_alternatives": [],
            "repositories": [
                {
                    "id": "https://openalex.org/S4306400668"

Now we can study what we get, and extract some data. In this json file at the top, you see there are 4 results. That is because we have several campuses: Rwanda, Silicon Valley, Qatar, and the Pittsburgh campus (as you will soon see, there seems to be an Australia campus?).

For each result, there is a subsection of data available. For now, let us focus on the "works_count" and "cited_by_count" data. These are the total number of publications and citations for each campus.



# jq - shell command to parse json files

jq (https://stedolan.github.io/jq/) is a command-line json processor. It can be used to extract data from json files. The command generally works like this:

    jq query-string json-file
    
where the query-string is a specialized language used by the tool to get data from the file. This language is sophisticated, and we will only cover enough to meet our needs. Libraries like Python are much more flexible, and for complex queries, it is often better to switch to them. We do that in the next class. The query-string is often a "path" to the data you want. For example, we can get the count in the meta section like this:

    

In [7]:
! jq ".meta.count" cmu.json

[0;39m4[0m


    
We can loop over the results to extract subsets of information. Here, for each result we create a new json object containing the display name, works_count and cited_by_count.    
    
    

In [8]:
! jq '.results[] | {name: .display_name, count: .works_count, cites: .cited_by_count}' cmu.json 

[1;39m{
  [0m[34;1m"name"[0m[1;39m: [0m[0;32m"Carnegie Mellon University"[0m[1;39m,
  [0m[34;1m"count"[0m[1;39m: [0m[0;39m111876[0m[1;39m,
  [0m[34;1m"cites"[0m[1;39m: [0m[0;39m4533093[0m[1;39m
[1;39m}[0m
[1;39m{
  [0m[34;1m"name"[0m[1;39m: [0m[0;32m"Carnegie Mellon University Qatar"[0m[1;39m,
  [0m[34;1m"count"[0m[1;39m: [0m[0;39m612[0m[1;39m,
  [0m[34;1m"cites"[0m[1;39m: [0m[0;39m8180[0m[1;39m
[1;39m}[0m
[1;39m{
  [0m[34;1m"name"[0m[1;39m: [0m[0;32m"Carnegie Mellon University Australia"[0m[1;39m,
  [0m[34;1m"count"[0m[1;39m: [0m[0;39m81[0m[1;39m,
  [0m[34;1m"cites"[0m[1;39m: [0m[0;39m1672[0m[1;39m
[1;39m}[0m
[1;39m{
  [0m[34;1m"name"[0m[1;39m: [0m[0;32m"Carnegie Mellon University Africa"[0m[1;39m,
  [0m[34;1m"count"[0m[1;39m: [0m[0;39m125[0m[1;39m,
  [0m[34;1m"cites"[0m[1;39m: [0m[0;39m651[0m[1;39m
[1;39m}[0m


   
    
You may want to redirect that into a file for subsequent analysis.    

It is moderately tedious to get reasonable displays with jq in the shell. Here we extract the data that we want in one query, and pipe it to a second query that builds up a string. Python is definitely more convenient for this, but we will work with this for now.

    

In [24]:
! jq '.results[] | {name: .display_name, count: .works_count, cites: .cited_by_count}' cmu.json | jq '(.name) + ": documents (" + (.count | tostring) + ") citations (" + (.cites | tostring) + ")"'

[0;32m"Carnegie Mellon University: documents (111876) citations (4533093)"[0m
[0;32m"Carnegie Mellon University Qatar: documents (612) citations (8180)"[0m
[0;32m"Carnegie Mellon University Australia: documents (81) citations (1672)"[0m
[0;32m"Carnegie Mellon University Africa: documents (125) citations (651)"[0m


# Writing a shell script

It is tedious to write those commands over and over. We can avoid that by capturing the commands in a shell script. Let's review what needs to happen for that.

1. we need to create the file where the shell commands will be put.
2. The file needs to be executable (chmod +x fname)
3. Then we can run the file in our shell.

We will start with a few new concepts first. We will want the script to take some arguments. In our script, we refer to arguments by *$n*. Here is a short script that simply echos the arguments you provide it. Create this file (myscript.sh), make it executable, and try it in the terminal with some arguments.

```
    #!/bin/bash
    echo "arg1 = " $1
    echo "arg2 = " $2
``` 
    
Note this script does not have very sophisticated error checking, e.g. for too few or too many arguments. We will not address that here; it is much easier in Python.

The next cell writes this [script](./myscript.sh).


In [10]:
%%writefile myscript.sh
#!/bin/bash
echo "arg1 = " $1
echo "arg2 = " $2

Writing myscript.sh


In [11]:
! chmod +x myscript.sh

In [12]:
! ./myscript.sh arg1

arg1 =  arg1
arg2 = 


In [13]:
! ./myscript.sh arg1 arg2

arg1 =  arg1
arg2 =  arg2


## OpenAlex institutions script

Now we can create a short script with our code. We will use one argument, the institution to query. It is moderately tedious (IMO) to convert a free form query from the arguments. One option is that you require the user to enter only one argument with no spaces, and + where words should be joined. 

I use some bash scripting tricks here. First we assign the query variable to the first argument (this is the first word). Then the `shift 1` line moves the position from the first arg to the next one. We loop over the remaining arguments in the for loop. The special variable $@ holds the rest of these, and we concatenate each word onto the query with a +. Then, we put the query into the curl command. Make this file, e.g. oa-inst.sh, make it executable, and then test it with some examples.

See the script [here](./oa-inst.sh).



In [17]:
%%writefile oa-inst.sh
#!/bin/bash
query=$1
shift 1
for word in "$@"; do
    query+="+$word"
done

curl -s https://api.openalex.org/institutions?search=$query | jq '.results[] | {name: .display_name, count: .works_count, cites: .cited_by_count}' | jq '(.name) + ": documents (" + (.count | tostring) + ") citations (" + (.cites | tostring) + ")"'

Overwriting oa-inst.sh


In [18]:
! chmod +x oa-inst.sh

In [19]:
! ./oa-inst.sh carnegie mellon university

[0;32m"Carnegie Mellon University: documents (111876) citations (4533093)"[0m
[0;32m"Carnegie Mellon University Qatar: documents (612) citations (8180)"[0m
[0;32m"Carnegie Mellon University Australia: documents (81) citations (1672)"[0m
[0;32m"Carnegie Mellon University Africa: documents (125) citations (651)"[0m


```{note}
Shell scripts are very particular about whitespace and punctuation. They are less forgiving than languages like Python. 

This example is about the limit of what I would consider in a shell script before switching over to Python. Some reasons to use a shell script include:

1. Python is not installed.
2. You don't want to install Python just for a short script.
```



I try to minimize shell scripting in my work. I think you are almost always better off leveraging what you know in Python, and using Python tools to write code that is easier to document and debug. 

Some scenarios I have to use shell scripting:
1. Building software. Any time you have to install or compile code, you use shell scripts.
2. Cloud services. Any kind of maintenance or setup of cloud services almost always requires command line tools and shell scripting.

Some resources on Shell scripting:

1. https://www.freecodecamp.org/news/shell-scripting-crash-course-how-to-write-bash-scripts-in-linux/
2. https://devhints.io/bash



# Exercise

Use the single author API (https://docs.openalex.org/api-entities/authors/get-a-single-author) to get a list of works_count and cited_by_count by year for me. Here is the url to the data: https://api.openalex.org/authors/https://orcid.org/0000-0003-2625-9232.

For each year, print the works_count and cited_by_count. The output should look like this:

```
"2023 1 879"
"2022 9 3415"
"2021 7 3213"
"2020 9 2806"
"2019 7 2376"
"2018 5 1849"
"2017 8 1335"
"2016 15 1154"
"2015 14 1006"
"2014 7 806"
"2013 8 625"
"2012 9 485"
```

Write a shell script that takes an orcid as a single argument, and prints out this information. You can find some CMU ORCIDs at https://orcid.org/orcid-search/search?institution=carnegie%20mellon%20university. Test your script on some of these.

See a solution in the next cell.

In [1]:
%%bash
curl -s https://api.openalex.org/authors/https://orcid.org/0000-0001-7235-1481 \
| jq '.counts_by_year[]' \
| jq '(.year | tostring) + " " + (.works_count | tostring) + " " + (.cited_by_count | tostring)'

"2023 3 1"
"2022 0 1"
"2020 1 2"
