# Python and REST APIs

Python can be used instead of `curl` for accessing REST APIs. The most useful library for this is [requests](https://requests.readthedocs.io/en/latest/). When combined with the `json` library in Python, we can easily write small programs, and wrap them into command-line utilities.

We simply import the `requests` library, and use it to retrieve data. The OpenAlex API returns json data, and the request library makes it easy to access that in the form of a Python dictionary. Here is a short example.



In [1]:
import requests
requests.__file__

'/opt/tljh/user/lib/python3.9/site-packages/requests/__init__.py'

In [2]:
import sys
sys.path

['/home/jupyter-jkitchin@andrew.cm-11dd7/s24-06643/sse/02-python-requests',
 '/opt/tljh/user/lib/python39.zip',
 '/opt/tljh/user/lib/python3.9',
 '/opt/tljh/user/lib/python3.9/lib-dynload',
 '',
 '/home/jupyter-jkitchin@andrew.cm-11dd7/.local/lib/python3.9/site-packages',
 '/opt/tljh/user/lib/python3.9/site-packages']

In [4]:
import requests

req = requests.get('https://api.openalex.org/institutions?search=carnegie+mellon+university')
data = req.json()
data

{'meta': {'count': 4,
  'db_response_time_ms': 40,
  'page': 1,
  'per_page': 25,
  'groups_count': None},
 'results': [{'id': 'https://openalex.org/I74973139',
   'ror': 'https://ror.org/05x2bcf33',
   'display_name': 'Carnegie Mellon University',
   'relevance_score': 226166.28,
   'country_code': 'US',
   'type': 'education',
   'lineage': ['https://openalex.org/I74973139'],
   'homepage_url': 'http://www.cmu.edu/index.shtml',
   'image_url': 'https://upload.wikimedia.org/wikipedia/commons/1/1d/Www.wikipedia.org_screenshot_2018.png',
   'image_thumbnail_url': 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/Www.wikipedia.org_screenshot_2018.png/76px-Www.wikipedia.org_screenshot_2018.png',
   'display_name_acronyms': ['CMU'],
   'display_name_alternatives': [],
   'repositories': [{'id': 'https://openalex.org/S4306400668',
     'display_name': 'Research Showcase @ Carnegie Mellon University (Carnegie Mellon University)',
     'host_organization': 'https://openalex.org/I7497

A better way to make this request is to separate out the url from the params. Here we specify search and select params.

In [21]:
req = requests.get('https://api.openalex.org/institutions',
                   params={'search': 'carnegie+mellon+university',
                           'select': 'id,display_name,works_count,cited_by_count'})
print(req.status_code)
data = req.json()
data

200


{'meta': {'count': 4,
  'db_response_time_ms': 14,
  'page': 1,
  'per_page': 25,
  'groups_count': None},
 'results': [{'id': 'https://openalex.org/I74973139',
   'display_name': 'Carnegie Mellon University',
   'works_count': 121925,
   'cited_by_count': 4978056},
  {'id': 'https://openalex.org/I4210089979',
   'display_name': 'Carnegie Mellon University Qatar',
   'works_count': 697,
   'cited_by_count': 9769},
  {'id': 'https://openalex.org/I4210091826',
   'display_name': 'Carnegie Mellon University Australia',
   'works_count': 93,
   'cited_by_count': 2194},
  {'id': 'https://openalex.org/I4210130200',
   'display_name': 'Carnegie Mellon University Africa',
   'works_count': 187,
   'cited_by_count': 1179}],
 'group_by': []}

In [6]:
data.keys()

dict_keys(['meta', 'results', 'group_by'])

In [7]:
data['meta']['count']

4

In [11]:
[result['display_name'] for result in data['results']]

['Carnegie Mellon University',
 'Carnegie Mellon University Qatar',
 'Carnegie Mellon University Australia',
 'Carnegie Mellon University Africa']

In [13]:
i = 'Do not delete'
print(i)
names = []
for i in range(data['meta']['count']):
    names += [data['results'][i]['display_name']]
names, i    

Do not delete


(['Carnegie Mellon University',
  'Carnegie Mellon University Qatar',
  'Carnegie Mellon University Australia',
  'Carnegie Mellon University Africa'],
 3)

Now, it is easy to replicate the example we had from last class to show each result with the works_count and cited_by_count, even with simple formatting.



In [14]:
for result in data['results']:
    print(f'{result["display_name"]:50s}{result["works_count"]:10d}{result["cited_by_count"]:10d}')

Carnegie Mellon University                            121925   4978056
Carnegie Mellon University Qatar                         697      9769
Carnegie Mellon University Australia                      93      2194
Carnegie Mellon University Africa                        187      1179


In [17]:
print('\n'.join([f'{result["display_name"]:50s}{result["works_count"]:10d}{result["cited_by_count"]:10d}' 
           for result in data['results']]))

Carnegie Mellon University                            121925   4978056
Carnegie Mellon University Qatar                         697      9769
Carnegie Mellon University Australia                      93      2194
Carnegie Mellon University Africa                        187      1179


## Our first Python based shell script



It is convenient to use the notebook for this, but let's convert this to a script that takes an argument. Let's do this in a few pieces.

1. create a file called oa_inst.py, and make it executable.

In [22]:
! pwd

/home/jupyter-jkitchin@andrew.cm-11dd7/s24-06643/sse/02-python-requests


In [23]:
import os
os.getcwd()

'/home/jupyter-jkitchin@andrew.cm-11dd7/s24-06643/sse/02-python-requests'

In [24]:
%%writefile oa_inst.py
#!/usr/bin/env python

import sys
print(sys.argv)

Writing oa_inst.py


In [29]:
! ls -l oa_inst.py

-rwxr-xr-x 1 jupyter-jkitchin@andrew.cm-11dd7 jupyter-jkitchin@andrew.cm-11dd7 50 Mar 18 18:37 oa_inst.py


In [32]:
! ./oa_inst.py

['./oa_inst.py']


In [35]:
! ./oa_inst.py carnegie mellon university

['./oa_inst.py', 'carnegie', 'mellon', 'university']


In [27]:
! python oa_inst.py

['oa_inst.py']


In [28]:
! chmod +x oa_inst.py

The first line (the so-called shebang line) tells the shell what interpreter to use, in this case, that it is a python script. Then we import the `sys` module. This module provides basic access to command line arguments through the .argv attribute.

Run your script with a few examples:

    ./oa_inst.py
    ./oa_inst.py carnegie mellon university
    
You can see the first element of `sys.argv` is always the script name. All the other elements are what we call the command-line arguments. 

In [36]:
%%bash 
./oa_inst.py
./oa_inst.py carnegie mellon university

['./oa_inst.py']
['./oa_inst.py', 'carnegie', 'mellon', 'university']


We need to join these with + as we did in the shell script before. It is easy to do in Python. Now, add these lines:

```

In [69]:
%%writefile oa_inst.py
#!/usr/bin/env python

import sys
query = '+'.join(sys.argv[1:])

url = 'https://api.openalex.org/institutions'
print(url)
import requests

req = requests.get(url,
                  params={'search': query})

print(f'Generated URL: {req.request.url}')

if req.status_code != 200:
    raise Exception(f'Status = {req.status_code}')
                    
data = req.json()

for result in data['results']:
    print(f'{result["display_name"]:50s}{result["works_count"]:10d}{result["cited_by_count"]:10d}')

Overwriting oa_inst.py


In [70]:
! chmod +x oa_inst.py

In [74]:
! ./oa_inst.py "carnegie mellon university"

https://api.openalex.org/institutions
Generated URL: https://api.openalex.org/institutions?search=carnegie+mellon+university
Carnegie Mellon University                            121925   4978056
Carnegie Mellon University Qatar                         697      9769
Carnegie Mellon University Australia                      93      2194
Carnegie Mellon University Africa                        187      1179


In [50]:
! echo $PATH

/opt/tljh/user/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin


In [49]:
! PATH=$PATH:`pwd` oa_inst.py "carnegie  mellon university"

https://api.openalex.org/institutions?search=carnegie  mellon university
Carnegie Mellon University                            121925   4978056
Carnegie Mellon University Qatar                         697      9769
Carnegie Mellon University Australia                      93      2194
Carnegie Mellon University Africa                        187      1179


Now you should be able to run this python script like the shell script.

You can move the script to ~/bin if you put that on your path like we described in the first lecture, and then run that from anywhere.

Our script is not without issues. They aren't big issues, but we can *only* use this script in the shell. We can't import it and use it here in the notebook. If you do import it, you will see that it tries to run something, but something weird happens, and it doesn't work right.

We need to separate some things out so we can have a script *and* importable library.

In [51]:
! cp oa_inst.py ~/bin

In [53]:
! which oa_inst.py

In [56]:
! source ~/.bashrc & oa_inst.py carnegie mellon university

/bin/bash: oa_inst.py: command not found


In [58]:
%%bash 
echo $PATH

/opt/tljh/user/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin


## Security risk

The way we wrote the script above has some security risks. We can inject new parameters that can break our script. Here are some examples. 

First, we add a `select=id` param. This will break the script because the data it needs won't be available. It is necessary to quote the argument for this because the shell interprets & as a command separator otherwise.

In [15]:
! ./oa_inst.py "carnegie mellon university&select=id"

Traceback (most recent call last):
  File "/home/jupyter-jkitchin@andrew.cm-11dd7/src/sse/02-python-requests/./oa_inst.py", line 13, in <module>
    print(f'{result["display_name"]:50s}{result["works_count"]:10d}{result["cited_by_count"]:10d}')
KeyError: 'display_name'


Another way we can break this is to add an invalid api-key param. This breaks the script in another way, because we will get a forbidden access error. That isn't shown here because we don't check the status_code of the response.

In [16]:
! ./oa_inst.py "carnegie mellon university&api-key=xxx"

Traceback (most recent call last):
  File "/home/jupyter-jkitchin@andrew.cm-11dd7/src/sse/02-python-requests/./oa_inst.py", line 12, in <module>
    for result in data['results']:
KeyError: 'results'


```{warning}
You should always consider what can happen with user input. Here the consequences are not too dire on our end, the script simply breaks. However, on the server end worse things might happen, and depending on what you do with user input, bad things could happen.

In this case you can avoid the issue by using the params argument in `requests.get`. This will ensure the input is encoded and not used literally.
```

## Getting better than sys.argv

`sys.argv` is really only suitable for the simplest of command line arguments, and it isn't really even great for those. Among the limitations are:

1. No option parsing (or you have to write your own)
2. No built-in help or documentation

Some built-in core libraries in Python can help with this, e.g. [argparse](https://docs.python.org/3/library/argparse.html). There are also third-party libraries like [click](https://palletsprojects.com/p/click/). 

Let's rewrite the script above using click. The principle idea is we write a function that does what we want with some arguments, use the click library to convert the command line arguments into variables we use in the function, and then, we only run the function when we run the script (as opposed to importing from it).

Making this look easy requires some advanced Python skills. Let's work out a reusable function first. I am writing this with some 20/20 hindsight we don't have yet. We will return later to why we wrote the function this specific way. This function should take a list of terms, or a string to query. Either way, we convert it to a string with each word joined by +. Then, we return a formatted string for each result found.



In [82]:
import requests 
from collections.abc import Iterable 

def openalex_institution(query):
    'query is a list of terms in the query, or a string.'
    
    # Replace spaces with +
    if isinstance(query, str):
        query = '+'.join(query.split())

    # If it is not a string We assume it is an iterable of strings.
    elif isinstance(query, Iterable):
        query = '+'.join([str(x) for x in query])
        
    else:
        raise Exception('query should be a string or Iterable')
        
    url = f'https://api.openalex.org/institutions'
    req = requests.get(url,
                       params={'search': query})
    print(req.request.url)
    data = req.json()

    return [f'{result["display_name"]:50s}{result["works_count"]:10d}{result["cited_by_count"]:10d}'
            for result in data['results']]
            
openalex_institution('carnegie mellon university')            

https://api.openalex.org/institutions?search=carnegie%2Bmellon%2Buniversity


['Carnegie Mellon University                            121925   4978056',
 'Carnegie Mellon University Qatar                         697      9769',
 'Carnegie Mellon University Australia                      93      2194',
 'Carnegie Mellon University Africa                        187      1179']

In [84]:
# Test with list of words
openalex_institution(['carnegie', 'mellon', 'university'])          

https://api.openalex.org/institutions?search=carnegie%2Bmellon%2Buniversity


['Carnegie Mellon University                            121925   4978056',
 'Carnegie Mellon University Qatar                         697      9769',
 'Carnegie Mellon University Australia                      93      2194',
 'Carnegie Mellon University Africa                        187      1179']

Note our function returns data in the form of a list of strings. Later, we can join them into a single string like this.



In [85]:
print('\n'.join(openalex_institution(['carnegie', 'mellon', 'university'])))

https://api.openalex.org/institutions?search=carnegie%2Bmellon%2Buniversity
Carnegie Mellon University                            121925   4978056
Carnegie Mellon University Qatar                         697      9769
Carnegie Mellon University Australia                      93      2194
Carnegie Mellon University Africa                        187      1179


## Basic click usage
That is the independent reusable part. Now, let's look at how click works. We have to create a function that does what we want, and then decorate it with click functions. Start by creating a new file: oa_inst2.py with these contents. The `main` function is what will run in our script, and only when we run this as a script.


Now, you can see we automatically get help.

In [86]:
%%writefile oa_inst2.py
#!/usr/bin/env python
import click

@click.command(help='OpenAlex Institutions')
@click.argument('query', nargs=-1)
def main(query):
    print(query)
    
if __name__ == '__main__':
    main()

Writing oa_inst2.py


In [87]:
! chmod +x oa_inst2.py

In [88]:
! ./oa_inst2.py --help

Usage: oa_inst2.py [OPTIONS] [QUERY]...

  OpenAlex Institutions

Options:
  --help  Show this message and exit.


We can also run the command with a few arguments. Here you see that the arguments become a tuple of strings. 



In [93]:
! ./oa_inst2.py "carnegie mellon university"

('carnegie mellon university',)


Now, we combine the function we worked out above in the script. We make a few modifications to the main function here. First, we have to join the strings returned by the openalex_institution function with \n, and then print that string so we can see it on stdout in the shell.

In [1]:
%%writefile oa_inst2.py
#!/usr/bin/env python
import click

import requests 
from collections.abc import Iterable 

def openalex_institution(query):
    'query is a list of terms in the query, or a string.'
    if isinstance(query, str):
        print('You passed a string as a query')
        query = '+'.join(query.split())

    # We assume it is an iterable of strings.
    elif isinstance(query, Iterable):
        print('you passed an iterable collection')
        query = '+'.join(query)
        
    url = f'https://api.openalex.org/institutions?search={query}'
    print(url)
    req = requests.get(url)
    print(req.request.url)
    data = req.json()

    return [f'{result["display_name"]:50s}{result["works_count"]:10d}{result["cited_by_count"]:10d}'
            for result in data['results']]

@click.command(help='OpenAlex Institutions')
@click.argument('query', nargs=-1)
def main(query):
    print(query)
    print('\n'.join(openalex_institution(query)))
    
if __name__ == '__main__':
    main()

Overwriting oa_inst2.py


In [2]:
! ./oa_inst2.py "carnegie mellon university"

('carnegie mellon university',)
you passed an iterable collection
https://api.openalex.org/institutions?search=carnegie mellon university
https://api.openalex.org/institutions?search=carnegie%20mellon%20university
Carnegie Mellon University                            121925   4978056
Carnegie Mellon University Qatar                         697      9769
Carnegie Mellon University Australia                      93      2194
Carnegie Mellon University Africa                        187      1179


In [3]:
! select=id

Finally, we can import the function we wrote and use it in the notebook. This import works because the python file is in this directory. Later we will learn how to do this more generally. For now the critical idea is we have one file that can be used two different ways: one as a script in a shell, and one as a python library you can import the same function for use in a notebook, or even another script.



In [9]:
from oa_inst2 import openalex_institution
print('\n'.join(openalex_institution(['carnegie', 'mellon', 'university'])))

you passed an iterable collection
https://api.openalex.org/institutions?search=carnegie+mellon+university
https://api.openalex.org/institutions?search=carnegie+mellon+university
Carnegie Mellon University                            121925   4978056
Carnegie Mellon University Qatar                         697      9769
Carnegie Mellon University Australia                      93      2194
Carnegie Mellon University Africa                        187      1179


In [10]:
print('\n'.join(openalex_institution('carnegie mellon university')))

You passed a string as a query
https://api.openalex.org/institutions?search=carnegie+mellon+university
https://api.openalex.org/institutions?search=carnegie+mellon+university
Carnegie Mellon University                            121925   4978056
Carnegie Mellon University Qatar                         697      9769
Carnegie Mellon University Australia                      93      2194
Carnegie Mellon University Africa                        187      1179


# Back to the author endpoint

You can access an author from a URL like this:
[](https://api.openalex.org/authors/https://orcid.org/0000-0003-2625-9232). Let's look at a few things. We get the name, number of works, and a url to those works. Although the url here says there are 172 works, it does not list them. Instead, it provides you with a url to get to them. Let's click on this url, and see what is there.



In [115]:
import requests

url = 'https://api.openalex.org/authors/https://orcid.org/0000-0003-2625-9232'
data = requests.get(url).json()
data['display_name'], data['works_count'], data['works_api_url'], data['summary_stats']['h_index']

('John R. Kitchin',
 154,
 'https://api.openalex.org/works?filter=author.id:A5003442464',
 40)

The works_api_url is another set of json data.



In [116]:
works = requests.get(data['works_api_url']).json()
works['meta']

{'count': 154,
 'db_response_time_ms': 122,
 'page': 1,
 'per_page': 25,
 'groups_count': None}

This new data has a new feature. There are many works, but on this "page" of data, there are only 25 results. We have to consider how to access all the pages to get the rest of the data. Paging is described here https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging. The gist is we have to add something to the url to increase the number of results per-page. We can increase it up to 200, and this example only has less than that, so we do that.



In [117]:
works = requests.get(data['works_api_url'],
                     params={'per-page': 200}).json()
works['meta']

{'count': 154,
 'db_response_time_ms': 238,
 'page': 1,
 'per_page': 200,
 'groups_count': None}

Our goal now is to retrieve each work, and get the cited_by_count for each paper. Then, we will compute the H-index for this list of papers. The H-index is the number of papers that have at least H citations. The works are already sorted in descending citations here, so we don't have to sort them ourselves.



In [118]:
len(works['results'])

154

In [120]:
citations = sorted([work['cited_by_count'] for work in works['results']], reverse=True)
for i, cite in enumerate(citations, start=1):
    print(f'{i:3d}{cite:8d}')
    if i > cite:
        print(f'H-index = {i - 1}')
        break
    

  1    8431
  2    4010
  3    3159
  4    1915
  5    1225
  6    1177
  7     432
  8     416
  9     314
 10     300
 11     271
 12     215
 13     185
 14     150
 15     130
 16     115
 17     108
 18     103
 19     100
 20      97
 21      95
 22      86
 23      85
 24      81
 25      78
 26      70
 27      69
 28      68
 29      68
 30      68
 31      61
 32      59
 33      58
 34      54
 35      49
 36      48
 37      46
 38      46
 39      43
 40      41
 41      40
H-index = 40


# Exercise

Work together to create a Python based shell script that takes an ORCID and computes the H-index for the author, and prints it with their name.

In [124]:
%%writefile orcid
#!/usr/bin/env python

import click
import requests

def hindex(orcid):
    
    url = f'https://api.openalex.org/authors/https://orcid.org/{orcid}'
    data = requests.get(url).json()

    displayname = data['display_name']
    works = (requests.get(data['works_api_url'] 
                          + f'?page=1&per-page={data["works_count"]}').json())
    
    citations = sorted([work['cited_by_count'] for work in works['results']], reverse=True)
    for i, cite in enumerate(citations, start=1):
        if cite < i:
            return displayname, i - 1

@click.command(help='OpenAlex h-index')
@click.argument('orcid', nargs=1)
def main(orcid):
    dn, hi = hindex(orcid)
    print(f'{dn}: h-index = {hi}')
    
if __name__ == '__main__':
    main()

Overwriting orcid


In [125]:
! chmod +x orcid

In [126]:
! ./orcid 0000-0003-2625-9232

John R. Kitchin: h-index = 40


**Exercise**: work out a script to calculate some other kind of metric, e.g. i10-index, or some aggregate statistic like citations per year over time.