# How are University of Washington (UW) researchers collaborating with people around the globe?

The UW Office of Global Affairs is interested in showing off the scholarly collaborations between UW's faculty and researchers around the world. Their [interactive dashboard](https://www.washington.edu/global/publications/) lets people dig deep into these collaborations using maps, graphs, and other visualizations. The data powering this is dashboard is based on the Microsoft Academic Graph, which is no longer being updated. In this tutorial, we will use OpenAlex's API to update the data by getting all of the publications that are collaborations between UW faculty and other institutions.



First, let's look at how the University of Washington is represented in OpenAlex.

[Institutions](https://docs.openalex.org/api-entities/institutions) in OpenAlex are closely linked with the [ROR registry](https://ror.org/) of research organizations. By searching the ROR website, we can find the [ROR ID for UW](https://ror.org/00cvxb145).

In [1]:
uw_id = "https://ror.org/00cvxb145"

To get the data for the UW dashboard, we need to collect all of the works from OpenAlex which have at least one author from UW and at least one author outside UW. We'll do this in two steps: first, we'll get all of the works with UW authors, then we'll filter to keep only the papers with at least one other affiliation.

In [2]:
# specify endpoint
endpoint = 'works'

# build the 'filter' parameter
# We'll limit it to the last 20 years
filters = ",".join((
    f'institutions.ror:{uw_id}',
    'from_publication_date:2003-01-01',
))

# put the URL together
filtered_works_url = f'https://api.openalex.org/{endpoint}?filter={filters}'
print(f'complete URL with filters:\n{filtered_works_url}')

complete URL with filters:
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/00cvxb145,from_publication_date:2003-01-01


We've built the URL. Now let's get the results from the API, using the `requests` library.

In [3]:
import requests

In [4]:
r = requests.get(filtered_works_url)
results_page = r.json()
print(f"retrieved {len(results_page['results'])} works")

retrieved 25 works


We've retrieved 25 works, but of course that isn't the total number of works by UW authors. We only got the first page of results. We can see how many works there actually are by looking at the `results_page['meta']['count']` value:

In [5]:
print(results_page['meta']['count'])

246613


In [50]:
results_page['meta']['count'] / 25

9864.52

There are about 250,000 works. To get all of them, we will need to use the [paging technique](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging). Since we want more than 10,000 works, we need to use the cursor paging technique. At 25 results per page, that means we will need to make nearly 10,000 API calls to get all of the data we want. This may seem like a lot, but don't panic! This is well under the [free allowance of 100,000 API calls per day.](https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication) It will take about an hour in total, so let it run overnight or while you take a break.


In [5]:
cursor = '*'

select = ",".join((
    'id',
    'ids',
    'title',
    'display_name',
    'publication_year',
    'publication_date',
    'primary_location',
    'open_access',
    'authorships',
    'cited_by_count',
    'is_retracted',
    'is_paratext',
    'updated_date',
    'created_date',
))

# loop through pages
works = []
loop_index = 0
while cursor:
    
    # set cursor value and request page from OpenAlex
    url = f'{filtered_works_url}&select={select}&cursor={cursor}'
    page_with_results = requests.get(url).json()
    
    results = page_with_results['results']
    works.extend(results)

    # update cursor to meta.next_cursor
    cursor = page_with_results['meta']['next_cursor']
    loop_index += 1
    if loop_index in [5, 10, 20, 50, 100] or loop_index % 500 == 0:
        print(f'{loop_index} api requests made so far')
print(f'done. made {loop_index} api requests. collected {len(works)} works')

5 api requests made so far
10 api requests made so far
20 api requests made so far
50 api requests made so far
100 api requests made so far
200 api requests made so far
300 api requests made so far
400 api requests made so far
500 api requests made so far
600 api requests made so far
700 api requests made so far
800 api requests made so far
900 api requests made so far
1000 api requests made so far
1100 api requests made so far
1200 api requests made so far
1300 api requests made so far
1400 api requests made so far
1500 api requests made so far
1600 api requests made so far
1700 api requests made so far
1800 api requests made so far
1900 api requests made so far
2000 api requests made so far
2100 api requests made so far
2200 api requests made so far
2300 api requests made so far
2400 api requests made so far
2500 api requests made so far
2600 api requests made so far
2700 api requests made so far
2800 api requests made so far
2900 api requests made so far
3000 api requests made so fa

Now might be a good time to save the data we just retrieved, in case this notebook restarts and we want to come back to it without having to get all of the works from the API again. The following cell allows us to do this by saving the data using the `.pickle` data format. (If you are loading the data in this way, then you don't need to run the above cell.)

In [6]:
import pickle

# uncomment these lines and run to save the results so we won't have to fetch them
# again next time we run the notebook
# import os
# if not os.path.isdir('../../data'):
#     os.mkdir('../../data')
# with open('../../uw_works_since_2003.pickle', 'wb') as outf:
#     pickle.dump(works, outf, protocol=pickle.HIGHEST_PROTOCOL)

# OR uncomment these lines and run to load the saved results
# with open('../../data/uw_works_since_2003.pickle', 'rb') as f:
#     works = pickle.load(f)

Each work object has a lot of information, not all of which we will use. Let's get the data into a Pandas dataframe, limiting the number of fields to just the ones we might be interested in.

In [9]:
import pandas as pd
data = []
for work in works:
    for authorship in work['authorships']:
        if authorship:
            author = authorship['author']
            author_id = author['id'] if author else None
            author_name = author['display_name'] if author else None
            author_position = authorship['author_position']
            for institution in authorship['institutions']:
                if institution:
                    institution_id = institution['id']
                    institution_name = institution['display_name']
                    institution_country_code = institution['country_code']
                    data.append({
                        'work_id': work['id'],
                        'work_title': work['title'],
                        'work_display_name': work['display_name'],
                        'work_publication_year': work['publication_year'],
                        'work_publication_date': work['publication_date'],
                        'author_id': author_id,
                        'author_name': author_name,
                        'author_position': author_position,
                        'institution_id': institution_id,
                        'institution_name': institution_name,
                        'institution_country_code': institution_country_code,
                    })
df = pd.DataFrame(data)

In [54]:
print(f"There are {df['institution_id'].nunique():,} institutions that collaborate with UW, in {df['institution_country_code'].nunique()} countries.")

There are 20,842 institutions that collaborate with UW, in 193 countries.


In [None]:
def outside_uw_collab(institution_ids):
    if all(institution_ids == 'https://openalex.org/I201448701'):
        return False
    else:
        return True
df['is_outside_uw_collab'] = df.groupby('work_id')['institution_id'].transform(outside_uw_collab)

In [None]:
df.drop_duplicates(subset='work_id')['is_outside_uw_collab'].value_counts().sort_index()

False     96886
True     149651
Name: is_outside_uw_collab, dtype: int64

In [None]:
df_collab = df[df['is_outside_uw_collab']].drop(columns='is_outside_uw_collab')
print(f"dataframe has {len(df_collab)} rows, with {df_collab['work_id'].nunique()} unique publications")

dataframe has 1453778 rows, with 149651 unique publications


We've found about 20K institutions that collaborate with UW. We already know the names and the countries of these institutions; these were included in the results from the `works` endpoint of the API. There is some more information about the institutions that would be helpful to supply to the dashboard, especially the geolocation data that can be used to show the collaborations on maps. This information can be found in the [Institutions](https://docs.openalex.org/api-entities/institutions) entities.

To get the institutions, we will query the `/institutions` endpoint of the API. We will use the technique to get multiple entities per API request as suggested [here](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists#addition-or). This technique involves getting 50 results at once by requesting them in a filter, separated by a pipe ('`|`').

This will get all of the institutions in about 400 API calls, which should only take a few minutes.

In [14]:
institution_ids = df['institution_id'].dropna().unique()

endpoint = "institutions"
size = 50
loop_index = 0
institutions = []
for list_index in range(0, len(institution_ids), size):
    subset = institution_ids[list_index:list_index+size]
    pipe_separated_ids = "|".join(subset)
    r = requests.get(f"https://api.openalex.org/institutions?filter=openalex:{pipe_separated_ids}&per-page={size}")
    # ! ids.openalex filter does not work, even though it is in the documentation
    results = r.json()['results']
    institutions.extend(results)
    loop_index += 1
print(f"collected {len(institutions)} institutions using {loop_index} api calls")

collected 20842 institutions using 417 api calls


In [55]:
data = []
for institution in institutions:
    data.append({
        'id': institution['id'],
        'ror': institution['ror'],
        'display_name': institution['display_name'],
        'country_code': institution['country_code'],
        'type': institution['type'],
        'latitude': institution['geo']['latitude'],
        'longitude': institution['geo']['longitude'],
        'city': institution['geo']['city'],
        'region': institution['geo']['region'],
        'country': institution['geo']['country'],
        'image_url': institution['image_url'],
        'image_thumbnail_url': institution['image_thumbnail_url'],
    })

df_institutions = pd.DataFrame(data)

In [35]:
def international_collab(institution_country_codes):
    if all(institution_country_codes == 'US'):
        return False
    else:
        return True
df['is_international_collab'] = df.groupby('work_id')['institution_country_code'].transform(international_collab)

In [36]:
df.drop_duplicates(subset='work_id')['is_international_collab'].value_counts().sort_index()

False    152031
True      94506
Name: is_international_collab, dtype: int64

In [48]:
outpath = '../../data/uw_collabs.csv'
df_collab.to_csv(outpath, index=False)

In [49]:
outpath = '../../data/uw_collabs.csv.gz'
df_collab.to_csv(outpath, index=False)

In [53]:
institutions[0]

{'id': 'https://openalex.org/I27837315',
 'ror': 'https://ror.org/00jmfr291',
 'display_name': 'University of Michigan–Ann Arbor',
 'country_code': 'US',
 'type': 'education',
 'homepage_url': 'https://www.umich.edu/',
 'image_url': 'https://upload.wikimedia.org/wikipedia/commons/9/93/Seal_of_the_University_of_Michigan.svg',
 'image_thumbnail_url': 'https://upload.wikimedia.org/wikipedia/commons/thumb/9/93/Seal_of_the_University_of_Michigan.svg/100px-Seal_of_the_University_of_Michigan.svg.png',
 'display_name_acronyms': ['UM'],
 'display_name_alternatives': ['UMich'],
 'works_count': 727419,
 'cited_by_count': 36062005,
 'ids': {'openalex': 'https://openalex.org/I27837315',
  'ror': 'https://ror.org/00jmfr291',
  'mag': '27837315',
  'grid': 'grid.214458.e',
  'wikipedia': 'https://en.wikipedia.org/wiki/University%20of%20Michigan',
  'wikidata': 'https://www.wikidata.org/wiki/Q230492'},
 'geo': {'city': 'Ann Arbor',
  'geonames_city_id': '4984247',
  'region': 'Michigan',
  'country_co