# How are University of Washington (UW) researchers collaborating with people around the globe?

The UW Office of Global Affairs is interested in showing off the scholarly collaborations between UW's faculty and researchers around the world. Their [interactive dashboard](https://www.washington.edu/global/publications/) lets people dig deep into these collaborations using maps, graphs, and other visualizations. The data powering this is dashboard is based on the Microsoft Academic Graph, which is no longer being updated. In this tutorial, we will use OpenAlex's API to update the data by getting all of the publications that are collaborations between UW faculty and other institutions.



First, let's look at how the University of Washington is represented in OpenAlex.

[Institutions](https://docs.openalex.org/api-entities/institutions) in OpenAlex are closely linked with the [ROR registry](https://ror.org/) of research organizations. By searching the ROR website, we can find the [ROR ID for UW](https://ror.org/00cvxb145).

In [5]:
uw_id = "https://ror.org/00cvxb145"

To get the data for the UW dashboard, we need to collect all of the works from OpenAlex which have at least one author from UW and at least one author outside UW. We'll do this in two steps: first, we'll get all of the works with UW authors, then we'll filter to keep only the papers with at least one other affiliation.

In [6]:
# specify endpoint
endpoint = 'works'

# build the 'filter' parameter
# We'll limit it to the last 20 years
filters = ",".join((
    f'institutions.ror:{uw_id}',
    'from_publication_date:2003-01-01',
))

# put the URL together
filtered_works_url = f'https://api.openalex.org/{endpoint}?filter={filters}'
print(f'complete URL with filters:\n{filtered_works_url}')

complete URL with filters:
https://api.openalex.org/works?filter=institutions.ror:https://ror.org/00cvxb145,from_publication_date:2003-01-01


We've built the URL. Now let's get the results from the API, using the `requests` library.

In [7]:
import requests

In [9]:
r = requests.get(filtered_works_url)
results_page = r.json()
print(f"retrieved {len(results_page['results'])} works")

retrieved 25 works


We've retrieved 25 works, but of course that isn't the total number of works by UW authors. We only got the first page of results. We can see how many works there actually are by looking at the `results_page['meta']['count']` value:

In [10]:
print(results_page['meta']['count'])

246572


There are about 250,000 works. To get all of them, we will need to use the [paging technique](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging). ...

In [5]:
cursor = '*'

select = ",".join((
    'id',
    'ids',
    'title',
    'display_name',
    'publication_year',
    'publication_date',
    'primary_location',
    'open_access',
    'authorships',
    'cited_by_count',
    'is_retracted',
    'is_paratext',
    'updated_date',
    'created_date',
))

# loop through pages
works = []
loop_idx = 0
while cursor:
    
    # set cursor value and request page from OpenAlex
    url = f'{filtered_works_url}&select={select}&cursor={cursor}'
    page_with_results = requests.get(url).json()
    
    results = page_with_results['results']
    works.extend(results)

    # update cursor to meta.next_cursor
    cursor = page_with_results['meta']['next_cursor']
    loop_idx += 1
    if loop_idx in [5, 10, 20, 50] or loop_idx % 100 == 0:
        print(f'{loop_idx} api requests made so far')
print(f'done. made {loop_idx} api requests. collected {len(works)} works')

5 api requests made so far
10 api requests made so far
20 api requests made so far
50 api requests made so far
100 api requests made so far
200 api requests made so far
300 api requests made so far
400 api requests made so far
500 api requests made so far
600 api requests made so far
700 api requests made so far
800 api requests made so far
900 api requests made so far
1000 api requests made so far
1100 api requests made so far
1200 api requests made so far
1300 api requests made so far
1400 api requests made so far
1500 api requests made so far
1600 api requests made so far
1700 api requests made so far
1800 api requests made so far
1900 api requests made so far
2000 api requests made so far
2100 api requests made so far
2200 api requests made so far
2300 api requests made so far
2400 api requests made so far
2500 api requests made so far
2600 api requests made so far
2700 api requests made so far
2800 api requests made so far
2900 api requests made so far
3000 api requests made so fa

In [11]:
import pickle

# uncomment these lines and run to save the results so we won't have to fetch them
# again next time we run the notebook
# with open('uw_works_since_2003.pickle', 'wb') as outf:
#     pickle.dump(works, outf, protocol=pickle.HIGHEST_PROTOCOL)

# OR uncomment these lines and run to load the saved results
with open('../../data/uw_works_since_2003.pickle', 'rb') as f:
    works = pickle.load(f)

In [12]:
sum([work['is_paratext'] is True for work in works])

52

In [13]:
works[0]

{'id': 'https://openalex.org/W2142225512',
 'ids': {'openalex': 'https://openalex.org/W2142225512',
  'doi': 'https://doi.org/10.1177/1049732305276687',
  'mag': '2142225512',
  'pmid': 'https://pubmed.ncbi.nlm.nih.gov/16204405'},
 'title': 'Three Approaches to Qualitative Content Analysis',
 'display_name': 'Three Approaches to Qualitative Content Analysis',
 'publication_year': 2005,
 'publication_date': '2005-11-01',
 'primary_location': {'is_oa': False,
  'landing_page_url': 'https://doi.org/10.1177/1049732305276687',
  'pdf_url': None,
  'source': {'id': 'https://openalex.org/S32648145',
   'display_name': 'Qualitative Health Research',
   'issn_l': '1049-7323',
   'issn': ['1049-7323', '1552-7557'],
   'host_organization': 'https://openalex.org/P4310320017',
   'type': 'journal'},
  'license': None,
  'version': None},
 'open_access': {'is_oa': False, 'oa_status': 'closed', 'oa_url': None},
 'authorships': [{'author_position': 'first',
   'author': {'id': 'https://openalex.org/A2

In [14]:
import pandas as pd
data = []
for work in works:
    for authorship in work['authorships']:
        if authorship:
            author = authorship['author']
            author_id = author['id'] if author else None
            author_name = author['display_name'] if author else None
            author_position = authorship['author_position']
            for institution in authorship['institutions']:
                if institution:
                    institution_id = institution['id']
                    institution_name = institution['display_name']
                    institution_country_code = institution['country_code']
                    data.append({
                        'work_id': work['id'],
                        'work_title': work['title'],
                        'work_display_name': work['display_name'],
                        'work_publication_year': work['publication_year'],
                        'work_publication_date': work['publication_date'],
                        'author_id': author_id,
                        'author_name': author_name,
                        'author_position': author_position,
                        'institution_id': institution_id,
                        'institution_name': institution_name,
                        'institution_country_code': institution_country_code,
                    })
df = pd.DataFrame(data)

In [15]:
df

Unnamed: 0,work_id,work_title,work_display_name,work_publication_year,work_publication_date,author_id,author_name,author_position,institution_id,institution_name,institution_country_code
0,https://openalex.org/W2142225512,Three Approaches to Qualitative Content Analysis,Three Approaches to Qualitative Content Analysis,2005,2005-11-01,https://openalex.org/A2642964564,Hsiu Ching Laura Hsieh,first,https://openalex.org/I64045040,Fooyin University,TW
1,https://openalex.org/W2142225512,Three Approaches to Qualitative Content Analysis,Three Approaches to Qualitative Content Analysis,2005,2005-11-01,https://openalex.org/A2111315299,Sarah E. Shannon,last,https://openalex.org/I201448701,University of Washington,US
2,https://openalex.org/W2963037989,"You Only Look Once: Unified, Real-Time Object ...","You Only Look Once: Unified, Real-Time Object ...",2016,2016-06-27,https://openalex.org/A2392241600,Joseph Redmon,first,https://openalex.org/I201448701,University of Washington,US
3,https://openalex.org/W2963037989,"You Only Look Once: Unified, Real-Time Object ...","You Only Look Once: Unified, Real-Time Object ...",2016,2016-06-27,https://openalex.org/A2310010008,Santosh K. Divvala,middle,https://openalex.org/I2945602774,Allen Institute for Artificial Intelligence,US
4,https://openalex.org/W2963037989,"You Only Look Once: Unified, Real-Time Object ...","You Only Look Once: Unified, Real-Time Object ...",2016,2016-06-27,https://openalex.org/A2473549963,Ross Girshick,middle,https://openalex.org/I2252078561,Facebook,IL
...,...,...,...,...,...,...,...,...,...,...,...
1632158,https://openalex.org/W99980466,LASIK: Early Postoperative Complications,LASIK: Early Postoperative Complications,2008,2008-12-01,https://openalex.org/A2180837181,José L. Güell,middle,https://openalex.org/I4210131277,Instituto de Microcirugía Ocular,ES
1632159,https://openalex.org/W99980466,LASIK: Early Postoperative Complications,LASIK: Early Postoperative Complications,2008,2008-12-01,https://openalex.org/A2111176233,Merce Morral,middle,https://openalex.org/I4210131277,Instituto de Microcirugía Ocular,ES
1632160,https://openalex.org/W99980466,LASIK: Early Postoperative Complications,LASIK: Early Postoperative Complications,2008,2008-12-01,https://openalex.org/A2077428892,Oscar Gris,middle,https://openalex.org/I4210131277,Instituto de Microcirugía Ocular,ES
1632161,https://openalex.org/W99980466,LASIK: Early Postoperative Complications,LASIK: Early Postoperative Complications,2008,2008-12-01,https://openalex.org/A2127986247,Javier Gaytan,middle,https://openalex.org/I4210131277,Instituto de Microcirugía Ocular,ES


In [37]:
institution_ids = df['institution_id'].dropna().unique()
print(f"There are {len(institution_ids):,} institutions that collaborate with UW.")

There are 20,842 institutions that collaborate with UW.


We've found about 20K institutions that collaborate with UW. We already know the names and the countries of these institutions; these were included in the results from the `works` endpoint of the API. There is some more information about the institutions that would be helpful to supply to the dashboard, especially the geolocation data that can be used to show the collaborations on maps. This information can be found in the [Institutions](https://docs.openalex.org/api-entities/institutions) entities.

To get the institutions, we will query the `/institutions` endpoint of the API. We will use the technique to get multiple entities per API request as suggested [here](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists#addition-or).

In [50]:
endpoint = "institutions"
size = 50
loop_index = 0
institutions = []
for list_index in range(0, len(institution_ids), size):
    subset = institution_ids[list_index:list_index+size]
    pipe_separated_ids = "|".join(subset)
    r = requests.get(f"https://api.openalex.org/institutions?filter=openalex:{pipe_separated_ids}&per-page={size}")
    # ! ids.openalex filter does not work, even though it is in the documentation
    results = r.json()['results']
    institutions.extend(results)
    print(f"added {len(results)}")
    loop_index += 1
print(f"collected {len(institutions)} institutions using {loop_index} api calls")

added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
added 50
a

In [26]:
def international_collab(gdf):
    if all(gdf['institution_country_code'] == 'US'):
        return False
    else:
        return True
df_international_collab = df.groupby('work_id').apply(international_collab)

In [34]:
df_international_collab.value_counts().sort_index()

False    152031
True      94506
dtype: int64

In [28]:
def outside_uw_collab(gdf):
    if all(gdf['institution_id'] == 'https://openalex.org/I201448701'):
        return False
    else:
        return True
df_outside_uw_collab = df.groupby('work_id').apply(outside_uw_collab)

In [35]:
df_outside_uw_collab.value_counts().sort_index()

False     96886
True     149651
dtype: int64