# OpenAlex and OpenAlex API

## What is OpenAlex?

OpenAlex is an open, comprehensive, and interconnected index of scholarly works, authors, venues, institutions, and concepts. It aims to democratize access to scholarly metadata and make it easier for researchers and developers to analyze and build upon academic knowledge.

## OpenAlex API

The OpenAlex API provides programmatic access to the OpenAlex database. It allows users to query and retrieve scholarly information using various parameters and filters.

### Key Features:

1. **RESTful API**: Follows REST principles, making it easy to integrate with various programming languages.
2. **Rich Metadata**: Provides detailed information about scholarly works, including authors, citations, concepts, and more.
3. **Flexible Querying**: Supports complex queries with multiple filters and search parameters.

## Understanding API Components

### Cursor-based Pagination

The `get_articles` function uses cursor-based pagination to retrieve large sets of results efficiently:

- A cursor is a pointer to a specific item in a result set.
- When you make an initial request, you get a set of results and a `next_cursor`.
- To get the next set of results, you include this `next_cursor` in your subsequent request.
- This method is more efficient than offset pagination for large datasets.

### URL Construction in `build_url`

The `build_url` function constructs the API request URL based on various parameters:

1. **Base URL**: Starts with `https://api.openalex.org/works?`
2. **Search Query**: 
   - If a search term is provided, it's added as `search=term1+term2`
   - Spaces in the search query are replaced with `+`
3. **Filters**:
   - Time windows and concept IDs are added as filters
   - Multiple filters are combined with commas
   - The filter section starts with `&filter=` or just `filter=` if it's the first parameter

Example:
```
https://api.openalex.org/works?search=machine+learning&filter=publication_year:2020-2022,concepts.id:C154945302
```

This URL searches for "machine learning" articles published between 2020-2022 in the concept area identified by C154945302.

In [2]:
import tqdm
import re
import pandas as pd
import math
import time
import requests
import json

In [34]:
def get_abstract(words):
    """
    Reconstruct the abstract from an inverted index representation.

    Args:
    words (dict): A dictionary where keys are words and values are lists of indices.

    Returns:
    str: The reconstructed abstract.
    """
    n = max({idx for name in words for idx in words[name]}) + 1
    temp_list = [' '] * n
    for word in words:
        idxs = words[word]
        for idx in idxs:
            temp_list[idx] = word
    abstract = ' '.join(temp_list)
    return abstract


def extract_source_id(primary_location):
    """
    Extract the source ID from the primary location dictionary.

    Args:
    primary_location (dict): A dictionary containing source information.

    Returns:
    str or None: The extracted source ID, or None if not found.
    """
    source = primary_location.get('source')
    if not source or "id" not in source:
        return None
    return re.sub("https://openalex.org/", "", source['id'])


def extract_authorship_details(authorships):
    """
    Extract author IDs, countries, and institutions from authorships data.

    Args:
    authorships (list): A list of dictionaries containing authorship information.

    Returns:
    tuple: Three strings containing semicolon-separated lists of author IDs, countries, and institutions.
    """
    author_ids = []
    countries = []
    institutions = []
    
    for auth in authorships:
        # Extract author ID
        author_id = auth['author'].get('id')
        if author_id:
            author_ids.append(re.sub("https://openalex.org/", "", author_id))
        
        # Extract country
        if auth.get('countries'):
            countries.append(auth['countries'][0])
        
        # Extract institution ID
        if auth.get('institutions'):
            institutions.append(auth['institutions'][0]['id'])

    return '; '.join(author_ids), '; '.join(countries), '; '.join(institutions)


def get_basic_infos(doc):
    """
    Extract basic information from a document dictionary.

    Args:
    doc (dict): A dictionary containing document information from OpenAlex.

    Returns:
    dict: A dictionary containing extracted basic information about the document.
    """
    list_concepts = [concept['id'] for concept in doc.get('concepts', [])]
    
    # Extracting source information
    Source = extract_source_id(doc.get('primary_location', {}))
    
    # Extracting authorship details
    auth_ids, countries, institutions = extract_authorship_details(doc.get('authorships', []))

    infos = {
        'id': re.sub("https://openalex.org/", "", doc['id']),
        'doi': doc.get('doi'),
        'year': doc.get('publication_year'),
        'language': doc.get('language'),
        'type': doc.get('type'),
        'source': Source,
        'nb_auth': len(doc.get('authorships', [])),
        'auth_ids': auth_ids,
        'countries': countries,
        'institutions': institutions,
        'nb_citations': doc.get('cited_by_count'),
        'title': doc.get('title'),
        'abstract': get_abstract(doc.get('abstract_inverted_index')) if doc.get('abstract_inverted_index') else None,
        'nb_ref': doc.get('referenced_works_count', 0),
        'concepts': '; '.join(list_concepts),
        'references': '; '.join(doc.get('referenced_works', []))
    }
    
    return infos


def get_nb_pages(url, limit=50):
    """
    Fetch the total number of pages based on the total count of articles and limit per page.

    Args:
    url (str): The API URL to fetch data from.
    limit (int): The number of results per page. Defaults to 50.

    Returns:
    int or None: The total number of pages, or None if the request fails.
    """
    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to retrieve data: {response.status_code}")
        return None

    data = response.json()
    count = data['meta']['count']
    nb_pages = math.ceil(count / limit)
    print(f"Number of articles matched: {count}")
    print(f"Number of pages: {nb_pages}")
    return nb_pages


def get_articles(url, cursor=None, limit=100):
    """
    Fetch articles from the OpenAlex API.

    Args:
    url (str): The API URL to fetch data from.
    cursor (str, optional): A cursor for pagination. Defaults to None.
    limit (int): The number of results to return. Defaults to 100.

    Returns:
    tuple or list: If cursor is provided, returns a tuple of (articles, next_cursor).
                   If no cursor is provided, returns just the list of articles.
    """
    if cursor:
        url += f"&cursor={cursor}"
    
    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to retrieve data: {response.status_code}")
        return None
    data = response.json()
    articles = data['results']
    next_cursor = data['meta'].get('next_cursor')
    return (articles, next_cursor) if cursor else articles


def build_url(search=None, time_windows=None, concept_id=None, author_id=None, limit=100):
    """
    Build the search URL for the OpenAlex API based on provided parameters.

    Args:
    search (str, optional): Search terms. Defaults to None.
    time_windows (str, optional): Publication year range. Defaults to None.
    concept_id (str, optional): Concept ID to filter by. Defaults to None.
    author_id (str, optional): Author ID to filter by. Defaults to None.
    limit (int): Number of results per page. Defaults to 100.

    Returns:
    str: The constructed API URL.
    """
    base_url = "https://api.openalex.org/works?"
    filters = []

    if search:
        search_query = search.replace(' ', '+')
        base_url += f"search={search_query}"
    
    if time_windows or concept_id or author_id:
        filter_ = '&filter=' if search else 'filter='
        base_url += filter_
        
        if time_windows:
            filters.append(f"publication_year:{time_windows}")
        if concept_id:
            filters.append(f"concepts.id:{concept_id}")
        if author_id:
            filters.append(f"author.id:{author_id}")

        base_url += ','.join(filters)
    
    # Add the limit parameter
    base_url += f"&per_page={limit}"
    
    return base_url

In [17]:
def get_concept_id(query):
    query = query.replace(' ', '+')
    url = f"https://api.openalex.org/concepts?search={query}&per-page=200"
    response = requests.get(url)
    
    if response.status_code != 200:
        print("Failed to retrieve data:", response.status_code)
        return None
    data = response.json()
    print(f"Number of concepts with {query} in display_name: {data['meta']['count']}")
    concepts = data['results']
    return concepts

    
    
concepts = get_concept_id('Marie curie')
pd.DataFrame(concepts)

Number of concepts with Marie+curie in display_name: 1


Unnamed: 0,id,wikidata,display_name,relevance_score,level,description,works_count,cited_by_count,summary_stats,ids,image_url,image_thumbnail_url,international,ancestors,related_concepts,counts_by_year,works_api_url,updated_date,created_date
0,https://openalex.org/C2993243194,https://www.wikidata.org/wiki/Q7186,Marie curie,15384.579,3,Polish-French physicist and chemist (1867-1934),5591,63861,"{'2yr_mean_citedness': 1.5991735537190082, 'h_...",{'openalex': 'https://openalex.org/C2993243194...,https://upload.wikimedia.org/wikipedia/commons...,https://upload.wikimedia.org/wikipedia/commons...,"{'display_name': {'af': 'Marie Curie', 'alt': ...","[{'id': 'https://openalex.org/C2910001868', 'w...",[],"[{'year': 2024, 'works_count': 74, 'cited_by_c...",https://api.openalex.org/works?filter=concepts...,2024-09-08T11:39:13.639180,2019-12-13


In [27]:
marie_curie = build_url(concept_id="https://openalex.org/C2993243194")
marie_curie

'https://api.openalex.org/works?filter=concepts.id:https://openalex.org/C2993243194'

In [28]:
import pprint

lst_articles = get_articles(marie_curie, limit=50)
pprint.pprint(lst_articles[4])

{'abstract_inverted_index': {'(Paris': [61],
                             '(Web):October': [78],
                             '(both': [117],
                             '(iRTSV),': [42],
                             '1': [89],
                             '10,': [75, 79],
                             '107,': [74],
                             '12': [63],
                             '17': [43],
                             '2007': [94],
                             '2007,': [73],
                             '2007Publication': [80],
                             '2007Published': [84, 87],
                             '2007https://pubs.acs.org/doi/10.1021/cr050196rhttps://doi.org/10.1021/cr050196rresearch-articleACS': [91],
                             '2008': [116],
                             '38054': [47],
                             '4206–4272Publication': [76],
                             '5092,': [31],
                             '6),': [62],
                             '750

In [29]:
get_nb_pages(marie_curie)

Number of articles matched: 5278
Number of pages: 106


106

In [30]:
get_basic_infos(lst_articles[7])

{'id': 'W2321296767',
 'doi': 'https://doi.org/10.1021/cr2004212',
 'year': 2012,
 'language': 'en',
 'type': 'review',
 'source': 'S41143188',
 'nb_auth': 5,
 'auth_ids': 'A5066664391; A5082624328; A5067567506; A5057401087; A5016771012',
 'countries': 'FR; FR; FR; US; FR',
 'institutions': 'https://openalex.org/I4210138474; https://openalex.org/I1294671590; https://openalex.org/I4210138474; https://openalex.org/I106959904; https://openalex.org/I4210138474',
 'nb_citations': 733,
 'title': 'Surface Modification Using Phosphonic Acids and Esters',
 'abstract': 'ADVERTISEMENT RETURN TO ISSUEPREVReviewNEXTSurface Modification Using Phosphonic Acids and EstersClémence Queffélec†, Marc Petit†‡, Pascal Janvier†, D. Andrew Knight§, and Bruno Bujoli*†View Author Information† LUNAM Université, CNRS, UMR 6230, Chimie Et Interdisciplinarité: Synthèse Analyse Modélisation (CEISAM), UFR Sciences et Techniques, 2, rue de la Houssinière, BP 92208, 44322 NANTES Cedex 3, France‡ Université Pierre et Ma

In [8]:
concepts

[{'id': 'https://openalex.org/C2777583440',
  'wikidata': 'https://www.wikidata.org/wiki/Q4204823',
  'display_name': 'History of childhood',
  'relevance_score': 4146.808,
  'level': 5,
  'description': 'aspect of history',
  'works_count': 416,
  'cited_by_count': 9447,
  'summary_stats': {'2yr_mean_citedness': 0.8666666666666667,
   'h_index': 49,
   'i10_index': 140},
  'ids': {'openalex': 'https://openalex.org/C2777583440',
   'wikidata': 'https://www.wikidata.org/wiki/Q4204823',
   'mag': '2777583440',
   'wikipedia': 'https://en.wikipedia.org/wiki/History%20of%20childhood'},
  'image_url': 'https://upload.wikimedia.org/wikipedia/commons/c/c0/Su_Han_Ch%27en_001.jpg',
  'image_thumbnail_url': 'https://upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Su_Han_Ch%27en_001.jpg/55px-Su_Han_Ch%27en_001.jpg',
  'international': {'display_name': {'ar': 'تاريخ الطفولة',
    'ast': 'Historia de la infancia',
    'be-tarask': 'гісторыя дзяцінства',
    'ca': 'història de la infància',
    'e

In [31]:
import logging
import os
import time
import tqdm
import pandas as pd
""" 6 levels
    NOTSET=0.
    DEBUG=10.
    INFO=20.
    WARN=30.
    ERROR=40.
    CRITICAL=50.
"""

# Set up logging
logging.basicConfig(filename='app.log', 
                    filemode='w',  # write/erase to the log file
                    format='%(asctime)s - %(levelname)s - %(message)s', 
                    level=logging.DEBUG)  # Set log level to INFO
logger = logging.getLogger(__name__)

In [32]:
time_windows = '2000-2023'
search = None
limit = 200

for concept in concepts:
    logger.info(f"Concept: {concept}")
    concept_id = concept['id']
    url = build_url(search, time_windows, concept_id=concept_id, limit=50)
    nb_pages = get_nb_pages(url, limit)
    all_infos = []
    next_cursor = '*'

    for page in tqdm.tqdm(range(1, nb_pages+1)):
        logger.info(f"Page: {page}\n")
        try:
            time.sleep(1)
            page_i, next_cursor = get_articles(url, next_cursor, limit)

            for article in page_i:
                if not article:
                    # Log relevant info to a file
                    logger.warning(f"NoneType article found on page {page} with cursor {next_cursor}")
                    logger.info(f"NoneType from page_i\n{page_i}")
                    continue  # Skip this article

                infos = get_basic_infos(article)
                all_infos.append(infos)

        except Exception as e:
            # Log exceptions
            logger.error(f"Error on page {page} with cursor {next_cursor}: {str(e)}")

    df = pd.DataFrame(all_infos)


Number of articles matched: 3469
Number of pages: 18


100%|██████████| 18/18 [00:33<00:00,  1.86s/it]


In [33]:
df

Unnamed: 0,id,doi,year,language,type,source,nb_auth,auth_ids,countries,institutions,nb_citations,title,abstract,nb_ref,concepts,references
0,W2118233996,https://doi.org/10.1002/jcc.20290,2005,en,article,S21117631,10,A5014694160; A5020579549; A5012407838; A506394...,US; US; US; DE; US; US; GB; US; US; US,https://openalex.org/I123431417; https://opena...,8355,The Amber biomolecular simulation programs,"Journal of Computational ChemistryVolume 26, I...",155,https://openalex.org/C2991991741; https://open...,https://openalex.org/W1487420490; https://open...
1,W1991039509,https://doi.org/10.1021/cr040410w,2006,en,review,S41143188,4,A5053501559; A5050708434; A5009218318; A504898...,IT; IT; IT; IT,https://openalex.org/I219388962; https://opena...,1593,Copper Homeostasis and Neurodegenerative Disor...,ADVERTISEMENT RETURN TO ISSUEPREVArticleNEXTCo...,418,https://openalex.org/C2780596555; https://open...,https://openalex.org/W1489286950; https://open...
2,W2092638302,https://doi.org/10.1021/cr050196r,2007,en,review,S41143188,2,A5050853948; A5023552550,FR; FR,https://openalex.org/I4210144381; https://open...,1458,"Occurrence, Classification, and Biological Fun...",ADVERTISEMENT RETURN TO ISSUEPREVArticleNEXTOc...,587,https://openalex.org/C2778805511; https://open...,https://openalex.org/W100311329; https://opena...
3,W2321296767,https://doi.org/10.1021/cr2004212,2012,en,review,S41143188,5,A5066664391; A5082624328; A5067567506; A505740...,FR; FR; FR; US; FR,https://openalex.org/I4210138474; https://open...,733,Surface Modification Using Phosphonic Acids an...,ADVERTISEMENT RETURN TO ISSUEPREVReviewNEXTSur...,394,https://openalex.org/C161191863; https://opena...,https://openalex.org/W1600530500; https://open...
4,W3125613570,https://doi.org/10.1016/j.joule.2020.12.025,2021,en,article,S2898305631,3,A5057309116; A5018906660; A5028485156,NL; DE; NL,https://openalex.org/I121797337; https://opena...,647,Electrocatalytic Nitrate Reduction for Sustain...,Phebe van Langevelde earned her BSc in molecul...,15,https://openalex.org/C33989665; https://openal...,https://openalex.org/W2019322675; https://open...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
433,W3123576375,https://doi.org/10.1051/medsci/2020255,2021,fr,article,S20253135,1,A5039467504,FR,https://openalex.org/I4210136791,6,"CRISPR : le Nobel, enfin…","Après Marie Curie, en 1903 et 1911, Irène Joli...",15,https://openalex.org/C2993243194; https://open...,https://openalex.org/W1990487336; https://open...
434,W4233300423,https://doi.org/10.1002/advs.202101787,2021,en,erratum,S2737737698,12,A5073678325; A5036453390; A5079839801; A507599...,,,6,3D‐Printed Soft Lithography for Complex Compar...,"Adv. Sci. 2020, 7, 2001150 DOI: 10.1002/advs.2...",2,https://openalex.org/C2993243194; https://open...,https://openalex.org/W3035220540; https://open...
435,W4246316445,https://doi.org/10.1002/9783527679461.oth1,2014,en,other,S4306511533,1,A5075725880,FR,https://openalex.org/I2802931824,6,End User License Agreement,Free Access End User License Agreement Book Ed...,0,https://openalex.org/C2780560020; https://open...,
436,W2014165355,https://doi.org/10.1021/jo010451h,2001,en,article,S205050996,4,A5054628077; A5084133257; A5010758714; A508715...,FR; FR; FR; FR,https://openalex.org/I62396329; https://openal...,14,Why Are Monomeric Lithium Amides Planar?,ADVERTISEMENT RETURN TO ISSUEPREVNoteNEXTWhy A...,17,https://openalex.org/C2778805511; https://open...,https://openalex.org/W1968605816; https://open...


In [38]:
def get_author_id(query):
    query = query.replace(' ', '+')
    url = f"https://api.openalex.org/authors?search={query}&per-page=200"
    response = requests.get(url)
    
    if response.status_code != 200:
        print("Failed to retrieve data:", response.status_code)
        return None
    data = response.json()
    print(f"Number of concepts with {query} in display_name: {data['meta']['count']}")
    concepts = data['results']
    return concepts

get_author_id("Delphine Batho")

Number of concepts with Delphine+Batho in display_name: 1


[{'id': 'https://openalex.org/A5070491227',
  'orcid': None,
  'display_name': 'Delphine Batho',
  'display_name_alternatives': ['Delphine Batho'],
  'relevance_score': 363.94003,
  'works_count': 3,
  'cited_by_count': 1,
  'summary_stats': {'2yr_mean_citedness': 0.0, 'h_index': 1, 'i10_index': 0},
  'ids': {'openalex': 'https://openalex.org/A5070491227'},
  'affiliations': [{'institution': {'id': 'https://openalex.org/I4210118081',
     'ror': 'https://ror.org/01q07sy43',
     'display_name': 'Ministre de la Santé',
     'country_code': 'BJ',
     'type': 'government',
     'lineage': ['https://openalex.org/I4210118081']},
    'years': [2018]},
   {'institution': {'id': 'https://openalex.org/I4210092945',
     'ror': 'https://ror.org/00g8rx069',
     'display_name': 'Ministère de la Transition écologique et de la Cohésion des territoires',
     'country_code': 'FR',
     'type': 'government',
     'lineage': ['https://openalex.org/I2802818602',
      'https://openalex.org/I4210092945

In [39]:

url = build_url(search=search, author_id='A5070491227')
lst_papers = get_articles(url)
get_basic_infos(lst_papers[0])


{'id': 'W2792322390',
 'doi': 'https://doi.org/10.3917/espri.1801.0046',
 'year': 2018,
 'language': 'fr',
 'type': 'article',
 'source': 'S79731123',
 'nb_auth': 2,
 'auth_ids': 'A5070491227; A5103332534',
 'countries': 'BJ; FR',
 'institutions': 'https://openalex.org/I4210118081',
 'nb_citations': 1,
 'title': 'Sur le front de l’écologie',
 'abstract': None,
 'nb_ref': 0,
 'concepts': 'https://openalex.org/C138885662',
 'references': ''}

# Exercise



## Retrieve Publications

Use the OpenAlex API to fetch all publications by Delphine Bato.
Hint: You'll need to first find her OpenAlex ID using the API's author search functionality.


## Identify Most Cited Paper

- From the retrieved publications, determine which paper has the highest citation count.
- Extract relevant information about this paper (title, publication year, DOI, etc.).


## Analyze the Most Cited Paper

- Retrieve the abstract of the most cited paper.
- Identify the main topic or breakthrough described in this paper.
- List the top 5 concepts associated with this paper (use the 'concepts' field in the API response).