In [1]:
#| output: false
#| code-fold: true
import requests
import json

import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd

import torch
from sentence_transformers import SentenceTransformer
from transformers import pipeline as transformers_pipeline

import contextlib
import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_colwidth', 160)

  from .autonotebook import tqdm as notebook_tqdm


# Introduction

This guide provides a comprehensive overview of the **Who's Who in UK Research** data pipeline. The project aims to link person data from the UKRI Gateway to Research (GtR) and OpenAlex to identify matching researchers and track activity and impact at the individual level.

![DSIT Taxonomy System Architecture](handover_notebook_files/dsit-whos-who-diagram.svg){width=100% fig-align="center"}

### Key Components

The system is built using Kedro and consists of five core pipelines:

1.  **Gateway to Research (GtR) Data Collection**: Fetches project, funding, and researcher data from the GtR API.
2.  **OpenAlex Data Collection**: Enriches GtR data with publication, citation, and affiliation information from OpenAlex.
3.  **Author Disambiguation**: Matches GtR researchers to their OpenAlex profiles using a machine learning approach.
4.  **Basic Metrics Analysis**: Calculates fundamental bibliometric and career indicators.
5.  **Complex Metrics Analysis**: Computes research disruption and discipline diversity metrics.


### Current performance 

#### Overall coverage statistics
| Metric | Value |
|--------|--------|
| Total GtR persons | 140,245 |
| Matchable persons (at least one name candidate) | 126,304 |
| Matched persons | 85,444 |
| Overall coverage rate | 60.9% |
| Coverage of matchable persons | 67.6% |
| Coverage of active researchers (with projects) | 68.4% |

#### Coverage by grant category

| Category | Coverage Rate | Matches/Total |
|----------|--------------|---------------|
| Fellowship | 82.9% | 7,710/9,303 |
| Research and Innovation | 82.8% | 3,339/4,034 | 
| Intramural | 82.8% | 1,330/1,606 |
| Institute Project | 79.0% | 595/753 |
| Research Grant | 79.0% | 55,881/70,748 |
| Training Grant | 70.7% | 5,285/7,477 |
| Studentship | 56.3% | 33,829/60,074 |
| EU-Funded | 47.9% | 1,062/2,218 |
| Collaborative R&D | 13.7% | 1,088/7,921 |
| Feasibility Studies | 15.1% | 556/3,689 |
| Small Business Research Initiative | 16.9% | 136/805 |

---

This guide will walk through each component, demonstrating how to:

1.  Understand the data collection processes.
2.  Interpret the author disambiguation results.
3.  Generate and understand basic and complex metrics.
4.  Utilise the final combined dataset.


## Setup

We begin by setting up our Kedro environment. While Kedro pipelines are typically run through the command line interface (CLI), we can also interact with them in notebooks using the `kedro.ipython` extension. This is particularly useful for exploration and debugging.

When running through CLI, [OmegaConf](https://omegaconf.readthedocs.io/en/2.3_branch/) automatically loads [configuration](https://docs.kedro.org/en/stable/configuration/configuration_basics.html) from:

- `conf/base/catalog.yml`: Data catalog definitions
- `conf/base/parameters.yml`: Pipeline parameters  
- `conf/local/credentials.yml`: API keys and credentials
- `conf/base/settings.yml`: Project settings
- Registered hooks in `src/hooks.py` (none in this project)

In notebooks, we need to explicitly load these using the IPython magic commands below.
See the [Kedro documentation](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html) for more details.

In [2]:
# | output: false
# Load Kedro extensions
%load_ext kedro.ipython

# Reload the Kedro context
%reload_kedro

## 1. Gateway to Research (GtR) Data Collection (`data_collection_gtr`)

This pipeline retrieves core information about UKRI-funded research from the GtR API.

**Purpose**: To gather foundational data on projects, funding, involved personnel, organisations, and associated publications as recorded in GtR. This forms the basis for identifying researchers and their activities.

**Key Data Points & Endpoints Used**:

-   **`projects`**: Retrieves details about funded research projects, including titles, abstracts, start/end dates, funding amounts, and grant categories. Essential for understanding the research context.
-   **`persons`**: Fetches information about individuals involved in projects (Principal Investigators, Co-Investigators, etc.), including their names and unique GtR identifiers. Crucial for the subsequent author disambiguation step.
-   **`publications`**: Collects metadata about publications linked to GtR projects, primarily DOIs. These are used to find corresponding records in OpenAlex.
-   **`organisations`**: Gathers details about the institutions involved in funded research, which aids in affiliation matching during disambiguation.

**Process**: The pipeline systematically interacts with these GtR API endpoints, handling pagination to retrieve all relevant records. It incorporates robust error handling, retry mechanisms for transient network issues, and adheres to API rate limits.

### Direct API Example

Below is a simple example demonstrating a direct request to the GtR API (using the `requests` library) to fetch data for a specific person. This illustrates the raw interaction that the Kedro pipeline nodes encapsulate.

In [3]:
response = requests.get(
    "https://gtr.ukri.org/gtr/api/projects/FFDDCE32-AA62-447F-9F40-2D285C5CB209",
    headers={"Accept": "application/vnd.rcuk.gtr.json-v7"}
)
data = response.json()
result = {
    "id": data["id"],
    "title": data["title"],
    "abstract": data["abstractText"][:250]
}
print(json.dumps(result, indent=2))

{
  "id": "FFDDCE32-AA62-447F-9F40-2D285C5CB209",
  "title": "Spinal inhibitory interneurons that suppress itch",
  "abstract": "Chronic itch is a distressing feature of many diseases, including conditions affecting the skin, kidneys and blood, as well as certain forms of cancer. It is also a side-effect of certain drugs, such as morphine. Although some types of itch respond t"
}


### Pipeline Output Example
In practice, we run the pipelines through the command interface. For example, collecting all `persons` data available through the GtR API requires the following command line:

```bash
kedro run --pipeline data_collection_gtr --tags publications
```

This will run two nodes: a first node that creates partitions of the data fetch and saves these, and a second node that concatenates the partitioned dataset into a parquet file.

After the pipeline runs, the collected and processed data is stored in the Kedro catalog. Let's load an example of the processed `persons` data:

In [4]:
# | output: false
persons = catalog.load("gtr.data_collection.persons.intermediate")

In [5]:
# | code-fold: true
print(f"Shape: {persons.shape}")
print("Columns:", persons.columns.tolist())
print("Example data:")
display(persons.head(2))

Shape: (140245, 8)
Columns: ['person_id', 'orcid_id', 'first_name', 'surname', 'other_names', 'email', 'projects', 'organisations']
Example data:


Unnamed: 0,person_id,orcid_id,first_name,surname,other_names,email,projects,organisations
0,153B0438-9136-480C-B0E1-908A9DDE2AF1,,Anne,Osbourn,,,"[{'id': '74DF8C7B-CA58-4961-9EFF-00FBA00803D6', 'role': 'PI_PER'}, {'id': 'DDEE52BB-6B2B-4A3E-9D67-0BF0C77EE0A9', 'role': 'PI_PER'}, {'id': 'BDC017FB-2D64-4...","[{'id': '8E38A887-3CD8-49FD-87E5-A1844E81B9CF', 'role': 'EMPLOYED'}]"
1,15522B00-543E-4C45-87C4-B27D78459184,,Nick,Plant,,,"[{'id': '4FAFE57B-4C27-4B45-9AEA-026F0DDF42DC', 'role': 'TGH_PER'}, {'id': '7AC1CF29-AA48-4716-9D21-34ADE4C075CC', 'role': 'PI_PER'}, {'id': '01C229C2-90F4-...","[{'id': 'CA799973-1F1B-4936-B99A-9970F567FE67', 'role': 'EMPLOYED'}]"


## 2. OpenAlex Data Collection (`data_collection_oa`)

This pipeline fetches and processes data from the OpenAlex API, using identifiers gathered from the GtR dataset (persons, publications) as starting points.

**Purpose**: To collect detailed information about authors, publications, and institutions from OpenAlex that correspond to or are potentially related to the entities identified in GtR. This data serves two main goals:

1.  Gathering potential **candidate matches** for GtR researchers in OpenAlex, forming the input for the `author_disambiguation` pipeline.
2.  **Enriching** the dataset with  publication and institutional details for matching and basic-metric analysis.

**Sub-Pipelines Overview**:

This pipeline is structured into four main sub-pipelines, each targeting different entities or using different GtR identifiers:

1.  **ORCID Fetch (`fetch_orcid`)**: Uses ORCID iDs available for some GtR persons to directly query OpenAlex. This aims to find highly probable or exact matches and retrieve their corresponding OpenAlex author profiles.
2.  **Author Name Search (`fetch_author_names`)**: Takes the display names of *all* GtR persons (~140,000) and searches for potential matches in OpenAlex. This generates a large pool of candidate OpenAlex authors (resulting in ~5.7 million potential GtR-OA pairs) for the disambiguation model to evaluate.
3.  **Publication Fetch (`fetch_doi_publications`)**: Uses DOIs associated with GtR projects/publications to retrieve detailed publication records (metadata, authorships, citations, etc.) from OpenAlex.
4.  **Institution Fetch (`fetch_institutions`)**: Extracts OpenAlex institution IDs identified in the *author search results* and fetches detailed information about these institutions.

**Process**: The pipeline interacts with the OpenAlex API, often batched to both process in **parallel** using `joblib`. Filter conditions are also joined by `OR` conditions (`|`) to reduce the number of calls by a factor of 50 (see [documentation details](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists#logical-expressions)).

### Direct API Example

Below is a simple example demonstrating a direct request to the OpenAlex API (using the `requests` library) to fetch candidates for a specific name: `Anouk L'Hermitte`. This illustrates the raw interaction that the Kedro pipeline nodes encapsulate.

In [6]:
response = requests.get(
    "https://api.openalex.org/authors?filter=display_name.search:Anouk L'Hermitte",
    params={"select": "id,orcid,display_name,counts_by_year"}
)

data = response.json()
author = data["results"][0]
result = {
    "id": author["id"],
    "orcid": author.get("orcid"),
    "display_name": author["display_name"],
    "counts_by_year": author["counts_by_year"][:2]
}
print(json.dumps(result, indent=2))

{
  "id": "https://openalex.org/A5004845442",
  "orcid": "https://orcid.org/0000-0001-7595-5387",
  "display_name": "Anouk L\u2019Hermitte",
  "counts_by_year": [
    {
      "year": 2025,
      "works_count": 0,
      "cited_by_count": 29
    },
    {
      "year": 2024,
      "works_count": 0,
      "cited_by_count": 99
    }
  ]
}


### Pipeline Output Example

This pipeline generates multiple datasets corresponding to the outputs of its sub-pipelines (e.g., authors found via ORCID, authors found via name search, publications, institutions). Let's load the ORCID-based data, which corresponds to the output of the following sub-pipeline:

```bash
kedro run --pipeline data_collection_oa --tags fetch_orcid
```

Note that the names of the catalog objects are defined in `conf/base/catalog.yml`, but we use dataset factories by defining "generic" configurations and allowing OmegaConf to resolve the references. For example, `oa.data_collection.{filter}.intermediate` will correctly map the catalog object to the pipeline outputs `oa.data_collection.orcid.intermediate` and `oa.data_collection.publications.intermediate`, without needing to define each of these explicitly in the YAML file.

In [7]:
# | output: false
orcid_authors = catalog.load("oa.data_collection.orcid.intermediate")

In [8]:
# | code-fold: true
print(f"Shape: {orcid_authors.shape}")
print("Columns:", orcid_authors.columns.tolist())
print("Example data:")
display(orcid_authors.head(2))

Shape: (16847, 12)
Columns: ['id', 'orcid', 'display_name', 'display_name_alternatives', 'works_count', 'cited_by_count', 'h_index', 'i10_index', 'affiliations', 'last_known_institutions', 'topics', 'counts_by_year']
Example data:


Unnamed: 0,id,orcid,display_name,display_name_alternatives,works_count,cited_by_count,h_index,i10_index,affiliations,last_known_institutions,topics,counts_by_year
0,A5079633934,0000-0003-1376-8409,Adrian L. Harris,"[A. L Harris, A. L. Harris, Adrian. Harris, Austin Harris, AdrianL. Harris, Adrian L. Harris, AL. Harris, Al Harris, Alex L. Harris, and Adrian L. Harris, A...",2061,149619,186,1096,"[[I4210110594, MRC Weatherall Institute of Molecular Medicine, GB, facility, 2025,2024,2023,2022,2021,2020,2019,2018,2017,2016], [I40120149, University of O...","[[I4210110594, MRC Weatherall Institute of Molecular Medicine, GB, facility], [I2801316944, Cancer Research UK, GB, nonprofit], [I40120149, University of Ox...","[[T10631, Cancer, Hypoxia, and Metabolism, 627, subfields/1306, Cancer Research, fields/13, Biochemistry, Genetics and Molecular Biology, domains/1, Life Sc...","[[2025, 4, 1802], [2024, 19, 11200], [2023, 312, 12997], [2022, 17, 13197], [2021, 28, 15117], [2020, 33, 13976], [2019, 42, 12614], [2018, 35, 12295], [201..."
1,A5019685830,0000-0002-0784-8640,Andrew P. Monkman,"[A. Monkman, Andy Monkman, Andrew T. Monkman, A.P Monkman, A. P. Monkman, Andrew P. Monkman, Andy P. Monkman, Andrew Monkman]",798,27353,82,394,"[[I190082696, Durham University, GB, funder, 2025,2024,2023,2022,2021,2020,2019,2018,2017,2016], [I169381384, University of Groningen, NL, funder, 2023,2003...","[[I190082696, Durham University, GB, funder]]","[[T10611, Organic Light-Emitting Diodes Research, 308, subfields/2208, Electrical and Electronic Engineering, fields/22, Engineering, domains/3, Physical Sc...","[[2025, 2, 278], [2024, 19, 2624], [2023, 58, 2740], [2022, 42, 2860], [2021, 37, 3075], [2020, 29, 2787], [2019, 50, 2807], [2018, 36, 2682], [2017, 23, 20..."


## 3. Author Disambiguation (`author_disambiguation`)

This pipeline links researcher records from GtR to their corresponding author profiles in OpenAlex using a machine learning approach.

**Purpose**: To accurately identify and match individuals across the two datasets, resolving potential ambiguities based on names, affiliations, and research activities. This is necessary for attributing publications and metrics collected from OpenAlex back to the correct GtR person.

**Pipeline Overview**:

1.  **Data Aggregation & Preprocessing**: Consolidates GtR person information (details, affiliations, projects, topics, publications) and processes OpenAlex candidate profiles (affiliations, metrics, name variants).
2.  **Candidate Generation**: Creates potential GtR-OA author pairs (~5.7 million) based initially on name similarity. Uses ORCID links where available to create a validated training set.
3.  **Feature Engineering**: Computes a wide range of features for each candidate pair to quantify the likelihood of a match across multiple dimensions.
4.  **Model Training**: Trains a classifier (e.g., LightGBM) to predict match probability, using strategies like SMOTE or class weighting to handle the severe class imbalance (~2.4% positive class).
5.  **Prediction & Thresholding**: Applies the trained model to all candidate pairs and uses a selected probability threshold (0.80 in production) to determine the final matches, selecting the highest confidence match per GtR author.


**Feature Engineering Details**:

The pipeline computes five categories of features for each GtR-OA author pair:

**1. Name Similarity Features**

| Feature Type                  | Specific Features                                                                                                 |
| :---------------------------- | :---------------------------------------------------------------------------------------------------------------- |
| **Direct Name Comparisons**   | `display_lev` (Levenshtein), `display_jw` (Jaro-Winkler), `display_token` (Token Set Ratio)                        |
| **Name Component Matches**    | `surname_match` (Binary), `first_initial_match` (Binary), `full_first_match` (Binary)                             |
| **Alternative Name Comparisons** | `alt_lev_mean`/`max`, `alt_jw_mean`/`max`, `alt_token_mean`/`max` (Applied to OpenAlex alternative names)         |




**2. Institution Features**

| Feature Type                     | Specific Features                                                                                                  |
| :------------------------------- | :----------------------------------------------------------------------------------------------------------------- |
| **Direct Institution Comparison** | `inst_jw_max`, `inst_token_max` (Max similarity between GtR orgs and OA affiliations)                            |
| **Associated Institution Metrics** | `inst_child_jw_max`, `inst_child_token_max` (Max similarity including OA associated/parent/child institutions) |
| **GB Affiliation Indicators**     | `gb_affiliation_proportion`, `has_gb_affiliation` (Binary), `has_gb_associated` (Binary)                           |



**3. Topic Similarity Features**

Computed at four OpenAlex concept levels (domain, field, subfield, topic):

| Feature Type     | Specific Features                                                                                                     |
| :--------------- | :-------------------------------------------------------------------------------------------------------------------- |
| **Overlap Metrics** | `{level}_jaccard`, `{level}_cosine`, `{level}_js_div` (Jensen-Shannon), `{level}_containment` (Calculated for each level) |



**4. Publication Features**

| Feature Type        | Specific Features                                                                       |
| :------------------ | :-------------------------------------------------------------------------------------- |
| **Coverage Metrics** | `publication_coverage`, `author_proportion` (Relating GtR project pubs to OA author) |



**5. Author Metadata Features**

| Feature Type          | Specific Features                                                                                              |
| :-------------------- | :------------------------------------------------------------------------------------------------------------- |
| **Publication Metrics** | `works_count`, `cited_by_count`, `h_index`, `i10_index`, `citations_per_work` (From OpenAlex author profile) |

**Model Performance & Production Choice**:

The pipeline was tested with both SMOTE and Class Weighting strategies to handle imbalance. The Class Weights model with a threshold of **0.80** was selected for production.

*   **Test Set Performance (Class Weights, Threshold 0.80)**:
    *   Accuracy: 0.996
    *   Precision: 0.920
    *   Recall: 0.937
    *   F1 Score: 0.928

*   **Confusion Matrix (Test Set)**:

    | True \ Predicted | Negative | Positive |
    | :---------------- | --------:| --------:|
    | **Negative**      | 58,217   | 118      |
    | **Positive**      | 92       | 1,357    |

*   **Feature Importance**: Name similarity (`alt_lev_max`) and institution matching (`inst_token_max`) were found to be the most important features, followed by publication overlap metrics. Topic similarity and author metadata features had lower importance.

**Production Results & Coverage**:

*   **Matches**: Applied to 5.7M candidate pairs, resulting in **85,444** matched unique GtR persons (using the 0.80 threshold).
*   **Overall Coverage**: **60.9%** of all GtR persons, or **67.6%** of those considered 'matchable' (having at least one candidate name match in OpenAlex).
*   **Coverage by Grant Category**: Coverage varies significantly by the primary grant category associated with the researcher in GtR.

    | Category                         | Coverage Rate | Matches/Total   |
    | :------------------------------- | ------------: | :-------------- |
    | Fellowship                       | 82.9%         | 7,710/9,303     |
    | Research and Innovation          | 82.8%         | 3,339/4,034     |
    | Intramural                       | 82.8%         | 1,330/1,606     |
    | Institute Project                | 79.0%         | 595/753         |
    | Research Grant                   | 79.0%         | 55,881/70,748   |
    | Training Grant                   | 70.7%         | 5,285/7,477     |
    | Studentship                      | 56.3%         | 33,829/60,074   |
    | EU-Funded                        | 47.9%         | 1,062/2,218     |
    | Collaborative R&D                | 13.7%         | 1,088/7,921     |
    | Feasibility Studies              | 15.1%         | 556/3,689       |
    | Small Business Research Initiative | 16.9%         | 136/805         |

    *(Note: Lower coverage for industry-focused grants is expected)*

### Pipeline Output Example
Let's first load the ORCID feature data to understand how the pipeline defines the features space.

In [9]:
# | output: false
orcid_features = catalog.load("ad.orcid_labelled_feature_matrix.intermediate")

In [26]:
# | code-fold: true
# Display basic info about the dataset
print("Dataset shape:", orcid_features.shape)
print("\nClass distribution:")
print(orcid_features["is_match"].value_counts())
print("\nClass distribution (normalized):")
print(orcid_features["is_match"].value_counts(normalize=True))

print("\nFirst row with all features:")
display(orcid_features.sample(1).iloc[0])

Dataset shape: (597833, 45)

Class distribution:
is_match
0    583344
1     14489
Name: count, dtype: int64

Class distribution (normalized):
is_match
0    0.975764
1    0.024236
Name: proportion, dtype: float64

First row with all features:



gtr_id                       [93mEA1E1EB1-31ED-4CCF-893D-ACC0B98012B1[0m
oa_id                                                 A5074272633
is_match                                                        [1;36m0[0m
display_lev                                              [1;36m0.857143[0m
display_jw                                               [1;36m0.971429[0m
display_token                                                 [1;36m1.0[0m
surname_match                                                   [1;36m1[0m
first_initial_match                                             [1;36m1[0m
full_first_match                                                [1;36m1[0m
alt_lev_mean                                             [1;36m0.857143[0m
alt_jw_mean                                              [1;36m0.971429[0m
alt_token_mean                                                [1;36m1.0[0m
alt_lev_max                                              [1;36m0.857143[0m
alt_jw_max 

The main output is a table linking GtR person IDs to the best-matching OpenAlex author ID, along with the match probability. Let's load the matched authors, which includes the model's `match_probability`.

In [11]:
# | output: false
matched_authors = catalog.load("ad.matched_authors.primary")

In [12]:
# | code-fold: true
display(matched_authors.head(2))

Unnamed: 0,gtr_id,oa_id,match_probability,id,orcid,display_name,display_name_alternatives,works_count,cited_by_count,h_index,i10_index,topics,gtr_author_name,name_match_score,institution_names,has_gb_affiliation,gb_affiliation_proportion,associated_institution_names,has_gb_associated
0,0014394C-5B8D-4475-A74A-41C2272F8061,A5008934318,0.999874,A5008934318,0000-0001-7695-9208,Hazel Conley,"[Hazel. Conley, Hazel Margot Conley, H. Conley, Hazel M. Conley]",57,650,16,24,"[[T11421, Labor Movements and Unions, 25, subfields/3321, Public Administration, fields/33, Social Sciences, domains/2, Social Sciences], [T10443, Social Po...",Hazel Conley,1.0,"[University of the West of England, Queen Mary University of London, London School of Business and Management, Cardiff University]",True,1.0,"[Hartpury University, Frenchay Hospital, Royal United Hospital, William Harvey Research Institute, University of London, Broomfield Hospital, Moorfields Eye...",True
1,0021BF58-F3F1-43F3-8C81-470480622745,A5054332135,0.998462,A5054332135,,Andrew J. Counter,"[Andrew J. Counter, Andrew Counter, A. Counter, A. J. Counter]",69,118,4,2,"[[T13641, Historical Studies and Socio-cultural Analysis, 21, subfields/1207, History and Philosophy of Science, fields/12, Arts and Humanities, domains/2, ...",Andrew Counter,0.962963,"[New College, University of Oxford, King's College London, Canisius College, St. John's College of Nursing, Christ University]",True,0.5,"[CRUK/MRC Oxford Institute for Radiation Oncology, Cancer Research UK Oxford Centre, Centre for Human Genetics, Centre for the Observation and Modelling of ...",True


Detailed predictive metrics and coverage analysis are available in a JSON format through the catalog object `ad.coverage_analysis.tmp`. This includes model performance statistics and breakdowns of matching coverage across different project types.


## 4. Basic Metrics Analysis (`data_analysis_basic_metrics`)

This pipeline calculates standard bibliometric and career-related indicators for the matched researchers, combining data from OpenAlex and GtR.

**Purpose**: To provide fundamental quantitative measures of research output, impact, collaboration patterns, and career progression, often comparing metrics before and after the researcher received their first GtR grant.

**Pipeline Overview**:

The pipeline operates in several stages:

1.  **Data Integration and Preprocessing**: Processes OpenAlex author metadata, GtR person/project data, and publication data for the matched authors.
2.  **Metric Computation**: Orchestrated by the `compute_basic_metrics` function, this stage calculates various metrics, including:
    *   Author metadata processing (first publication year, citation metrics)
    *   Academic age calculation
    *   Grant processing (aggregation of categories, funders, timelines)
    *   International metrics (affiliation analysis, time abroad)
    *   Collaboration metrics (analysis before/after first grant, collaborator counts)

**Key Metrics Calculated**:

The pipeline generates a wide range of metrics, detailed in the tables below. Many metrics are computed separately for the periods *before* and *after* the researcher's first recorded GtR grant start date.

**Identifiers & Basic Info**

| Variable                  | Description                     | Type         |
| :------------------------ | :------------------------------ | :----------- |
| `oa_id`                   | OpenAlex unique identifier      | string       |
| `orcid`                   | ORCID identifier                | string       |
| `display_name`            | Full name from OpenAlex         | string       |
| `display_name_alternatives` | Alternative names               | list[string] |
| `first_name`              | First name from GTR             | string       |
| `surname`                 | Surname from GTR                | string       |
| `gtr_person_id`           | GTR person ID                   | string       |
| `match_probability`       | Probability of correct matching | float        |


**Current Institutional Information**

| Variable                      | Description                               | Type         |
| :---------------------------- | :---------------------------------------- | :----------- |
| `gtr_organisation`          | Current GTR organisation ID             | string       |
| `gtr_organisation_name`     | Current GTR organisation name           | string       |
| `last_known_institutions`   | List of last known institutions         | list[list]   |
| `last_known_institution_uk` | Whether last known institution is in UK | boolean      |


**Academic Profile & Metrics**

| Variable                      | Description                                              | Type       |
| :---------------------------- | :------------------------------------------------------- | :--------- |
| `works_count`                 | Total number of works                                    | integer    |
| `cited_by_count`              | Total citations                                          | integer    |
| `citations_per_publication` | Average citations per publication                      | float      |
| `h_index`                     | H-index                                                  | integer    |
| `i10_index`                   | i10-index                                                | integer    |
| `first_work_year`             | First publication year                                   | integer    |
| `academic_age_at_first_grant` | Academic age when receiving first grant                  | integer    |
| `topics`                      | Research topics                                          | list[list] |
| `affiliations`                | Historical affiliations                                  | list[list] |
| `counts_by_year`              | Pub counts/citations by *citation year* (OA, <=2012)     | list[list] |
| `counts_by_pubyear`           | Pub counts/citations by *publication year* (OA, full) | list[list] |


**Note on Citation Counting Methods**

The pipeline uses two different approaches to count citations, reflected in `counts_by_year` and `counts_by_pubyear`:

1.  **Publication Year Based (`counts_by_pubyear`)**: Citations are attributed to the year the cited paper was published. This method provides complete time coverage and is primarily used for the before/after grant comparisons in this pipeline.
2.  **Citation Year Based (`counts_by_year`)**: Citations are attributed to the year the citation occurred. This gives a more accurate view of *when* impact happened but is limited by OpenAlex data availability (only up to 2012 at the time of pipeline creation). Its use for before/after comparisons is limited.

Both metrics are retained in the final dataset where available.


**Grant Information**

| Variable                     | Description                       | Type         |
| :--------------------------- | :-------------------------------- | :----------- |
| `earliest_start_date`        | First grant start date            | date         |
| `latest_end_date`          | Last grant end date               | date         |
| `has_active_project`         | Whether has active projects       | boolean      |
| `number_grants`            | Total number of grants            | integer      |
| `has_multiple_funders`     | Whether has multiple funders      | boolean      |
| `grant_categories`         | List of grant categories          | list[list]   |
| `lead_funders`             | List of lead funders              | list[list]   |
| `gtr_project_timeline`       | Detailed project timeline         | list[list]   |
| `gtr_project_id`           | GTR project IDs                   | list[string] |
| `gtr_project_publications` | Project-linked publications       | list[string] |
| `gtr_project_topics`       | Project-specific topics           | list[list]   |
| `gtr_project_oa_authors`   | Project OpenAlex authors          | list[string] |
| `gtr_project_oa_ids`       | Project OpenAlex IDs              | list[string] |


**Publication Metrics (Before/After First Grant)**

(Based primarily on `counts_by_pubyear`)

| Variable                        | Description                              | Type    |
| :------------------------------ | :--------------------------------------- | :------ |
| `n_pubs_before`                 | Number of publications before            | integer |
| `n_pubs_after`                  | Number of publications after             | integer |
| `total_citations_pubyear_before` | Total citations (by pub year) before     | integer |
| `total_citations_pubyear_after`  | Total citations (by pub year) after      | integer |
| `mean_citations_pubyear_before`  | Mean citations (by pub year) before      | float   |
| `mean_citations_pubyear_after`   | Mean citations (by pub year) after       | float   |
| `citations_pp_pubyear_before`   | Citations per pub (by pub year) before   | float   |
| `citations_pp_pubyear_after`    | Citations per pub (by pub year) after    | float   |
| `mean_citations_before`         | Mean citations before                    | float   |
| `mean_citations_after`          | Mean citations after                     | float   |
| `citations_pp_before`           | Citations per pub before                 | float   |
| `citations_pp_after`            | Citations per pub after                  | float   |
| `mean_fwci_before`              | Mean Field-Weighted Citation Impact before | float   |
| `mean_fwci_after`               | Mean Field-Weighted Citation Impact after  | float   |


**International Experience Metrics (Before/After First Grant)**

| Variable                   | Description                                | Type         |
| :------------------------- | :----------------------------------------- | :----------- |
| `abroad_experience_before` | Had international experience before        | boolean      |
| `abroad_experience_after`  | Had international experience after         | boolean      |
| `countries_before`         | Countries worked in before                 | list[string] |
| `countries_after`          | Countries worked in after                  | list[string] |
| `abroad_fraction_before`   | Fraction of time affiliated abroad before  | float        |
| `abroad_fraction_after`    | Fraction of time affiliated abroad after   | float        |


**Collaboration Metrics (Before/After First Grant)**

| Variable                         | Description                                 | Type         |
| :------------------------------- | :------------------------------------------ | :----------- |
| `collab_countries_before`        | Collaboration countries with counts before  | list[list]   |
| `collab_countries_after`         | Collaboration countries with counts after   | list[list]   |
| `collab_countries_list_before` | List of collaboration countries before      | list[string] |
| `collab_countries_list_after`  | List of collaboration countries after       | list[string] |
| `unique_collabs_before`        | Unique collaborators before                 | integer      |
| `unique_collabs_after`         | Unique collaborators after                  | integer      |
| `total_collabs_before`         | Total collaborations before                 | integer      |
| `total_collabs_after`          | Total collaborations after                  | integer      |
| `foreign_collab_fraction_before` | Fraction of foreign collaborations before | float        |
| `foreign_collab_fraction_after`  | Fraction of foreign collaborations after  | float        |


### Pipeline Output Example
These metrics are typically added to the main author table. We can display some examples from the `final_metrics` dataset loaded earlier.

In [13]:
# | output: false
basic_metrics = catalog.load("analysis.basic_metrics.primary")

In [14]:
# | code-fold: true
print("\nExample Basic Metrics:")

# Show citation metrics
citation_cols = [
    "oa_id",
    "display_name", 
    "works_count",
    "cited_by_count",
    "citations_per_publication",
    "h_index",
    "i10_index",
]
display(basic_metrics[citation_cols].sample(3, random_state=42))

# Show before/after metrics
print("\nBefore/After Grant Metrics:")
before_after_cols = [
    "oa_id",
    "display_name",
    "n_pubs_before", 
    "n_pubs_after",
    "mean_citations_pubyear_before",
    "mean_citations_pubyear_after",
]
display(basic_metrics[before_after_cols].sample(3, random_state=43))

# Show international metrics
print("\nInternational Experience Metrics:")
international_cols = [
    "oa_id",
    "display_name",
    "abroad_experience_before",
    "abroad_experience_after", 
    "abroad_fraction_before",
    "abroad_fraction_after",
]
display(basic_metrics[international_cols].sample(3, random_state=44))


Example Basic Metrics:


Unnamed: 0,oa_id,display_name,works_count,cited_by_count,citations_per_publication,h_index,i10_index
64299,A5043849805,Andrew Jahoda,230,4506,19.591304,36,98
41046,A5081261971,David R. Garrod,241,13096,54.340249,64,163
73710,A5020989903,Emma V. King,186,3260,17.526882,25,36



Before/After Grant Metrics:


Unnamed: 0,oa_id,display_name,n_pubs_before,n_pubs_after,mean_citations_pubyear_before,mean_citations_pubyear_after
82720,A5105470222,Ralph Burton,7.0,28,2.0,8.4
34330,A5101536117,Daniel George,,3,,1.333
61370,A5083595167,Simon H. Reed,39.0,45,107.385,33.692



International Experience Metrics:


Unnamed: 0,oa_id,display_name,abroad_experience_before,abroad_experience_after,abroad_fraction_before,abroad_fraction_after
63358,A5065151012,Neville Wylie,True,True,0.062,0.111
79779,A5035601896,Emma Barrett,False,True,0.0,0.474
70272,A5018389548,Charlotte Greene,,,,



## 5. Complex Metrics Analysis (`data_analysis_complex_metrics`)

This pipeline computes iscipline diversity and research disruption indicators based on publication and citation data. This provides insights into the nature and influence of a researcher's work, such as the degree to which their work disrupts or consolidates existing knowledge domains, and the diversity of the research topics they engage with.

**Pipeline Overview**:

The calculation involves several sequential sub-pipelines:

1.  **Sample Pipeline**: Performs stratified sampling of publications per author (up to 50, weighted by FWCI) to avoid hitting OpenAlex daily rate limits. 5.3M papers have on average 22 citations each, which implies some 71M unique papers we need to fetch references for. OpenAlex only allows up to 5M paper searchers (100,000 queries of 50 papers each) per day ([see documentation](https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication)).  
2.  **Focal Collection Pipeline**: Fetches citation data for the sampled publications.
3.  **Reference Collection Pipeline**: Fetches metadata for the references *within* the sampled publications.
4.  **Disruption Index Pipeline**: Calculates the disruption index for sampled publications and aggregates to the author level.
5.  **Discipline Diversity Pipeline**: Computes topic embeddings (using SPECTER), aggregates topics per author, and calculates diversity components.
6.  **Metric Computation Pipeline**: Combines disruption and diversity metrics, computes before/after funding comparisons, and merges with basic metrics.

**Methodological Approach**:

*   **Disruption Index**: Implements the [Wu & Yan (2019)](https://doi.org/10.48550/arXiv.1905.03461) variant, following recommendations from [Leibel & Bornmann (2023)](https://doi.org/10.48550/arXiv.2308.02383). For a focal paper *i*, this index measures how subsequent papers cite it relative to its references:
    $$ DI = \frac{n_f - n_b}{n_f + n_b} $$
    where $n_f$ is the count of papers citing *i* but not *i*\'s references, and $n_b$ is the count of papers citing both *i* and *i*\'s references. Values near +1 suggest disruption, while values near -1 suggest consolidation. Author-level scores are averaged, with an option to weight by Field-Weighted Citation Impact (FWCI).

*   **Discipline Diversity**: Follows [Leydesdorff et al. (2019)](https://doi.org/10.1016/j.joi.2018.12.006), combining three components:
    1.  **Variety**: Proportion of unique topics covered by the author.
    2.  **Evenness**: Balance of publication distribution across topics (using Kvålseth-Jost measure).
    3.  **Disparity**: Average cognitive distance between topic embeddings (using SPECTER model).
    The overall diversity is the product: Diversity = Variety × Evenness × Disparity.

**Key Metrics Calculated (Output Variables)**:

The pipeline produces author-level metrics aggregated for periods before and after their first GtR funding date, as well as annual time series data.

*   **Before/After Funding Metrics (Examples)**:
    *   `mean_disruption_before`/`mean_disruption_after`: Average unweighted disruption.
    *   `mean_weighted_disruption_before`/`mean_weighted_disruption_after`: Average FWCI-weighted disruption.
    *   `mean_variety_before`/`mean_variety_after`: Average topic variety.
    *   `mean_evenness_before`/`mean_evenness_after`: Average topic evenness.
    *   `mean_disparity_before`/`mean_disparity_after`: Average topic disparity.

*   **Annual Time Series Data**:
    *   Contains yearly values for disruption (mean, weighted mean) and diversity (variety, evenness, disparity).

### Pipeline Output Example

These aggregated metrics are typically added to the main author table. We can display some examples from the `final_metrics` dataset loaded earlier.


In [15]:
# | output: false
final_metrics = catalog.load("analysis.final_metrics.primary")

In [None]:
# | code-fold: true
print("Example Complex Metrics (Before/After First Grant):")
complex_metrics_cols = [
    "oa_id",
    "display_name",
    "mean_disruption_before",
    "mean_disruption_after",
    "mean_weighted_disruption_before",
    "mean_weighted_disruption_after",
    "mean_variety_before",
    "mean_variety_after",
    "mean_evenness_before",
    "mean_evenness_after",
    "mean_disparity_before",
    "mean_disparity_after",
]
display(final_metrics[complex_metrics_cols].sample(10, random_state=42))

Example Complex Metrics (Before/After First Grant):


Unnamed: 0,oa_id,display_name,mean_disruption_before,mean_disruption_after,mean_weighted_disruption_before,mean_weighted_disruption_after,mean_variety_before,mean_variety_after,mean_evenness_before,mean_evenness_after,mean_disparity_before,mean_disparity_after
64299,A5043849805,Andrew Jahoda,0.148,-0.305,0.227,-0.235,0.053,0.072,0.406,0.303,0.553,0.564
41046,A5081261971,David R. Garrod,-0.282,-0.551,-0.315,-0.545,0.057,0.043,0.338,0.435,0.601,0.615
73710,A5020989903,Emma V. King,-0.255,-0.422,-0.215,-0.403,0.087,0.081,0.427,0.279,0.603,0.609
72234,A5071118063,Beth Jefferies,,-0.6,,-1.0,0.012,0.011,1.0,0.468,0.552,0.554
82799,A5067019894,Luke Griffith,,,,,,,,,,
66287,A5046276492,Santie de Villiers,-0.373,-0.196,-0.357,-0.157,0.017,0.027,0.537,0.591,0.595,0.594
39065,A5059921894,Norman Morrison,-0.495,-0.032,-0.273,0.043,0.08,0.141,0.232,0.363,0.597,0.613
62927,A5089882185,Daniel T. Bowron,-0.401,-0.644,-0.396,-0.665,0.093,0.103,0.437,0.513,0.577,0.582
38282,A5063458205,Martin C. Stennett,-0.265,-0.581,-0.331,-0.539,0.045,0.061,0.211,0.182,0.577,0.599
82704,A5016855567,Marion M. Hetherington,-0.374,-0.569,-0.398,-0.645,0.08,0.075,0.357,0.314,0.605,0.598


In [31]:
# | code-fold: true
print("Example Complex Metrics TS data:")
complex_metrics_cols = [
    "oa_id",
    "display_name",
    "author_year_disruption",
    "author_year_diversity",
]
display(final_metrics[complex_metrics_cols].sample(10, random_state=42))

Example Complex Metrics TS data (Before/After First Grant):


Unnamed: 0,oa_id,display_name,author_year_disruption,author_year_diversity
64299,A5043849805,Andrew Jahoda,"[[2002, 1.0, 1.0], [2004, 0.21, 0.35], [2005, -0.3, -0.3], [2006, -0.11, -0.03], [2007, 0.19, 0.19], [2008, -0.1, 0.15], [2009, -0.08, 0.56], [2010, -0.47, ...","[[2002, 0.048, 0.467, 0.547], [2004, 0.048, 0.438, 0.547], [2005, 0.048, 0.419, 0.547], [2006, 0.048, 0.401, 0.547], [2007, 0.052, 0.403, 0.546], [2008, 0.0..."
41046,A5081261971,David R. Garrod,"[[2002, -0.6, -0.6], [2003, 0.04, -0.07], [2004, -0.2, -0.39], [2005, -0.43, -0.42], [2006, -0.77, -0.54], [2007, 0.27, 0.13], [2008, -0.41, -0.4], [2009, -...","[[2002, 0.056, 0.311, 0.606], [2003, 0.044, 0.413, 0.585], [2004, 0.056, 0.319, 0.571], [2005, 0.052, 0.312, 0.6], [2006, 0.067, 0.309, 0.619], [2007, 0.067..."
73710,A5020989903,Emma V. King,"[[2011, 0.33, 0.3], [2012, 0.09, -0.09], [2013, -0.42, -0.42], [2014, -0.52, -0.34], [2015, -0.62, -0.39], [2016, -0.39, -0.35], [2017, -0.42, -0.44], [2018...","[[2008, 0.048, 0.587, 0.582], [2011, 0.099, 0.416, 0.611], [2012, 0.095, 0.452, 0.621], [2013, 0.099, 0.396, 0.611], [2014, 0.103, 0.374, 0.599], [2015, 0.0..."
72234,A5071118063,Beth Jefferies,"[[2015, -1.0, -1.0], [2018, 1.0, nan], [2020, -1.0, -1.0], [2021, -1.0, -1.0], [2023, -1.0, -1.0]]","[[2002, 0.012, 1.0, 0.552], [2014, 0.008, 0.625, 0.567], [2015, 0.012, 0.565, 0.596], [2018, 0.012, 0.284, 0.474], [2019, 0.016, 0.435, 0.501], [2020, 0.008..."
82799,A5067019894,Luke Griffith,,
66287,A5046276492,Santie de Villiers,"[[2009, 0.0, 0.0], [2010, -0.65, -0.63], [2011, -0.47, -0.44], [2012, -0.24, -0.24], [2014, -0.14, 0.2], [2015, -0.41, -0.41], [2016, -0.41, -0.41], [2018, ...","[[2006, 0.016, 0.4, 0.598], [2008, 0.016, 0.549, 0.598], [2009, 0.016, 0.607, 0.598], [2010, 0.016, 0.596, 0.598], [2011, 0.02, 0.534, 0.582], [2012, 0.02, ..."
39065,A5059921894,Norman Morrison,"[[2002, -1.0, nan], [2004, -0.63, -0.55], [2005, -0.1, -0.03], [2006, -0.25, -0.24], [2007, 0.38, 0.57], [2008, -0.24, -0.04], [2009, -0.33, -0.33], [2010, ...","[[2002, 0.063, 0.375, 0.587], [2004, 0.075, 0.232, 0.612], [2005, 0.087, 0.177, 0.593], [2006, 0.095, 0.145, 0.598], [2007, 0.103, 0.231, 0.591], [2008, 0.1..."
62927,A5089882185,Daniel T. Bowron,"[[2002, -0.67, -0.62], [2003, -0.27, -0.25], [2004, -0.47, -0.06], [2005, -0.1, -0.19], [2006, 0.25, -0.15], [2007, -0.58, -0.57], [2008, -0.48, -0.66], [20...","[[2002, 0.087, 0.498, 0.586], [2003, 0.083, 0.483, 0.58], [2004, 0.075, 0.481, 0.589], [2005, 0.079, 0.462, 0.58], [2006, 0.079, 0.45, 0.571], [2007, 0.083,..."
38282,A5063458205,Martin C. Stennett,"[[2002, 0.38, 0.38], [2004, -0.43, -0.43], [2005, 0.0, 0.03], [2006, 0.09, -0.43], [2007, -0.13, -0.26], [2008, -0.27, -0.24], [2009, -1.0, -1.0], [2010, -0...","[[2002, 0.032, 0.28, 0.53], [2004, 0.036, 0.245, 0.574], [2005, 0.048, 0.19, 0.574], [2006, 0.044, 0.204, 0.565], [2007, 0.044, 0.209, 0.565], [2008, 0.052,..."
82704,A5016855567,Marion M. Hetherington,"[[2002, -0.12, -0.3], [2003, -0.21, -0.27], [2004, -0.23, -0.36], [2005, -0.39, -0.41], [2006, -0.33, -0.36], [2007, -0.31, -0.31], [2008, -0.45, -0.35], [2...","[[2002, 0.052, 0.469, 0.608], [2003, 0.052, 0.459, 0.616], [2004, 0.06, 0.422, 0.604], [2005, 0.083, 0.332, 0.608], [2006, 0.099, 0.316, 0.604], [2007, 0.10..."
