Semi-supervised LDA topic modelling and analysis of AHRC research grant application abstracts
The AHRC Grants Topic Browser is built upon the findings of topic analysis conducted on research grant applications that have been awarded funding by the Arts and Humanities Research Council (AHRC) between 2013 and 2023.
The 32 topics identified here have been generated using a combination of unsupervised and semi-supervised machine learning techniques (LDA and seeded LDA) in a heuristic manner. The goal was to arrive at a classification of the documents in the corpus that is both statistically robust and intuitively meaningful to a human observer. I have labeled the emerging topics based on my interpretation of the documents identified by the model as most strongly associated with the given category (Most Relevant Projects), and the cluster of terms idenified as having the highest probability of appearing in the associated documents (Most Frequent Words). The topic labels are of necessity imperfect. When selecting them, my aim was to find broad concepts that best capture the semantic overlap within each category.
The data analysed here is sourced from publicly available information provided by the UK Research and Innovation (UKRI) at Gateway to Research (GtR). The analysis focused on research grant applications, excluding studentships, fellowships, and training grants awarded by the AHRC. 2270 applications have been analysed.
Author: Anna Kuslits
Acknowledgments: The analysis was performed using the quanteda and seededLDA R packages, developed by Kenneth Benoit and Kohei Watanabe at the LSE Data Science Institute. In visualising the results and designing the dashboard, I drew inspiration from Mining the Dispatch, created by Robert K. Nelson and the Digital Scholarship Lab at the University of Richmond.