Software Impact Hackathon 2023: Scientific Software Citation Intent

Introduction

We are motivated to investigate the authors’ citation intent when referencing software in scholarly articles. Aligned with the main topic of the hackathon, we identify citation intent classes that convey the notion of impact, for authors and consumers of the software.

Citation context analysis is believed to be able to reveal more granular reasons for researchers to give and receive citations and hence contributes to the overall understanding of how credit and knowledge flows between researchers and their outcomes. We believe this benefit of understanding not just how software is mentioned in publications, but the reasons why it is mentioned, is critical for the construction of a deeper understanding of the roles played by research software in the academic system and contributing to a fairer scientific reward system in the future.

Even though a large number of citation context classification systems have been proposed for scientific publications, especially Scite (Nicholson et al., 2021), Very few systems have been proposed for the citations or mentions of research software in scientific publications, with the few exceptions of SoftCite (Du et al., 2021) and SoMeSci (Schindler et al., 2020). However, both of these latter efforts are based and only tested on limited publication samples. Thus, we believe it is vital to construct a system by reusing existing efforts as much as possible, and apply the system to a large-scale software name dataset that only recently emerged, to improve existing research infrastructure for empirical studies on research software.

Citation Intent Classes

We analyzed a few different schemes of citation contexts/intents for regular research articles and software, to construct our own classes. The list of schemes and their categories are listed below:

ACL-ARC (Jurgens, 2018):
- Background
- Compare/contrast
- Motivation
- Extends
- Future work
SciCite: Citation intent classification dataset (Cohan et al., 2019)
- Background Information
- Method
- Result Comparison
SoftCite (Du et al., 2021):
- Created
- Used
- Shared
SoMeSci (Schindler et al., 2020):
- Usage
- Mention
- Creation
- Deposition

We further established two principles behind our own scheme:

Citations are shared, deposited as they are based on URIs, hence point to “things on the (open) web;”
We are striving to understand the impact of software cited in scholarly articles, so authors of created and cited software should get credit for their work.

Based on the above principles and other systems, we proposed the following three categories in our scheme of research software citation intent:

Paper <describe_creation_of> Software: the paper describes or acknowledges the creation of a research software. It is corresponding to the Created category in SoftCite and Creation in SoMeSci.
Paper <describes_usage_of> Software: the paper describes the use of research software in any part of the research procedure, for any purpose. This category is corresponding to the Used category in SoftCite and Usage in SoMeSci.
Paper <describes_related_software> Software: the paper describes the research software for any other reasons beyond the first two categories. It is corresponding to the Mention category in SoMeSci.

Similar to existing efforts, this system only considers “functional” intents, i.e., functional reasons for mentioning the software in publications, instead of other aspects of the intent, such as sentiment and importance (Zhang & Ding, 2013). Compared to the two existing software citation intent classification systems, we specifically did not include the category of Sharing or Deposition, because we believe that is not strongly relevant to the impact of software being mentioned in publications, despite its importance to open science.

An important attribute of our scheme is that it is designed to be applied on the sentence level: the evaluation is made based on each sentence where a software entity is mentioned and hence a paper-software pair can have multiple citation intents. Moreover, we also decided that each sentence can only be classified into one category. In the case where multiple categories may apply, we will make a decision based on the degree of impact from the sentence, i.e., creation is greater than usage and usage is greater than other mentions.

Datasets

We collectively made the decision to re-use existing datasets as much as possible, rather than spend valuable time creating new datasets and defining new gold standards. We chose to build on the SoftCite, SoMeCi, and CZI datasets that were available to us. The datasets, for the most part, consist of single sentences that contain a software mention (implicit, by means of verbal reference to software and explicit by verbal means in conjunction with an included URI) and their corresponding labels of "used", "created", and "shared". Consolidating these similar yet still slightly different datasets was outside the scope of our group work, in fact, it was subject to another group’s effort during the hackathon. However, given our decision regarding citation intent classes outlined above, we had to make a few adjustments to the existing labels in the provided datasets. From the SoftCite dataset, for example, we were able to transfer labels "used" and "created" directly to our "Usage" and "Creation" classes and mapped most of the "shared" labeled data into "Creation". After careful consideration and much debate, we moved some records that had multiple positive labels or no positive labels at all into our "Mention" category.

As part of the data curation, we created a pipeline that downloaded all available full text via the PMC API in order to extract the expanded citation context of three sentences surrounding the citation (leading, citing, trailing sentence).

After all preprocessing, we ended up with one dataset that we used to train a variety of language models. We split the dataset in typical chunks for training, testing, and evaluation in order to facilitate a reasonable comparison between models.

The dataset consists of 2,283 software citations, each labeled as "Creation", "Usage", or "Mention", along with the citation context. The context comes in two forms: the citing sentence The leading sentence, the citing sentence, and the trailing sentence In addition, we created a dataset of 1,000 unlabeled citation contexts that could be used for negative training examples.

The distribution of labels in this dataset is as follows:

For evaluation, we used a dataset of 411 samples curated by CZI. This dataset was manually curated by reviewing sentences that contain mentions of software names; the dataset was initially curated before the hackathon using a more granular intent classification which was subsequently mapped to the intent classification described above (creation, used, mention).

The label distribution is as follows:

All datasets are also located in the data folder, with documentation in the respective README.

Training Language Models

We explored finetuning several BERT language models for classifying software mentions based on their intent, namely BERT, distilBERT, SciBERT and PubMedBERT. Moreover, we finetuned ChatGPT-3 using three different strategies: zero-shot learning, few-shot learning and finetuning.

The code for finetuning the BERT models is located in BERT_finetuning.

For zero-shot learning, the prompt used was

initial_message = [{"role": "system",
"content": "You are a scientist trying to figure out the citation intent behind software mentioned in sentences coming from research articles. Your four categories are: usage, creation, mention, or none. The definitions of the classes are: \
- usage: software was used in the paper \
- creation: software was created by the authors of the paper \
- mention: software was mentioned in the paper, but not used, nor created \
- none: none of the previous 3 categories apply \
You need to output one category only."}]

For few-shot learning, the prompt was

num_examples = 5
initial_message = [{"role": "system",
"content": "You are a scientist trying to figure out the citation intent behind software mentioned in sentences coming from research articles. Your four categories are: usage, creation, mention, or none. The definitions of the classes are: \
- usage: software was used in the paper \
- creation: software was created by the authors of the paper \
- mention: software was mentioned in the paper, but not used, nor created \
- none: none of the previous 3 categories apply \
You need to output one category only."}]
for example in examples_used:
initial_message += [{"role": "user", "content" : example}]
initial_message += [{"role": "assistant", "content" : 'usage'}]
for example in examples_created:
initial_message += [{"role": "user", "content" : example}]
initial_message += [{"role": "assistant", "content" : 'creation'}]
for example in examples_mentioned:
initial_message += [{"role": "user", "content" : example}]
initial_message += [{"role": "assistant", "content" : 'mention'}]
for example in examples_none:
initial_message += [{"role": "user", "content" : example}]
initial_message += [{"role": "assistant", "content" : 'none'}]

For finetuning Chat-GPT 3.5, we employed early stopping (n_epochs = 2) based on previous run:

Evaluation

For multi-class precision, recall and F1-score, we used the macro average method.

model	method	test set	precision (overall)	recall (overall)	F1 (overall)	accuracy (overall)	P (Creation)	R (Creation)	F1 (Creation)	P (Mention)	R (Mention)	F1 (Mention)	P (Usage)	R (Usage)	F1 (Usage)	P (Unlabelled)	R (Unlabelled)	F1 (Unlabelled)
BERT	finetuned-sentence	test split (n=838)	0.866	0.859	0.862	0.903	0.90	0.87	0.88	0.71	0.69	0.70	0.93	0.94	0.94	0.92	0.94	0.93
BERT	finetuned-sentence	CZI validation (n=410)	0.323	0.368	0.335	0.771	0.00	0.00	0.00	0.15	0.29	0.20	0.94	0.85	0.90	0.20	0.33	0.2
distilBERT	finetuned-sentence	test split (n=838)	0.823	0.826	0.824	0.884	0.86	0.81	0.84	0.59	0.62	0.61	0.95	0.93	0.94	0.90	0.95	0.92
distilBERT	finetuned-sentence	CZI validation (n=410)	0.481	0.412	0.443	0.801	0.71	0.50	0.59	0.29	0.26	0.27	0.94	0.88	0.91	0.00	0.00	0.00
SciBERT	finetuned-sentence	test split (n=843)	0.85	0.846	0.846	0.906	0.89	0.77	0.82	0.63	0.68	0.65	0.96	0.95	0.95	0.93	0.99	0.96
SciBERT	finetuned-sentence	CZI validation (n=410)	0.302	0.308	0.306	0.80	0.00	0.00	0.00	0.27	0.35	0.30	0.95	0.90	0.92	0.00	0.00	0.00
PubmedBERT	finetuned-sentence	test split (n=838)	0.867	0.891	0.88	0.919	0.88	0.88	0.88	0.69	0.78	0.73	0.97	0.93	0.95	0.94	0.97	0.96
PubmedBERT	finetuned-sentence	CZI validation (n=410)	0.319	0.392	0.342	0.81	0.00	0.00	0.00	0.28	0.39	0.33	0.94	0.91	0.93	0.00	0.00	0.00
GPT3.5	zero-shot	test split (n=837)	0.7	0.608	0.627	0.717	0.77	0.46	0.57	0.24	0.47	0.32	0.84	0.86	0.85	0.96	0.65	0.77
GPT3.5	zero-shot	CZI validation (n=410)	0.464	0.511	0.478	0.8	0.55	0.6	0.57	0.35	0.61	0.44	0.96	0.84	0.89	0.00	0.00	0.00
GPT3.5	few-shot	test split (n=837)	0.59	0.617	0.556	0.54	0.612	0.81	0.46	0.59	0.28	0.24	0.26	0.93	0.57	0.71	0.45	0.95
GPT3.5	few-shot	CZI validation (n=410)	0.525	0.291	0.373	0.457	0.5	0.3	0.37	0.6	0.39	0.47	1.00	0.47	0.64	0.00	0.00	0.00
GPT3.5	few-shot (context)	test split (n=837)	0.546	0.512	0.463	0.5	0.72	0.57	0.64	0.21	0.15	0.17	0.88	0.35	0.5	0.38	0.98	0.55
GPT3.5	few-shot (context)	CZI validation (n=410)	0.454	0.195	0.269	0.338	0.5	0.2	0.29	0.33	0.22	0.26	0.98	0.36	0.53	0.0	0.0	0.0
GPT3.5	finetuned	test split (n=837)	0.839	0.867	0.851	0.9	0.8	0.91	0.85	0.65	0.69	0.67	0.97	0.93	0.95	0.94	0.93	0.94
GPT3.5	finetuned	CZI validation (n=410)	0.571	0.531	0.545	0.881	0.71	0.5	0.59	0.59	0.7	0.64	0.98	0.93	0.95	0.00	0.00	0.00
GPT3.5	finetuned with context	test split (n=837)	0.766	0.808	0.783	0.857	0.66	0.88	0.76	0.49	0.52	0.51	0.96	0.88	0.92	0.95	0.95	0.95
GPT3.5	finetuned with context	CZI validation (n=410)	0.553	0.503	0.509	0.819	0.83	0.5	0.62	0.41	0.65	0.5	0.97	0.86	0.91	0.00	0.00	0.00
GPT4	zero-shot	test split (n=837)	0.684	0.662	0.664	0.815	0.73	0.70	0.71	0.26	0.11	0.15	0.83	0.96	0.89	0.92	0.88	0.90
GPT4	zero-shot	CZI validation (n=410)	0.473	0.544	0.495	0.8	0.58	0.70	0.64	0.35	0.65	0.45	0.96	0.82	0.89	0.00	0.00	0.00
GPT4	few-shot	test split (n=837)	0.746	0.736	0.738	0.839	0.74	0.62	0.67	0.46	0.44	0.45	0.93	0.91	0.92	0.86	0.97	0.91
GPT4	few-shot	CZI validation (n=410)	0.473	0.385	0.421	0.614	0.7	0.7	0.7	0.21	0.17	0.19	0.98	0.67	0.79	0.0	0.0	0.0
GPT4	few-shot (context)	test split (n=837)	0.716	0.73	0.716	0.832	0.68	0.82	0.74	0.43	0.26	0.33	0.92	0.91	0.92	0.83	0.93	0.87
GPT4	few-shot (context)	CZI validation (n=410)	0.399	0.206	0.269	0.471	0.5	0.2	0.29	0.11	0.09	0.1	0.99	0.54	0.7	0.00	0.00	0.00

More granular metrics on the GPT3.5 finetuning are included as screenshots in images/.

Key takeaways

For GPT3.5 prompt engineering requires time and careful crafting
- Responses also take longer because the model has to think through each answer
Fine-tuning take ~1hr for training but inference is really fast: ~1.5 min for ~800 sentences
Tuning things it’s really hard - you can’t change any hyperparameters besides num_epochs, so the only tuning you can really do is early stopping when training - it’s not as flexible as Huggingface API in letting you interact with the models

References

Please do not modify or delete any other part of the readme below this line.

About this project

This repository was developed as part of the Mapping the Impact of Research Software in Science hackathon hosted by the Chan Zuckerberg Initiative (CZI). By participating in this hackathon, owners of this repository acknowledge the following:

The code for this project is hosted by the project contributors in a repository created from a template generated by CZI. The purpose of this template is to help ensure that repositories adhere to the hackathon’s project naming conventions and licensing recommendations. CZI does not claim any ownership or intellectual property on the outputs of the hackathon. This repository allows the contributing teams to maintain ownership of code after the project, and indicates that the code produced is not a CZI product, and CZI does not assume responsibility for assuring the legality, usability, safety, or security of the code produced.
This project is published under a MIT license.

Code of Conduct

Contributions to this project are subject to CZI’s Contributor Covenant code of conduct. By participating, contributors are expected to uphold this code of conduct.

Reporting Security Issues

If you believe you have found a security issue, please responsibly disclose by contacting the repository owner via the ‘security’ tab above.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
BERT_finetuning		BERT_finetuning
GPT_models		GPT_models
data		data
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT_finetuning

BERT_finetuning

GPT_models

GPT_models

data

data

images

images

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Software Impact Hackathon 2023: Scientific Software Citation Intent

Introduction

Citation Intent Classes

Datasets

Training Language Models

Evaluation

Key takeaways

References

About this project

Code of Conduct

Reporting Security Issues

About

Releases

Packages

Contributors 7

Languages

License

karacolada/SoftwareImpactHackathon2023_SoftwareCitationIntent

Folders and files

Latest commit

History

Repository files navigation

Software Impact Hackathon 2023: Scientific Software Citation Intent

Introduction

Citation Intent Classes

Datasets

Training Language Models

Evaluation

Key takeaways

References

About this project

Code of Conduct

Reporting Security Issues

About

Resources

License

Stars

Watchers

Forks

Languages