Skip to content

The code, and data for the NAACL 2024 paper "MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference" will be released here.

Notifications You must be signed in to change notification settings

msadat3/MSciNLI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

This repository contains the dataset and code for the NAACL 2024 paper "MSCINLI: A Diverse Benchmark for Scientific Natural Language Inference." The dataset can be downloaded directly from here. You can also access the dataset through huggingface.

If you face any difficulties while downloading the dataset, raise an issue in this repository or contact us at msadat3@uic.edu.

Abstract

The task of scientific Natural Language Inference (NLI) involves predicting the semantic relation between two sentences extracted from research articles. This task was recently proposed along with a new dataset called SciNLI derived from papers published in the computational linguistics domain. In this paper, we aim to introduce diversity in the scientific NLI task and present MSciNLI, a dataset containing $132,320$ sentence pairs extracted from five new scientific domains. The availability of multiple domains makes it possible to study domain shift for scientific NLI. We establish strong baselines on MSciNLI by fine-tuning Pre-trained Language Models (PLMs) and prompting Large Language Models (LLMs). The highest Macro F1 scores of PLM and LLM baselines are $77.21$% and $51.77$%, respectively, illustrating that MSciNLI is challenging for both types of models. Furthermore, we show that domain shift degrades the performance of scientific NLI models which demonstrates the diverse characteristics of different domains in our dataset. Finally, we use both scientific NLI datasets in an intermediate task transfer learning setting and show that they can improve the performance of downstream tasks in the scientific domain.

Dataset Description

MSciNLI is derived from the papers published in five different scientific domains: "Hardware", "Networks", "Software & its Engineering", "Security & Privacy", and "NeurIPS." Similar to SciNLI, we use a distant supervision method that exploits the linking phrases between sentences in scientific papers to construct a large training set and directly use these potentially noisy sentence pairs during training. For the test and development sets, we manually annotate $4,000$ and $1,000$ examples, respectively, to create high quality evaluation data for scientific NLI. We refer the reader to our paper for an in-depth description of our dataset construction process.

Examples

Alt text

Dataset Statistics

A comparison of key statistics of MSciNLI with those of the previously existing SciNLI can be seen below.

Alt text

Files

=> train.csv, test.csv and dev.csv contain the training, testing and development data, respectively. Each file has five columns:

* 'id': a unique id for each sample.

* 'sentence1': the premise of each sample.

* 'sentence2': the hypothesis of each sample.

* 'label': corresponding label representing the semantic relation between the premise and hypothesis. 

* 'domain': the scientific domain from which each sample is extracted.

Baseline Performance

PLM Baselines

The performance of the Pre-trained Language Model (PLM) baselines can be seen in the Table below. The experimental details are described in our paper.

Alt text

A comparison of the In-domain (ID) and out-of-domain (OOD) performance of the best performing PLM baseline, RoBERTa fine-tuned separately on the five domains instroduced in MSciNLI and the ACL domain from SciNLI can be seen below.

Alt text

LLM Baselines

The performance of the Large Language Model (LLM) baselines can be seen in the Table below. The experimental details and the prompts used for the LLMs are described in our paper.

Alt text

Model Training & Testing

Coming soon!

Citation

If you use this dataset, please cite our paper:

@inproceedings{sadat-caragea-2024-mscinli,
    title = "MSCINLI: A Diverse Benchmark for Scientific Natural Language Inference",
    author = "Sadat, Mobashir  and  Caragea, Cornelia",
    booktitle = "2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics",
    month = june,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "",
    pages = "",
}

License

MSciNLI is licensed with Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

Contact

Please contact us at msadat3@uic.edu with any questions.

About

The code, and data for the NAACL 2024 paper "MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference" will be released here.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published