# Reproduction of Citation-Integrity

### Authored by Jeffrey Dick on 2024-11-29

This notebook presents my reproduction of the Citation-Integrity model.
See [multivers/README.md](https://github.com/jedick/readycite/tree/main/multivers) for details about setting up the model, including sources of data.

## Introduction to the model

This description was adapted from [Sarol et al. (2024)](https://doi.org/10.1093/bioinformatics/btae420).

- [Citation-Integrity](https://github.com/ScienceNLP-Lab/Citation-Integrity) is based on [MultiVerS](https://github.com/dwadden/multivers), which uses the Longformer encoder.
    - The data processed by MultiVerS consist of **claims** and **abstracts**. The original model was applied to a version of the SciFact dataset with additional preprocessing for negative sampling ([Wadden et al., 2022](https://doi.org/10.48550/arXiv.2112.01640)).
    - The Citation-Integrity dataset consists of **claims** and **evidence**. The evidence is treated like the abstract for the purpose of the MultiVerS model. *This is the dataset used for this notebook.*
- MultiVerS performs two tasks independently:
    - selection of rationale sentences from the abstracts
    - label prediction for the claims
- Based on poor performance in experiments, Citation-Integrity ignores the rational sentence selection in MultiVerS by setting the loss function weight to 0.
- Compared to MultiVerS, Citation-Integrity adds three tokens to the claims as citation markers:
    - [CIT] for the target citation
    - [MULTI_CIT] for the target citation among other citations
    - [OTHER_CIT] for non-target citations
- Unlike MultiVerS, which selects rationale sentences from only the abstracts of cited articles, Citation-Integrity uses the full text of articles.
    - The BM25 model is used to retrieve the top 60 sentences.
    - The MonoT5 reranker is used to rerank those sentences.
    - The top-k (5, 10, or 20 in experiments) sentences are used as evidence sentences.
- The labels used in Citation-Integrity (ACCURATE, NOT_ACCURATE, NEI) have different names but otherwise correspond to the labels used in MultiVerS (SUPPORT, REFUTE, NOT ENOUGH INFORMATION). Note: The IRRELEVANT label used in the paper of Sarol et al. (2024) maps to the NEI label in the Citation-Integrity dataset.
- The baseline model in Citation-Integrity is MultiVerS trained on HealthVER.
- I used the baseline model as the starting point for training on the Citation-Integrity dataset.

# Model parameters and checkpoints

The predictions made by the model at four checkpoints are compared below.
These checkpoints are:

- Model A (`bestModel-001.ckpt`): This is the best model from [Citation-Integrity](https://github.com/ScienceNLP-Lab/Citation-Integrity) and was downloaded from [Google Drive](https://drive.google.com/drive/u/0/folders/11b6Z8iv2FXObWmLaqfYzgUQsaL4QgTT2?q=parent:11b6Z8iv2FXObWmLaqfYzgUQsaL4QgTT2).
- Model B (`citint_20241127.ckpt`): This is my first reproduction of the Citation-Integrity model. Except for modification made to `requirements.txt` and imported packages, the codebase is identical to [this commit of Citation-Integrity](https://github.com/ScienceNLP-Lab/Citation-Integrity/commit/277152f9dfe3873455220f4cd15269474ab15617). This corresponds to commit [e10022](https://github.com/jedick/readycite/commit/e10022ecc4a24646708f6dd81e40f20208d62860).
- Model C: (`citint_20241128.ckpt`): As in Model B, but with the dataset in `val_dataloader` changed from `"test"` to `"val"`. This corresponds to commit [cf8461](https://github.com/jedick/readycite/commit/cf846148c39557c45d99e2fcbb3409adea4fede3).
- Model D: (`citint_20241129.ckpt`): As in Model C, but with the number of epochs in `train_target.py` changed from 5 to 20.