# Data Wrangling for Citation-Integrity

Written by Jeffrey Dick, 2024-12-14

This notebook wrangles data from the **Citation-Integrity** dataset.
The purpose of the data wrangling is to identify and deal with:
- duplicated examples
- missing information
- outliers (e.g. very short claims)

Here are the main tasks:

- Summary statistics are calculated for each column in the dataset.
- Each fold (train, dev, test) is treated individually, then they are combined.
- Furthermore, the `claim` and `evidence` columns are processed to get word counts and labels.
- Separately from the claims, we read the `corpus.jsonl` file to get the abstracts and check that `cited_doc_ids` in the claims have matching abstracts.
- *A separate notebook* will be created for exploratory data analysis of the abstracts.

NOTE: The dataset follows the schema used by **SciFact** [(schema information)](https://github.com/allenai/scifact/blob/master/doc/data.md).
For consistency with terminology used in **SciFact**, the evidence sentences are referred to as the *abstract*.
In **Citation-Integrity**, this is not the actual abstract of a cited paper, but instead the set of evidence sentences identified by human annotators.

First we define a function to read a given fold of the dataset.

In [1]:
import pandas as pd

def read_citint(fold):
    """
    Reads Citation-Integrity dataset of the given fold
    (train, dev, test).
    Returns a DataFrame
    """
    file = f'../data/citint/claims_{fold}.jsonl'
    try:
        df = pd.read_json(file, lines=True)
    except:
        print(f'Error attempting to read file {file}')
        return None
    # Print range of claim IDs
    print('Range of claim IDs for '+fold+' is '+str(df['id'].min())+'..'+str(df['id'].max()))
    # Prepend a 'fold' column with the name of the fold
    df.insert(0, 'fold', fold)
    df['fold'] = df['fold'].astype('category')
    return df


## Summary statistics for each fold

Start by getting DataFrames for each fold.
The ranges of claim IDs are also printed.

In [2]:
train = read_citint('train')
dev = read_citint('dev')
test = read_citint('test')

Range of claim IDs for train is 0..2141
Range of claim IDs for dev is 0..316
Range of claim IDs for test is 0..605


Next, change the type of the `id` column from int to str in order to count unique values.
Then summarize the data for the train, dev, and test folds.
Exclude category dtype which is used for the `fold` column.

In [3]:
train['id'] = train['id'].astype('str')
train.describe(exclude='category')

Unnamed: 0,id,claim,evidence,cited_doc_ids
count,2138,2138,2138,2138
unique,2138,2136,1996,2138
top,2137,SARS-CoV-2 pseudoviruses were generated essent...,{},[18066]
freq,1,2,143,1


In [4]:
dev['id'] = dev['id'].astype('str')
dev.describe(exclude='category')

Unnamed: 0,id,claim,evidence,cited_doc_ids
count,316,316,316,316
unique,316,316,293,316
top,315,Closing this testing gap through increased acc...,{},[39022]
freq,1,1,24,1


In [5]:
test['id'] = test['id'].astype('str')
test.describe(exclude='category')

Unnamed: 0,id,claim,evidence,cited_doc_ids
count,606,606,606,606
unique,606,606,557,606
top,603,This finding confirms the delayed sleep schedu...,{},[33034]
freq,1,1,50,1


**Results**
- The claim IDs start at 0 and are unique in each fold.
- For the train and dev folds a few claim IDs are unused.
- There is one duplicated claim in the train set; no claims are duplicated in the other folds.
- Each fold has a certain number of empty evidence statements (`{}`).
- The remaining evidence statements are unique (i.e. count - unique - freq + 1 = 0).
- Cited doc IDs are unique in each fold.

**Action**
- Unused claim IDs in the train and test folds may indicate claims that were later removed by the curators and require no further action.
- The duplicated claim in the train fold should be considered for dropping.
- Empty evidence statements can be classified as `NEI` (not enough information).
    - However, we should be aware of claims that might have an explicit `NEI` label.

## Summary statistics for the combined dataset

In [6]:
citint = pd.concat([train, dev, test], ignore_index=True)
citint.describe()

Unnamed: 0,fold,id,claim,evidence,cited_doc_ids
count,3060,3060,3060,3060,3060
unique,3,2138,3056,2844,3060
top,train,277,"In particular, the expansion of neutralizing a...",{},[33034]
freq,2138,3,2,217,1


**Results**
- There are two duplicated claims (3060 - 3056 = 4; the most frequent duplicate occurs twice, so the other also occurs twice).
- There are no duplicated evidence statements (3060 - 2844 - 217 + 1 = 0).
- There are no duplicated cited doc IDs.

**Action**
- The duplicated claims should be investigated and possibly removed.
    - The cited doc IDs are all unique, so the duplicated claims might actually be unique examples with identical text by chance.

## Number of words in claims

This code calculates the number of words of the claims by splitting on spaces and getting the length of the word list.
Then we calculate summary statistics: mean, min, and max number of words for each fold.

In [7]:
citint['claim_length'] = citint['claim'].apply(lambda x: len(str(x).split(' ')))
citint.groupby('fold')['claim_length'].agg(['median', 'min', 'max'])

Unnamed: 0_level_0,median,min,max
fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dev,31.0,3,142
test,29.0,5,260
train,30.0,4,606


Let's look at some of the very short claims.

In [8]:
citint[citint['claim_length'] <= 5]

Unnamed: 0,fold,id,claim,evidence,cited_doc_ids,claim_length
611,train,772,2012; 380: 1491–7) [<|cit|>••],"{'73020': [{'sentences': [0], 'label': 'ACCURA...",[73020],4
1146,train,1443,Chen et al. [<|cit|>].,"{'63000': [{'sentences': [0], 'label': 'NOT_AC...",[63000],4
1671,train,2110,isolates in hACE2 transgenic mice<|cit|>.,"{'18028': [{'sentences': [0], 'label': 'ACCURA...",[18028],5
2253,dev,140,Kim et al[<|cit|>] are positive.,"{'40013': [{'sentences': [0], 'label': 'ACCURA...",[40013],5
2372,dev,282,"Notably, in [<|cit|>].","{'27017': [{'sentences': [0], 'label': 'NOT_AC...",[27017],3
2764,test,378,These are unprecedent times (<|multi_cit|>).,"{'25015': [{'sentences': [0], 'label': 'ACCURA...",[25015],5


**Results**
- The claims in each fold have comparable median lengths (29-31 words).
- A few claims are very short (less than 6 words, including the citation marker).

**Action**
- The short claims are incomplete or unspecific statements that should be considered for removal.

## Class (label) distribution

The `evidence` column contains dictionaries with a key to the evidence text in the corpus, followed by the index of the evidence sentence and the label.
Let's take a look:

In [9]:
citint['evidence'].head()

0    {'66000': [{'sentences': [0], 'label': 'ACCURA...
1    {'66001': [{'sentences': [0], 'label': 'ACCURA...
2    {'66002': [{'sentences': [0], 'label': 'ACCURA...
3    {'66003': [{'sentences': [0], 'label': 'ACCURA...
4    {'66004': [{'sentences': [0], 'label': 'ACCURA...
Name: evidence, dtype: object

To get the sentence indices and labels, use list comprehension to index into each of the dictionaries in the `evidence` column.
We handle empty evidence by using a conditional expression in the list comprehension.
Empty evidence is treated as `NEI` (Not Enough Information).
Then, count the distribution of labels in each fold and calculate percentages.

In [10]:
sentences = [list(x.values())[0][0]['sentences'] if not x == {} else None for x in citint['evidence']]
label = [list(x.values())[0][0]['label'] if not x == {} else 'NEI' for x in citint['evidence']]
citint['sentences'] = sentences
citint['label'] = label
label_counts = citint.groupby(['fold', 'label']).size().unstack()
label_sum = label_counts.sum(axis=1)
label_percentage = label_counts.div(label_sum, axis=0) * 100
print(label_counts)
print(label_percentage.round())

label  ACCURATE  NEI  NOT_ACCURATE
fold                              
dev         191   24           101
test        386   50           170
train      1366  143           629
label  ACCURATE  NEI  NOT_ACCURATE
fold                              
dev        60.0  8.0          32.0
test       64.0  8.0          28.0
train      64.0  7.0          29.0


Finally, look at the values for all sentence indices.

In [11]:
citint['sentences'].value_counts(dropna=False)

sentences
[0]     2843
None     217
Name: count, dtype: int64

**Results**
- The dataset exhibits class imbalance:
    - `ACCURATE` is approximately twice as frequent as `NOT_ACCURATE`.
    - Less than 10% of labels are `NEI`.
- The folds are comparable in terms of class imbalance.
- Only the first sentence index is listed for each claim with evidence.

**Action**
- Model training should be adjusted for class imbalance.
- The model code should be checked for correct handling of missing evidence statements (`NEI` labels).
- Although only the first sentence is indexed, the code should be constructed to handle abstracts with more than one sentence.

## Checking that all claims have matching abstracts

Let's read and take a look at the corpus of abstracts (i.e. evidence sentences for the claims).

In [12]:
corpus = pd.read_json('../data/citint/corpus.jsonl', lines=True)
corpus.head()

Unnamed: 0,doc_id,title,abstract
0,66000,,[Accumulating evidence indicates that lncRNAs ...
1,66001,,[We present evidence that loc285194 is a direc...
2,66002,,"[Finally, we demonstrate that loc285194 negati..."
3,66003,,[This miR-211-promoted cell growth was also se...
4,66004,,"[Moreover, a muscle-specific lncRNA, linc-MD1,..."


List claims that do not have matching abstracts.

In [13]:
claim_has_abstract = citint['cited_doc_ids'].str[0].isin(corpus['doc_id'])
print("Out of "+str(len(citint))+" claims the following do not have matching abstracts:")
citint[~claim_has_abstract]

Out of 3060 claims the following do not have matching abstracts:


Unnamed: 0,fold,id,claim,evidence,cited_doc_ids,claim_length,sentences,label


List abstracts that do not have matching claims.

In [14]:
abstract_has_claim = corpus['doc_id'].isin(citint['cited_doc_ids'].str[0])
print("Out of "+str(len(corpus))+" abstracts the following do not have matching claims:")
corpus[~abstract_has_claim]

Out of 3063 abstracts the following do not have matching claims:


Unnamed: 0,doc_id,title,abstract
624,44005,,[]
756,76011,,[]
1259,12013,,[]


**Results**
- Each claim has a matching abstract.
- Three abstracts do not have matching claims, and these abstract also have no text.

**Action**
- No action is required for training the model, since the data for each claim is complete.
- The abstracts with missing text may affect some outcomes of data exploration.