# Data Wrangling for SciFact

Written by Jeffrey Dick, 2024-12-14

This notebook wrangles data from the **SciFact** dataset.
The purpose of the data wrangling is to identify and deal with:
- duplicated examples
- missing information
- outliers (e.g. very short claims)

Here are the main tasks:

- Summary statistics are calculated for each column in the dataset.
- Each fold (train, dev, test) is treated individually, then they are combined.
- Furthermore, the `claim` and `evidence` columns are processed to get word counts and labels.
- Separately from the claims, we read the `corpus.jsonl` file to get the abstracts and check that `cited_doc_ids` in the claims have matching abstracts.
- *A separate notebook* will be created for exploratory data analysis of the abstracts.

First we define a function to read a given fold of the dataset.

In [1]:
import pandas as pd

def read_scifact(fold):
    """
    Reads the given fold the SciFact dataset
    (train, dev, test).
    Returns a DataFrame
    """
    file = f'../data/scifact/claims_{fold}.jsonl'
    try:
        df = pd.read_json(file, lines=True)
    except:
        print(f'Error attempting to read file {file}')
        return None
    # Print range of claim IDs
    print('Range of claim IDs for '+fold+' is '+str(df['id'].min())+'..'+str(df['id'].max()))
    # Prepend a 'fold' column with the name of the fold
    df.insert(0, 'fold', fold)
    df['fold'] = df['fold'].astype('category')
    return df


## Summary statistics for each fold

Start by getting DataFrames for each fold.
The ranges of claim IDs are also printed.

In [2]:
train = read_scifact('train')
dev = read_scifact('dev')
test = read_scifact('test')

Range of claim IDs for train is 0..1407
Range of claim IDs for dev is 1..1395
Range of claim IDs for test is 7..1408


Next, change the type of the `id` column from int to str in order to count unique values.
Then summarize the data for the train, dev, and test folds.
Exclude category dtype which is used for the `fold` column.

In [3]:
train['id'] = train['id'].astype('str')
train.describe(exclude='category')

Unnamed: 0,id,claim,evidence,cited_doc_ids
count,809,809,809,809
unique,809,807,481,513
top,1407,Adult tissue-resident macrophages are seeded b...,{},[16472469]
freq,1,2,304,8


In [4]:
dev['id'] = dev['id'].astype('str')
dev.describe(exclude='category')

Unnamed: 0,id,claim,evidence,cited_doc_ids
count,300,300,300,300
unique,300,300,188,250
top,1395,p16INK4A accumulation is linked to an abnorma...,{},[21366394]
freq,1,1,112,4


In [5]:
test['id'] = test['id'].astype('str')
test.describe(exclude='category')

Unnamed: 0,id,claim
count,300,300
unique,300,297
top,1408,Hematopoietic progenitor cells are never susce...
freq,1,2


**Results**
- The claim IDs are unique in each fold.
- The train and test folds each have one or two duplicated claims.
- Each fold has a certain number of empty evidence statements (`{}`).
- Some cited doc IDs are duplicated in the train and dev folds.

**Action**
- The duplicated claims should be considered for removal.
- Empty evidence statements should be labelled with `NEI` (not enough information).

## Summary statistics for the combined dataset

In [6]:
scifact = pd.concat([train, dev, test], ignore_index=True)
scifact.describe()

Unnamed: 0,fold,id,claim,evidence,cited_doc_ids
count,1409,1409,1409,1109,1109
unique,3,1409,1401,648,606
top,train,1408,Obesity decreases life quality.,{},[16472469]
freq,809,1,2,416,9


**Results**
- There are several claims that are duplicated across folds. 

**Action**
- The duplicated claims should be investigated and possibly removed.

## Number of words in claims

This code calculates the number of words of the claims by splitting on spaces and getting the length of the word list.
Then we calculate summary statistics: mean, min, and max number of words for each fold.

In [7]:
scifact['claim_length'] = scifact['claim'].apply(lambda x: len(str(x).split(' ')))
scifact.groupby('fold')['claim_length'].agg(['median', 'min', 'max'])

Unnamed: 0_level_0,median,min,max
fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dev,12.0,4,29
test,12.0,4,39
train,12.0,3,39


Let's look at some of the very short claims.

In [8]:
scifact[scifact['claim_length'] <= 4]

Unnamed: 0,fold,id,claim,evidence,cited_doc_ids,claim_length
8,train,14,5'-nucleotidase metabolizes 6MP.,{},[641786],3
176,train,306,DUSP4 decreases apoptosis.,"{'7821634': [{'sentences': [6], 'label': 'CONT...",[7821634],3
178,train,309,DUSP4 increases apoptosis.,"{'7821634': [{'sentences': [6], 'label': 'SUPP...",[7821634],3
500,train,871,Obesity decreases life quality.,{},[195689316],4
501,train,876,Obesity raises life quality.,{},[195689316],4
569,train,992,Pyridostatin deregulates G2/M progression.,"{'16472469': [{'sentences': [5], 'label': 'SUP...",[16472469],4
571,train,996,Pyridostatin induces checkpoint activation.,"{'16472469': [{'sentences': [5], 'label': 'SUP...",[16472469],4
743,train,1304,Tirasemtiv targets cardiac muscle.,{},[12631697],4
744,train,1305,Tirasemtiv targets fast-twitch muscle.,"{'12631697': [{'sentences': [1], 'label': 'SUP...",[12631697],4
999,dev,870,Obesity decreases life quality.,{},[195689316],4


**Results**
- The claims in each fold have the same median lengths (12 words).
- The short claims are sensible statements.
    - Unlike **Citation-Integrity**, the claims have no citation markers.

**Action**
- No action is needed.

## Class (label) distribution

The `evidence` column contains dictionaries with a key to the evidence text in the corpus, followed by the index of the evidence sentence and the label.
Let's take a look:

In [9]:
scifact['evidence'].head()

0                                                   {}
1    {'13734012': [{'sentences': [4], 'label': 'CON...
2                                                   {}
3                                                   {}
4    {'44265107': [{'sentences': [15], 'label': 'SU...
Name: evidence, dtype: object

To get the sentence indices and labels, use list comprehension to index into each of the dictionaries in the `evidence` column.
We handle empty evidence by using a conditional expression in the list comprehension.
Empty evidence is treated as `NEI` (Not Enough Information).
Then, count the distribution of labels in each fold and calculate percentages.

In [10]:
sentences = [list(x.values())[0][0]['sentences'] if not (x == {} or pd.isnull(x)) else None for x in scifact['evidence']]
label = [list(x.values())[0][0]['label'] if not (x == {} or pd.isnull(x)) else 'NEI' for x in scifact['evidence']]
scifact['sentences'] = sentences
scifact['label'] = label
# No labels are available for the test fold
scifact.loc[pd.isnull(scifact['evidence']), 'label'] = None
label_counts = scifact.groupby(['fold', 'label']).size().unstack()
label_sum = label_counts.sum(axis=1)
label_percentage = label_counts.div(label_sum, axis=0) * 100
print(label_counts)
print(label_percentage.round())

label  CONTRADICT  NEI  SUPPORT
fold                           
dev            64  112      124
train         173  304      332
label  CONTRADICT   NEI  SUPPORT
fold                            
dev          21.0  37.0     41.0
train        21.0  38.0     41.0


Finally, look at the values for sentence indices.

In [11]:
scifact['sentences'].value_counts(dropna=False).head(20)

sentences
None       716
[4]         90
[3]         79
[2]         65
[5]         62
[1]         57
[7]         46
[6]         43
[10]        32
[0]         32
[8]         31
[9]         26
[11]        21
[12]        17
[13]        10
[2, 3]       9
[4, 5]       6
[5, 6]       6
[9, 10]      4
[3, 4]       4
Name: count, dtype: int64

**Results**
- The dataset exhibits class imbalance:
    - `SUPPORT` is approximately twice as frequent as `CONTRADICT`.
    - Nearly 40% of labels are `NEI` (much higher than 10% in **Citation-Integrity**).
- The train and dev folds are comparable in terms of class imbalance.
- Some claims have multiple sentences indexed from the abstracts.

**Action**
- Model training should be adjusted for class imbalance.
- The code should be checked for correct handling of missing evidence statements (`NEI` labels).
- The code should correctly handle labels in different datasets (**SciFact**: `SUPPORT` and `CONTRADICT`; **Citation-Integrity**: `ACCURACTE` and `NOT_ACCURATE`).

## Checking that all claims have matching abstracts

Let's read and take a look at the corpus of abstracts (i.e. evidence sentences for the claims).

In [12]:
corpus = pd.read_json('../data/scifact/corpus.jsonl', lines=True)
corpus.head()

Unnamed: 0,doc_id,title,abstract,structured
0,4983,Microstructural development of human newborn c...,[Alterations of the architecture of cerebral w...,False
1,5836,Induction of myelodysplasia by myeloid-derived...,[Myelodysplastic syndromes (MDS) are age-depen...,False
2,7912,"BC1 RNA, the transcript from a master gene for...",[ID elements are short interspersed elements (...,False
3,18670,The DNA Methylome of Human Peripheral Blood Mo...,[DNA methylation plays an important role in bi...,False
4,19238,The human myelin basic protein gene is include...,[Two human Golli (for gene expressed in the ol...,False


List claims that do not have matching abstracts.

In [13]:
claim_has_abstract = scifact['cited_doc_ids'].str[0].isin(corpus['doc_id'])
print("Out of "+str(len(scifact))+" claims the following do not have matching abstracts:")
scifact[~claim_has_abstract]

Out of 1409 claims the following do not have matching abstracts:


Unnamed: 0,fold,id,claim,evidence,cited_doc_ids,claim_length,sentences,label
1109,test,7,10-20% of people with severe mental disorder r...,,,16,,
1110,test,8,25% of patients with melanoma and an objective...,,,19,,
1111,test,16,50% of patients exposed to radiation have acti...,,,11,,
1112,test,23,8% of burn patients are admitted for hospitali...,,,20,,
1113,test,29,A breast cancer patient's capacity to metaboli...,,,14,,
...,...,...,...,...,...,...,...,...
1404,test,1384,c-MYC is important for maintaining pluripotent...,,,11,,
1405,test,1388,mTORC2 inhibits xCT antiporter through phospho...,,,6,,
1406,test,1396,p16INK4A accumulation is encoded by CDKN2A.,,,6,,
1407,test,1399,p53 controls autophagy through the AMPK/mTOR-d...,,,7,,


**Results**
- Except for the test fold, each claim has a matching abstract.

**Action**
- Predictions for the test fold need to be made without abstracts.