# Generate claim evidence pair with random negative sampling (RNS)

Load data from `json` such that they are in pairs of claim and evidence with labels. Labels are binary `True` if related and `False` if not related. `False` instances are obtained by negative sampling.

In [1]:
# Change the working directory to project root
import pathlib
import os
ROOT_DIR = pathlib.Path.cwd()
while not ROOT_DIR.joinpath("src").exists():
    ROOT_DIR = ROOT_DIR.parent
os.chdir(ROOT_DIR)

In [2]:
# Imports and dependencies
import pandas as pd
import numpy as np
random_seed = 42
np.random.seed(random_seed)

from src.data import load_as_dataframe, get_paired_texts, slice_by_claim

## Load the datasets

In [3]:
data_names = ["train-claims", "dev-claims", "evidence"]
train_claims, dev_claims, all_evidence \
    = load_as_dataframe(data_names, full_evidence=True)

Loaded train-claims
Loaded dev-claims
Loaded evidence


## Generate samples

### Positive samples

Select the source data

In [4]:
src_data = train_claims

Load claims and evidences as pairs and attach `True` label.

In [5]:
positive_samples = get_paired_texts(dataset_df=src_data).assign(related=1)
print(positive_samples.shape)
positive_samples.head(10)

(4122, 3)


Unnamed: 0,claim_text,evidence_text,related
0,Not only is there no scientific evidence that ...,At very high concentrations (100 times atmosph...,1
1,Not only is there no scientific evidence that ...,Plants can grow as much as 50 percent faster i...,1
2,Not only is there no scientific evidence that ...,Higher carbon dioxide concentrations will favo...,1
3,El Niño drove record highs in global temperatu...,While ‘climate change’ can be due to natural f...,1
4,El Niño drove record highs in global temperatu...,This acceleration is due mostly to human-cause...,1
5,"In 1946, PDO switched to a cool phase.",There is evidence of reversals in the prevaili...,1
6,"In 1946, PDO switched to a cool phase.","1945/1946: The PDO changed to a ""cool"" phase, ...",1
7,Weather Channel co-founder John Coleman provid...,There is no convincing scientific evidence tha...,1
8,Weather Channel co-founder John Coleman provid...,"He has called global warming the ""greatest sca...",1
9,Weather Channel co-founder John Coleman provid...,International Council of Academies of Engineer...,1


## Related negative samples

Related negative samples will be selected from the pool of related sentences as they are most likely to be closer in similarity than the larger evidence corpus. Select an equal number as positive samples for each claim.

In [6]:
related_negative_samples = (
    src_data
    .groupby(level=["claim", "claim_text"], sort=False, as_index=True)
    .apply(lambda g, src: (src
        .loc[src.index.get_level_values("claim") != g.name[0]]
        .droplevel(["claim", "claim_text", "claim_label"])
        .sample(n=g.shape[0], random_state=random_seed, ignore_index=False)
    ), src=src_data)
    .reset_index()
    .get(["claim_text", "evidence_text"])
    .assign(related=0)
)
print(related_negative_samples.shape)
related_negative_samples.head(10)

(4122, 3)


Unnamed: 0,claim_text,evidence_text,related
0,Not only is there no scientific evidence that ...,The Legendre symbol was introduced by Adrien-M...,0
1,Not only is there no scientific evidence that ...,They were already working with the Met Office ...,0
2,Not only is there no scientific evidence that ...,The 2014 flip from the cool PDO phase to the w...,0
3,El Niño drove record highs in global temperatu...,The term hockey stick was coined by the climat...,0
4,El Niño drove record highs in global temperatu...,Climate model projections summarized in the re...,0
5,"In 1946, PDO switched to a cool phase.",The term hockey stick was coined by the climat...,0
6,"In 1946, PDO switched to a cool phase.",Climate model projections summarized in the re...,0
7,Weather Channel co-founder John Coleman provid...,The 2014–16 El Niño was a warming of the easte...,0
8,Weather Channel co-founder John Coleman provid...,"To make accurate records, tide gauges at fixed...",0
9,Weather Channel co-founder John Coleman provid...,Climate scientists have reached a consensus th...,0


### Distant negative samples

Distant negative samples are randomly selected from the general evidence corpus. Since they are not linked in the training data, they are assumed to be less similar. Select an equal number as positive samples for each claim.

In [7]:
distant_negative_samples = (
    src_data
    .groupby(level=["claim", "claim_text"], sort=False, as_index=True)
    .apply(lambda g, all_evidence: (all_evidence
        .loc[~all_evidence.index.isin(g.index.get_level_values("evidences"))]
        .sample(n=g.shape[0], random_state=random_seed, ignore_index=False)
    ), all_evidence=all_evidence)
    .reset_index()
    .get(["claim_text", "evidence_text"])
    .assign(related=0)
)
print(distant_negative_samples.shape)
distant_negative_samples.head(10)

(4122, 3)


Unnamed: 0,claim_text,evidence_text,related
0,Not only is there no scientific evidence that ...,Sabit Hadžić (born 7 August 1957 in Sarajevo) ...,0
1,Not only is there no scientific evidence that ...,"It was described by Albert Günther in 1864, or...",0
2,Not only is there no scientific evidence that ...,Mahsa Vahdat took part in the albums Listen to...,0
3,El Niño drove record highs in global temperatu...,A large hangar and maintenance facilities were...,0
4,El Niño drove record highs in global temperatu...,"In 1952, Tony Stecher sold a one-third interes...",0
5,"In 1946, PDO switched to a cool phase.",An eye bolt is a bolt with a loop at one end.,0
6,"In 1946, PDO switched to a cool phase.","In 1952, Tony Stecher sold a one-third interes...",0
7,Weather Channel co-founder John Coleman provid...,"20 nations participated in the tournament, whi...",0
8,Weather Channel co-founder John Coleman provid...,"Calvary Episcopal Church (Burnt Hills, New Yor...",0
9,Weather Channel co-founder John Coleman provid...,Jean Isabel Melzer (7 February 192618 June 201...,0


### Combine positive and negative samples

This should result in dataset of shape $ (3n, 3) $ samples where $ n $ is the number of evidence-claim pairs in the source dataset.

In [8]:
combined_samples = pd.concat([
    positive_samples,
    related_negative_samples,
    distant_negative_samples
])
print(combined_samples.shape)
assert combined_samples.shape[0] == 3 * positive_samples.shape[0]
combined_samples.head(10)

(12366, 3)


Unnamed: 0,claim_text,evidence_text,related
0,Not only is there no scientific evidence that ...,At very high concentrations (100 times atmosph...,1
1,Not only is there no scientific evidence that ...,Plants can grow as much as 50 percent faster i...,1
2,Not only is there no scientific evidence that ...,Higher carbon dioxide concentrations will favo...,1
3,El Niño drove record highs in global temperatu...,While ‘climate change’ can be due to natural f...,1
4,El Niño drove record highs in global temperatu...,This acceleration is due mostly to human-cause...,1
5,"In 1946, PDO switched to a cool phase.",There is evidence of reversals in the prevaili...,1
6,"In 1946, PDO switched to a cool phase.","1945/1946: The PDO changed to a ""cool"" phase, ...",1
7,Weather Channel co-founder John Coleman provid...,There is no convincing scientific evidence tha...,1
8,Weather Channel co-founder John Coleman provid...,"He has called global warming the ""greatest sca...",1
9,Weather Channel co-founder John Coleman provid...,International Council of Academies of Engineer...,1


Save a copy of the samples as json.

In [14]:
output_file = \
    ROOT_DIR.joinpath("./result/train_data/claim_evidence_pair_rns.json")
with open(output_file, mode="w") as f:
    combined_samples.to_json(f, orient="records")