# Preprocessing of claim-sample pairs using a spaCy pipeline.

Ref: 
- [spaCy linguistic features](https://spacy.io/usage/linguistic-features)
- [spaCy processing pipelines](https://spacy.io/usage/processing-pipelines)
- [spaCy custom components](https://spacy.io/usage/processing-pipelines#custom-components)

In [1]:
# Change the working directory to project root
import pathlib
import os
ROOT_DIR = pathlib.Path.cwd()
while not ROOT_DIR.joinpath("src").exists():
    ROOT_DIR = ROOT_DIR.parent
os.chdir(ROOT_DIR)

In [2]:
# Imports and dependencies
import pandas as pd
import numpy as np
import spacy
import torch
from sentence_transformers import SentenceTransformer, util
from src.torch_utils import get_torch_device
from src.spacy_utils import repl_special_token, process_sentence
from typing import Callable, Tuple
import copy
import re

random_seed = 42
np.random.seed(random_seed)
torch_device = get_torch_device()
pd.set_option("display.max_colwidth", 0)

Torch device is 'mps'


  from .autonotebook import tqdm as notebook_tqdm


## Select models

In [3]:
nlp = spacy.load("en_core_web_trf")
nlp

<spacy.lang.en.English at 0x103719c40>

In [4]:
embedder = SentenceTransformer(
    "sentence-transformers/msmarco-bert-base-dot-v5",
    device=torch_device
)
embedder

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

## Load pair samples

Data need to shape $(n, 5)$ with columns:
- `claim`: claim id
- `claim_text`: claim text string
- `evidence`: evidence id
- `evidence_text`: evidence text string
- `related`: relation labels as `1/0`

In [5]:
train_data_file_path = \
    ROOT_DIR.joinpath("./result/train_data/train_claim_evidence_pair_rns.json")
with open(train_data_file_path, mode="r") as f:
    train_data = (
        pd.read_json(f, orient="records")
        .set_index(["claim", "evidence"])
    )
print(train_data.shape)
train_data.head(60)

(12366, 3)


Unnamed: 0_level_0,Unnamed: 1_level_0,claim_text,evidence_text,related
claim,evidence,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
claim-1937,evidence-442946,"Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.","At very high concentrations (100 times atmospheric concentration, or greater), carbon dioxide can be toxic to animal life, so raising the concentration to 10,000 ppm (1%) or higher for several hours will eliminate pests such as whiteflies and spider mites in a greenhouse.",1
claim-1937,evidence-1194317,"Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.","Plants can grow as much as 50 percent faster in concentrations of 1,000 ppm CO 2 when compared with ambient conditions, though this assumes no change in climate and no limitation on other nutrients.",1
claim-1937,evidence-12171,"Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.",Higher carbon dioxide concentrations will favourably affect plant growth and demand for water.,1
claim-126,evidence-338219,El Niño drove record highs in global temperatures suggesting rise may not be down to man-made emissions.,"While ‘climate change’ can be due to natural forces or human activity, there is now substantial evidence to indicate that human activity – and specifically increased greenhouse gas (GHGs) emissions – is a key factor in the pace and extent of global temperature increases.",1
claim-126,evidence-1127398,El Niño drove record highs in global temperatures suggesting rise may not be down to man-made emissions.,"This acceleration is due mostly to human-caused global warming, which is driving thermal expansion of seawater and the melting of land-based ice sheets and glaciers.",1
claim-2510,evidence-530063,"In 1946, PDO switched to a cool phase.","There is evidence of reversals in the prevailing polarity (meaning changes in cool surface waters versus warm surface waters within the region) of the oscillation occurring around 1925, 1947, and 1977; the last two reversals corresponded with dramatic shifts in salmon production regimes in the North Pacific Ocean.",1
claim-2510,evidence-984887,"In 1946, PDO switched to a cool phase.","1945/1946: The PDO changed to a ""cool"" phase, the pattern of this regime shift is similar to the 1970s episode with maximum amplitude in the subarctic and subtropical front but with a greater signature near the Japan while the 1970s shift was stronger near the American west coast.",1
claim-2021,evidence-1177431,Weather Channel co-founder John Coleman provided evidence that convincingly refutes the concept of anthropogenic global warming.,"There is no convincing scientific evidence that human release of carbon dioxide, methane, or other greenhouse gases is causing or will, in the foreseeable future, cause catastrophic heating of the Earth's atmosphere and disruption of the Earth's climate.",1
claim-2021,evidence-782448,Weather Channel co-founder John Coleman provided evidence that convincingly refutes the concept of anthropogenic global warming.,"He has called global warming the ""greatest scam in history"" and made numerous false or misleading claims about climate science.",1
claim-2021,evidence-540069,Weather Channel co-founder John Coleman provided evidence that convincingly refutes the concept of anthropogenic global warming.,"International Council of Academies of Engineering and Technological Sciences (CAETS) in 2007, issued a Statement on Environment and Sustainable Growth: As reported by the Intergovernmental Panel on Climate Change (IPCC), most of the observed global warming since the mid-20th century is very likely due to human-produced emission of greenhouse gases and this warming will continue unabated if present anthropogenic emissions continue or, worse, expand without control.",1


## Basic preprocessing exploration

In [6]:
def get_emb_similarity(claim_texts:list, evidence_texts:list) -> Tuple[float]:
    emb_kwargs = {"convert_to_tensor": True, "device": torch_device}
    claim_emb = embedder.encode(sentences=claim_texts, **emb_kwargs)
    evidence_emb = embedder.encode(sentences=evidence_texts, **emb_kwargs)
    score = util.cos_sim(a=claim_emb, b=evidence_emb)
    return score

### Lower case

Impact of casting to lower case.

In [7]:
test_A = train_data.loc[("claim-2510", "evidence-984887")]
test_A

  test_A = train_data.loc[("claim-2510", "evidence-984887")]


Unnamed: 0_level_0,Unnamed: 1_level_0,claim_text,evidence_text,related
claim,evidence,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
claim-2510,evidence-984887,"In 1946, PDO switched to a cool phase.","1945/1946: The PDO changed to a ""cool"" phase, the pattern of this regime shift is similar to the 1970s episode with maximum amplitude in the subarctic and subtropical front but with a greater signature near the Japan while the 1970s shift was stronger near the American west coast.",1


In [8]:
get_emb_similarity(
    claim_texts=test_A["claim_text"].tolist(),
    evidence_texts=test_A["evidence_text"].values.tolist()
)

tensor([[0.9589]], device='mps:0')

In [9]:
get_emb_similarity(
    claim_texts=test_A["claim_text"].str.lower().tolist(),
    evidence_texts=test_A["evidence_text"].str.lower().values.tolist()
)

tensor([[0.9589]], device='mps:0')

### Replace special tokens

In [10]:
test_nlp = copy.deepcopy(nlp)
test_nlp.add_pipe("repl_special_token", first=True)
test_nlp.pipe_names

['repl_special_token',
 'transformer',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

#### Test case: CO2, CO 2, carbon dioxide

In [11]:
test_A = train_data.loc[("claim-1937", "evidence-12171")]
test_A

  test_A = train_data.loc[("claim-1937", "evidence-12171")]


Unnamed: 0_level_0,Unnamed: 1_level_0,claim_text,evidence_text,related
claim,evidence,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
claim-1937,evidence-12171,"Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.",Higher carbon dioxide concentrations will favourably affect plant growth and demand for water.,1


In [12]:
get_emb_similarity(
    claim_texts=test_A["claim_text"].tolist(),
    evidence_texts=test_A["evidence_text"].values.tolist()
)

tensor([[0.9141]], device='mps:0')

In [13]:
test_B = test_A.copy()
cols = ["claim_text", "evidence_text"]
test_B[cols] = test_A[cols].applymap(lambda t: test_nlp(t).text)
test_B

Unnamed: 0_level_0,Unnamed: 1_level_0,claim_text,evidence_text,related
claim,evidence,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
claim-1937,evidence-12171,"Not only is there no scientific evidence that carbon dioxide is a pollutant, higher carbon dioxide concentrations actually help ecosystems support more plant and animal life.",Higher carbon dioxide concentrations will favourably affect plant growth and demand for water.,1


In [14]:
get_emb_similarity(
    claim_texts=test_B["claim_text"].tolist(),
    evidence_texts=test_B["evidence_text"].values.tolist()
)

tensor([[0.9309]], device='mps:0')

#### Test case: temperature symbol

In [15]:
test_A = train_data.loc[("claim-2449", "evidence-1010750")]
test_A

  test_A = train_data.loc[("claim-2449", "evidence-1010750")]


Unnamed: 0_level_0,Unnamed: 1_level_0,claim_text,evidence_text,related
claim,evidence,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
claim-2449,evidence-1010750,"""January 2008 capped a 12 month period of global temperature drops on all of the major well respected indicators.",With average temperature +8.1 °C (47 °F).,1


In [16]:
get_emb_similarity(
    claim_texts=test_A["claim_text"].tolist(),
    evidence_texts=test_A["evidence_text"].values.tolist()
)

tensor([[0.9069]], device='mps:0')

In [17]:
test_B = test_A.copy()
cols = ["claim_text", "evidence_text"]
test_B[cols] = test_A[cols].applymap(lambda t: test_nlp(t).text)
test_B

Unnamed: 0_level_0,Unnamed: 1_level_0,claim_text,evidence_text,related
claim,evidence,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
claim-2449,evidence-1010750,"""January 2008 capped a 12 month period of global temperature drops on all of the major well respected indicators.",With average temperature +8.1 degree celcius temperature (47 degree fahrenheit temperature).,1


In [18]:
get_emb_similarity(
    claim_texts=test_B["claim_text"].tolist(),
    evidence_texts=test_B["evidence_text"].values.tolist()
)

tensor([[0.8969]], device='mps:0')

## Preprocessing pipe

In [19]:
nlp.add_pipe("repl_special_token", first=True)
nlp.pipe_names

['repl_special_token',
 'transformer',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

#### Test case: dispute due to negation

In [20]:
test_A = train_data.loc[("claim-3003", "evidence-515817")]
test_A

  test_A = train_data.loc[("claim-3003", "evidence-515817")]


Unnamed: 0_level_0,Unnamed: 1_level_0,claim_text,evidence_text,related
claim,evidence,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
claim-3003,evidence-515817,Venus is not hot because of a runaway greenhouse.,"The planet Venus experienced runaway greenhouse effect, resulting in an atmosphere which is 96% carbon dioxide, with surface atmospheric pressure roughly the same as found 900 m (3,000 ft) underwater on Earth.",1


In [21]:
get_emb_similarity(
    claim_texts=test_A["claim_text"].tolist(),
    evidence_texts=test_A["evidence_text"].values.tolist()
)

tensor([[0.9335]], device='mps:0')

In [22]:
test_B = test_A.copy()
cols = ["claim_text", "evidence_text"]
test_B[cols] = test_A[cols].applymap(process_sentence, nlp=nlp)
test_B

Unnamed: 0_level_0,Unnamed: 1_level_0,claim_text,evidence_text,related
claim,evidence,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
claim-3003,evidence-515817,venus be hot because of a runaway greenhouse .,"the planet venus experience runaway greenhouse effect , result in an atmosphere which be 96 % carbon dioxide , with surface atmospheric pressure roughly the same as find 900 m ( 3,000 ft ) underwater on earth .",1


In [23]:
get_emb_similarity(
    claim_texts=test_B["claim_text"].tolist(),
    evidence_texts=test_B["evidence_text"].values.tolist()
)

tensor([[0.9397]], device='mps:0')