# Preparing data

This notebook contains `SpikeX` usage examples
tailored towards our field. 

- loading & processing **Influence-IR S2ORC** dataset
- running `SpikeX` pipelines on a sample publication

In [1]:
import json
import pathlib
from typing import Iterator, Dict, Any

In [2]:
# path
ROOT = pathlib.Path("./").resolve().parent
RAW = ROOT / "data" / "raw"

# typing classes
# doesn't change the behaviour of the code,
# but rather helps to understad what a particular function
# takes as an input or returns as an output
Publication = Dict[str, Any]

In [3]:
def read_jsonl(p: pathlib.PosixPath) -> Iterator[Publication]:
    """Yield .jsonl file's contents line by line."""
    with open(RAW / "test.jsonl", "r") as lines:
        for line in lines:
            yield json.loads(line)
            
            
def retrive_text(data: Publication, field: str = "body_text") -> str:
    """Parse 'body_text' or 'abstract' fields extracting raw texts."""
    return " ".join(section["text"] for section in data[field])

In [4]:
# testing on a sample text - first one in our corpus
# creating generator
data = read_jsonl(RAW / "test.jsonl")

# loading the second entry
next(data)
publication = next(data)

In [5]:
publication.keys()

dict_keys(['paper_id', '_pdf_hash', 'abstract', 'body_text', 'bib_entries', 'ref_entries'])

In [6]:
# this is how the first section of an abstract field 
# looks like
publication["abstract"][0]

{'section': 'Abstract',
 'text': 'Sovereignty is a funny thing. It is allegedly the foundation of the Westphalian order, but its exact contours are frustratingly indeterminate. When it was revealed that the Russian government interfered in the 2016 U.S. presidential election by, among other things, hacking into the e-mail system of the Democratic National Committee (DNC) and releasing its e-mails, international lawyers were divided over whether the cyber attack violated international law. President Obama seemingly went out of his way to describe the attack as a mere violation of "established international norms of behavior" and pointedly declined to refer to the cyber attacks as a violation of international law. \' Some international lawyers were more willing to describe the cyber attack as a violation of international law.',
 'cite_spans': [],
 'ref_spans': []}

In [7]:
text = retrive_text(publication, field="abstract")

In [8]:
text

'Sovereignty is a funny thing. It is allegedly the foundation of the Westphalian order, but its exact contours are frustratingly indeterminate. When it was revealed that the Russian government interfered in the 2016 U.S. presidential election by, among other things, hacking into the e-mail system of the Democratic National Committee (DNC) and releasing its e-mails, international lawyers were divided over whether the cyber attack violated international law. President Obama seemingly went out of his way to describe the attack as a mere violation of "established international norms of behavior" and pointedly declined to refer to the cyber attacks as a violation of international law. \' Some international lawyers were more willing to describe the cyber attack as a violation of international law. 2 However, identifying the exact legal norm that was contravened turns out to be harder than it might otherwise appear. To the layperson, the Russian hacking constituted an impermissible (and perha

---

# Testing SpikeX

`SpikeX`:
- aims to help in building knowledge extraction tools
- built on top of spaCy

In [11]:
# creating spacy's instance
from spacy import load as spacy_load

nlp = spacy_load("en_core_web_sm")
doc = nlp(text)

---

### ClusterX

The `ClusterX` pipe takes noun chunks in a text and clusters them using a Radial Ball Mapper algorithm.

In [12]:
# ClusterX
from spikex.pipes import ClusterX

clusterx = ClusterX(min_score=0.65)
doc = clusterx(doc)
for cluster in doc._.cluster_chunks:
    print(cluster)

[Sovereignty]
[President Obama]
[It]
[international lawyers, that term, the Russian government, the Russian hacking, that point, Some international lawyers, the cyber attack]
[none]
[its exact contours]
[the translation effort]
[nonlawyers]
[the layperson, a mere violation, a variety, the standard rubrics, the foundation, legal discourse, the cyber attacks, the American political process, the 2016 U.S. presidential election, the exact legal norm, an intervention, a funny thing, the difficulty, international law, the attack, a violation, the Westphalian order]
[the Democratic National Committee]
["established international norms]
[his way]
[the layperson, a mere violation, a variety, the standard rubrics, the foundation, legal discourse, the cyber attacks, the American political process, the 2016 U.S. presidential election, the exact legal norm, an intervention, a funny thing, international law, the attack, a violation, the Westphalian order]
[perhaps) shocking interference]
[the layper



### AbbrX

The **AbbrX** pipe finds abbreviations and acronyms in the text, linking short and long forms together:

In [13]:
# AbbrX

from spikex.pipes import AbbrX

abbrx = AbbrX(nlp.vocab)
doc = abbrx(doc)
for abbr in doc._.abbrs:
    print(abbr, "->", abbr._.long_form)

DNC -> Democratic National Committee


### LabelX

The `LabelX` pipe matches and labels patterns in text, solving overlappings, abbreviations and acronyms.

In [14]:
# LabelX
from spikex.pipes import LabelX

# don't fully understand how it works, 
# but from what I see you could create a pattern
# and the assign a label for that pattern

patterns = [
  [{"LOWER": "presidential"}, {"LOWER": "election"}],
]
labelx = LabelX(nlp.vocab, [("Proccess", patterns)], validate=True, only_longest=True)
doc = labelx(doc)
for labeling in doc._.labelings:
    print(labeling, f"[{labeling.label_}]")


presidential election [Proccess]


### PhraseX

The `PhraseX` pipe creates a custom `Doc`'s underscore extension which fulfills with matches from phrase patterns.

In [15]:
# PhraseX
from spikex.pipes import PhraseX

patterns = [
  [{"LOWER": "presidential"}],
  [{"LOWER": "election"}],
]
phrasex = PhraseX(nlp.vocab, "Russia", patterns)
doc = phrasex(doc)
for r in doc._.Russia: # <- now all matches are stored in `doc._.Russia`
    print(r)

presidential
election


### SentX

The **SentX** pipe splits sentences in a text. It modifies tokens' *is_sent_start* attribute, so it's mandatory to add it before *parser* pipe in the spaCy pipeline:

In [16]:
# SentX
from spikex.pipes import SentX
from spikex.defaults import spacy_version

if spacy_version >= 3:
    from spacy.language import Language

    @Language.factory("sentx-ir")
    def create_sentx(nlp, name):
        return SentX()

nlp = spacy_load("en_core_web_sm")
sentx_pipe = SentX() if spacy_version < 3 else "sentx"
nlp.add_pipe(sentx_pipe, before="parser")
doc = nlp(text)
for sent in doc.sents:
    print(sent)

Sovereignty is a funny thing.
It is allegedly the foundation of the Westphalian order, but its exact contours are frustratingly indeterminate.
When it was revealed that the Russian government interfered in the 2016 U.S. presidential election by, among other things, hacking into the e-mail system of the Democratic National Committee (DNC) and releasing its e-mails, international lawyers were divided over whether the cyber attack violated international law.
President Obama seemingly went out of his way to describe the attack as a mere violation of "established international norms of behavior" and pointedly declined to refer to the cyber attacks as a violation of international law.
' Some international lawyers were more willing to describe the cyber attack as a violation of international law.
2 However, identifying the exact legal norm that was contravened turns out to be harder than it might otherwise appear.
To the layperson, the Russian hacking constituted an impermissible (and perhaps