## Read and explore data

We'll start off with reading and doing some basic data processing. We'll assume that:
* you've downloaded the data from http://www.eraserbenchmark.com/ and have unpacked it to a directory called `data`
* you're running the kernel in the root of the `eraserbenchmark` repo

We're going to work with the movies dataset as it's the smallest and easiest to get started with. All the data is stored in either plain text, or jsonl, and should be pre-tokenized and ready to go!

In [1]:
import os
from rationale_benchmark.utils import load_documents, load_datasets, annotations_from_jsonl, Annotation

data_root = os.path.join('data', 'movies')
documents = load_documents(data_root)
val = annotations_from_jsonl(os.path.join(data_root, 'val.jsonl'))
## Or load everything:
train, val, test = load_datasets(data_root)

In [2]:
ann = train[0]
evidences = ann.all_evidences()
print(type(ann))
print(ann.query)
print(ann.classification)
print(len(evidences))

<class 'rationale_benchmark.utils.Annotation'>
What is the sentiment of this review?
NEG
16


So we have a review with a negative sentiment, and 16 evidence statements. Let's take a look.

In [3]:
for ev in evidences:
    print(ev.text)

the sad part is
what 's the deal ?
not really
just did n't snag this one correctly
it does n't entertain , it 's confusing , it rarely excites
have no idea what 's going on
do we really need to see it over and over again ?
skip it !
pretty redundant
the film does n't stick
it 's simply too jumbled
i get kind of fed up after a while
downshifts into this " fantasy " world
executed it terribly
a very bad package
mind - fuck movie


Let's get all the documents, and a take a look at the contents

In [4]:
(docid,) = set(ev.docid for ev in evidences)
doc = documents[docid]
print(len(doc))
for sent in doc:
    print(' '.join(sent))

44
plot : two teen couples go to a church party , drink and then drive .
they get into an accident .
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares .
what 's the deal ?
watch the movie and " sorta " find out . . .
critique : a mind - fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package .
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just did n't snag this one correctly
.
they seem to have taken this pretty neat concept , but executed it terribly .
so what are the problems with the movie ?
well , its main problem is that it 's simply too jumbled
.
it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member ,
have no idea

Now let's take a look at where in the document these start appearing:

In [5]:
import itertools
flattened_doc = list(itertools.chain.from_iterable(doc))
for ev in evidences:
    # saved text
    print(ev.text)
    # offset text
    print(' '.join(flattened_doc[ev.start_token:ev.end_token]))

the sad part is
the sad part is
what 's the deal ?
what 's the deal ?
not really
not really
just did n't snag this one correctly
just did n't snag this one correctly
it does n't entertain , it 's confusing , it rarely excites
it does n't entertain , it 's confusing , it rarely excites
have no idea what 's going on
have no idea what 's going on
do we really need to see it over and over again ?
do we really need to see it over and over again ?
skip it !
skip it !
pretty redundant
pretty redundant
the film does n't stick
the film does n't stick
it 's simply too jumbled
it 's simply too jumbled
i get kind of fed up after a while
i get kind of fed up after a while
downshifts into this " fantasy " world
downshifts into this " fantasy " world
executed it terribly
executed it terribly
a very bad package
a very bad package
mind - fuck movie
mind - fuck movie


### Count rationale tokens, tokens, sentences

In [6]:
import numpy as np

def process_annotation(ann: Annotation, docs: dict) -> dict:
    evidences = ann.all_evidences()
    if len(evidences) == 0:
        return {}
    (docid,) = set(ev.docid for ev in evidences)
    doc = docs[docid]
    sentences = len(doc)
    tokens = sum(len(s) for s in doc)
    # this accumulation will take care of any potentially overlapping evidence statements.
    # there should be none in the data, but getting familiar with the idea of how to do this is potentially useful
    rationale_tokens = len(set(itertools.chain.from_iterable(range(ev.start_token, ev.end_token) for ev in evidences)))
    return {
        'class': ann.classification,
        'evidences': len(evidences),
        'document_sentences': sentences,
        'document_tokens': tokens,
        'rationale_tokens': rationale_tokens,
        'rationale_token_fraction': rationale_tokens / tokens
    }

def average(counts, key) -> float:
    ns = [c[key] for c in counts]
    return np.mean(ns)

# this filter skips an empty document 
annotation_counts = list(filter(lambda x: len(x) > 0, (process_annotation(ann, documents) for ann in train)))
for key in ['evidences', 'document_sentences', 'document_tokens', 'rationale_tokens', 'rationale_token_fraction']:
    print(key, average(annotation_counts, key))

evidences 8.679174484052533
document_sentences 36.78924327704816
document_tokens 773.5622263914947
rationale_tokens 66.83989993746091
rationale_token_fraction 0.09350348753236702
