# Loading and exploring the dataset

This notebook will walk you through the basic classes intended for the end use of the module and the dataset.
It is written for and tested with dataset version 1.0.0a0 and should be compatible with all 1.0 versions. Please refer to the repository's `README.md` for download instructions. 
On top of python3.10, you will need jupyter in order to run this notebook at home. Warning: pdfs are not rendered in github preview.

In [1]:
from IPython.display import IFrame

## Proof Bank and Samples
Let us first make the necessary imports and load the dataset:

In [2]:
import LassyExtraction
aethel = LassyExtraction.ProofBank.load_data('../data/aethel.pickle')

Loading and verifying aethel.pickle...
Loaded æthel dump version 1.0.0a1 containing 68782 samples.


The just initialized `aethel` item is an instance of a `ProofBank`, i.e. a simple container of `Sample` objects.
It provides some basic functionality, like a `version` field that specifies the dataset's version and a `__len__` function that returns its size.

More importantly, it allows us to retrieve a single `Sample` using standard python indexing.

In [3]:
sample = aethel[2310]

`Samples` are identified by their names, which are unique, and consist of 2 parts, a *prefix* (ending in xml) that specifies the name of the source file in Lassy, and (optionally) a *suffix* that alerts us to the fact that the original parse graph has been disassembled into multiple ones during preprocessing (this can happen for a number of reasons, but is mostly due to incomplete or underspecified annotations).

In [4]:
sample.name

'dpc-svb-000432-nl-sen.p.44.s.2.xml(2)'

We can inspect the raw sentence of the sample via its `sentence` property...

In [5]:
print(sample.sentence)

Inkomen dat u in niet-Nederlandse valuta heeft ontvangen , rekent de SVB om naar euro volgens de officiële koers van De Nederlandsche Bank N.V.


 ...and the data subset (train/dev/test) it belongs to via its `subset` property

In [6]:
print(sample.subset)

train


## Lexical Phrases and Items

The lexical content of each sample is provided pre-tokenized and chunked by Lassy's annotations.

Lexical phrases are stored in the `lexical_phrases` field of a `Sample`. 
Each `LexicalPhrase` is a wrapper around a
* non-empty tuple of `LexicalItems` (*access via `items`*),
* for which a `Type` is supplied (*access via `type`*).

The full string can be accessed via property `string`, and the len of `LexicalItems` contained via `__len__`.

In [7]:
lp7 = sample.lexical_phrases[7]
print(lp7)

LexicalPhrase(string=ontvangen, type=◇obj1(VNW)⟶PPART, len=1)


Each `LexicalItem` within a `LexicalPhrase` corresponds to a single word, and comes packed with some rudimentary token-level features. This allows us to assign a single type to multi-word expressions (rather common in Lassy), while still maintaining their token-level annotations.

In [8]:
print(lp7.items[0])

LexicalItem(word='ontvangen', pos='verb', pt='ww', lemma='ontvangen')


Most lexical phrases participate in the proof-derivation as lexical constants, typed as specified. 

Some, however, don't (i.e. those assigned default dependencies, like punctuation symbols) -- which is why their provision *outside* the proof is necessary for sample representation not to be lossy.

## Proofs, Judgements and Terms

The syntactic analysis of each sample resides in its `proof` field, and is a `Proof` object.

In [9]:
proof = sample.proof

A `Proof` is an inductive datatype that faithfully mirrors the Natural Deduction presentation of the underlying type theory, i.e. dependency-enhanced Lambek with permutations (or Modal Multiplicative Intuitionistic Linear Logic).

It contains three named fields:
* `premises` --  a (possibly empty) tuple of premise `Proofs`
* `conclusion` -- a conclusion `Judgement`, and
* `rule` -- a `Rule`. 

Where a `Judgement` consists of 
* a `Structure` of `Variables` (hypothetical elements) and/or `Constants` (lexical constants)

For brevity, printing a `Proof` will only print its `conclusion` field.


In [10]:
print(proof)

〈c13, 〈c14〉obj1〉mod, 〈c15, 〈〈c19, 〈c20〉obj1〉mod, 〈c16〉det, 〈c17〉mod, c18〉obj1〉mod, c9, 〈c12〉svp, 〈〈c10〉det, c11〉obj1, 〈〈c1, 〈c6, 〈〈c3, 〈〈c4〉mod, c5〉obj1〉mod, c7〉vc, 〈c2〉su〉relcl〉mod, c0〉su ⊢ ▾mod(c13 ▵obj1(c14)) (▾mod(c15 ▵obj1(▾mod(c19 ▵obj1(c20)) (▾det(c16) (▾mod(c17) c18)))) (c9 ▵svp(c12) ▵obj1(▾det(c10) c11) ▵su(▾mod(c1 ▵relcl(λx0.(c6 ▵vc(▾mod(c3 ▵obj1(▾mod(c4) c5)) (c7 ▾x(▿x(x0)))) ▵su(c2)))) c0))) : SMAIN


Shortcut properties `Proof.structure`, `Proof.type`, `Proof.term` provide access to fields and properties nested in `Proof.conclusion`.

In [11]:
print(proof.type)

SMAIN


For a more holistic inspection of a proof, you can use the `LassyExtraction.utils.tex` submodule to cast samples and proofs to compilable tex code:

In [12]:
from LassyExtraction.utils.tex import sample_to_tex
tex_code = sample_to_tex(sample)

The tex code can be saved to a file and compiled externally. If you have pdflatex installed, you should also be able to directly invoke the `compile_tex` function.

In [13]:
from LassyExtraction.utils.tex import compile_tex
compile_tex(tex_code, 'tmp')

This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019/Debian) (preloaded format=pdflatex)
 restricted \write18 enabled.
entering extended mode
(./tmp.tex
LaTeX2e <2020-02-02> patch level 2
L3 programming layer <2020-02-14>
(/usr/share/texlive/texmf-dist/tex/latex/standalone/standalone.cls
Document Class: standalone 2018/03/26 v1.3a Class to compile TeX sub-files stan
dalone
(/usr/share/texlive/texmf-dist/tex/latex/tools/shellesc.sty)
(/usr/share/texlive/texmf-dist/tex/generic/iftex/ifluatex.sty
(/usr/share/texlive/texmf-dist/tex/generic/iftex/iftex.sty))
(/usr/share/texlive/texmf-dist/tex/latex/xkeyval/xkeyval.sty
(/usr/share/texlive/texmf-dist/tex/generic/xkeyval/xkeyval.tex
(/usr/share/texlive/texmf-dist/tex/generic/xkeyval/xkvutils.tex
(/usr/share/texlive/texmf-dist/tex/generic/xkeyval/keyval.tex))))
(/usr/share/texlive/texmf-dist/tex/latex/standalone/standalone.cfg)
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2019/12/20 v1.4l Standard L

The compiled end result can be found as `tmp.pdf` in the current directory.

## Searching the dataset
`scripts/search.py` provides some simple first-order filtering tools.

In [14]:
from scripts.search import search, length_between, of_type, must_contain_rules, may_only_contain_rules, contains_word, Atoms, Query, Sample
from LassyExtraction.mill.proofs import Logical

The `search` function takes a (subset of the) dataset, a logical Query plus (optionally) a maximum number of hits, and returns a list of matching samples.
The below expression filters the first 50 items that contain exclusively applicative terms:

In [15]:
is_simple_applicative = may_only_contain_rules({Logical.Constant, Logical.ArrowElimination, Logical.BoxElimination, Logical.DiamondIntroduction})
simple_applicative = list(search(bank=aethel, query=is_simple_applicative, num_hits=50))
compile_tex(sample_to_tex(simple_applicative[33]), 'applicative')

This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019/Debian) (preloaded format=pdflatex)
 restricted \write18 enabled.
entering extended mode
(./applicative.tex
LaTeX2e <2020-02-02> patch level 2
L3 programming layer <2020-02-14>
(/usr/share/texlive/texmf-dist/tex/latex/standalone/standalone.cls
Document Class: standalone 2018/03/26 v1.3a Class to compile TeX sub-files stan
dalone
(/usr/share/texlive/texmf-dist/tex/latex/tools/shellesc.sty)
(/usr/share/texlive/texmf-dist/tex/generic/iftex/ifluatex.sty
(/usr/share/texlive/texmf-dist/tex/generic/iftex/iftex.sty))
(/usr/share/texlive/texmf-dist/tex/latex/xkeyval/xkeyval.sty
(/usr/share/texlive/texmf-dist/tex/generic/xkeyval/xkeyval.tex
(/usr/share/texlive/texmf-dist/tex/generic/xkeyval/xkvutils.tex
(/usr/share/texlive/texmf-dist/tex/generic/xkeyval/keyval.tex))))
(/usr/share/texlive/texmf-dist/tex/latex/standalone/standalone.cfg)
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2019/12/20 v1.4l St

Queries can be composed, combined and negated like standard logical expressions.
This next one finds proofs that are 5 to 7 phrases long, contain at least one λ abstraction, but do not contain the word "en":

In [16]:
higher_order = list(search(bank=aethel, query=must_contain_rules({Logical.Variable}) & length_between(5, 7) & (~ contains_word('en')), num_hits=10))
compile_tex(sample_to_tex(higher_order[0]), 'higher_order')

This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019/Debian) (preloaded format=pdflatex)
 restricted \write18 enabled.
entering extended mode
(./higher_order.tex
LaTeX2e <2020-02-02> patch level 2
L3 programming layer <2020-02-14>
(/usr/share/texlive/texmf-dist/tex/latex/standalone/standalone.cls
Document Class: standalone 2018/03/26 v1.3a Class to compile TeX sub-files stan
dalone
(/usr/share/texlive/texmf-dist/tex/latex/tools/shellesc.sty)
(/usr/share/texlive/texmf-dist/tex/generic/iftex/ifluatex.sty
(/usr/share/texlive/texmf-dist/tex/generic/iftex/iftex.sty))
(/usr/share/texlive/texmf-dist/tex/latex/xkeyval/xkeyval.sty
(/usr/share/texlive/texmf-dist/tex/generic/xkeyval/xkeyval.tex
(/usr/share/texlive/texmf-dist/tex/generic/xkeyval/xkvutils.tex
(/usr/share/texlive/texmf-dist/tex/generic/xkeyval/keyval.tex))))
(/usr/share/texlive/texmf-dist/tex/latex/standalone/standalone.cfg)
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2019/12/20 v1.4l S

Custom queries are also easy to write. The below query filters sentences that end with a question mark and are typed as a WH-question.

In [17]:
def ends_with_qmark() -> Query:
    def f(s: Sample) -> bool: return s.sentence.endswith('?')
    return Query(f)

questions = list(search(bank=aethel, query=ends_with_qmark() & of_type(Atoms['whq']), num_hits=10))
compile_tex(sample_to_tex(questions[4]), 'question')

This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019/Debian) (preloaded format=pdflatex)
 restricted \write18 enabled.
entering extended mode
(./question.tex
LaTeX2e <2020-02-02> patch level 2
L3 programming layer <2020-02-14>
(/usr/share/texlive/texmf-dist/tex/latex/standalone/standalone.cls
Document Class: standalone 2018/03/26 v1.3a Class to compile TeX sub-files stan
dalone
(/usr/share/texlive/texmf-dist/tex/latex/tools/shellesc.sty)
(/usr/share/texlive/texmf-dist/tex/generic/iftex/ifluatex.sty
(/usr/share/texlive/texmf-dist/tex/generic/iftex/iftex.sty))
(/usr/share/texlive/texmf-dist/tex/latex/xkeyval/xkeyval.sty
(/usr/share/texlive/texmf-dist/tex/generic/xkeyval/xkeyval.tex
(/usr/share/texlive/texmf-dist/tex/generic/xkeyval/xkvutils.tex
(/usr/share/texlive/texmf-dist/tex/generic/xkeyval/keyval.tex))))
(/usr/share/texlive/texmf-dist/tex/latex/standalone/standalone.cfg)
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2019/12/20 v1.4l Stand