In [17]:
from lxml import etree
from standoffconverter import Standoff, View
import json
from spacy.lang.fr import French
from tqdm.notebook import tqdm

# Preparing the Corpus Charles Bruneau for ML

In this notebook, we demonstrate how to use the standoff converter to apply ML methods in the context of an existing TEI edition. It is shown how to load the given TEI XML file and parse it into a standoff format of the standoff converter. How to extract a plain text view from the edition and how it is split into sentences using the spacy NLP library. Crucially, it is shown how the detected sentences are (1) added to the TEI and (2) prepared as a CSV file for further analysis or manual labelling. In both output formats, the same identifier is used for a given sentence so that future annotation from an ML model can be added to the TEI, easily.

## Loading the XML
First, the XML file is loaded as an `lxml.etree`. Then, a `standoffconverter.Standoff` object is created that takes the `etree` as input. In the `Standoff` object, the standoff representation and the `etree` representation are always kept in sync.

In [1]:
parser=etree.XMLParser(remove_comments=True)
tree = etree.fromstring(open("indir/charles-destinataires.xml").read(), parser=parser)
so = Standoff(tree, namespaces={
    "tei":"http://www.tei-c.org/ns/1.0"
})
print(so.collapsed_table)

                                                 context          text
0      [[[<Element {http://www.tei-c.org/ns/1.0}pb at...        \n    
1      [[[<Element {http://www.tei-c.org/ns/1.0}pb at...      \n      
2      [[[<Element {http://www.tei-c.org/ns/1.0}pb at...              
3      [[[<Element {http://www.tei-c.org/ns/1.0}pb at...      \n      
4      [[[<Element {http://www.tei-c.org/ns/1.0}pb at...    \n        
...                                                  ...           ...
27817  [[[<Element {http://www.tei-c.org/ns/1.0}pb at...  \n          
27818  [[[<Element {http://www.tei-c.org/ns/1.0}pb at...    \n        
27819  [[[<Element {http://www.tei-c.org/ns/1.0}pb at...      \n      
27820  [[[<Element {http://www.tei-c.org/ns/1.0}pb at...        \n    
27821  [[[<Element {http://www.tei-c.org/ns/1.0}pb at...          \n  

[27822 rows x 2 columns]


## Preparing the plain text view
After loading the TEI XML, we need to extract a plain text view from the XML file. This can be a demanding scholarly task, for example in the case of genetic editions where the XML file may contain countless variants of the document. In this case, we just focus on the main body of the text that is inside the `<div1>` tags and we remove comments (in XML, denoted by `<!-- .. -->`) and shrink longer sequences of white space to single whitespace characters.



In [2]:
view = (
    View(so)
        .exclude_outside("{http://www.tei-c.org/ns/1.0}div1")
        .shrink_whitespace()
        .remove_comments()
)

plain = view.get_plain()

create view: 100%|██████████| 37158/37158 [00:17<00:00, 2095.27it/s]
100%|██████████| 31/31 [00:00<00:00, 210.10it/s]
shrink whitespace: 100%|██████████| 760188/760188 [03:05<00:00, 4103.42it/s]
0it [00:00, ?it/s]


## Split plain text view into sentences
Now we are ready to apply spacy to the plain text view to split it into sentences.

In [3]:
nlp = French()
nlp.add_pipe('sentencizer')

sentences = []
for sent in tqdm(nlp(plain).sents):
    sentences.append(sent)

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…




## Add resulting sentence tags to the TEI
The most important part of the standoff converter package is that it keeps the reference of each character mapping from the position in the plain text view to the position in the TEI document. Here, we get the position in the standoff data structure from the position in the plain text view: `start_ind = view.get_table_pos(sent.start_char)`. Then, we can add `<s>` tags to the TEI document using the `so.add_inline` command!
Since the TEI document is required to keep the tree characteristic, sometimes. It is not possible to add a new inline element. Whenever the inline element cannot be created, we use a `<span>`-`<anchor>` combination that does not break the tree characteristic of the document.


In [4]:
for isent, sent in tqdm(enumerate(sentences), total=len(sentences)):

    start_ind = view.get_table_pos(sent.start_char)
    end_ind = view.get_table_pos(sent.end_char-1)+1
    try:
        so.add_inline(
            begin=start_ind,
            end=end_ind,
            tag="s",
            depth=None,
            attrib={'id':f'{isent}'}
        )
    except ValueError:
        so.add_span(
            begin=start_ind,
            end=end_ind,
            tag="s",
            depth=None,
            attrib={'id':f'{isent}'},
            id_=f'{isent}-anchor'
        )

HBox(children=(FloatProgress(value=0.0, max=8219.0), HTML(value='')))




## Saving the resulting files
We now create two output files. We save the sentencized version of the TEI XML as a file. As you can see, the corresponding id is stored as a tag attribute. Also, we save the data set as a jsonl file. This format is following the schema that is expected by the open source labelling tool doccano https://doccano.herokuapp.com/. This way, the sentences can be added to the labellingi tool to create manual labels that can later be used to train a new ML-model.

In [40]:
print(etree.tostring(so.text_el, pretty_print=True).decode('utf-8')[:2000])

with open("outdir/sentencized_charles-destinataires.xml", "w") as fout:
    fout.write(
        etree.tostring(so.tree, pretty_print=True).decode('utf-8')
    )

<text xmlns="http://www.tei-c.org/ns/1.0">
    <front>
      <pb xml:id="page-000a" facs="./facs/p_000a.jpg"/>
      <titlePage>
        <docTitle>
          <titlePart>Ma Guerre<lb/>1914-1918</titlePart>
        </docTitle>
        <byline>par <lb/><docAuthor>Charles Bruneau</docAuthor></byline>
      </titlePage>
      <pb xml:id="page-000b" facs="./facs/p_000b.jpg"/>
      <epigraph>
        <quote>"Ne croyez pas aux dires du Poilu quand il se<lb/> "m&#234;le de juger d'un combat. Pensez
          qu'il vous<lb/> "&#233;crit le ventre garni, s'il dit que tout va<lb/> "bien; que son soulier
          lui fait mal ou qu'il<lb/> "n'a pas dormi, s'il affirme que rien ne va<lb/> "plus
          ."</quote>
        <bibl><author>J. ARENE</author><lb/> (<title>Les Carnets d'un soldat en Haute-Alsace et<lb/>
            dans les Vosges</title>
          <biblScope>p.15</biblScope>). </bibl>
      </epigraph>
      <pb xml:id="page-001" facs="./facs/p_001.jpg"/>
      <div1 type="section" xml

In [36]:
with open("outdir/dataset.jsonl", "w") as fout:
    out_data = []
    for isentence, sentence in enumerate(sentences):
        out_data.append(json.dumps({
            "text": sentence.text,
            "labels": [],
            "meta": {"sentenceId": f"{isentence}"}
        }))
    fout.write("\n".join(out_data))

As you can see in this screenshot from doccano, the sentence id is preserved in the meta data of each doccano record:
![screenshot from doccano showing a few sentences from the edition.](doccano_screenshot.png "screenshot from doccano showing a few sentences from the edition.")