# LEPIDEMO : LECTAUREP PIPELINE DEMONSTRATOR

Générer une édition XML TEI pour les répertoires des notaires de Paris traités dans le cadre du projet [LECTAUREP](https://lectaurep.hypotheses.org) de manière à passer automatiquement des transcriptions annotées dans le logiciel [eScriptorium](https://escriptorium.inria.fr) à des fichiers XML TEI conforme aux attentes du *framework* [TEI Publisher](https://teipublisher.com/index.html)

## Préambule : description des données d'entrée

L'application eScriptorium permet d'exporter des fichiers XML PAGE à partir des données générées dans l'application. 

### Résumé des annotations propres au projet LECTAUREP

Pour réaliser la suite de transformation qui va suivre et créer la structure en tableau, nous avons préalablement défini 9 types de régions et 2 types de lignes : 

| nom | niveau | usage | 
| --- | ------ | ----- |
| `header` | region | définit la zone d'en-tête contenant du texte pré-imprimé |
| `stamp` | region | définit une zone graphique correspondant à un timbre fiscal  <!--; permet de comprendre d'éventuelles interruptions dans les en-têtes--> |
| `col_1` | region | pour la 1ère colonne du tableau, intitulée "Numéros du répertoire" |
| `col_2` | region | pour la 2e colonne du tableau, intitulée "Date de l'acte" |
| `col_3` | region | pour la 3e colonne du tableau, intitulée "Nature et espèce des actes :/ en brevets" |
| `col_4` | region | pour la 4e colonne du tableau, intitulée "Nature et espèce des actes :/ en minutes" |
| `col_5` | region | pour la 5e colonne du tableau, intitulée "Noms, prénoms et domiciles des parties / indications, situations et prix des biens" |
| `col_6` | region | pour la 6e colonne du tableau, intitulée "Relation de l'enregistrement / dates" |
| `col_7` | region | pour la 7e colonne du tableau, intitulée "Relation de l'enregistrement / droits" |
| `marginal` | region | pour toute mention marginale |
| `first_line` | line | pour chaque première ligne d'une entrée dans le répertoire |
| `main_date` | line | pour chaque ligne indiquant l'année et le mois concerné par la page en cours |
<!--| printed | line | |-->

> le plus souvent, il n'y a qu'une ligne de type "main_date" par document, mais on peut aussi en trouver plusieurs.

![illustration de l'utilisation des tags](https://gitlab.inria.fr/almanach/lectaurep/lepidemo/-/raw/master/images/region_and_line_tags.png)


### Rendu en XML PAGE

Les annotations réalisées dans eScriptorium apparaissent dans les attributs `@custom` des noeuds `//TextLine` sous la forme d'objets de la classe "structure" : par exemple, pour les lignes annotés avec le tag "firt_line" l'attribut @custom contient : `custom="structure {type:first_line;}"`.

``` xml
      <!-- ligne sans annotation -->
      <TextLine id="eSc_line_d7e2b970" >
        <Coords points="1249,2811 1256,2763 1322,2785 1428,2763 1483,2789 1538,2767 1586,2789 1626,2767 1666,2789 1732,2771 1754,2789 1842,2789 1882,2771 1948,2785 1978,2771 2113,2771 2135,2793 2183,2771 2238,2793 2267,2771 2384,2771 2406,2793 2443,2793 2450,2826 2439,2848 2003,2848 1945,2833 1523,2844 1414,2829 1252,2840"/>
        <Baseline points="1250,2815 1527,2815 1589,2822 1886,2826 2454,2828"/>
        <TextEquiv>
          <Unicode>Jenny Belin safe- à Paris 15 rue Picot à Abel à Paris 24 BdSt Denis</Unicode>
        </TextEquiv>
      </TextLine>

      <!-- ligne annotées -->
      <TextLine id="eSc_line_73523fc1" custom="structure {type:first_line;}">
        <Coords points="1300,2741 1304,2712 1337,2687 1406,2712 1487,2698 1652,2716 1677,2698 1703,2709 1751,2661 1780,2665 1831,2716 1926,2719 1981,2665 2432,2661 2443,2756 2410,2782 2373,2767 2245,2782 2128,2782 2077,2763 1882,2771 1816,2807 1740,2774 1674,2789 1633,2763 1586,2782 1545,2760 1498,2778 1425,2763 1304,2771"/>
        <Baseline points="1304,2745 1461,2752 1498,2760 2446,2760"/>
        <TextEquiv>
          <Unicode>Boudier (par Abel Eugène) architecte &amp;amp; Marie Thérèse Louise</Unicode>
        </TextEquiv>
      </TextLine>
```


## 1\. Installation de l'environnement

### Installation des dépendances

In [1]:
import os
import re
import shutil
import zipfile

from bs4 import BeautifulSoup, NavigableString
import itertools
from lxml import etree
from tqdm.notebook import trange, tqdm

# if running on colab, you'll need to install the corresponding files first
import lutils.lutils as lutils
from constants import MONTHS_MAPS, TEIHEADER

### Organisation du répertoire de travail

- `/content/source/` -> reçoit les fichiers XML PAGE
- `/content/tei_output/` -> reçoit le résultat de la transformation XSL
- `/content/tei4publisher/` -> reçoit les fichiers XML TEI prêts à être chargés dans TEI Publisher

In [2]:
path_dir_content = f".{os.sep}content"
path_dir_source = f"{path_dir_content}{os.sep}source"
path_dir_output = f"{path_dir_content}{os.sep}tei_output"
path_dir_teipub = f"{path_dir_content}{os.sep}tei4publisher"

#if running locally
# if you need to flush "./content/"
#lutils.flush_dir(path_dir_content)
os.makedirs(path_dir_content, mode=0o777, exist_ok=True)
os.makedirs(path_dir_source, mode=0o777, exist_ok=True)
os.makedirs(path_dir_output, mode=0o777, exist_ok=True)
os.makedirs(path_dir_teipub, mode=0o777, exist_ok=True)


# --------------------
# if running on collab
#!rm -r content/source/
#!rm -r content/tei_output/
#!rm -r content/tei4publisher/

#!mkdir source
#!mkdir tei4publisher
#!mkdir tei_output

Récupération de la feuille XSLT pour PAGE -> TEI :

In [3]:
url_xsl = "https://raw.githubusercontent.com/lectaurep/page2tei/main/xmlpage_to_tei.xsl"

# if running locally
path_to_xsl = os.path.join(os.path.abspath("."),(os.path.basename(url_xsl)))
lutils.pywget(url_xsl, path_to_xsl)

# --------------------
# if executing on colab:
#path_to_xsl = os.path.join("content", os.path.basename(url_xsl))
#!wget -N $url_xsl

Récupération du parser Saxon HE 9.9.1-7 pour appliquer la feuille XSL :

In [4]:
# récupération du parser Saxon
url_saxon = "https://repo1.maven.org/maven2/net/sf/saxon/Saxon-HE/9.9.1-7/Saxon-HE-9.9.1-7.jar"

# if running locally
path_to_saxon = os.path.join(os.path.abspath("."),(os.path.basename(url_saxon)))
lutils.pywget(url_saxon, path_to_saxon)

# --------------------
# if executing on colab:
#path_to_saxon = os.path.join("/content", os.path.basename(url_saxon))
#!wget -N $url_saxon

Récupération du schéma RNG

## 2\. Définitions des fonctions et classes

### 2.1 Input/output utils

In [5]:
def control_schema_validity(xml_tree):
    """Control the existence of an accepted root element in XML tree"""
    ACCEPTED_SCHEMAS = ["PcGts", "TEI"]
    for schema in ACCEPTED_SCHEMAS:
        if len(xml_tree.find_all(schema)) == 1:
            return True
    print("Aucun schéma valide trouvé dans le fichier XML !")
    

def open_and_parse_file(path):
    """Open a file, parse with BS and return result"""
    with open(path, 'r', encoding='utf8') as fh:
        content = fh.read()
    parsed = BeautifulSoup(content, 'xml')
    if not control_schema_validity(parsed):
        parsed = None
    return parsed


def save_file(path, content):
    """Create or open a file and modify content"""
    try:
        with open(path, 'w', encoding="utf8") as fh:
            fh.write(content)
    except Error as e:
        e 


def correct_xsl(path_to_xsl):
    with open(path_to_xsl, "r", encoding='utf-8') as fh:
        xsl = fh.read()
    xsl = xsl.replace("""<?xml version="1.0" encoding="UTF-8"?>""", """<?xml version="1.0" encoding="UTF-8" standalone="yes"?>""")
    # we don't like how the output file is named:
    xsl = xsl.replace('<xsl:result-document href="teifromxmlpage.xml"', f'<xsl:result-document href="{tei_file}"')
    with open(path_to_xsl, "w", encoding="utf-8") as fh:
        fh.write(xsl)

### 2.2 Analyse XML PAGE

On cherche les lignes annotées comme "first_line" et à partir de leur coordonnées, on génère un découpage vertical (*vertical slicing*) du fichier permettant d'identifier les différentes entrées de la page du répertoire. 

***Attention :** pour le moment, on perd les informations situées avant la première entrée. Par exemple, si un paragraphe débuté à la page précédente se termiine sur la page suivante, on "perd" ces quelques lignes.

In [6]:
def vertical_slicing(xml_tree):
    """Compose the vertical slices in an image depending on the coordinates of the  
    //TextLine/@custom="structure {type:col_5;}" nodes."""
    # Targetting the central column (col_5)
    main_cols = xml_tree.find_all("TextRegion", custom="structure {type:col_5;}")
    head_lines = []
    for main_col in main_cols:
        first_lines = main_col.find_all("TextLine", custom="structure {type:first_line;}")
    for fline in first_lines:
        coords = fline.find_all("Coords")
        points = coords[0].attrs.get("points", f"le noeud {fline.attrs['id']} a des coords incomplètes (pas de @points)")
        y_coords = [int(xy.split(",")[-1]) for xy in points.split(" ")]
        min_y = sorted(y_coords)[0]
        head_lines.append({"head_line": fline.attrs["id"], "top_max": min_y})

    for line in head_lines:
        if head_lines.index(line) + 1 < len(head_lines):
            max_y = head_lines[head_lines.index(line) + 1]["top_max"] 
        else:
            max_y = None
        line["bottom_max"] = max_y
    return head_lines

### 2.3 Génération des entrées de tableaux

Un objet Row permet de modéliser rapidement les entrées de la page de répertoire à partir du découpage vertical réalisé à la page précédente. Les différentes propriétés de l'objet enregistrent les liens entre chaque segment grâce à leur identifiants. 

`Row.show_row()` permet d'afficher le texte ainsi rassemblé.

In [7]:
class Row:
    def show_row(self):
        shown = self.__dict__
        del shown["_xml_tree"]
        pp = pprint.PrettyPrinter(indent=2)
        pp.pprint(shown)
        return shown

    def _find_text(self, node_id):
        return self._xml_tree.find(True, id=node_id).text.strip()

    def show_text_in_row(self):
        main_paragraph = "\n+ ".join([self._find_text(n_id) for n_id in self.main_paragraph])
        print("\n".join([f"{self.top_limit} < {self.bottom_limit}",
                         f"num de répertoire : {[self._find_text(n_id) for n_id in self.entry_id]}",
                         f"date de l'acte : {[self._find_text(n_id) for n_id in self.date_of_act]}",
                         f"types de l'acte (brevet) : {[self._find_text(n_id) for n_id in self.type_of_act['brevet']]}",
                         f"types de l'acte (minute) : {[self._find_text(n_id) for n_id in self.type_of_act['minute']]}",
                         f"paragraphe central : {main_paragraph}",
                         f"date d'enregistrement : {[self._find_text(n_id) for n_id in self.registration_relation['date']]}",
                         f"droits d'enregistrement : {[self._find_text(n_id) for n_id in self.registration_relation['droits']]}",
                         f"misc : {[self._find_text(n_id) for n_id in self.misc]}",
                         "___fin___"]))

    def _elems_in_range(self):
        """Collect all segments fitting within the range defined by the vertical slice"""
        selected_elems = []
        for textline in self._xml_tree.find_all("TextLine"):
            baseline = textline.find("Baseline")
            points = baseline.attrs.get("points", f"le noeud {textline.attrs['id']} a des coords incomplètes (pas de @points)")
            y_coords = [int(xy.split(",")[-1]) for xy in points.split(" ")]
            highest_point = sorted(y_coords)[0]
            if not self.bottom_limit:
                if self.top_limit < highest_point:
                    selected_elems.append(textline)
            else:
                if self.top_limit < highest_point <= self.bottom_limit:
                    selected_elems.append(textline)
        return selected_elems

    def _distribute_lines(self, lines):
        for line in lines:
            # custom="structure {type:col_3;}" -> "col_3"
            if "custom" in line.parent.attrs.keys():
                region_type = line.parent.attrs["custom"].replace("structure {type:", "").replace(";}", "")
                if region_type == "col_1":
                    self.entry_id.append(line.attrs["id"])
                elif region_type == "col_2":
                    self.date_of_act.append(line.attrs["id"])
                elif region_type == "col_3":
                    self.type_of_act["brevet"].append(line.attrs["id"])
                elif region_type == "col_4":
                    self.type_of_act["minute"].append(line.attrs["id"])
                elif region_type == "col_5":
                    # TODO: et si l'id de la line n'est pas head_line...
                    self.main_paragraph.append(line.attrs["id"])
                elif region_type == "col_6":
                    self.registration_relation["date"].append(line.attrs["id"])
                elif region_type == "col_7":
                    self.registration_relation["droits"].append(line.attrs["id"])
                else:
                    self.misc.append(line.attrs["id"])
            else:
                self.misc.append(line.attrs["id"])

    def __init__(self, xml_tree, part):
        self._xml_tree = xml_tree
        self.top_limit = part["top_max"]
        self.bottom_limit = part["bottom_max"]
        self.head_line = part["head_line"]
        self.main_paragraph = []
        self.entry_id = []
        self.date_of_act = []
        self.type_of_act = {"minute" : [], "brevet": []} # warrant?
        self.registration_relation = {"date" : [] , "droits" : []}
        self.misc = []

        associated_lines = self._elems_in_range()
        self._distribute_lines(associated_lines)

### 2.4 Modification des fichiers TEI

1. mettre à jour les métadonnées dans le teiHeader
2. ajouter une section "//text"" contenant un tableau (//table) où chaque ligne (//row) correspond à une objet Row

#### 2.4.1 Contrôler les dates

In [8]:
# dates control
def make_combinations(years, months):
    """Create every possible combination of months and years"""
    return [f"{yyyy}-{mm}" for yyyy, mm in itertools.product(years, months)]


def get_months_and_years(tree):
    """Get month(s) and year(s) a page refers to"""
    main_date = [elem for elem in tree.find_all(True, custom=True) if elem.attrs["custom"] == 'structure {type:main_date;}']
    years, months = build_list_of_years_and_months(main_date)
    return make_combinations(years, months)


def compose_iso_date(date_node, yyyy_mm):
    if date_node.attrs["when-iso"]:
        if len(date_node.attrs["when-iso"].split("-")) == 1:
            iso_date = f'{yyyy_mm}-{date_node.attrs["when-iso"]}'
    return iso_date


def build_list_of_years_and_months(main_dates_nodes):
    years = []
    months = []
    for textline in main_dates_nodes:
        line = str(textline.TextEquiv.Unicode.string)
        # get year(s)
        myear = re.search(r"\d+", line) #r"\d{4}" ?
        if myear:
            years.append(myear.group(0))
        else:
            year = None
        # get month(s)
        for month in MONTHS_MAPS.keys():
            if month in line.lower():
                months.append(MONTHS_MAPS[month])
    years = list(set(years))
    months = list(set(months))
    return years, months

In [9]:
# TEI Tree date modification
def is_date(value):
    """Evaluate how likely a string is to refer to a date"""
    value = str(value)
    if '(' in value or ')' in value or '&amp;' in value:
        value = value.replace("(", "").replace(")", "").replace("&amp;", " ").strip()
    if value and value.isdigit():
        if 1 <= int(value) <= 31:
            return "high"
    if value.strip() in ['"', "d°", "-", "- -"]:
        return "unknown"
    for month in MONTHS_MAPS.keys():
        if month in value.lower():
            return "medium"
    return "low"


def control_dates(tree):
    """Parse date elements in TEI tree and add a cert attribute or delete when-iso"""
    #slices = slice_dates(tree)
    for date in tree.body.find_all("date"):
        date.attrs["cert"] = is_date(date.string)
        if not date.attrs["when-iso"].isdigit():
            del date.attrs["when-iso"]


def complete_wheniso_attrs(tei_tree, yyyy_mms):
    """Change values in wheniso attrs"""
    for date in tei_tree.find_all("date"):
        if "when-iso" in date.attrs.keys():
            new_values = [f"{yyyy_mm}-{date.attrs['when-iso']}" for yyyy_mm in yyyy_mms]
            if len(new_values) == 1:
                date.attrs["when-iso"] == new_values[0]
            elif len(new_values) > 1:
                date.attrs["when-iso"] = f"{', '.join(new_values)}"
    return tei_tree

#### 2.4.2 Modifier le teiHeader

In [10]:
# TEI header
def update_tei_header(tei_tree):
    # 1. modification des balises "title" et "author"
    tei_tree.fileDesc.replace_with(BeautifulSoup(TEIHEADER["fileDesc"], "xml").fileDesc.extract())
    tei_tree.titleStmt.title.string = tei_file.split("/")[-1].replace(".xml", "")
    tei_tree.titleStmt.author.string = "Lectaurep"
    if len(tei_tree.find_all("encodingDesc")) == 1:
        tei_tree.encodingDesc.replace_with(BeautifulSoup(TEIHEADER["encodingDesc"], "xml").encodingDesc.extract())
    else:
        tei_tree.fileDesc.insert_after(BeautifulSoup(TEIHEADER["encodingDesc"], "xml").encodingDesc.extract())
    return tei_tree

#### 2.4.3 Intégrer les rows

In [11]:
# table & rows
def build_new_row(xml_tree, row):
    new_row = xml_tree.new_tag("row")
    # 1. numero de répertoire
    cell = xml_tree.new_tag("cell", n=1, role="col1")
    for num_rep in row.entry_id:
        matching_tag = xml_tree.find(True, attrs={"xml:id": num_rep})
        cell.append(xml_tree.new_tag("lb", facs=f"#{num_rep}"))
        cell.append(NavigableString(matching_tag.line.text))
    new_row.append(cell)
    # 2. date de l'acte
    cell = xml_tree.new_tag("cell", n=2, role="col2")
    for date_act in row.date_of_act:
        matching_tag = xml_tree.find(True, attrs={"xml:id": date_act})
        cell.append(xml_tree.new_tag("lb", facs=f"#{date_act}"))
        cell_content = matching_tag.line.text.strip()
        if cell_content.isdigit:
            cell_date = xml_tree.new_tag("date", attrs={"when-iso":cell_content}) #TODO: mieux construire @when
            cell_date.append(NavigableString(cell_content))
            cell.append(cell_date)
        else:
            cell.append(NavigableString(cell_content))
    new_row.append(cell)
    # 3. type d'acte brevet
    cell = xml_tree.new_tag("cell", n=3, role="col3")
    for type_brevet in row.type_of_act["brevet"]:
        matching_tag = xml_tree.find(True, attrs={"xml:id": type_brevet})
        cell.append(xml_tree.new_tag("lb", facs=f"#{type_brevet}"))
        cell.append(NavigableString(matching_tag.line.text))
    new_row.append(cell)
    # 4. type d'acte minute
    cell = xml_tree.new_tag("cell", n=4, role="col4")
    for type_minute in row.type_of_act["minute"]:
        matching_tag = xml_tree.find(True, attrs={"xml:id": type_minute})
        cell.append(xml_tree.new_tag("lb", facs=f"#{type_minute}"))
        cell.append(NavigableString(matching_tag.line.text))
    new_row.append(cell)
    # 5. paragraphe central
    cell = xml_tree.new_tag("cell", n=5, role="col5")
    for mainp in row.main_paragraph:
        n = row.main_paragraph.index(mainp) + 1
        matching_tag = xml_tree.find(True, attrs={"xml:id": mainp})
        cell.append(xml_tree.new_tag("lb", n=n, facs=f"#{mainp}"))
        cell.append(NavigableString(matching_tag.line.text))
    new_row.append(cell)
    # 6. date enregistrement
    cell = xml_tree.new_tag("cell", n=6, role="col6")
    for date_reg in row.registration_relation["date"]:
        matching_tag = xml_tree.find(True, attrs={"xml:id": date_reg})
        cell.append(xml_tree.new_tag("lb", facs=f"#{date_reg}"))
        cell_content = matching_tag.line.text.strip()
        if cell_content.isdigit:
            cell_date = xml_tree.new_tag("date", attrs={"when-iso":cell_content}) #TODO: mieux construire @when
            cell_date.append(NavigableString(cell_content))
            cell.append(cell_date)
        else:
            cell.append(NavigableString(cell_content))
    new_row.append(cell)
    # 7. droit enregistrement
    cell = xml_tree.new_tag("cell", n=7, role="col7")
    for droits_reg in row.registration_relation["droits"]:
        matching_tag = xml_tree.find(True, attrs={"xml:id": droits_reg})
        cell.append(xml_tree.new_tag("lb", facs=f"#{droits_reg}"))
        cell.append(NavigableString(matching_tag.line.text))
    new_row.append(cell)
    # 8. misc
    cell = xml_tree.new_tag("cell", n=8, role="misc")
    for misc in row.misc:
        n = row.misc.index(misc) + 1
        matching_tag = xml_tree.find(True, attrs={"xml:id": misc})
        cell.append(xml_tree.new_tag("lb", n=n, facs=f"#{misc}"))
        cell.append(NavigableString(matching_tag.line.text))
    new_row.append(cell)
    return new_row


def add_table_structure(tei_tree):
    # 1. construction des noeuds tempo_text (à renommer text) et body
    tei_tree.sourceDoc.insert_after(tei_tree.new_tag("tempo_text"))
    tei_tree.tempo_text.append(tei_tree.new_tag("body"))
    # 2. creation du tableau
    tei_tree.body.append(tei_tree.new_tag("div", type="main"))
    tei_tree.body.div.append(tei_tree.new_tag("table", rows=len(rows), cols=8))
    table_labels = tei_tree.new_tag("row", role="label", n=0)
    labels = ["Numéros du répertoire", "Dates des actes", "Actes en brevets", 
                        "Actes en minutes", "Noms, prénms et domiciles des parties ; indication, situations et prix des biens", 
                        "Date de l'enregistrement", "Droits de l'enregistrement", "Autres"]
    for label in labels:
        cell = tei_tree.new_tag("cell", role=f"label{labels.index(label) + 1}", n=labels.index(label) + 1)
        cell.append(NavigableString(label))
        table_labels.append(cell)
    tei_tree.body.div.table.append(table_labels)
    return tei_tree

#### 2.4.4 Fonction principale pour la modification des XML TEI

In [12]:
def modify_tei_file(tei_file, rows, yyyy_mms):
    # 1. ouverture du fichier TEI
    xml_tree = open_and_parse_file(tei_file)
    print("Done parsing file")
    
    # 2. modification du teiHeader
    xml_tree = update_tei_header(xml_tree)
    print("Done updating teiHeader")
    
    # 3. ajout de la structure table
    xml_tree = add_table_structure(xml_tree)
    print("Done building table structure")
    
    # 4. ajout du contenu de table
    for row in tqdm(rows):
        new_row = build_new_row(xml_tree, row)
        xml_tree.body.div.table.append(new_row)
    print("Done building rows")
    
    # 5. nettoyage des éléments temporaires
    text_node = xml_tree.find_all('tempo_text')
    if len(text_node) >= 1:
        text_node[0].name = "text"
    print("Done cleaning")
    
    # 6. standardisation des dates dans les attributs when-iso
    control_dates(xml_tree)
    complete_wheniso_attrs(xml_tree, yyyy_mms)
    print("Done handling dates")
    
    tei4publisher_file = tei_file.replace("tei_output/", "tei4publisher/")
    file_content = xml_tree.prettify().replace("&amp;amp;", "&amp;")
    return tei4publisher_file, file_content

## 3\. Application de la pipeline

## 3.1 Téléchargement des fichiers source

La valeur de "url" correspond au permalien généré par eScriptorium pour le téléchargement d'une archive. La création de cette archive se fait directement dans la GUI d'eScriptorium avec la fonction "Export". L'export des fichiers XML PAGE suffit.

In [13]:
url = "https://escriptorium.inria.fr/media/users/4/export_test_vers_tei_publisher_pagexml_202106181347.zip"
path_to_zip = os.path.join(path_dir_source, os.path.basename(url))
abszipname = os.path.abspath(path_to_zip)

lutils.pywget(url, path_to_zip, verify=False)
with zipfile.ZipFile(path_to_zip, 'r') as ziph:
    ziph.extractall(path_dir_source)
os.remove(path_to_zip)

# ------------------
#if running on colab
#!cd content/source && wget -N $url --no-check-certificate && unzip -u $absfilename && rm $filename
# Ajout du paramètre --no-check-certificate pour télécharger l'export eScriptorium depuis son API sans être authentifié

sources = [f for f in os.listdir(path_dir_source) if f.endswith(".xml")]



### 3.2 Application de la pipeline complète

In [14]:
out_files = []
for source in sources:
    print(f"-------: {source} :-------")
    # open xml page file
    source = os.path.join("content/source/", source)
    page_tree = open_and_parse_file(source)
    
    # get months and years combinations
    yyyy_mms = get_months_and_years(page_tree)
    
    # analyze page tree and create rows    
    rows = []
    for entry in vertical_slicing(page_tree):
        current_row = Row(page_tree, entry)
        rows.append(current_row)
    
    # generate tei file
    tei_file = source.replace(".xml", "-tei.xml").replace("/source/", "/tei_output/")
    
    # java doit être installé
    print("starting conversion")
    !java -jar $path_to_saxon -xsl:$path_to_xsl -s:$source -o:$tei_file
    print("conversion finished")
    
    # modify tei file and save it #TODO control validity
    out_file_name, out_file_content = modify_tei_file(tei_file, rows, yyyy_mms)
    
    # save file
    print(f"saving to {out_file_name}")
    save_file(out_file_name, out_file_content)
    out_files.append(out_file_name)

-------: FRAN_0025_3657_L-1.xml :-------
starting conversion
conversion finished
Done parsing file
Done updating teiHeader
Done building table structure


  0%|          | 0/20 [00:00<?, ?it/s]

Done building rows
Done cleaning
Done handling dates
saving to content/tei4publisher/FRAN_0025_3657_L-1-tei.xml
-------: DAFANCH96_023MIC07633_L-0.xml :-------
starting conversion
conversion finished
Done parsing file
Done updating teiHeader
Done building table structure


  0%|          | 0/13 [00:00<?, ?it/s]

Done building rows
Done cleaning
Done handling dates
saving to content/tei4publisher/DAFANCH96_023MIC07633_L-0-tei.xml
-------: DAFANCH96_023MIC07645_L-0.xml :-------
starting conversion
conversion finished
Done parsing file
Done updating teiHeader
Done building table structure


  0%|          | 0/18 [00:00<?, ?it/s]

Done building rows
Done cleaning
Done handling dates
saving to content/tei4publisher/DAFANCH96_023MIC07645_L-0-tei.xml
-------: FRAN_0025_3056_L-0.xml :-------
starting conversion
conversion finished
Done parsing file
Done updating teiHeader
Done building table structure


  0%|          | 0/19 [00:00<?, ?it/s]

Done building rows
Done cleaning
Done handling dates
saving to content/tei4publisher/FRAN_0025_3056_L-0-tei.xml
-------: FRAN_0025_1290_L-1.xml :-------
starting conversion
conversion finished
Done parsing file
Done updating teiHeader
Done building table structure


  0%|          | 0/21 [00:00<?, ?it/s]

Done building rows
Done cleaning
Done handling dates
saving to content/tei4publisher/FRAN_0025_1290_L-1-tei.xml
-------: FRAN_0025_5094_L-1.xml :-------
starting conversion
conversion finished
Done parsing file
Done updating teiHeader
Done building table structure


  0%|          | 0/21 [00:00<?, ?it/s]

Done building rows
Done cleaning
Done handling dates
saving to content/tei4publisher/FRAN_0025_5094_L-1-tei.xml
-------: FRAN_0025_4648_L-1.xml :-------
starting conversion
conversion finished
Done parsing file
Done updating teiHeader
Done building table structure


  0%|          | 0/33 [00:00<?, ?it/s]

Done building rows
Done cleaning
Done handling dates
saving to content/tei4publisher/FRAN_0025_4648_L-1-tei.xml
-------: FRAN_0025_5795_L-1.xml :-------
starting conversion
conversion finished
Done parsing file
Done updating teiHeader
Done building table structure


  0%|          | 0/14 [00:00<?, ?it/s]

Done building rows
Done cleaning
Done handling dates
saving to content/tei4publisher/FRAN_0025_5795_L-1-tei.xml
-------: FRAN_0025_0227_L-0.xml :-------
starting conversion
conversion finished
Done parsing file
Done updating teiHeader
Done building table structure


  0%|          | 0/37 [00:00<?, ?it/s]

Done building rows
Done cleaning
Done handling dates
saving to content/tei4publisher/FRAN_0025_0227_L-0-tei.xml
-------: FRAN_0025_6067_L-1.xml :-------
starting conversion
conversion finished
Done parsing file
Done updating teiHeader
Done building table structure


  0%|          | 0/26 [00:00<?, ?it/s]

Done building rows
Done cleaning
Done handling dates
saving to content/tei4publisher/FRAN_0025_6067_L-1-tei.xml


## [opt] 3.3 Controler la validité des fichiers XML TEI

Attention, construire l'objet "RelaxNG" est atrocement long...

In [15]:
#path_to_rng = os.path.join(".", os.path.join("schema", "LEPIDEMO.rng"))
#rng_doc = etree.parse(path_to_rng)
#relaxng = etree.RelaxNG(rng_doc)

In [16]:
#def validator(xml_file_name, relaxng):    
#    try:
#        tree = etree.parse(xml_file_name)
#    except etree.XMLSyntaxError as e:
#        print(f"Error parsing {xml_file_name}:", e.strerror, sep="\n")
#        return False
#    if relaxng.validate(tree):
#        return True

In [17]:
#for file in tqdm(out_files):
#    # control validity
#    print(f"Controling validity of {out_file_name}")
#    if rng_val.validator(out_file_name, path_to_rng):
#        print("validity: ok")
#    else:
#        print("validity: failed!")

### [opt] 3.3 Génération d'une archive

In [18]:
# ------------------
#if running on colab
# zip the content of tei4publisher/
#!zip -j content/tei4publisher.zip -r content/tei4publisher/
# zip the content of tei_output/
#!zip -j content/tei_output.zip -r content/tei_output/