## Objectifs

Montrer l'impact du pre-processing.

## Methodes

1. Comparaison de l'annotation via normalisation et segmentation de phrases simple
2. Comparaison du nombre de tokens par analyse via Pie vs Pie-Extended et résultat ?

In [1]:
import pandas

TASKS = "lemma,Deg,Numb,Person,Mood_Tense_Voice,Case,Gend,pos".split(",")


SENTENCE_CHAR_LENGTH = 35

def pretty_print(sentences, size=5):
    df = pandas.DataFrame([
        {task: tok.get(task, "_") for task in ["form"]+TASKS}
        for sent in sentences[:size] 
        for tok in sent+[{"form": "SENTENCE_BREAK"}]
    ], dtype=str)
    print(df.to_string())
    
PrettyName = {
    "Normalized": "Prétraitement complet",
    "Raw": "Prétraitement absent, tokenization phrase conservé",
    "RawVJUI": "Prétraitement absent, VJUI OUI, tokenization phrase conservé",
    "Chunks35": f"Prétraitement absent, tokenization tous les {SENTENCE_CHAR_LENGTH} caractères",
    "Chunks35Normalized": f"Prétraitement complet, tokenization tous les {SENTENCE_CHAR_LENGTH} caractères"
}

## Récupération des données

### Imports généraux

In [2]:
from tqdm import tqdm

from tools.analysis_utils import (
    import_known_tokens,
    compile_scores,
    convert_raw,
    vjui
)

### Parsage du document

Pour chaque document, on conserve:

1. Une version normalisée, où l'on supprime la ponctuation et l'on tokenise via les points forts.
2. Une version non-normalisée, où l'on conserve la ponctuation et l'on tokenise via les points forts.
3. Une version non-normalisée où la tokenisation est faite à chaque 35eme mot (défaut de Pie)
3. Une version normalisée où la tokenisation est faite à chaque 35eme mot (défaut de Pie)

In [3]:
Texts = []

# Text normalized
Normalized = [[]]
# Chunked using punctuation but no modification
Raw = [[]]
# No modification, chunk = 35 characters
Chunks35 = [[]]
# Normalization, chunk = 35 characters
Chunks35Normalized = [[]]
# Chunked with punctuation, VJUI applied
RawVJUI = [[]]

Versions = {
    "Normalized": Normalized,
    "Raw": Raw,
    "RawVJUI": RawVJUI,
    "Chunks35": Chunks35,
    "Chunks35Normalized": Chunks35Normalized
}

EndToken = 14225 # As per Anthony's sheet
nb_tokens = 0


def filter_gold(data):
    return [lst for lst in data if lst]

def append(val, *arrs):
    for x in arrs:
        x[-1].append(val)
        
def create_chunks(*arrs):
    for x in filter_gold(arrs):
        if x[-1] != []:
            x.append([])
        yield x

def filter_all(*arrs):
    for x in arrs:
        yield filter_gold(x)

def count_toks(sentences):
    return len([tok for sent in sentences for tok in sent])
        

TextsMilestonesStart = {
    
}

HARD_PUNCT = "?.!"


Milestones = set(list(range(0, 14226)) + list(range(20384, 24316)))

with open("/home/thibault/dev/latin-non-classical-data/latin-chretien-v2.tsv") as f:
    header = []
    current_text = None
    last_token = {}
    for lineno, line in enumerate(f):
        line = line.strip().split("\t")
        if lineno == 0:
            header = line
            continue
        line = dict(zip(header, line))
        
        if lineno not in Milestones:
            continue
        
        if line["form"].startswith("urn:"):
            TextsMilestonesStart[line["form"]] = {
                version: (len(arr) - 1, count_toks(arr))
                for version, arr in Versions.items()
            }
            last_token = {}
            Raw, Chunks35, Normalized, Chunks35Normalized, RawVJUI = create_chunks(
                Raw, Chunks35, Normalized, Chunks35Normalized, RawVJUI
            ) 
            continue
        
        
        if line["POS"] == "PUNC":
            append(line, Raw, Chunks35, RawVJUI)
            if line["lemma"] in HARD_PUNCT and last_token.get("lemma") not in ".!?":
                Normalized, Raw, RawVJUI = create_chunks(Normalized, Raw, RawVJUI)
        else:
            append(line, Raw, Chunks35, Normalized, Chunks35Normalized, RawVJUI)
        
        if len(Chunks35[-1]) == SENTENCE_CHAR_LENGTH:
            Chunks35, *_ = create_chunks(Chunks35)
        if len(Chunks35Normalized[-1]) == SENTENCE_CHAR_LENGTH:
            Chunks35Normalized, *_ = create_chunks(Chunks35Normalized)
        
        last_token = line
        #if lineno+1 == EndToken:
        #    break   

Normalized, Raw, Chunks35, Chunks35Normalized, RawVJUI = filter_all(
    Normalized, Raw, Chunks35, Chunks35Normalized, RawVJUI
)

for version, arr in Versions.items():
    print(f"{version}")
    print(f"\t{len(arr)} sentences found")
    print(f"\t{count_toks(arr)} tokens found")
# print(TextsMilestonesStart)

Normalized
	716 sentences found
	15367 tokens found
Raw
	717 sentences found
	18138 tokens found
RawVJUI
	717 sentences found
	18138 tokens found
Chunks35
	527 sentences found
	18138 tokens found
Chunks35Normalized
	448 sentences found
	15367 tokens found


### Normalisation

Pour chacune des version, on applique des fonctions de normalisation.

In [4]:
Normalisations = {
    "Normalized": vjui,
    "Chunks35Normalized": vjui,
    "RawVJUI": vjui,
    "Raw": lambda x: x,
    "Chunks35": lambda x: x
}

PlatinumVersions = {
    version: convert_raw(sentences, task_list=TASKS, lemma_fn=vjui, form_fn=Normalisations[version], 
                       pos_fn=lambda x:x)
    for version, sentences in Versions.items()
}
pretty_print(PlatinumVersions["Raw"])

100%|██████████| 716/716 [00:00<00:00, 14726.17it/s]
100%|██████████| 717/717 [00:00<00:00, 12283.08it/s]
100%|██████████| 717/717 [00:00<00:00, 11039.14it/s]
100%|██████████| 527/527 [00:00<00:00, 9441.75it/s]
100%|██████████| 448/448 [00:00<00:00, 8309.26it/s]

               form        lemma   Deg  Numb Person Mood_Tense_Voice Case      Gend     pos
0            Mutant         muto     _  Plur      3     Ind|Pres|Act    _         _     VER
1                et           et     _     _      _                _    _         _  CONcoo
2           bestiae       bestia     _  Plur      _                _  Nom         _  NOMcom
3               pro          pro     _     _      _                _    _         _     PRE
4             veste       uestis     _  Sing      _                _  Abl         _  NOMcom
5            formam        forma     _  Sing      _                _  Acc         _  NOMcom
6                 ;            ;     _     _      _                _    _         _    PUNC
7          quamquam     quamquam     _     _      _                _    _         _  CONsub
8                et           et     _     _      _                _    _         _  CONcoo
9              pavo        pauus     _  Sing      _                _  Dat       




### Récupération des tokens

In [5]:
TokensVersion = {
    version: [
        [x["form"] for x in sentence]
        for sentence in sentences
    ]
    for version, sentences in PlatinumVersions.items()
}
print(TokensVersion["Chunks35"][0])

['Mutant', 'et', 'bestiae', 'pro', 'veste', 'formam', ';', 'quamquam', 'et', 'pavo', 'pluma', 'vestis', ',', 'et', 'quidem', 'de', 'cataclistis', ',', 'immo', 'omni', 'conchylio', 'pressior', 'qua', 'colla', 'florent', ',', 'et', 'omni', 'patagio', 'inauratior', 'qua', 'terga', 'fulgent', ',', 'et']


### Lemmatisation

In [6]:
from pie.tagger import Tagger
from pie.utils import chunks
DEVICE = "cpu"
BATCH_SIZE = 128


OutputVersion = {
    version: []
    for version in TokensVersion
}

# MODEL_PATH = "../../../pie/models/FinalModelVulgate-lemma-2020_10_14-18_09_35.tar"
MODEL_PATH = "../../../latin-lasla-models/model.tar"

tagger = Tagger(device="cpu", batch_size=100, lower=False, tokenize=False)
tagger.add_model(MODEL_PATH, *TASKS)

for version, sentences in TokensVersion.items():
    print(f"Tagging : `{version}`\n")
    for chunk in tqdm(chunks([(sent, len(sent)) for sent in sentences], tagger.batch_size)):
        tagged, tasks = tagger.tag(*zip(*chunk))#, use_beam=True)
        OutputVersion[version].extend([
            [
                (token, dict(zip(tasks, result)))
                for token, result in sentence
            ]
            for sentence in tagged
        ])


0it [00:00, ?it/s]

Tagging : `Normalized`



8it [00:13,  1.70s/it]
0it [00:00, ?it/s]

Tagging : `Raw`



8it [00:15,  1.88s/it]
0it [00:00, ?it/s]

Tagging : `RawVJUI`



8it [00:14,  1.85s/it]
0it [00:00, ?it/s]

Tagging : `Chunks35`



6it [00:13,  2.31s/it]
0it [00:00, ?it/s]

Tagging : `Chunks35Normalized`



5it [00:12,  2.58s/it]


### Génération des version alignées

La sortie comprend des tokens qui ne doivent pas être comptabilisées dans les scores.

In [7]:
import regex

punkts = regex.compile(r"\W+")
greek = regex.compile(r"\p{Greek}+")

def keep_tokens(token):
    if punkts.match(token):
        #print(token)
        return False
    elif greek.match(token):
        return False
    return True


CleanedUpVersions = {
    version: [
        [
            (token, annots)
            for (token, annots) in sentence
            if keep_tokens(token)
        ]
        for sentence in sentences
    ]
    for version, sentences in OutputVersion.items()
    #if not print(version)
}

## Analyse des résultats

### Variétés des formes: richesse lexicale des lemmes

Dans cette analyse, on s'intéresse au nombre de lemmes différents

In [8]:
RichesseLexicaleLemmes = {
    version: len(set([
        annot["lemma"]
        for sent in sentences
        for _, annot in sent
    ]))
    for version, sentences in CleanedUpVersions.items()
}

for version, richesse in RichesseLexicaleLemmes.items():
    if version != "Normalized":
        print(f"{version} has {richesse} different lemma")
        print(
            f"{version} has {(richesse/RichesseLexicaleLemmes['Normalized']-1)*100:.2f}% "
            "more lemma than the 'Perfect' version\n"
        )


Raw has 3459 different lemma
Raw has 3.32% more lemma than the 'Perfect' version

RawVJUI has 3366 different lemma
RawVJUI has 0.54% more lemma than the 'Perfect' version

Chunks35 has 3468 different lemma
Chunks35 has 3.58% more lemma than the 'Perfect' version

Chunks35Normalized has 3355 different lemma
Chunks35Normalized has 0.21% more lemma than the 'Perfect' version



### Score accuracy

On calcule maintenant l'accuracy de chacune des versions, en excluant du calcul les formes interdites.

Rappel: les versions de comparaison sont contenues dans `PlatinumVersions` et les prédictions dans `OutputVersion`

#### Création des comparaisons

In [9]:
Scores = {
    version: {task:[] for task in TASKS}
    for version in PlatinumVersions
}
for version, gt_sentences in PlatinumVersions.items():
    # gt_sentences = [{annotation_dicts}]
    pred_sentences = OutputVersion[version]
    for gt_sentence, pred_sentence in zip(gt_sentences, pred_sentences):
        # We filter out noise so we have the same basis in terms of comparison
        for gt_annot, (token, pred_annot) in zip(gt_sentence, pred_sentence):
            if not keep_tokens(gt_annot["form"]):
                continue
            if token != gt_annot["form"]:
                print("ERROR", version, token, gt_annot["form"])
            for task in TASKS:
                Scores[version][task].append(int(gt_annot[task] == pred_annot[task]))

#### Calculs globaux

In [10]:
GlobAccuracies = {
    version: {
        task: sum(comparison_ints) / len(comparison_ints)*100
        for task, comparison_ints in tasks.items()
    }
    for version, tasks in Scores.items()
}
GlobAccuraciesDF = pandas.DataFrame({
    PrettyName[version]: {
        task: task_score-GlobAccuracies["Normalized"][task]
        for task, task_score in tasks.items()
    }
    for version, tasks in GlobAccuracies.items()
    if version != "Normalized"
})
print(GlobAccuraciesDF.to_string())

print(GlobAccuraciesDF.to_latex(float_format="{:0.2f}%".format, decimal=","))

print(GlobAccuraciesDF.to_markdown())

                  Prétraitement absent, tokenization phrase conservé  Prétraitement absent, VJUI OUI, tokenization phrase conservé  Prétraitement absent, tokenization tous les 35 caractères  Prétraitement complet, tokenization tous les 35 caractères
lemma                                                      -1.513572                                                     -0.222446                                                  -1.624495                                                   -0.226986
Deg                                                        -0.301302                                                     -0.157115                                                  -0.261781                                                   -0.064441
Numb                                                       -0.412643                                                     -0.307780                                                  -0.563092                                                   -0.122956


## Conclusions

En dehors des cas, ce qui semble logique, la tokenisation des phrases a un impact assez limité sur la lemmatisation et les tâches associées. Au contraire, la conservation de mots (principalement la ponctuation) et de lettres inconnues a un impact assez fort sur le résultat final.