### German Annotation

In this notebook, we will try jp-errant on German language dataset

In [1]:
import sys
sys.path.insert(0, '../')

In [2]:
import jp_errant

In [3]:
jp_errant.__version__

'3.0.1'

In [4]:
annotator = jp_errant.load(lang="de") # Deutsch = German

2025-02-13 20:40:17 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

2025-02-13 20:40:18 INFO: Downloaded file to C:\Users\Minh UBC\stanza_resources\resources.json
2025-02-13 20:40:19 INFO: Loading these models for language: de (German):
| Processor | Package      |
----------------------------
| tokenize  | gsd          |
| mwt       | gsd          |
| pos       | gsd_charlm   |
| lemma     | gsd_nocharlm |
| depparse  | gsd_charlm   |

2025-02-13 20:40:19 INFO: Using device: cuda
2025-02-13 20:40:19 INFO: Loading: tokenize
  checkpoint = torch.load(filename, lambda storage, loc: storage)
2025-02-13 20:40:21 INFO: Loading: mwt
  checkpoint = torch.load(filename, lambda storage, loc: storage)
2025-02-13 20:40:21 INFO: Loading: pos
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  data = torch.load(self.filename, lambda storage, loc: storage)
  state = torch.load(filename, lambda storage, loc: storage)
2025-02-13 20:40:21 INFO: Loading: lemma
  checkpoint = torch.load(filename, lambda storage, loc: storage)
2025-02-13 20:40:22 INFO: Loa

In [5]:
original  = "Dagegen wieder , bekommen BA Studenten die ein extra Jahr oder mehr studiert haben , leichter Jobs ."
corrected = "Dahingegen bekommen BA-Studenten , die ein extra Jahr oder mehr studiert haben , wiederum leichter Jobs ."

In [6]:
# do tokenization
original_tokens = annotator.parse(original)
corrected_tokens = annotator.parse(corrected)

In [7]:
for item in original_tokens.iter_words():
    print(f"Text: {item.text} | Lemma: {item.lemma} | POS: {item.upos}")

Text: Dagegen | Lemma: dagegen | POS: ADV
Text: wieder | Lemma: wieder | POS: ADV
Text: , | Lemma: , | POS: PUNCT
Text: bekommen | Lemma: bekommen | POS: VERB
Text: BA | Lemma: BA | POS: PROPN
Text: Studenten | Lemma: Student | POS: NOUN
Text: die | Lemma: der | POS: PRON
Text: ein | Lemma: ein | POS: DET
Text: extra | Lemma: extra | POS: ADV
Text: Jahr | Lemma: Jahr | POS: NOUN
Text: oder | Lemma: oder | POS: CCONJ
Text: mehr | Lemma: mehr | POS: ADV
Text: studiert | Lemma: studieren | POS: VERB
Text: haben | Lemma: haben | POS: AUX
Text: , | Lemma: , | POS: PUNCT
Text: leichter | Lemma: leicht | POS: ADJ
Text: Jobs | Lemma: Job | POS: NOUN
Text: . | Lemma: . | POS: PUNCT


In [8]:
# annotate 
edits = annotator.annotate(orig=original_tokens, 
                           cor=corrected_tokens
                           )

In [9]:
for edit_item in edits:
    print(edit_item)

Orig: [0, 2, 'Dagegen wieder'], Cor: [0, 1, 'Dahingegen'], Type: 'R:ADV ADV -> ADV'
Orig: [2, 3, ','], Cor: [1, 1, ''], Type: 'U:PUNCT'
Orig: [5, 5, ''], Cor: [3, 4, '-'], Type: 'M:PUNCT'
Orig: [6, 6, ''], Cor: [5, 6, ','], Type: 'M:PUNCT'
Orig: [15, 15, ''], Cor: [15, 16, 'wiederum'], Type: 'M:ADV'


### Run command line parallel to m2

In [10]:
ORIGINAL_FILE = "../data/GEC_European_Datasets/German/fm-dev.src"
CORRECTION_FILE = "../data/GEC_European_Datasets/German/fm-dev.trg"
OUTPUT_FILE = "../data/GEC_European_Datasets/German/fm-dev-jp-errant.m2"
LANGUAGE = "de"

In [11]:
!jp_errant_parallel --help

usage: jp_errant_parallel [-h] [options] -orig ORIG -cor COR [COR ...] [-tsv yes] -out OUT

Align parallel text files and extract and classify the edits.

options:
  -h, --help            show this help message and exit
  -orig ORIG            The path to the original text file.
  -cor COR [COR ...]    The paths to >= 1 corrected text files.
  -out OUT              The output filepath.
  -lang LANG            The 2-letter language code (default: en).
  -lev                  Align using standard Levenshtein (default: True).
  -merge {rules,all-split,all-merge,all-equal}
                        Choose a merging strategy for automatic alignment.
                        rules: Use a rule-based merging strategy (default)
                        all-split: Merge nothing: MSSDI -> M, S, S, D, I
                        all-merge: Merge adjacent non-matches: MSSDI -> M, SSDI
                        all-equal: Merge adjacent same-type non-matches: MSSDI -> M, SS, D, I


#### Process German dev set

In [12]:
!jp_errant_parallel -orig {ORIGINAL_FILE} -cor {CORRECTION_FILE} -lang {LANGUAGE} -out {OUTPUT_FILE}

Loading resources...
Processing parallel files...


2025-02-13 20:40:34 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|          | 0.00/48.5k [00:00<?, ?B/s]
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json: 392kB [00:00, 12.4MB/s]                    
2025-02-13 20:40:35 INFO: Downloaded file to C:\Users\Minh UBC\stanza_resources\resources.json
2025-02-13 20:40:35 INFO: Loading these models for language: de (German):
| Processor | Package      |
----------------------------
| tokenize  | gsd          |
| mwt       | gsd          |
| pos       | gsd_charlm   |
| lemma     | gsd_nocharlm |
| depparse  | gsd_charlm   |

2025-02-13 20:40:35 INFO: Using device: cuda
2025-02-13 20:40:35 INFO: Loading: tokenize
  checkpoint = torch.load(filen