### Demos

In this notebook, we will do some demonstrations on how to use jp-errant package

#### Installation instruction: 

First, create a conda enviroment with Python 3.12 using the following command line 
- `conda create --name jp_errant_test python==3.12`
- `conda activate jp_errant_test`

Clone the `jp-errant` repo if you haven't done so:
- `git clone https://github.com/open-writing-evaluation/jp-errant.git`
- `cd jp-errant`  # move to the jp-errant folder of the cloned repo

Check out to the latest branch (`minh`, or `dev`, or `main`):
- `git checkout minh`

Install dependencies:
- `pip install -r requirements.txt` 

Then, install the package using this line:
- `pip install -e .`  # install the jp-errant package

In [1]:
import sys
sys.path.insert(0, '../')

In [18]:
import jp_errant

In [19]:
jp_errant.__version__

'3.0.1'

In [20]:
# initialize jp-errant using default Stanza nlp
annotator = jp_errant.load(lang="en")

2025-02-24 19:59:00 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json: 424kB [00:00, 8.99MB/s]                    
2025-02-24 19:59:00 INFO: Downloaded file to C:\Users\Minh UBC\stanza_resources\resources.json
2025-02-24 19:59:01 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |

2025-02-24 19:59:01 INFO: Using device: cpu
2025-02-24 19:59:01 INFO: Loading: tokenize
2025-02-24 19:59:01 INFO: Loading: mwt
2025-02-24 19:59:01 INFO: Loading: pos
2025-02-24 19:59:03 INFO: Loading: lemma
2025-02-24 19:59:03 INF

In [21]:
type(annotator)

jp_errant.annotator.Annotator

In [22]:
original = "Today is good day."
corrected = "Today is a good day."

In [23]:
# do annotations
original_tokens = annotator.parse(original)
corrected_tokens = annotator.parse(corrected)

In [24]:
# annotate 
edits = annotator.annotate(orig=original_tokens, 
                           cor=corrected_tokens
                           )

In [25]:
for edit_item in edits:
    print(edit_item)

Orig: [2, 2, ''], Cor: [2, 3, 'a'], Type: 'M:DET'


In [26]:
# test for Chinese language, default Stanza
annotator_zh = jp_errant.load(lang="zh")

2025-02-24 19:59:04 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json: 424kB [00:00, 6.56MB/s]                    
2025-02-24 19:59:05 INFO: Downloaded file to C:\Users\Minh UBC\stanza_resources\resources.json
2025-02-24 19:59:05 INFO: "zh" is an alias for "zh-hans"
2025-02-24 19:59:05 INFO: Loading these models for language: zh-hans (Simplified_Chinese):
| Processor | Package          |
--------------------------------
| tokenize  | gsdsimp          |
| pos       | gsdsimp_charlm   |
| lemma     | gsdsimp_nocharlm |
| depparse  | gsdsimp_charlm   |

2025-02-24 19:59:05 INFO: Using device: cpu
2025-02-24 19:59:05 INFO: Loading: tokenize
2025-02-24 19:59:06 INFO: Loading: pos
2025-02-24 19:59:08 INFO: Loading: lemma
2025-02-24 19:59:08 INFO: Loa

In [27]:
original_text_zh = "冬阴功是泰国最著名的菜之一，它虽然不是很豪华，但它的味确实让人上瘾，做法也不难、不复杂。"
corrected_text_zh = "冬阴功是泰国最著名的菜之一，虽然它不是很豪华，但它的味确实让人上瘾，做法也不难、不复杂。"

In [28]:
original_zh_tokens = annotator_zh.parse(original_text_zh)
corrected_zh_tokens = annotator_zh.parse(corrected_text_zh)

In [29]:
edits = annotator_zh.annotate(orig = original_zh_tokens,
                              cor=corrected_zh_tokens
                              )

Comparing: 它虽然 虽然它 0.75 0.7200819306244606


In [30]:
for edit in edits:
    print(edit)

Orig: [11, 13, '它 虽然'], Cor: [11, 13, '虽然 它'], Type: 'R:WO'


In [31]:
import stanza

In [32]:
# test for Chinese language, custom trained Stanza following LTP style tokenizer
SAVED_MODEL_FOLDER = "../trained_models/"  # point to the folder with save pt files
tokenize_path = SAVED_MODEL_FOLDER + "UD_Chinese-GSDSimpLTP_model/saved_models/tokenize/zh_gsdsimpltp_tokenizer.pt"
nlp = stanza.Pipeline(lang="zh", 
                      processors="tokenize,pos,lemma,depparse",
                      tokenize_model_path=tokenize_path,  
                      )

# load jp-errant with custom Stanza nlp component
annotator_zh = jp_errant.load(lang="zh", nlp=nlp)

2025-02-24 19:59:10 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json: 424kB [00:00, 14.6MB/s]                    
2025-02-24 19:59:10 INFO: Downloaded file to C:\Users\Minh UBC\stanza_resources\resources.json
2025-02-24 19:59:10 INFO: "zh" is an alias for "zh-hans"
2025-02-24 19:59:11 INFO: Loading these models for language: zh-hans (Simplified_Chinese):
| Processor | Package                 |
---------------------------------------
| tokenize  | ../trained...kenizer.pt |
| pos       | gsdsimp_charlm          |
| lemma     | gsdsimp_nocharlm        |
| depparse  | gsdsimp_charlm          |

2025-02-24 19:59:11 INFO: Using device: cpu
2025-02-24 19:59:11 INFO: Loading: tokenize
2025-02-24 19:59:11 INFO: Loading: pos
2025-02-24 19:59:13 INFO: Lo

In [33]:
original_zh_tokens = annotator_zh.parse(original_text_zh)
corrected_zh_tokens = annotator_zh.parse(corrected_text_zh)

In [34]:
# print tokenized words
print("Original text")
for item in original_zh_tokens.iter_words():
    print(item.text, end="  ")

print("\n", "-"*10)
print("Corrected text")
for item in corrected_zh_tokens.iter_words():
    print(item.text, end="  ")

Original text
冬阴功  是  泰国  最  著名  的  菜  之  一  ，  它  虽然  不是  很  豪华  ，  但  它  的  味  确实  让  人  上瘾  ，  做法  也  不难  、  不  复杂  。  
 ----------
Corrected text
冬阴功  是  泰国  最  著名  的  菜  之  一  ，  虽然  它  不是  很  豪华  ，  但  它  的  味  确实  让  人  上瘾  ，  做法  也  不难  、  不  复杂  。  

In [35]:
edits = annotator_zh.annotate(orig = original_zh_tokens,
                              cor=corrected_zh_tokens
                              )
for edit in edits:
    print(edit)

Comparing: 它虽然 虽然它 0.75 0.7200819306244606
Orig: [10, 12, '它 虽然'], Cor: [10, 12, '虽然 它'], Type: 'R:WO'


#### Test multi-reference command line

In [36]:
INPUT_FILE = "./data/multi_references/en_input.txt"         # original sentence
REFERENCE_FILE = "./data/multi_references/en_ref_multi.tsv" # corrected sentences, separated by tab \t
OUTPUT_FILE = "./data/multi_references/en_output.m2"        # m2 output file

In [37]:
!jp_errant_parallel --help

usage: jp_errant_parallel [-h] [options] -orig ORIG -cor COR [COR ...] [-tsv yes] -out OUT

Align parallel text files and extract and classify the edits.

options:
  -h, --help            show this help message and exit
  -orig ORIG            The path to the original text file.
  -cor COR [COR ...]    The paths to >= 1 corrected text files.
  -out OUT              The output filepath.
  -lang LANG            The 2-letter language code (default: en).
  -lev                  Align using standard Levenshtein (default: True).
  -merge {rules,all-split,all-merge,all-equal}
                        Choose a merging strategy for automatic alignment.
                        rules: Use a rule-based merging strategy (default)
                        all-split: Merge nothing: MSSDI -> M, S, S, D, I
                        all-merge: Merge adjacent non-matches: MSSDI -> M, SSDI
                        all-equal: Merge adjacent same-type non-matches: MSSDI -> M, SS, D, I


In [38]:
!jp_errant_parallel -orig {INPUT_FILE} -cor {REFERENCE_FILE} -out {OUTPUT_FILE}

Loading resources...
Processing parallel files...


2025-02-24 19:59:24 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|          | 0.00/52.5k [00:00<?, ?B/s]
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json: 424kB [00:00, 12.1MB/s]                    
2025-02-24 19:59:24 INFO: Downloaded file to C:\Users\Minh UBC\stanza_resources\resources.json
2025-02-24 19:59:25 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |

2025-02-24 19:59:25 INFO: Using device: cpu
2025-02-24 19:59:25 INFO: Loading: tok

In [39]:
INPUT_FILE_ZH = "./data/multi_references/zh_input.txt"         # original sentence
REFERENCE_FILE_ZH = "./data/multi_references/zh_ref.tsv"       # corrected sentences, separated by tab \t
OUTPUT_FILE_ZH = "./data/multi_references/zh_output.m2"        # m2 output file

In [40]:
# testing parallel to m2 command line with Chinese (zh) language
# please NOTE: it does **NOT** support loading LTP files yet. The command line still uses default stanza nlp processor
!jp_errant_parallel -orig {INPUT_FILE_ZH} -cor {REFERENCE_FILE_ZH} -out {OUTPUT_FILE_ZH} -lang zh

Loading resources...
Processing parallel files...
Comparing: 它虽然 虽然它 0.75 0.7200819306244606
Comparing: 大 :大 0.8 0
Comparing: 搾好 榨好 1.0 0.5604585667221499
Comparing: 叶三 叶三 1.0 0.9999999949197186
Comparing: 辣椒6 辣椒六 0.75 0.777275938334677
Comparing: 4 四 0.0 0.5606951171827526
Comparing: 草菇10 草菇十 0.6666666666666666 0
Comparing: 死爱 爱死 0.5 0.8413336325438497
Comparing: 对外国人 外国人对 0.75 0.5867633548646929
Comparing: 是 在 0.3333333333333333 0.7543451641312413
Comparing: 的心里 都 0.2 0
Comparing: 在 使 0.3333333333333333 0.8399053672795396
Comparing: 猛近 猛进 1.0 0.9481206153284796
Comparing: 日益严重造成了 造成了日益严重 0.5454545454545454 0.6659675649793019
Comparing: 日益严重造成了 造成了日益严重 0.5454545454545454 0.6659675649793019
Comparing: 空气 的 0.0 0
Comparing: 有助 无助 0.7272727272727273 0.9148432177750987
Comparing: 人生 人 0.5454545454545454 0
Comparing: 建康 健康 1.0 0.9501210549857415
Comparing: 人生 人体 0.46153846153846156 0.9037275247914941
Comparing: 题 题— 0.8 0
Comparing: 。 ， 0.0 0.7181735669053702
Comparing: 人生 人体 0.46153846153

2025-02-24 19:59:36 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|          | 0.00/52.5k [00:00<?, ?B/s]
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json: 424kB [00:00, 6.61MB/s]                    
2025-02-24 19:59:37 INFO: Downloaded file to C:\Users\Minh UBC\stanza_resources\resources.json
2025-02-24 19:59:37 INFO: "zh" is an alias for "zh-hans"
2025-02-24 19:59:38 INFO: Loading these models for language: zh-hans (Simplified_Chinese):
| Processor | Package          |
--------------------------------
| tokenize  | gsdsimp          |
| pos       | gsdsimp_charlm   |
| lemma     | gsdsimp_nocharlm |
| depparse  | gsdsimp_charlm   |

2025-02-24 19:59:38 INFO: Using device: cpu
2025-