<a href="https://colab.research.google.com/github/isikus/qualification-project/blob/master/notebooks/1.%20Converting%20datasets%20to%20parallel%20corpora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Converting datasets to parallel corpora
In this notebook we process the corpora we used such that each original entry matches its corrected version. The notebook will download the corpora where applicable and then perform the operations needed, so you will have the correspondent `pandas` DataFrames in your home folder when the execution is finished.

**Please note the following:**
1. While [REALEC](https://realec.org) is freely distributable, the owners of [FCE](https://ilexir.co.uk/datasets/index.html), [W&I+LOCNESS](https://www.cl.cam.ac.uk/research/nl/bea2019st/) and [EFCAMDAT](https://corpus.mml.cam.ac.uk/efcamdat2/) ask you to give them credit in order to use their corpora in your research. While the same is true also for [ICNALE](http://language.sakura.ne.jp/icnale/) and [NUCLE](https://www.comp.nus.edu.sg/~nlp/corpora.html), you also have to register to use these corpora. In order to obtain the password for the ICNALE dataset we used, you have to fill the form [here](http://tinyurl.com/8rwy472). To get the NUCLE dataset you have to fill [this form](https://sterling8.d2.comp.nus.edu.sg/nucle_download/nucle.php) and wait for ~3 working days. As such, we **did not include** these parts in our release, and the correspondent code is put to the end of the notebook for your convenience.
2. If running on Colab, you may want to add the produced datasets to your Google Drive. The specific code for that is added to the end of the notebook, but you need to uncomment it or it will not run otherwise.
3. You may revert to just reevaluate the resulting checkpoint of our model instead: please run [this notebook](https://colab.research.google.com/github/isikus/qualification-project/blob/master/notebooks/4.%20Model%20reevaluation.ipynb) in order to do that.

## Imports and necessary dependencies

In [0]:
!git clone https://github.com/isikus/corpora-manipulation
!cp ./corpora-manipulation/* .
!rm -rf corpora-manipulation

In [0]:
from parallel_error_corpora import list_to_corpus_df

## Processing FCE

In [0]:
!mkdir fce
%cd fce

/content/fce


In [0]:
!wget https://www.cl.cam.ac.uk/research/nl/bea2019st/data/fce_v2.1.bea19.tar.gz
!tar -xzf fce_v2.1.bea19.tar.gz

In [0]:
with open("./fce/json/fce.dev.json", "r", encoding="utf-8") as injson:
  fixstring = "[" + ", ".join(injson.read().replace(" null, ", " None, ").split("\n")) + "]"
  fce_dev = eval(fixstring)

with open("./fce/json/fce.train.json", "r", encoding="utf-8") as injson:
  fixstring = "[" + ", ".join(injson.read().replace(" null, ", " None, ").split("\n")) + "]"
  fce_train = eval(fixstring)

with open("./fce/json/fce.test.json", "r", encoding="utf-8") as injson:
  fixstring = "[" + ", ".join(injson.read().replace(" null, ", " None, ").split("\n")) + "]"
  fce_test = eval(fixstring)

In [0]:
print(any(e != 1 for e in [len(e["edits"]) for e in fce_dev]))
print(any(e != 1 for e in [len(e["edits"]) for e in fce_train]))
print(any(e != 1 for e in [len(e["edits"]) for e in fce_test]))
print("\n")
print(any(e != 2 for e in [len(e["edits"][0]) for e in fce_dev]))
print(any(e != 2 for e in [len(e["edits"][0]) for e in fce_train]))
print(any(e != 2 for e in [len(e["edits"][0]) for e in fce_test]))

False
False
False


False
False
False


In [0]:
fce_dev = [{
    "id": entry["id"],
    "text": entry["text"],
    "patch": [[e[0], e[1], e[2]] for e in entry["edits"][0][1] if e[2] is not None]
} for entry in fce_dev]

fce_train = [{
    "id": entry["id"],
    "text": entry["text"],
    "patch": [[e[0], e[1], e[2]] for e in entry["edits"][0][1] if e[2] is not None]
} for entry in fce_train]

fce_test = [{
    "id": entry["id"],
    "text": entry["text"],
    "patch": [[e[0], e[1], e[2]] for e in entry["edits"][0][1] if e[2] is not None]
} for entry in fce_test]

In [0]:
%%time

fce_dev_df = list_to_corpus_df(fce_dev)
fce_train_df = list_to_corpus_df(fce_train)
fce_test_df = list_to_corpus_df(fce_test)

CPU times: user 537 ms, sys: 3.11 ms, total: 540 ms
Wall time: 546 ms


In [0]:
fce_dev_df.sample(5)

Unnamed: 0,id,orig_text,corr_text,corrections_num
1,TR27*0100*2000*01,"Dear Kim,\n\nHello! I hope everything around y...","Dear Kim,\n\nHello! I hope everything around y...",21
145,TR1016*0100*2000*01,"Dear Mrs Helen Ryan,\n\nI was writting as repl...","Dear Mrs Helen Ryan,\n\nI am writing in reply ...",34
47,TR252*0100*2000*01,Dear Helen:\n\nI've recived your letter and I ...,Dear Helen:\n\nI've received your letter and I...,24
153,TR1116*0102*2000*01,"Mr John Smythe,\n\nI'm writing to tell you abo...","Mr John Smythe,\n\nI'm writing to tell you abo...",9
66,TR322*0100*2000*02,Best Detective Stories of Agatha Christie\n\n...,The Best Detective Stories of Agatha Christie\...,26


In [0]:
fce_dev_df.to_pickle("fce_dev_df.pickle")
fce_train_df.to_pickle("fce_train_df.pickle")
fce_test_df.to_pickle("fce_test_df.pickle")

In [0]:
!cp fce_dev_df.pickle ../
!cp fce_train_df.pickle ../
!cp fce_test_df.pickle ../
%cd ../
!rm -rf fce

/content


## Processing W&I + LOCNESS

In [0]:
!mkdir wi_locness
%cd wi_locness

/content/wi_locness


In [0]:
!wget https://www.cl.cam.ac.uk/research/nl/bea2019st/data/wi+locness_v2.1.bea19.tar.gz
!tar -xzf wi+locness_v2.1.bea19.tar.gz

In [0]:
with open("./wi+locness/json/A.dev.json", "r", encoding="utf-8") as injson:
  fixstring = "[" + ", ".join(injson.read().replace(" null", " None").split("\n")) + "]"
  wi_locness_dev = eval(fixstring)

with open("./wi+locness/json/B.dev.json", "r", encoding="utf-8") as injson:
  fixstring = "[" + ", ".join(injson.read().replace(" null", " None").split("\n")) + "]"
  wi_locness_dev += eval(fixstring)

with open("./wi+locness/json/C.dev.json", "r", encoding="utf-8") as injson:
  fixstring = "[" + ", ".join(injson.read().replace(" null", " None").split("\n")) + "]"
  wi_locness_dev += eval(fixstring)

with open("./wi+locness/json/N.dev.json", "r", encoding="utf-8") as injson:
  fixstring = "[" + ", ".join(injson.read().replace(" null", " None").split("\n")) + "]"
  wi_locness_dev += eval(fixstring)

with open("./wi+locness/json/A.train.json", "r", encoding="utf-8") as injson:
  fixstring = "[" + ", ".join(injson.read().replace(" null", " None").split("\n")) + "]"
  wi_locness_train = eval(fixstring)

with open("./wi+locness/json/B.train.json", "r", encoding="utf-8") as injson:
  fixstring = "[" + ", ".join(injson.read().replace(" null", " None").split("\n")) + "]"
  wi_locness_train += eval(fixstring)

with open("./wi+locness/json/C.train.json", "r", encoding="utf-8") as injson:
  fixstring = "[" + ", ".join(injson.read().replace(" null", " None").split("\n")) + "]"
  wi_locness_train += eval(fixstring)

In [0]:
print(any(e != 1 for e in [len(e["edits"]) for e in wi_locness_dev]))
print(any(e != 1 for e in [len(e["edits"]) for e in wi_locness_train]))
print("\n")
print(any(e != 2 for e in [len(e["edits"][0]) for e in wi_locness_dev]))
print(any(e != 2 for e in [len(e["edits"][0]) for e in wi_locness_train]))

False
False


False
False


In [0]:
wi_locness_dev = [{
    "id": entry["id"],
    "text": entry["text"],
    "patch": [[e[0], e[1], e[2]] for e in entry["edits"][0][1] if e[2] is not None]
} for entry in wi_locness_dev]

wi_locness_train = [{
    "id": entry["id"],
    "text": entry["text"],
    "patch": [[e[0], e[1], e[2]] for e in entry["edits"][0][1] if e[2] is not None]
} for entry in wi_locness_train]

In [0]:
%%time

wi_locness_dev_df = list_to_corpus_df(wi_locness_dev)
wi_locness_train_df = list_to_corpus_df(wi_locness_train)

CPU times: user 803 ms, sys: 2.63 ms, total: 806 ms
Wall time: 814 ms


In [0]:
wi_locness_dev_df.to_pickle("wi_locness_dev_df.pickle")
wi_locness_train_df.to_pickle("wi_locness_train_df.pickle")

In [0]:
!cp wi_locness_dev_df.pickle ../
!cp wi_locness_train_df.pickle ../
%cd ../
!rm -rf wi_locness

/content


## Processing REALEC

In [0]:
import re
import os

In [0]:
from realec_brat_to_patch_list import ann_to_patchlist
from parallel_error_corpora import list_to_corpus_df_realec

In [0]:
!pip install unidecode

Collecting unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |█▍                              | 10kB 15.5MB/s eta 0:00:01[K     |██▊                             | 20kB 1.8MB/s eta 0:00:01[K     |████▏                           | 30kB 2.3MB/s eta 0:00:01[K     |█████▌                          | 40kB 2.5MB/s eta 0:00:01[K     |██████▉                         | 51kB 2.0MB/s eta 0:00:01[K     |████████▎                       | 61kB 2.3MB/s eta 0:00:01[K     |█████████▋                      | 71kB 2.5MB/s eta 0:00:01[K     |███████████                     | 81kB 2.5MB/s eta 0:00:01[K     |████████████▍                   | 92kB 2.8MB/s eta 0:00:01[K     |█████████████▊                  | 102kB 2.8MB/s eta 0:00:01[K     |███████████████▏                | 112kB 2.8MB/s eta 0:00:01[K     |████████████████▌               | 122kB 2.8MB/

In [0]:
!mkdir realec
%cd realec

/content/realec


In [0]:
fileid = "12ggqRLoctoiRRFm8GMDrVVKgPnkAlbYH"
filename = "realec_240520_1918.tar.gz"

!wget --save-cookies cookies.txt 'https://docs.google.com/uc?export=download&id='$fileid -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p' > confirm.txt
!wget --load-cookies cookies.txt -O $filename 'https://docs.google.com/uc?export=download&id='$fileid'&confirm='$(<confirm.txt)

!rm confirm.txt
!rm cookies.txt

In [0]:
!tar -xzf realec_240520_1918.tar.gz
!rm realec_240520_1918.tar.gz

In [0]:
paths = []

for root, dirs, files in os.walk("data"):
    for file in files:
        if os.path.splitext(file)[1] == ".ann":
            paths.append("./" + os.path.join(root, file))

In [0]:
%%time

realec_entries = []

for path in paths:
    fid = path[:-3]+"txt"
    patch = ann_to_patchlist(path)
    with open(fid, "r", encoding="utf-8") as intxt:
        text = intxt.read()
    realec_entries.append({
        "id": fid[7:],
        "text": text,
        "patch": patch
    })

CPU times: user 40 s, sys: 665 ms, total: 40.7 s
Wall time: 40.7 s


In [0]:
%%time

realec_df = list_to_corpus_df_realec(realec_entries)

Failed at old IELTS/IELTS2015/ASt_14_1.txt with string index out of range
Failed at old IELTS/IELTS2015/ASt_9_1.txt with string index out of range
Failed at old IELTS/IELTS2015/ASt_4_2.txt with string index out of range
Failed at old IELTS/IELTS2015/ASt_8_1.txt with string index out of range
Failed at old IELTS/IELTS2015/ASt_4_1.txt with string index out of range
Failed at old IELTS/IELTS2015/ASt_1_2.txt with string index out of range
Failed at old IELTS/IELTS2015/ASt_7_1.txt with string index out of range
Failed at old IELTS/IELTS2015/ASt_3_2.txt with string index out of range
Failed at old IELTS/IELTS2015/ASt_6_1.txt with string index out of range
Failed at old IELTS/IELTS2015/ASt_7_2.txt with string index out of range
CPU times: user 19.4 s, sys: 107 ms, total: 19.5 s
Wall time: 19.5 s


In [0]:
realec_df = realec_df.sort_values(by=['id']).reset_index(drop=True)

realec_df_train = realec_df.loc[~realec_df["id"].str.contains("^exam/exam2017/DOv/")].reset_index(drop=True)
realec_df_gold = realec_df.loc[realec_df["id"].str.contains("^exam/exam2017/DOv/")].reset_index(drop=True)

In [0]:
realec_df_train.to_pickle("realec_df_train.pickle")
realec_df_gold.to_pickle("realec_df_gold.pickle")

In [0]:
!cp realec_df_train.pickle ../
!cp realec_df_gold.pickle ../
%cd ../
!rm -rf realec

/content


## Processing EFCamDat

In [0]:
!mkdir efcamdat
%cd efcamdat

/content/efcamdat


In [0]:
!wget http://corpus.mml.cam.ac.uk/efcamdat2/public_html/download.php?f=EF201403_selection7.xml.gz
!mv download.php?f=EF201403_selection7.xml.gz EF201403_selection7.xml
!gzip -d efcamdat.xml.gz

In [0]:
!git clone https://github.com/isikus/EFCamDat-Preprocess

Cloning into 'EFCamDat-Preprocess'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 29 (delta 7), reused 10 (delta 3), pack-reused 13[K
Unpacking objects: 100% (29/29), done.


In [0]:
!cp ./EFCamDat-Preprocess/* .

In [0]:
import re
import os
import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
%%time

!mkdir output
!python read_ef.py EF201403_selection7.xml | python ef_to_diff.py > output/diff_corp.txt

CPU times: user 9.25 s, sys: 1.41 s, total: 10.7 s
Wall time: 39min 46s


In [0]:
%%time

%cd output
!split -l 1 diff_corp.txt

/content/output
CPU times: user 1.18 s, sys: 153 ms, total: 1.33 s
Wall time: 4min 50s


In [0]:
%cd ../

from diff_to_parallel import run

/content/efcamdat


In [0]:
run("./output")

In [0]:
len(os.listdir("output/src"))

965123

In [0]:
import time

while True:
  time.sleep(5000)

5000 files processed
10000 files processed
15000 files processed
20000 files processed
25000 files processed
30000 files processed
35000 files processed
40000 files processed
45000 files processed
50000 files processed
55000 files processed
60000 files processed
65000 files processed
70000 files processed
75000 files processed
80000 files processed
85000 files processed
90000 files processed
95000 files processed
100000 files processed
105000 files processed
110000 files processed
115000 files processed
120000 files processed
125000 files processed
130000 files processed
135000 files processed
140000 files processed
145000 files processed
150000 files processed
155000 files processed
160000 files processed
165000 files processed
170000 files processed
175000 files processed
180000 files processed
185000 files processed
190000 files processed
195000 files processed
200000 files processed
205000 files processed
210000 files processed
215000 files processed
220000 files processed
225000 f

KeyboardInterrupt: ignored

**Don't forget to hit** `Interrupt Execution` **when the number of processed files reaches the output of the previous cell and then proceed by executing the following cells**

In [0]:
%%time

!mkdir processed
!mv ./output/src ./processed/src
!mv ./output/trg ./processed/trg
!mv ./output/crs ./processed/crs

CPU times: user 15.8 s, sys: 2.2 s, total: 18 s
Wall time: 52min 28s


In [0]:
%%time

Srcs = []
Trgs = []
Corrs = []

for filename in os.listdir("./processed/crs"):
  with open("./processed/src/"+filename, "r", encoding="utf-8") as infile:
    Srcs.append(infile.read())
  with open("./processed/trg/"+filename, "r", encoding="utf-8") as infile:
    Trgs.append(infile.read())
  with open("./processed/crs/"+filename, "r", encoding="utf-8") as infile:
    Corrs.append(int(infile.read()))

CPU times: user 1min 55s, sys: 2min 47s, total: 4min 42s
Wall time: 26min 7s


In [0]:
%%time

import pandas as pd

EFCamDat = pd.DataFrame({
    "orig_text": Srcs,
    "corr_text": Trgs,
    "corrections_num": Corrs
})

EFCamDat['idx'] = EFCamDat.index
EFCamDat = EFCamDat[["idx", "orig_text", "corr_text", "corrections_num"]]

def convertbr(text):
  return re.sub(r"<br */?>", r"\n", text)

EFCamDat["orig_text"] = EFCamDat["orig_text"].apply(convertbr)
EFCamDat["corr_text"] = EFCamDat["corr_text"].apply(convertbr)

EFCamDat.to_pickle("efcamdat_df.pickle")

CPU times: user 2.92 s, sys: 4.01 s, total: 6.93 s
Wall time: 10.4 s


In [0]:
!cp efcamdat_df.pickle ../
%cd ../
!rm -rf efcamdat

/content


## Processing NUCLE

In [0]:
!mkdir nucle
%cd nucle

We assume that you have the `release3.3.tar.bz2` file in this directory.

In [0]:
!bzip2 -d release3.3.tar.bz2
!tar -xf release3.3.tar

In [0]:
import re
from lxml import etree

In [0]:
entries = []

with open("./release3.3/data/nucle3.2.sgml", "r", encoding="utf-8") as sgml:
  d = "</DOC>\n\n<DOC"
  entries = ["<DOC" + e + "</DOC>" for e in sgml.read().split(d)]
  entries = [entries[0][4:]] + entries[1:-1] + [entries[-1][:-6]]

  print(chr(1006) in s)
  print(chr(1007) in s)
  print(chr(1008) in s)

In [0]:
def sgml_to_dict(sgml_entry):
  try:
    sgml_entry = sgml_entry.replace('&', chr(1008))
    sgml_entry = re.sub(r"<P>\n<([A-Z/])", r"</P>\n<\g<1>", sgml_entry)
    sgml_soup = re.split(r"(<P>.*?</P>)", sgml_entry, flags=re.DOTALL)
    sgml_entry = ""
    for el in sgml_soup:
      if el[:3] + el[-4:] == '<P></P>':
        sgml_entry += '<P>' + re.sub(r"<", chr(1006), re.sub(r">", chr(1007), el[3:-4])) + '</P>'
      else:
        sgml_entry += el
    xtree = etree.fromstring(sgml_entry)
    nid = xtree.attrib["nid"]

    text_elem = xtree.xpath("./TEXT")[0]
    parrs = [el.text.strip()+"\n" for el in text_elem.getchildren() if el.tag in {"TITLE", "P"}]
    lenlist = [0]
    for i, c in enumerate(parrs):
      lenlist.append(len(c) + lenlist[-1])

    text = "".join(parrs)[:-1].replace(chr(1006), '<').replace(chr(1007), '>').replace(chr(1008), '&')

    patch = []
    for m in xtree.xpath("./ANNOTATION")[0]:
      start = lenlist[int(m.attrib['start_par'])] + int(m.attrib['start_off'])
      end = lenlist[int(m.attrib['end_par'])] + int(m.attrib['end_off'])
      correction = m.xpath("./CORRECTION")[0].text
      if correction is None:
        correction = ""
      patch.append([start, end, correction])
    
    outdict = {
        "id": nid,
        "text": text,
        "patch": patch
    }

    return outdict
  
  except:
    print(sgml_entry)
    raise KeyboardInterrupt

In [0]:
nucle_struct = [sgml_to_dict(e) for e in entries]

In [0]:
%%time

nucle_df = list_to_corpus_df(nucle_struct)

CPU times: user 1.21 s, sys: 58.1 ms, total: 1.26 s
Wall time: 1.28 s


In [0]:
nucle_df.to_pickle("nucle_df.pickle")

In [0]:
!cp nucle_df.pickle ../
%cd ../
!rm -rf nucle

/content


## Processing ICNALE

In [0]:
!mkdir icnale
%cd icnale

In [0]:
!apt install libreoffice 1>/dev/null



Extracting templates from packages: 100%


In [0]:
!pip install python-docx

Collecting python-docx
[?25l  Downloading https://files.pythonhosted.org/packages/e4/83/c66a1934ed5ed8ab1dbb9931f1779079f8bca0f6bbc5793c06c4b5e7d671/python-docx-0.8.10.tar.gz (5.5MB)
[K     |████████████████████████████████| 5.5MB 7.6MB/s 
Building wheels for collected packages: python-docx
  Building wheel for python-docx (setup.py) ... [?25l[?25hdone
  Created wheel for python-docx: filename=python_docx-0.8.10-cp36-none-any.whl size=184491 sha256=2f500e519f927745be5917ed25a597dc6c0d373f7f773c582a7c367a6b6a991a
  Stored in directory: /root/.cache/pip/wheels/18/0b/a0/1dd62ff812c857c9e487f27d80d53d2b40531bec1acecfa47b
Successfully built python-docx
Installing collected packages: python-docx
Successfully installed python-docx-0.8.10


In [0]:
import os

import pandas as pd

We assume that you have the `icnale_credentials.py` file in this directory with `icnale_password` string containing the password for ICNALE.

In [0]:
from icnale_credentials import icnale_password

In [0]:
!wget http://language.sakura.ne.jp/icnale/corpus/ICNALE_EE_2.1.zip
!unzip -P {icnale_password} ICNALE_EE_2.1.zip 1>/dev/null
%cd ICNALE_Edited\ Essays_2.1/EE_Unmerged_Unclassified

--2020-05-25 14:17:05--  http://language.sakura.ne.jp/icnale/corpus/ICNALE_EE_2.1.zip
Resolving language.sakura.ne.jp (language.sakura.ne.jp)... 202.181.97.71
Connecting to language.sakura.ne.jp (language.sakura.ne.jp)|202.181.97.71|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10383630 (9.9M) [application/zip]
Saving to: ‘ICNALE_EE_2.1.zip’


2020-05-25 14:17:15 (1.09 MB/s) - ‘ICNALE_EE_2.1.zip’ saved [10383630/10383630]

/content/ICNALE_Edited Essays_2.1/EE_Unmerged_Unclassified


In [0]:
%%time

!find . -name "*.doc"  -exec lowriter --convert-to docx {} \; 1>/dev/null 2>/dev/null

In [0]:
from docx import Document

def process_icnale_corrs(docxname):
  doc = Document(docxname)
  body = doc._body._body

  num_corrs = 0

  for parr in body.xpath('./w:p'):
    met = False
    for subparr in parr.getchildren():
      if subparr.tag[-3:] in {"ins", "del"}:
        if not met:
          num_corrs += 1
        met = True
      else:
        met = False
  
  return num_corrs

In [0]:
from docx.opc.exceptions import PackageNotFoundError

In [0]:
icnale_data = {
    "id": [],
    "orig_text": [],
    "corr_text": [],
    "corrections_num": []
}

for fid in sorted(list(set(os.path.splitext(f)[0][:len(os.path.splitext(f)[0])-1-os.path.splitext(f)[0][::-1].find("_")].strip() for f in os.listdir(".")))):
  spellings = [fid+"_ORIG+EDIT.docx", fid+" _ORIG+EDIT.docx", fid+"_ ORIG+EDIT.docx"]
  f = None
  for sp in spellings:
    if os.path.isfile(sp):
      f = sp
      break
  if f is None:
    print("No edit info for", fid)
    continue
  else:
    num_corrs = process_icnale_corrs(f)
  if os.path.isfile(fid+"_ORIG.txt"):
    with open(fid+"_ORIG.txt", "r", encoding="utf-8") as intext:
      orig_text = intext.read()
  else:
    print("No ORIG text for", fid)
    continue
  if os.path.isfile(fid+"_EDIT.txt"):
    with open(fid+"_EDIT.txt", "r", encoding="utf-8") as intext:
      corr_text = intext.read()
  else:
    print("No EDIT text for", fid)
    continue

  icnale_data["id"].append(fid)
  icnale_data["orig_text"].append(orig_text)
  icnale_data["corr_text"].append(corr_text)
  icnale_data["corrections_num"].append(num_corrs)

icnale_df = pd.DataFrame(icnale_data)

No edit info for W_CHN_PTJ0_272_B2_0
No EDIT text for W_CHN_PTJ0_277_B2_0


In [0]:
icnale_df.to_pickle("icnale_df.pickle")

In [0]:
!cp icnale_df.pickle ../../../
%cd ../../../
!rm -rf icnale

/content


## Uploading to Google Drive

In [0]:
# from google.colab import drive
# drive.mount("/content/gdrive")

In [0]:
# !cp *.pickle /content/gdrive/My\ Drive