# Run heptabot `tiny` model from Docker image

This notebook showcases a way to process data with `heptabot` using JupyterLab in your container image.

First, let's initialize the system. Note that in the process of creating the image this notebook gets moved to the root `heptabot` directory, so we assume we're already there, not in `heptabot/notebooks`.

In [1]:
import os
import subprocess
import pickle
import Pyro4
import Pyro4.util
from time import sleep
from tqdm.notebook import tqdm

!mkdir input
!mkdir output

In [2]:
%%writefile prompt_run.sh
source ~/mambaforge/etc/profile.d/conda.sh
export MODEL_PLACE=cpu
conda activate heptabot
pyro4-ns &
sleep 5; python models.py &
sleep 70

Writing prompt_run.sh


In [3]:
os.environ["HPT_MODEL_TYPE"] = "tiny"
!chmod +x prompt_run.sh
subprocess.Popen(["/bin/bash", os.path.join(os.path.abspath(os.getcwd()), "prompt_run.sh")])
sleep(70)

After waiting around 70 seconds, the model should be up and running, so we connect to it using `Pyro4`:

In [4]:
Heptamodel = Pyro4.Proxy("PYRONAME:heptabot.heptamodel")
batchify, process_batch, result_to_div = Heptamodel.batchify, Heptamodel.process_batch, Heptamodel.result_to_div

Now let's get some example texts, each having 300 words, as measured by `nltk.word_tokenize`, and perform correction. You will find the correction results in `output` directory. Feel free to change the cell below to process the texts you need.

**Important**: please choose the appropriate task type in the following cell. While `correction`, the default, is used to correct whole essays and only its pipeline incorporates the error classification subroutine, you may also want to perform sentencewise correction. In this case, choose one of the identifiers of the relevant GEC tasks: `jfleg` (trained on JFLEG data) is for general sentencewise correction and should provide more diverse results, while `conll` (trained on CONLL-14 competition) and `bea` (trained on BEA-2019 competition) correct mainly grammar-related errors, for which case the grammar parsing data is appended to the sentence in the corresponding pipeline. Please note that `heptabot` expects whole paragraphs of text as data for `correction` and sentence-by-sentence structured data for other tasks, so make sure your file(s) contain single sentences separated by newlines if you wish to perform any other task than `correction`.

In [5]:
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2015%2F&document=2015_KT_12_2&extension=txt&protocol=1" -O ./input/KT_12_2.txt
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2014%2F&document=2014_ESha_2_1&extension=txt&protocol=1" -O ./input/ESha_2_1.txt
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2016%2F&document=2016_LKa_2_2&extension=txt&protocol=1" -O ./input/LKa_2_2.txt

files = ["KT_12_2.txt", "ESha_2_1.txt", "LKa_2_2.txt"]
textdict = {}

for f in files:
  with open(os.path.join("input", f), "r", encoding="utf-8") as infile:
    textdict[f[:-4]] = infile.read()

task_type = "correction"

Now we will convert the initial collection in `textdict` to the final `texts` dict which we will then pass to the model. Note that you may set different type of task for each text: tasks in one batch don't have to be uniform.

In [6]:
texts = {}

for textid in textdict:
  texts[textid] = {"task_type": task_type, "text": textdict[textid]}

And now it is time to actually perform the correction:

In [7]:
%%time

prepared_data = {}
for textid in texts:
    batches, delims = batchify(texts[textid]["text"], texts[textid]["task_type"])    
    prepared_data[textid] = (texts[textid]["task_type"], batches, delims)

with open("./templates/result.html", "r") as inres:
    outhtml = inres.read()
outhtml = outhtml.replace("{{ which_font }}", "{0}").replace("{{ response }}", "{1}").replace("{{ task_type }}", "{2}")

for textid in tqdm(prepared_data):
    task_type, batches, delims = prepared_data[textid]
    which_font = "" if task_type == "correction" else "font-family: Ubuntu Mono; letter-spacing: -0.5px;"
    task_str = "text" if task_type == "correction" else "sentences"
    processed = []

    if task_type != "correction":
        print("Processing text with ID", textid)
        for batch in tqdm(batches):
            processed.append(process_batch(batch))
    else:
        for batch in batches:
            processed.append(process_batch(batch))
    response = result_to_div(texts[textid]["text"], processed, delims, task_type)

    proc_html = outhtml.format(which_font, response, task_str)
    with open(os.path.join("output", textid + ".html"), "w", encoding="utf-8") as outfile:
        outfile.write(proc_html)

  0%|          | 0/3 [00:00<?, ?it/s]

CPU times: user 63 ms, sys: 6.33 ms, total: 69.3 ms
Wall time: 1min 7s


## Download the results

Here you may also want to get the texts processed by `heptabot`. The code below creates an archive with all the processed files: download it from the menu on the left and unzip on your device to view the results as they would be displayed in the Web version.

In [8]:
!zip -q heptabot_processed.zip -r output