
<a href="https://www.kaggle.com/kernels/fork/18483792" target="_parent"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" alt="Open In Kaggle"/></a>

# Run heptabot `xxl` model on TPU

This notebook lets you to process data with our `xxl` model in Kaggle TPU environment. Kaggle is the only widely available environment that provides access `v3-8` TPU topology required by T5 11B checkpoint. With Kaggle, you can efficiently run our `xxl` model.

First, we set some required environmental variables:



In [1]:
import os

model_type = "xxl"
# The steps are largely the same between medium and xxl models. 
# However, we keep this, as it is advantageous to run medium model in Google Colab, while xxl – in Kaggle, and these environments have their differences

os.environ["MODEL_PLACE"] = "tpu"
os.environ["HPT_MODEL_TYPE"] = model_type

if model_type == "xxl":
    os.environ["CHECKPOINT_STEP"] = "1014000"
    os.environ["TPU_TOPOLOGY"] = "v3-8"
else:
    os.environ["CHECKPOINT_STEP"] = "1003800"
    os.environ["TPU_TOPOLOGY"] = "v2-8"

## Prepare environment

Now click the '▷' buttons next to each of these cells. The code below will install the environment for `heptabot`, which takes around 10-12 minutes to execute.

In [2]:
!pip install -qq t5==0.9.0 seqio rouge_score sacrebleu sentencepiece

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pytorch-lightning 1.3.8 requires tensorboard!=2.5.0,>=2.2.0, but you have tensorboard 2.5.0 which is incompatible.[0m


In [3]:
!git clone -q https://github.com/lcl-hse/heptabot -b gpu-tpu
%cd heptabot
!mv scripts/colab-run/* .
!mv scripts/tpu-run/* .
!chmod +x colab_run.sh
!chmod +x tpu_run.sh
!mv scripts/measures/* .
!chmod +x run_measures.sh
!sed -i -e 's/mamba/conda/g' colab_setup.sh
!chmod +x colab_setup.sh

/kaggle/working/heptabot


In [4]:
!time bash colab_setup.sh

Initializing virtual environment with python 3.6.9
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda/envs/heptabot

  added / updated specs:
    - python=3.6.9


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |      conda_forge           3 KB  conda-forge
    _openmp_mutex-4.5          |            1_gnu          22 KB  conda-forge
    certifi-2021.5.30          |   py36h5fab9bb_0         141 KB  conda-forge
    ld_impl_linux-64-2.36.1    |       hea4e1c9_2         667 KB  conda-forge
    libffi-3.2.1               |    he1b5a44_1007       

In [5]:
!mkdir output
!cp -r static output
!mkdir raw

In [6]:
import subprocess
from time import sleep

_ram_before = !awk '/MemAvailable/ { printf "%.3f\n", $2/1024/1024 }' /proc/meminfo
os.environ["RAM_BEFORE"] = str(_ram_before[0])
subprocess.Popen(["/bin/bash", os.path.join(os.path.realpath("."), "colab_run.sh")])
sleep(70)

In [7]:
import re
import os
import pickle
import IPython

## Get the texts

The textual data is downloaded in this part. Here we use 3 essays from [REALEC](https://realec.org/) as example data; you should, however, change this part to process the texts you need.

In [8]:
!mkdir input

!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2015%2F&document=2015_KT_12_2&extension=txt&protocol=1" -O ./input/KT_12_2.txt
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2014%2F&document=2014_ESha_2_1&extension=txt&protocol=1" -O ./input/ESha_2_1.txt
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2016%2F&document=2016_LKa_2_2&extension=txt&protocol=1" -O ./input/LKa_2_2.txt

files = ["KT_12_2.txt", "ESha_2_1.txt", "LKa_2_2.txt"]
textdict = {}

for f in files:
  with open(os.path.join("input", f), "r", encoding="utf-8") as infile:
    textdict[f[:-4]] = infile.read()

**Important**: If you got here from the error page on `heptabot` website stating "*In order to maintain server resources and stable uptime, we limit the amounts of data that can be processed via our Web interface*", hit the **+ Add data** button next to the Data tab on the upper right of this page, upload the `generated.txt` file you got from our website, follow the steps to create a new dataset and move the file to `heptabot`'s `input` folder (e. g. `mv ../../input/generated.txt ./input`). Alternatively, set up direct link access to the file and use `wget` (e. g. `wget -q https://example.com/storage/generated.txt -O ./input/generated.txt`); using GitHub or Google Cloud Storage may be a good idea. Finally, modify the previous cell. (Note that if this is too complicated you can use `tiny` and `medium` models in Colab, where adding your own files is fairly easier.)

Next, put all your texts in a `dict` with the name `texts`, where keys are `str`'s with texts IDs (preferrably filenames without extension), while the actual data is stored also as `txt`'s in values, as such:

In [9]:
assert all(type(k) is str for k in textdict.keys())
assert all(type(v) is str for v in textdict.values())

## Process data with `heptabot`

The actual `heptabot` magic is performed here!

**Important**: please choose the appropriate task type in the following cell. While `correction`, the default, is used to correct whole essays and only its pipeline incorporates the error classification subroutine, you may also want to perform sentencewise correction. In this case, choose one of the identifiers of the relevant GEC tasks: `jfleg` (trained on JFLEG data) is for general sentencewise correction and should provide more diverse results, while `conll` (trained on CONLL-14 competition) and `bea` (trained on BEA-2019 competition) correct mainly grammar-related errors, for which case the grammar parsing data is appended to the sentence in the corresponding pipeline. Please note that `heptabot` expects whole paragraphs of text as data for `correction` and sentence-by-sentence structured data for other tasks, so make sure your file(s) contain single sentences separated by newlines if you wish to perform any other task than `correction`.

In [10]:
task_type = "correction"  #@param ["correction", "jfleg", "conll", "bea"] 

In [11]:
import random
chosen_one = random.choice(list(textdict.keys()))

print(textdict[chosen_one])

In modern world our life is demanding more and more different knowledge and skills from us so to set it children from early age go to some lessons and courses. Because of it they usually spend quite a little time outside and do not aware of all value and beauty of our nature, I can partly asree with this statement. 
From one side, it is true that nowdays children spent less time outside enjoying some simple things such as trees, grass, sun and fresh air. Even when they go for a walk, in big sities it is complicated to find place where virgin nature is saved. They have to walk around blocks of flats and roads where no fresh air or spectacular views are left, although they are very important. 
From other side, there is a lot of time children have to spend learning nature. They all have holidays when parents try to send they to different camps in forests or round the sea, to countryside where a lot of them have relatives or friends and so on. 
So in this time children have enough space an

In [12]:
texts = {}

for textid in textdict:
  texts[textid] = {"task_type": task_type, "text": textdict[textid]}

with open("./raw/process_texts.pkl", "wb") as outpickle:
  pickle.dump(texts, outpickle)

In order to get the advantage of using TPU, we split our process in three parts. First, we prepare our texts by splitting them into batches required by `heptabot`:

In [13]:
%%bash
source activate heptabot
python batchify_input.py

Preparing texts for TPU model inference


100%|██████████| 3/3 [00:00<00:00, 151.25it/s]


Then we call the TPU process (this is where our `xxl` model runs inference on the texts):

In [14]:
%%bash
python tpu_model_run.py

2021-07-30 21:45:54.825799: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib
2021-07-30 21:45:54.825945: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-07-30 21:46:03.197171: I tensorflow/core/platform/cloud/google_auth_provider.cc:180] Attempting an empty bearer token since no token was retrieved from files, and GCE metadata check was skipped.
2021-07-30 21:46:03.337424: I tensorflow/core/platform/cloud/google_auth_provider.cc:180] Attempting an empty bearer token since no token was retrieved from files, and GCE metadata check was skipped.
2021-07-30 21:46:03.516070: I tensorflow/core/platform/cloud/google_auth_provider.cc:180] Attempting an empty bearer token since no token was retrieved from files, and GCE metadat

And then, finally, we glue our processed texts back together to produce the outputs:

In [15]:
%%bash
source activate heptabot
python process_output.py

Processing TPU model outputs


100%|██████████| 3/3 [00:04<00:00,  1.33s/it]


## Display the results

In this section we show the processed results.

In [16]:
# This cell contains a function that makes pretty displaying work
def prepare_display(filekey):
  template = """<html><head>
	<meta charset="utf-8">
	<meta content="IE=edge" http-equiv="X-UA-Compatible">
	<meta content="width=device-width, initial-scale=1" name="viewport">
	<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
	<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
	<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
	<link href="https://getbootstrap.com/docs/3.3/dist/css/bootstrap.min.css" rel="stylesheet"><!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
	<link href="https://getbootstrap.com/docs/3.3/assets/css/ie10-viewport-bug-workaround.css" rel="stylesheet"><!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
	<link href="https://fonts.googleapis.com/css2?family=Kanit&family=Mukta&family=PT+Sans&family=PT+Serif&family=Ubuntu+Mono&display=swap" rel="stylesheet">
<style>
{0}
</style>
<script type="text/javascript">
{1}
</script>
</head>
<body>
<div class="header2">{2}</div><br>
{3}
</body></html>"""

  with open("static/result/style.css", "r") as inhtml:
    style = inhtml.read()
  with open("static/result/engine.js", "r") as inhtml:
    script = inhtml.read().replace("var em;", "var em=18;").replace("elemtitle.style.left = varleft + 'px';", "elemtitle.style.left = varleft - 70 + 'px';")
  with open(os.path.join('./output', filekey + ".html"), "r") as inhtml:
    htmlcont = inhtml.read()
  tt = re.search(r'<div class="header2">(.*?)</div>', htmlcont, flags=re.DOTALL).group(1)
  result_div = re.search(r'<div id="resulta".*?\n', htmlcont).group(0)
  outcont = template.format(style, script, tt, result_div)
  with open("display.html", "w", encoding="utf-8") as outhtml:
    outhtml.write(outcont)

In [17]:
#@markdown Enter the desired text ID below to pretty-print the result
display_id = chosen_one  #@param {type: "string"}

prepare_display(display_id)
IPython.display.HTML(filename='display.html')

## Download the results

Now, you may also want to get the texts processed by `heptabot`. The code below will generate an archive with the processed files. Navigate to `/kaggle/working/heptabot` in the output subsection of Data bar on the right, scroll to `heptabot_processed.zip` (hit the "🗘" button next to `heptabot` folder if it is not present), select "More actions" and click on Download. When downloaded, unzip the archive to view the results as they would be displayed in our Web version.

In [18]:
!zip -q heptabot_processed.zip -r output

## Measure performance

Finally, here we include a section to test the performance of this version and reproduce the scores we report for some Grammar Error Correction tasks. Our test set for `correction` task consists of 40 texts, 20 with 1000 symbols and 30 with 1500 symbols (texts of such length are fairly common in [REALEC](https://realec.org/)), for the total of 50000 symbols. During our research, we found out that using GLEU to assess the performance of `correction` is not very informative, so we do not measure the quality of our model for this task.

**Important:** please note that in order to reproduce our BEA-2019 score you need to upload the zipped version of our `bea` task output to the official [scoring system](https://competitions.codalab.org/competitions/20229#participate). To get the file, Navigate to `/kaggle/working/heptabot` in the output subsection of Data bar on the right, scroll to `bea_test_heptabot_xxl_tpu.zip` (hit the "🗘" button next to `heptabot` folder if it is not present), select "More actions" and click on Download. When downloaded, unzip the archive to view the results as they would be displayed in our Web version.  

In [19]:
!chmod +x run_measures.sh
!bash run_measures.sh
print("All tests finished.")

Evaluating heptabot "xxl" model on architecture "tpu"

Test 1. Correction task, running time and memory usage
Preparing texts for TPU model inference
100%|██████████████████████████████████████████| 40/40 [00:00<00:00, 201.90it/s]
Processing TPU model outputs
100%|███████████████████████████████████████████| 40/40 [00:36<00:00,  1.10it/s]
RAM used: 2.563 GiB
Time elapsed: 4 minutes 9 seconds
Average time/text: 6.225 secs
Average time/symbol: 4.98 ms
Note that for TPU RAM usage is not so relevant and elapsed time included system startup unlike in CPU and GPU tests.

Test 2. Competition scores
Getting JFLEG from https://github.com/keisks/jfleg
Getting CONLL-14 test set from https://www.comp.nus.edu.sg/~nlp/conll14st/conll14st-test-data.tar.gz, M2-scorer from https://www.comp.nus.edu.sg/~nlp/sw/m2scorer.tar.gz
Getting BEA-2019 test set from https://www.cl.cam.ac.uk/research/nl/bea2019st/data/ABCN.test.bea19.orig
Preparing input for heptabot...
Processing files...
Prepar