
<a href="https://colab.research.google.com/github/lcl-hse/heptabot/blob/gpu-tpu/notebooks/Run_medium_model_on_Colab_TPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Run heptabot `medium` model on Colab TPU

This notebook lets you to process data with our `medium` model in Google Colab TPU environments, which provides the highest speed and allows to process huge chunks of data.

As Colab has recently switched to Python 3.7 and our dependency `spaCy 1.9.0` supports only Python 3.6, we use `mamba` to ensure that we get the right packages in our environment. To get `mamba`, you should execute the following cell (click the '▷' button). Please note that the runtime will restart after that, so don't schedule the rest of the cells to execute just yet.

In [1]:
!pip install -q condacolab==0.1.1
import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:40
🔁 Restarting kernel...


After your runtime is restarted, execute the following cell to set some environmental variables:



In [1]:
import os

model_type = "medium"
# The steps are largely the same between medium and xxl models. 
# However, we keep this, as it is advantageous to run medium model in Google Colab, while xxl – in Kaggle, and these environments have their differences

os.environ["MODEL_PLACE"] = "tpu"
os.environ["HPT_MODEL_TYPE"] = model_type

if model_type == "xxl":
    os.environ["CHECKPOINT_STEP"] = "1014000"
    os.environ["TPU_TOPOLOGY"] = "v3-8"
else:
    os.environ["CHECKPOINT_STEP"] = "1003800"
    os.environ["TPU_TOPOLOGY"] = "v2-8"

## Prepare environment

Now click the '▷' on this group of cells. The code below will install the environment for `heptabot`, which takes around 10 minutes to execute.

In [2]:
!pip install -qq t5==0.9.0 seqio rouge_score sacrebleu sentencepiece

[K     |████████████████████████████████| 230 kB 5.3 MB/s 
[K     |████████████████████████████████| 249 kB 10.2 MB/s 
[K     |████████████████████████████████| 54 kB 2.1 MB/s 
[K     |████████████████████████████████| 1.2 MB 11.2 MB/s 
[K     |████████████████████████████████| 46 kB 3.9 MB/s 
[K     |████████████████████████████████| 8.8 MB 21.5 MB/s 
[K     |████████████████████████████████| 11.5 MB 58.4 MB/s 
[K     |████████████████████████████████| 366 kB 64.5 MB/s 
[K     |████████████████████████████████| 28.5 MB 31 kB/s 
[K     |████████████████████████████████| 4.0 MB 45.8 MB/s 
[K     |████████████████████████████████| 831.4 MB 2.2 kB/s 
[K     |████████████████████████████████| 1.5 MB 23.7 MB/s 
[K     |████████████████████████████████| 4.3 MB 27.0 MB/s 
[K     |████████████████████████████████| 22.3 MB 2.2 MB/s 
[K     |████████████████████████████████| 132 kB 60.2 MB/s 
[K     |████████████████████████████████| 15.7 MB 71 kB/s 
[K     |████████████████████

In [3]:
!git clone -q https://github.com/lcl-hse/heptabot -b gpu-tpu
%cd heptabot
!mv scripts/colab-run/* .
!mv scripts/tpu-run/* .
!chmod +x colab_run.sh
!chmod +x tpu_run.sh
!mv scripts/measures/* .
!chmod +x run_measures.sh

/content/heptabot


In [4]:
!time bash colab_setup.sh

Initializing virtual environment with python 3.6.9
  Package             Version  Build               Channel                    Size
────────────────────────────────────────────────────────────────────────────────────
  Install:
────────────────────────────────────────────────────────────────────────────────────

[32m  _libgcc_mutex   [00m        0.1  conda_forge         conda-forge/linux-64[32m     Cached[00m
[32m  _openmp_mutex   [00m        4.5  1_gnu               conda-forge/linux-64[32m     Cached[00m
[32m  ca-certificates [00m  2021.5.30  ha878542_0          conda-forge/linux-64     136 KB
[32m  certifi         [00m  2021.5.30  py36h5fab9bb_0      conda-forge/linux-64     141 KB
[32m  ld_impl_linux-64[00m     2.36.1  hea4e1c9_2          conda-forge/linux-64     667 KB
[32m  libffi          [00m      3.2.1  he1b5a44_1007       conda-forge/linux-64      47 KB
[32m  libgcc-ng       [00m     11.1.0  hc902ee8_5          conda-forge/linux-64     907 KB
[32m  libgom

In [5]:
!mkdir output
!cp -r static output
!mkdir raw

In [6]:
import subprocess
from time import sleep

_ram_before = !awk '/MemAvailable/ { printf "%.3f\n", $2/1024/1024 }' /proc/meminfo
os.environ["RAM_BEFORE"] = str(_ram_before[0])
subprocess.Popen(["/bin/bash", os.path.join(os.path.realpath("."), "colab_run.sh")])
sleep(70)

In [7]:
import re
import os
import pickle
import IPython
from google.colab.files import download

## Check the installation

The following cell is designed to check if the preparations went through correctly. 

In [8]:
#@markdown ### Environment check

test = !lsof | grep 9090
if len(test) > 6:
  print('\x1b[1mEverything seems to be OK!\x1b[0m')
else:
  print('\x1b[1;31mSeems like something went wrong.\nTry waiting for a couple minutes and re-run this cell. If the problem persists, click Runtime ➔ Factory reset runtime ➔ YES and redo all the steps.\x1b[0m')

[1mEverything seems to be OK![0m


## Get the texts

The textual data is downloaded in this part. Here we use 3 essays from [REALEC](https://realec.org/) as example data; you should, however, change this part to process the texts you need.

In [9]:
!mkdir input

!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2015%2F&document=2015_KT_12_2&extension=txt&protocol=1" -O ./input/KT_12_2.txt
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2014%2F&document=2014_ESha_2_1&extension=txt&protocol=1" -O ./input/ESha_2_1.txt
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2016%2F&document=2016_LKa_2_2&extension=txt&protocol=1" -O ./input/LKa_2_2.txt

files = ["KT_12_2.txt", "ESha_2_1.txt", "LKa_2_2.txt"]
textdict = {}

for f in files:
  with open(os.path.join("input", f), "r", encoding="utf-8") as infile:
    textdict[f[:-4]] = infile.read()

**Important**: If you got here from the error page on `heptabot` website stating "*In order to maintain server resources and stable uptime, we limit the amounts of data that can be processed via our Web interface*", uncomment the following code (remove all the number signs) and upload the `generated.txt` file you got from our website:

In [None]:
#from google.colab import files
#files.upload()

#textdict = {}

#with open("generated.txt", "r", encoding="utf-8") as infile:
  #textdict["generated"] = infile.read()

In other cases, we recommend to put your files into the **`input`** folder for comprehensibility.

Put all your texts in a `dict` with the name `textdict`, where keys are `str`'s with texts IDs (preferrably filenames without extension), while the actual data is stored also as `txt`'s in values, as such:

In [10]:
assert all(type(k) is str for k in textdict.keys())
assert all(type(v) is str for v in textdict.values())

## Process data with `heptabot`

The actual `heptabot` magic is performed here!

**Important**: please choose the appropriate task type in the following cell. While `correction`, the default, is used to correct whole essays and only its pipeline incororates the error classification subroutine, you may also want to perform sentencewise correction. In this case, choose one of the identifiers of the relevant GEC tasks: `jfleg` (trained on JFLEG data) is for general sentencewise correction and should provide more diverse results, while `conll` (trained on CONLL-14 competition) and `bea` (trained on BEA-2019 competition) correct mainly grammar-related errors, for which case the grammar parsing data is appended to the sentence in the corresponding pipeline. Please note that `heptabot` expects whole paragraphs of text as data for `correction` and sentence-by-sentence structured data for other tasks, so make sure your file(s) contain single sentences separated by newlines if you wish to perform any other task than `correction`.

In [11]:
task_type = "correction"  #@param ["correction", "jfleg", "conll", "bea"] 

In [12]:
import random
chosen_one = random.choice(list(textdict.keys()))

print(textdict[chosen_one])

In modern world our life is demanding more and more different knowledge and skills from us so to set it children from early age go to some lessons and courses. Because of it they usually spend quite a little time outside and do not aware of all value and beauty of our nature, I can partly asree with this statement. 
From one side, it is true that nowdays children spent less time outside enjoying some simple things such as trees, grass, sun and fresh air. Even when they go for a walk, in big sities it is complicated to find place where virgin nature is saved. They have to walk around blocks of flats and roads where no fresh air or spectacular views are left, although they are very important. 
From other side, there is a lot of time children have to spend learning nature. They all have holidays when parents try to send they to different camps in forests or round the sea, to countryside where a lot of them have relatives or friends and so on. 
So in this time children have enough space an

In [13]:
texts = {}

for textid in textdict:
  texts[textid] = {"task_type": task_type, "text": textdict[textid]}

with open("./raw/process_texts.pkl", "wb") as outpickle:
  pickle.dump(texts, outpickle)

In order to get the advantage of using TPU, we split our process in three parts. First, we prepare our texts by splitting them into batches required by `heptabot`:

In [14]:
%%bash
source activate heptabot
python batchify_input.py

Preparing texts for TPU model inference


  0%|          | 0/3 [00:00<?, ?it/s]100%|██████████| 3/3 [00:00<00:00, 196.69it/s]


Then we call the TPU process (this is where our `medium` model runs inference on the texts). Please note that `mesh-tensorflow` TPU processes are prone to produce lots of logging output. We decided to omit it from this GitHub notebook but keep the output on in the actual code cell so that you can check that the process is running as intended. There is, however, an option to discard this output entirely: for this, set `SUPPRESS_OUTPUT` variable to `True` and wait for the cell below to finish the execution.

In [15]:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

SUPPRESS_OUTPUT = False  #@param {"type": "boolean"}

if SUPPRESS_OUTPUT:
  !bash tpu_run.sh 1>/dev/null 2>%1
else:
  !bash tpu_run.sh

Instructions for updating:
non-resource variables are not supported in the long term
I0730 21:46:02.370088 140202429888384 mesh_transformer_main.py:163] No write access to model directory. Skipping command logging.
I0730 21:46:02.378151 140202429888384 resource_reader.py:50] system_path_file_exists:gs://heptabot/models/medium/tpu/operative_config.gin
E0730 21:46:02.378844 140202429888384 resource_reader.py:55] Path not found: gs://heptabot/models/medium/tpu/operative_config.gin
INFO:tensorflow:model_type=bitransformer
I0730 21:46:02.590454 140202429888384 utils.py:2535] model_type=bitransformer
INFO:tensorflow:mode=infer
I0730 21:46:02.590665 140202429888384 utils.py:2536] mode=infer
INFO:tensorflow:sequence_length={'inputs': 512, 'targets': 512}
I0730 21:46:02.590729 140202429888384 utils.py:2537] sequence_length={'inputs': 512, 'targets': 512}
INFO:tensorflow:batch_size=16
I0730 21:46:02.590778 140202429888384 utils.py:2538] batch_size=16
INFO:tensorflow:train_steps=8388608
I0730 21:

And then, finally, we glue our processed texts back together to produce the outputs:

In [16]:
%%bash
source activate heptabot
python process_output.py

Processing TPU model outputs


  0%|          | 0/3 [00:00<?, ?it/s] 33%|███▎      | 1/3 [00:01<00:03,  1.54s/it] 67%|██████▋   | 2/3 [00:02<00:01,  1.09s/it]100%|██████████| 3/3 [00:04<00:00,  1.48s/it]100%|██████████| 3/3 [00:04<00:00,  1.42s/it]


## Display the results

In this section we show the processed results.

In [17]:
#@markdown This cell hides a function to make pretty displaying work
def prepare_display(filekey):
  template = """<html><head>
	<meta charset="utf-8">
	<meta content="IE=edge" http-equiv="X-UA-Compatible">
	<meta content="width=device-width, initial-scale=1" name="viewport">
	<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
	<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
	<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
	<link href="https://getbootstrap.com/docs/3.3/dist/css/bootstrap.min.css" rel="stylesheet"><!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
	<link href="https://getbootstrap.com/docs/3.3/assets/css/ie10-viewport-bug-workaround.css" rel="stylesheet"><!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
	<link href="https://fonts.googleapis.com/css2?family=Kanit&family=Mukta&family=PT+Sans&family=PT+Serif&family=Ubuntu+Mono&display=swap" rel="stylesheet">
<style>
{0}
</style>
<script type="text/javascript">
{1}
</script>
</head>
<body>
<div class="header2">{2}</div><br>
{3}
</body></html>"""

  with open("static/result/style.css", "r") as inhtml:
    style = inhtml.read()
  with open("static/result/engine.js", "r") as inhtml:
    script = inhtml.read().replace("var em;", "var em=18;").replace("elemtitle.style.top = (rect.top - prect.top) + 'px';", "elemtitle.style.top = (rect.top - prect.top) + 6 + 'px';")
  with open(os.path.join('./output', filekey + ".html"), "r") as inhtml:
    htmlcont = inhtml.read()
  tt = re.search(r'<div class="header2">(.*?)</div>', htmlcont, flags=re.DOTALL).group(1)
  result_div = re.search(r'<div id="resulta".*?\n', htmlcont).group(0)
  outcont = template.format(style, script, tt, result_div)
  with open("display.html", "w", encoding="utf-8") as outhtml:
    outhtml.write(outcont)

In [18]:
#@markdown Enter the desired text ID below to pretty-print the result
display_id = chosen_one  #@param {type: "string"}

prepare_display(display_id)
IPython.display.HTML(filename='display.html')

## Download the results

Now, you may also want to get the texts processed by `heptabot`. The code below downloads the texts directly to your computer: unzip it to view the results as they would be displayed in the web version. With Colab, you can also easily save the resulting folder to your Google Drive.

In [None]:
!zip -q heptabot_processed.zip -r output

In [None]:
from google.colab import files

files.download("heptabot_processed.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Measure performance

Finally, here we include a section to test the performance of this version and reproduce the scores we report for some Grammar Error Correction tasks. Our test set for `correction` task consists of 40 texts, 20 with 1000 symbols and 30 with 1500 symbols (texts of such length are fairly common in [REALEC](https://realec.org/)), for the total of 50000 symbols. During our research, we found out that using GLEU to assess the performance of `correction` is not very informative, so we do not measure the quality of our model for this task.

**Important:** please note that in order to reproduce our BEA-2019 score you need to upload the zipped version of our `bea` task output, which will automatically start downloading near the end of this cell's execution, to the official [scoring system](https://competitions.codalab.org/competitions/20229#participate).

In [19]:
!chmod +x run_measures.sh
!bash run_measures.sh
download("bea_test_heptabot_{}_{}.zip".format(os.environ["HPT_MODEL_TYPE"], os.environ["MODEL_PLACE"]))
print("All tests finished.")

Evaluating heptabot "medium" model on architecture "tpu"

Test 1. Correction task, running time and memory usage
Preparing texts for TPU model inference
100% 40/40 [00:00<00:00, 273.55it/s]
Processing TPU model outputs
100% 40/40 [00:32<00:00,  1.22it/s]
RAM used: 2.193 GiB
Time elapsed: 1 minutes 23 seconds
Average time/text: 2.075 secs
Average time/symbol: 1.66 ms
Note that for TPU RAM usage is not so relevant and elapsed time included system startup unlike in CPU and GPU tests.

Test 2. Competition scores
Getting JFLEG from https://github.com/keisks/jfleg
Getting CONLL-14 test set from https://www.comp.nus.edu.sg/~nlp/conll14st/conll14st-test-data.tar.gz, M2-scorer from https://www.comp.nus.edu.sg/~nlp/sw/m2scorer.tar.gz
Getting BEA-2019 test set from https://www.cl.cam.ac.uk/research/nl/bea2019st/data/ABCN.test.bea19.orig
Preparing input for heptabot...
Processing files...
Preparing texts for TPU model inference
100% 3/3 [00:11<00:00,  3.75s/it]
Processing TPU model outputs
100% 3/

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

All tests finished.
