
<a href="https://colab.research.google.com/github/lcl-hse/heptabot/blob/tensorflow/notebooks/Run_heptabot_medium_model_on_TPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Run heptabot `medium` model on TPU

This notebook lets you to process data with our `medium` model in Google Colab TPU environments, which provides the highest speed and allows to process huge chunks of data.

As Colab has recently switched to Python 3.7 and our dependency `spaCy 1.9.0` supports only Python 3.6, we use `mamba` to ensure that we get the right packages in our environment. To get `mamba`, you should execute the following cell (click the '▷' button). Please note that the runtime will restart after that, so don't schedule the rest of the cells to execute just yet.

In [1]:
!pip install -q condacolab==0.1.1
import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:42
🔁 Restarting kernel...


After your runtime is restarted, execute the following cell to set some environmental variables:



In [1]:
import os

model_type = "medium"
# The steps are largely the same between medium and xxl models. 
# However, we keep this, as it is advantageous to run medium model in Google Colab, while xxl – in Kaggle, and these environments have their differences

os.environ["MODEL_PLACE"] = "tpu"
os.environ["HPT_MODEL_TYPE"] = model_type

if model_type == "xxl":
    os.environ["CHECKPOINT_STEP"] = "1014000"
    os.environ["TPU_TOPOLOGY"] = "v3-8"
else:
    os.environ["CHECKPOINT_STEP"] = "1003800"
    os.environ["TPU_TOPOLOGY"] = "v2-8"

## Prepare environment

Now click the '▷' on this group of cells. The code below will install the environment for `heptabot` and needs around 10 minutes to execute.

In [2]:
!pip install -qq t5==0.9.0 seqio rouge_score sacrebleu sentencepiece

[K     |████████████████████████████████| 230 kB 5.2 MB/s 
[K     |████████████████████████████████| 249 kB 9.9 MB/s 
[K     |████████████████████████████████| 54 kB 2.4 MB/s 
[K     |████████████████████████████████| 1.2 MB 11.1 MB/s 
[K     |████████████████████████████████| 1.5 MB 23.5 MB/s 
[K     |████████████████████████████████| 831.4 MB 2.6 kB/s 
[K     |████████████████████████████████| 3.9 MB 45.4 MB/s 
[K     |████████████████████████████████| 2.5 MB 41.4 MB/s 
[K     |████████████████████████████████| 132 kB 76.5 MB/s 
[K     |████████████████████████████████| 28.5 MB 30 kB/s 
[K     |████████████████████████████████| 366 kB 73.8 MB/s 
[K     |████████████████████████████████| 10.8 MB 17.8 MB/s 
[K     |████████████████████████████████| 22.3 MB 1.3 MB/s 
[K     |████████████████████████████████| 46 kB 3.2 MB/s 
[K     |████████████████████████████████| 4.3 MB 68.1 MB/s 
[K     |████████████████████████████████| 8.8 MB 15.1 MB/s 
[K     |████████████████████

In [3]:
!git clone -q https://github.com/lcl-hse/heptabot -b tensorflow
%cd heptabot
!mv scripts/colab-run/* .
!mv scripts/tpu-run/* .
!chmod +x colab_run.sh
!chmod +x tpu_run.sh

/content/heptabot


In [4]:
!time bash colab_setup.sh

Initializing virtual environment with python 3.6.9
  Package             Version  Build               Channel                    Size
────────────────────────────────────────────────────────────────────────────────────
  Install:
────────────────────────────────────────────────────────────────────────────────────

[32m  _libgcc_mutex   [00m        0.1  conda_forge         conda-forge/linux-64[32m     Cached[00m
[32m  _openmp_mutex   [00m        4.5  1_gnu               conda-forge/linux-64[32m     Cached[00m
[32m  ca-certificates [00m  2021.5.30  ha878542_0          conda-forge/linux-64     136 KB
[32m  certifi         [00m  2021.5.30  py36h5fab9bb_0      conda-forge/linux-64     141 KB
[32m  ld_impl_linux-64[00m     2.36.1  hea4e1c9_1          conda-forge/linux-64     668 KB
[32m  libffi          [00m      3.2.1  he1b5a44_1007       conda-forge/linux-64      47 KB
[32m  libgcc-ng       [00m      9.3.0  h2828fa1_19         conda-forge/linux-64       8 MB
[32m  libgom

In [5]:
!mkdir output
!cp -r static output
!mkdir raw

In [6]:
import subprocess
from time import sleep

subprocess.Popen(["/bin/bash", os.path.join(os.path.realpath("."), "colab_run.sh")])
sleep(70)

In [7]:
import re
import os
import pickle
import IPython

## Check the installation

The following cell is designed to check if the preparations went through correctly. 

In [8]:
#@markdown ### Environment check

test = !lsof | grep 9090
if len(test) > 6:
  print('\x1b[1mEverything seems to be OK!\x1b[0m')
else:
  print('\x1b[1;31mSeems like something went wrong.\nTry waiting for a couple minutes and re-run this cell. If the problem persists, click Runtime ➔ Factory reset runtime ➔ YES and redo all the steps.\x1b[0m')

[1mEverything seems to be OK![0m


## Get the texts

The textual data is downloaded in this part. Here we use 3 essays from [REALEC](https://realec.org/) as example data; you should, however, change this part to process the texts you need.

In [9]:
!mkdir input

!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2015%2F&document=2015_KT_12_2&extension=txt&protocol=1" -O ./input/KT_12_2.txt
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2014%2F&document=2014_ESha_2_1&extension=txt&protocol=1" -O ./input/ESha_2_1.txt
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2016%2F&document=2016_LKa_2_2&extension=txt&protocol=1" -O ./input/LKa_2_2.txt

files = ["KT_12_2.txt", "ESha_2_1.txt", "LKa_2_2.txt"]
textdict = {}

for f in files:
  with open(os.path.join("input", f), "r", encoding="utf-8") as infile:
    textdict[f[:-4]] = infile.read()

**Important**: If you got here from the error page on `heptabot` website stating "*In order to maintain server resources and stable uptime, we limit the amounts of data that can be processed via our Web interface*", uncomment the following code (remove all the number signs) and upload the `generated.txt` file you got from our website:

In [None]:
#from google.colab import files
#files.upload()

#textdict = {}

#with open("generated.txt", "r", encoding="utf-8") as infile:
  #textdict["generated"] = infile.read()

In other cases, we recommend to put your files into the **`input`** folder for comprehensibility.

Put all your texts in a `dict` with the name `texts`, where keys are `str`'s with texts IDs (preferrably filenames without extension), while the actual data is stored also as `txt`'s in values, as such:

In [10]:
texts = textdict

assert all(type(k) is str for k in texts.keys())
assert all(type(v) is str for v in texts.values())

## Process data with `heptabot`

The actual `heptabot` magic is performed here!

**Important**: please choose the appropriate task type in the following cell. While `correction`, the default, is used to correct whole essays and only its pipeline incororates the error classification subroutine, you may also want to perform sentencewise correction. In this case, choose one of the identifiers of the relevant GEC tasks: `jfleg` (trained on JFLEG data) is for general sentencewise correction and should provide more diverse results, while `conll` (trained on CONLL-14 competition) and `bea` (trained on BEA-2019 competition) correct mainly grammar-related errors, for which case the grammar parsing data is appended to the sentence in the corresponding pipeline. Please note that `heptabot` expects whole paragraphs of text as data for `correction` and sentence-by-sentence structured data for other tasks, so make sure your file(s) contain single sentences separated by newlines if you wish to perform any other task than `correction`.

In [11]:
task_type = "correction"  #@param ["correction", "jfleg", "conll", "bea"] 

In [12]:
import random
chosen_one = random.choice(list(texts.keys()))

print(texts[chosen_one])

The bar chart illustrates information about the percentage between men and women at levels of post-school skills in Australia in the duration of 1999. 
It is noticable that the figures in males who skilled vocational diploma was the highest and made up about 90%. The lowest persantage in men was the undergraduate diploma and came to approximately 35%. In terms of Bachelor's degree, postgraduate diploma and Master's degree in males the figures fell on 47%, 70% and 60% respectively. 
The figures changed in undergraduate diplomas, in this section, women prevailed at 70% in comparison with rest categories. The figures females who have a Bachelor's degree was about 55%. Skilled vocational diploma has a lowest popular in women, it made up only 10%. The percentage females with postgraduate diploma and Master's degree was 30 and 40 respectively. Women figures instead were significant lower almost the half comparably to the men at 30% and 40%.


In [13]:
pickledata = (task_type, texts)

with open("./raw/process_texts.pkl", "wb") as outpickle:
  pickle.dump(pickledata, outpickle)

In order to get the advantage of using TPU, we split our process in three parts. First, we prepare our texts by splitting them into batches required by `heptabot`:

In [14]:
%%bash
source activate heptabot
python batchify_input.py

Preparing texts for TPU model inference


  0%|          | 0/3 [00:00<?, ?it/s]100%|██████████| 3/3 [00:00<00:00, 155.53it/s]


Then we call the TPU process (this is where our `medium` model runs inference on the texts). Please note that `mesh-tensorflow` TPU processes are prone to produce lots of logging output. We decided to omit it from this GitHub notebook but keep the output on in the actual code cell so that you can check that the process is running as intended. There is, however, an option to discard this output entirely: for this, set `SUPPRESS_OUTPUT` variable to `True` and wait for the cell below to finish the execution.

In [None]:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

SUPPRESS_OUTPUT = False  #@param {"type": "boolean"}

if SUPPRESS_OUTPUT:
  !bash tpu_run.sh 1>/dev/null 2>%1
else:
  !bash tpu_run.sh

And then, finally, we glue our processed texts back together to produce the outputs:

In [17]:
%%bash
source activate heptabot
python process_output.py

Processing TPU model outputs


  0%|          | 0/3 [00:00<?, ?it/s] 33%|███▎      | 1/3 [00:01<00:03,  1.57s/it] 67%|██████▋   | 2/3 [00:02<00:01,  1.06s/it]100%|██████████| 3/3 [00:04<00:00,  1.48s/it]100%|██████████| 3/3 [00:04<00:00,  1.42s/it]


## Display the results

Finally, in this section you can display the processed results.

In [18]:
#@markdown This cell hides a function to make pretty displaying work
def prepare_display(filekey):
  template = """<html><head>
	<meta charset="utf-8">
	<meta content="IE=edge" http-equiv="X-UA-Compatible">
	<meta content="width=device-width, initial-scale=1" name="viewport">
	<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
	<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
	<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
	<link href="https://getbootstrap.com/docs/3.3/dist/css/bootstrap.min.css" rel="stylesheet"><!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
	<link href="https://getbootstrap.com/docs/3.3/assets/css/ie10-viewport-bug-workaround.css" rel="stylesheet"><!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
	<link href="https://fonts.googleapis.com/css2?family=Kanit&family=Mukta&family=PT+Sans&family=PT+Serif&family=Ubuntu+Mono&display=swap" rel="stylesheet">
<style>
{0}
</style>
<script type="text/javascript">
{1}
</script>
</head>
<body>
<div class="header2">{2}</div><br>
{3}
</body></html>"""

  with open("static/result/style.css", "r") as inhtml:
    style = inhtml.read()
  with open("static/result/engine.js", "r") as inhtml:
    script = inhtml.read().replace("var em;", "var em=18;").replace("elemtitle.style.top = (rect.top - prect.top) + 'px';", "elemtitle.style.top = (rect.top - prect.top) + 6 + 'px';")
  with open(os.path.join('./output', filekey + ".html"), "r") as inhtml:
    htmlcont = inhtml.read()
  tt = re.search(r'<div class="header2">(.*?)</div>', htmlcont, flags=re.DOTALL).group(1)
  result_div = re.search(r'<div id="resulta".*?\n', htmlcont).group(0)
  outcont = template.format(style, script, tt, result_div)
  with open("display.html", "w", encoding="utf-8") as outhtml:
    outhtml.write(outcont)

In [19]:
#@markdown Enter the desired text ID below to pretty-print the result
display_id = chosen_one  #@param {type: "string"}

prepare_display(display_id)
IPython.display.HTML(filename='display.html')

## Download the results

Now, you may also want to get the texts processed by `heptabot`. The code below downloads the texts directly to your computer: unzip it to view the results as they would be displayed in the web version. With Colab, you can also easily save the resulting folder to your Google Drive.

In [20]:
!zip -q heptabot_processed.zip -r output

In [None]:
from google.colab import files

files.download("heptabot_processed.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>