
<a href="https://colab.research.google.com/github/lcl-hse/heptabot/blob/master/notebooks/Use_in_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Use heptabot in Google Colab

Using this notebook you can process data with `heptabot` in Google Colab, getting the same results as you would get from the web version. It is currently the only way to process large batches of texts, although it can be used for arbitrary amounts of data.

## Prepare environment

The code below is needed for `heptabot` to get running correctly. It needs around 10 minutes to execute.

In [1]:
import tensorflow as tf

sess = tf.compat.v1.Session()

In [2]:
%%time

!git clone -q https://github.com/lcl-hse/heptabot
%cd heptabot

!pip install -q -r conda_requirements.txt

!sed -i "s/tensorflow-gpu==2.3.0//g" requirements.txt
!pip install -q -r requirements.txt

/content/heptabot
[K     |████████████████████████████████| 3.4MB 5.9MB/s 
[K     |████████████████████████████████| 71kB 6.1MB/s 
[K     |████████████████████████████████| 51kB 7.1MB/s 
[K     |████████████████████████████████| 81kB 8.8MB/s 
[K     |████████████████████████████████| 931kB 19.6MB/s 
[K     |████████████████████████████████| 1.4MB 38.1MB/s 
[K     |████████████████████████████████| 184kB 39.1MB/s 
[K     |████████████████████████████████| 153kB 39.5MB/s 
[K     |████████████████████████████████| 614kB 41.5MB/s 
[K     |████████████████████████████████| 51kB 6.2MB/s 
[K     |████████████████████████████████| 389kB 42.8MB/s 
[?25h  Building wheel for spacy (setup.py) ... [?25l[?25hdone
  Building wheel for python-Levenshtein (setup.py) ... [?25l[?25hdone
  Building wheel for murmurhash (setup.py) ... [?25l[?25hdone
  Building wheel for thinc (setup.py) ... [?25l[?25hdone
  Building wheel for dill (setup.py) ... [?25l[?25hdone
  Building wheel for reg

In [3]:
%%time

!python -c 'import nltk; nltk.download("punkt")' 1>/dev/null 2>/dev/null
!python -m spacy download -d en_core_web_sm-1.2.0 1>/dev/null 2>/dev/null
!python -m spacy link en_core_web_sm en 1>/dev/null 2>/dev/null

!wget -q --show-progress https://storage.googleapis.com/ml-bucket-isikus/cbmodel/err_type_classifier.cbm -P ./models
!mkdir ./models/savemodel
!wget -q --show-progress https://storage.googleapis.com/ml-bucket-isikus/t5-base-model/models/base-basedrei/export/1599625548/saved_model.pb -P ./models/savemodel
!mkdir ./models/savemodel/variables
!wget -q --show-progress https://storage.googleapis.com/ml-bucket-isikus/t5-base-model/models/base-basedrei/export/1599625548/variables/variables.data-00000-of-00002 -P ./models/savemodel/variables
!wget -q --show-progress https://storage.googleapis.com/ml-bucket-isikus/t5-base-model/models/base-basedrei/export/1599625548/variables/variables.data-00001-of-00002 -P ./models/savemodel/variables
!wget -q --show-progress https://storage.googleapis.com/ml-bucket-isikus/t5-base-model/models/base-basedrei/export/1599625548/variables/variables.index -P ./models/savemodel/variables

CPU times: user 137 ms, sys: 151 ms, total: 288 ms
Wall time: 48.5 s


In [4]:
from models import batchify, process_batch, result_to_div

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




100%|██████████| 245M/245M [00:29<00:00, 8.34MB/s]


Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.loader.load or tf.compat.v1.saved_model.load. There will be a new function for importing SavedModels in Tensorflow 2.0.


Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.loader.load or tf.compat.v1.saved_model.load. There will be a new function for importing SavedModels in Tensorflow 2.0.


INFO:tensorflow:Restoring parameters from models/savemodel/variables/variables


INFO:tensorflow:Restoring parameters from models/savemodel/variables/variables


In [5]:
import re
import os
import IPython
from tqdm.notebook import tqdm

## Get text data

The texts are downloaded in this part. This version showcases the exemplar data (3 essays from [REALEC](https://realec.org/)); you should, however, change this part to whatever you need.

In [6]:
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2015%2F&document=KT_12_2&extension=txt&protocol=1" -O KT_12_2.txt
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2014%2F&document=ESha_2_1&extension=txt&protocol=1" -O ESha_2_1.txt
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2016%2F&document=LKa_2_2&extension=txt&protocol=1" -O LKa_2_2.txt

files = ["KT_12_2.txt", "ESha_2_1.txt", "LKa_2_2.txt"]
textdict = {}

for f in files:
  with open(f, "r", encoding="utf-8") as infile:
    textdict[f[:-4]] = infile.read()

**Important**: If you got here from the error page on `heptabot` website stating *In order to maintain server resources and stable uptime, we limit the amounts of data that can be processed via our Web interface*, uncomment the following code (remove all the number signs) and upload the `generated.txt` file you got from our website:

In [None]:
#from google.colab import files
#files.upload()

#textdict = {}

#with open("generated.txt", "r", encoding="utf-8") as infile:
  #textdict["generated"] = infile.read()

Put all your texts in a `dict` with the name `texts`, where keys are `str`'s with texts IDs (preferrably filenames without extension), while the actual data is stored also as `txt`'s in values, as such:

In [7]:
texts = textdict

assert all(type(k) is str for k in texts.keys())
assert all(type(v) is str for v in texts.values())

## Process data with `heptabot`

The actual processing is here!

**Important**: please choose the appropriate task type in the following cell. While `correction`, the default, is used to correct whole essays and only its pipeline contains employs the errors' classification, you may also want sentence-by-sentence correction. In this case, choose one of the identifiers of the relevant GEC tasks: `jfleg` (trained on JFLEG data) is for general sentence-wise correction and it should provide more diverse results, while `conll` (trained on CONLL-14 competition) and `bea` (trained on BEA-2019 competition) correct only grammar errors, for which case sentence parsing is added to the sentence in the corresponding pipeline. Please note that `heptabot` expect whole bodies of text as singular pieces of data for `correction` and sentence-by-sentence structured data otherwise, so make sure your file(s) contain single sentences separated by newlines if you wish to perform any other task than `correction`.

In [8]:
task_type = "correction"  #@param ["correction", "jfleg", "conll", "bea"] 

In [11]:
import random
chosen_one = random.choice(list(texts.keys()))

print(texts[chosen_one])

Some people think that social media in the Internet following purpose like give some information to people, but other people think that Facebook, Vkontakte and other media in the Internet just help people entertain. 
People with the first idea may be true because Facebook and Vkontakte have many groups which showing differents news and have many comments about it. They presenting much advertising about new-opens cafe and lectures which soon are going happening in ypur city. Also, we can get known about lastly new booksor films, sometimes we can research texts of some objects and read it ourselves. On these sites we can see all information about people whose we know or just famous people. Many funats use these resurse that know what like and what doing their lovely stars in simple life. They can chatting with people who is unvalable but wont that other people get known what they feel or think about something. 
However, many people don't use social media for take or get some information.

First we prepare data (batchify and separate the delimeters). This should not take long.

In [12]:
prepared_data = {}

for textid in tqdm(texts):
    batches, delims = batchify(texts[textid], task_type)    
    prepared_data[textid] = (batches, delims)

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




In [13]:
!mkdir output
!cp -r static output

with open("./templates/result.html", "r") as inres:
  outhtml = inres.read()

outhtml = outhtml.replace("{{ which_font }}", "{0}").replace("{{ response }}", "{1}").replace("{{ task_type }}", "{2}")

And here we finally run the correction routine. Please note that the processed texts are written in the `output` directory as html files with their ID (key in `texts` dictionary) as filename.

In [14]:
processed_texts = {}

which_font = "" if task_type == "correction" else "font-family: Ubuntu Mono; letter-spacing: -0.5px;"
task_str = "text" if task_type == "correction" else "sentences"

for textid in tqdm(prepared_data):
    batches, delims = prepared_data[textid]
    processed = []

    if task_type != "correction":
      print("Processing text with ID", textid)
      for batch in tqdm(batches):
          processed.append(process_batch(batch))

    else:
      for batch in batches:
          processed.append(process_batch(batch))

    plist = [item for subl in processed for item in subl] 
    response = result_to_div(texts[textid], plist, delims, task_type)

    proc_html = outhtml.format(which_font, response, task_str)

    with open(os.path.join("output", textid+".html"), "w", encoding="utf-8") as outfile:
        outfile.write(proc_html)

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




## Display the results

Here you may display the processed html results!

In [15]:
#@markdown This cell hides a function to make pretty displaying work
def prepare_display(filekey):
  template = """<html><head>
	<meta charset="utf-8">
	<meta content="IE=edge" http-equiv="X-UA-Compatible">
	<meta content="width=device-width, initial-scale=1" name="viewport">
	<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
	<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
	<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
	<link href="https://getbootstrap.com/docs/3.3/dist/css/bootstrap.min.css" rel="stylesheet"><!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
	<link href="https://getbootstrap.com/docs/3.3/assets/css/ie10-viewport-bug-workaround.css" rel="stylesheet"><!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
	<link href="https://fonts.googleapis.com/css2?family=Kanit&family=Mukta&family=PT+Sans&family=PT+Serif&family=Ubuntu+Mono&display=swap" rel="stylesheet">
<style>
{0}
</style>
<script type="text/javascript">
{1}
</script>
</head>
<body>
<div class="header2">{2}</div><br>
{3}
</body></html>"""

  with open("static/result/style.css", "r") as inhtml:
    style = inhtml.read()
  with open("static/result/engine.js", "r") as inhtml:
    script = inhtml.read().replace("var em;", "var em=18;").replace("elemtitle.style.top = (rect.top - prect.top) + 'px';", "elemtitle.style.top = (rect.top - prect.top) + 6 + 'px';")
  with open(os.path.join('./output', filekey + ".html"), "r") as inhtml:
    htmlcont = inhtml.read()
  tt = re.search(r'<div class="header2">(.*?)</div>', htmlcont, flags=re.DOTALL).group(1)
  result_div = re.search(r'<div id="resulta".*?\n', htmlcont).group(0)
  outcont = template.format(style, script, tt, result_div)
  with open("display.html", "w", encoding="utf-8") as outhtml:
    outhtml.write(outcont)

In [16]:
#@markdown Enter the desired text ID below to pretty-print the result
display_id = chosen_one  #@param {type: "string"}

prepare_display(display_id)
IPython.display.HTML(filename='display.html')

## Download the results

Now, if you are finished with displaying different texts, you may want to download your results in an archive. The code below downloads the processed texts directly to your computer: unzip it to view the results in the supposed way. Otherwise, you may save the resulting folder to your Google Drive.

In [17]:
!zip -q heptabot_processed.zip -r output

In [18]:
from google.colab import files

files.download("heptabot_processed.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>