
<a href="https://colab.research.google.com/github/lcl-hse/heptabot/blob/master/notebooks/Use_in_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Use heptabot in Google Colab

This notebook lets you to process data with `heptabot` in Google Colab, getting the same results as you would get from the web version. To avoid overloading our servers, we currently suggest this method to process large amounts of text, however, it can be used for any amount of data.

As Colab has recently switched to Python 3.7 and our dependency `spaCy 1.9.0` supports only Python 3.6, we use `mamba` to ensure that we get the right packages in our environment. To get `mamba`, you should execute the following cell (click the '▷' button). Please note that the runtime will restart after that, so don't schedule the rest of the cells to execute just yet.

In [1]:
!pip install -q condacolab==0.1.1
import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:40
🔁 Restarting kernel...


## Prepare environment

Now click the '▷' on this group of cells. The code below will install the environment for `heptabot` and needs around 10 minutes to execute.

In [1]:
!git clone -q https://github.com/lcl-hse/heptabot
%cd heptabot
!mv colab-scripts/* .
!chmod +x colab_run.sh

/content/heptabot


In [2]:
!time bash colab_setup.sh

Initializing virtual environment with python 3.6.9
  Package             Version  Build               Channel                    Size
────────────────────────────────────────────────────────────────────────────────────
  Install:
────────────────────────────────────────────────────────────────────────────────────

[32m  _libgcc_mutex   [00m        0.1  conda_forge         conda-forge/linux-64[32m     Cached[00m
[32m  _openmp_mutex   [00m        4.5  1_gnu               conda-forge/linux-64[32m     Cached[00m
[32m  ca-certificates [00m  2020.12.5  ha878542_0          conda-forge/linux-64[32m     Cached[00m
[32m  certifi         [00m  2020.12.5  py36h5fab9bb_1      conda-forge/linux-64     143 KB
[32m  ld_impl_linux-64[00m     2.35.1  hea4e1c9_2          conda-forge/linux-64     618 KB
[32m  libffi          [00m      3.2.1  he1b5a44_1007       conda-forge/linux-64      47 KB
[32m  libgcc-ng       [00m      9.3.0  h2828fa1_18         conda-forge/linux-64       8 MB
[3

In [3]:
!mkdir output
!cp -r static output

In [4]:
import subprocess
from time import sleep

subprocess.Popen(["/bin/bash", "/content/heptabot/colab_run.sh"])
sleep(70)

In [5]:
import re
import os
import pickle
import IPython

## Check the installation

The following cell is designed to check if the preparations went through correctly. 

In [7]:
#@markdown ### Environment check

test = !lsof | grep 9090
if len(test) > 6:
  print('\x1b[1mEverything seems to be OK!\x1b[0m')
else:
  print('\x1b[1;31mSeems like something went wrong.\nTry waiting for a couple minutes and re-run this cell. If the problem persists, click Runtime ➔ Factory reset runtime ➔ YES and redo all the steps.\x1b[0m')

[1mEverything seems to be OK![0m


## Get the texts

The textual data is downloaded in this part. Here we use 3 essays from [REALEC](https://realec.org/) as example data; you should, however, change this part to process the texts you need.

In [None]:
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2015%2F&document=KT_12_2&extension=txt&protocol=1" -O KT_12_2.txt
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2014%2F&document=ESha_2_1&extension=txt&protocol=1" -O ESha_2_1.txt
!wget -q "https://realec.org/ajax.cgi?action=downloadFile&collection=%2Fexam%2FExam2016%2F&document=LKa_2_2&extension=txt&protocol=1" -O LKa_2_2.txt

files = ["KT_12_2.txt", "ESha_2_1.txt", "LKa_2_2.txt"]
textdict = {}

for f in files:
  with open(f, "r", encoding="utf-8") as infile:
    textdict[f[:-4]] = infile.read()

**Important**: If you got here from the error page on `heptabot` website stating "*In order to maintain server resources and stable uptime, we limit the amounts of data that can be processed via our Web interface*", uncomment the following code (remove all the number signs) and upload the `generated.txt` file you got from our website:

In [None]:
#from google.colab import files
#files.upload()

#textdict = {}

#with open("generated.txt", "r", encoding="utf-8") as infile:
  #textdict["generated"] = infile.read()

Put all your texts in a `dict` with the name `texts`, where keys are `str`'s with texts IDs (preferrably filenames without extension), while the actual data is stored also as `txt`'s in values, as such:

In [None]:
texts = textdict

assert all(type(k) is str for k in texts.keys())
assert all(type(v) is str for v in texts.values())

## Process data with `heptabot`

The actual `heptabot` magic is performed here!

**Important**: please choose the appropriate task type in the following cell. While `correction`, the default, is used to correct whole essays and only its pipeline incororates the error classification subroutine, you may also want to perform sentencewise correction. In this case, choose one of the identifiers of the relevant GEC tasks: `jfleg` (trained on JFLEG data) is for general sentencewise correction and should provide more diverse results, while `conll` (trained on CONLL-14 competition) and `bea` (trained on BEA-2019 competition) correct mainly grammar-related errors, for which case the grammar parsing data is appended to the sentence in the corresponding pipeline. Please note that `heptabot` expects whole paragraphs of text as data for `correction` and sentence-by-sentence structured data for other tasks, so make sure your file(s) contain single sentences separated by newlines if you wish to perform any other task than `correction`.

In [None]:
task_type = "correction"  #@param ["correction", "jfleg", "conll", "bea"] 

In [None]:
import random
chosen_one = random.choice(list(texts.keys()))

print(texts[chosen_one])

Some people think that social media in the Internet following purpose like give some information to people, but other people think that Facebook, Vkontakte and other media in the Internet just help people entertain. 
People with the first idea may be true because Facebook and Vkontakte have many groups which showing differents news and have many comments about it. They presenting much advertising about new-opens cafe and lectures which soon are going happening in ypur city. Also, we can get known about lastly new booksor films, sometimes we can research texts of some objects and read it ourselves. On these sites we can see all information about people whose we know or just famous people. Many funats use these resurse that know what like and what doing their lovely stars in simple life. They can chatting with people who is unvalable but wont that other people get known what they feel or think about something. 
However, many people don't use social media for take or get some information.

In [None]:
pickledata = (task_type, texts)

with open("process.pkl", "wb") as outpickle:
  pickle.dump(pickledata, outpickle)

The actual processing is called by the script from the following cell, as it needs to be done in a virtual environment:

In [None]:
!bash colab_execute.sh

100% 3/3 [00:50<00:00, 16.99s/it]


## Display the results

Finally, in this section you can display the processed results

In [None]:
#@markdown This cell hides a function to make pretty displaying work
def prepare_display(filekey):
  template = """<html><head>
	<meta charset="utf-8">
	<meta content="IE=edge" http-equiv="X-UA-Compatible">
	<meta content="width=device-width, initial-scale=1" name="viewport">
	<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
	<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
	<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
	<link href="https://getbootstrap.com/docs/3.3/dist/css/bootstrap.min.css" rel="stylesheet"><!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
	<link href="https://getbootstrap.com/docs/3.3/assets/css/ie10-viewport-bug-workaround.css" rel="stylesheet"><!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
	<link href="https://fonts.googleapis.com/css2?family=Kanit&family=Mukta&family=PT+Sans&family=PT+Serif&family=Ubuntu+Mono&display=swap" rel="stylesheet">
<style>
{0}
</style>
<script type="text/javascript">
{1}
</script>
</head>
<body>
<div class="header2">{2}</div><br>
{3}
</body></html>"""

  with open("static/result/style.css", "r") as inhtml:
    style = inhtml.read()
  with open("static/result/engine.js", "r") as inhtml:
    script = inhtml.read().replace("var em;", "var em=18;").replace("elemtitle.style.top = (rect.top - prect.top) + 'px';", "elemtitle.style.top = (rect.top - prect.top) + 6 + 'px';")
  with open(os.path.join('./output', filekey + ".html"), "r") as inhtml:
    htmlcont = inhtml.read()
  tt = re.search(r'<div class="header2">(.*?)</div>', htmlcont, flags=re.DOTALL).group(1)
  result_div = re.search(r'<div id="resulta".*?\n', htmlcont).group(0)
  outcont = template.format(style, script, tt, result_div)
  with open("display.html", "w", encoding="utf-8") as outhtml:
    outhtml.write(outcont)

In [None]:
#@markdown Enter the desired text ID below to pretty-print the result
display_id = chosen_one  #@param {type: "string"}

prepare_display(display_id)
IPython.display.HTML(filename='display.html')

## Download the results

Now, you may also want to get the texts processed by `heptabot`. The code below downloads the texts directly to your computer: unzip it to view the results as they would be displayed in the web version. With Colab, you can also easily save the resulting folder to your Google Drive.

In [None]:
!zip -q heptabot_processed.zip -r output

In [None]:
from google.colab import files

files.download("heptabot_processed.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>