<a href="https://colab.research.google.com/github/juliakreutzer/masakhane-covid/blob/master/CovidSurveyTranslation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Covid Survey Translation with Masakhane Models

This notebook will load a Masakhane Joey NMT model for your selected language, will translate the [Covid-19 survey](https://coronasurveys.org/), and then present it to you. 

- Connect to a GPU runtime for fast translations: In the menu, select 'Runtime' -> 'Change Runtime Type' -> 'Hardware Accelarator' -> select 'GPU'.
- Please go through and *execute this notebook cell by cell*. Sometimes cells seem to be empty because the code is hidden. Please execute them anyway. 
- Proceed  until you come to *the post-editing part* where you will be asked to provide your correction to the models' translations. 
- If any cells reports an error message that does not let you continue, please take a screenshot and email it to kreutzer@cl.uni-heidelberg.de.
- At the very end, you'll be asked to *download* the created translation files and *send it via email*. 

In total, this should not take more than 30 mins to complete. **Thank you for your time! <3**

## Getting Ready

In [1]:
#@title Installation

! pip3 install gdown --quiet
! pip3 install polyglot --quiet
! pip3 install pyicu --quiet
! pip3 install pycld2 --quiet
! pip3 install morfessor --quiet
! pip3 install pyter3 --quiet

! git clone https://github.com/joeynmt/joeynmt.git joeynmt
! cd joeynmt; pip3 install . --quiet

fatal: destination path 'joeynmt' already exists and is not an empty directory.
  Building wheel for joeynmt (setup.py) ... [?25l[?25hdone


In [0]:
#@title Imports

import os
import re
import yaml
import sacrebleu
import numpy as np
import pandas as pd
import functools
import pyter
import ipywidgets as widgets
from IPython.display import display
from polyglot.text import Text
from subword_nmt import apply_bpe

## Model Preparation

In [23]:
#@title Target language selection

from joeynmt.helpers import load_config

class MasakhaneModelLoader():

  def __init__(self, available_models_file):
    self._model_dir_prefix = 'joeynmt/models/'
    self._src_language = ''
    self.models = self.load_available_models(available_models_file)
  
  def load_available_models(self, available_models_file, 
                            src_language='en', domain='JW300'):
    # Get list of available models.
    # If multiple models: select domain. 
    # Only select relevant models with correct src language.
    models = {}
    with open(available_models_file, 'r') as ofile:
      for i, line in enumerate(ofile):
        entries = line.strip().split("\t")
        if i == 0:
          headers = entries
          header_keys = [h.__str__() for h in headers]
          continue
        model = {h: v for h, v in zip(header_keys, entries)}
        if model['src_language'] != src_language or model['complete'] != 'yes':
          continue
        if model['trg_language'] in models.keys() and model['domain'] != domain:
          continue
        models[model['trg_language']] = model
    print('Found {} Masakhane models.'.format(len(models)))
    self._model_dir_prefix += src_language
    self._src_language = src_language
    return models
  
  def download_model(self, trg_language):
    """ Download model for given trg language. """
    model_dir = "{}-{}".format(self._model_dir_prefix, trg_language)
    !mkdir -p $model_dir
    model_files = self.models[trg_language]
    # Download the checkpoint.
    ckpt_path = os.path.join(model_dir, 'model.ckpt')
    self._download(model_files['ckpt'], ckpt_path)
    # Download the vocabularies.
    src_vocab_file = model_files['src_vocab']
    trg_vocab_file = model_files['trg_vocab']
    src_vocab_path = os.path.join(model_dir, 'src_vocab.txt')
    self._download(src_vocab_file, src_vocab_path)
    trg_vocab_path = os.path.join(model_dir, 'trg_vocab.txt')
    self._download(trg_vocab_file, trg_vocab_path)
    # Download the config.
    config_file = model_files['config.yaml']
    config_path = os.path.join(model_dir, 'config_orig.yaml')
    self._download(config_file, config_path)
    # Adjust config.
    config = load_config(config_path)
    new_config_file = os.path.join(model_dir, 'config.yaml')
    config = self._update_config(config, src_vocab_path, trg_vocab_path,
                                 model_dir, ckpt_path)
    with open(new_config_file, 'w') as cfile:
      yaml.dump(config, cfile)
    # Download BPE codes.
    src_bpe_path = os.path.join(model_dir, 'src.bpe.model')
    trg_bpe_path = os.path.join(model_dir, 'trg.bpe.model')
    self._download(model_files['src_bpe'], src_bpe_path)
    self._download(model_files['trg_bpe'], trg_bpe_path)
    print('Downloaded model for {}-{}.'.format(self._src_language, trg_language))
    return model_dir, config, self._is_lc(src_vocab_path)

  def _update_config(self, config, new_src_vocab_path, new_trg_vocab_path,
                     new_model_dir, new_ckpt_path):
    """Overwrite the settings in the given config."""
    config['data']['src_vocab'] = new_src_vocab_path
    if config['model'].get('tied_embeddings', False):
      config['data']['trg_vocab'] = new_src_vocab_path
    else:
      config['data']['trg_vocab'] = new_trg_vocab_path
    config['training']['model_dir'] = new_model_dir
    config['training']['load_model'] = new_ckpt_path
    return config

  def _is_lc(self, src_vocab_path):
    # Infer whether the model is built on lowercased data.
    lc = True
    with open(src_vocab_path, 'r') as ofile:
      for line in ofile:
        if line != line.lower():
          lc = False
          break
    return lc

  def _download_gdrive_file(self, file_id, destination):
    """Download a file from Google Drive and store in local file."""
    download_link = 'https://drive.google.com/uc?id={}'.format(file_id)
    !gdown -q -O $destination $download_link

  def _download_github_file(self, github_raw_path, destination):
    """Download a file from GitHub."""
    ! wget -q -O $destination $github_raw_path

  def _download(self, url, destination):
    """Download file from Github or Googledrive."""
    if url.startswith('https://raw.githubusercontent.com'):
      self._download_github_file(url, destination)
    elif 'drive.google.com' in url:
      if url.startswith('https://drive.google.com/file'):
        file_id = url.split("/")[-1]
      elif url.startswith('https://drive.google.com/open?'):
        file_id = url.split('id=')[-1]
      self._download_gdrive_file(file_id, destination)
    else:
      print("Download failed, didn't recognize url.")

available_models_file = 'available_models.tsv'
! wget -q -O $available_models_file https://raw.githubusercontent.com/juliakreutzer/masakhane-covid/master/models/available_models.tsv

model_loader = MasakhaneModelLoader(available_models_file=available_models_file)

import ipywidgets as widgets
print('Please select a target language.')
lang_picker = widgets.Dropdown(options=model_loader.models.keys(), value='yo')
lang_picker

Found 17 Masakhane models.
Please select a target language.


Dropdown(index=15, options=('af', 'efi', 'bin', 'ish', 'ha', 'ig', 'iso', 'kam', 'ki', 'luo', 'pcm', 'nso', 's…

In [39]:
#@title Model download 
language = lang_picker.value
model_dir, config, lc = model_loader.download_model(language)

Downloaded model for en-zu.


In [40]:
#@title Test loaded model
# Try if this works: this should not lead to an error message. 
! echo "Test.\nAnd again." > test.txt
new_config_path = os.path.join(model_dir, 'config.yaml')
! python -m joeynmt translate $new_config_path < test.txt > test_out.txt

2020-04-29 04:49:55,897 Hello! This is Joey-NMT.


## Survey Data

Now we're loading the data to translate.

In [79]:
#@title Load the source

sentence_sep = "###"

class SourceData():
  def __init__(self, survey_link, bpe_path, out_file):
    self._src_df = pd.read_csv(survey_link, sep='\t')
    print("Loaded {} lines.".format(len(self._src_df)))
    self._bpe_model = self.load_bpe(bpe_path)
    self._src_df, self._sources = self.preprocess(out_file)
  
  def get_df(self):
    return self._src_df
  
  def get_sources(self):
    return self._sources

  def preprocess(self, out_file):
    """Split into sentences, tokenize, (lowercase,) sub-word split.
    
    Using Polyglot since it was used for JW300.
    Preprocess the source column of a dataframe object and write to file.
  
    Pipeline:
    - split sentences
    - tokenize
    - split into sub-words

    Append pre-processed sources to dataframe."""
    split_sentences = []
    tokenized_sentences = []
    bped_sentences = []
    sources = []
    with open(out_file, 'w') as ofile:
      for i, row in self._src_df.iterrows():
        sentences_i = Text(row[0]).sentences
        split_sentences.append([str(s) for s in sentences_i])
        tokenized_sentence = []
        bped_sentence = []
        for sentence_i in sentences_i:
          tokenized = " ".join(sentence_i.words)
          sources.append(str(sentence_i))
          if lc:
            tokenized = tokenized.lower()
          tokenized_sentence.append(tokenized)
          bped = self._bpe_model.process_line(tokenized)
          bped_sentence.append(bped)
          ofile.write("{}\n".format(bped))
        tokenized_sentences.append(tokenized_sentence)
        bped_sentences.append(bped_sentence)
    data = self._src_df.assign(
        split_sentences=[sentence_sep.join(s) for s in split_sentences])
    data = data.assign(
        tokenized_sentences=[sentence_sep.join(s) for s in tokenized_sentences])
    data = data.assign(
        bped_sentences=[sentence_sep.join(s) for s in bped_sentences])
    return data, sources

  def load_bpe(self, bpe_path):
    with open(bpe_path, 'r') as ofile:
      bpe_model = apply_bpe.BPE(codes=ofile)
    return bpe_model
  
src_input_file = 'src_input.txt'
src_bpe_path = os.path.join(model_dir, 'src.bpe.model')
survey_link = 'https://raw.githubusercontent.com/juliakreutzer/masakhane-covid/master/data/survey.tsv'
src_data = SourceData(survey_link, bpe_path=src_bpe_path, out_file=src_input_file)
sources = src_data.get_sources()
survey_df = src_data.get_df()

Detector is not able to detect the language reliably.


Loaded 31 lines.


In [82]:
#@title Source excerpt
survey_df[:3]

Unnamed: 0,Survey on the number of persons with symptoms compatible with COVID-19,split_sentences,tokenized_sentences,bped_sentences
0,We are an international team of scientists fro...,We are an international team of scientists fro...,we are an international team of scientists fro...,we are an international te@@ am of scienti@@ s...
1,"Please answer this survey once a day, even if ...","Please answer this survey once a day, even if ...","please answer this survey once a day , even if...",ple@@ ase answer this sur@@ ve@@ y once a day ...
2,"Pressing the ""submit"" button implies explicit ...","Pressing the ""submit"" button implies explicit ...","pressing the "" submit "" button implies explici...","pre@@ ssing the "" sub@@ m@@ it "" bu@@ tt@@ on ..."


In [83]:
#@title Translation with Joey NMT
trg_output_file = 'targets.txt'
print('Translating...')
! python -m joeynmt translate $model_dir/config.yaml < $src_input_file > $trg_output_file

# Post-processing
def post_process(output_file):
  """Load and detokenize translations.
  
  There is no given Polyglot detokenizer, so we do it by heuristics.
  """
  targets = []
  with open(trg_output_file, 'r') as ofile:
    for line in ofile:
      sent = line.strip()
      sent = sent.replace('<pad>', '')
      sent = re.sub(r'\s+([?.!"-,:’])', r'\1', sent)
      sent = sent.replace('( ', '(').replace(' - ', '-').replace(' / ', '/').replace(' /', '/')
      if lc:
        # Cheap casing restoration... only first character but better than nothing.
        sent = sent[0].upper() + sent[1:]
      targets.append(sent)
  return targets

targets = post_process(trg_output_file)

print('Done!')

Translating...
2020-04-29 04:14:02,339 Hello! This is Joey-NMT.
Done!


## Post-edit

In [62]:
#@title Now it's your turn!
class SentenceCounterLabel():
  def __init__(self, num_sentences):
    self._num_sentences = num_sentences
    self._label = widgets.Label("")

  def update(self, current_sentence_index):
    self._label.value = f"Please correct the mistakes in the following translation ({current_sentence_index + 1} out of {self._num_sentences}):"

  def get_label(self):
    return self._label

class SentenceCounter(object):
  def __init__(self, source_sentences, target_sentences):
    self._count = 0
    
    assert len(source_sentences) == len(target_sentences)
    self._max_count = len(source_sentences) - 1

    self._source = source_sentences
    self._target = target_sentences
    self._post_edits = [""]*len(target_sentences)

    self._label = SentenceCounterLabel(self._max_count + 1)
    self._update_label()

  def _update_label(self):
    self._label.update(self._count)

  def decrement(self):
    if self._count > 0:
      self._count -= 1

    self._update_label()

  def increment(self):
    if self._count < self._max_count:
      self._count += 1

    self._update_label()

  def get_count(self):
    return self._count
  
  def get_max_count(self):
    return self._max_count

  def get_source(self):
    return self._source[self._count]

  def get_target(self):
    return self._target[self._count]

  def update_pe(self, post_edit):
    self._post_edits[self._count] = post_edit

  def get_label(self):
    return self._label.get_label()

  def get_pes(self):
    return self._post_edits

  @property
  def pe_is_complete(self):
    return all([e != '' for e in self._post_edits])

sentence_counter = SentenceCounter(
  sources, targets
)

out = widgets.Output()

trg_text = widgets.Textarea(
  value=sentence_counter.get_target(),
  description="Target",
  disabled=False,
  layout=widgets.Layout(width="90%", overflow="auto"),
  rows=5
)

src_text = widgets.Textarea(
  value=sentence_counter.get_source(),
  description="Source",
  disabled=True,
  layout=widgets.Layout(width="90%", overflow="auto"),
  rows=5
)

prev_button = widgets.Button(
  description='Previous Sentence',
  disabled=False,
  button_style='',
  tooltip='Click me',
  icon='check'
)

next_button = widgets.Button(
  description='Next Sentence',
  disabled=False,
  button_style='',
  tooltip='Click me',
  icon='check'
)

def update_text_fields(trg_text_field, src_text_field, counter, increment=True):
  counter.update_pe(trg_text_field.value)

  if increment:
    counter.increment()

  else:
    counter.decrement()
  
  trg_text_field.value = counter.get_target()
  src_text_field.value = counter.get_source()


with out:
  def on_submit(trg_text_field, src_text_field, counter):
    update_text_fields(trg_text_field, src_text_field, counter)

  def on_click(button, trg_text_field, src_text_field, counter, increment=True):
    update_text_fields(trg_text_field, src_text_field, counter, increment=increment)

  display(sentence_counter.get_label())
  display(src_text)
  display(trg_text)
  display(prev_button)
  display(next_button)

  prev_button.on_click(functools.partial(
    on_click,
    trg_text_field=trg_text,
    src_text_field=src_text,
    counter=sentence_counter,
    increment=False
  ))

  next_button.on_click(functools.partial(
    on_click,
    trg_text_field=trg_text,
    src_text_field=src_text,
    counter=sentence_counter
  ))

display(out)

Output()

In [63]:
print("Finished?", sentence_counter.pe_is_complete)

Finished? True


In [0]:
assert sentence_counter.pe_is_complete, 'Only move one once the post-edits are finished.'
post_edits = sentence_counter.get_pes()

### Post-Edit Analysis

We compute a bunch of automatic metrics to assess how much had to be edited.
- BLEU: on corpus- and on sentence-level, for tokenized inputs and without tokenization.
- ChrF: a character-level metric, on untokenized inputs.
- TER: on sentence-level, white-space tokenized.

In addition, your feedback would be very valuable.


In [74]:
#@title Where did the model struggle?

bleu_pe_none_tok = sacrebleu.corpus_bleu(targets, [post_edits], tokenize='none').score
bleu_pe_intl_tok = sacrebleu.corpus_bleu(targets, [post_edits], tokenize='intl').score
bleu_pe_none_tok_sent = [sacrebleu.sentence_bleu(target, post_edit).score for target, post_edit in zip(targets, post_edits)]
chrf_pe_sent = [sacrebleu.sentence_chrf(target, post_edit).score*100 for target, post_edit in zip(targets, post_edits)]
ter_pe_sent = [pyter.ter(target.split(' '), post_edit.split(' ')) for target, post_edit in zip(targets, post_edits)]

#print('BLEU without tokenization: ', bleu_pe_none_tok)
#print('BLEU with intl tokenization: ', bleu_pe_intl_tok)
#print('Avg sent. BLEU', np.mean(bleu_pe_none_tok_sent))
#print('Avg ChrF', np.mean(chrf_pe_sent))
#print('Avg TER', np.mean(ter_pe_sent))

feedback = widgets.Textarea(
  value='The model had difficulties... ',
  description="Feedback",
  disabled=False,
  layout=widgets.Layout(width="90%", overflow="auto"),
  rows=5
)
feedback

Textarea(value='The model had difficulties... ', description='Feedback', layout=Layout(overflow='auto', width=…

## Store the results

We store sources, post-edits and targets in a per-sentence table, and merge individual sentences back together into the format of the survey.


In [0]:
#@title Prepare output files
assert len(targets) == len(sources) == len(post_edits)

pe_output_file = 'pe_{}.tsv'.format(language)
with open(pe_output_file, 'w') as ofile:
  ofile.write('source\ttranslation\tpost-edit\tbleu\tchrf\tter\n')
  for t, s, p, b, c, ter in zip(targets, sources, post_edits, bleu_pe_none_tok_sent, chrf_pe_sent, ter_pe_sent):
    ofile.write('{}\t{}\t{}\t{:.2f}\t{:.2f}\t{:.2f}\n'.format(s, t, p, b, c, ter))

pe_metric_file = 'pe_metrics_{}.tsv'.format(language)
with open(pe_metric_file, 'w') as ofile:
  ofile.write('Corpus BLEU tok=none:\t{:.2f}\n'.format(bleu_pe_none_tok))
  ofile.write('Corpus BLEU tok=intl:\t{:.2f}\n'.format(bleu_pe_intl_tok))
  ofile.write('Sent. BLEU tok=none:\t{:.2f}\n'.format(np.mean(bleu_pe_none_tok_sent)))
  ofile.write('Avg TER tok=white:\t{:.2f}\n'.format(np.mean(ter_pe_sent)))
  ofile.write('Avg ChrF:\t{:.2f}\n'.format(np.mean(chrf_pe_sent)))

feedback_file = 'feedback_{}.txt'.format(language)
with open(feedback_file, 'w') as ofile:
  ofile.write(feedback.value)

def join_translated_sentences(data, translations):
  """Join sentence translations just like in original source."""
  joined_targets = ['']*len(data)
  target_counter = 0
  for source_counter, row in enumerate(data['split_sentences']):
    num_sents = row.count(sentence_sep)+1
    joined_targets[source_counter] = sentence_sep.join(
        translations[target_counter:target_counter+num_sents])
    target_counter += num_sents
  assert len(joined_targets) == len(data)
  data = data.assign(
      split_translations=joined_targets)
  data = data.assign(
      translations=[t.replace(sentence_sep, ' ') for t in joined_targets])  
  return data

survey_df = join_translated_sentences(survey_df, post_edits)

def store_translations(data, output_file):
  """Store the translations (post-edited) in a file like the survey inputs."""
  with open(output_file, 'w') as ofile:
    for translation in data['translations']:
      ofile.write('{}\n'.format(translation))

output_file = 'survey_{}.tsv'.format(language)
store_translations(survey_df, output_file)

In [100]:
#@title Source and target excerpt
survey_df[:3]

Unnamed: 0,Survey on the number of persons with symptoms compatible with COVID-19,split_sentences,tokenized_sentences,bped_sentences,split_translations,translations
0,We are an international team of scientists fro...,We are an international team of scientists fro...,we are an international team of scientists fro...,we are an international te@@ am of scienti@@ s...,"""ons is’ n internasionale span van wetenskapli...","""ons is’ n internasionale span van wetenskapli..."
1,"Please answer this survey once a day, even if ...","Please answer this survey once a day, even if ...","please answer this survey once a day , even if...",ple@@ ase answer this sur@@ ve@@ y once a day ...,"""antwoord asseblief hierdie oppervlakkige dag,...","""antwoord asseblief hierdie oppervlakkige dag,..."
2,"Pressing the ""submit"" button implies explicit ...","Pressing the ""submit"" button implies explicit ...","pressing the "" submit "" button implies explici...","pre@@ ssing the "" sub@@ m@@ it "" bu@@ tt@@ on ...","""",""""


# Done!

To finish, complete the following steps
1. Download the following files from this colab:
  - survey_{language}.tsv
  - pe_metrics_{language}.tsv
  - pe_{language}.tsv
  - feedback_{language}.txt
 
  For the download, either execute the next cell and allow the download, or click on 'Files' on the left bar, right-click on files and download.
2. Send the data to kreutzer@cl.uni-heidelberg.de.

In [0]:
#@title File download
from google.colab import files
files.download('survey_{}.tsv'.format(language))
files.download('pe_metrics_{}.tsv'.format(language))
files.download('pe_{}.tsv'.format(language)) 
files.download('feedback_{}.txt'.format(language))