<a href="https://colab.research.google.com/github/markaaronslater/NMT/blob/master/NMT_driver.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
!pip install subword-nmt # for segmenting words into subwords
!pip install stanza # for tokenizing corpus and tagging with morphological data
!pip install sacrebleu # for evaluation
!git clone https://github.com/moses-smt/mosesdecoder.git # for detokenizing model outputs prior to evaluation

In [None]:
!nvidia-smi

In [None]:
# recommended: place cloned NMT folder in Google drive folder 'My Drive':
path = '/content/gdrive/My Drive/NMT/'
corpus_path = path + 'corpuses/iwslt16_en_de/'
config_path = path + 'configs/'
data_path = path + 'data/'
checkpoint_path = path + 'checkpoints/'

model_name = 'my_model' # name of model tensor batches, hyperparameters, etc., saved as pickle file inside data_path

In [None]:
%cd /content/gdrive/My Drive/

#from NMT import mod
from NMT.src.preprocessing.apply_stanza_processors import apply_stanza_processors
from NMT.src.preprocessing.truecase import truecase_corpuses
from NMT.src.import_configs import import_configs
from NMT.src.preprocessing.preprocess import construct_model_data, retrieve_model_data
from NMT.src.train import train, load_checkpoint
from NMT.src.predict import predict

below, steps 1 thru 4 only ever need to be run once (they save their outputs to text and pickle files).


step 1 - **apply stanza processors to tokenize and pos-tag the corpuses**


> &lt;corpus\_name&gt; saved to &lt;corpus\_path&gt;/stanza\_&lt;corpus\_name&gt;.pkl, e.g., train.en saved to /content/gdrive/My Drive/iwslt16\_en\_de/stanza\_train.en.pkl.

> can retrieve via retrieve\_stanza\_outputs().


step 2 - **decase corpuses using linguistic heuristics that leverage morphological data produced by morphological data tagger**

> &lt;corpus\_name&gt; saved to &lt;corpus\_path&gt;/word\_&lt;corpus\_name&gt;. 

> can retrieve via read\_tokenized\_corpuses(prefix='word\_')


step 3 - **segment corpuses of words into corpuses of subwords**

> &lt;corpus\_name&gt; saved to &lt;corpus\_path&gt;/subword\_joint\_&lt;corpus\_name&gt; or &lt;corpus\_path&gt;/subword\_ind\_&lt;corpus\_name&gt;, depending on if learn a joint vocabulary or separate, independent vocabularies, respectively, for the source and target languages.

> can retrieve via read\_tokenized\_corpuses(prefix='subword\_joint\_') and read\_tokenized\_corpuses(prefix='subword\_ind\_')



step 4 - **convert corpuses into intelligently batched sets of tensors that can be directly passed to model**

> dictionary containing all model data is saved to &lt;data\_path&gt;/&lt;model\_name&gt;.pkl, where &lt;model\_name&gt; is identifier for which model to load. 

> can retrieve via retrieve\_model\_data().

In [None]:
# only meaningful for unit tests on subsets of corpus data, where _start is starting line number,
# (using 1-based indexing) and num is how many lines to extract. if num is None, then extract all lines from _start till end of corpus.
_start = 1
num = None
num = 10 # uncomment this line if unit testing

In [None]:
# step 1 - tokenize corpuses, and tag with morphological data, 
apply_stanza_processors("train.de", "train.en", "dev.de", "dev.en", "test.de", path=corpus_path, _start=_start, num=num)

In [None]:
# step 2 - true-case corpuses using linguistic heuristics that leverage morphological data produced by morphological data tagger
# e.g., remove capitalization from words that are only capitalized for a syntactic reason, like occurring at beginning of sentence
# but retain capitalization in proper nouns, etc. (more sophisticated heuristics employed for German corpuses)
truecase_corpuses("train.de", "train.en", "dev.de", "dev.en", "test.de", path=corpus_path):

In [None]:
# step 3 - import vocab, training, and model hyperparameter settings from configuration files
hyperparams = import_configs(config_path=config_path)

In [None]:
# step 4 - segment words of corpuses into subwords
# (skip this cell if using a word-level vocabulary)
!bash subword_joint.sh $hyperparams["num_merge_ops"] $hyperparams["vocab_threshold"] $corpus_path

In [None]:
# step 5 - build intelligently batched sets of tensors that can be directly passed to model
construct_model_data("train.de", "train.en", "dev.de", "dev.en", "test.de", hyperparams=hyperparams,
                     corpus_path=corpus_path, data_path=data_path, model_name=model_name
                    )

In [None]:
# step 6 - instantiate model
model_data = retrieve_model_data(data_path=data_path, model_name=model_name)

train_batches = model_data["train_batches"]
dev_batches = model_data["dev_batches"]
test_batches = model_data["test_batches"]
idx_to_trg_word = model_data["idx_to_trg_word"]
ref_corpuses = model_data["ref_corpuses"]
hyperparams = model_data["hyperparams"]

In [None]:
# step 7 - run tests to ensure model is correct before expensive train step

In [None]:
# step 8 - train model
model = train(hyperparams, train_batches, dev_batches, references, idx_to_trg_word, checkpoint_path, save=True)

In [None]:
# step 9 - predict test set
# can load a checkpoint rather than using prev cell's model:
if hyperparams["early_stopping"]:
    model = load_checkpoint(hyperparams, checkpoint_path, "best_model")
else:
    model, _ = load_checkpoint(hyperparams, checkpoint_path, "most_recent_model")

# change to test_batches
predict(model, dev_batches, references, idx_to_trg_word, checkpoint_path, 1000, inference_alg="beam_search", write=True):

In [None]:
translations_file = '' # name this the file containing set of predictions to evaluate
# step 10 - evaluate model
# step 10a - desegment predictions
# skip this cell if used word-level vocab    
!sed -E 's/(@@ )|(@@ ?$)//g' <dev.BPE.en > desegmented_translations.txt

In [None]:
# step 10b - evaluate
!bash eval.sh '/content/gdrive/My Drive/iwslt16_en_de/'


In [None]:
%%shell
#!/bin/bash



# REFERENCE_FILE=corpus_path+"dev.en" # replace with test.en

# TRANSLATED_FILE=checkpoint_path+"beampreds16"
# perl "mosesdecoder/scripts/tokenizer/detokenizer.perl" -l en < "$TRANSLATED_FILE" > "$TRANSLATED_FILE.detok"
# PARAMS=("-tok" "intl" "-l" "de-en" "$REFERENCE_FILE")
# sacrebleu "${PARAMS[@]}" < "$TRANSLATED_FILE.detok"