<a href="https://colab.research.google.com/github/markaaronslater/recurrent-NMT/blob/master/NMT_driver.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
!nvidia-smi

In [None]:
# overwrite these with your own path, and make sure folders already exist.
path = '/content/gdrive/My Drive/NMT/
corpus_path = path + 'iwslt16_en_de/'
config_path = path + 'configs/'
data_path = path + 'data/'
checkpoint_path = path + 'checkpoints/'

model_name = 'my_model' # name of model tensor batches, hyperparameters, etc., saved as pickle file inside data_path

below, steps 1 thru 4 only ever need to be run once (they save their outputs to text and pickle files).


step 1 - **apply stanza processors to tokenize and pos-tag the corpuses**


> &lt;corpus_name&gt; saved to &lt;corpus_path&gt;/stanza_&lt;corpus_name&gt;.pkl, e.g., train.en saved to /content/gdrive/My Drive/iwslt16_en_de/stanza_train.en.pkl.

> can retrieve via retrieve_stanza_outputs().


step 2 - **decase corpuses using linguistic heuristics that leverage morphological data produced by morphological data tagger**

> &lt;corpus_name&gt; saved to &lt;corpus_path&gt;/word_&lt;corpus_name&gt;. 

> can retrieve via read_tokenized_corpuses(prefix='word_')


step 3 - **segment corpuses of words into corpuses of subwords**

> &lt;corpus_name&gt; saved to &lt;corpus_path&gt;/subword_joint_&lt;corpus_name&gt; or &lt;corpus_path&gt;/subword_ind_&lt;corpus_name&gt;, depending on if learn a joint vocabulary or separate, independent vocabularies, respectively, for the source and target languages.

> can retrieve via read_tokenized_corpuses(prefix='subword_joint_') and read_tokenized_corpuses(prefix='subword_ind_')



step 4 - **convert corpuses into batches of tensors that can directly be passed to model**

> dictionary containing all model data is saved to &lt;data_path&gt;/&lt;model_name&gt;.pkl, where &lt;model_name&gt; is identifier for which model to load. 

> can retrieve via retrieve_model_data().

In [None]:
# only meaningful for unit tests on subsets of corpus data, where _start is starting line number,
# (using 1-based indexing) and num is how many lines to extract. if num is None, then extract all lines from _start till end of corpus.
_start = 1
num = 10
#num = None

In [None]:
# step 1 - tokenize corpuses, and tag with morphological data, 
#apply_stanza_processors("train.de", "train.en", "dev.de", "dev.en", "test.de", path=corpus_path, )

In [None]:
# step 2 - true-case corpuses using linguistic heuristics that leverage morphological data produced by morphological data tagger
# e.g., remove capitalization from words that are only capitalized for a syntactic reason, like occurring at beginning of sentence
# but retain capitalization in proper nouns, etc. (more sophisticated heuristics employed for German corpuses)
#truecase_corpuses("train.de", "train.en", "dev.de", "dev.en", "test.de", path=corpus_path, _start=_start, num=num):

In [None]:
# step 3 - segment words of corpuses into subwords
# (skip this cell if using a word-level vocabulary)
!pip install subword-nmt

# arg1 is num_merge_ops, arg2 is vocab_threshold.
#!bash subword_joint.sh 30000 10 '/content/gdrive/My Drive/iwslt16_en_de/'

# produces bpe_codes, vocab.de, vocab.en, and text files containing the segmented corpuses.

In [None]:
# step 4 - build batches of tensors that can be directly pass to model
# construct_model_data("train.de", "train.en", "dev.de", "dev.en", "test.de",
#                      corpus_path=corpus_path, config_path=config_path, data_path=data_path, model_name=model_name,
#                      vocab_threshold=10, src_vocab_file='vocab.de', trg_vocab_file='vocab.en')

In [None]:
# step 5 - instantiate model
model_data = retrieve_model_data(data_path=data_path, model_name=model_name)

train_batches = model_data["train_batches"]
dev_batches = model_data["dev_batches"]
test_batches = model_data["test_batches"]
idx_to_trg_word = model_data["idx_to_trg_word"]
ref_corpuses = model_data["ref_corpuses"]
hyperparams = model_data["hyperparams"]

In [None]:
# step 6 - train model
model = train(hyperparams, train_batches, dev_batches, references, idx_to_trg_word, checkpoint_path, save=True)

In [None]:
# step 7 - predict test set
# can load a checkpoint rather than using prev cell's model:
name = "best_model" if early_stopping else "most_recent_model"
load_checkpoint(hyperparams, checkpoint_path, name)






In [None]:
# step 8 - evaluate model
!pip install sacrebleu


In [None]:
import torch

x = torch.tensor([1,2,3], device="cuda:0")
x

In [None]:
x.cpu()

In [None]:
x