<a href="https://colab.research.google.com/drive/1EAovK1wc4DtuEXaL2iwLSttUrlQZDuY4?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"></a>

If we want to save data to Google drive, first we should do this:

In [1]:
import os
from google.colab import drive

In [2]:
drive.mount('/content/drive')
os.chdir('drive/My Drive/ChukchiLM')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Download data and baseline implementation from Github:

In [7]:
!git clone https://github.com/ftyers/global-classroom.git

Cloning into 'global-classroom'...
remote: Enumerating objects: 42, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 306 (delta 10), reused 39 (delta 7), pack-reused 264[K
Receiving objects: 100% (306/306), 1.61 MiB | 20.08 MiB/s, done.
Resolving deltas: 100% (131/131), done.


# Baseline

In [None]:
os.listdir()

['global-classroom']

In [3]:
os.chdir('global-classroom/chukchi/baseline')
!python3 train.py ../data/train.tsv model.dat
!python3 predict.py model.dat < ../data/dev.tsv > output.tsv
!python3 ../evaluate.py ../data/dev.tsv output.tsv 

Written 33724 unigrams and 107835 bigrams to model.dat.
Hits: 33 ; Tokens: 4504
Characters: 37897
Tokens: 8788
Clicks: 37754
Clicks/Token: 4.2960855712335
Clicks/Character: 0.9962266142438716


In [None]:
!python3 predict.py model.dat < ../data/test/test.tsv > output_test.tsv
!python3 ../evaluate.py ../data/test/test.tsv output_test.tsv

Hits: 33 ; Tokens: 4187
Characters: 37927
Tokens: 8374
Clicks: 37768
Clicks/Token: 4.510150465727251
Clicks/Character: 0.995807735913729


In [None]:
!sed 20q < ../data/dev.tsv output.tsv 

ӄԓявыԓя риӄукэтэ ивнин ытри ынкы варкыт гынин ӈэвъэн гэԓгыԓин мэмыԓя рагтыгъэ	ӄԓявыԓ>я риӄукэ>тэ ив>ни>н ытри ын>кы варкы>т гын>ин ӈэвъэ>н гэ>ԓгы>ԓин мэмыԓ>я ра>гты>гъэ
ӄԓявыԓ ынкъам купрэн ынанъомрычьын	ӄԓявыԓ ынкъам купрэн ынанъомры>чьы>н
рытэнмавнэн ынӄо эргатык эквэтгъэт копрантыватысӄэквъат	рытэ>нм>ав>нэн ынӄо эрг>атык эквэт>гъэ>т копра>нтыв>аты>сӄэк>въа>т
риӄукэтэ гамгаваны нэнайӈоткоӄэн ӄоԓ	риӄукэ>тэ га>мгав>аны н>эна>йӈо>тко>ӄэн ӄоԓ
ԓьунин ынӄо йыӈотконэн	ԓьу>ни>н ынӄо йыӈо>тко>нэ>н
 эвын коԓё	 эвын коԓё
тыкгаткэта ивнин риӄукэтэ	ты>кгат>кэ>та ив>ни>н риӄукэ>тэ
йыӈотконэн эвын коԓё	йыӈо>тко>нэ>н эвын коԓё
тыкгаткэта	ты>кгат>кэ>та
ивнин риӄукэтэ ынкы	ив>ни>н риӄукэ>тэ ын>кы
рытваннэнат риӄукэтэ ивнин	рытва>н>нэ>на>т риӄукэ>тэ ив>ни>н
мачынан ынкы нытваркын ынӄо	мачы>нан ын>кы ны>тва>ркын ынӄо
эргатык купрэт ёпаннэнат эвын гакваԓен	эрг>атык купрэт ё>пан>нэ>на>т эвын гаква>ԓен
ӈавмэмыԓчыӈын ынӄо рырагтаннэн ынкъам	ӈавмэмы>ԓчы>ӈы>н ынӄо ры>ра>гт>ан>нэ>н ынкъам
тэнуйгын йыннин	тэну>

In output.tsv we should have the original sentence and model output separated by tab. If the model's next word guess was right we append the whole word to output. If the model's guess was wrong, we append to the output each character separately. Between words we should add an underscore

# Our model

## Data loading

In [4]:
## Go three directories up
os.chdir('..')
os.chdir('..')
os.chdir('..')

In [5]:
import os
import re

import torch as tt
import torch.nn as nn
import torch.optim as optim
import pandas as pd

from math import ceil
from tqdm import tqdm_notebook

from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from torch.nn.utils import clip_grad_norm_

from collections import Counter

import numpy as np

In [None]:
os.listdir("global-classroom/chukchi/data")

['dev.tsv', 'test', 'train.tsv']

Now, we load the train data and write both word tokenized and morph tokenized sentences to two lists - word_sentences and morph_sentences

In [6]:
with open("global-classroom/chukchi/data/train.tsv") as inp:
  word_sentences = []
  morph_sentences = []
  for line in inp.readlines():
    words, morphs = line.strip('\n').split('\t')
    word_sentences.append(words.split(' '))
    morph_sentences.append(morphs.replace('>',' >').split(' '))

with open("global-classroom/chukchi/data/dev.tsv") as inp:
  dev_word_sentences = []
  dev_morph_sentences = []
  for line in inp.readlines():
    words, morphs = line.strip('\n').split('\t')
    dev_word_sentences.append(words.split(' '))
    dev_morph_sentences.append(morphs.replace('>',' >').split(' '))

with open("global-classroom/chukchi/data/test/test.tsv") as inp:
  test_word_sentences = []
  test_morph_sentences = []
  for line in inp.readlines():
    words, morphs = line.strip('\n').split('\t')
    test_word_sentences.append(words.split(' '))
    test_morph_sentences.append(morphs.replace('>',' >').split(' '))

In [None]:
len(morph_sentences), len(dev_morph_sentences), len(test_morph_sentences)

(30000, 1000, 1006)

Let's see how our data looks by outputting the first three examples:

In [None]:
print(word_sentences[:3])

[['амаравкэваратэн', 'таа’койӈын'], ['йъйыӄык', 'ныӄэԓпэратӄэн', 'вытэчгытрыӄэргыԓьын', 'йыӈэттэт'], ['мыкыӈ', 'нывытрэтӄин', 'чеԓгатвытрыԓьо', 'ынӄорыым', 'вытэчгытрыԓьо']]


In [None]:
print(morph_sentences[:3])

[['а', '>маравкэва', '>ра', '>тэн', 'таа', '>’ко', '>йӈы', '>н'], ['йъйыӄы', '>к', 'ны', '>ӄэԓпэр', '>ат', '>ӄэн', 'вытэч', '>гытры', '>ӄэргы', '>ԓьы', '>н', 'йыӈэт', '>тэ', '>т'], ['мык', '>ы', '>ӈ', 'ны', '>вытрэт', '>ӄин', 'чеԓг', '>ат', '>вытры', '>ԓь', '>о', 'ынӄор', '>ыым', 'вытэч', '>гытры', '>ԓь', '>о']]


Now, let's build a vocab. Let's count every unique word in sentences and create a two-way map from words to indices:

In [9]:
vocab = Counter([morph for sent in morph_sentences for morph in sent])

In [10]:
len(vocab), sum([v for k,v in vocab.items()])

(19487, 318530)

In [11]:
pd.Series(vocab).describe()

count    19487.000000
mean        16.345769
std        189.081036
min          1.000000
25%          1.000000
50%          1.000000
75%          4.000000
max      15991.000000
dtype: float64

50% of words appear no more than once - might consider throwing them out

In [12]:
id2word = [i for i in vocab] + ['<bos>','<eos>','<pad>','<unk>']
word2id = {v:k for k,v in enumerate(id2word)}

Here we use some special tokens:

<b>\<bos\></b> - Begininng of sequence - so the model could predct first word in sentence

<b>\<eos\></b> - End of sequence - so the model would know where to stop

<b>\<pad\></b> - If the sentences in batch are of different length (which is almost always the case) insert \<pad\> tokens right before eos until the sentences are of equal length

In [13]:
word2id['<bos>'], word2id['<eos>'], word2id['<pad>']

(19487, 19488, 19489)

Now let's see the ditribution of sentence length in the training set:

In [None]:
sent_lens = [len(sent) for sent in morph_sentences]

In [None]:
pd.Series(sent_lens).describe()

count    30000.000000
mean        10.617667
std          7.877840
min          1.000000
25%          5.000000
50%          9.000000
75%         14.000000
max         94.000000
dtype: float64

The average length of a sentence in dataset is 10.6 tokens, whereas the maximum length is 94 tokens.

## Model definition

Let's define our model

In [14]:
class MyModel(nn.Module):
    
    def __init__(self, vocab_size, embed_size, hidden_size):
        super(MyModel, self).__init__()
        ## Learnable parameters:

        # Embeddings:
        self.embedding = nn.Embedding(vocab_size, embed_size)
        
        # LSTM layer - the recurrent part:
        self.rnn = nn.LSTM(input_size=embed_size,
                           hidden_size=hidden_size,
                           bidirectional=True,
                           batch_first=True)
        
        # Fully connected (linear) layer - for final decision
        self.fc = nn.Linear(hidden_size * 2, vocab_size)
        
        # Set the initial values of weigths
        self.init_weights()
        
    def init_weights(self):
        '''Sets intitial values of model parameters'''
        nn.init.uniform_(self.embedding.weight)
        nn.init.xavier_uniform_(self.fc.weight)
        nn.init.zeros_(self.fc.bias)
        
    def forward(self, x):
        # how many sentences are in batch
        batch_size = x.size(1)
        # get length of a single sentence
        total_length = x.size(0)
        
        # apply embedding layer to inputs:
        x = self.embedding(x)

        # apply recurrent layer to inputs:  
        x, _ = self.rnn(x)
        
        # apply linear layer to inputs:
        x = self.fc(x)
        
        return x#.transpose(1,2)

## Training utils definition

In [15]:
def _train_epoch(model, iterator, optimizer, criterion, curr_epoch, device, clip,
                 pad_id):
    '''
    Function that runs model on train set for one epoch,
    backpropagates gradient, updates weigths
    and calcultes scores for train set

    Arguments:

    model - instance of MyModel
    iterator - instance of MyBatchIterator
    optimizer - instance of Torch optimizer (implements update of model weights)
    criterion - Loss function from Torch
    curr_epoch - number of current training epoch
    device - an interface to device that performs calculations (CPU or GPU)
    clip - a constant by which we clip the model gradient (to prevent it from 'exploding')
    pad_id - index of padding token in our mapping
    '''

    model.train()
    
    epoch_loss, accuracy, f1score = 0, 0, 0

    n_batches = len(iterator)
    iterator = tqdm_notebook(iterator, total=n_batches, desc='epoch %d' % (curr_epoch), leave=True)

    for i, (batch_in, batch_out) in enumerate(iterator):
        optimizer.zero_grad()
        
        x = batch_in.to(device)
        pred = model(x)
        
        y = batch_out.to(device)

        loss = criterion(pred.transpose(2,1), y)

        loss.backward()

        clip_grad_norm_(model.parameters(), clip)

        optimizer.step()

        curr_loss = loss.data.cpu().detach().item()

        y_flat, ypred_flat = batch_out.flatten().numpy(), pred.detach().cpu().argmax(dim=2).flatten().numpy()

        curr_acc = accuracy_score(y_flat, ypred_flat)
        curr_f1 = f1_score(y_flat, ypred_flat, average='micro')

        accuracy += curr_acc
        f1score += curr_f1

        iterator.set_postfix(loss='%.5f' % curr_loss, acc='%.5f' % curr_acc, f1='%.5f'%curr_f1)

    return epoch_loss/n_batches, accuracy/n_batches, f1score/n_batches

def _test_epoch(model, iterator, criterion, device, pad_id):
    '''Function that runs model on test set for one epoch and calcualtes scores on it

    Arguments:

    model - instance of MyModel
    iterator - instance of MyBatchIterator
    criterion - Loss function from Torch
    device - an interface to device that performs calculations (CPU or GPU)
    pad_id - index of padding token in our mapping'''
    model.eval()
    epoch_loss = 0

    epoch_loss, accuracy, f1score = 0, 0, 0

    n_batches = len(iterator)
    with tt.no_grad():
        for batch_in, batch_out in iterator:
            
            x, y = batch_in.to(device), batch_out.to(device)
            pred = model(x)
            loss = criterion(pred.transpose(2,1), y)
            epoch_loss += loss.data.cpu().detach().item()

            y_flat, ypred_flat = batch_out.flatten().numpy(), pred.detach().cpu().argmax(dim=2).flatten().numpy()

            curr_acc = accuracy_score(y_flat, ypred_flat)
            curr_f1 = f1_score(y_flat, ypred_flat, average='micro')

            accuracy += curr_acc
            f1score += curr_f1

    return epoch_loss / n_batches, accuracy/n_batches, f1score/n_batches


def nn_train(model, train_iterator, valid_iterator, criterion, optimizer, device, n_epochs=100,
          scheduler=None, early_stopping=0, clip=1.0, pad_id=word2id['<pad>']):
    '''
    Train the model for specified number of epochs

    Arguments:

    model - instance of MyModel
    train_iterator - instance of MyBatchIterator with training data
    valid_iterator - instance of MyBatchIterator with validation data
    criterion - Loss function from Torch
    optimizer - instance of Torch optimizer (implements update of model weights)
    curr_epoch - number of current training epoch
    device - an interface to device that performs calculations (CPU or GPU)
    n_epochs - number of training epochs
    scheduler - a PyTorch learning rate scheduler (for manipulating learning rate)
    early_stopping - number of epochs to wait before stopping training if the loss doesn't reduce
    clip - a constant by which we clip the model gradient (to prevent it from 'exploding')
    pad_id - index of padding token in our mapping
    '''

    prev_loss = float('inf')

    es_epochs = 0
    best_epoch = None
    history = pd.DataFrame()

    for epoch in range(n_epochs):
        train_loss, train_acc, train_f1 = _train_epoch(model, train_iterator, optimizer, criterion, epoch, device, clip, pad_id)

        print(f"Epoch {epoch} Training loss: {np.round(train_loss, 4)} Training accuracy: {np.round(train_acc, 4)} Training F1: {np.round(train_f1, 4)}")

        valid_loss, valid_acc, valid_f1 = _test_epoch(model, valid_iterator, criterion, device, pad_id)

        print(f"Epoch {epoch} Validation loss: {np.round(valid_loss, 4)} Validation accuracy: {np.round(valid_acc, 4)} Validation F1: {np.round(valid_f1, 4)}")

        if valid_loss < prev_loss:
          print("New record! Saving model")
          tt.save(model, 'lstm_morph_model')

        record = {'epoch': epoch, 'train_loss': train_loss, 'train_acc': train_acc, 'train_f1': train_f1,
                  'valid_loss': valid_loss, 'valid_acc': valid_acc, 'valid_f1': valid_f1}
        history = history.append(record, ignore_index=True)

        if early_stopping > 0:
            if valid_loss > prev_loss:
                es_epochs += 1
            else:
                es_epochs = 0

            if es_epochs >= early_stopping:
                best_epoch = history[history.valid_loss == history.valid_loss.min()].iloc[0]
                print('Early stopping! best epoch: %d val %.5f' % (best_epoch['epoch'], best_epoch['valid_loss']))
                break
                
        prev_loss = min(prev_loss, valid_loss)
    return history

## Prepare data for training:

In [17]:
def code_sentences(sentences, mapping):
  '''
  Map tokens in our sentences to their indices
  '''
  out = []
  for sent in sentences:
    sent_out = []
    for word in sent:
      if word in mapping:
        sent_out.append(mapping[word])
      else:
        sent_out.append(mapping['<unk>'])
    out.append(sent_out)
  return out

def make_lm_dataset(sentences, word2id, seq_len):
  '''Prepares data - creates sequences of selected length'''
  text_new = []
  for sent in sentences:
    new_sent = ['<bos>'] + sent
    text_new += new_sent
    text_new += ['<pad>' for i in range(seq_len)]
  
  sents_in = []
  sents_out = []

  for word_id in range(len(text_new)-1):
    sents_in.append(text_new[word_id:word_id+seq_len])
    sents_out.append(text_new[word_id+1:word_id+seq_len+1])
  return code_sentences(sents_in, word2id), code_sentences(sents_out, word2id)

In [22]:
def pad_sentences(sents, pad_id, max_len, append_eos=False):
  '''Function that makes sentences same length'''
  for sent_id, sent in enumerate(sents):
    if len(sent) < max_len:
      sents[sent_id] = sent + [pad_id for i in range(max_len-len(sent))]
    elif len(sent) < max_len:
      sents[sent_id] = sent[:max_len]
  return sents


def make_batch(batch_in, batch_out, pad_id, max_len):
   '''
   Function that converts Python lists to Torch tensors

   batch_in - list of sequences (w_i,...,w_j) words from original text
   batch_out - list of sequnces of (w_{i+1},...,w_{j+1}) words from original text
   pad_id - id of <pad> token
   max_len - maximum length of sentence
   '''
   max_len = min(max_len, max([len(sent) for sent in batch_in]))
   batch_in = pad_sentences(batch_in, pad_id, max_len)
   batch_out = pad_sentences(batch_out, pad_id, max_len)
   return tt.LongTensor(batch_in), tt.LongTensor(batch_out)

class MyBatchIterator:
  def __init__(self, sents_in, sents_out, batch_size=128,
              pad_id=word2id['<pad>'], max_len=5):
    '''
    An object that will return a new portion of our input and target data with every loop
    '''
    self.sents_in = sents_in
    self.sents_out = sents_out
    self.batch_size = batch_size
    self.pad_id = pad_id
    self.max_len = max_len
  
  def __iter__(self):
    self.start = 0
    return self
  
  def __next__(self):
    if self.start >= len(self.sents_in):
      raise StopIteration
    batch_in = self.sents_in[self.start:self.start+self.batch_size]
    batch_out = self.sents_out[self.start:self.start+self.batch_size]
    self.start += self.batch_size
    return make_batch(batch_in, batch_out, self.pad_id, self.max_len)
  
  def __len__(self):
    return ceil(len(self.sents_in)/self.batch_size)

A hyperparameter: length of input sequence

In [23]:
SEQ_LEN = 5

Let's prepare our dataset:

In [24]:
sents_in, sents_out = make_lm_dataset(morph_sentences, word2id, SEQ_LEN)
sents_in_dev, sents_out_dev = make_lm_dataset(dev_morph_sentences, word2id, SEQ_LEN)
sents_in_test, sents_out_test = make_lm_dataset(test_morph_sentences, word2id, SEQ_LEN)

In [None]:
sents_in[:4]

[[19487, 0, 1, 2, 3], [0, 1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 3, 4, 5, 6]]

In [None]:
sents_out[:4]

[[0, 1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7]]

In [None]:
X_train, y_train = sents_in, sents_out
X_dev, y_dev = sents_in_dev, sents_out_dev
X_test, y_test = sents_in_test, sents_out_test

In [None]:
len(X_train), len(X_test)

(498529, 16567)

## Training model on data

Create batch iterators for data:

In [None]:
train_iter = MyBatchIterator(X_train[:-1], y_train[:-1], max_len=SEQ_LEN)
test_iter = MyBatchIterator(X_dev, y_dev, max_len=SEQ_LEN)

Select device for training:

In [None]:
device = tt.device('cuda')

Initalize model:

In [None]:
model = MyModel(len(word2id), 100, 100).to(device)

In [None]:
criterion = nn.CrossEntropyLoss(ignore_index=word2id["<pad>"]).to(device)
optimizer = tt.optim.Adam(model.parameters())
scheduler = tt.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10)

Start training:

In [None]:
history_df = nn_train(model, train_iter, test_iter, criterion, optimizer, device, n_epochs=10, scheduler=scheduler)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  if __name__ == '__main__':


HBox(children=(FloatProgress(value=0.0, description='epoch 0', max=3895.0, style=ProgressStyle(description_wid…


Epoch 0 Training loss: 0.0 Training accuracy: 0.3038 Training F1: 0.3038
Epoch 0 Validation loss: 2.5042 Validation accuracy: 0.4712 Validation F1: 0.4712
New record! Saving model


HBox(children=(FloatProgress(value=0.0, description='epoch 1', max=3895.0, style=ProgressStyle(description_wid…


Epoch 1 Training loss: 0.0 Training accuracy: 0.4927 Training F1: 0.4927
Epoch 1 Validation loss: 2.0647 Validation accuracy: 0.5265 Validation F1: 0.5265
New record! Saving model


HBox(children=(FloatProgress(value=0.0, description='epoch 2', max=3895.0, style=ProgressStyle(description_wid…


Epoch 2 Training loss: 0.0 Training accuracy: 0.5415 Training F1: 0.5415
Epoch 2 Validation loss: 1.9748 Validation accuracy: 0.5474 Validation F1: 0.5474
New record! Saving model


HBox(children=(FloatProgress(value=0.0, description='epoch 3', max=3895.0, style=ProgressStyle(description_wid…


Epoch 3 Training loss: 0.0 Training accuracy: 0.5681 Training F1: 0.5681
Epoch 3 Validation loss: 1.9498 Validation accuracy: 0.5573 Validation F1: 0.5573
New record! Saving model


HBox(children=(FloatProgress(value=0.0, description='epoch 4', max=3895.0, style=ProgressStyle(description_wid…


Epoch 4 Training loss: 0.0 Training accuracy: 0.5894 Training F1: 0.5894
Epoch 4 Validation loss: 1.9361 Validation accuracy: 0.5613 Validation F1: 0.5613
New record! Saving model


HBox(children=(FloatProgress(value=0.0, description='epoch 5', max=3895.0, style=ProgressStyle(description_wid…


Epoch 5 Training loss: 0.0 Training accuracy: 0.6041 Training F1: 0.6041
Epoch 5 Validation loss: 1.9251 Validation accuracy: 0.5636 Validation F1: 0.5636
New record! Saving model


HBox(children=(FloatProgress(value=0.0, description='epoch 6', max=3895.0, style=ProgressStyle(description_wid…


Epoch 6 Training loss: 0.0 Training accuracy: 0.6117 Training F1: 0.6117
Epoch 6 Validation loss: 1.9177 Validation accuracy: 0.5647 Validation F1: 0.5647
New record! Saving model


HBox(children=(FloatProgress(value=0.0, description='epoch 7', max=3895.0, style=ProgressStyle(description_wid…


Epoch 7 Training loss: 0.0 Training accuracy: 0.6156 Training F1: 0.6156
Epoch 7 Validation loss: 1.9198 Validation accuracy: 0.5648 Validation F1: 0.5648


HBox(children=(FloatProgress(value=0.0, description='epoch 8', max=3895.0, style=ProgressStyle(description_wid…


Epoch 8 Training loss: 0.0 Training accuracy: 0.618 Training F1: 0.618
Epoch 8 Validation loss: 1.9366 Validation accuracy: 0.5647 Validation F1: 0.5647


HBox(children=(FloatProgress(value=0.0, description='epoch 9', max=3895.0, style=ProgressStyle(description_wid…


Epoch 9 Training loss: 0.0 Training accuracy: 0.6199 Training F1: 0.6199
Epoch 9 Validation loss: 1.9216 Validation accuracy: 0.5649 Validation F1: 0.5649


## Example of model prediction:

Now let's write code for text prediction

In [25]:
class LMPredictor:
  def __init__(self, model, id2word, word2id):
    self.model = model
    self.id2word = id2word
    self.word2id = word2id

  def predict(self, sent):
    sent = [self.word2id['<bos>']] + [self.word2id[word] if word in self.word2id else self.word2id['<unk>'] for word in sent]
    sent = [sent]
    sent = tt.LongTensor(sent)
    output = model(sent)
    new_word = output.argmax(dim=2)[0][-1]
    return self.id2word[new_word]

In [26]:
device = tt.device('cpu')
model = tt.load('lstm_morph_model').to(device)

In [27]:
predictor = LMPredictor(model, id2word, word2id)

Let's see how model works on one sentence from dev set

In [28]:
sent = dev_morph_sentences[0]
for i in range(len(sent)):
  feed = sent[:i]
  predicted = predictor.predict(feed)
  print(f"Item: {sent[i]} Predicted: {predicted}")

Item: ӄԓявыԓ Predicted: ны
Item: >я Predicted: >я
Item: риӄукэ Predicted: ынкъам
Item: >тэ Predicted: >тэ
Item: ив Predicted: ив
Item: >ни Predicted: >ни
Item: >н Predicted: >н
Item: ытри Predicted: ӈин
Item: ын Predicted: гэ
Item: >кы Predicted: >кы
Item: варкы Predicted: >ри
Item: >т Predicted: >т
Item: гын Predicted: ынкъам
Item: >ин Predicted: >ин
Item: ӈэвъэ Predicted: >рин
Item: >н Predicted: >н
Item: гэ Predicted: ынкъам
Item: >ԓгы Predicted: >мо
Item: >ԓин Predicted: >ԓин
Item: мэмыԓ Predicted: ԓыгэн
Item: >я Predicted: >тэ
Item: ра Predicted: ынкъам
Item: >гты Predicted: >гты
Item: >гъэ Predicted: >гъэ


## Scoring on test set

Let's count word accuracy on test set:

In [29]:
## count accuracy
acc = 0
output = []
dev_test = []
for sent in tqdm_notebook(test_morph_sentences, total=len(test_morph_sentences)):
  sent_acc = 0
  for i in range(len(sent)):
    feed = sent[:i]
    predicted = predictor.predict(feed)
    if predicted == sent[i]:
 #     if predicted == '<eos>':
 #       continue
 #     sent_acc += 1
 #     else:
      output.append(predicted)
      sent_acc += 1
    else:
      output += [c for c in sent[i]]
    if i!=0:
      if i == len(sent)-1:
        output.append('_')
      elif not sent[i+1].startswith('>'):
        output.append('_')
  output.append('\n')
  acc += sent_acc/len(sent)
acc = acc/len(test_morph_sentences)
listToStr = ' '.join(map(str, output)).replace('>',' ')
listToStr = re.sub(' +', ' ', listToStr)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """


HBox(children=(FloatProgress(value=0.0, max=1006.0), HTML(value='')))




In [30]:
output1 = list(listToStr.split('\n'))

In [31]:
output1[:10]

['г ы м нин _ ы т л ь а т _ ы ’ т т ъ у в и _ ы н к ъ а м _ о ’ м р ы т в а а л _ ',
 ' п у у р ъ у э п ы _ г ы м нин _ о ’ м _ о ’ м р ы к в о т _ н э м ы ӄ э й _ ӄ о р а г ы н р э т ы л ь о _ н и т ӄин _ ',
 ' г ы м нин _ ы т л ы г ы т _ ы т л ь ’ а т э _ ӄ о н п ы _ э м н у ӈ кы _ н ы м и г ч и р эт ӄ и н э т _ н э м ы ӄ э й _ ӄ о р а г ы н р э т ы л ь о _ н и т ӄ и н э т _ ',
 ' ы н к ъ а м м у р и _ г а ч а к э т т о м г а _ г э _ г э е г т э л ь м у р и _ к а в р а г ы р г ы н _ ы н к ъ а м _ г ы м _ ',
 ' ы т р ъ э ч е _ ӈ и р э ӄ _ н э н э н э т _ в а г ъ э т _ ',
 ' ы т р ъ э ч ',
 ' к ы т у р ы н ӈ ы т э ӄ э _ м э д в э д э в _ п р э з и д э н т о _ э н м а _ к а н ч а л я н э т ы _ г э р э м к и ч и л ь и н _ ',
 ' ы н к ъ а л ю у т _ ӈ ы р ъ а _ в э р т а л ё т т э _ в а к ъ о г ъ а т ӈ а _ ',
 ' ны н ы п ч е ӈ и в й и в ӄ и н _ к о л ё _ н ы м н ы м а _ м э д в э д э в ъ ы м _ ',
 ' г э ч е в кы _ н ы н т ы ӄ и н _ т а ӈ к о л ё _ ы н кы _ н ы г ы н р и т ӄ и н _ ']

In [None]:
!sed 10q < global-classroom/chukchi/data/test/test.tsv

гымнин ытльат ы’ттъуви ынкъам о’мрытваал	гым>нин ытльа>т ы’ттъуви ынкъам о’мрытваал>
пууръу эпы гымнин о’м о’мрыквот нэмыӄэй ӄорагынрэтыльо нитӄин	пууръу эпы гым>нин о’м о’мрыквот нэмыӄэй ӄора>гынрэты>ль>о н>ит>ӄин
гымнин ытлыгыт ытль’атэ ӄонпы эмнуӈкы нымигчирэтӄинэт нэмыӄэй ӄорагынрэтыльо нитӄинэт	гым>нин ытлыгы>т ытль’а>т>э ӄонпы эмнуӈ>кы ны>мигчир>эт>ӄинэ>т нэмыӄэй ӄора>гынрэты>ль>о нитӄинэт
ынкъам мури гачакэттомга гэ гэегтэльмури каврагыргын ынкъам гым	ынкъам мури га>чакэт>томг>а гэ гэ>егтэль>мури каврагыргы>н ынкъам гым
ытръэче ӈирэӄ нэнэнэт вагъэт	ытръэч>е ӈирэӄ нэнэнэ>т ва>гъэ>т
ытръэч	ытръэч
кытур ынӈытэӄэ мэдвэдэв прэзидэнто энма канчалянэты гэрэмкичильин	кытур ынӈытэӄ>э мэдвэдэв прэзидэнт>о эн>ма канчалян>эты гэ>рэмкичи>льин
ынкъа люут ӈыръа вэрталёттэ вакъогъатӈа	ынкъа люут ӈыръа вэрталёт>тэ вакъо>гъа>т>ӈа
ныныпчеӈивйивӄин колё нымныма мэдвэдэвъым	ны>ны>пчеӈи>в>йив>ӄин колё нымным>а мэдвэдэв>>ъым
гэчевкы нынтыӄин таӈколё ынкы ныгынритӄин	гэчев>кы ны>нты>ӄин таӈ>колё ын>кы 

In [32]:
with open("global-classroom/chukchi/data/test/test.tsv") as inp:
  for line in inp.readlines():
    words, morphs = line.strip('\n').split('\t')
    dev_test.append(words)
  dev_test.append('\n')
import csv
with open('output_test_lstm.tsv', 'w') as out_file:
    tsv_writer = csv.writer(out_file, delimiter='\t')
    tsv_writer.writerows(zip(dev_test, output1))

In [33]:
acc

0.11955693984085125

In [34]:
!sed 10q < output_test_lstm.tsv

гымнин ытльат ы’ттъуви ынкъам о’мрытваал	г ы м нин _ ы т л ь а т _ ы ’ т т ъ у в и _ ы н к ъ а м _ о ’ м р ы т в а а л _ 
пууръу эпы гымнин о’м о’мрыквот нэмыӄэй ӄорагынрэтыльо нитӄин	 п у у р ъ у э п ы _ г ы м нин _ о ’ м _ о ’ м р ы к в о т _ н э м ы ӄ э й _ ӄ о р а г ы н р э т ы л ь о _ н и т ӄин _ 
гымнин ытлыгыт ытль’атэ ӄонпы эмнуӈкы нымигчирэтӄинэт нэмыӄэй ӄорагынрэтыльо нитӄинэт	 г ы м нин _ ы т л ы г ы т _ ы т л ь ’ а т э _ ӄ о н п ы _ э м н у ӈ кы _ н ы м и г ч и р эт ӄ и н э т _ н э м ы ӄ э й _ ӄ о р а г ы н р э т ы л ь о _ н и т ӄ и н э т _ 
ынкъам мури гачакэттомга гэ гэегтэльмури каврагыргын ынкъам гым	 ы н к ъ а м м у р и _ г а ч а к э т т о м г а _ г э _ г э е г т э л ь м у р и _ к а в р а г ы р г ы н _ ы н к ъ а м _ г ы м _ 
ытръэче ӈирэӄ нэнэнэт вагъэт	 ы т р ъ э ч е _ ӈ и р э ӄ _ н э н э н э т _ в а г ъ э т _ 
ытръэч	 ы т р ъ э ч 
кытур ынӈытэӄэ мэдвэдэв прэзидэнто энма канчалянэты гэрэмкичильин	 к ы т у р ы н ӈ ы т э ӄ э _ м э д в э д э в _ п р э з и д э н т о

In [35]:
!python3 global-classroom/chukchi/evaluate.py global-classroom/chukchi/data/test/test.tsv output_test_lstm.tsv 

Characters: 37927
Tokens: 8374
Clicks: 37358
Clicks/Token: 4.461189395748746
Clicks/Character: 0.9849974951881245


## Scoring on dev set

In [36]:
## count accuracy
dev_acc = 0
output = []
dev_test = []
for sent in tqdm_notebook(dev_morph_sentences, total=len(dev_morph_sentences)):
  sent_acc = 0
  for i in range(len(sent)):
    feed = sent[:i]
    predicted = predictor.predict(feed)
    if predicted == sent[i]:
 #     if predicted == '<eos>':
 #       continue
 #     sent_acc += 1
 #     else:
      output.append(predicted)
      sent_acc += 1
    else:
      output += [c for c in sent[i]]
    if i!=0:
      if i == len(sent)-1:
        output.append('_')
      elif not sent[i+1].startswith('>'):
        output.append('_')
  output.append('\n')
  dev_acc += sent_acc/len(sent)
dev_acc = dev_acc/len(dev_morph_sentences)
listToStr = ' '.join(map(str, output)).replace('>',' ')
listToStr = re.sub(' +', ' ', listToStr)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))




In [37]:
output1 = list(listToStr.split('\n'))

In [38]:
dev_acc

0.2760651385429509

In [39]:
with open("global-classroom/chukchi/data/dev.tsv") as inp:
  for line in inp.readlines():
    words, morphs = line.strip('\n').split('\t')
    dev_test.append(words)
  dev_test.append('\n')
import csv
with open('output_dev_lstm.tsv', 'w') as out_file:
    tsv_writer = csv.writer(out_file, delimiter='\t')
    tsv_writer.writerows(zip(dev_test, output1))

In [40]:
!sed 10q < output_dev_lstm.tsv

ӄԓявыԓя риӄукэтэ ивнин ытри ынкы варкыт гынин ӈэвъэн гэԓгыԓин мэмыԓя рагтыгъэ	ӄ ԓ я в ы ԓ я _ р и ӄ у к э тэ _ ив ни н _ ы т р и _ ы н кы _ в а р к ы т _ г ы н ин _ ӈ э в ъ э н _ г э ԓ г ы ԓин _ м э м ы ԓ я _ р а гты гъэ _ 
ӄԓявыԓ ынкъам купрэн ынанъомрычьын	 ӄ ԓ я в ы ԓ ы н к ъ а м _ к у п р э н _ ы н а н ъ о м р ы ч ь ы н _ 
рытэнмавнэн ынӄо эргатык эквэтгъэт копрантыватысӄэквъат	 р ы т э нм ав н э н _ ы н ӄ о _ э р г атык _ э к в э т г ъ э т _ к о п р а н т ы в аты с ӄ э к в ъ а т _ 
риӄукэтэ гамгаваны нэнайӈоткоӄэн ӄоԓ	 р и ӄ у к э т э _ г а м г а в а н ы _ н эна й ӈ о т к о ӄэн _ ӄ о ԓ _ 
ԓьунин ынӄо йыӈотконэн	 ԓ ь у ни н _ ы н ӄ о _ й ы ӈ о тко н э н _ 
 эвын коԓё	 эвын _ к о ԓ ё _ 
тыкгаткэта ивнин риӄукэтэ	 т ы к г а т к э т а _ и в ни н _ р и ӄ у к э тэ _ 
йыӈотконэн эвын коԓё	 й ы ӈ о тко н э н _ э в ы н _ к о ԓ ё _ 
тыкгаткэта	 т ы к г а т к э т а _ 
ивнин риӄукэтэ ынкы	 и в ни н _ р и ӄ у к э тэ _ ы н кы _ 


In [41]:
!python3 global-classroom/chukchi/evaluate.py global-classroom/chukchi/data/dev.tsv output_dev_lstm.tsv 

Characters: 37897
Tokens: 8788
Clicks: 35372
Clicks/Token: 4.025034137460173
Clicks/Character: 0.9333720347257038
