# Language modelling


Обучим две различные символьные модели для генерации динозавров:
* модель на символьных биграммах
* ***RNN***-модель.


## Bigram model


In [None]:
!wget https://raw.githubusercontent.com/artemovae/NLP-seminar-LM/master/dinos.txt

--2022-10-05 17:46:47--  https://raw.githubusercontent.com/artemovae/NLP-seminar-LM/master/dinos.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19909 (19K) [text/plain]
Saving to: ‘dinos.txt’


2022-10-05 17:46:47 (42.3 MB/s) - ‘dinos.txt’ saved [19909/19909]



In [None]:
!cat dinos.txt | tail

Zhuchengtyrannus
Ziapelta
Zigongosaurus
Zizhongosaurus
Zuniceratops
Zunityrannus
Zuolong
Zuoyunlong
Zupaysaurus
Zuul

In [None]:
names = ['<' + name.strip().lower() + '>' for name in open('dinos.txt').readlines()]
print(names[:10])

['<aachenosaurus>', '<aardonyx>', '<abdallahsaurus>', '<abelisaurus>', '<abrictosaurus>', '<abrosaurus>', '<abydosaurus>', '<acanthopholis>', '<achelousaurus>', '<acheroraptor>']


In [None]:
import nltk

Вычислим частоту каждого символа в корпусе имен динозавров

In [None]:
chars = [char for name in names for char in name]
freq = nltk.FreqDist(chars)

In [None]:
freq

FreqDist({'a': 2487, 's': 2285, 'u': 2123, 'o': 1710, 'r': 1704, '<': 1536, '>': 1536, 'n': 1081, 'i': 944, 'e': 913, ...})

In [None]:
print(list(freq.keys()))

['<', 'a', 'c', 'h', 'e', 'n', 'o', 's', 'u', 'r', '>', 'd', 'y', 'x', 'b', 'l', 'i', 't', 'p', 'v', 'm', 'g', 'f', 'j', 'k', 'w', 'z', 'q']


In [None]:
freq.most_common(10)

[('a', 2487),
 ('s', 2285),
 ('u', 2123),
 ('o', 1710),
 ('r', 1704),
 ('<', 1536),
 ('>', 1536),
 ('n', 1081),
 ('i', 944),
 ('e', 913)]

Определим функцию для вычисления условной вероятности символов

In [None]:
l = sum([freq[char] for char in freq])
def unigram_prob(char):
    return freq[char] / l

In [None]:
print('p(a) = %1.4f' %unigram_prob('<'))

p(a) = 0.0716


Вычислим условную вероятность каждого символа в зависимости от того, какой символ стоял на предыдущей позиции.

In [None]:
cfreq = nltk.ConditionalFreqDist(nltk.bigrams(chars))

In [None]:
cfreq['a']

FreqDist({'u': 791, 'n': 347, 't': 204, 's': 171, 'l': 138, '>': 138, 'r': 124, 'c': 100, 'p': 89, 'm': 68, ...})

Оценим условные вероятности с помощью MLE.

In [None]:
cprob = nltk.ConditionalProbDist(cfreq, nltk.MLEProbDist)

In [None]:
print('p(a a) = %1.4f' %cprob['a'].prob('a'))
print('p(a b) = %1.4f' %cprob['a'].prob('b'))
print('p(a u) = %1.4f' %cprob['a'].prob('u'))

p(a a) = 0.0044
p(a b) = 0.0097
p(a u) = 0.3181


In [None]:
cprob['a'].generate()

'n'

### Task 1.
a. Write a function to generate a dinosaur name of **fixed** length. Use '<' as a start of name symbol.

b. Write a function to generate a dinosaur names of any length. 

In [None]:
def generate_n_word(n=10):
    word = '<'    
    while (len(word) <= n):
        c = cprob[word[-1]].generate()
        if c not in {'<','>'}: word += c        
    word += '>'
    return word

In [None]:
generate_n_word()

'<sausanosar>'

In [None]:
def generate_any_word():
  word = '<'
# first = True
  while (True):
    c = cprob[word[-1]].generate()
    if c == '>':
      word += c
      break
    elif c != '<':
      word += c
  return word

In [None]:
generate_any_word()

'<ploochasaus>'

## Модели n-грамм для слов
(https://nlpforhackers.io/language-models/)

In [None]:
import nltk
from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict

nltk.download('reuters')
nltk.download('punkt')
!unzip /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora

first_sentence = reuters.sents()[0]
print (first_sentence) # [u'ASIAN', u'EXPORTERS', u'FEAR', u'DAMAGE', u'FROM' ...
 
# Get the bigrams
print (list(bigrams(first_sentence))) # [(u'ASIAN', u'EXPORTERS'), (u'EXPORTERS', u'FEAR'), (u'FEAR', u'DAMAGE'), (u'DAMAGE', u'FROM'), ...
 
# Get the padded bigrams
print (list(bigrams(first_sentence, pad_left=True, pad_right=True))) # [(None, u'ASIAN'), (u'ASIAN', u'EXPORTERS'), (u'EXPORTERS', u'FEAR'), (u'FEAR', u'DAMAGE'), (u'DAMAGE', u'FROM'),
 
# Get the trigrams
print (list(trigrams(first_sentence))) # [(u'ASIAN', u'EXPORTERS', u'FEAR'), (u'EXPORTERS', u'FEAR', u'DAMAGE'), (u'FEAR', u'DAMAGE', u'FROM'), ...
 
# Get the padded trigrams
print (list(trigrams(first_sentence, pad_left=True, pad_right=True))) # [(None, None, u'ASIAN'), (None, u'ASIAN', u'EXPORTERS'), (u'ASIAN', u'EXPORTERS', u'FEAR'), (u'EXPORTERS', u'FEAR', u'DAMAGE'), (u'FEAR', u'DAMAGE', u'FROM') ...

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
  inflating: /root/nltk_data/corpora/reuters/training/2238  
  inflating: /root/nltk_data/corpora/reuters/training/2239  
  inflating: /root/nltk_data/corpora/reuters/training/2240  
  inflating: /root/nltk_data/corpora/reuters/training/2244  
  inflating: /root/nltk_data/corpora/reuters/training/2246  
  inflating: /root/nltk_data/corpora/reuters/training/2247  
  inflating: /root/nltk_data/corpora/reuters/training/2249  
  inflating: /root/nltk_data/corpora/reuters/training/225  
  inflating: /root/nltk_data/corpora/reuters/training/2251  
  inflating: /root/nltk_data/corpora/reuters/training/2252  
  inflating: /root/nltk_data/corpora/reuters/training/2253  
  inflating: /root/nltk_data/corpora/reuters/training/2257  
  inflating: /root/nltk_data/corpora/reuters/training/2259  
  inflating: /root/nltk_data/corpora/reuters/training/2260  
  inflating: /root/nltk_data/corpora/reuters/training/2264  
  in

In [None]:
model = defaultdict(lambda: defaultdict(lambda: 0))
 
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1
 
 
print (model["what", "the"]["economists"]) # "economists" follows "what the" 2 times
print (model["what", "the"]["nonexistingword"]) # 0 times
print (model[None, None]["The"]) # 8839 sentences start with "The"
 
# Let's transform the counts to probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count
 
print (model["what", "the"]["economists"]) # 0.0434782608696
print (model["what", "the"]["nonexistingword"]) # 0.0
print (model[None, None]["The"]) # 0.161543241465
 

2
0
8839
0.043478260869565216
0.0
0.16155800478879934


The probability of a sequence is computed using conditional probabilities.
The probability of word[i] given word[i-1] and word[i-2] is P(word[i] | word[i-1], word[i-2]) which in our case is equal to: model[(word[i-2], word[i-1])][word[i]]

Let’s add the probability computation in the generation script:

In [None]:
import random
 
 
text = [None, None]
prob = 1.0  # <- Init probability
 
sentence_finished = False
 
while not sentence_finished:
    r = random.random()
    accumulator = .0
 
    for word in model[tuple(text[-2:])].keys():
        accumulator += model[tuple(text[-2:])][word]
 
        if accumulator >= r:
            prob *= model[tuple(text[-2:])][word]  # <- Update the probability with the conditional probability of the new word
            text.append(word)
            break
 
    #if text[-2:] == [None, None]:
        #sentence_finished = True
    if word == '.' or word == '!' or word == '?':
       sentence_finished = True 
print ("Probability of text=", prob)  # <- Print the probability of the text
print (' '.join([t for t in text if t]))

Probability of text= 7.804332920583214e-15
But he agreed in principle under which the seven - year high of 1 .


##Task 2

Напишите функцию, которая будет генерировать предложение и оценивать его вероятность. Подсказка: предложение должно заканчиваться соответствующим знаком препинания.

## Реккурентные нейронные сети (RNN)

Исходная последовательность:

$x_{1:n} = x_1, x_2, \ldots, x_n$, $x_i \in \mathbb{R}^{d_{in}}$

Для каждого входного значения $x_{1:i}$ получаем на выходе $y_i$:

$y_i = RNN(x_{1:i})$, $y_i \in \mathbb{R}^{d_{out}}$

Для всей последовательности $x_{1:n}$:

$y_{1:n} = RNN^{*}(x_{1:n})$, $y_i \in \mathbb{R}^{d_{out}}$

$R$ - рекурсивная функция активации, зависящая от двух параметров: $x_i$ и $s_{i-1}$ (вектор предыдущего состояния)

$RNN^{*}(x_{1:n}, s_0) = y_{1:n}$

$y_i = O(s_i) = g(W^{out}[s_{i} ,x_i] +b)$

$s_i = R(s_{i-1}, x_i)$

$s_i = R(s_{i-1}, x_i) = g(W^{hid}[s_{i-1} ,x_i] +b)$  -- конкатенация $[s_{i-1}, x]$

$x_i \in \mathbb{R}^{d_{in}}$, $y_i \in \mathbb{R}^{ d_{out}}$, $s_i \in \mathbb{R}^{d_{hid}}$

$W^{hid} \in \mathbb{R}^{(d_{in}+d_{out}) \times d_{hid}}$, $W^{out} \in \mathbb{R}^{d_{hid} \times d_{out}}$

Построим языковую модель на основе RNN с помощью pytorch

In [None]:
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim
import pdb
from torch.utils.data import Dataset, DataLoader

%load_ext autoreload
%autoreload 2

torch.set_printoptions(linewidth=200)

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
hidden_size = 50

Подготовим данные

In [None]:
class DinosDataset(Dataset):
    def __init__(self):
        super().__init__()
        with open('dinos.txt') as f:
            content = f.read().lower()
            self.vocab = sorted(set(content)) + ['<', '>']
            self.vocab_size = len(self.vocab)
            self.lines = content.splitlines()
        self.ch_to_idx = {c:i for i, c in enumerate(self.vocab)}
        self.idx_to_ch = {i:c for i, c in enumerate(self.vocab)}
    
    def __getitem__(self, index):
        line = self.lines[index]
        #teacher forcing
        x_str = '<' + line 
        y_str = line + '>' 
        x = torch.zeros([len(x_str), self.vocab_size], dtype=torch.float)
        y = torch.empty(len(x_str), dtype=torch.long)
        for i, (x_ch, y_ch) in enumerate(zip(x_str, y_str)):
            x[i][self.ch_to_idx[x_ch]] = 1
            y[i] = self.ch_to_idx[y_ch]
        
        return x, y
    
    def __len__(self):
        return len(self.lines)

In [None]:
trn_ds = DinosDataset()
trn_dl = DataLoader(trn_ds, shuffle=True)

In [None]:
trn_ds.lines[1]

'aardonyx'

In [None]:
print(trn_ds.idx_to_ch)

{0: '\n', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 27: '<', 28: '>'}


In [None]:
trn_ds.vocab_size

29

In [None]:
x, y = trn_ds[1]

In [None]:
x.shape

torch.Size([9, 29])

In [None]:
y.shape

torch.Size([9])

In [None]:
y

tensor([ 1,  1, 18,  4, 15, 14, 25, 24, 28])

Опишем модель, функцию потерь и алгоритм оптимизации

In [None]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.dropout = nn.Dropout(0.3)
        self.i2o = nn.Linear(hidden_size, output_size)
    
    def forward(self, h_prev, x):
        combined = torch.cat([h_prev, x], dim = 1) # concatenate x and h
        h = torch.tanh(self.dropout(self.i2h(combined)))
        y = self.i2o(h)
        return h, y

In [None]:
model = RNN(trn_ds.vocab_size, hidden_size, trn_ds.vocab_size).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2)

![rnn](images/dinos3.png)

In [None]:
def sample(model):
    model.eval()
    word_size=0
    newline_idx = trn_ds.ch_to_idx['>']
    with torch.no_grad():
        h_prev = torch.zeros([1, hidden_size], dtype=torch.float, device=device)
        x = h_prev.new_zeros([1, trn_ds.vocab_size])
        start_char_idx = trn_ds.ch_to_idx['<']
        indices = [start_char_idx]
        x[0, start_char_idx] = 1
        predicted_char_idx = start_char_idx
        
        while predicted_char_idx != newline_idx and word_size != 50:
            h_prev, y_pred = model(h_prev, x)
            y_softmax_scores = torch.softmax(y_pred, dim=1)
            
            np.random.seed(np.random.randint(1, 5000))
            idx = np.random.choice(np.arange(trn_ds.vocab_size), p=y_softmax_scores.cpu().numpy().ravel())
            #print(idx)            
            #print(np.random.choice(np.arange(trn_ds.vocab_size), p=y_softmax_scores.cpu().numpy().ravel()))
            #print('--')
            indices.append(idx)
            
            x = (y_pred == y_pred.max(1)[0]).float()
            
            predicted_char_idx = idx
            
            word_size += 1
        
        if word_size == 50:
            indices.append(newline_idx)
    return indices

In [None]:
def print_sample(sample_idxs):
    [print(trn_ds.idx_to_ch[x], end ='') for x in sample_idxs]
    print()

Обучим получившуюся модель

In [None]:
def train_one_epoch(model, loss_fn, optimizer):
    model.train()
    for line_num, (x, y) in enumerate(trn_dl):
        loss = 0
        optimizer.zero_grad()
        h_prev = torch.zeros([1, hidden_size], dtype=torch.float, device=device)
        x, y = x.to(device), y.to(device)
        for i in range(x.shape[1]):
            h_prev, y_pred = model(h_prev, x[:, i])
            loss += loss_fn(y_pred, y[:, i])
            
        if (line_num+1) % 100 == 0:
            print_sample(sample(model))
        loss.backward()
        optimizer.step()

In [None]:
def train(model, loss_fn, optimizer, dataset='dinos', epochs=1):
    for e in range(1, epochs+1):
        print('Epoch:{}'.format(e))
        train_one_epoch(model, loss_fn, optimizer)
        print()

In [None]:
train(model, loss_fn, optimizer, epochs = 10)

Epoch:1
<izacusasaunrauas>
<rxd>
<usnfsruxus>
<suamlaaudur>
<acriaigsnreseusustususzurus>
<lsruakuiustur>
<aauras>
<tandudiurys>
<tureshusus>
<suainan>
<ateuaheugus>
<gmcuceurus>
<trneosazrus>
<tanraniurus>
<crnaausus>

Epoch:2
<tantusaussurus>
<zurusherus>
<scritauruh>
<comgasausus>
<suromiurus>
<rutanidauatp>
<aeoialheuros>
<kvrubturus>
<shwrosoulus>
<etatisaurug>
<cuglasasturus>
<silosaurus>
<tanianoa>
<guadosaurus>
<auaiuaurus>

Epoch:3
<jlrus>
<smproanaurus>
<euadsacrus>
<haucotaurus>
<klorivaurus>
<khcisauruo>
<sibnltinaurus>
<tertsonaurus>
<rruaralrus>
<guckopaurus>
<atdryseurus>
<crnivsosaurus>
<erasevaasalrus>
<harbkpsaurus>
<etshzrsiuris>

Epoch:4
<altaslsaurus>
<doricsurus>
<burrraurus>
<lrrtaosauros>
<tagthcaurus>
<saiupuesaurus>
<iyrniuras>
<alocoltcauros>
<selaocaurus>
<ssliturus>
<ppuahnantasaurus>
<anbtiocodtor>
<csuoraurus>
<lhusanieitaps>
<agnobiisaurus>

Epoch:5
<whurusaurus>
<ypnsusausussus>
<tajsibldaurus>
<kvntaurus>
<smnuunpusas>
<amrasaurus>
<anatapasaurus>
<sup

In [None]:
#def generate_dino(n=50):
#    word = '<'
#    for i in range(n):
#        word += cprob[word[-1]].generate()
#        if word[-1] == '>':
#            return word
#    word += '>'
#    return word

In [None]:
#nn.LSTM

print_sample(sample(model))

<krotclnaurus>


In [None]:
#for i in range(10):
#    print(generate_dino())

### Task 2.
Rewrite the sampling function to generate pangrams (words that contain each character of the alphabet only once)

### Task 3.
Rewrite the sampling function so that it would be possible to change the sampling temperature

### Task 4.
Implement the beam search for sampling

### Task 5
Rewrite the training function using LSTM/GRU. Also try Adam optimizer.

In [None]:
#a = y_softmax_scores.cpu().numpy().ravel()

In [None]:
def equalize_probs_sqrt(in_vector):
    out_vector = np.zeros_like(in_vector)
    for i, el in enumerate(in_vector):
        out_vector[i] = np.math.sqrt(el)

    return out_vector / sum(out_vector)

In [None]:
#equalize_probs_sqrt(vec)

In [None]:
def sample_hotter(model):
    model.eval()
    word_size=0
    newline_idx = trn_ds.ch_to_idx['>']
    with torch.no_grad():
        h_prev = torch.zeros([1, hidden_size], dtype=torch.float, device=device)
        x = h_prev.new_zeros([1, trn_ds.vocab_size])
        start_char_idx = trn_ds.ch_to_idx['<']
        indices = [start_char_idx]
        x[0, start_char_idx] = 1
        predicted_char_idx = start_char_idx
        
        while predicted_char_idx != newline_idx and word_size != 50:
            h_prev, y_pred = model(h_prev, x)
            y_softmax_scores = torch.softmax(y_pred, dim=1)
            
            np.random.seed(np.random.randint(1, 5000))
            
            next_prob_vector = equalize_probs_sqrt(y_softmax_scores.cpu().numpy().ravel())
            idx = np.random.choice(np.arange(trn_ds.vocab_size), p=next_prob_vector)
            indices.append(idx)
            
            x = (y_pred == y_pred.max(1)[0]).float()
            
            predicted_char_idx = idx
            
            word_size += 1
        
        if word_size == 50:
            indices.append(newline_idx)
    return indices

In [None]:
print_sample(sample_hotter(model))

<pozrlzou>


# Reference

1. Sampling in  RNN: https://nlp.stanford.edu/blog/maximum-likelihood-decoding-with-rnns-the-good-the-bad-and-the-ugly/
2. Coursera course (main source): https://github.com/furkanu/deeplearning.ai-pytorch/tree/master/5-%20Sequence%20Models
3. Coursera course (main source): https://github.com/Kulbear/deep-learning-coursera/blob/master/Sequence%20Models/Dinosaurus%20Island%20--%20Character%20level%20language%20model%20final%20-%20v3.ipynb
4. LSTM: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

## Named Entity Recognition


#### Постановка задачи «sequence labeling»:



* Дан корпус текстов $D$
* Каждый текст представляет собой последовательность токенов
* Каждому токену присвоена метка из некоторого множества $V$

В зависимости от множества меток $V$ получаем разные типы подзадач. Например:
* если $V$ - множество частей речи, то это задача ***POS***-теггинга
* если $V$ - множество типов именованных сущностей, то это задача ***NER***

Именованная сущность - любой фрагмент текста, обозначающий некоторый интересный объект.

#### Conditional Random Fields

***Conditional Random Field*** - развитие метода Марковских случайных полей. Графовая модель, которая используется для представления совместных распределений набора нескольких случайных переменных. 

Особенности модели ***CRF***:

* Качество сильно зависит от выбора признаков

* Один из лучших методов для ***NER*** и ***POS***-теггинга

* Долго обучается

* Хорошо работает в связке с рекуррентными нейросетями, моделирует совместное распределение на всей последовательности выходов сети одновременно

code from https://github.com/Akshayc1/named-entity-recognition/blob/master/NER%20using%20CRF.ipynb

In [None]:
!pip install -U 'scikit-learn<0.24'
!pip install sklearn-crfsuite

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn_crfsuite import CRF
from sklearn_crfsuite.metrics import flat_f1_score
from sklearn_crfsuite.metrics import flat_classification_report

https://www.kaggle.com/datasets/abhinavwalia95/entity-annotated-corpus/download?datasetVersionNumber=4

In [None]:
!wget https://github.com/KMA504/hse-ami-nlp-course-fall-22/raw/main/seminars/ner_dataset.csv
#Reading the csv file
df = pd.read_csv('ner_dataset.csv', encoding = "ISO-8859-1")

#Display first 10 rows
df.head(10)

--2022-10-03 15:59:04--  https://github.com/KMA504/hse-ami-nlp-course-fall-22/raw/main/seminars/ner_dataset.csv
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/KMA504/hse-ami-nlp-course-fall-22/main/seminars/ner_dataset.csv [following]
--2022-10-03 15:59:04--  https://raw.githubusercontent.com/KMA504/hse-ami-nlp-course-fall-22/main/seminars/ner_dataset.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15208151 (15M) [text/plain]
Saving to: ‘ner_dataset.csv.1’


2022-10-03 15:59:04 (102 MB/s) - ‘ner_dataset.csv.1’ saved [15208151/15208151]



Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


In [None]:
# This is a class to get sentence. The each sentence will be list of tuples with its tag and pos.
class sentence(object):
    def __init__(self, df):
        self.n_sent = 1
        self.df = df
        self.empty = False
        agg = lambda s : [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(),
                                                       s['POS'].values.tolist(),
                                                       s['Tag'].values.tolist())]
        self.grouped = self.df.groupby("Sentence #").apply(agg)
        self.sentences = [s for s in self.grouped]
        
    def get_text(self):
        try:
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent +=1
            return s
        except:
            return None

#Displaying one full sentence
getter = sentence(df)
sentences = [" ".join([s[0] for s in sent]) for sent in getter.sentences]
sentences[0]

'Thousands'

In [None]:
sentences = getter.sentences

## Features
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

crf = CRF(algorithm = 'lbfgs',
         c1 = 0.1,
         c2 = 0.1,
         max_iterations = 100,
         all_possible_transitions = False)
try:
  crf.fit(X_train, y_train)
except AttributeError:
    pass

In [None]:
#Predicting on the test set.
y_pred = crf.predict(X_test)

In [None]:
f1_score = flat_f1_score(y_test, y_pred, average = 'weighted')
print(f1_score)

0.9390271110263764


In [None]:
report = flat_classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

       B-art       1.00      0.20      0.33         5
       B-eve       0.00      0.00      0.00         2
       B-geo       0.75      0.89      0.82       675
       B-gpe       0.96      0.92      0.94       605
       B-nat       1.00      1.00      1.00         3
       B-org       0.73      0.53      0.62       545
       B-per       0.86      0.89      0.88       807
       B-tim       0.98      0.81      0.89       107
           O       0.98      0.99      0.99      6843

    accuracy                           0.94      9592
   macro avg       0.81      0.69      0.72      9592
weighted avg       0.94      0.94      0.94      9592



  _warn_prf(average, modifier, msg_start, len(result))


#### Архитектура BiLSTM-CRF

В данной модели для каждого слова вычисляется его векторное представление на основе символьного состава слова, предобученных векоторных представлений (***Word2Vec***, ***FastText***, ***GloVe***), а также других признаков (***POS***-тег, роль в предложении и т.д.)

![representation](images/word_representation_model.png)

Общая схема модели

![bilstm_crf](images/bilstm_crf_model.png)

Основные шаги алгоритма:
* Получить предобученные эмбеддинги слов коллекции (***word2vec***, ***GloVe***)
$$$$
* Обучить символьные эмбеддинги (***char-BiLSTM***, ***char-CNN***)
$$$$
* Составить для каждого слова морфологические/синтаксические признаки (***POS***-тег, роль в предложении и т.п.)
$$$$
* Объединить всё это и подать на вход основной сети (***BiLSTM***)
$$$$
* Выходы $h_t$ для всех слов предложения подавать на вход классификатору,
который будет предсказывать NER-тег (***SoftMax***, ***CRF***)

#### Toy example of BiLSTM-CRF

From here: https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html

In [None]:
!pip3 install torchvision

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
#!pip3 install http://download.pytorch.org/whl/cu80/torch-0.4.1-cp36-cp36m-linux_x86_64.whl

In [None]:
!pip3 install torchtext

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7fd897392670>

Helper functions to make the code more readable.

In [None]:
def argmax(vec):
    # return the argmax as a python int
    _, idx = torch.max(vec, 1)
    return idx.item()


def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


# Compute log sum exp in a numerically stable way for the forward algorithm
def log_sum_exp(vec):
    max_score = vec[0, argmax(vec)]
    max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
    return max_score + \
        torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))

Creating model

In [None]:
class BiLSTM_CRF(nn.Module):

    def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim):
        super(BiLSTM_CRF, self).__init__()
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size
        self.tag_to_ix = tag_to_ix
        self.tagset_size = len(tag_to_ix)

        self.word_embeds = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2,
                            num_layers=1, bidirectional=True)

        # Maps the output of the LSTM into tag space.
        self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size)

        # Matrix of transition parameters.  Entry i,j is the score of
        # transitioning *to* i *from* j.
        self.transitions = nn.Parameter(
            torch.randn(self.tagset_size, self.tagset_size))

        # These two statements enforce the constraint that we never transfer
        # to the start tag and we never transfer from the stop tag
        self.transitions.data[tag_to_ix[START_TAG], :] = -10000
        self.transitions.data[:, tag_to_ix[STOP_TAG]] = -10000

        self.hidden = self.init_hidden()

    def init_hidden(self):
        return (torch.randn(2, 1, self.hidden_dim // 2),
                torch.randn(2, 1, self.hidden_dim // 2))

    def _forward_alg(self, feats):
        # Do the forward algorithm to compute the partition function
        init_alphas = torch.full((1, self.tagset_size), -10000.)
        # START_TAG has all of the score.
        init_alphas[0][self.tag_to_ix[START_TAG]] = 0.

        # Wrap in a variable so that we will get automatic backprop
        forward_var = init_alphas

        # Iterate through the sentence
        for feat in feats:
            alphas_t = []  # The forward tensors at this timestep
            for next_tag in range(self.tagset_size):
                # broadcast the emission score: it is the same regardless of
                # the previous tag
                emit_score = feat[next_tag].view(
                    1, -1).expand(1, self.tagset_size)
                # the ith entry of trans_score is the score of transitioning to
                # next_tag from i
                trans_score = self.transitions[next_tag].view(1, -1)
                # The ith entry of next_tag_var is the value for the
                # edge (i -> next_tag) before we do log-sum-exp
                next_tag_var = forward_var + trans_score + emit_score
                # The forward variable for this tag is log-sum-exp of all the
                # scores.
                alphas_t.append(log_sum_exp(next_tag_var).view(1))
            forward_var = torch.cat(alphas_t).view(1, -1)
        terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
        alpha = log_sum_exp(terminal_var)
        return alpha

    def _get_lstm_features(self, sentence):
        self.hidden = self.init_hidden()
        embeds = self.word_embeds(sentence).view(len(sentence), 1, -1)
        lstm_out, self.hidden = self.lstm(embeds, self.hidden)
        lstm_out = lstm_out.view(len(sentence), self.hidden_dim)
        lstm_feats = self.hidden2tag(lstm_out)
        return lstm_feats

    def _score_sentence(self, feats, tags):
        # Gives the score of a provided tag sequence
        score = torch.zeros(1)
        tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags])
        for i, feat in enumerate(feats):
            score = score + \
                self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
        score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]
        return score

    def _viterbi_decode(self, feats):
        backpointers = []

        # Initialize the viterbi variables in log space
        init_vvars = torch.full((1, self.tagset_size), -10000.)
        init_vvars[0][self.tag_to_ix[START_TAG]] = 0

        # forward_var at step i holds the viterbi variables for step i-1
        forward_var = init_vvars
        for feat in feats:
            bptrs_t = []  # holds the backpointers for this step
            viterbivars_t = []  # holds the viterbi variables for this step

            for next_tag in range(self.tagset_size):
                # next_tag_var[i] holds the viterbi variable for tag i at the
                # previous step, plus the score of transitioning
                # from tag i to next_tag.
                # We don't include the emission scores here because the max
                # does not depend on them (we add them in below)
                next_tag_var = forward_var + self.transitions[next_tag]
                best_tag_id = argmax(next_tag_var)
                bptrs_t.append(best_tag_id)
                viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))
            # Now add in the emission scores, and assign forward_var to the set
            # of viterbi variables we just computed
            forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
            backpointers.append(bptrs_t)

        # Transition to STOP_TAG
        terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
        best_tag_id = argmax(terminal_var)
        path_score = terminal_var[0][best_tag_id]

        # Follow the back pointers to decode the best path.
        best_path = [best_tag_id]
        for bptrs_t in reversed(backpointers):
            best_tag_id = bptrs_t[best_tag_id]
            best_path.append(best_tag_id)
        # Pop off the start tag (we dont want to return that to the caller)
        start = best_path.pop()
        assert start == self.tag_to_ix[START_TAG]  # Sanity check
        best_path.reverse()
        return path_score, best_path

    def neg_log_likelihood(self, sentence, tags):
        feats = self._get_lstm_features(sentence)
        forward_score = self._forward_alg(feats)
        gold_score = self._score_sentence(feats, tags)
        return forward_score - gold_score

    def forward(self, sentence):  # dont confuse this with _forward_alg above.
        # Get the emission scores from the BiLSTM
        lstm_feats = self._get_lstm_features(sentence)

        # Find the best path, given the features.
        score, tag_seq = self._viterbi_decode(lstm_feats)
        return score, tag_seq

Run training

In [None]:
START_TAG = "<START>"
STOP_TAG = "<STOP>"
EMBEDDING_DIM = 5
HIDDEN_DIM = 4

# Make up some training data
training_data = [(
    "the wall street journal reported today that apple corporation made money".split(),
    "B I I I O O O B I O O".split()
), (
    "georgia tech is a university in georgia".split(),
    "B I O O O O B".split()
)]

word_to_ix = {}
for sentence, tags in training_data:
    for word in sentence:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

tag_to_ix = {"B": 0, "I": 1, "O": 2, START_TAG: 3, STOP_TAG: 4}

model = BiLSTM_CRF(len(word_to_ix), tag_to_ix, EMBEDDING_DIM, HIDDEN_DIM)
optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)

# Check predictions before training
with torch.no_grad():
    precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)
    precheck_tags = torch.tensor([tag_to_ix[t] for t in training_data[0][1]], dtype=torch.long)
    print(model(precheck_sent))

# Make sure prepare_sequence from earlier in the LSTM section is loaded
for epoch in range(
        300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is,
        # turn them into Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = torch.tensor([tag_to_ix[t] for t in tags], dtype=torch.long)

        # Step 3. Run our forward pass.
        loss = model.neg_log_likelihood(sentence_in, targets)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss.backward()
        optimizer.step()

# Check predictions after training
with torch.no_grad():
    precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)
    print(model(precheck_sent))
# We got it!

(tensor(2.6907), [1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1])
(tensor(20.4906), [0, 1, 1, 1, 2, 2, 2, 0, 1, 2, 2])


#### Full example of BiLSTM-CRF

In [None]:
!pip3 install natasha

Collecting natasha
  Downloading natasha-1.4.0-py3-none-any.whl (34.4 MB)
[K     |████████████████████████████████| 34.4 MB 29 kB/s 
[?25hCollecting ipymarkup>=0.8.0
  Downloading ipymarkup-0.9.0-py3-none-any.whl (14 kB)
Collecting pymorphy2
  Downloading pymorphy2-0.9.1-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 2.4 MB/s 
[?25hCollecting navec>=0.9.0
  Downloading navec-0.10.0-py3-none-any.whl (23 kB)
Collecting slovnet>=0.3.0
  Downloading slovnet-0.5.0-py3-none-any.whl (49 kB)
[K     |████████████████████████████████| 49 kB 5.2 MB/s 
[?25hCollecting razdel>=0.5.0
  Downloading razdel-0.5.0-py3-none-any.whl (21 kB)
Collecting yargy>=0.14.0
  Downloading yargy-0.15.0-py3-none-any.whl (41 kB)
[K     |████████████████████████████████| 41 kB 95 kB/s 
[?25hCollecting intervaltree>=3
  Downloading intervaltree-3.1.0.tar.gz (32 kB)
Collecting pymorphy2-dicts-ru<3.0,>=2.4
  Downloading pymorphy2_dicts_ru-2.4.417127.4579844-py2.py3-none-any.whl (8.2 MB)


In [None]:
!python3 -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 5.3 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [None]:
!pip install html5lib



In [None]:
!wget https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testa

--2021-09-22 19:25:51--  https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testa
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 827012 (808K) [text/plain]
Saving to: ‘eng.testa’


2021-09-22 19:25:51 (17.5 MB/s) - ‘eng.testa’ saved [827012/827012]



In [None]:
!wget https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testb

--2021-09-22 19:26:04--  https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testb
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 748096 (731K) [text/plain]
Saving to: ‘eng.testb’


2021-09-22 19:26:04 (16.4 MB/s) - ‘eng.testb’ saved [748096/748096]



In [None]:
!wget https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.train

--2021-09-22 19:26:07--  https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.train
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3281528 (3.1M) [text/plain]
Saving to: ‘eng.train’


2021-09-22 19:26:07 (45.0 MB/s) - ‘eng.train’ saved [3281528/3281528]



In [None]:
!mkdir datasets && mv eng.* datasets/

In [None]:
!wget https://worksheets.codalab.org/rest/bundles/0x15a09c8f74f94a20bec0b68a2e6703b3/contents/blob/ && mkdir embeddings && mv index.html embeddings/glove.6B.100d.txt

--2021-09-22 19:26:19--  https://worksheets.codalab.org/rest/bundles/0x15a09c8f74f94a20bec0b68a2e6703b3/contents/blob/
Resolving worksheets.codalab.org (worksheets.codalab.org)... 13.68.212.115
Connecting to worksheets.codalab.org (worksheets.codalab.org)|13.68.212.115|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-09-22 19:26:20 ERROR 404: Not Found.



In [None]:
!git clone https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Sequence-Labeling

Cloning into 'a-PyTorch-Tutorial-to-Sequence-Labeling'...
remote: Enumerating objects: 139, done.[K
remote: Total 139 (delta 0), reused 0 (delta 0), pack-reused 139[K
Receiving objects: 100% (139/139), 6.49 MiB | 19.25 MiB/s, done.
Resolving deltas: 100% (68/68), done.


#### Spacy 

In [None]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

In [None]:
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
print([(X.text, X.label_) for X in doc.ents])

[('European', 'NORP'), ('$5.1 billion', 'MONEY'), ('Wednesday', 'DATE')]


Выкачаем статью и найдём в ней именованные сущности, выведем их число:

In [None]:
from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))

In [None]:
ny_bb = url_to_string('https://towardsdatascience.com/from-graph-ml-to-deep-relational-learning-f07a0dddda89')
article = nlp(ny_bb)

In [None]:
len(article.ents)

271

Выведем число встреченных сущностей каждого типа:

In [None]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'ORG': 102,
         'PERSON': 58,
         'WORK_OF_ART': 10,
         'DATE': 31,
         'FAC': 3,
         'CARDINAL': 44,
         'GPE': 12,
         'PERCENT': 1,
         'EVENT': 1,
         'LAW': 2,
         'PRODUCT': 5,
         'TIME': 1,
         'LOC': 1})

Выведем текст с подсвеченными сущностями разных типов:

In [None]:
sentences = [x for x in article.sents]
displacy.render(nlp(str(sentences)), jupyter=True, style='ent')

In [None]:
!mkdir conll2003 && cp datasets/eng.testa conll2003/eng.testa.txt && cp datasets/eng.testb conll2003/eng.testb.txt && cp datasets/eng.train conll2003/eng.train.txt

In [None]:
!mkdir .data && mv conll2003 .data/

In [None]:
!git clone https://github.com/kolloldas/torchnlp

Cloning into 'torchnlp'...
remote: Enumerating objects: 247, done.[K
remote: Total 247 (delta 0), reused 0 (delta 0), pack-reused 247[K
Receiving objects: 100% (247/247), 2.13 MiB | 11.14 MiB/s, done.
Resolving deltas: 100% (120/120), done.


In [None]:
!cd torchnlp && pip3 install -r requirements.txt && python3 setup.py install

Collecting git+git://github.com/kolloldas/text.git (from -r requirements.txt (line 4))
  Cloning git://github.com/kolloldas/text.git to /tmp/pip-req-build-zux4f_rg
  Running command git clone -q git://github.com/kolloldas/text.git /tmp/pip-req-build-zux4f_rg
Collecting pytest-mock
  Downloading pytest_mock-3.6.1-py3-none-any.whl (12 kB)
Collecting pytest-pythonpath
  Downloading pytest-pythonpath-0.7.3.tar.gz (4.5 kB)
Collecting pytest
  Downloading pytest-6.2.5-py3-none-any.whl (280 kB)
[K     |████████████████████████████████| 280 kB 7.5 MB/s 
[?25h  Downloading pytest-6.2.4-py3-none-any.whl (280 kB)
[K     |████████████████████████████████| 280 kB 47.1 MB/s 
  Downloading pytest-6.2.3-py3-none-any.whl (280 kB)
[K     |████████████████████████████████| 280 kB 54.1 MB/s 
[?25h  Downloading pytest-6.2.2-py3-none-any.whl (280 kB)
[K     |████████████████████████████████| 280 kB 34.0 MB/s 
[?25h  Downloading pytest-6.2.1-py3-none-any.whl (279 kB)
[K     |█████████████████████████

### Natasha

In [None]:
!pip install natasha
import natasha

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting natasha
  Downloading natasha-1.4.0-py3-none-any.whl (34.4 MB)
[K     |████████████████████████████████| 34.4 MB 163 kB/s 
[?25hCollecting razdel>=0.5.0
  Downloading razdel-0.5.0-py3-none-any.whl (21 kB)
Collecting yargy>=0.14.0
  Downloading yargy-0.15.0-py3-none-any.whl (41 kB)
[K     |████████████████████████████████| 41 kB 128 kB/s 
[?25hCollecting navec>=0.9.0
  Downloading navec-0.10.0-py3-none-any.whl (23 kB)
Collecting ipymarkup>=0.8.0
  Downloading ipymarkup-0.9.0-py3-none-any.whl (14 kB)
Collecting slovnet>=0.3.0
  Downloading slovnet-0.5.0-py3-none-any.whl (49 kB)
[K     |████████████████████████████████| 49 kB 6.1 MB/s 
[?25hCollecting pymorphy2
  Downloading pymorphy2-0.9.1-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 4.1 MB/s 
[?25hCollecting intervaltree>=3
  Downloading intervaltree-3.1.0.tar.gz (32 kB)
Collecting dawg-python

In [None]:
#l = natasha.MoneyExtractor

In [None]:
from natasha import (
    Segmenter,
    MorphVocab,    
    NewsEmbedding,
    NewsMorphTagger,
    NewsSyntaxParser,
    NewsNERTagger,    
    PER,
    NamesExtractor,
    Doc
)


segmenter = Segmenter()
morph_vocab = MorphVocab()

emb = NewsEmbedding()
morph_tagger = NewsMorphTagger(emb)
syntax_parser = NewsSyntaxParser(emb)
ner_tagger = NewsNERTagger(emb)

names_extractor = NamesExtractor(morph_vocab)

text = 'Я живу в Москве, но иногда езжу в Питер'
text = 'Магазин ООО "Мечта" дарит скидки всем Антонам'
doc = Doc(text)
doc.segment(segmenter)
doc.tag_ner(ner_tagger)
print(doc.spans[:5])
doc.ner.print()

[DocSpan(start=8, stop=19, type='ORG', text='ООО "Мечта"', tokens=[...]), DocSpan(start=38, stop=45, type='PER', text='Антонам', tokens=[...])]
Магазин ООО "Мечта" дарит скидки всем Антонам
        ORG────────                   PER────


#### LSTM-CRF

In [None]:
import time
import torch
import torch.optim as optim
import os
import sys

#sys.path.append('./a-PyTorch-Tutorial-to-Sequence-Labeling')

from models import LM_LSTM_CRF, ViterbiLoss
from utils import *
from torch.nn.utils.rnn import pack_padded_sequence
from datasets import WCDataset
from inference import ViterbiDecoder
from sklearn.metrics import f1_score

ModuleNotFoundError: ignored

In [None]:
#!touch ./a-PyTorch-Tutorial-to-Sequence-Labeling/__init__.py

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2021-09-22 19:31:55--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-09-22 19:31:55--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-09-22 19:31:56--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

In [None]:
import zipfile
with zipfile.ZipFile('glove.6B.zip', 'r') as zip_ref:
    zip_ref.extractall('.')

In [None]:
# Data parameters
task = 'ner'  # tagging task, to choose column in CoNLL 2003 dataset
train_file = './datasets/eng.train'  # path to training data
val_file = './datasets/eng.testa'  # path to validation data
test_file = './datasets/eng.testb'  # path to test data
emb_file = './glove.6B.100d.txt'  # path to pre-trained word embeddings
min_word_freq = 5  # threshold for word frequency
min_char_freq = 1  # threshold for character frequency
caseless = True  # lowercase everything?
expand_vocab = True  # expand model's input vocabulary to the pre-trained embeddings' vocabulary?

# Model parameters
char_emb_dim = 30  # character embedding size
with open(emb_file, 'r') as f:
    word_emb_dim = len(f.readline().split(' ')) - 1  # word embedding size
word_rnn_dim = 300  # word RNN size
char_rnn_dim = 300  # character RNN size
char_rnn_layers = 1  # number of layers in character RNN
word_rnn_layers = 1  # number of layers in word RNN
highway_layers = 1  # number of layers in highway network
dropout = 0.5  # dropout
fine_tune_word_embeddings = False  # fine-tune pre-trained word embeddings?

# Training parameters
start_epoch = 0  # start at this epoch
batch_size = 10  # batch size
lr = 0.015  # learning rate
lr_decay = 0.05  # decay learning rate by this amount
momentum = 0.9  # momentum
workers = 1  # number of workers for loading data in the DataLoader
epochs = 10  # number of epochs to run without early-stopping
grad_clip = 5.  # clip gradients at this value
print_freq = 100  # print training or validation status every __ batches
best_f1 = 0.  # F1 score to start with
checkpoint = None  # path to model checkpoint, None if none

tag_ind = 1 if task == 'pos' else 3  # choose column in CoNLL 2003 dataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
def train(train_loader, model, lm_criterion, crf_criterion, optimizer, epoch, vb_decoder):
    """
    Performs one epoch's training.
    :param train_loader: DataLoader for training data
    :param model: model
    :param lm_criterion: cross entropy loss layer
    :param crf_criterion: viterbi loss layer
    :param optimizer: optimizer
    :param epoch: epoch number
    :param vb_decoder: viterbi decoder (to decode and find F1 score)
    """

    model.train()  # training mode enables dropout

    batch_time = AverageMeter()  # forward prop. + back prop. time per batch
    data_time = AverageMeter()  # data loading time per batch
    ce_losses = AverageMeter()  # cross entropy loss
    vb_losses = AverageMeter()  # viterbi loss
    f1s = AverageMeter()  # f1 score

    start = time.time()

    # Batches
    for i, (wmaps, cmaps_f, cmaps_b, cmarkers_f, cmarkers_b, tmaps, wmap_lengths, cmap_lengths) in enumerate(
            train_loader):

        data_time.update(time.time() - start)

        max_word_len = max(wmap_lengths.tolist())
        max_char_len = max(cmap_lengths.tolist())

        # Reduce batch's padded length to maximum in-batch sequence
        # This saves some compute on nn.Linear layers (RNNs are unaffected, since they don't compute over the pads)
        wmaps = wmaps[:, :max_word_len].to(device)
        cmaps_f = cmaps_f[:, :max_char_len].to(device)
        cmaps_b = cmaps_b[:, :max_char_len].to(device)
        cmarkers_f = cmarkers_f[:, :max_word_len].to(device)
        cmarkers_b = cmarkers_b[:, :max_word_len].to(device)
        tmaps = tmaps[:, :max_word_len].to(device)
        wmap_lengths = wmap_lengths.to(device)
        cmap_lengths = cmap_lengths.to(device)

        # Forward prop.
        crf_scores, lm_f_scores, lm_b_scores, wmaps_sorted, tmaps_sorted, wmap_lengths_sorted, _, __ = model(cmaps_f,
                                                                                                             cmaps_b,
                                                                                                             cmarkers_f,
                                                                                                             cmarkers_b,
                                                                                                             wmaps,
                                                                                                             tmaps,
                                                                                                             wmap_lengths,
                                                                                                             cmap_lengths)

        # LM loss

        # We don't predict the next word at the pads or <end> tokens
        # We will only predict at [dunston, checks, in] among [dunston, checks, in, <end>, <pad>, <pad>, ...]
        # So, prediction lengths are word sequence lengths - 1
        lm_lengths = wmap_lengths_sorted - 1
        lm_lengths = lm_lengths.tolist()

        # Remove scores at timesteps we won't predict at
        # pack_padded_sequence is a good trick to do this (see dynamic_rnn.py, where we explore this)
        #lm_f_scores, _ = pack_padded_sequence(lm_f_scores, lm_lengths, batch_first=True)
        #lm_b_scores, _ = pack_padded_sequence(lm_b_scores, lm_lengths, batch_first=True)

        # For the forward sequence, targets are from the second word onwards, up to <end>
        # (timestep -> target) ...dunston -> checks, ...checks -> in, ...in -> <end>
        lm_f_targets = wmaps_sorted[:, 1:]
        #lm_f_targets, _ = pack_padded_sequence(lm_f_targets, lm_lengths, batch_first=True)

        # For the backward sequence, targets are <end> followed by all words except the last word
        # ...notsnud -> <end>, ...skcehc -> dunston, ...ni -> checks
        lm_b_targets = torch.cat(
            [torch.LongTensor([word_map['<end>']] * wmaps_sorted.size(0)).unsqueeze(1).to(device), wmaps_sorted], dim=1)
        #lm_b_targets, _ = pack_padded_sequence(lm_b_targets, lm_lengths, batch_first=True)

        # Calculate loss
        ce_loss = lm_criterion(lm_f_scores, lm_f_targets) + lm_criterion(lm_b_scores, lm_b_targets)
        vb_loss = crf_criterion(crf_scores, tmaps_sorted, wmap_lengths_sorted)
        loss = ce_loss + vb_loss

        # Back prop.
        optimizer.zero_grad()
        loss.backward()

        if grad_clip is not None:
            clip_gradient(optimizer, grad_clip)

        optimizer.step()

        # Viterbi decode to find accuracy / f1
        decoded = vb_decoder.decode(crf_scores.to("cpu"), wmap_lengths_sorted.to("cpu"))

        # Remove timesteps we won't predict at, and also <end> tags, because to predict them would be cheating
        decoded, _ = pack_padded_sequence(decoded, lm_lengths, batch_first=True)
        tmaps_sorted = tmaps_sorted % vb_decoder.tagset_size  # actual target indices (see create_input_tensors())
        tmaps_sorted, _ = pack_padded_sequence(tmaps_sorted, lm_lengths, batch_first=True)

        # F1
        f1 = f1_score(tmaps_sorted.to("cpu").numpy(), decoded.numpy(), average='macro')

        # Keep track of metrics
        ce_losses.update(ce_loss.item(), sum(lm_lengths))
        vb_losses.update(vb_loss.item(), crf_scores.size(0))
        batch_time.update(time.time() - start)
        f1s.update(f1, sum(lm_lengths))

        start = time.time()

        # Print training status
        if i % print_freq == 0:
            print('Epoch: [{0}][{1}/{2}]\t'
                  'Batch Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
                  'Data Load Time {data_time.val:.3f} ({data_time.avg:.3f})\t'
                  'CE Loss {ce_loss.val:.4f} ({ce_loss.avg:.4f})\t'
                  'VB Loss {vb_loss.val:.4f} ({vb_loss.avg:.4f})\t'
                  'F1 {f1.val:.3f} ({f1.avg:.3f})'.format(epoch, i, len(train_loader),
                                                          batch_time=batch_time,
                                                          data_time=data_time, ce_loss=ce_losses,
                                                          vb_loss=vb_losses, f1=f1s))

In [None]:
def validate(val_loader, model, crf_criterion, vb_decoder):
    """
    Performs one epoch's validation.
    :param val_loader: DataLoader for validation data
    :param model: model
    :param crf_criterion: viterbi loss layer
    :param vb_decoder: viterbi decoder
    :return: validation F1 score
    """
    model.eval()

    batch_time = AverageMeter()
    vb_losses = AverageMeter()
    f1s = AverageMeter()

    start = time.time()

    for i, (wmaps, cmaps_f, cmaps_b, cmarkers_f, cmarkers_b, tmaps, wmap_lengths, cmap_lengths) in enumerate(
            val_loader):

        max_word_len = max(wmap_lengths.tolist())
        max_char_len = max(cmap_lengths.tolist())

        # Reduce batch's padded length to maximum in-batch sequence
        # This saves some compute on nn.Linear layers (RNNs are unaffected, since they don't compute over the pads)
        wmaps = wmaps[:, :max_word_len].to(device)
        cmaps_f = cmaps_f[:, :max_char_len].to(device)
        cmaps_b = cmaps_b[:, :max_char_len].to(device)
        cmarkers_f = cmarkers_f[:, :max_word_len].to(device)
        cmarkers_b = cmarkers_b[:, :max_word_len].to(device)
        tmaps = tmaps[:, :max_word_len].to(device)
        wmap_lengths = wmap_lengths.to(device)
        cmap_lengths = cmap_lengths.to(device)

        # Forward prop.
        crf_scores, wmaps_sorted, tmaps_sorted, wmap_lengths_sorted, _, __ = model(cmaps_f,
                                                                                   cmaps_b,
                                                                                   cmarkers_f,
                                                                                   cmarkers_b,
                                                                                   wmaps,
                                                                                   tmaps,
                                                                                   wmap_lengths,
                                                                                   cmap_lengths)

        # Viterbi / CRF layer loss
        vb_loss = crf_criterion(crf_scores, tmaps_sorted, wmap_lengths_sorted)

        # Viterbi decode to find accuracy / f1
        decoded = vb_decoder.decode(crf_scores.to("cpu"), wmap_lengths_sorted.to("cpu"))

        # Remove timesteps we won't predict at, and also <end> tags, because to predict them would be cheating
        decoded, _ = pack_padded_sequence(decoded, (wmap_lengths_sorted - 1).tolist(), batch_first=True)
        tmaps_sorted = tmaps_sorted % vb_decoder.tagset_size  # actual target indices (see create_input_tensors())
        tmaps_sorted, _ = pack_padded_sequence(tmaps_sorted, (wmap_lengths_sorted - 1).tolist(), batch_first=True)

        # f1
        f1 = f1_score(tmaps_sorted.to("cpu").numpy(), decoded.numpy(), average='macro')

        # Keep track of metrics
        vb_losses.update(vb_loss.item(), crf_scores.size(0))
        f1s.update(f1, sum((wmap_lengths_sorted - 1).tolist()))
        batch_time.update(time.time() - start)

        start = time.time()

        if i % print_freq == 0:
            print('Validation: [{0}/{1}]\t'
                  'Batch Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
                  'VB Loss {vb_loss.val:.4f} ({vb_loss.avg:.4f})\t'
                  'F1 Score {f1.val:.3f} ({f1.avg:.3f})\t'.format(i, len(val_loader), batch_time=batch_time,
                                                                  vb_loss=vb_losses, f1=f1s))

    print(
        '\n * LOSS - {vb_loss.avg:.3f}, F1 SCORE - {f1.avg:.3f}\n'.format(vb_loss=vb_losses,
                                                                          f1=f1s))

    return f1s.avg

In [None]:
def main_func():
    """
    Training and validation.
    """
    global best_f1, epochs_since_improvement, checkpoint, start_epoch, word_map, char_map, tag_map

    # Read training and validation data
    train_words, train_tags = read_words_tags(train_file, tag_ind, caseless)
    val_words, val_tags = read_words_tags(val_file, tag_ind, caseless)

    # Initialize model or load checkpoint
    if checkpoint is not None:
        checkpoint = torch.load(checkpoint)
        model = checkpoint['model']
        optimizer = checkpoint['optimizer']
        word_map = checkpoint['word_map']
        lm_vocab_size = checkpoint['lm_vocab_size']
        tag_map = checkpoint['tag_map']
        char_map = checkpoint['char_map']
        start_epoch = checkpoint['epoch'] + 1
        best_f1 = checkpoint['f1']
    else:
        word_map, char_map, tag_map = create_maps(train_words + val_words, train_tags + val_tags, min_word_freq,
                                                  min_char_freq)  # create word, char, tag maps
        embeddings, word_map, lm_vocab_size = load_embeddings(emb_file, word_map,
                                                              expand_vocab)  # load pre-trained embeddings

        model = LM_LSTM_CRF(tagset_size=len(tag_map),
                            charset_size=len(char_map),
                            char_emb_dim=char_emb_dim,
                            char_rnn_dim=char_rnn_dim,
                            char_rnn_layers=char_rnn_layers,
                            vocab_size=len(word_map),
                            lm_vocab_size=lm_vocab_size,
                            word_emb_dim=word_emb_dim,
                            word_rnn_dim=word_rnn_dim,
                            word_rnn_layers=word_rnn_layers,
                            dropout=dropout,
                            highway_layers=highway_layers).to(device)
        model.init_word_embeddings(embeddings.to(device))  # initialize embedding layer with pre-trained embeddings
        model.fine_tune_word_embeddings(fine_tune_word_embeddings)  # fine-tune
        optimizer = optim.SGD(params=filter(lambda p: p.requires_grad, model.parameters()), lr=lr, momentum=momentum)

    # Loss functions
    lm_criterion = nn.CrossEntropyLoss().to(device)
    crf_criterion = ViterbiLoss(tag_map).to(device)

    # Since the language model's vocab is restricted to in-corpus indices, encode training/val with only these!
    # word_map might have been expanded, and in-corpus words eliminated due to low frequency might still be added because
    # they were in the pre-trained embeddings
    temp_word_map = {k: v for k, v in word_map.items() if v <= word_map['<unk>']}
    train_inputs = create_input_tensors(train_words, train_tags, temp_word_map, char_map,
                                        tag_map)
    val_inputs = create_input_tensors(val_words, val_tags, temp_word_map, char_map, tag_map)

    # DataLoaders
    train_loader = torch.utils.data.DataLoader(WCDataset(*train_inputs), batch_size=batch_size, shuffle=True,
                                               num_workers=workers, pin_memory=False)
    val_loader = torch.utils.data.DataLoader(WCDataset(*val_inputs), batch_size=batch_size, shuffle=True,
                                             num_workers=workers, pin_memory=False)

    # Viterbi decoder (to find accuracy during validation)
    vb_decoder = ViterbiDecoder(tag_map)

    # Epochs
    for epoch in range(start_epoch, epochs):

        # One epoch's training
        train(train_loader=train_loader,
              model=model,
              lm_criterion=lm_criterion,
              crf_criterion=crf_criterion,
              optimizer=optimizer,
              epoch=epoch,
              vb_decoder=vb_decoder)

        # One epoch's validation
        val_f1 = validate(val_loader=val_loader,
                          model=model,
                          crf_criterion=crf_criterion,
                          vb_decoder=vb_decoder)

        # Did validation F1 score improve?
        is_best = val_f1 > best_f1
        best_f1 = max(val_f1, best_f1)
        if not is_best:
            epochs_since_improvement += 1
            print("\nEpochs since improvement: %d\n" % (epochs_since_improvement,))
        else:
            epochs_since_improvement = 0

        # Save checkpoint
        save_checkpoint(epoch, model, optimizer, val_f1, word_map, char_map, tag_map, lm_vocab_size, is_best)

        # Decay learning rate every epoch
        adjust_learning_rate(optimizer, lr / (1 + (epoch + 1) * lr_decay))

In [None]:
main_func()

Embedding length is 100.
You have elected to include embeddings that are out-of-corpus.

Loading embeddings...
'word_map' is being updated accordingly.

Done.
 Embedding vocabulary: 400054
 Language Model vocabulary: 4671.



  "num_layers={}".format(dropout, num_layers))


RuntimeError: ignored