# Assignment 9

Use data from `https://github.com/thedenaas/hse_seminars/tree/master/2018/seminar_13/data.zip`  
Implement model in pytorch from "An Unsupervised Neural Attention Model for Aspect Extraction, He et al, 2017", also desribed in seminar notes.  

You can use sentence embeddings with attention **[7 points]**:  
$z_s = \sum_{i}^n \alpha_i e_{w_i}, z_s \in R^d$ sentence embedding  
$\alpha_i = softmax(d_i)$  attention weight for i-th token  
$d_i = e_{w_i}^T M y_s$ attention with trainable matrix $M \in R^{dxd}$  
$y_s = \frac 1 n \sum_{i=1}^n e_{w_i}, y_s \in R^d$ sentence context  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  

**Or** just use sentence embedding as an average over word embeddings **[5 points]**:  
$z_s = \frac 1 n \sum_{i=1}^n e_{w_i}, z_s \in R^d$ sentence embedding  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  
 
$p_t = softmax(W z_s + b), p_t \in R^K$ topic weights for sentence $s$, with trainable matrix $W \in R^{dxK}$ and bias vector $b \in R^K$  
$r_s = T^T p_t, r_s \in R^d$ reconstructed sentence embedding as a weighted sum of topic embeddings   
$T \in R^{Kxd}$ trainable matrix of topic embeddings, K=number of topics


**Training objective**:
$$ J = \sum_{s \in D} \sum_{i=1}^n max(0, 1-r_s^T z_s + r_s^T n_i) + \lambda ||T^T T - I ||^2_F  $$
where   
$m$ random sentences are sampled as negative examples from dataset $D$ for each sentence $s$  
$n_i = \frac 1 n \sum_{i=j}^n e_{w_j}$ average of word embeddings in the i-th sentence  
$||T^T T - I ||_F$ regularizer, that enforces matrix $T$ to be orthogonal  
$||A||^2_F = \sum_{i=1}^N\sum_{j=1}^M a_{ij}^2, A \in R^{NxM}$ Frobenius norm


**[3 points]** Compute topic coherence for at least for 3 different number of topics. Use 10 nearest words for each topic. It means you have to train one model for each number of topics. You can use code from seminar notes with word2vec similarity scores.

In [0]:
# TODO Заменить texts[23] на что-то более разумное при выборе негативных экземпляров
# TODO Заменить в TabularDataset, чтобы neg_{} зависило от NEG_SAMPLES
# TODO Сделать в модели так, чтобы neg не было фиксированным, а зависело от NEG_SAMPLES
# TODO Определить и проверить структуру Лосса.

In [1]:
import pandas as pd
import numpy as np

import nltk
import spacy 

import torch
from torchtext.data import Field, TabularDataset, BucketIterator
from torchtext.vocab import Vectors
from gensim.models import Word2Vec, KeyedVectors

nltk.download('punkt')
spacy_en = spacy.load('en')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [0]:
BATCH_SIZE = 64
NEG_SAMPLES = 3  # number of negative samples
random_state = 23
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# DATA

In [3]:
!wget -O data.zip https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true
!unzip '/content/data.zip'

--2020-03-22 10:48:11--  https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true
Resolving github.com (github.com)... 140.82.118.3
Connecting to github.com (github.com)|140.82.118.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip [following]
--2020-03-22 10:48:11--  https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip [following]
--2020-03-22 10:48:12--  https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.1

In [0]:
with open('/content/data.txt', 'r') as f:
    data = nltk.tokenize.sent_tokenize(f.read())

with open('/content/stopwords.txt', 'r') as f:
    stopwords = f.read().splitlines()

In [5]:
print(len(data), data[0])
print(len(stopwords), stopwords[0])

183400 Barclays' defiance of US fines has merit Barclays disgraced itself in many ways during the pre-financial crisis boom years.
350 a


# Making DataFrame

In [0]:
def create_df(texts, neg_samples):
    """
    Creating pandas DataFrame from texts and adding randomly chosen negative samples
    """
    df = pd.DataFrame()
    df['text'] = texts

    for i in range(1, neg_samples+1):
        df['neg_{}'.format(i)] = [texts[ind] if ind != el else texts[23] for el, ind in enumerate(np.random.choice(np.arange(0,len(texts)), size=len(texts)))]
    return df

In [0]:
df = create_df(data, neg_samples=NEG_SAMPLES)

In [8]:
df.head()

Unnamed: 0,text,neg_1,neg_2,neg_3
0,Barclays' defiance of US fines has merit Barcl...,We are also changing the process for new comme...,When are football clubs going to start this ki...,The whole thing stinks to high heaven.
1,"So it is tempting to think the bank, when aske...","Instead, like most, I found I was more British...","Jess Bradley of Action for Trans Health, a cam...",“We’ve received your loan inquiry.
2,"That is not the view of the chief executive, J...","The decision by the chancellor, Philip Hammond...",No decision has been taken on the cost to pass...,"“Lock her up, lock her up,” chanted the crowd,..."
3,Barclays thinks the DoJ’s claims are “disconne...,“My focus is on winning the largest state in o...,Very well deserved and a demonstration that yo...,“We can’t spread existing funding across Austr...
4,"But actually, some grudging respect for Staley...",But there are also movies covered where film m...,"During that nervy, scratchy period Palace were...","The current debate over diversity in films, he..."


In [9]:
df.tail()

Unnamed: 0,text,neg_1,neg_2,neg_3
183395,It feels as though Stone realised that some of...,To miss out by one goal … to see a team playin...,The contrast with 2014’s Scottish referendum i...,“What do we want?
183396,"There are some fun elements, many involving Rh...","Alison Macfarlane, a professor of perinatal he...",I can pretty much remember where each and ever...,Most say they don’t deliberately withhold bad ...
183397,I particularly enjoyed a scene in which O’Bria...,Sid Lowe has the latest ...,I think Dan Jarvis is a really good guy.,"Thus begat Cinerama, an initially successful e..."
183398,His carnivorous snarl fills the immense screen...,"O’Malley has the option, if he wants to narrow...",They were almost entirely ignored.,“There’s going to be a tremendous problem” if ...
183399,There’s a playful visual flair to this moment ...,"My rule of thumb, which I continue to follow, ...",He also worked for the Rolling Stones on their...,Certainly Republican women angrily withdrew th...


In [0]:
df.to_csv('data.csv', index=False)

# TORCHTEXT and stuff

In [0]:
def tokenize(text):
    return [tok.lemma_ for tok in spacy_en.tokenizer(text) if tok.text.isalpha()]

In [12]:
# Using data to pretrain word-embeddings

data_tokenized = list(df['text'].apply(lambda x: tokenize(x)))
model = Word2Vec(data_tokenized, size=200, window=10, negative=5)  # building emb of size 200 (parameters from the paper)
model_weights = torch.FloatTensor(model.wv.vectors)
model.wv.save_word2vec_format('pretrained_embeddings')
vectors = Vectors(name='pretrained_embeddings', cache='./')  # and saving the weights to build vocab later

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  0%|          | 0/24240 [00:00<?, ?it/s]Skipping token b'24240' with 1-dimensional vector [b'200']; likely a header
100%|█████████▉| 24203/24240 [00:01<00:00, 14941.65it/s]


In [0]:
# https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/

TEXT = Field(sequential=True, 
             include_lengths=False, 
             batch_first=True, 
             tokenize=tokenize, 
             lower=True, 
             stop_words=stopwords)

# RESULT = Field(sequential=True, 
#              include_lengths=False, 
#              batch_first=True, 
#              tokenize=tokenize, 
#              lower=True,
#              stop_words=stopwords)

dataset = TabularDataset(
           path="/content/data.csv",
           format='csv',
           skip_header=True,
           fields=[('text', TEXT),('neg_1', TEXT), ('neg_2', TEXT), ('neg_3', TEXT)])

TEXT.build_vocab(dataset, min_freq=2, vectors=vectors,
                   unk_init = torch.Tensor.normal_)

RESULT.build_vocab(dataset, min_freq=2, vectors=vectors,
                   unk_init = torch.Tensor.normal_)

In [0]:
vocab = TEXT.vocab

In [93]:
print('Vocab size:', len(TEXT.vocab.itos))
TEXT.vocab.itos[:10]

Vocab size: 55879


['<unk>',
 '<pad>',
 '-pron-',
 'i',
 'trump',
 'people',
 'go',
 'get',
 'time',
 'take']

In [97]:
valid[0].__dict__

{'neg_1': ['york',
  'university',
  'brennan',
  'center',
  'publish',
  'article',
  'michael',
  'price',
  'smart',
  'tv',
  'scare',
  'turn',
  'thing',
  'myriad',
  'disturb',
  'feature',
  'facial',
  'recognition'],
 'neg_2': ['friday',
  'trade',
  'barb',
  'conservative',
  'ted',
  'cruz',
  'launch',
  'tv',
  'ad',
  'feature',
  'jamiel',
  'shaw',
  'son',
  'murder',
  'undocumented',
  'migrant'],
 'neg_3': ['glower', 'loose', 'divot', 'suggest', '-pron-', 'pitch', 'blame'],
 'text': ['go',
  'lee',
  'brown',
  'injury',
  'time',
  'goal',
  'mean',
  'accrington',
  'score',
  'remain',
  'bump',
  'play',
  'offs']}

In [95]:
train[0].__dict__

{'neg_1': ['internet',
  'messy',
  'fill',
  'lot',
  'different',
  'stuff',
  'need',
  'conversation'],
 'neg_2': ['change', 'football', 'league'],
 'neg_3': ['give',
  'big',
  'advantage',
  'campaign',
  'ability',
  'datum',
  'technology',
  'target',
  'individual',
  'voter'],
 'text': ['catherine',
  'invite',
  'lakeside',
  'summer',
  'house',
  'old',
  'friend',
  'virginia',
  'play',
  'katherine',
  'waterston',
  'recuperate']}

In [0]:
train, test = dataset.split(0.8)
train, valid = train.split(0.8)

In [0]:
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(BATCH_SIZE, BATCH_SIZE, BATCH_SIZE),
    shuffle=True,
    sort_key=lambda x: len(x.text),
    device=device
)

In [101]:
b = next(iter(valid_iterator))
vars(b).keys()

dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'text', 'neg_1', 'neg_2', 'neg_3'])

In [102]:
b = next(iter(train_iterator))
vars(b).keys()

dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'text', 'neg_1', 'neg_2', 'neg_3'])

In [70]:
b.text.size(), b.neg_1.size(), b.neg_2.size(), b.neg_3.size()

(torch.Size([64, 33]),
 torch.Size([64, 39]),
 torch.Size([64, 42]),
 torch.Size([64, 34]))

# Model

In [0]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [0]:
# Optimizer = Adam

# learning rate 0.001 for 15 epochs and batch size of 50. 
# We set the number of negative samples per input sample m to 20,
# λ to 1

**Or** just use sentence embedding as an average over word embeddings **[5 points]**:  
$z_s = \frac 1 n \sum_{i=1}^n e_{w_i}, z_s \in R^d$ sentence embedding  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  
 
$p_t = softmax(W z_s + b), p_t \in R^K$ topic weights for sentence $s$, with trainable matrix $W \in R^{dxK}$ and bias vector $b \in R^K$  
$r_s = T^T p_t, r_s \in R^d$ reconstructed sentence embedding as a weighted sum of topic embeddings   
$T \in R^{Kxd}$ trainable matrix of topic embeddings, K=number of topics

In [0]:
class MyModel(nn.Module):

    def __init__(self, vocab_size, embed_size, topics_size):
        super(MyModel, self).__init__()
        
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.topics_size = topics_size

        self.embeddings = nn.Embedding(self.vocab_size, self.embed_size)
        self.embeddings.weight.data.copy_(vocab.vectors)
        self.embeddings.requires_gra = False  # For now let's freeze them

        self.fc1 = nn.Linear(self.embed_size, self.topics_size)  # W
        self.fc2 = nn.Linear(self.topics_size, self.embed_size)  # T

        self.bias = None


    def sentence_embeddings(self, x):
        '''
        input: (batch_size, seq_length, embed_size)
        output: (batch_size, embed_size)
        '''
        x = torch.sum(x, dim=1)/x.size()[1]
        return x

    
    def forward(self, batch):
        
        text, neg_1, neg_2, neg_3 = batch.text, batch.neg_1, batch.neg_2, batch.neg_3

        text_true = self.sentence_embeddings(self.embeddings(text))
        neg_1 = self.sentence_embeddings(self.embeddings(neg_1))
        neg_2 = self.sentence_embeddings(self.embeddings(neg_2))
        neg_3 = self.sentence_embeddings(self.embeddings(neg_3))

        
        text_out = self.fc1(text_true)
        text_out = F.softmax(text_out, dim=1)
        text_out = self.fc2(text_out)

        return text_true, text_out, neg_1, neg_2, neg_3

In [106]:
model.fc2.weight.size()

torch.Size([200, 10])

In [0]:
model = MyModel(vocab_size=len(vocab), embed_size=200, topics_size=10)
model.to(device)

optimizer = optim.Adam(model.parameters())

In [108]:
model.forward(b)

(tensor([[-0.1040,  0.0607,  0.0434,  ..., -0.0655,  0.0050,  0.0481],
         [-0.0277, -0.0103,  0.1187,  ..., -0.0986,  0.1300, -0.0005],
         [ 0.0428,  0.1004,  0.0653,  ..., -0.0619, -0.0193,  0.1776],
         ...,
         [-0.0033,  0.0018,  0.0027,  ...,  0.0054, -0.0019, -0.0067],
         [-0.0692,  0.0241, -0.0075,  ..., -0.0812,  0.0274, -0.0182],
         [-0.1031,  0.0283,  0.0414,  ..., -0.0419,  0.1143,  0.2702]],
        device='cuda:0', grad_fn=<DivBackward0>),
 tensor([[-2.5620e-01,  3.2627e-01, -2.1972e-01,  ..., -3.5729e-02,
           2.6560e-03, -1.6473e-01],
         [-2.5265e-01,  3.1814e-01, -2.2743e-01,  ..., -3.3048e-02,
           8.6200e-04, -1.6366e-01],
         [-2.5449e-01,  3.1818e-01, -2.1674e-01,  ..., -3.3927e-02,
          -2.4349e-04, -1.6762e-01],
         ...,
         [-2.5471e-01,  3.2195e-01, -2.2187e-01,  ..., -3.4323e-02,
           1.1400e-03, -1.6472e-01],
         [-2.5465e-01,  3.2172e-01, -2.2440e-01,  ..., -3.5722e-02,
       

In [109]:
tr, out, n1, n2, n3 = model.forward(b)

tr.size(), out.size(), n1.size(), n2.size(), n3.size()

(torch.Size([64, 200]),
 torch.Size([64, 200]),
 torch.Size([64, 200]),
 torch.Size([64, 200]),
 torch.Size([64, 200]))

In [110]:
loss_batch = 0 
for ind, truth in enumerate(tr): # Проходимся по батчу
    # print(truth.size())
    temp = 1 - torch.dot(out[ind].T, tr[ind]) + torch.dot(out[ind].T, n1[ind]) + torch.dot(out[ind].T, n2[ind]) + torch.dot(out[ind].T, n3[ind])
    # print(temp.size(), temp)
    t_max = F.relu(temp)
    # print(t_max)
    loss_batch += t_max
    # break
loss_batch / BATCH_SIZE

tensor(1.0597, device='cuda:0', grad_fn=<DivBackward0>)

**Training objective**:
$$ J = \sum_{s \in D} \sum_{i=1}^n max(0, 1-r_s^T z_s + r_s^T n_i) + \lambda ||T^T T - I ||^2_F  $$
where   
$m$ random sentences are sampled as negative examples from dataset $D$ for each sentence $s$  
$n_i = \frac 1 n \sum_{i=j}^n e_{w_j}$ average of word embeddings in the i-th sentence  
$||T^T T - I ||_F$ regularizer, that enforces matrix $T$ to be orthogonal  
$||A||^2_F = \sum_{i=1}^N\sum_{j=1}^M a_{ij}^2, A \in R^{NxM}$ Frobenius norm

In [0]:
# https://spandan-madan.github.io/A-Collection-of-important-tasks-in-pytorch/

class LossFunction(nn.Module):

    def __init__(self):
        super(LossFunction, self).__init__()

    def forward(self, emb_true, emb_pred, n1, n2, n3, T, lambd=1):

        loss_batch = 0
        for ind, e_true in enumerate(emb_true): # Проходимся по батчу
            # print(truth.size())
            temp = 1 - torch.dot(emb_pred[ind].T, e_true) + torch.dot(emb_pred[ind].T, n1[ind]) + torch.dot(emb_pred[ind].T, n2[ind]) + torch.dot(emb_pred[ind].T, n3[ind])
            # print(temp.size(), temp)
            t_max = F.relu(temp)
            # print(t_max)
            loss_batch += t_max
        loss_batch += lambd * torch.norm(torch.mm(T.T, T) - torch.eye(T.size()[1], T.size(1)).to(device))
        return loss_batch / BATCH_SIZE

In [112]:
criterion = LossFunction()
criterion.to(device)

LossFunction()

In [0]:
train_losses = []
valid_losses = []

def _train_epoch(model, iterator, optimizer, curr_epoch):

    model.train()

    running_loss = 0

    n_batches = len(iterator)
    iterator = tqdm_notebook(iterator, total=n_batches, desc='epoch %d' % (curr_epoch), leave=True)

    for i, batch in enumerate(iterator):
        optimizer.zero_grad()

        text_true, text_out, neg_1, neg_2, neg_3 = model(batch)
        loss = criterion(text_true, text_out, neg_1, neg_2, neg_3, model.fc2.weight)
        loss.backward()
        optimizer.step()

        curr_loss = loss.item()
        
        loss_smoothing = i / (i+1)
        running_loss = loss_smoothing * running_loss + (1 - loss_smoothing) * curr_loss

        train_losses.append(running_loss)
        iterator.set_postfix(loss='%.5f' % running_loss)

    return running_loss


def _test_epoch(model, iterator):
    model.eval()
    epoch_loss = 0

    n_batches = len(iterator)
    with torch.no_grad():
        for batch in iterator:
            text_true, text_out, neg_1, neg_2, neg_3 = model(batch)
            loss = criterion(text_true, text_out, neg_1, neg_2, neg_3, model.fc2.weight)
            epoch_loss += loss.data.item()
            valid_losses.append(loss)

    return epoch_loss / n_batches


def nn_train(model, train_iterator, valid_iterator, optimizer, n_epochs=100,
          scheduler=None, early_stopping=0):

    prev_loss = 100500
    es_epochs = 0
    best_epoch = None
    history = pd.DataFrame()

    for epoch in range(n_epochs):
        train_loss = _train_epoch(model, train_iterator, optimizer, epoch)
        valid_loss = _test_epoch(model, valid_iterator)
        # scheduler.step(valid_loss)

        valid_loss = valid_loss
        print('validation loss %.5f' % valid_loss)

        record = {'epoch': epoch, 'train_loss': train_loss, 'valid_loss': valid_loss}
        history = history.append(record, ignore_index=True)

        if early_stopping > 0:
            if valid_loss > prev_loss:
                es_epochs += 1
            else:
                es_epochs = 0

            if es_epochs >= early_stopping:
                best_epoch = history[history.valid_loss == history.valid_loss.min()].iloc[0]
                print('Early stopping! best epoch: %d val %.5f' % (best_epoch['epoch'], best_epoch['valid_loss']))
                break

            prev_loss = min(prev_loss, valid_loss)

# nn_train(model, train_loader, valid_loader, optimizer, n_epochs=100) # Удалила вывод, т.к. он длинный, график лосса ниже

In [0]:
from tqdm import tqdm_notebook

In [115]:
nn_train(model, train_iterator, valid_iterator, optimizer, n_epochs=15) # Удалила вывод, т.к. он длинный, график лосса ниже

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  # This is added back by InteractiveShellApp.init_path()


HBox(children=(IntProgress(value=0, description='epoch 0', max=1834, style=ProgressStyle(description_width='in…


validation loss nan


HBox(children=(IntProgress(value=0, description='epoch 1', max=1834, style=ProgressStyle(description_width='in…


validation loss nan


HBox(children=(IntProgress(value=0, description='epoch 2', max=1834, style=ProgressStyle(description_width='in…

KeyboardInterrupt: ignored

# Topic CoHerence