# Assignment 9

Use data from `https://github.com/thedenaas/hse_seminars/tree/master/2018/seminar_13/data.zip`  
Implement model in pytorch from "An Unsupervised Neural Attention Model for Aspect Extraction, He et al, 2017", also desribed in seminar notes.  

You can use sentence embeddings with attention **[7 points]**:  
$z_s = \sum_{i}^n \alpha_i e_{w_i}, z_s \in R^d$ sentence embedding  
$\alpha_i = softmax(d_i)$  attention weight for i-th token  
$d_i = e_{w_i}^T M y_s$ attention with trainable matrix $M \in R^{dxd}$  
$y_s = \frac 1 n \sum_{i=1}^n e_{w_i}, y_s \in R^d$ sentence context  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  

**Or** just use sentence embedding as an average over word embeddings **[5 points]**:  
$z_s = \frac 1 n \sum_{i=1}^n e_{w_i}, z_s \in R^d$ sentence embedding  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  
 
$p_t = softmax(W z_s + b), p_t \in R^K$ topic weights for sentence $s$, with trainable matrix $W \in R^{dxK}$ and bias vector $b \in R^K$  
$r_s = T^T p_t, r_s \in R^d$ reconstructed sentence embedding as a weighted sum of topic embeddings   
$T \in R^{Kxd}$ trainable matrix of topic embeddings, K=number of topics


**Training objective**:
$$ J = \sum_{s \in D} \sum_{i=1}^n max(0, 1-r_s^T z_s + r_s^T n_i) + \lambda ||T^T T - I ||^2_F  $$
where   
$m$ random sentences are sampled as negative examples from dataset $D$ for each sentence $s$  
$n_i = \frac 1 n \sum_{i=j}^n e_{w_j}$ average of word embeddings in the i-th sentence  
$||T^T T - I ||_F$ regularizer, that enforces matrix $T$ to be orthogonal  
$||A||^2_F = \sum_{i=1}^N\sum_{j=1}^M a_{ij}^2, A \in R^{NxM}$ Frobenius norm


**[3 points]** Compute topic coherence for at least for 3 different number of topics. Use 10 nearest words for each topic. It means you have to train one model for each number of topics. You can use code from seminar notes with word2vec similarity scores.

In [0]:
# TODO Заменить texts[23] на что-то более разумное при выборе негативных экземпляров
# TODO Заменить в TabularDataset, чтобы neg_{} зависило от NEG_SAMPLES
# TODO Сделать в модели так, чтобы neg не было фиксированным, а зависело от NEG_SAMPLES
# TODO Определить и проверить структуру Лосса.

In [1]:
import pandas as pd
import numpy as np

import nltk
import spacy 

import torch
from torchtext.data import Field, TabularDataset, BucketIterator
from torchtext.vocab import Vectors
from gensim.models import Word2Vec, KeyedVectors

nltk.download('punkt')
spacy_en = spacy.load('en')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [0]:
BATCH_SIZE = 64
NEG_SAMPLES = 3  # number of negative samples
random_state = 23
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# DATA

In [3]:
!wget -O data.zip https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true
!unzip '/content/data.zip'

--2020-03-20 20:03:47--  https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true
Resolving github.com (github.com)... 52.74.223.119
Connecting to github.com (github.com)|52.74.223.119|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip [following]
--2020-03-20 20:03:47--  https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip [following]
--2020-03-20 20:03:48--  https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151

In [0]:
with open('/content/data.txt', 'r') as f:
    data = nltk.tokenize.sent_tokenize(f.read())

with open('/content/stopwords.txt', 'r') as f:
    stopwords = f.read().splitlines()

In [5]:
print(len(data), data[0])
print(len(stopwords), stopwords[0])

183400 Barclays' defiance of US fines has merit Barclays disgraced itself in many ways during the pre-financial crisis boom years.
350 a


# Making DataFrame

In [0]:
def create_df(texts, neg_samples):
    """
    Creating pandas DataFrame from texts and adding randomly chosen negative samples
    """
    df = pd.DataFrame()
    df['text'] = texts

    for i in range(1, neg_samples+1):
        df['neg_{}'.format(i)] = [texts[ind] if ind != el else texts[23] for el, ind in enumerate(np.random.choice(np.arange(0,len(texts)), size=len(texts)))]
    return df

In [0]:
df = create_df(data, neg_samples=NEG_SAMPLES)

In [8]:
df.head()

Unnamed: 0,text,neg_1,neg_2,neg_3
0,Barclays' defiance of US fines has merit Barcl...,"He asks her to be in his music video and, afte...",Most high-profile sites and services have two-...,It shows that too many patients are no longer ...
1,"So it is tempting to think the bank, when aske...",But at recent rallies Trump has continued to t...,“We want a president to make data-based decisi...,"Mace Windu’s is purple because, the exhibition..."
2,"That is not the view of the chief executive, J...","Congressman Filemon Vela, a Texas Democratic, ...",“I told you to get the house cleaned!” She yel...,(A Labor favourite) Why does it take so long t...
3,Barclays thinks the DoJ’s claims are “disconne...,The next three games (West Brom and Leicester ...,It’s really intriguing.,"Alternatively, you could buy books while suppo..."
4,"But actually, some grudging respect for Staley...","80 mins: Bony sums up a poor day, struggling t...","Unlike most other smart home systems, a single...",The 29-year-old actor plays a pickpocket who b...


In [9]:
df.tail()

Unnamed: 0,text,neg_1,neg_2,neg_3
183395,It feels as though Stone realised that some of...,Or you could say it was a battle for the soul ...,Newly sprouted flowers pushing up through the ...,23 mins: Peter Oh emails: “If the Black Cats a...
183396,"There are some fun elements, many involving Rh...",Goal!,The South Staffordshire and Shropshire trust h...,David Moyes has now had his pre-match chat wit...
183397,I particularly enjoyed a scene in which O’Bria...,Almost nine out of ten school leaders are tell...,“I haven’t supported Mr Trump at any point alo...,My patient taught me that we don’t have to act...
183398,His carnivorous snarl fills the immense screen...,“Often the harm may be too small to make it pr...,"Rather than privacy from the state, the real c...",She refused to comment on reports that Cameron...
183399,There’s a playful visual flair to this moment ...,“The MP for Dewsbury was employed by Virgin Ca...,"(Like humans, they have sex for pleasure as we...",Man of the match Virgil van Dijk (Southampton)...


In [0]:
df.to_csv('data.csv', index=False)

# TORCHTEXT and stuff

In [0]:
def tokenize(text):
    return [tok.lemma_ for tok in spacy_en.tokenizer(text) if tok.text.isalpha()]

In [12]:
# Using data to pretrain word-embeddings

data_tokenized = list(df['text'].apply(lambda x: tokenize(x)))
model = Word2Vec(data_tokenized, size=200, window=10, negative=5)  # building emb of size 200 (parameters from the paper)
model_weights = torch.FloatTensor(model.wv.vectors)
model.wv.save_word2vec_format('pretrained_embeddings')
vectors = Vectors(name='pretrained_embeddings', cache='./')  # and saving the weights to build vocab later

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  0%|          | 0/24240 [00:00<?, ?it/s]Skipping token b'24240' with 1-dimensional vector [b'200']; likely a header
 96%|█████████▌| 23179/24240 [00:01<00:00, 13596.14it/s]


In [0]:
# https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/

TEXT = Field(sequential=True, 
             include_lengths=False, 
             batch_first=True, 
             tokenize=tokenize, 
             lower=True)

RESULT = Field(sequential=True, 
             include_lengths=False, 
             batch_first=True, 
             tokenize=tokenize, 
             lower=True)

dataset = TabularDataset(
           path="/content/data.csv",
           format='csv',
           skip_header=True,
           fields=[('text', RESULT),('neg_1', TEXT), ('neg_2', TEXT), ('neg_3', TEXT)])

TEXT.build_vocab(dataset, min_freq=2, vectors=vectors,
                   unk_init = torch.Tensor.normal_)

RESULT.build_vocab(dataset, min_freq=2, vectors=vectors,
                   unk_init = torch.Tensor.normal_)

In [0]:
vocab = TEXT.vocab

In [15]:
print('Vocab size:', len(TEXT.vocab.itos))
TEXT.vocab.itos[:10]

Vocab size: 52958


['<unk>', '<pad>', 'the', 'be', 'a', 'to', 'of', 'and', 'in', 'that']

In [0]:
train, test = dataset.split(0.8)
train, valid = train.split(0.8)

In [0]:
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(BATCH_SIZE, BATCH_SIZE, BATCH_SIZE),
    shuffle=True,
    sort_key=lambda x: len(x.result),
    device=device
)

In [65]:
b = next(iter(train_iterator))
vars(b).keys()

dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'text', 'neg_1', 'neg_2', 'neg_3'])

In [66]:
b.text.size(), b.neg_1.size(), b.neg_2.size(), b.neg_3.size()

(torch.Size([64, 61]),
 torch.Size([64, 78]),
 torch.Size([64, 100]),
 torch.Size([64, 80]))

# Model

In [0]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [0]:
# Optimizer = Adam

# learning rate 0.001 for 15 epochs and batch size of 50. 
# We set the number of negative samples per input sample m to 20,
# λ to 1

In [0]:
class MyModel(nn.Module):

    def __init__(self, vocab_size, embed_size, topics_size):
        super(MyModel, self).__init__()
        
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.topics_size = topics_size

        self.embeddings = nn.Embedding(self.vocab_size, self.embed_size)
        self.embeddings.weight.data.copy_(vocab.vectors)

        self.fc1 = nn.Linear(self.embed_size, self.topics_size)
        self.fc2 = nn.Linear(self.topics_size, self.embed_size)   

        self.bias = None


    def sentence_embeddings(self, x):
        '''
        input: (batch_size, seq_length, embed_size)
        output: (batch_size, embed_size)
        '''
        x = torch.sum(x, dim=1)/x.size()[1]
        return x

    
    def forward(self, batch):
        
        text, neg_1, neg_2, neg_3 = batch.text, batch.neg_1, batch.neg_2, batch.neg_3

        text_true = self.sentence_embeddings(self.embeddings(text))
        neg_1 = self.sentence_embeddings(self.embeddings(neg_1))
        neg_2 = self.sentence_embeddings(self.embeddings(neg_2))
        neg_3 = self.sentence_embeddings(self.embeddings(neg_3))

        
        text_out = self.fc1(text_true)
        text_out = F.softmax(text_out, dim=1)
        text_out = self.fc2(text_out)

        return text_true, text_out, neg_1, neg_2, neg_3

In [60]:
model = MyModel(vocab_size=len(vocab), embed_size=200, topics_size=10)
model.to(device)

MyModel(
  (embeddings): Embedding(52958, 200)
  (fc1): Linear(in_features=200, out_features=10, bias=True)
  (fc2): Linear(in_features=10, out_features=200, bias=True)
)

In [61]:
model.forward(b)

(tensor([[ 0.1027, -0.0895,  0.0326,  ..., -0.0178, -0.0326, -0.1298],
         [-0.0048, -0.0119,  0.0072,  ...,  0.0247, -0.0146, -0.0277],
         [ 0.1114, -0.1159,  0.0220,  ...,  0.0841,  0.0010, -0.0294],
         ...,
         [-0.0439, -0.1392,  0.0732,  ...,  0.1561,  0.0902, -0.0220],
         [-0.0087,  0.0303,  0.0064,  ...,  0.0101,  0.0033,  0.0102],
         [ 0.0119,  0.0146,  0.0214,  ...,  0.0071,  0.0080, -0.0172]],
        device='cuda:0', grad_fn=<DivBackward0>),
 tensor([[-0.0886,  0.1370, -0.1686,  ...,  0.1837, -0.2896, -0.1951],
         [-0.0889,  0.1345, -0.1676,  ...,  0.1859, -0.2943, -0.1954],
         [-0.0891,  0.1408, -0.1684,  ...,  0.1781, -0.2888, -0.1991],
         ...,
         [-0.0886,  0.1415, -0.1721,  ...,  0.1790, -0.2825, -0.1964],
         [-0.0882,  0.1338, -0.1673,  ...,  0.1879, -0.2960, -0.1950],
         [-0.0887,  0.1350, -0.1674,  ...,  0.1869, -0.2950, -0.1956]],
        device='cuda:0', grad_fn=<AddmmBackward>),
 tensor([[ 0.0149

In [62]:
t, t2, n1, n2, n3 = model.forward(b)

t.size(), t2.size(), n1.size(), n2.size(), n3.size()

(torch.Size([64, 200]),
 torch.Size([64, 200]),
 torch.Size([64, 200]),
 torch.Size([64, 200]),
 torch.Size([64, 200]))

**Training objective**:
$$ J = \sum_{s \in D} \sum_{i=1}^n max(0, 1-r_s^T z_s + r_s^T n_i) + \lambda ||T^T T - I ||^2_F  $$
where   
$m$ random sentences are sampled as negative examples from dataset $D$ for each sentence $s$  
$n_i = \frac 1 n \sum_{i=j}^n e_{w_j}$ average of word embeddings in the i-th sentence  
$||T^T T - I ||_F$ regularizer, that enforces matrix $T$ to be orthogonal  
$||A||^2_F = \sum_{i=1}^N\sum_{j=1}^M a_{ij}^2, A \in R^{NxM}$ Frobenius norm

In [0]:
class LossFunction(nn.Module):

    def __init__(self, emb_true, emb_pred, negative, T, lambd):
        super(LossFunction, self).__init__()

        self.emb_true = emb_true
        self.emb_pred = emb_pred
        self.negative = negative
        self.T = T
        self.lambd = lambd

    def __call__(self):
        pass

In [0]:
def loss_function(true, negative, pred, T, lambda):
    """
    
    """
    return None

In [0]:
train_losses = []
valid_losses = []

def _train_epoch(model, iterator, optimizer, curr_epoch):

    model.train()

    running_loss = 0

    n_batches = len(iterator)
    iterator = tqdm_notebook(iterator, total=n_batches, desc='epoch %d' % (curr_epoch), leave=True)

    for i, batch in enumerate(iterator):
        optimizer.zero_grad()

        loss = model(batch)
        loss.backward()
        optimizer.step()

        curr_loss = loss.item()
        
        loss_smoothing = i / (i+1)
        running_loss = loss_smoothing * running_loss + (1 - loss_smoothing) * curr_loss

        train_losses.append(running_loss)
        iterator.set_postfix(loss='%.5f' % running_loss)

    return running_loss


def _test_epoch(model, iterator):
    model.eval()
    epoch_loss = 0

    n_batches = len(iterator)
    with torch.no_grad():
        for batch in iterator:
            loss = model(batch)
            epoch_loss += loss.data.item()
            valid_losses.append(loss)

    return epoch_loss / n_batches


def nn_train(model, train_iterator, valid_iterator, optimizer, n_epochs=100,
          scheduler=scheduler, early_stopping=0):

    prev_loss = 100500
    es_epochs = 0
    best_epoch = None
    history = pd.DataFrame()

    for epoch in range(n_epochs):
        train_loss = _train_epoch(model, train_iterator, optimizer, epoch)
        valid_loss = _test_epoch(model, valid_iterator)
        scheduler.step(valid_loss)

        valid_loss = valid_loss
        print('validation loss %.5f' % valid_loss)

        record = {'epoch': epoch, 'train_loss': train_loss, 'valid_loss': valid_loss}
        history = history.append(record, ignore_index=True)

        if early_stopping > 0:
            if valid_loss > prev_loss:
                es_epochs += 1
            else:
                es_epochs = 0

            if es_epochs >= early_stopping:
                best_epoch = history[history.valid_loss == history.valid_loss.min()].iloc[0]
                print('Early stopping! best epoch: %d val %.5f' % (best_epoch['epoch'], best_epoch['valid_loss']))
                break

            prev_loss = min(prev_loss, valid_loss)

# nn_train(model, train_loader, valid_loader, optimizer, n_epochs=100) # Удалила вывод, т.к. он длинный, график лосса ниже

# Topic CoHerence