
## Assignment 9
Use data from https://github.com/thedenaas/hse_seminars/tree/master/2018/seminar_13/data.zip
Implement model in pytorch from "An Unsupervised Neural Attention Model for Aspect Extraction, He et al, 2017", also desribed in seminar notes.

You can use sentence embeddings with attention [7 points]:
$z_s = \sum_{i}^n \alpha_i e_{w_i}, z_s \in R^d$ sentence embedding
$\alpha_i = softmax(d_i)$ attention weight for i-th token
$d_i = e_{w_i}^T M y_s$ attention with trainable matrix $M \in R^{dxd}$
$y_s = \frac 1 n \sum_{i=1}^n e_{w_i}, y_s \in R^d$ sentence context
$e_{w_i} \in R^d$, token embedding of size d
$n$ - number of tokens in a sentence

Or just use sentence embedding as an average over word embeddings [5 points]:
$z_s = \frac 1 n \sum_{i=1}^n e_{w_i}, z_s \in R^d$ sentence embedding
$e_{w_i} \in R^d$, token embedding of size d
$n$ - number of tokens in a sentence

$p_t = softmax(W z_s + b), p_t \in R^K$ topic weights for sentence $s$, with trainable matrix $W \in R^{dxK}$ and bias vector $b \in R^K$
$r_s = T^T p_t, r_s \in R^d$ reconstructed sentence embedding as a weighted sum of topic embeddings
$T \in R^{Kxd}$ trainable matrix of topic embeddings, K=number of topics

Training objective: $$ J = \sum_{s \in D} \sum_{i=1}^n max(0, 1-r_s^T z_s + r_s^T n_i) + \lambda ||T^T T - I ||^2_F  $$ where
$m$ random sentences are sampled as negative examples from dataset $D$ for each sentence $s$
$n_i = \frac 1 n \sum_{i=j}^n e_{w_j}$ average of word embeddings in the i-th sentence
$||T^T T - I ||_F$ regularizer, that enforces matrix $T$ to be orthogonal
$||A||^2_F = \sum_{i=1}^N\sum_{j=1}^M a_{ij}^2, A \in R^{NxM}$ Frobenius norm

[3 points] Compute topic coherence for at least for 3 different number of topics. Use 10 nearest words for each topic. It means you have to train one model for each number of topics. You can use code from seminar notes with word2vec similarity scores.

In [1]:
!wget -O data.zip https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true
!unzip '/content/data.zip'

--2020-03-22 16:58:34--  https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip [following]
--2020-03-22 16:58:35--  https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip [following]
--2020-03-22 16:58:35--  https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.1

In [2]:
import pandas as pd
import numpy as np

import nltk
import spacy 

import torch
from torchtext.data import Field, TabularDataset, BucketIterator
from torchtext.vocab import Vectors
from gensim.models import Word2Vec, KeyedVectors

nltk.download('punkt')
spacy_en = spacy.load('en')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [0]:
BATCH_SIZE = 64
neg_samples = 3  
random_state = 23
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [0]:
with open('/content/data.txt', 'r') as f:
    data = f.read()

with open('/content/stopwords.txt', 'r') as f:
    stopwords = f.read().splitlines()

In [0]:
def tokenize(text):
  return nltk.word_tokenize(text) 
    
TEXT = Field(include_lengths=False, 
             batch_first=True, 
             tokenize=tokenize,
             lower=True,
             stop_words=stopwords)

In [0]:
train_data = pd.DataFrame()
fields = [('sent',TEXT)]
train_data['sent'] = nltk.sent_tokenize(data)[:10000]
train_data.to_csv('train.csv', index=False)
train1 = TabularDataset(path="train.csv", format='csv',
                     skip_header=True, fields=fields)

In [0]:
TEXT.build_vocab(train1)

In [8]:
print(len(TEXT.vocab.itos))

21346


In [0]:
train, test = train1.split(0.8)
train, valid = train.split(0.8)

In [0]:
train_iter, valid_iter, test_iter = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(BATCH_SIZE, BATCH_SIZE, BATCH_SIZE),
    shuffle=True,
    sort_key=lambda x: len(x.result),
    device=device
)

In [0]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [0]:
class MyModel(nn.Module):
    
    def __init__(self, vocab_size, emb_dim=300, topic_dim=5):
      super(MyModel, self).__init__()
      self.get_emb = nn.EmbeddingBag(vocab_size, emb_dim)
      self.embedding = nn.EmbeddingBag(vocab_size, emb_dim)  
      self.pt = nn.Linear(emb_dim, topic_dim)
      self.softmax = F.softmax
      self.x2 = nn.Linear(topic_dim, emb_dim, bias=False)

    def forward(self, batch):
      vecs = self.get_emb(batch.sent)
      x1 = torch.tensor(vecs).unsqueeze(1)
      x2 = self.step(torch.tensor(vecs))
      x3 = x2.unsqueeze(1).permute(0, 2, 1)
      x4 = torch.bmm(x3, x1)
      return x3, x4
    
    def step(self, x):
        x = self.pt(torch.tensor(x))
        x = self.softmax(x)
        x = self.x2(x)
        return x
     
    def negs(self, batch):
        total = len(batch)
        for idx in range(total):
            to_random = list(range(0, idx)) + list(range(idx+1, total))
            neg_ids = np.random.choice(to_random, size=5,replace=False)

            negs = [self.get_emb(batch.sent)[i] for i in neg_ids]
            negs = torch.stack(negs, dim=-1)
            yield negs

In [0]:
model = MyModel(vocab_size=len(TEXT.vocab))
model = model.to(device)

In [0]:
class MyLoss(nn.Module):
  
    def __init__(self):
        super().__init__()

    def regularization(self, param, lambda_=1):
        inner = torch.mm(param.permute(1, 0), param) 
        reg = inner - torch.eye(inner.shape[0])
        return lambda_ * torch.norm(input=reg, p='fro')

    def forward(self, rsT, rsTzs, negs, param):
        negs = torch.stack(list(negs))
        losses = []
        for ni in negs.permute(2, 0, 1):
            ni = torch.bmm(rsT, ni.unsqueeze(1))
            tmp = (1 - rsTzs + ni).squeeze(1)
            zeros = torch.zeros_like(tmp)
            values, _ = torch.max(torch.stack([tmp, zeros]), 0)
            losses.append(values)
        losses = torch.stack(losses, dim=-1)
        return torch.sum(losses) + self.regularization(param)

In [0]:
from tqdm import tqdm, tqdm_notebook
def train_epoch(data_iter, n_epoch, model, criterion, optimizer=None):
    train_losses = []
    total_loss = 0
    counter = 0
    data_iter = tqdm_notebook(data_iter, total=len(data_iter), 
                              desc=f"Epoch {n_epoch + 1}", leave=True)
    for batch in data_iter:
        if optimizer:
          optimizer.zero_grad()
        rsT, rsTzs = model(batch)
        negs = model.negs(batch)
        param = list(model.parameters())[1]
        loss = criterion(rsT, rsTzs, negs, param)
        loss.backward()
        if optimizer:
          optimizer.step()
        curr_value = loss.detach().item()
        total_loss += curr_value
        train_losses.append(curr_value)
        data_iter.set_postfix(loss = curr_value)
        counter += 1
        
    total_loss /= counter
    return total_loss, train_losses

In [0]:
criterion = MyLoss()
criterion.to(device)
optimizer = torch.optim.Adam(model.parameters())

In [17]:
total_train_losses = []
total_valid_losses = []
for epoch in range(5):
    model.train()
    loss, train_losses = train_epoch(train_iter, 5, model, criterion, optimizer)
    total_train_losses += train_losses
    print('train', loss)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  import sys


HBox(children=(IntProgress(value=0, description='Epoch 6', style=ProgressStyle(description_width='initial')), …

  del sys.path[0]
  



train 29143335.28


HBox(children=(IntProgress(value=0, description='Epoch 6', style=ProgressStyle(description_width='initial')), …


train 29088316.54


HBox(children=(IntProgress(value=0, description='Epoch 6', style=ProgressStyle(description_width='initial')), …

KeyboardInterrupt: ignored