# Assignment 9

Use data from `https://github.com/thedenaas/hse_seminars/tree/master/2018/seminar_13/data.zip`  
Implement model in pytorch from "An Unsupervised Neural Attention Model for Aspect Extraction, He et al, 2017", also desribed in seminar notes.  

You can use sentence embeddings with attention **[7 points]**:  
$z_s = \sum_{i}^n \alpha_i e_{w_i}, z_s \in R^d$ sentence embedding  
$\alpha_i = softmax(d_i)$  attention weight for i-th token  
$d_i = e_{w_i}^T M y_s$ attention with trainable matrix $M \in R^{dxd}$  
$y_s = \frac 1 n \sum_{i=1}^n e_{w_i}, y_s \in R^d$ sentence context  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  

**Or** just use sentence embedding as an average over word embeddings **[5 points]**:  
$z_s = \frac 1 n \sum_{i=1}^n e_{w_i}, z_s \in R^d$ sentence embedding  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  
 
$p_t = softmax(W z_s + b), p_t \in R^K$ topic weights for sentence $s$, with trainable matrix $W \in R^{dxK}$ and bias vector $b \in R^K$  
$r_s = T^T p_t, r_s \in R^d$ reconstructed sentence embedding as a weighted sum of topic embeddings   
$T \in R^{Kxd}$ trainable matrix of topic embeddings, K=number of topics


**Training objective**:
$$ J = \sum_{s \in D} \sum_{i=1}^n max(0, 1-r_s^T z_s + r_s^T n_i) + \lambda ||T^T T - I ||^2_F  $$
where   
$m$ random sentences are sampled as negative examples from dataset $D$ for each sentence $s$  
$n_i = \frac 1 n \sum_{i=j}^n e_{w_j}$ average of word embeddings in the i-th sentence  
$||T^T T - I ||_F$ regularizer, that enforces matrix $T$ to be orthogonal  
$||A||^2_F = \sum_{i=1}^N\sum_{j=1}^M a_{ij}^2, A \in R^{NxM}$ Frobenius norm


**[3 points]** Compute topic coherence for at least for 3 different number of topics. Use 10 nearest words for each topic. It means you have to train one model for each number of topics. You can use code from seminar notes with word2vec similarity scores.

In [36]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.externals import joblib
from sklearn import metrics
from tqdm import tqdm
import nltk
from nltk import PunktSentenceTokenizer
nltk.download('punkt')
from torchtext.data import Field, LabelField, BucketIterator, TabularDataset, Iterator
import numpy as np
import torch
from torch.nn import init
from torch.nn.parameter import Parameter


%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [42]:
raw_documents = ''
with open( "data/data.txt", "r") as f:
    raw_documents = f.read()
        
print("Read %d raw text documents" % len(raw_documents.split('/n')))

Read 13 raw text documents


In [43]:
# custom stopwords
custom_stop_words = []
with open( "data/stopwords.txt", "r" ) as f:
    for line in f.readlines():
        custom_stop_words.append( line.strip().lower() )
        
print("Stopword list has %d entries" % len(custom_stop_words) )

Stopword list has 350 entries


In [0]:
train = pd.DataFrame()
train['doc'] = nltk.sent_tokenize(raw_documents)[:1000] 

In [45]:
train.head()

Unnamed: 0,doc
0,Barclays' defiance of US fines has merit Barcl...
1,"So it is tempting to think the bank, when aske..."
2,"That is not the view of the chief executive, J..."
3,Barclays thinks the DoJ’s claims are “disconne...
4,"But actually, some grudging respect for Staley..."


In [0]:
train['neg1'] = train['doc'].apply(lambda x: train.iloc[np.random.choice(len(train)), 0])
train['neg2'] = train['doc'].apply(lambda x: train.iloc[np.random.choice(len(train)), 0])
train['neg3'] = train['doc'].apply(lambda x: train.iloc[np.random.choice(len(train)), 0])

In [47]:
train.head()

Unnamed: 0,doc,neg1,neg2,neg3
0,Barclays' defiance of US fines has merit Barcl...,"A dispatch comes in from Glastonbury, where th...",European Union referendum polling day – as it ...,But Ukip has also been getting stronger in the...
1,"So it is tempting to think the bank, when aske...",The Real Clear Politics average uses two polls...,Spanish-owned TSB – once part of Lloyds – want...,Don’t lose this chance to make today our Indep...
2,"That is not the view of the chief executive, J...",The first electronic message - 1965 The very f...,"I’m tired of the rich getting richer, and havi...",A Twitter spokesperson said: “Our rules explic...
3,Barclays thinks the DoJ’s claims are “disconne...,"Ana Boata, economist at trade insurance firm E...",Whether he built the movement or simply rode i...,We’re getting some reports around the country ...
4,"But actually, some grudging respect for Staley...","Rubio, who spent part of his childhood in Neva...","That view was echoed by Peter Waiswa, an assoc...","It’s a similar picture for Ladbrokes, which re..."


In [0]:
train.to_csv('train.csv', index=False)

In [0]:
def tokenize(text):
  return nltk.word_tokenize(text) 

In [0]:
TEXT = Field(include_lengths=False, 
             batch_first=True, 
             tokenize=tokenize,
             lower=True,
             stop_words=custom_stop_words)

datafields = [('doc',TEXT), ('neg1', TEXT), ('neg2', TEXT), ('neg3', TEXT)]

In [0]:
train_d = TabularDataset(path="train.csv", format='csv', skip_header=True, fields=datafields)

TEXT.build_vocab(train_d)

In [52]:
vocab_size = len(TEXT.vocab)
vocab_size

5357

Thanks to https://github.com/alexeyev/abae-pytorch

In [0]:
class SelfAttention(torch.nn.Module):
    def __init__(self, wv_dim, maxlen):
        super(SelfAttention, self).__init__()
        self.wv_dim = wv_dim

        # max sentence length -- batch 2nd dim size
        self.maxlen = maxlen
        self.M = Parameter(torch.Tensor(wv_dim, wv_dim))
        init.kaiming_uniform(self.M.data)

        # softmax for attending to wod vectors
        self.attention_softmax = torch.nn.Softmax()

    def forward(self, input_embeddings):
        # (b, wv, 1)
        mean_embedding = torch.mean(input_embeddings, (1,)).unsqueeze(2)

        # (wv, wv) x (b, wv, 1) -> (b, wv, 1)
        product_1 = torch.matmul(self.M, mean_embedding)

        # (b, maxlen, wv) x (b, wv, 1) -> (b, maxlen, 1)
        product_2 = torch.matmul(input_embeddings, product_1).squeeze(2)

        results = self.attention_softmax(product_2)

        return results

    def extra_repr(self):
        return 'wv_dim={}, maxlen={}'.format(self.wv_dim, self.maxlen)


class ABAE(torch.nn.Module):
    """
        The model described in the paper ``An Unsupervised Neural Attention Model for Aspect Extraction''
        by He, Ruidan and  Lee, Wee Sun  and  Ng, Hwee Tou  and  Dahlmeier, Daniel, ACL2017
        https://aclweb.org/anthology/papers/P/P17/P17-1036/
    """

    def __init__(self, wv_dim=200, asp_count=30, ortho_reg=0.1, maxlen=201, init_aspects_matrix=None):
        """
        Initializing the model
        :param wv_dim: word vector size
        :param asp_count: number of aspects
        :param ortho_reg: coefficient for tuning the ortho-regularizer's influence
        :param maxlen: sentence max length taken into account
        :param init_aspects_matrix: None or init. matrix for aspects
        """
        super(ABAE, self).__init__()
        self.wv_dim = wv_dim
        self.asp_count = asp_count
        self.ortho = ortho_reg
        self.maxlen = maxlen

        self.attention = SelfAttention(wv_dim, maxlen)
        self.linear_transform = torch.nn.Linear(self.wv_dim, self.asp_count)
        self.softmax_aspects = torch.nn.Softmax()
        self.aspects_embeddings = Parameter(torch.Tensor(wv_dim, asp_count))

        if init_aspects_matrix is None:
            torch.nn.init.xavier_uniform(self.aspects_embeddings)
        else:
            self.aspects_embeddings.data = torch.from_numpy(init_aspects_matrix.T)

    def get_aspects_importances(self, text_embeddings):
        """
            Takes embeddings of a sentence as input, returns attention weights
        """

        # compute attention scores, looking at text embeddings average
        attention_weights = self.attention(text_embeddings)

        # multiplying text embeddings by attention scores -- and summing
        # (matmul: we sum every word embedding's coordinate with attention weights)
        weighted_text_emb = torch.matmul(attention_weights.unsqueeze(1), # (batch, 1, sentence)
                                         text_embeddings                 # (batch, sentence, wv_dim)
                                         ).squeeze()

        # encoding with a simple feed-forward layer (wv_dim) -> (aspects_count)
        raw_importances = self.linear_transform(weighted_text_emb)

        # computing 'aspects distribution in a sentence'
        aspects_importances = self.softmax_aspects(raw_importances)

        return attention_weights, aspects_importances, weighted_text_emb

    def forward(self, text_embeddings, negative_samples_texts):

        # negative samples are averaged
        averaged_negative_samples = torch.mean(negative_samples_texts, dim=2)

        # encoding: words embeddings -> sentence embedding, aspects importances
        _, aspects_importances, weighted_text_emb = self.get_aspects_importances(text_embeddings)

        # decoding: aspects embeddings matrix, aspects_importances -> recovered sentence embedding
        recovered_emb = torch.matmul(self.aspects_embeddings, aspects_importances.unsqueeze(2)).squeeze()

        # loss
        reconstruction_triplet_loss = ABAE._reconstruction_loss(weighted_text_emb,
                                                                recovered_emb,
                                                                averaged_negative_samples)
        max_margin = torch.max(reconstruction_triplet_loss, torch.zeros_like(reconstruction_triplet_loss))

        return self.ortho * self._ortho_regularizer() + max_margin

    @staticmethod
    def _reconstruction_loss(text_emb, recovered_emb, averaged_negative_emb):

        positive_dot_products = torch.matmul(text_emb.unsqueeze(1), recovered_emb.unsqueeze(2)).squeeze()
        negative_dot_products = torch.matmul(averaged_negative_emb, recovered_emb.unsqueeze(2)).squeeze()
        reconstruction_triplet_loss = torch.sum(1 - positive_dot_products.unsqueeze(1) + negative_dot_products, dim=1)

        return reconstruction_triplet_loss

    def _ortho_regularizer(self):
        return torch.norm(
            torch.matmul(self.aspects_embeddings.t(), self.aspects_embeddings) \
            - torch.eye(self.asp_count))

    def get_aspect_words(self, w2v_model, topn=15):
        words = []

        # getting aspects embeddings
        aspects = self.aspects_embeddings.detach().numpy()

        # getting scalar products of word embeddings and aspect embeddings;
        # to obtain the ``probabilities'', one should also apply softmax
        words_scores = w2v_model.wv.syn0.dot(aspects)

        for row in range(aspects.shape[1]):
            argmax_scalar_products = np.argsort(- words_scores[:, row])[:topn]
            # print([w2v_model.wv.index2word[i] for i in argmax_scalar_products])
            # print([w for w, dist in w2v_model.similar_by_vector(aspects.T[row])[:topn]])
            words.append([w2v_model.wv.index2word[i] for i in argmax_scalar_products])

        return words

In [39]:
import argparse

parser = argparse.ArgumentParser()

parser.add_argument("--word-vectors-path", "-wv",
                  dest="wv_path", type=str, metavar='<str>',
                  default="word_vectors/reviews_Electronics_5.json.txt.w2v",
                  help="path to word vectors file")

parser.add_argument("--batch-size", "-b", dest="batch_size", type=int, default=50,
                  help="Batch size for training")

parser.add_argument("--aspects-number", "-as", dest="aspects_number", type=int, default=40,
                  help="A total number of aspects")

parser.add_argument("--ortho-reg", "-orth", dest="ortho_reg", type=float, default=0.1,
                  help="Ortho-regularization impact coefficient")

parser.add_argument("--epochs", "-e", dest="epochs", type=int, default=1,
                  help="Epochs count")

parser.add_argument("--optimizer", "-opt", dest="optimizer", type=str, default="adam", help="Optimizer",
                  choices=["adam", "adagrad", "sgd"])

parser.add_argument("--negative-samples", "-ns", dest="neg_samples", type=int, default=5,
                  help="Negative samples per positive one")

parser.add_argument("--dataset-path", "-d", dest="dataset_path", type=str, default="reviews_Electronics_5.json.txt",
                  help="Path to a training texts file. One sentence per line, tokens separated wiht spaces.")

parser.add_argument("--maxlen", "-l", type=int, default=201,
                  help="Max length of the considered sentence; the rest is clipped if longer")

args = parser.parse_args()

w2v_model = get_w2v(args.wv_path)
wv_dim = w2v_model.vector_size
y = torch.zeros(args.batch_size, 1)

model = ABAE(wv_dim=wv_dim,
            asp_count=args.aspects_number,
            init_aspects_matrix=get_centroids(w2v_model, aspects_count=args.aspects_number))
print(model)

criterion = torch.nn.MSELoss(reduction="sum")

optimizer = None
scheduler = None

# if args.optimizer == "cycsgd":
#     optimizer = torch.optim.SGD(model.parameters(), lr=0.05, momentum=0.9)
#     scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=1e-5, max_lr=0.05, mode="triangular2")
# elif args.optimizer == "adam":

if args.optimizer == "adam":
  optimizer = torch.optim.Adam(model.parameters())
elif args.optimizer == "sgd":
  optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
elif args.optimizer == "adagrad":
  optimizer = torch.optim.Adagrad(model.parameters())
else:
  raise Exception("Optimizer '%s' is not supported" % args.optimizer)

for t in range(args.epochs):

  print("Epoch %d/%d" % (t + 1, args.epochs))

  data_iterator = read_data_tensors(args.dataset_path, args.wv_path,
                                    batch_size=args.batch_size, maxlen=args.maxlen)

  for item_number, (x, texts) in enumerate(data_iterator):
      if x.shape[0] < args.batch_size:  # pad with 0 if smaller than batch size
          x = np.pad(x, ((0, args.batch_size - x.shape[0]), (0, 0), (0, 0)))

      x = torch.from_numpy(x)

      # extracting bad samples from the very same batch; not sure if this is OK, so todo
      negative_samples = torch.stack(
          tuple([x[torch.randperm(x.shape[0])[:args.neg_samples]] for _ in range(args.batch_size)]))

      # prediction
      y_pred = model(x, negative_samples)

      # error computation
      loss = criterion(y_pred, y)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
      # scheduler.step(epoch=t)

      if item_number % 1000 == 0:

          print(item_number, "batches, and LR:", optimizer.param_groups[0]['lr'])

          for i, aspect in enumerate(model.get_aspect_words(w2v_model)):
              print(i + 1, " ".join(["%10s" % a for a in aspect]))

          print("Loss:", loss.item())
          print()

usage: ipykernel_launcher.py [-h] [--word-vectors-path <str>]
                             [--batch-size BATCH_SIZE]
                             [--aspects-number ASPECTS_NUMBER]
                             [--ortho-reg ORTHO_REG] [--epochs EPOCHS]
                             [--optimizer {adam,adagrad,sgd}]
                             [--negative-samples NEG_SAMPLES]
                             [--dataset-path DATASET_PATH] [--maxlen MAXLEN]
ipykernel_launcher.py: error: unrecognized arguments: -f /root/.local/share/jupyter/runtime/kernel-04c4a7a4-142e-4cb0-959b-eb86a6610db9.json


SystemExit: ignored

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
