# Assignment 9

Use data from `https://github.com/thedenaas/hse_seminars/tree/master/2018/seminar_13/data.zip`  
Implement model in pytorch from "An Unsupervised Neural Attention Model for Aspect Extraction, He et al, 2017", also desribed in seminar notes.  

You can use sentence embeddings with attention **[7 points]**:  
$z_s = \sum_{i}^n \alpha_i e_{w_i}, z_s \in R^d$ sentence embedding  
$\alpha_i = softmax(d_i)$  attention weight for i-th token  
$d_i = e_{w_i}^T M y_s$ attention with trainable matrix $M \in R^{dxd}$  
$y_s = \frac 1 n \sum_{i=1}^n e_{w_i}, y_s \in R^d$ sentence context  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  

**Or** just use sentence embedding as an average over word embeddings **[5 points]**:  
$z_s = \frac 1 n \sum_{i=1}^n e_{w_i}, z_s \in R^d$ sentence embedding  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  
 
$p_t = softmax(W z_s + b), p_t \in R^K$ topic weights for sentence $s$, with trainable matrix $W \in R^{dxK}$ and bias vector $b \in R^K$  
$r_s = T^T p_t, r_s \in R^d$ reconstructed sentence embedding as a weighted sum of topic embeddings   
$T \in R^{Kxd}$ trainable matrix of topic embeddings, K=number of topics


**Training objective**:
$$ J = \sum_{s \in D} \sum_{i=1}^n max(0, 1-r_s^T z_s + r_s^T n_i) + \lambda ||T^T T - I ||^2_F  $$
where   
$m$ random sentences are sampled as negative examples from dataset $D$ for each sentence $s$  
$n_i = \frac 1 n \sum_{i=j}^n e_{w_j}$ average of word embeddings in the i-th sentence  
$||T^T T - I ||_F$ regularizer, that enforces matrix $T$ to be orthogonal  
$||A||^2_F = \sum_{i=1}^N\sum_{j=1}^M a_{ij}^2, A \in R^{NxM}$ Frobenius norm


**[3 points]** Compute topic coherence for at least for 3 different number of topics. Use 10 nearest words for each topic. It means you have to train one model for each number of topics. You can use code from seminar notes with word2vec similarity scores.

In [0]:
# TODO Заменить texts[23] на что-то более разумное при выборе негативных экземпляров
# TODO Заменить в TabularDataset, чтобы neg_{} зависило от NEG_SAMPLES
# TODO Удалить пунктуацию

In [85]:
import pandas as pd
import numpy as np

import torch
import nltk

from torchtext.vocab import Vectors
from gensim.models import Word2Vec, KeyedVectors

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [0]:
NEG_SAMPLES = 3  # number of negative samples
random_state = 23
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# DATA

In [6]:
!wget -O data.zip https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true
!unzip '/content/data.zip'

--2020-03-20 14:17:52--  https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true
Resolving github.com (github.com)... 13.229.188.59
Connecting to github.com (github.com)|13.229.188.59|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip [following]
--2020-03-20 14:17:52--  https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip [following]
--2020-03-20 14:17:53--  https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151

In [0]:
# with open('/content/data.txt', 'r') as f:
#     data = f.read().splitlines()

with open('/content/data.txt', 'r') as f:
    data = nltk.tokenize.sent_tokenize(f.read())

with open('/content/stopwords.txt', 'r') as f:
    stopwords = f.read().splitlines()

In [94]:
print(len(data), data[0])
print(len(stopwords), stopwords[0])

183400 Barclays' defiance of US fines has merit Barclays disgraced itself in many ways during the pre-financial crisis boom years.
350 a


In [0]:
with open('/content/data.txt', 'r') as f:
    data = nltk.tokenize.sent_tokenize(f.read())

# Making DataFrame

In [0]:
def create_df(texts, neg_samples):
    """
    Creating pandas DataFrame from texts and adding randomly chosen negative samples
    """
    df = pd.DataFrame()
    df['text'] = texts

    for i in range(1, neg_samples+1):
        df['neg_{}'.format(i)] = [texts[ind] if ind != el else texts[23] for el, ind in enumerate(np.random.choice(np.arange(0,len(texts)), size=len(texts)))]
    return df

In [0]:
df = create_df(data, neg_samples=NEG_SAMPLES)

In [98]:
df.head()

Unnamed: 0,text,neg_1,neg_2,neg_3
0,Barclays' defiance of US fines has merit Barcl...,Aides have yet to say whether he was speaking ...,Instead today NHS England announced it was not...,"And part of the reason, according to CNBC’s Ji..."
1,"So it is tempting to think the bank, when aske...",The EU did run up backlog of €26bn (£20bn) in ...,"“The developing, poorer countries are impacted...","(And when he does, well you can’t fault him fo..."
2,"That is not the view of the chief executive, J...","Nigel Farage and the other Ukip MEPs, as well ...","For her part, Swami Ambikananda Saraswati seem...",“We are now here with our crying four year old...
3,Barclays thinks the DoJ’s claims are “disconne...,And so an easy dichotomy presents itself.,"Conor D’Arcy, a policy analyst at the Resoluti...","Footloose On paper, even now, the setup of thi..."
4,"But actually, some grudging respect for Staley...",A documentary crew invited to film the recordi...,Bill Clinton is still somewhere on a bus in Oh...,“Maybe that’s it!


In [102]:
df.tail()

Unnamed: 0,text,neg_1,neg_2,neg_3
183395,It feels as though Stone realised that some of...,Steven W Thrasher: Trump is better at whipping...,"So, here you would think, is an open goal – a ...",The question is whether our top law enforcemen...
183396,"There are some fun elements, many involving Rh...",The tracks that really shine the most though a...,"One man, bless his heart, all but leapt into t...",FGM is defined by the World Health Organisatio...
183397,I particularly enjoyed a scene in which O’Bria...,Having an eating disorder is extremely isolati...,Run-in 2 Apr Man City H 9 Apr Aston Villa A 17...,"No Man’s Sky shares the film’s tranquil pace, ..."
183398,His carnivorous snarl fills the immense screen...,He noted that often the party’s nominee did no...,The Oscars is on February 28.,"Chelsea Clinton will attend a glitzy, star-stu..."
183399,There’s a playful visual flair to this moment ...,So help us to get a sense of the country as th...,I looked at TV pictures of a home match this m...,As the virus continues to spread since the fir...


In [0]:
df.to_csv('data.csv', index=False)

# TORCHTEXT and stuff

In [0]:
import spacy 
spacy_en = spacy.load('en')

In [0]:
def tokenize(text):
    return [tok.lemma_ for tok in spacy_en.tokenizer(text) if tok.text.isalpha()]

In [125]:
# Using data to pretrain word-embeddings

data_tokenized = list(df['text'].apply(lambda x: tokenize(x)))
model = Word2Vec(data_tokenized, size=200, window=10, negative=5)  # building emb of size 200 (parameters from the paper)
model_weights = torch.FloatTensor(model.wv.vectors)
model.wv.save_word2vec_format('pretrained_embeddings')
vectors = Vectors(name='pretrained_embeddings', cache='./')  # and saving the weights to build vocab later

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
from torchtext.data import Field, TabularDataset

In [0]:
# https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/

TEXT = Field(sequential=True, 
             include_lengths=False, 
             batch_first=True, 
             tokenize=tokenize, 
             lower=True)

RESULT = Field(sequential=True, 
             include_lengths=False, 
             batch_first=True, 
             tokenize=tokenize, 
             lower=True,
             is_target=True)

dataset = TabularDataset(
           path="/content/data.csv",
           format='csv',
           skip_header=True,
           fields=[('text', RESULT),('neg_1', TEXT), ('neg_2', TEXT), ('neg_3', TEXT)])

TEXT.build_vocab(dataset, min_freq=2, vectors=vectors,
                   unk_init = torch.Tensor.normal_)

RESULT.build_vocab(dataset, min_freq=2, vectors=vectors,
                   unk_init = torch.Tensor.normal_)

In [0]:
vocab = TEXT.vocab

In [129]:
print('Vocab size:', len(TEXT.vocab.itos))
TEXT.vocab.itos[:10]

Vocab size: 52932


['<unk>', '<pad>', 'the', 'be', 'a', 'to', 'of', 'and', 'in', 'that']

# Model

In [0]:
# Optimizer = Adam

# learning rate 0.001 for 15 epochs and batch size of 50. 
# We set the number of negative samples per input sample m to 20,
# λ to 1

# Topic CoHerence