🚨**NOTE** 🚨: to run the notebooks move them to the main dir. Simply

```bash
cp notebook_name.ipynd ../
```

In this and the other notebooks I will describe step by step the my implemenation of [Hierarchical Attention Networks](https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf) (Zichao Yang et al., 2016) and discuss the results I obtained for the [amazon reviews dataset](https://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf) (R. He, J. McAuley, 2016), in particular [Clothes, shoes and jewellery](http://jmcauley.ucsd.edu/data/amazon/).

An illustration of the architecture I will implement is shown in the figure below (the Figure 2 in the Zichao Yang et al., 2016 paper)

<p align="center">
  <img width="400" src="figures/HAN_arch.png">
</p>

In words, we consider a document comprised by a number of sentences $L$, and each sentence contains a number of tokens $T$. The network first encodes the tokens ($w_{it}, i \in [1,T]$) in each sentence $s_i, i \in [1,L]$ using an RNN, in particular a GRU (Word Encoder), with an attention mechanism (Word Attention). The result of this first step is a sentence representation. This sentence representation is then passed through a second GRU (Sentence Encoder) with attention (Sentence Attention). The result of this second step is a document (in this exercises documents are reviews) representation. The latter is "fed to" a fully connected layer for prediction (Softmax in the figure). 

Let me give you one example with tensor dimensions. Let's assume tha we have a document comprised by 10 sentences each of 20 words, and that we use 100 dim word embeddings. The output of Word-Encoder + Word-Attention will be a tensor of shape (None, 10, 100), while the output of the Sentence-Encoder + Sentence-Attention will be a tensor of shape (None, 1, 100). The latter will be the input of a fully connected (i.e. prediction) layer. 

Taken directly from their paper, the **word attention mechanism** can be formulated as:

$$
u_{it} = \text{tanh}(W_wh_{it} + b_w)
$$

$$
\alpha_{it} = \frac{\exp(u_{it}u_w^{\mathsf{T}})}{\sum_{t}\exp(u_{it}u_w^{\mathsf{T}})}
$$

$$
s_i = \sum_{t}\alpha_{it}h_{it}
$$

Where $u_{it}$ can be seen as a hidden representation of $h_{it}$ (the GRU ouput). The importance of a word is then measured as the similarity of $u_{it}$ with a context vector $u_{w}$, which is then normalized through a softmax function resulting in  $\alpha_{it}$, the so called *normalized importance weight*. The sentence vector $s_i$ is the weighted sum of the word annotations based on the weights $\alpha_{it}$. For more details please, have a look to the paper [Zichao Yang et al., 2016](https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf). 

The **sentence attention mechanism** the authors used is identical to the previous one, but at sentence level representation:

$$
u_{i} = \text{tanh}(W_sh_i + b_s)
$$

$$
\alpha_{i} = \frac{\exp(u_iu_s^{\mathsf{T}})}{\sum_{i}\exp(u_{i}u_s^{\mathsf{T}})}
$$

$$
v = \sum_{i}\alpha_{i}h_{i}
$$

I will come back to these expressions in the next notebook, where I will implement them in code. 

However, and as with any other data problem, the first step is preparing the data. For this particular exercises, we need to tokenize reviews into sentences and sentences into tokens. Of course, this is easily attainable using [`Spacy`](https://spacy.io/usage) or any other of your NLP favourite packages. 

To make life easier I have wrapped up the full preprocessing in a class called `HANPreprocessor` in the `utils` module. One could access to this class by simply:

```python
from utils.preprocessors import HANPreprocessor
```

Here I will describe step by step what happens inside this class. Let's start by reading the reviews

In [1]:
import numpy as np
import pandas as pd
import os
import pickle
import spacy

from pathlib import Path
from sklearn.model_selection import train_test_split
from gensim.utils import tokenize
from fastai.text import Tokenizer, Vocab

In [2]:
DATA_PATH = Path("../datasets/amazon_reviews/")
OUT_PATH = Path("data/")
if not os.path.exists(OUT_PATH):
    os.makedirs(OUT_PATH)

df_org = pd.read_json(DATA_PATH / "reviews_Clothing_Shoes_and_Jewelry_5.json.gz", lines=True)

# classes from [0,num_class)
df = df_org.copy()
df["overall"] = (df["overall"] - 1).astype("int64")

# group reviews with 1 and 2 scores into one class
df.loc[df.overall == 0, "overall"] = 1

# and back again to [0,num_class)
df["overall"] = (df["overall"] - 1).astype("int64")

# agressive preprocessing: drop short reviews
df["reviewLength"] = df.reviewText.apply(lambda x: len(x.split(" ")))
df = df[df.reviewLength >= 5]

The main goal of the notebooks is to illustrate the details of the process (rather than running real experiments). Therefore, with that in mind, I will only use 1000 samples here. Otherwise tokenizing reviews into sentences and sentences into tokens takes a while.

In [3]:
sdf = df.sample(n=1000, random_state=1)
texts = sdf.reviewText.tolist()
target = sdf.overall.tolist()

Within the `HANPreprocessor` class, there is a private method called `_sentencizer` that does mostly this:

In [4]:
tok_func = spacy.load("en_core_web_sm")
n_cpus = os.cpu_count()
bsz = 100
texts_sents = []
for doc in tok_func.pipe(texts, n_process=n_cpus, batch_size=bsz):
    sents = [str(s) for s in list(doc.sents)]
    texts_sents.append(sents)

In [5]:
texts_sents[0]

['This is and original design for the beach.',
 'Super light and it also covers very well.',
 'I got XL.',
 "I'm  5'11&#34; and size 16 and it works.",
 'Length is a little above knee.',
 'there are others, but sizes are not true and this one has better fabric quality.']

Now that we have the sentences, we would like to process the text in them. However, for the whole dataset there are over 1.4 mil sentences and looping through reviews and then sentences is just **BAD**. Let's implement a better solution. First we flat out the nested lists.

In [6]:
# from nested to flat list.
all_sents = [s for sents in texts_sents for s in sents]

We then save the lengths of the original documents (number of sentences) and build a list with the sentence indexes that belong to each document, so we can "fold back" the list later.

In [7]:
# saving the lengths of the documents: 1) for padding purposes and 2) to compute consecutive ranges 
# so we can "fold" the list again
texts_length = [0] + [len(s) for s in texts_sents]
range_idx = [sum(texts_length[: i + 1]) for i in range(len(texts_length))]

Now let's process the text in `all_sents` in parallel using the fantastic [`fastai`](https://github.com/fastai/fastai/blob/96339e70184eeed8d28261a88be54631eafc77cf/fastai/text/transform.py#L87) tokenizer:

In [8]:
sents_tokens = Tokenizer().process_all(all_sents)

In [9]:
sents_tokens[0]

['xxmaj',
 'this',
 'is',
 'and',
 'original',
 'design',
 'for',
 'the',
 'beach',
 '.']

Note that I apply very little cleaning to the text (other than the one applied by default by the fastai Tokenizer). This in **intentional**, as I aim to keep the maximum information possible within the text and hopefully the "noise" will be removed when removing the least frequent tokens from the vocabulary. 

Nonetheless, I have included the option of applying some cleaning via a mildly customised `gensim`'s [`simple_preprocess`](https://radimrehurek.com/gensim/utils.html). All this is wrapped up in a function within the `utils.preprocessors` called `get_texts` that is mostly this:

In [10]:
def simple_preprocess(doc, lower=False, deacc=False, min_len=2, max_len=15):
    tokens = [
        token
        for token in tokenize(doc, lower=False, deacc=deacc, errors="ignore")
        if min_len <= len(token) <= max_len and not token.startswith("_")
    ]
    return tokens


def get_texts(texts, with_preprocess=False):
    if with_preprocess:
        texts = [" ".join(simple_preprocess(s)) for s in texts]
    tokens = Tokenizer().process_all(texts)
    return tokens

In [11]:
sents_tokens_2 = get_texts(all_sents, with_preprocess=True)

In [12]:
sents_tokens_2[0]

['xxmaj', 'this', 'is', 'and', 'original', 'design', 'for', 'the', 'beach']

Note that the punctuation has dissapeared, which is good. However, let's have a look to this sentence:

In [13]:
s = "I don't: particularly ; like data science."

In [14]:
get_texts([s], with_preprocess=False)[0]

['i', 'do', "n't", ':', 'particularly', ';', 'like', 'data', 'science', '.']

In [15]:
get_texts([s], with_preprocess=True)[0]

['don', 'particularly', 'like', 'data', 'science']

you can see that the negation has dissapeared, which is not something I want here. If one wanted to remove some punctuation while keeping negations, the solution is rather simple, moreover, given the structure of the `fastai`'s `Tokenizer`. You could define a function like:

In [16]:
import re
def rm_punctuation(x):
    x = re.sub(r"\.|,|:|;", " ", x)
    # or 
    # x = x.replace(".", "").replace(",","").replace(":", "").replace(";","")
    return x

In [17]:
new_tok = Tokenizer()
new_tok.pre_rules = [rm_punctuation] + new_tok.pre_rules

And this way you could take full advantage of all the `fastai`'s preprocessing rules and simply add a new one. 

In [18]:
new_tok.process_all([s])

[['i', 'do', "n't", 'particularly', 'like', 'data', 'science']]

I will leave that for future implementations if/when I revisit this exercise. The full vocabulary for these reviews, applying a low freq cut of 5, is around 22k words (i.e. not too big). Therefore, worst case scenario, the network will end up learning a few useless embeddings (as we will see).  

Let's move on

In [19]:
#  saving the lengths of sentences for padding purposes
sents_length = [len(s) for s in sents_tokens]

In [20]:
# Create Vocabulary using fastai's Vocab class
vocab = Vocab.create(sents_tokens, max_vocab=5000, min_freq=5)

In [21]:
# 'numericalize' each sentence
sents_numz = [vocab.numericalize(s) for s in sents_tokens]

In [22]:
# group the sentences again into documents
texts_numz = [sents_numz[range_idx[i] : range_idx[i + 1]] for i in range(len(range_idx[:-1]))]

In [23]:
texts_numz[0]

[[5, 19, 17, 13, 703, 320, 18, 10, 950, 9],
 [5, 268, 216, 13, 15, 105, 1072, 39, 64, 9],
 [11, 112, 6, 491, 9],
 [11, 79, 0, 51, 13, 44, 951, 13, 15, 453, 9],
 [5, 227, 17, 14, 80, 492, 952, 9],
 [115, 24, 493, 12, 25, 321, 24, 31, 236, 13, 19, 59, 108, 160, 237, 88, 9]]

Let's see if all is consistent

In [24]:
print(sents_tokens[1])
print([vocab.itos[i] for i in texts_numz[0][1]])

['xxmaj', 'super', 'light', 'and', 'it', 'also', 'covers', 'very', 'well', '.']
['xxmaj', 'super', 'light', 'and', 'it', 'also', 'covers', 'very', 'well', '.']


We have all sentences tokenized and processed, the only thing left is to pad them so they are ready to be passed to the RNN.

In [25]:
q=0.8
maxlen_sent = int(np.quantile(sents_length, q=q))
maxlen_doc  = int(np.quantile(texts_length[1:], q=q))

In [26]:
print(maxlen_sent, maxlen_doc)

21 7


All sentences will be pad to 21 tokens and all documents/reviews to 7 sentences. 

I have coded two helper functions to help with the padding. These are inspired by code in the fastai library.

In [27]:
def pad_sequences(seq, maxlen, pad_first=True, pad_idx=1):
    if len(seq) >= maxlen:
        res = np.array(seq[-maxlen:]).astype("int32")
        return res
    else:
        res = np.zeros(maxlen, dtype="int32") + pad_idx
        if pad_first:
            res[-len(seq) :] = seq
        else:
            res[: len(seq) :] = seq
        return res


def pad_nested_sequences(
    seq, maxlen_sent, maxlen_doc, pad_sent_first=True, pad_doc_first=False, pad_idx=1
):
    seq = [s for s in seq if len(s) >= 1]
    if len(seq) == 0:
        return np.array([[pad_idx] * maxlen_sent] * maxlen_doc).astype("int32")
    seq = [pad_sequences(s, maxlen_sent, pad_sent_first, pad_idx) for s in seq]
    if len(seq) >= maxlen_doc:
        return np.array(seq[:maxlen_doc])
    else:
        res = np.array([[pad_idx] * maxlen_sent] * maxlen_doc).astype("int32")
        if pad_doc_first:
            res[-len(seq) :] = seq
        else:
            res[: len(seq) :] = seq
        return res

In [28]:
padded_texts = np.stack([pad_nested_sequences(r, maxlen_sent, maxlen_doc) for r in texts_numz], axis=0)

In [29]:
padded_texts.shape

(1000, 7, 21)

In [30]:
padded_texts[:5]

array([[[   1,    1,    1,    1, ...,   18,   10,  950,    9],
        [   1,    1,    1,    1, ..., 1072,   39,   64,    9],
        [   1,    1,    1,    1, ...,  112,    6,  491,    9],
        [   1,    1,    1,    1, ...,   13,   15,  453,    9],
        [   1,    1,    1,    1, ...,   80,  492,  952,    9],
        [   1,    1,    1,    1, ...,  160,  237,   88,    9],
        [   1,    1,    1,    1, ...,    1,    1,    1,    1]],

       [[  11,  142,  249,   10, ...,   72,   42,   15,    9],
        [   1,    1,    1,    1, ...,   20,   19,  139,    9],
        [   1,    1,    5,  300, ...,   23,   10,  467,    9],
        [   1,    1,    1,    5, ...,  315,    7,  135,    9],
        [   1,    1,    1,    1, ...,    0,   21,  315,    9],
        [   0,  183,   10,  274, ...,  643,   16,    0,    9],
        [ 156,   18,  495,   49, ...,  494,  282,  274,    9]],

       [[   1,    1,    1,    1, ...,    5,    0,  797,    9],
        [   1,    1,    1,    1, ...,  222,  169,  

and that's it! the data is ready for Deep Learning 