# Heirarchical Attention Network for text classification

In [1]:
## Uncomment command below to kill current job:
#!neuro kill $(hostname)

Our recipe is based on highly cited paper
[Hierarchical Attention Networks for Document Classification](https://arxiv.org/abs/1608.07775) (Z. Yang et al.), 
published in 2017. We classify the IMDB's reviews as positive and negative
(25k reviews for train and the same number for test). The proposed neural network architecture takes two steps:
1. It encodes sentences. The attention mechanism predicts the importance for each word in the final embedding of a sentence.
2. It encodes texts. The attention mechanism predicts the importance for each sentence in the final embedding of a text.

This architecture is exciting because we can make an illustration to understand what words and sentences were
important for prediction. You can find more information in the original article.

In [1]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append("..")

from pathlib import Path
from src.notebooks_utils import display_predict
from src.dataset import get_test_dataset, collate_docs, ImdbReviewsDataset
from src.model import HAN
from src.const import RESULT_DIR

### Load IMDB reviews dataset

In [3]:
!sh ../src/download_data.sh ../data

Downloading dataset to ../data
--2020-06-01 12:46:07--  http://data.neu.ro/aclImdb.zip
Resolving data.neu.ro (data.neu.ro)... 52.216.28.51
Connecting to data.neu.ro (data.neu.ro)|52.216.28.51|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44329613 (42M) [application/zip]
Saving to: ‘/tmp/aclImdb.zip’


2020-06-01 12:46:08 (41.7 MB/s) - ‘/tmp/aclImdb.zip’ saved [44329613/44329613]

Unpacking...
Finished


In [4]:
dataset = get_test_dataset()
itow = dict(zip(dataset.vocab.values(), dataset.vocab.keys()))

Dataset loading from ../data/aclImdb/test.


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))




### Load your trained model

In [3]:
# path_to_ckpt = RESULT_DIR / 'logs' / 'checkpoints' / 'best.pth'  # ckpt will apear here you run training
# model = HAN.from_imbd_ckpt(path_to_ckpt)

### Load pretrained model

In [4]:
! sh ../src/download_pretrained.sh ../data/pretrained_hier.pth

RESULT_DIR = Path("../data/")
path_to_ckpt = RESULT_DIR / 'pretrained_hier.pth'
model = HAN.from_imbd_ckpt(path_to_ckpt)

Model was loaded from ../results/pretrained_hier.pth.


## Display predict for reviews from test

In [5]:
from random import randint

idx = randint(1, len(dataset))
batch = collate_docs([dataset[idx]])
display_predict(model=model, batch=batch, itow=itow)
    
print('Raw review:')
print(open(dataset._paths[idx], 'r').read())

Predict: negative (confedence: 0.997)
Ground truth: negative.


Raw review:
Wow, there are no words to describe how bad this movie truly is. I usually pride myself on being able to enjoy any movie no matter how bad, but this was just too much. I would only suggest watching this movie as a torture device. If you can get past the terrible plot and dialogue by watching it on mute, the even more terrible camera work and shoddy special effects make this movie a real horror. If your thinking about watching this because your a Sandra Bullock fan, don't even bother as she has less than ten minutes of screen time, and her acting is absolutely atrocious. Not to mention the rest of the cast, which could be replaced with baboons who would do a better job, at least then we wouldn't have to listen to the terrible dialogue.


## Display predict on your own review

In [6]:
text_str = '''

I really like films like this! I discover several new talents for myself. 
The sound was great and picture as well. So, I am going to see it again.

'''


text, snt_max_len, txt_len  = dataset.tokenize_plane_text(text_str)

batch = collate_docs([{'txt': text, 'snt_len': snt_max_len,
                       'txt_len': txt_len, 'label': -1}])
batch['targets'] = None

display_predict(model=model, batch=batch, itow=itow)

Predict: positive (confedence: 0.334)
