# Heirarchical Attention Network for text classification

In [1]:
## Uncomment command below to kill current job:
#!neuro kill $(hostname)

Our recipe is based on highly cited paper
[Hierarchical Attention Networks for Document Classification](https://arxiv.org/abs/1608.07775) (Z. Yang et al.), 
published in 2017. We classify the IMDB's reviews as positive and negative
(25k reviews for train and the same number for test). The proposed neural network architecture takes two steps:
1. It encodes sentences. The attention mechanism predicts the importance for each word in the final embedding of a sentence.
2. It encodes texts. The attention mechanism predicts the importance for each sentence in the final embedding of a text.

This architecture is exciting because we can make an illustration to understand what words and sentences were
important for prediction. You can find more information in the original article.

In [2]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append("..")

from src.notebooks_utils import display_predict
from src.dataset import get_test_dataset, collate_docs, ImdbReviewsDataset
from src.model import HAN
from src.const import RESULT_DIR

### Load IMDB reviews dataset

In [3]:
!sh ../src/download_data.sh ../data

Downloading dataset to ../data
--2020-06-01 12:46:07--  http://data.neu.ro/aclImdb.zip
Resolving data.neu.ro (data.neu.ro)... 52.216.28.51
Connecting to data.neu.ro (data.neu.ro)|52.216.28.51|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44329613 (42M) [application/zip]
Saving to: ‘/tmp/aclImdb.zip’


2020-06-01 12:46:08 (41.7 MB/s) - ‘/tmp/aclImdb.zip’ saved [44329613/44329613]

Unpacking...
Finished


In [4]:
dataset = get_test_dataset()
itow = dict(zip(dataset.vocab.values(), dataset.vocab.keys()))

Dataset loading from ../data/aclImdb/test.


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))




### Load your trained model

In [5]:
#!sh ../src/download_pretrained.sh ../results/pretrained_hier.pth
#path_to_ckpt = RESULT_DIR / 'pretrained_hier.pth'  # ckpt will apear here you run training
#model = HAN.from_imbd_ckpt(path_to_ckpt)

### Load pretrained model

In [6]:
!! chmod 700 ../src/download_pretrained.sh && bash -c ../src/download_pretrained.sh

path_to_ckpt = RESULT_DIR / 'pretrained_hier.pth'
model = HAN.from_imbd_ckpt(path_to_ckpt)

Model was loaded from ../results/pretrained_hier.pth.


## Display predict for reviews from test

In [7]:
from random import randint

idx = randint(1, len(dataset))
batch = collate_docs([dataset[idx]])
display_predict(model=model, batch=batch, itow=itow)
    
print('Raw review:')
print(open(dataset._paths[idx], 'r').read())

Predict: positive (confedence: 0.997)
Ground truth: positive.


Raw review:
Fantastic Chaplin movie with many memorable moments as Charlie joins the army to fight in WW 1.<br /><br />At first he goes to boot camp, where he has to learn how to handle his rifle and how to walk in line. That's a really funny scene as the tramp is not used to keeping his feet straight!<br /><br />Next thing you know he's in France in a trench. Hilarious scenes here include a starving Charlie eating the cheese of a mousetrap and reading a letter from home over someone's shoulder.<br /><br />When Charlie goes to sleep he finds his bunker all flooded and his roommate snoring. This is such a funny part! I can't really describe it, just watch the movie. When Charlie wakes up his legs feel numb so he tries to 'wake them up'. It had me rolling on the floor when it turns out his second leg still feels numb... while Charlie actually rubs his roommate's foot!<br /><br />The movie then turns a bit grim, as Charlie shoots a couple of Germans from his trench (although it's done in 

## Display predict on your own review

In [8]:
text_str = '''

I really like films like this! I discover several new talents for myself. 
The sound was great and picture as well. So, I am going to see it again.

'''


text, snt_max_len, txt_len  = dataset.tokenize_plane_text(text_str)

batch = collate_docs([{'txt': text, 'snt_len': snt_max_len,
                       'txt_len': txt_len, 'label': -1}])
batch['targets'] = None

display_predict(model=model, batch=batch, itow=itow)

Predict: positive (confedence: 0.334)
