# Tutorial 1.  Sentence classification with word embeddings

This tutorial is aimed to make participants familiar with text classification on **DeepPavlov**.

The tutorial has the following **structure**:

1. [Library and requirements installation](#Library-and-requirements-installation)

2. [Dataset downloading](#Dataset-downloading)

3. [Dataset Reader](#Dataset-Reader): [docs link](https://deeppavlov.readthedocs.io/en/latest/apiref/dataset_readers.html)

4. [Dataset Iterator](#Dataset-Iterator): [docs link](https://deeppavlov.readthedocs.io/en/latest/apiref/dataset_iterators.html)

5. [Preprocessor](#Preprocessor): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/data_processors.html)

6. [Tokenizer](#Tokenizer): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/data_processors.html)

7. [GloVe Embedder](#Embedder): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/data_processors.html)
[pre-trained embeddings link](https://deeppavlov.readthedocs.io/en/latest/intro/pretrained_vectors.html)

8. [Vocabulary of classes](#Vocabulary-of-classes)

9. [Keras Classifier](#Classifier): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/classifiers.html)

## Library and requirements installation

Let's install `DeepPavlov` library and dependencies for Keras classification model.

In [None]:
!pip install deeppavlov

Collecting deeppavlov
[?25l  Downloading https://files.pythonhosted.org/packages/03/4f/1f73825653f388ead9a9ea2f46cad51c92bd84a899ebf983906013e14d1c/deeppavlov-0.4.0-py3-none-any.whl (682kB)
[K     |████████████████████████████████| 686kB 2.7MB/s 
[?25hCollecting Cython==0.28.5 (from deeppavlov)
[?25l  Downloading https://files.pythonhosted.org/packages/19/8e/32b280abb0947a96cdbb8329fb2014851a21fc1d099009f946ea8a8202c3/Cython-0.28.5-cp36-cp36m-manylinux1_x86_64.whl (3.4MB)
[K     |████████████████████████████████| 3.4MB 39.9MB/s 
[?25hCollecting numpy==1.14.5 (from deeppavlov)
[?25l  Downloading https://files.pythonhosted.org/packages/68/1e/116ad560de97694e2d0c1843a7a0075cc9f49e922454d32f49a80eb6f1f2/numpy-1.14.5-cp36-cp36m-manylinux1_x86_64.whl (12.2MB)
[K     |████████████████████████████████| 12.2MB 34.4MB/s 
Collecting pytelegrambotapi==3.5.2 (from deeppavlov)
[?25l  Downloading https://files.pythonhosted.org/packages/a4/5a/7aab147b253f19e5ef007316f39cf693a63d5cd7f654c3805c

In [None]:
!python -m deeppavlov install intents_snips

2019-07-01 02:28:36.323 INFO in 'deeppavlov.core.common.file'['file'] at line 30: Interpreting 'intents_snips' as '/usr/local/lib/python3.6/dist-packages/deeppavlov/configs/classifiers/intents_snips.json'
Collecting tensorflow==1.10.0
[?25l  Downloading https://files.pythonhosted.org/packages/ee/e6/a6d371306c23c2b01cd2cb38909673d17ddd388d9e4b3c0f6602bfd972c8/tensorflow-1.10.0-cp36-cp36m-manylinux1_x86_64.whl (58.4MB)
[K     |████████████████████████████████| 58.4MB 2.5MB/s 
Collecting setuptools<=39.1.0 (from tensorflow==1.10.0)
[?25l  Downloading https://files.pythonhosted.org/packages/8c/10/79282747f9169f21c053c562a0baa21815a8c7879be97abd930dbcf862e8/setuptools-39.1.0-py2.py3-none-any.whl (566kB)
[K     |████████████████████████████████| 573kB 38.8MB/s 
Collecting tensorboard<1.11.0,>=1.10.0 (from tensorflow==1.10.0)
[?25l  Downloading https://files.pythonhosted.org/packages/c6/17/ecd918a004f297955c30b4fffbea100b1606c225dbf0443264012773c3ff/tensorboard-1.10.0-py3-none-any.whl (3

## Dataset downloading.

This tutorial uses dataset Stanford Sentiment Treebank (SST) from [paper](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf).

The dataset contains unlabelled sentences divided to train/dev/test sets, phrases labelled with float sentiment value. Most of the sentences are contained in labelled list of phrases. Therefore, we are going to extract sentences coinciding with labelled phrases, convert their float sentiment to fine-grained (5 classes: very negative, negative, neutral, positive, very positive) and binary classes (negative and positive only), build two classifiers.

Let's download and extract the SST dataset.

In [None]:
from deeppavlov.core.data.utils import download

download("./stanfordSentimentTreebank.zip", source_url="http://files.deeppavlov.ai/datasets/stanfordSentimentTreebank.zip")

2019-07-01 02:30:05.291 INFO in 'deeppavlov.core.data.utils'['utils'] at line 63: Downloading from http://files.deeppavlov.ai/datasets/stanfordSentimentTreebank.zip to /content/stanfordSentimentTreebank.zip
100%|██████████| 7.26M/7.26M [00:01<00:00, 3.83MB/s]


In [None]:
!unzip stanfordSentimentTreebank.zip

Archive:  stanfordSentimentTreebank.zip
   creating: stanfordSentimentTreebank/
  inflating: stanfordSentimentTreebank/datasetSentences.txt  
   creating: __MACOSX/
   creating: __MACOSX/stanfordSentimentTreebank/
  inflating: __MACOSX/stanfordSentimentTreebank/._datasetSentences.txt  
  inflating: stanfordSentimentTreebank/datasetSplit.txt  
  inflating: __MACOSX/stanfordSentimentTreebank/._datasetSplit.txt  
  inflating: stanfordSentimentTreebank/dictionary.txt  
  inflating: __MACOSX/stanfordSentimentTreebank/._dictionary.txt  
  inflating: stanfordSentimentTreebank/original_rt_snippets.txt  
  inflating: __MACOSX/stanfordSentimentTreebank/._original_rt_snippets.txt  
  inflating: stanfordSentimentTreebank/README.txt  
  inflating: __MACOSX/stanfordSentimentTreebank/._README.txt  
  inflating: stanfordSentimentTreebank/sentiment_labels.txt  
  inflating: __MACOSX/stanfordSentimentTreebank/._sentiment_labels.txt  
  inflating: stanfordSentimentTreebank/SOStr.txt  
  inflating: stanfo

In [None]:
!ls stanfordSentimentTreebank/

datasetSentences.txt	  sentiment_labels.txt	 train_binary.csv
datasetSplit.txt	  SOStr.txt		 train_fine_grained.csv
dictionary.txt		  STree.txt		 valid_binary.csv
original_rt_snippets.txt  test_binary.csv	 valid_fine_grained.csv
README.txt		  test_fine_grained.csv


## Dataset Reader

DatasetReaders are components for reading datasets from files. DeepPavlov contains several different DatasetReaders, one can use either presented DatasetReader or build his own component. 

The only requirements is the output of **DatasetReader**: 
* output must be a dictionary with three fields "train", "valid" and "test", 
* each dictionary value must be a list of corresponding samples,
* each sample must be a tuple (x, y) where either x, y or both can also be lists of several inputs.

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/dataset_readers.html

In [None]:
from deeppavlov.dataset_readers.basic_classification_reader import BasicClassificationDatasetReader

In [None]:
reader = BasicClassificationDatasetReader()
data = reader.read(data_path="./stanfordSentimentTreebank", 
                   train="train_binary.csv", valid="valid_binary.csv", test="test_binary.csv",
                   x="text", y="binary_label")

In [None]:
data.keys()

dict_keys(['train', 'valid', 'test'])

For every samples we store label(s) as list because we don't know whether it is binary, multi-class or multi-label classification.

In [None]:
data["train"][0]

("The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 ['positive'])

## Dataset Iterator

DatasetIterators are components for iterating over datasets. DeepPavlov contains several different DatasetIterators, one can either use presented iterator or build his own component.

DatasetIterator must have the following methods:
* **gen_batches** - method generates batches of inputs and expected output to train neural networks. Output is a tuple of a batch of inputs and a batch of expected outputs.
* **get_instances** - method gets all data for a selected data type ("train", "valid", "test"). Output is a tuple of all inputs for a data type and all expected outputs for a data type.
* **split** - method merges/splits data of a selected data type from DatasetReader ("train", "valid", "test").

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/dataset_iterators.html

In [None]:
from deeppavlov.dataset_iterators.basic_classification_iterator import BasicClassificationDatasetIterator

In [None]:
iterator = BasicClassificationDatasetIterator(data, seed=42, shuffle=True)

In [None]:
for batch in iterator.gen_batches(data_type="train", batch_size=13):
  print(batch)
  break

(('Kids should have a stirring time at this beautifully drawn movie .', 'It is quite a vision .', "It 's rare to find a film to which the adjective ` gentle ' applies , but the word perfectly describes Pauline & Paulette .", "Despite terrific special effects and funnier gags , Harry Potter and the Chamber of Secrets finds a way to make J.K. Rowling 's marvelous series into a deadly bore .", 'A harrowing account of a psychological breakdown .', "If you 're a fan of the series you 'll love it and probably want to see it twice .", "An escapist confection that 's pure entertainment .", "It 's mostly a pleasure to watch .", 'All three women deliver remarkable performances .', 'The IMAX screen enhances the personal touch of manual animation .', "With amazing finesse , the film shadows Heidi 's trip back to Vietnam and the city where her mother , Mai Thi Kim , still lives .", 'For those of an indulgent , slightly sunbaked and summery mind , Sex and Lucia may well prove diverting enough .', 'N

## Preprocessor

We can preprocess text according to our needs. 
Let's define the most simple preprocessor - lower-casing.

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/models/preprocessors.html

In [None]:
from deeppavlov.models.preprocessors.str_lower import StrLower

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package perluniprops to /root/nltk_data...
[nltk_data]   Unzipping misc/perluniprops.zip.
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping corpora/nonbreaking_prefixes.zip.


In [None]:
preprocessor = StrLower()

In [None]:
preprocessor(["The Rock is destined to be the 21st Century 's new `` Conan ''."])

["the rock is destined to be the 21st century 's new `` conan ''."]

## Tokenizer

We need to tokenize our texts because we are going to use word embeddings.
DeepPavlov contains several different tokenizers, one can choose the most appropriate.

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/models/tokenizers.html

In [None]:
from deeppavlov.models.tokenizers.nltk_tokenizer import NLTKTokenizer

In [None]:
tokenizer = NLTKTokenizer()

In [None]:
tokenizer(["The Rock is destined to be the 21st Century 's new `` Conan ''."])

[['The',
  'Rock',
  'is',
  'destined',
  'to',
  'be',
  'the',
  '21st',
  'Century',
  "'",
  's',
  'new',
  '``',
  'Conan',
  "''."]]

## Embedder

We are planning to use non-trainable GloVe word embeddings. Let's download file.

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/models/embedders.html

Now we need to download GloVe embeddings file. One can download from [here](https://nlp.stanford.edu/projects/glove/) but it downloads more than 800 Mb. To save your time, you can download GloVe embeddings file from DeepPavlov (downloads 350 Mb).

In [None]:
from deeppavlov.core.data.utils import download

download("./glove.6B.100d.txt", source_url="http://files.deeppavlov.ai/embeddings/glove.6B.100d.txt")

2019-07-01 02:31:12.602 INFO in 'deeppavlov.core.data.utils'['utils'] at line 63: Downloading from http://files.deeppavlov.ai/embeddings/glove.6B.100d.txt to /content/glove.6B.100d.txt
347MB [00:24, 14.1MB/s]


Now we can define GloVeEmbedder. Parameter `pad_zero` which is set to `True` determines whether to pad embedded batch of tokens to the longest sample length.

In [None]:
from deeppavlov.models.embedders.glove_embedder import GloVeEmbedder

embedder = GloVeEmbedder(load_path="./glove.6B.100d.txt", 
                         pad_zero=True  # means whether to pad up to the longest sample in a batch
                        )

2019-07-01 02:31:37.943 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `/content/glove.6B.100d.txt`]
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [None]:
embedder.dim

100

In [None]:
embedder(tokenizer(preprocessor(["The Rock is destined to be the 21st Century 's new 'Conan'.",
                                 "The Rock is a new 'Conan'."])))

array([[[-0.038194, -0.24487 ,  0.72812 , ..., -0.1459  ,  0.8278  ,
          0.27062 ],
        [-0.68387 ,  0.39176 ,  0.5367  , ..., -0.14145 ,  1.3115  ,
          0.31476 ],
        [-0.54264 ,  0.41476 ,  1.0322  , ..., -1.2969  ,  0.76217 ,
          0.46349 ],
        ...,
        [-0.34562 , -0.24993 ,  0.58678 , ..., -1.3106  ,  1.0294  ,
         -0.058794],
        [-0.7287  , -0.40513 ,  0.25123 , ..., -0.78115 , -0.45564 ,
          0.10672 ],
        [ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
          0.      ]],

       [[-0.038194, -0.24487 ,  0.72812 , ..., -0.1459  ,  0.8278  ,
          0.27062 ],
        [-0.68387 ,  0.39176 ,  0.5367  , ..., -0.14145 ,  1.3115  ,
          0.31476 ],
        [-0.54264 ,  0.41476 ,  1.0322  , ..., -1.2969  ,  0.76217 ,
          0.46349 ],
        ...,
        [ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
          0.      ],
        [ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
   

In [None]:
embedder(tokenizer(preprocessor(["The Rock is destined to be the 21st Century 's new 'Conan'.",
                                 "The Rock is a new 'Conan'."]))).shape

(2, 15, 100)

## Vocabulary of classes

By default, we assume that we have different classes which also can be given as strings. Therefore, we need to convert them to something more appropriate for classifier. For example, neural classifiers always need to get **one-hot** representation of classes. To get one-hot representation we have to collect a dictionary with all the classes appeared (if needed one can add "unknown" class), index class samples and convert to one-hot representation.

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/core/data.html

In [None]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary

In [None]:
vocab = SimpleVocabulary(save_path="./binary_classes.dict")



In [None]:
iterator.get_instances(data_type="train")

(("The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
  "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer\\/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .",
  'Singer\\/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .',
  'Yet the act is still charming here .',
  "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .",
  'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .',
  'Part of the ch

In [None]:
vocab.fit(iterator.get_instances(data_type="train")[1])

In [None]:
list(vocab.items())

[('positive', 0), ('negative', 1)]

In [None]:
vocab(["positive", "positive", "negative"])

[0, 0, 1]

In [None]:
vocab([0, 0, 1])

['positive', 'positive', 'negative']

**One-hotter**

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/models/preprocessors.html

In [None]:
from deeppavlov.models.preprocessors.one_hotter import OneHotter

In [None]:
one_hotter = OneHotter(depth=vocab.len, 
                       single_vector=True  # means we want to have one vector per sample
                      )

In [None]:
one_hotter(vocab(["positive", "positive", "negative"]))

[array([1., 0.], dtype=float32),
 array([1., 0.], dtype=float32),
 array([0., 1.], dtype=float32)]

**Converting from probability to labels**

Neural model not only accepts one-hot classes representation but also returns for every sample vector of probability distribution of classes. Therefore, we need to use some component to convert probability ditribution to label indices. 

`Proba2Labels` component supports three different model:
* if `max_proba` is true, returns indices of the highest probabilities,
* if `confident_threshold` is given, returns indices with probabiltiies higher than threshold,
* if `top_n` is given, returns `top_n` indices with highest probabilities.

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/models/preprocessors.html

In [None]:
from deeppavlov.models.classifiers.proba2labels import Proba2Labels

prob2labels = Proba2Labels(max_proba=True)

In [None]:
vocab.len

2

In [None]:
prob2labels([[0.6, 0.4], 
             [0.2, 0.8],
             [0.1, 0.9]])

[[0], [1], [1]]

In [None]:
vocab(prob2labels([[0.6, 0.4], 
                   [0.2, 0.8],
                   [0.1, 0.9]]))

[['positive'], ['negative'], ['negative']]

## Classifier

DeepPavlov contains several classification components: sklearn classifiers, NNs on Keras, BERT classifier on tensorflow. This tutorial demonstrates how to build Convolutional neural network classifier on Keras. 

`KerasClassificationModel` is a class building Keras classifier where network architecture is built in a separate class method. 

**DOCS:** http://docs.deeppavlov.ai/en/latest/apiref/models/classifiers.html

In [None]:
from keras.layers import Input, Dense, Activation, Dropout, Flatten, GlobalMaxPooling1D
from keras import Model

from deeppavlov.models.classifiers.keras_classification_model import KerasClassificationModel
from deeppavlov.metrics.accuracy import sets_accuracy

Using TensorFlow backend.


In [None]:
model = KerasClassificationModel(
    filters_cnn=256,
    kernel_sizes_cnn=[3,5,7],
    dropout_rate=0.2,
    dense_size=100,
    save_path="./cnn_model_v0", 
    load_path="./cnn_model_v0", 
    embedding_size=embedder.dim,
    n_classes=vocab.len,
    model_name="cnn_model",  # HERE we put our new network-method name
    optimizer="Adam",
    learning_rate=0.001,
    learning_rate_decay=0.001,
    loss="categorical_crossentropy")

2019-07-01 02:34:28.477 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 244: [initializing `KerasClassificationModel` from scratch as cnn_model]
2019-07-01 02:34:29.196 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 134: Model was successfully initialized!
Model summary:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, 100)    0                                            
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, None, 256)    77056       input_1[0][0]                    
__________________________________________________________________________________________________
conv1d_2 (Conv1D)  

In [None]:
# Method `get_instances` returns all the samples of particular data field
x_valid, y_valid = iterator.get_instances(data_type="valid")
# You need to save model only when validation score is higher than previous one.
# This variable will contain the highest accuracy score
best_score = 0.
patience = 2
impatience = 0

# let's train for 10 epochs
for ep in range(10):
    
    for x, y in iterator.gen_batches(batch_size=64, 
                                     data_type="train", shuffle=True):
        x_embed = embedder(tokenizer(preprocessor(x)))
        y_onehot = one_hotter(vocab(y))
        model.train_on_batch(x_embed, y_onehot)
        
    y_valid_pred = model(embedder(tokenizer(preprocessor(x_valid))))
    score = sets_accuracy(y_valid, vocab(prob2labels(y_valid_pred)))
    print("Epochs done: {}. Valid Accuracy: {}".format(ep + 1, score))
    if score > best_score:
        model.save()
        print("New best score. Saving model.")
        best_score = score    
        impatience = 0
    else:
      impatience += 1
      if impatience == patience:
        print("Out of patience. Stop training.")
        break

2019-07-01 02:35:35.782 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 373: [saving model to /content/cnn_model_v0_opt.json]


Epochs done: 1. Valid Accuracy: 0.7915151515151515
New best score. Saving model.
Epochs done: 2. Valid Accuracy: 0.7890909090909091


2019-07-01 02:36:57.485 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 373: [saving model to /content/cnn_model_v0_opt.json]


Epochs done: 3. Valid Accuracy: 0.7951515151515152
New best score. Saving model.


2019-07-01 02:37:38.513 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 373: [saving model to /content/cnn_model_v0_opt.json]


Epochs done: 4. Valid Accuracy: 0.8315151515151515
New best score. Saving model.
Epochs done: 5. Valid Accuracy: 0.7915151515151515
Epochs done: 6. Valid Accuracy: 0.8024242424242424
Out of patience. Stop training.


In [None]:
# Let's look into obtained resulting outputs
print("Text sample: {}".format(x_valid[0]))
print("True label: {}".format(y_valid[0]))
print("Predicted probability distribution: {}".format(dict(zip(vocab.keys(), 
                                                               y_valid_pred[0]))))
print("Predicted label: {}".format(vocab(prob2labels(y_valid_pred))[0]))

Text sample: It 's a lovely film with lovely performances by Buy and Accorsi .
True label: ['positive']
Predicted probability distribution: {'positive': 0.8794663548469543, 'negative': 0.048511240631341934}
Predicted label: ['positive']


# Fine-grained classification

Fine-grained labelled dataset corresponds to multi-class classification task with 5 classes.
Still this classification is not multi-label, so you do not need to change anything from binary classifiaction except of network or training parameters.

The **TASK** is to build from scratch fine-grained classifier.