## Check Questions

**Вопрос 1**: Что такое backpropagation?

Backpropagation -- этом метод подсчета градиента функции потерь по весам модели с помощью chain rule. 
Используется в нейросетях: каждый нейрон считает свои частные производные в forward pass-e, а потом они в обратном порядке перемножаются, чтобы получить итоговый градиент (так называемый backward pass).


**Вопрос 2**: Какие активационные функции вы знаете?

tanh(x), $\sigma(x) = \frac{1}{1 + \exp^{-x}}$, $rectifier(x) = \max(0, x)$

**Вопрос 3**: Чем они интересны, почему среди множества всех возможных функций были выбраны именно эти?

Эти функции нелинейные, поэтому можно аппроксимировать сложные зависимости, дифференцируемые (для backpropagation). Tanh и sigmoid выдают значения из [-1, 1] и [0, 1] соответственно, тем самым имитируя поведение нейронов в мозге. У tanh и sigmoid есть проблема -  vanishing gradients в очень глубоких сетях, т.к. очень маленькие производные, когда значения приближаются к 1.У ReLu нет такой проблемы.

**Вопрос 4**: Чем deep learning отличается от обычных нейронных сетей?

В deep neural networks много скрытых слоев, а в обычных нейронках 2-3 слоя.

**Вопрос 5**: Что такое свертка?

В применении к изображениям, это функция от какой-то части изображения, умножающая значения пикселей на веса и суммирующая результаты.
**Вопрос 6**: Что такое скрытое состояние?

Вектор, который хранится в RNN-cell и зависит от текущего входа и предыдущего своего значения. Таким образом RNN может учитывать предыдущий контекст.

**Вопрос 7**: Что такое embedding?

Embedding - map из токенов(слов, букв) в векторы размера hidden_size, которые являются уже входами rnn_cell-ов.

In [1]:
import tensorflow as tf
from tensorflow.python.ops import rnn_cell
from tensorflow.python.ops.math_ops import sigmoid, tanh
from scipy.ndimage.interpolation import shift
import numpy as np
import re
import string
from datetime import datetime
from bs4 import BeautifulSoup, Tag
import os

In [2]:
print(tf.__version__)

0.11.0rc0


### Utilities

In [3]:
class Vocabulary(object):
    """Vocabulary class."""

    __slots__ = ["word2index", "index2word", "unknown_idx", "charlevel"]

    def __init__(self, words=None, charlevel=False):
        """Init vocabulary."""
        self.word2index = {}
        self.index2word = []
        self.charlevel = charlevel
        # add unknown word/symbol:
        if not self.charlevel:
            self.add_words(["<UNK>"])
        else:
            self.add_words(["#"])
        self.unknown_idx = 0

        if words is not None:
            self.add_words(words)

    def add_words(self, words):
        """Add words to dictionary."""
        for word in words:
            if word not in self.word2index:
                self.word2index[word] = len(self.word2index)
                self.index2word.append(word)

    def __call__(self, line):
        """Convert from numerical representation to words and vice-versa."""
        if type(line) is np.ndarray:
            return " ".join([self.index2word[word_idx] for word_idx in line])
        if type(line) is list:
            if len(line) > 0:
                if line[0] is int:
                    return " ".join([self.index2word[word] for word in line])
            indices = np.zeros(len(line), dtype=np.int32)
        else:
            if not self.charlevel:
                tokens = re.split(r'\s+', line)
            else:
                tokens = list(line)
            indices = np.zeros(len(tokens), dtype=np.int32)

        for i, token in enumerate(tokens):
            indices[i] = self.word2index.get(token, self.unknown_idx)

        return indices

    @property
    def size(self):
        """Return length of vocabulary."""
        return len(self.index2word)

    def __len__(self):
        """Return length of vocabulary."""
        return len(self.index2word)


def pad_into_matrix(rows, padding=0):
    """Pad numerical rows with zeros so they all have the same width."""
    if len(rows) == 0:
        return np.array([0, 0], dtype=np.int32)
    lengths = list(map(len, rows))
    width = max(lengths)
    height = len(rows)
    mat = np.empty([height, width], dtype=rows[0].dtype)
    mat.fill(padding)
    for i, row in enumerate(rows):
        mat[i, 0:len(row)] = row
    return mat, lengths


def get_batches(numerical_lines, batch_size):
    """Generate batches of data."""
    current_batch_num = 0
    train_size = len(numerical_lines)
    while current_batch_num * batch_size < train_size:
        start = current_batch_num * batch_size
        end = (current_batch_num + 1) * batch_size
        lines_batch, lengths_batch = pad_into_matrix(numerical_lines[start:end])
        yield (lines_batch, lengths_batch,
               shift(lines_batch, shift=(0, -1)))
        current_batch_num += 1

def prepare_data(destination_file, *filenames):
    with open(destination_file, 'w') as parsed:
        all_texts = []
        for filename in filenames:
            # print(filename)
            soup = BeautifulSoup(open(filename), "html.parser")
            texts = map(Tag.getText, soup.findAll("body", text=True))
            # print(list(texts))
            parsed.write('\n'.join(texts))

## Custom GRU

Реализация GRU, как описано [здесь](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).

In [4]:
class CustomGRUCell(rnn_cell.RNNCell):
    """Custom GRU implementation."""

    def __init__(self, hidden_size):
        """Init cell."""
        self.hidden_size = hidden_size

    def __call__(self, inputs, state, scope=None):
        """Compute GRU outputs."""
        # print("Starting GRU...")
        with tf.variable_scope(scope or "CustomGRUCell"):
            dim1 = inputs.get_shape()[1].__int__() + state.get_shape()[1].__int__()
            dim2 = inputs.get_shape()[1].__int__()
            # print("Initializing matrices...")
            W = tf.get_variable("W", [dim1, dim2],
                                initializer=tf.random_normal_initializer(0, np.sqrt(2.0 / dim1)))
            W_r = tf.get_variable("W_r", [dim1, dim2],
                                initializer=tf.random_normal_initializer(0, np.sqrt(2.0 / dim1)))
            W_z = tf.get_variable("W_z", [dim1, dim2],
                                initializer=tf.random_normal_initializer(0, np.sqrt(2.0 / dim1)))
            # print("Done.")
            z = sigmoid(tf.matmul(tf.concat(1, [state, inputs]), W_z))
            r = sigmoid(tf.matmul(tf.concat(1, [state, inputs]), W_r))
            new_state_candidate = tanh(tf.matmul(tf.concat(1, [r * state, inputs]), W))
            new_state = (1 - z) * state + z * new_state_candidate
            # print("Computed new states.")
            return new_state, new_state

    @property
    def state_size(self):
        """Return hidden state size."""
        return self.hidden_size

    @property
    def output_size(self):
        """Return output size of the cell."""
        return self.hidden_size

## Model

In [5]:
class RnnLm(object):
    """Rnn language model class.

    Can be character-level or word-level.
    You can pass RNN hyperparameters to constructor.
    """

    log = None

    def __init__(self, filename=None, hidden_size=512, num_layers=3, batch_size=20,
                 num_epochs=200, num_tokens=20, charlevel=False, verbose=False, cell_type="lstm"):
        """Init hyperparameters or load existing model."""
        self.vocab = Vocabulary(charlevel=charlevel)
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.batch_size = batch_size
        self.num_epochs = num_epochs
        self.num_tokens = num_tokens
        self.charlevel = charlevel
        self.verbose = verbose
        self.cell_type = cell_type

    def print_(self, str_):
        if self.verbose:
            print(str_)

    def fit_data(self, filename):
        """Fit data to vocabulary."""
        self.print_("Loading data...")
        with open(filename) as text_file:
            lines = text_file.readlines()
        self.print_("Preparing data...")
        self.vocab = Vocabulary(charlevel=self.charlevel)
        for line in lines:
            if len(line) < 1000:
                if self.charlevel:
                    self.vocab.add_words(list(line))
                else:
                    self.vocab.add_words(line)
        self.numerical_lines = []
        for line in lines:
            self.numerical_lines.append(self.vocab(line))
        self.print_("Prepared data. Vocabulary size: {}".format(len(self.vocab)))

    def construct_graph(self):
        """Construct tensorflow graph."""
        num_items = len(self.vocab)
        self.print_("Building graph...")
        self.graph = tf.Graph()
        with self.graph.as_default():
            self.inputs = tf.placeholder(tf.int64, shape=(self.batch_size, None), name='inputs')
            self.labels = tf.placeholder(tf.int64, shape=(self.batch_size, None), name='labels')
            self.lengths = tf.placeholder(tf.int64, shape=(self.batch_size), name='lengths')
            self.hidden_states = []
            if self.cell_type == "lstm":
                hidden_state_size = 2*self.hidden_size
            elif self.cell_type == "gru":
                hidden_state_size = self.hidden_size
            else:
                raise ValueError("Unknown cell type!")
            for i in range(self.num_layers):
                self.hidden_states.append(tf.placeholder(tf.float32, shape=(1, hidden_state_size),
                                                         name='hidden_state{}'.format(i)))
            self.hidden_state = tuple(self.hidden_states)
            embeddings = tf.Variable(tf.random_uniform([num_items, self.hidden_size], -1.0, 1.0))
            # weights and bias for fully connected layer before softmax
            W = tf.Variable(tf.truncated_normal([self.hidden_size, num_items],
                                                stddev=np.sqrt(2.0 / self.hidden_size)))
            b = tf.Variable(tf.zeros([]))
            if self.cell_type == "lstm":
                cell = rnn_cell.LSTMCell(self.hidden_size, state_is_tuple=False)
            elif self.cell_type == "gru":
                cell = CustomGRUCell(self.hidden_size)
            else:
                raise ValueError("Unknown cell type!")
            rnn_multicell = rnn_cell.MultiRNNCell([cell] * self.num_layers)
            embedding = tf.nn.embedding_lookup(embeddings, self.inputs)
            self.token = tf.placeholder(tf.int64, shape=(1, 1), name="token")
            self.token_length = tf.placeholder(tf.int64, shape=(1), name="token_length")
            token_embedding = tf.nn.embedding_lookup(embeddings, self.token)
            with tf.variable_scope("dynamic_rnn") as scope:
                outputs, _ = tf.nn.dynamic_rnn(rnn_multicell, embedding, self.lengths,
                                               dtype=tf.float32, swap_memory=True)
                scope.reuse_variables()
                token_outputs, self.hidden_state = tf.nn.dynamic_rnn(rnn_multicell, token_embedding,
                                                                     self.token_length,
                                                                     initial_state=self.hidden_state,
                                                                     swap_memory=True)
            token_outputs = tf.reshape(token_outputs, [-1, self.hidden_size])
            token_logits = tf.matmul(token_outputs, W) + b
            self.token_probs = tf.nn.softmax(token_logits)
            outputs = tf.reshape(outputs, [-1, self.hidden_size])
            logits = tf.matmul(outputs, W) + b
            logits = tf.reshape(logits, [self.batch_size, -1, num_items])
            self.loss = (tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(logits,
                                                                                      self.labels) *
                                       tf.to_float(tf.not_equal(self.labels, 0))) /
                         tf.to_float(tf.reduce_sum(tf.to_int32(tf.not_equal(self.labels, 0)))))
            self.optimizer = tf.train.AdamOptimizer().minimize(self.loss)
        self.print_("Graph construction done.")
        self.print_("Starting training...")

    def run_session(self):
        """Run tensorflow session with already constructed self.graph."""
        # self.session = tf.Session(graph=self.graph)
        with tf.Session(graph=self.graph) as session:
            tf.initialize_all_variables().run()
            for epoch in range(self.num_epochs):
                batch_number = 0
                batch_costs = []
                print("Starting epoch {}...".format(epoch))
                for (inputs_batch, lengths_batch, labels_batch) in get_batches(self.numerical_lines,
                                                                               self.batch_size):
                    if len(inputs_batch) < self.batch_size:
                        break
                    feedDict = {self.inputs: inputs_batch, self.lengths: lengths_batch,
                                self.labels: labels_batch}
                    self.print_("\tStarting batch {}...".format(batch_number))
                    start = datetime.now()
                    batch_cost, _ = session.run([self.loss, self.optimizer], feed_dict=feedDict)
                    batch_costs.append(batch_cost)
                    self.print_("\tDone batch {} in {} seconds.".format(batch_number,
                                                                        datetime.now()-start))
                    self.print_("\tCost: {}".format(batch_cost))
                    batch_number += 1
                if self.charlevel:
                    self.sample(" ", session=session)
                else:
                    self.sample("the", session=session)
                print("Done. Average cost: {}".format(np.average(batch_costs)))

    def train(self, filename):
        """Train rnn, using data in provided file."""
        self.fit_data(filename)
        self.construct_graph()
        self.run_session()

    def sample(self, token, num_tokens=None, session=None):
        """Sample tokens, starting with token."""
        created = False
        if session is None:
            raise ValueError("You should run train first.")
        if num_tokens is None:
            num_tokens = self.num_tokens
        print("Sampling...")
        start_word = token
        words = [start_word]
        if self.cell_type == "lstm":
            hidden_state_size = 2*self.hidden_size # because lstm has 2 hidden states
        elif self.cell_type == "gru":
            hidden_state_size = self.hidden_size
        else:
            raise ValueError("Unknown cell type!")
        hidden_states_ = tuple([np.zeros(shape=(1, hidden_state_size))] *
                               self.num_layers)

        for i in range(self.num_tokens):
            encoded_token = self.vocab(words[-1])
            if len(encoded_token.shape) > 1:
                new_token = [[encoded_token[0]]]
            else:
                new_token = [encoded_token]
            feedDict = {self.token: new_token, self.token_length: [1]}
            for i in range(self.num_layers):
                feedDict[self.hidden_states[i]] = hidden_states_[i]
            probabilities, hidden_states_ = session.run([self.token_probs,
                                                         self.hidden_state],
                                                        feed_dict=feedDict)
            next_word = np.random.choice(self.vocab.index2word, 1,
                                         p=probabilities[0])[0]
            words.append(next_word)
        if self.charlevel:
            print("".join(words))
        else:
            print(" ".join(words))

In [6]:
hidden_size = 200
num_layers = 3
batch_size = 500
num_epochs = 200

Будем использовать датасет [Reuters](http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz). Распарсим тексты:

In [15]:
dirname = "reuters21578"
dest_file = "reuters_parsed.txt"
files = []
for filename in os.listdir(dirname):
    if filename[-4:] == ".sgm" and filename != "reut2-017.sgm": # for some reason beautifulsoup crashes on that file.
        # print(filename)
        files.append(os.path.join(dirname, filename))
prepare_data(dest_file, *sorted(files))

Потренируем character-level модель с LSTM, оптимизируя кросс-энтропию с помощью AdamOptimizer c дефолтными параметрами(у word-level почему-то получались exploding gradients, а с методом AdaDelta cost не уменьшался):

In [34]:
model = RnnLm(hidden_size=hidden_size, num_layers=num_layers, batch_size=batch_size, num_epochs=num_epochs,
              num_tokens=100, charlevel=True)
model.train("reuters_parsed.txt")



Starting epoch 0...
Sampling...
    14..8 billion clodued.

 The said.




 Shode 7.16 vs discruce fustrees














. Torne




Done. Average cost: 2.197575807571411
Starting epoch 1...
Sampling...
    NMT       0.5 pct 802-23,278 vs 27,952,416-vs 50





 10.7 mln

 its








8 share reasured



Done. Average cost: 1.5354686975479126
Starting epoch 2...
Sampling...
    SOOitour ReIehverson explaring on were are is not planned








 Flas






















4
.
Done. Average cost: 1.3829985857009888
Starting epoch 3...
Sampling...
    Shr 72 cts vs prote line rall for












s











/10,000













8



6


 NOA  2.
Done. Average cost: 1.309670329093933
Starting epoch 4...
Sampling...
    It said the share requirements to range is
 a troud-and Apde




















8
8
.78
 mln




Done. Average cost: 1.2629112005233765
Starting epoch 5...
Sampling...
    Nine mths
 replaced 1986 electric shares than a Health, a decline,













8 14.59




 3.16
Don

Вроде выводится какой-то нормальный текст, но очень много переводов строки, из-за того, что в обучающих текстах их много.
Пытаться избавиться от этого можно конкатенируя каждую новость через пробел, но тогда будут очень длинные строки инпута.

In [7]:
hidden_size = 200
num_layers = 3
batch_size = 500
num_epochs = 80

Теперь потренируем модель со своим GRU.

In [7]:
model = RnnLm(hidden_size=hidden_size, num_layers=num_layers, batch_size=batch_size, num_epochs=num_epochs,
              num_tokens=100, charlevel=True, cell_type="gru")
model.train("reuters_parsed.txt")

Starting epoch 0...
Sampling...
 Bruzwill be rate of the its California countryet ships
 to










 
alk




I












Yit has
Done. Average cost: 1.6597518920898438
Starting epoch 1...
Sampling...
 Reuter

be sood apparent to affected
 further







ge






elve







 Putum








-I computer
Done. Average cost: 1.2289400100708008
Starting epoch 2...
Sampling...
    Revs 345,000 vs 7,200,000










le











ha,

hal







































Done. Average cost: 1.1550521850585938
Starting epoch 3...
Sampling...
    "I'm value a 56 pct be about 280 billion gain of 40 pct

yehtly











wo

earn






hew
 "i
Done. Average cost: 1.1150093078613281
Starting epoch 4...
Sampling...
    The spokesman said this had a coffee speculation at a crow by




s


well

t

est
.



-based



Done. Average cost: 1.088284969329834
Starting epoch 5...
Sampling...
    Nine Mths
17
     LDOR
   World Inc said its ability




utio








fall,





 Taukmant,

has

Do

KeyboardInterrupt: 

Видно, что RNN с мой GRU обучается быстрее, чем RNN c LSTM, наверное из-за того, что модель более простая. Кстати забавно, что charlevel-модель хорошо научилась выделять места где smth vs smth, даже зачастую с правильными единицами измерения по обе стороны от vs :) Еще я тестировала на русском датасете более узкой тематики, там получались более хорошие результаты. Еще они были смешнее, из-за того что char-level модель может легко менять окончания итп.