# Gerador de Reviews

- Treinar um modelo preditivo no texto de todos os reviews
- Gerar alguns review aleatórios

### References
- [
Text Generation With LSTM Recurrent Neural Networks in Python with Keras](https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/)
- [Keras example - lstm_text_generation.py](https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py)
- [textgenrnn - Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.](https://github.com/minimaxir/textgenrnn)

In [1]:
!pip install -q keras unidecode

import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint, LambdaCallback
from keras.utils import np_utils

from collections import defaultdict

from IPython.display import display, Markdown, clear_output

import numpy as np
import random
import itertools

#!pip install unidecode
from unidecode import unidecode

from util import dataset, gpu
gpu.keras_share_gpu()

Using TensorFlow backend.


In [2]:
def _normalize_text(text, vocabulary=None):
    text = unidecode(text) # Replace unicode chars with closes ascii representation
    text = text.replace('\r\n', '\n').replace('\r', '\n')
    
    if vocabulary is not None:
        text = ''.join([
            char
            for char in text
            if char in vocabulary
        ])
    return text

In [3]:
SYMBOL_PAD = '∅'
SYMBOL_END = '␃'

vocabulary = ''
vocabulary += SYMBOL_PAD
vocabulary += SYMBOL_END
vocabulary += 'abcdefghijklmnopqrstuvwxyz'
vocabulary += 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
vocabulary += '0123456789'
vocabulary += '\n\t '
vocabulary += '.,:;?!\'"`/|\()[]{}<>~@#$^%&*-+_='

char_to_int = {c: i for i, c in enumerate(vocabulary)}
int_to_char = {i: c for i, c in enumerate(vocabulary)}
vocabulary = set(vocabulary)

', ' . join(int_to_char.values())

'∅, ␃, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, \n, \t,  , ., ,, :, ;, ?, !, \', ", `, /, |, \\, (, ), [, ], {, }, <, >, ~, @, #, $, ^, %, &, *, -, +, _, ='

In [4]:
vocabulary_prob = defaultdict(int)
num_chars = 0
for review in dataset.read('review'):
    for char in _normalize_text(review['text']):
        vocabulary_prob[char] += 1
        num_chars += 1
vocabulary_prob

KeyboardInterrupt: 

In [5]:
vocabulary_prob_sorted = {
    k: v*100/num_chars
    for v, k in sorted([(v,k) for k, v in vocabulary_prob.items()], reverse=True)
}
vocabulary_prob_sorted

{'\n': 0.3167662231700803,
 ' ': 18.496309239687296,
 '!': 0.24270328621570036,
 '"': 0.0658344230868113,
 '#': 0.001655465559732789,
 '$': 0.036132408444559276,
 '%': 0.0038870196683405624,
 '&': 0.01704574057653688,
 "'": 0.3228658605082416,
 '(': 0.06755258929941653,
 ')': 0.07391775759034641,
 '*': 0.006360539383670845,
 '+': 0.003991380486544358,
 ',': 0.6182759888241178,
 '-': 0.11513228539558004,
 '.': 1.3212534059131404,
 '/': 0.026925511906330073,
 '0': 0.07466385327855339,
 '1': 0.061293044255942904,
 '2': 0.0508502294795502,
 '3': 0.03466714889219792,
 '4': 0.026039707380850277,
 '5': 0.0453628703288345,
 '6': 0.013668321516441465,
 '7': 0.011145987870015858,
 '8': 0.011397632101007268,
 '9': 0.013623294873103537,
 ':': 0.031549790258110354,
 ';': 0.009580154787208103,
 '<': 5.8913365115045875e-06,
 '=': 0.0013731022169313906,
 '>': 5.8913365115045875e-06,
 '?': 0.036788871655841215,
 '@': 0.0007204262934068467,
 'A': 0.15048745823037155,
 'B': 0.0903293378723949,
 'C': 0.09

In [6]:
for char, prob in vocabulary_prob_sorted.items():
    if char not in vocabulary:
        print('Dataset has character missing from vocabulary:', (char, prob))

Dataset has character missing from vocabulary: ('\x7f', 4.2080975082175623e-07)


In [7]:
def resize_context(context, size):
    context = list(context)
    if len(context) > size:
        context = context[-size:]
    elif len(context) < size:
        context = [SYMBOL_PAD] * (size-len(context)) + context
    
    return context

def gen_trainning_data(context_size=50, batch_size=50):
    def random_text_slice(text):
        text = _normalize_text(text, vocabulary=vocabulary)
        offset = random.randint(0, len(text))
        context = resize_context(text[:offset], size=context_size)
        if offset < len(text):
            next_char = text[offset]
        else:
            next_char = SYMBOL_END

        return context, next_char
                
    data = dataset.read('review', 
            repeat=True, 
            map=lambda review: random_text_slice(review['text']),
            batch_size=batch_size
    )
    #return data
    for batch in data:
        #print(batch)
        context = [[char_to_int[z] for z in x[0]] for x in batch]
        next_char = [char_to_int[x[1]] for x in batch]
        
        context = np_utils.to_categorical(context, len(vocabulary))
        next_char = np_utils.to_categorical(next_char, len(vocabulary))
        yield context, next_char

#next(gen_trainning_data(context_size=3, batch_size=2))

In [8]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def gen_texts(prefixes=['']*10, max_len=100, temperature=0.1):
    context_size = model.input_shape[1]
    result = []
    while len(prefixes) != 0:
        x = [resize_context(prefix, context_size) for prefix in prefixes]
        x = np.array([
            [char_to_int[char] for char in resize_context(prefix, context_size)]
            for prefix in prefixes
        ])
        x = np_utils.to_categorical(x, len(vocabulary))
        
        predictions = model.predict(x, verbose=0)
        predictions = [int_to_char[sample(p, temperature=temperature)] for p in predictions]
        
        result += [
            prefix + ('…' if next_char != SYMBOL_END else '')
            for prefix, next_char in zip(prefixes, predictions)
            if next_char == SYMBOL_END or len(prefix) >= max_len
        ]
        prefixes = [
            prefix + next_char
            for prefix, next_char in zip(prefixes, predictions)
            if next_char != SYMBOL_END and len(prefix) < max_len
        ]
    return result
        
def gen_text(prefix='', max_len=100, temperature=0.1):
    return gen_texts(prefixes=[prefix], max_len=max_len, temperature=temperature)[0]

In [12]:
def build_model(units=(512, 512, 512), dropout=(0.2, 0.2, 0.2), context_size=50):

    model = Sequential()
    for i, (lstm_units, lstm_dropout) in enumerate(zip(units, dropout)):
        extra_args = {}
        if i == 0:
            extra_args['input_shape'] = (context_size, len(vocabulary))
        extra_args['return_sequences'] = i < len(units) - 1


        model.add(LSTM(lstm_units, **extra_args))
        if lstm_dropout != 0:
            model.add(Dropout(lstm_dropout))
            
    model.add(Dense(len(vocabulary), activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')
    return model

model = build_model()
model.load_weights('review-generator-weights-816-1.2344.h5')
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_7 (LSTM)                (None, 50, 512)           1253376   
_________________________________________________________________
dropout_7 (Dropout)          (None, 50, 512)           0         
_________________________________________________________________
lstm_8 (LSTM)                (None, 50, 512)           2099200   
_________________________________________________________________
dropout_8 (Dropout)          (None, 50, 512)           0         
_________________________________________________________________
lstm_9 (LSTM)                (None, 512)               2099200   
_________________________________________________________________
dropout_9 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 99)                50787     
Total para

In [13]:
checkpoint_callbacks = ModelCheckpoint('review-generator-weights-{epoch:02d}-{loss:.4f}.h5', monitor='loss', verbose=1, save_best_only=True, mode='min')
eval_callback = LambdaCallback(on_epoch_end=lambda epoch, logs: print('\n ---x---\n'.join([''] + gen_texts(['']*5, max_len=50) + [''])))

# fit the model
model.fit_generator(
    gen_trainning_data(batch_size=100), 
    epochs=1000, 
    steps_per_epoch=100,
    callbacks=[
        checkpoint_callbacks,
        eval_callback
    ])

Epoch 1/1000

Epoch 00001: loss improved from inf to 1.33023, saving model to review-generator-weights-01-1.3302.h5

 ---x---
I was a great service and the service is always so…
 ---x---
The service is always a great place to go back aga…
 ---x---
The service is always a great service and the serv…
 ---x---
The service is always a great place to go back aga…
 ---x---
I was a great place to go to the staff and the ser…
 ---x---

Epoch 2/1000

Epoch 00002: loss improved from 1.33023 to 1.29833, saving model to review-generator-weights-02-1.2983.h5

 ---x---
I was a great service and I will be back again.
 ---x---
I was a great place to go back and the staff was g…
 ---x---
The service was good and the service was good and …
 ---x---
The service is always friendly and the food was go…
 ---x---
I had a great service that I was a great place to …
 ---x---

Epoch 3/1000
  5/100 [>.............................] - ETA: 16s - loss: 1.2925

KeyboardInterrupt: 

In [14]:
#Generate a few random review-like texts
for x in gen_texts(temperature=0.3, max_len=500):
    print(x)
    print('-------x--------')

  after removing the cwd from sys.path.


I have been the food is the best service and I will be back.
-------x--------
We will be back to this place and the salad was always a great place for a fantastic service. I will be back again.
-------x--------
This place is a service and the staff was good.  The salad was good and the food was good.  The price was good for a very salad to the staff and the menu is a good place to go back.
-------x--------
I had the counter and the staff is always a good deal of the price and the staff was a show on the staff and it was delicious. The chicken was on the staff and we were the best place to go back for the staff and the staff was so good and the service was good and the food was good.
-------x--------
All the staff was okay and great and he was a little spot of beef and the company was a little considerable. We were so much looking for a staff and was good and the food was a bit of the best service in the pork that is a great selection of pork to the hotel in the food and the service is 

In [15]:
#Autocomplete animation

prefix_len = 80
for autocomplete_text in dataset.read('review', map=lambda review:_normalize_text(review['text'])):
    prefix = ""
    for char in [""] + list(autocomplete_text):
        prefix += char
        pretty_prefix = '[empty]' if prefix == '' else prefix
        if len(pretty_prefix) > prefix_len: 
            pretty_prefix = '…' + pretty_prefix[len(pretty_prefix)-prefix_len+1:]
        completions = gen_texts(prefixes=[prefix]*5, max_len=len(prefix)+30, temperature=0.3)
        clear_output()
        display(
            Markdown('### ' + pretty_prefix),
            Markdown('\n'.join(' - ' + ('…' if prefix else '') + x[len(prefix):] for x in (completions))))


### I love this plac

 - …e to go back and was a bit so …
 - …e for the staff and the chicke…
 - …e for a lot of the considerabl…
 - …e.  I will be back to this pla…
 - …e. The service was good and th…

KeyboardInterrupt: 

-----------
\





## Considerações de Performance

Essa implementação foi feita com simplicidade e legibilidade em mente, e é terrivelmente ineficiente.

### Treino
Para começar, a cada review, um único trecho de 50 caracteres é extraido e usado para treinar um único caracter de saida.

Em outras palavras, este código processa uma função _extremamente_ complexa, com 5 milhões de parametros, para simplesmente descartar os resultados intermediários e considerar apenas um entre centenas de caracteres no cálculo de loss.

Uma solução melhor deveria levar em consideração a estrutura do LSTM: É possível prever o próximo caracter para cada um dos elementos da sequencia e usar predições intermediárias no cálculo de loss, efetivamente treinando todas as letras da sequencia simultaneamente.

### Predição
A predição também é feita de forma idiota. Para cada caractere a ser predito, o trecho completo dos 50 caracteres anteriores é alimentado na rede neural, causando uma quantidade enorme de calculos redundantes.

Uma versão mais eficiente deveria preservar o estado da LSTM e calcular apenas a predição do próximo caractere.





## Usos

Gerar texto que se parece com uma review é divertido, mas não é terrívelmente útil.

Um uso mais interessante é no auxilio a digitação, especialmente em celulares. Isto me lembrou do [Dasher](http://www.inference.org.uk/dasher/) ([Demo](https://www.youtube.com/watch?v=nr3s4613DX8)), um projeto que existe há muitos anos e que certamente não usa redes neurais, mas que talvez poderia ser um bom teclado android.