## Text Generation

El objetivo de esta sesión es implementar un modelo de generación de texto basado en vocabulario. Para ello se deberán entrenar unos embeddings.

In [1]:
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding


Using TensorFlow backend.


Generar secuencia de palabras dado un modelo, un tokenizador, y los parámetros del vocabulario

In [0]:
def generate_seq_2(model, tokenizer,max_length,ini_text,numwords):
  in_text =ini_text
  for a in range(numwords):
    print(in_text)
    txt2seq = tokenizer.texts_to_sequences([in_text])[0]
    print(txt2seq)
    padseq = pad_sequences([txt2seq],maxlen=max_length,padding='pre')
    print(padseq)
    pred = model.predict_classes(padseq,verbose=0)[0]
    print(pred)
    pred2word = tokenizer.sequences_to_texts([[pred]])[0]
    in_text += ' ' + pred2word
    print("\n")
  return in_text

Carga de datos, también se puede pegar a capón. En mi caso, el SOTUA de 2019

In [0]:
data = ''' '''

Crear el tokenizador usando el texto/textos y transformar el texto en secuencias

In [4]:
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]
# retrieve vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)


Vocabulary Size: 1649


Transformar la ristra de indices en secuencias de un tamaño definido y separar la siguiente palabra a predecir

In [5]:
# encode n_words -> 1 word

n_words = 2

sequences = list()
for i in range(n_words, len(encoded)):
	sequence = encoded[i-n_words:i+1]
	sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))
# pad sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)


Total Sequences: 5554
Max Sequence Length: 3


In [6]:
print(sequences)

[[651 652 653]
 [652 653 654]
 [653 654 119]
 ...
 [ 13  27  66]
 [ 27  66  25]
 [ 66  25  13]]


Crear el modelo que genere embeddings de las secuencias, las pase por una LSTM y prediga la siguiente palabra

In [7]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(250))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=100, verbose=2)
# evaluate model





Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 2, 10)             16490     
_________________________________________________________________
lstm_1 (LSTM)                (None, 250)               261000    
_________________________________________________________________
dense_1 (Dense)              (None, 1649)              413899    
Total params: 691,389
Trainable params: 691,389
Non-trainable params: 0
_________________________________________________________________
None


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



Epoch 1/100





 - 11s - loss: 6.8377 - acc: 0.0452
Epoch 2/100
 - 2s - loss: 6.3134 - acc: 0.0484
Epoch 3/100
 - 2s - loss: 6.2292 - acc: 0.0484
Epoch 4/100
 - 2s - loss: 6.1482 - acc: 0.0486
Epoch 5/100
 - 2s - loss: 6.0456 - acc: 0.0484
Epoch 6/100
 - 2s - loss: 5.89

<keras.callbacks.History at 0x7f5edae47780>

Generar la secuencia

In [8]:
print(generate_seq_2(model, tokenizer, max_length-1, 'My fellow Americans', 50))


My fellow Americans
[44, 198, 45]
[[198  45]]
6


My fellow Americans we
[44, 198, 45, 6]
[[45  6]]
352


My fellow Americans we meet
[44, 198, 45, 6, 352]
[[  6 352]]
42


My fellow Americans we meet tonight
[44, 198, 45, 6, 352, 42]
[[352  42]]
46


My fellow Americans we meet tonight at
[44, 198, 45, 6, 352, 42, 46]
[[42 46]]
8


My fellow Americans we meet tonight at a
[44, 198, 45, 6, 352, 42, 46, 8]
[[46  8]]
255


My fellow Americans we meet tonight at a moment
[44, 198, 45, 6, 352, 42, 46, 8, 255]
[[  8 255]]
4


My fellow Americans we meet tonight at a moment of
[44, 198, 45, 6, 352, 42, 46, 8, 255, 4]
[[255   4]]
656


My fellow Americans we meet tonight at a moment of unlimited
[44, 198, 45, 6, 352, 42, 46, 8, 255, 4, 656]
[[  4 656]]
353


My fellow Americans we meet tonight at a moment of unlimited potential
[44, 198, 45, 6, 352, 42, 46, 8, 255, 4, 656, 353]
[[656 353]]
22


My fellow Americans we meet tonight at a moment of unlimited potential as
[44, 198, 45, 6, 352, 42,