## Bag of Words

Primero vamos a crear un modelo sencillo de Bag of Words.

In [1]:
def BagofWords(text):
    bag = {}
    vocab = []
    
    for word in text.split(" "):
        word=word.lower()
        bag[word]=1
        if word in vocab:
            bag[word] += 1
            
        vocab.append(word)
    
    return bag

text="Las palabras se repiten se repiten las palabras"
BagofWords(text)

{'las': 2, 'palabras': 2, 'se': 2, 'repiten': 2}

## Sentiment Analysis

Vamos a usar el Movie Review Dataset de IMDB

In [2]:
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
import os
import numpy as np

In [3]:
VOCAB_SIZE = 88584

MAXLEN = 250
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(
                                                    num_words=VOCAB_SIZE)

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


In [4]:
len(train_data[0]), len(train_data[5])

(218, 43)

In [5]:
test_labels.shape

(25000,)

Como vemos no todos los inputs tiene la misma longitud. Las redes neuronales no admiten inputs de distinta forma, por lo que hay que modificar los datos a una longitud específica.
 
 - Si la longitud es mayor a 250, se cortan palabras.
 - Si es inferior, se añaden 0s hasta que llegue a 250.

In [6]:
train_data = sequence.pad_sequences(train_data, MAXLEN)
test_data = sequence.pad_sequences(test_data, MAXLEN)

#### Crear el modelo

In [7]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [8]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 32)          2834688   
_________________________________________________________________
lstm (LSTM)                  (None, 32)                8320      
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 2,843,041
Trainable params: 2,843,041
Non-trainable params: 0
_________________________________________________________________


Se usa un embedding inicial para darle más sentido a los "vectores" de palabras que es cada input. El 32 es porque creamos vectores de dimensión 32.

#### Entrenamiento

In [9]:
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['acc'])

history = model.fit(train_data, train_labels, epochs=10, validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Vemos que aunque hagamos 10 epochs el valor de la validación no cambia demasiado. Esto significa que hay algo que debe ajustarse en el modelo. Por el momento lo dejaremos así.

In [10]:
results = model.evaluate(test_data, test_labels)
print(results)

[0.49047818992614745, 0.84788]


No está mal para una red tan simple.

#### Hacer predicciones

In [11]:
word_index = imdb.get_word_index()

def encode_text(text):
    tokens = tf.keras.preprocessing.text.text_to_word_sequence(text)
    # esto separa cada palabra de la frase
    tokens = [word_index[word] if word in word_index else 0 for word in tokens]
    # esto asigna un numero a cada palabra según word_index
    return sequence.pad_sequences([tokens], MAXLEN)[0]

text = 'the movie was just amazing, so amazing'
encoded = encode_text(text)
print(encoded)

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   1  17  13  4

In [12]:
# Ahora hacemos una al reves, le damos números y nos dice la frase

reverse_word_index = {value: key for (key, value) in word_index.items()}

def decode_integers(integers):
    PAD = 0
    text = ""
    for num in integers:
        if num != PAD:
            text += reverse_word_index[num] + " "
            
    return text[:-1]

print(decode_integers(encoded))

the movie was just amazing so amazing


In [13]:
# ahora a hacer una predicción

def predict(text):
    encoded_text = encode_text(text)
    pred = np.zeros((1,250)) # el modelo espera esta estructura
    pred[0] = encoded_text
    result = model.predict(pred)
    if result[0] >= 0.5:
        print("Review {:.2f}% positiva".format(result[0][0] * 100))
    else:
        print("Review {:.2f}% negativa".format(100 - result[0][0] * 100))

pos_review = "That movie was so cool! I really loved it and would watch it again because it was amazingly great"
predict(pos_review)

neg_review = "That movie totally sucked. I hated it and wouldn't watch it again. Was one of the worst things I've ever watched"
predict(neg_review)

Review 89.44% positiva
Review 63.33% negativa


### Escribir una obra con RNN

Vamos a hacer una RNN que prediga la siguiente palabra, como el predictor de texto del movil. La vamos a entrenar con Romeo y Julieta de Shakespeare.

In [14]:
from tensorflow.keras.preprocessing import sequence
import tensorflow.keras as keras
import tensorflow as tf
import os
import numpy as np

In [15]:
path_to_file = tf.keras.utils.get_file("shakespeare.txt", 
                                      "https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt")

In [16]:
path_to_file

'C:\\Users\\jaime\\.keras\\datasets\\shakespeare.txt'

#### Leer los contenidos del texto


In [17]:
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
print("Length of text: {} characters".format(len(text)))

Length of text: 1115394 characters


In [18]:
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



#### Codificar el texto

Dado que aquí no tenemos un número asociado a cada palabra como teníamos antes, tenemos que hacerlo nosotros

In [19]:
vocab = sorted(set(text))

# creando un mapeado que asocie un número a cada palabra
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

def text_to_int(text):
    return np.array([char2idx[c] for c in text])

text_as_int = text_to_int(text)

Hemos convertido cara caracter (no cada palabra) del texto en un número

In [20]:
print("text:", text[:13])
print("encoded:",text_as_int[:13])

text: First Citizen
encoded: [18 47 56 57 58  1 15 47 58 47 64 43 52]


Y al contrario

In [21]:
def int_to_text(ints):
    try:
        ints = ints.numpy()
    except:
        pass
    return ''.join(idx2char[ints])

print(int_to_text(text_as_int[:13]))

First Citizen


#### Crear ejemplos de entrenamiento

El objetivo es darle al modelo una serie de caracteres y que nos de los siguientes. Por ello, tenemos que partir el texto grande en secuencias más pequeñas que darle al modelo como entrenamiento.

Los ejemplos de entrenamiento serán el input que le pasemos con las letras desplazadas un espacio.

 - input: Hol | output: ola

In [22]:
seq_length = 100
examples_per_epoch = len(text) // (seq_length + 1)

char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

In [23]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

In [24]:
# dividir entre input y output

def split_input_target(chunk):
    input_text = chunk[:-1]
    output_text = chunk[1:]
    return input_text, output_text

dataset = sequences.map(split_input_target)

In [25]:
for x,y in dataset.take(2):
    print("\n\nEXAMPLE\n")
    print("INPUT")
    print(int_to_text(x))
    print("\nOUTPUT")
    print(int_to_text(y))



EXAMPLE

INPUT
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You

OUTPUT
irst Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You 


EXAMPLE

INPUT
are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you 

OUTPUT
re all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you k


In [26]:
# tenemos que hacer batches de entrenamiento

BATCH_SIZE = 64
VOCAB_SIZE = len(vocab) # caracteres unicos
EMBEDDING_DIM = 256
RNN_UNITS = 1024

# Tamaño de texto que barajar
BUFFER_SIZE = 10000

data = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

#### Crear el modelo

In [27]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                  batch_input_shape=[batch_size, None]),
        tf.keras.layers.LSTM(rnn_units,
                             return_sequences=True,
                             stateful=True,
                             recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, BATCH_SIZE)
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (64, None, 256)           16640     
_________________________________________________________________
lstm_1 (LSTM)                (64, None, 1024)          5246976   
_________________________________________________________________
dense_1 (Dense)              (64, None, 65)            66625     
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________


#### Crear una función de pérdida

Recordemos que el input al modelo (en entrenamiento) son objetos de longitud 100 en batches de 64. Cuando hagamos predicciones no siempre serán de esta longitud o forma. 

In [28]:
for input_example_batch, target_example_batch in data.take(1):
    example_batch_predictions = model(input_example_batch) 
    # prediccion en el primer batch de los datos de entrenamiento
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 65) # (batch_size, sequence_length, vocab_size)


In [29]:
# la prediccion es un array de 64, uno para cada entrada del batch
print(len(example_batch_predictions))
#print(example_batch_predictions[0])

64


In [30]:
pred = example_batch_predictions[0]
print(len(pred))
#print(pred)

# es un array 2d de longitud 100, con cada array interior siendo las predicciones
# para el siguiente elemento

100


In [31]:
# prediccion en el primer paso de tiempo
time_pred = pred[0]
print(len(time_pred))
print(time_pred)

# probabilidad de cada elemento en el primer paso de tiempo

65
tf.Tensor(
[-0.00175085 -0.00432556 -0.00706172 -0.001812   -0.00034917  0.00351697
 -0.0022645   0.00410583  0.00272123 -0.00171106  0.00333517  0.00290065
  0.00039189 -0.00290298 -0.00153354  0.00026906 -0.00834936 -0.0004392
  0.00074661  0.00657446  0.00539018 -0.00166748 -0.00534092 -0.00185909
  0.00085378 -0.00276041 -0.00294825 -0.00536192 -0.00095577 -0.00364206
  0.00096232 -0.00187163 -0.00188679 -0.00075335 -0.00551311 -0.0018262
 -0.00304138  0.00091418  0.00210359 -0.00223989 -0.00203788  0.00632498
  0.00109678  0.00090947 -0.00492635 -0.00076982  0.00352446 -0.00183091
  0.00241397  0.00202192  0.00291978 -0.0018658   0.00246556  0.00080082
 -0.00018268 -0.00219252  0.00136219 -0.00204161  0.00048571 -0.00018054
  0.00083675 -0.00174689 -0.00155232  0.00104817  0.00372499], shape=(65,), dtype=float32)


In [32]:
# se elige una muestra de las predicciones
sampled_indices = tf.random.categorical(pred, num_samples=1)

#los asociamos a cada letra del vocabulario
sampled_indices = np.reshape(sampled_indices, (1,-1))[0]
print(sampled_indices)
predicted_chars = int_to_text(sampled_indices)

predicted_chars

[38 19 62  3 60 55  1 19 23 62 42 56 42 33  7 12 21 45 18 32 48  9 26 13
 49  1 47 49 48 42  6 31 20 14 26 17 30 39 11 22 35 25 37 16 60 16  1 15
 26 22 35 27 63 11 62 61 61 13 14  7  0 59  1 14 26  7 27 16 16  5 56 44
 52 39 59 21 46 30 49 23  2  0 43 20 42  9 58  0  1  4 12 52 30  3 51 49
 17 12 18 60]


"ZGx$vq GKxdrdU-?IgFTj3NAk ikjd,SHBNERa;JWMYDvD CNJWOy;xwwAB-\nu BN-ODD'rfnauIhRkK!\neHd3t\n &?nR$mkE?Fv"

In [33]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

#### Compilar el modelo

In [34]:
model.compile(optimizer='adam', loss=loss)

#### Crear checkpoints

Esto permitirá tomar el modelo desde un checkpoint dado y seguir entrenándolo.

In [35]:
checkpoint_dir = './training_checkpoints'

checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt_{epoch}')

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

#### Entrenamiento

In [36]:
history = model.fit(data, epochs=100, callbacks=[checkpoint_callback])

Train for 172 steps
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
E

#### Cargar el modelo

Vamos a reconstruir el modelo desde un checkpoint usando un tamaño de batch 1. Esto nos permite darle al modelo un elemento de texto cada vez

In [37]:
model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, batch_size=1)

In [38]:
# el ultimo checkpoint del modelo es
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1,None]))

Podemos cargar cualquier checkpoint con:

    - model.load_weights(tf.train.load_checkpoint("./training_checkpoints/ckpt_" + str(checkpoint_num)))
    
    - model.build(tf.TensorShape([1, None]))

#### Generar Texto

In [48]:
def generate_text(model, start_string):
  # Paso de evaluacion

  # Numero de caracteres a generar
  num_generate = 800

  # Convertir el string inicial en numeros
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # String vacio para guardar el resultado
  text_generated = []

  # Temperaturas bajas en textos predecibles
  # Temperaturas altas en textos impredecibles
  # Experimentar hasta encontrar el mejor
  temperature = 0.8


  # Tamaño de batch es 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # Quitar las dimensiones [[]]
    
      predictions = tf.squeeze(predictions, 0)

      # Se usa una distribución categórica para predecir el output
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # Se pasa el caracter predicho como el siguiente input
      # junto con todos los estados anteriores
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [49]:
inp = input("Escribe la primera frase (en inglés): ")
print("\n")
print(generate_text(model, inp))

Escribe la primera frase (en inglés): You are all resolved rather to die than to famish?


You are all resolved rather to die than to famish?

All:
Resolved. resolved.

BUCKINGHAM:
What says his titue on you of such thing world,
And 'twill be withdraw together, let him be
known the gates. This is the matter, many hearing of
the souls of Richard makes a numbering creature king?
Ere he did not, that act on 't: though it passabler,
Nor had not off with maids to-morrow:
Trong fair Tyrrelent did I drop Clarence,
That love should purpose
To counterfeit that, noble lo, a
wicked vanity the helm and Richard like the town
Thy brother's blood open it again: if you are too former
And hang me: the rest her in the field.

YORK:
What bore may gentlemen, I have heard you say
'That a Jack in them.

First Senator:
Be it so; if any were in love before I came.

DUCHESS:
Whose house, my lord, we would have had you heard
The traitor speak bring
Till


Para mejorar resultados se puede añadir otra capa al modelo (LSTM o GRU), modificar valores de temperatura o añadir más epochs (aunque parece que con 100 se ha alcanzado una constante de loss)