<a href="https://colab.research.google.com/github/ixcheldelsun/TextGeneration_tensorflow/blob/master/Text_Generation_problem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation problem

Se hizo este modelo de text generation tomando como referencia el modelo RNN de la siguiente página: https://www.tensorflow.org/tutorials/text/text_generation, modificandolo para que funcionara con los datasets planteados para el desafío. 

## Librerías importadas

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
import os
import time

In [2]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


## Descarga de la data

### Descarga de la data de https://ndownloader.figshare.com/files/3199448 

In [3]:
path_to_keywords = tf.keras.utils.get_file('keywords_train.txt', 'https://ndownloader.figshare.com/files/3199448')

Downloading data from https://ndownloader.figshare.com/files/3199448


In [4]:
path_to_keywords_eval = tf.keras.utils.get_file('keywords_eval.txt', 'https://competitions.codalab.org/my/datasets/download/1d4287b8-74ef-4807-9f07-1cf62a3f15d5')

Downloading data from https://competitions.codalab.org/my/datasets/download/1d4287b8-74ef-4807-9f07-1cf62a3f15d5


In [5]:
keyword_train_text = text = open(path_to_keywords, 'rb').read().decode(encoding='utf-8')
keyword_eval_text = text_eval = open(path_to_keywords_eval, 'rb').read().decode(encoding='utf-8')

print ('Length of text: {} characters'.format(len(text)))

Length of text: 2361 characters


### Descarga y procesamiento de la data de https://data.mendeley.com/datasets/hpfprmy49b/1/files/ef50dc7c-f5c7-4da9-ad3e-cd14d6767255/health_claim_data_submit.xls?dl=1

In [6]:
headlines_df = pd.read_excel(r'/content/gdrive/My Drive/health_claim_data_submit.xls')

In [7]:
headlines_df = headlines_df.drop(['news_title', 'reported_date', 'source', 'health_claim_or_not', 'IV', 'relation', 'DV', 'multiple_IV'], axis = 1)

In [8]:
headlines_df = headlines_df.news_topic.astype(str)

In [9]:
headlines = headlines_df.values
headlines = np.array2string(headlines)
headlines = headlines[1:-1]
headlines

"'Breast Cancer' 'Diabetes' 'Diabetes' 'Diabetes' 'Diabetes'\n 'Breast Cancer' 'Diabetes' 'Breast Cancer' 'Breast Cancer' 'Diabetes'\n 'Breast Cancer' 'Diabetes' 'Diabetes' 'Breast Cancer' 'Breast Cancer'\n 'Breast Cancer' 'Breast Cancer' 'Diabetes' 'Diabetes' 'Diabetes'\n 'Diabetes' 'Diabetes' 'Diabetes' 'Breast Cancer' 'Diabetes'\n 'Breast Cancer' 'Diabetes' 'Diabetes' 'Breast Cancer' 'Breast Cancer'\n 'Breast Cancer' 'Diabetes' 'Diabetes' 'Diabetes' 'Breast Cancer'\n 'Diabetes' 'Diabetes' 'Breast Cancer' 'Diabetes' 'Breast Cancer'\n 'Diabetes' 'Breast Cancer' 'Diabetes' 'Diabetes' 'Diabetes' 'Diabetes'\n 'Diabetes' 'Diabetes' 'Diabetes' 'Diabetes' 'Diabetes' 'Breast Cancer'\n 'Diabetes' 'Breast Cancer' 'Breast Cancer' 'Diabetes' 'Breast Cancer'\n 'Diabetes' 'Diabetes' 'Diabetes' 'Diabetes' 'Diabetes' 'Diabetes'\n 'Diabetes' 'Breast Cancer' 'Diabetes' 'Diabetes' 'Diabetes' 'Diabetes'\n 'Diabetes' 'Diabetes' 'Diabetes' 'Breast Cancer' 'Breast Cancer'\n 'Breast Cancer' 'Diabetes' 'Brea

### Generación de la data de entrada

In [10]:
total_training_set = keyword_train_text + headlines
total_training_set

"ache,\naches,\nachey,\naching,\nachy,\nacl,\nacne,\nacupuncture,\nadvil,\naleve,\nallergic,\nallergies,\nallergy,\nankle,\nantibiotics,\nanxiety,\nanxious,\nappetite,\nappointment,\nappt,\narthritis,\naspirin,\nasthma,\nbackache,\nbattling,\nbedtime,\nbenadryl,\nbladder,\nblisters,\nbody,\nbreathing,\nbronchitis,\nbruised,\nburning,\nbypass,\ncaffeine,\ncancer,\nchemo,\nchest,\nchronic,\nclinic,\nclogged,\ncodeine,\ncold,\ncolds,\ncoma,\ncongested,\ncongestion,\ncontagious,\ncough,\ncoughed,\ncoughing,\ncoughs,\ncramps,\ncravings,\ncrutches,\ncure,\ncured,\ndealing,\ndehydrated,\ndehydration,\ndental,\ndentist,\ndepression,\ndiabetes,\ndiagnosed,\ndiarrhea,\ndieting,\ndizziness,\ndizzy,\ndoctor,\ndoctors,\ndose,\ndrained,\ndrowsy,\ndrugged,\near,\nearache,\neaten,\nelbow,\nemergency,\nexcedrin,\nexcruciating,\nexercise,\nexhausted,\nexhaustion,\nfaint,\nfatigue,\nfeelin,\nfever,\nfeverish,\nfevers,\nflu,\nfluids,\nforehead,\nfreezing,\ngastric,\ngerms,\nglands,\ngroggy,\nh1n1,\nhackin

## Procesamiento del Texto

Después de tener todo el dataset cargado en el notebook y disponible como total_training_set, se debe de procesar para tener como resultado un mapping que permita representar cada caracter en una secuencia de caracteres (string) como un escalar positivo y viceversa, de tal forma que el modelo pueda recibir y procesar la input data del texto y asi mismo representar el output que genere posteriormente y, también, tener nuestro input en forma numérica para nuestro modelo. 

### Vectorización del texto

In [11]:
vocab = sorted(set(total_training_set))
print('{} unique characters'.format(len(vocab)))

34 unique characters


In [12]:
char2idx = {u:i for i, u in enumerate(vocab)} #Mapping de caracteres a números
idx2char = np.array(vocab) #Mapping de números a caracteres

text_as_int = np.array([char2idx[c] for c in text])

In [13]:
print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
  '\n':   0,
  ' ' :   1,
  "'" :   2,
  ',' :   3,
  '1' :   4,
  'B' :   5,
  'C' :   6,
  'D' :   7,
  'a' :   8,
  'b' :   9,
  'c' :  10,
  'd' :  11,
  'e' :  12,
  'f' :  13,
  'g' :  14,
  'h' :  15,
  'i' :  16,
  'j' :  17,
  'k' :  18,
  'l' :  19,
  ...
}


In [14]:
print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

'ache,\naches,\n' ---- characters mapped to int ---- > [ 8 10 15 12  3  0  8 10 15 12 26  3  0]


### Creación de los ejemplos de entrenamiento y los targets

In [15]:
seq_length = 20 #Se cambió a un valor de 20 ya que el dataset es de nombres de enfermedades o condiciones físicas que no superan dicha cantidad de caracteres 
examples_per_epoch = len(text)//(seq_length+1)

In [16]:
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

In [17]:
for i in char_dataset.take(5):
  print(idx2char[i.numpy()])

a
c
h
e
,


In [18]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))

'ache,\naches,\nachey,\na'
'ching,\nachy,\nacl,\nacn'
'e,\nacupuncture,\nadvil'
',\naleve,\nallergic,\nal'
'lergies,\nallergy,\nank'


In [19]:
def split_input_target(chunk):
  input_text = chunk[:-1]
  target_text = chunk[1:]
  return input_text, target_text

dataset = sequences.map(split_input_target)

In [20]:
for input_example, target_example in  dataset.take(1): # Probamos si la vectorización del input funciona adecuadamente tanto individualmente como un batch del input + target
  #input_example.shape
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  'ache,\naches,\nachey,\n'
Target data: 'che,\naches,\nachey,\na'


In [21]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])): 
  print("Step {:4d}".format(i))
  print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
  print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 8 ('a')
  expected output: 10 ('c')
Step    1
  input: 10 ('c')
  expected output: 15 ('h')
Step    2
  input: 15 ('h')
  expected output: 12 ('e')
Step    3
  input: 12 ('e')
  expected output: 3 (',')
Step    4
  input: 3 (',')
  expected output: 0 ('\n')


## Creación de batches para el entrenamiento

In [22]:
BATCH_SIZE = 64

BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 20), (64, 20)), types: (tf.int64, tf.int64)>

## Creación del modelo

In [98]:
vocab_size = len(vocab)

embedding_dim = 256

rnn_units = 1024

In [145]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]), # capa de entrada que toma el input representado numericamente en escalares y los mapea a un vector. 
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'), # se implementa un gated recurrent unit (GRU) en este modelo
    tf.keras.layers.Dense(vocab_size) # capa de salida del modelo
  ])
  return model

In [146]:
model = build_model(
    vocab_size = len(vocab),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units,
    batch_size=BATCH_SIZE)



In [147]:
model.summary()

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (64, None, 256)           8704      
_________________________________________________________________
gru_10 (GRU)                 (64, None, 1024)          3938304   
_________________________________________________________________
dense_10 (Dense)             (64, None, 34)            34850     
Total params: 3,981,858
Trainable params: 3,981,858
Non-trainable params: 0
_________________________________________________________________


## Prueba del modelo

Se hace una prueba del modelo antes de ser entrenado para comprobar que toma el input y genera su respectiva predicción adecuadamente.

In [148]:
dataset.take(1)

<TakeDataset shapes: ((64, 20), (64, 20)), types: (tf.int64, tf.int64)>

In [149]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 20, 34) # (batch_size, sequence_length, vocab_size)


In [150]:
input_example_batch.shape

TensorShape([64, 20])

In [151]:
idx2char

array(['\n', ' ', "'", ',', '1', 'B', 'C', 'D', 'a', 'b', 'c', 'd', 'e',
       'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r',
       's', 't', 'u', 'v', 'w', 'x', 'y', 'z'], dtype='<U1')

In [152]:
input_example_batch

<tf.Tensor: shape=(64, 20), dtype=int64, numpy=
array([[19, 19, 16, ..., 32, 20, 23],
       [ 3,  0, 23, ..., 12, 10, 22],
       [18,  3,  0, ...,  0, 21, 28],
       ...,
       [21, 12, 12, ...,  3,  0, 26],
       [ 3,  0, 23, ..., 21, 16,  8],
       [ 3,  0, 13, ...,  3,  0, 13]])>

In [153]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

In [154]:
sampled_indices

array([15, 15, 20,  2,  4, 13, 33, 10,  8, 27, 26,  2,  2, 13,  0,  5, 23,
        8, 31, 13])

In [155]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0].numpy()])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices])))

Input: 
 'lling,\nswollen,\nsymp'

Next Char Predictions: 
 "hhm'1fzcats''f\nBpaxf"


## Entrenamiento del modelo

Se entrena el modelo tomando en cuenta que se implementó lo siguiente: 
- loss: sparse categorical cros-entropy. 
- tamaño del batch: 64 ejemplos de entrenamiento (training samples). 
- optimizador: adam.
- número de epochs: 100. 

In [156]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())


Prediction shape:  (64, 20, 34)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       3.5264924


In [157]:
model.compile(optimizer='adam', loss=loss)

In [158]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [159]:
EPOCHS=120

In [160]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/120
Epoch 2/120
Epoch 3/120
Epoch 4/120
Epoch 5/120
Epoch 6/120
Epoch 7/120
Epoch 8/120
Epoch 9/120
Epoch 10/120
Epoch 11/120
Epoch 12/120
Epoch 13/120
Epoch 14/120
Epoch 15/120
Epoch 16/120
Epoch 17/120
Epoch 18/120
Epoch 19/120
Epoch 20/120
Epoch 21/120
Epoch 22/120
Epoch 23/120
Epoch 24/120
Epoch 25/120
Epoch 26/120
Epoch 27/120
Epoch 28/120
Epoch 29/120
Epoch 30/120
Epoch 31/120
Epoch 32/120
Epoch 33/120
Epoch 34/120
Epoch 35/120
Epoch 36/120
Epoch 37/120
Epoch 38/120
Epoch 39/120
Epoch 40/120
Epoch 41/120
Epoch 42/120
Epoch 43/120
Epoch 44/120
Epoch 45/120
Epoch 46/120
Epoch 47/120
Epoch 48/120
Epoch 49/120
Epoch 50/120
Epoch 51/120
Epoch 52/120
Epoch 53/120
Epoch 54/120
Epoch 55/120
Epoch 56/120
Epoch 57/120
Epoch 58/120
Epoch 59/120
Epoch 60/120
Epoch 61/120
Epoch 62/120
Epoch 63/120
Epoch 64/120
Epoch 65/120
Epoch 66/120
Epoch 67/120
Epoch 68/120
Epoch 69/120
Epoch 70/120
Epoch 71/120
Epoch 72/120
Epoch 73/120
Epoch 74/120
Epoch 75/120
Epoch 76/120
Epoch 77/120
Epoch 78

## Generación del texto

In [161]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_120'

In [162]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [163]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 20

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
    predictions = model(input_eval)
    # remove the batch dimension
    predictions = tf.squeeze(predictions, 0)

    # using a categorical distribution to predict the character returned by the model
    predictions = predictions / temperature
    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

    # We pass the predicted character as the next input to the model
    # along with the previous hidden state
    input_eval = tf.expand_dims([predicted_id], 0)

    text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [164]:
eval = keyword_eval_text.split(',')

for i in range(len(eval)-1): 
  eval[i] = eval[i].strip()
    
print(eval)

['acupuncture', 'antibiotics', 'bypass', 'hospital', 'nausea', 'prescription', 'surgery', 'treatment', '']


In [165]:
test = []
predictions = []

for i in range(len(eval)-1):
  if i < len(eval):
    test.append(eval[i])
    print("predicting word {}...".format(eval[i]))
    prediction = generate_text(model, eval[i][:])
    prediction = prediction.split(',')
    predictions.append(prediction[0])
  #print(generate_text(model, start_string=word[:2]))

print(test)
print(predictions)


predicting word acupuncture...
predicting word antibiotics...
predicting word bypass...
predicting word hospital...
predicting word nausea...
predicting word prescription...
predicting word surgery...
predicting word treatment...
['acupuncture', 'antibiotics', 'bypass', 'hospital', 'nausea', 'prescription', 'surgery', 'treatment']
['acupuncture', 'antibiotics', 'bypass', 'hospitaling', 'nausea', 'prescription', 'surgery', 'treatmentablen']


In [166]:
evaluation = zip(test, predictions)
evaluation = list(evaluation)
evaluation 

[('acupuncture', 'acupuncture'),
 ('antibiotics', 'antibiotics'),
 ('bypass', 'bypass'),
 ('hospital', 'hospitaling'),
 ('nausea', 'nausea'),
 ('prescription', 'prescription'),
 ('surgery', 'surgery'),
 ('treatment', 'treatmentablen')]

In [168]:
np.savetxt('text_generation.csv', evaluation, delimiter = ', ', fmt = '% s')