# Create a sequence generator with GAN

In this notebook, we develog a  Recurrent Neural Network (RNN) with TimeDistributed to create a Galician language sequence generator. In this project, we use text from the Galician politician Beiras.

Consigue bos resultados con so 100000 carazteres, pero a base de comerse a memoria, na miña maquina aws con 100000 casi se come toda a memoria.

** Get good result only using 100000 carazters for training, but it use a lot of memory **

This work is based on https://github.com/udacity/aind2-rnn/blob/master/RNN_project.ipynb



In [1]:
import sys
sys.path.insert(0, '../aux/')
import numpy as np
from beiras_aux import load_text,predict_next_chars,load_coded_dictionaries


## Load the data
First we load the data and preprocess:
* Lower
* remove lines with http links
* remove symbols: '[ºªàâäçèêïìôöü&%@•…«»”“*/!"(),.:;_¿¡¿‘’´\[\]\']'

In [11]:
window_size = 100
step_size = 1
X,y,chars,chars_to_indices,indices_to_chars,text_clean=load_text('../data/Beiras.txt',window_size,step_size);

* X .- Array shape (sentences, window_size, num_chars) .- Input for training.
* y .- Array shape (sentences, num_chars) .- Output for training.
* chars . -Array with chars we have in the clean text
* chars_to_indices,indices_to_chars .- dictionaries to convert fron number to char and char to index
* text_clean .- All the text clean.

**In this case, the output of the network is a sequence, then y, the output used for training must be a sequence.**

This uses more memory, then we use a reduced training dataset.


In [17]:
TEXT_TO_USE=100000
def window_transform_text(text,window_size,step_size):
    # containers for input/output pairs\n",
    inputs = []
    outputs = []
    #Number of windows to create\n",
    n_windows=int((len(text) - window_size)/ step_size)
    for j in range(n_windows) :
        # k .- Start index
        k= j * step_size
        inputs.append(text[k:(k+window_size)])
        outputs.append(text[k+window_size])
    return inputs,outputs
def encode_io_pairs_distributedtime(text,window_size,step_size):
    # number of unique chars\n",
    chars = sorted(list(set(text)))
    num_chars = len(chars)

    # cut up text into character input/output pairs\n",
    inputs, outputs = window_transform_text(text,window_size,step_size)
    # create empty vessels for one-hot encoded input/output\n",
    X = np.zeros((len(inputs), window_size, num_chars), dtype=np.bool)
    y = np.zeros((len(inputs), window_size, num_chars), dtype=np.bool)

    # loop over inputs/outputs and tranform and store in X/y\n",
    for i, sentence in enumerate(inputs):
        for t, char in enumerate(sentence):
            X[i, t, chars_to_indices[char]] = 1
            if (t>0):
                y[i, t-1, chars_to_indices[char]] = 1
                y[i, len(sentence)-1, chars_to_indices[outputs[i]]] = 1
    return X,y
X_distributedtime,y_distributedtime = encode_io_pairs_distributedtime(text_clean[TEXT_TO_USE:],window_size,step_size)

1074280 100 55


MemoryError: 

## Test we have a GPU
I used a g2.2xlarge EC2 machine. Without a GPU this is too slow.

In [6]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/cpu:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 13253911401161739314, name: "/gpu:0"
 device_type: "GPU"
 memory_limit: 26214400
 locality {
   bus_id: 1
 }
 incarnation: 7405973703273733530
 physical_device_desc: "device: 0, name: GRID K520, pci bus id: 0000:00:03.0"]

## Create the network with TimeDistributed

In [7]:
from keras.layers import TimeDistributed
from keras.layers import Dense, Activation,GRU
from keras.optimizers import RMSprop
from keras.models import Sequential

def create_gru_distributed_model(num_chars):
    model= Sequential()
    # 1 Layer .- GRU  layer 1 should be an GRU module with 200 hidden units
    model.add(GRU(200,input_shape = (None,num_chars),return_sequences=True))
    # 2 Layer .- GRU  layer 2 should be an GRU module with 200 hidden units
    # Generate sequence to use with time distributed.    
    model.add(GRU(200,return_sequences=True))
    # 3 Layer .-  Dense, with number chars unit and softmax activation
    # 3 Layer use TimeDistributed
    model.add(TimeDistributed(Dense(num_chars)))
   
    model.add(Activation('softmax'))
    # initialize optimizer
    optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
    # compile model --> make sure initialized optimizer and callbacks - as defined above - are used
    model.compile(loss='categorical_crossentropy', optimizer=optimizer)
    return model

In [None]:
model=create_gru_distributed_model(chars)
model.summary()
model.fit(X_distributedtime, y_distributedtime, batch_size=500, nb_epoch=30,verbose = 1)

# save weights
model.save_weights('model_weights/best_beiras_gru_distributed_textdata_weights.hdf5')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru_3 (GRU)                  (None, None, 200)         153600    
_________________________________________________________________
gru_4 (GRU)                  (None, None, 200)         240600    
_________________________________________________________________
time_distributed_2 (TimeDist (None, None, 55)          11055     
_________________________________________________________________
activation_2 (Activation)    (None, None, 55)          0         
Total params: 405,255
Trainable params: 405,255
Non-trainable params: 0
_________________________________________________________________




Epoch 1/30
  61500/1074280 [>.............................] - ETA: 1198s - loss: 2.5717

** We need diferent funcions to predict. The network output a serie, and que only need the first element.**

In [13]:
# function that uses trained model to predict a desired number of future characters
def predict_next_chars_distributed(model,input_chars,num_to_predict):     
    # create output
    predicted_chars = ''
    for i in range(num_to_predict):
        # convert this round's predicted characters to numerical input    
        x_test = np.zeros((1, window_size, len(chars)))
        for t, char in enumerate(input_chars):
            x_test[0, t, chars_to_indices[char]] = 1.

        # make this round's prediction
        test_predict = model.predict(x_test,verbose = 0)[0][window_size-1]
        

        # translate numerical prediction back to characters
        r = np.argmax(test_predict)                           # predict class of each test input
        d = indices_to_chars[r] 

        # update predicted_chars and input
        predicted_chars+=d
        input_chars+=d
        input_chars = input_chars[1:]
    return predicted_chars


def print_predicctions_distributed(model,weights_file):
    start_inds = [100,1000,5000]

    # load in weights
    model.load_weights(weights_file)
    for s in start_inds:
        start_index = s
        input_chars = text_clean[start_index: start_index + window_size]

        # use the prediction function
        predict_input = predict_next_chars_distributed(model,input_chars,num_to_predict = 100)

        # print out input characters
        print('------------------')
        input_line = 'input chars = ' + '\n' +  input_chars + '"' + '\n'
        print(input_line)

        # print out predicted characters
        line = 'predicted chars = ' + '\n' +  predict_input + '"' + '\n'
        print(line)  

In [14]:
model=create_gru_distributed_model(len(chars_to_indices))
print_predicctions_distributed(model,'model_weights/best_beiras_gru_distributed_textdata_weights.hdf5')


------------------
input chars = 
pla panfletaria contra as leoninas taxas impostas polo ministro de xustiza actual malia que vulneran"

predicted chars = 
 entencíar nen sequer por parte da miña memoria infantil e a de castelao de anteriores do sistema co"

------------------
input chars = 
poema de rosalía titulado a xusticia pola man e dado á luz no seu libro follas novas por certo que s"

predicted chars = 
e entende a sua propria condea para ser consabidos ao cabo para delato e resultado do colonizador e "

------------------
input chars = 
se moito cando dixen eu que as suas políticas agresoras do común cidadán matan e a sua cospedal alcu"

predicted chars = 
nda pola sua peripecia liberal de panorama sen deza na barbarie de galiza de anova -sintenciais neol"

