<a href="https://colab.research.google.com/github/orestislampridis/Greek-Lyric-Generation/blob/master/Text_gen_4_char.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import tensorflow as tf
from keras.models import Sequential
from keras.layers import LSTM, Bidirectional, Activation, Dropout, Dense,CuDNNLSTM, Embedding
from keras.callbacks import *
from keras.optimizers import Adam
from keras.utils import np_utils
import numpy as np
import pandas as pd
import sys


Using TensorFlow backend.


Load the data (lyrics) from google drive

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


The form of data is raw text in a txt file. We chose not to edit the data to achieve a result closer to the reality.

In [0]:

with open('/content/drive/My Drive/Colab Notebooks/lyrics entexnoi raw.txt', 'r') as f: 
    
    text = f.read()
    
  

In [0]:
text=text.replace('\ufeff',"")

Check the text

In [0]:
print(repr(text[:100]))  #read the first 200 characters of doc


'Τώρα τι κλαις τι άλλο θες το μάθαμε κι οι δυο μας ψάξε και βρες σκύψε και δες πια δύναμη μας χώρισε '


In [0]:
n_char=len(text)
print ('Length of text: %i characters' %n_char) #lenght=number of characters in text


Length of text: 741237 characters


In [0]:
vocab=sorted(set(text)) #making the vocabulary of characters
n_vocab=len(vocab) 
print('number of unique characters: %i' %n_vocab)

number of unique characters: 133


We need to map chars to numbers. Neural Networks work best with integer instead of plain text.

In [0]:
char2int=dict((i, c) for c, i in enumerate(vocab)) #map characters to int
int2char=dict((i, c) for i, c in enumerate(vocab)) #map int to char (for "translation")

print(char2int) #print the result of mapping the characters in the vocabulary
print(int2char)

{'\n': 0, ' ': 1, '!': 2, '"': 3, "'": 4, ',': 5, '.': 6, '0': 7, '1': 8, '2': 9, '3': 10, '4': 11, '5': 12, '6': 13, '7': 14, '8': 15, '9': 16, ':': 17, ';': 18, '?': 19, 'A': 20, 'B': 21, 'C': 22, 'D': 23, 'E': 24, 'F': 25, 'G': 26, 'H': 27, 'I': 28, 'J': 29, 'K': 30, 'L': 31, 'M': 32, 'N': 33, 'O': 34, 'P': 35, 'Q': 36, 'R': 37, 'S': 38, 'T': 39, 'X': 40, 'a': 41, 'b': 42, 'c': 43, 'd': 44, 'e': 45, 'f': 46, 'g': 47, 'h': 48, 'i': 49, 'j': 50, 'k': 51, 'l': 52, 'm': 53, 'n': 54, 'o': 55, 'p': 56, 'q': 57, 'r': 58, 's': 59, 't': 60, 'u': 61, 'v': 62, 'w': 63, 'y': 64, 'z': 65, 'µ': 66, 'Ά': 67, 'Έ': 68, 'Ή': 69, 'Ί': 70, 'Ό': 71, 'Ύ': 72, 'Ώ': 73, 'ΐ': 74, 'Α': 75, 'Β': 76, 'Γ': 77, 'Δ': 78, 'Ε': 79, 'Ζ': 80, 'Η': 81, 'Θ': 82, 'Ι': 83, 'Κ': 84, 'Λ': 85, 'Μ': 86, 'Ν': 87, 'Ξ': 88, 'Ο': 89, 'Π': 90, 'Ρ': 91, 'Σ': 92, 'Τ': 93, 'Υ': 94, 'Φ': 95, 'Χ': 96, 'Ψ': 97, 'Ω': 98, 'ά': 99, 'έ': 100, 'ή': 101, 'ί': 102, 'α': 103, 'β': 104, 'γ': 105, 'δ': 106, 'ε': 107, 'ζ': 108, 'η': 109, 'θ': 110

In [0]:
text_as_int=np.array([char2int[c] for c in text]) #map the data as int

In [0]:
# Show a sample of our data mapped from text to integers
print ('%s --[chars to int] -- > %s' %(repr(text[101:120]), text_as_int[101:120]))

'σύ λοιπόν δε φταις ' --[chars to int] -- > [121 131   1 113 117 111 118 130 115   1 106 107   1 124 122 103 111 120
   1]


To feed the NN we need to devide the text 
into samples(sequences).

Also we devide out data to input and target


In [0]:
print('Making samples(sequences) and deviding data to input and target...\n')
seq_length = 100 #how many characters per sequence
#i.e seq_length=3 text=καλή, input=καλ, target=ή
target=[]
input=[]
step=5 #this step determines how many sequences we want
for i in range (0,n_char-seq_length,step):

  input.append(text_as_int[i:i+seq_length]) 
  target.append(text_as_int[i+seq_length])

print('Input and target data example:')
print("input 49:", "".join([int2char[c] for c in input[49]]))
print("target 49:", int2char[target[49]])


n_samples=len(input)
print("\nNumber of samples:",n_samples)



Making samples(sequences) and deviding data to input and target...

Input and target data example:
input 49:  γιατί μη με ρωτήσεις το 'χω νιώσει όταν πονώ σαν θα γίνεσαι ένας ξένος πιο βαθιά να σ' αγαπώ έλα μη
target 49: ν

Number of samples: 148228


We need to reshape the sequences to go into the RNN

In [0]:
#We can use the reshape() function on the NumPy array to reshape this one-dimensional array into a three-dimensional array 
#with the number of samples, time steps, and features we need at each time step.)inputR=np.reshape(input,(n_samples, seq_length,1)
inputR=np.reshape(input,(n_samples, seq_length))
print("The input representation of: ", "".join([int2char[c] for c in input[0][:13]]),"is now:")
print(inputR[0][:13])
#We can represent the target variables as binary vectors with One Hot Encoding.
#"This way me can give RNN a more expressive power to learn a probability-like number for each possible label value. 
#This can help in both making the problem easier for the network to model. 
#When a one hot encoding is used for the output variable, it may offer a more nuanced set of predictions than a single label."
targetE= np_utils.to_categorical(target)
print("The target representation of: ",int2char[target[70]]," is now:\n",targetE[70])


The input representation of:  Τώρα τι κλαις is now:
[ 93 132 119 103   1 122 111   1 112 113 103 111 120]
The target representation of:  ι  is now:
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [0]:
#an other way of reshaping
# inputR = np.zeros((n_samples, seq_length, n_vocab), dtype=np.bool)
# targetE = np.zeros((n_samples, n_vocab), dtype=np.bool)
# for i, sentence in enumerate(input):
#     for t, char in enumerate(sentence):
#         inputR[i, t, char] = 1
        
#     targetE[i, target[i]] = 1

In [0]:
print("the shape of the input data is:",inputR.shape)
print("the shape of the target data is:",targetE.shape)

the shape of the input data is: (148228, 100)
the shape of the target data is: (148228, 133)


**Building the model**

We will use an Sequential LSTM model

In [0]:
model= Sequential()




In [0]:
rnn_size=512


In [0]:
model.add(Embedding(n_samples, seq_length,input_length=seq_length))





In [0]:
#input layer
model.add(Bidirectional( CuDNNLSTM(rnn_size)))

In [0]:
#Hidden layers 




In [0]:
#Dropout layer(avoid overfitting)
model.add(Dropout(0.3))


Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [0]:
#Output layer
model.add(Dense(targetE.shape[1]))

In [0]:
#Activation function
model.add(Activation('softmax'))

In [0]:
#compile model
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam',metrics=['accuracy'])





In [0]:
#model details
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 100)          14822800  
_________________________________________________________________
bidirectional_1 (Bidirection (None, 1024)              2514944   
_________________________________________________________________
dropout_1 (Dropout)          (None, 1024)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 133)               136325    
_________________________________________________________________
activation_1 (Activation)    (None, 133)               0         
Total params: 17,474,069
Trainable params: 17,474,069
Non-trainable params: 0
_________________________________________________________________


***(Callbacks)***

In [0]:
filepath="/content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:{epoch:03d}-val_acc:{val_acc:.5f}.hdf5"
# folder called CheckpointsLyricsGen in drive
#each file will be stored with epoch number and validation accuracy
#these files contain weights of your neural network

In [0]:
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose = 1, save_best_only = True, mode ='max')
#the arguments passed in the above code it is monitoring validation accuracy 
#it stores when a higher validation accuracy is achieved than the last checkpoint


In [0]:
callbacks_list = [checkpoint]
#a list so that you can append any other callbacks to this list and pass it in fit function while training 
#all the methods in the list will be called after every epoch

Training the model

In [0]:
#if we need to train more: uncomment the code below with the correct checkpoint 

#model.load_weights('/content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:011-val_acc:0.49730.hdf5')


In [0]:
print('Training model...')

Training model...


In [0]:
#fit the model
model.fit(inputR,
          targetE,
          epochs=30,
          batch_size=128,
          shuffle= True,
          initial_epoch=0,
          callbacks=callbacks_list,
          validation_split = 0.2,
          validation_data = None,
          validation_steps = None)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



Train on 118582 samples, validate on 29646 samples
Epoch 1/30






Epoch 00001: val_acc improved from -inf to 0.38660, saving model to /content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:001-val_acc:0.38660.hdf5
Epoch 2/30

Epoch 00002: val_acc improved from 0.38660 to 0.44411, saving model to /content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:002-val_acc:0.44411.hdf5
Epoch 3/30

Epoch 00003: val_acc improved from 0.44411 to 0.47251, saving model to /content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:003-val_acc:0.47251.hdf5
Epoch 4/30

Epoch 00004: val_acc improved from 0.47251 to 0.48910, saving model to /content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:004-val_acc:0.48910.hdf5
Epoch 5/30

Epoch 00005: val_acc improved from 0.48910 to 0.49787, saving model to /content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/ep

Load weights for generation

In [0]:

#Load weights                                                                         #choose the right filename
model.load_weights('/content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:006-val_acc:0.50668.hdf5')                                                                                    
#compile model                                                                       
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

NameError: ignored

Lyrics Generation

In [0]:
# set a random seed :
start = np.random.randint(0, len(input)-1)
pattern = input[start]
print('Seed : ')
print("\"",''.join([int2char[v] for v in pattern]), "\"\n")

# How many characters you want to generate
generated_characters = 300

results=[]

for i in range(generated_characters):
    x = np.reshape(pattern, ( 1, len(pattern)))
    
    prediction = model.predict(x,verbose = 0)
    index = np.argmax(prediction)

    result = int2char[index]

    results.append(result)
    sys.stdout.write(result)
    
    pattern = np.append(pattern,index)
    
    pattern = pattern[1:len(pattern)]
# print("\"",''.join(results), "\"\n")    
print('\nDone')

NameError: ignored