<a href="https://colab.research.google.com/github/orestislampridis/Greek-Lyric-Generation/blob/master/Text_gen_4_char_1.2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import tensorflow as tf
from keras.models import Sequential
from keras.layers import LSTM, Bidirectional, Activation, Dropout, Dense,CuDNNLSTM, Embedding,GRU, CuDNNGRU
from keras.callbacks import *
from keras.optimizers import Adam
from keras.utils import np_utils
import numpy as np
import pandas as pd
import sys


Load the data (lyrics) from google drive

In [66]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


The form of data is raw text in a txt file. We chose not to edit the data to achieve a result closer to the reality.

In [0]:

with open('/content/drive/My Drive/Colab Notebooks/entexna.txt', 'r') as f: 
    
    text = f.read()
    
  

Check the text

In [68]:
print(repr(text[:100]))  #read the first 200 characters of doc


'Στον δρόμο έκαιγε η άσφαλτος, ο Αύγουστος μου φέρνει ζάλη, μα αυτός ακούμπησε στον ώμο μου να γείρω '


In [69]:
n_char=len(text)
print ('Length of text: %i characters' %n_char) #lenght=number of characters in text


Length of text: 658282 characters


In [70]:
vocab=sorted(set(text)) #making the vocabulary of characters
n_vocab=len(vocab) 
print('number of unique characters: %i' %n_vocab)

number of unique characters: 103


We need to map chars to numbers. Neural Networks work best with integer instead of plain text.

In [71]:
char2int=dict((i, c) for c, i in enumerate(vocab)) #map characters to int
int2char=dict((i, c) for i, c in enumerate(vocab)) #map int to char (for "translation")

print(char2int) #print the result of mapping the characters in the vocabulary
print(int2char)

{'\n': 0, ' ': 1, '!': 2, ',': 3, '.': 4, '0': 5, '1': 6, '2': 7, '3': 8, '4': 9, '5': 10, '6': 11, '7': 12, '8': 13, '9': 14, ';': 15, 'Ά': 16, 'Έ': 17, 'Ή': 18, 'Ί': 19, 'Ό': 20, 'Ύ': 21, 'Ώ': 22, 'ΐ': 23, 'Α': 24, 'Β': 25, 'Γ': 26, 'Δ': 27, 'Ε': 28, 'Ζ': 29, 'Η': 30, 'Θ': 31, 'Ι': 32, 'Κ': 33, 'Λ': 34, 'Μ': 35, 'Ν': 36, 'Ξ': 37, 'Ο': 38, 'Π': 39, 'Ρ': 40, 'Σ': 41, 'Τ': 42, 'Υ': 43, 'Φ': 44, 'Χ': 45, 'Ψ': 46, 'Ω': 47, 'ά': 48, 'έ': 49, 'ή': 50, 'ί': 51, 'α': 52, 'β': 53, 'γ': 54, 'δ': 55, 'ε': 56, 'ζ': 57, 'η': 58, 'θ': 59, 'ι': 60, 'κ': 61, 'λ': 62, 'μ': 63, 'ν': 64, 'ξ': 65, 'ο': 66, 'π': 67, 'ρ': 68, 'ς': 69, 'σ': 70, 'τ': 71, 'υ': 72, 'φ': 73, 'χ': 74, 'ψ': 75, 'ω': 76, 'ϊ': 77, 'ϋ': 78, 'ό': 79, 'ύ': 80, 'ώ': 81, 'ἀ': 82, 'ἁ': 83, 'ἆ': 84, 'Ἀ': 85, 'ἐ': 86, 'ἕ': 87, 'Ἐ': 88, 'ἡ': 89, 'ἶ': 90, 'ὁ': 91, 'ὅ': 92, 'ὐ': 93, 'ὰ': 94, 'ὲ': 95, 'ὴ': 96, 'ὶ': 97, 'ὸ': 98, 'ὺ': 99, 'ᾶ': 100, 'ῖ': 101, 'ῦ': 102}
{0: '\n', 1: ' ', 2: '!', 3: ',', 4: '.', 5: '0', 6: '1', 7: '2', 8: '3', 9: '

In [0]:
text_as_int=np.array([char2int[c] for c in text]) #map the data as int

In [73]:
# Show a sample of our data mapped from text to integers
print ('%s --[chars to int] -- > %s' %(repr(text[100:119]), text_as_int[100:119]))

'πάνω το κεφάλι. Να ' --[chars to int] -- > [67 48 64 76  1 71 66  1 61 56 73 48 62 60  4  1 36 52  1]


To feed the NN we need to devide the text 
into samples(sequences).

Also we devide out data to input and target


In [74]:
print('Making samples(sequences) and deviding data to input and target...\n')
seq_length = 100 #how many characters per sequence
#i.e seq_length=3 text=καλή, input=καλ, target=ή
target=[]
input=[]
step=5 #this step determines how many sequences we want
for i in range (0,n_char-seq_length,step):

  input.append(text_as_int[i:i+seq_length]) 
  target.append(text_as_int[i+seq_length])

print('Input and target data example:')
print("input 2:", "".join([int2char[c] for c in input[2]]))
print("target 2:", int2char[target[2]])


n_samples=len(input)
print("\nNumber of samples:",n_samples)



Making samples(sequences) and deviding data to input and target...

Input and target data example:
input 2:  έκαιγε η άσφαλτος, ο Αύγουστος μου φέρνει ζάλη, μα αυτός ακούμπησε στον ώμο μου να γείρω πάνω το κε
target 2: φ

Number of samples: 131637


We need to reshape the sequences to go into the RNN

In [75]:
#We can use the reshape() function on the NumPy array to reshape this one-dimensional array into a three-dimensional array 
#with the number of samples and length we need at each time step.
inputR=np.reshape(input,(n_samples, seq_length))
print("The input representation of: ", "".join([int2char[c] for c in input[0][:13]]),"is now:")
print(inputR[0][:13])
#We can represent the target variables as binary vectors with One Hot Encoding.
#"This way me can give RNN a more expressive power to learn a probability-like number for each possible label value. 
#This can help in both making the problem easier for the network to model. 
#When a one hot encoding is used for the output variable, it may offer a more nuanced set of predictions than a single label."
targetE= np_utils.to_categorical(target)
print("The target representation of: ",int2char[target[60]]," is now:\n",targetE[60])


The input representation of:  Στον δρόμο έκ is now:
[41 71 66 64  1 55 68 79 63 66  1 49 61]
The target representation of:  έ  is now:
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]


In [0]:
#an other way of reshaping
# inputR = np.zeros((n_samples, seq_length, n_vocab), dtype=np.bool)
# targetE = np.zeros((n_samples, n_vocab), dtype=np.bool)
# for i, sentence in enumerate(input):
#     for t, char in enumerate(sentence):
#         inputR[i, t, char] = 1
        
#     targetE[i, target[i]] = 1

In [77]:
print("the shape of the input data is:",inputR.shape)
print("the shape of the target data is:",targetE.shape)

the shape of the input data is: (131637, 100)
the shape of the target data is: (131637, 102)


**Building the model**

We will use an Sequential LSTM model

In [0]:
model= Sequential()

In [0]:
rnn_size=512


In [0]:
model.add(Embedding(n_samples, seq_length,input_length=seq_length, trainable=True))

In [0]:
#input layer
model.add(Bidirectional( CuDNNLSTM(rnn_size, return_sequences=True)))

In [0]:
#Hidden layers 
model.add(Bidirectional( CuDNNLSTM(rnn_size)))

In [0]:
#Dropout layer(avoid overfitting)
model.add(Dropout(0.2))

In [0]:
#Output layer
model.add(Dense(targetE.shape[1]))

In [0]:
#Activation function
model.add(Activation('softmax'))

In [0]:
adam = Adam(lr=0.001)

In [0]:
#compile model
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam',metrics=['accuracy'])

In [135]:
#model details
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 100, 100)          13163700  
_________________________________________________________________
bidirectional_2 (Bidirection (None, 100, 1024)         2514944   
_________________________________________________________________
bidirectional_3 (Bidirection (None, 1024)              6299648   
_________________________________________________________________
dropout_5 (Dropout)          (None, 1024)              0         
_________________________________________________________________
dense_5 (Dense)              (None, 102)               104550    
_________________________________________________________________
activation_5 (Activation)    (None, 102)               0         
Total params: 22,082,842
Trainable params: 22,082,842
Non-trainable params: 0
__________________________________________

***(Callbacks)***

In [0]:
filepath="/content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:{epoch:03d}-val_acc:{val_acc:.5f}.hdf5"
# folder called CheckpointsLyricsGen in drive
#each file will be stored with epoch number and validation accuracy
#these files contain weights of your neural network

In [0]:
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose = 1, save_best_only = False, mode ='max')
#the arguments passed in the above code it is monitoring validation accuracy 
#it stores when a higher validation accuracy is achieved than the last checkpoint


In [0]:
callbacks_list = [checkpoint]
#a list so that you can append any other callbacks to this list and pass it in fit function while training 
#all the methods in the list will be called after every epoch


Training the model

In [0]:
#if we need to train more: uncomment the code below with the correct checkpoint 

#model.load_weights('/content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:011-val_acc:0.49730.hdf5')


In [139]:
print('Training model...')

Training model...


In [140]:
#fit the model
model.fit(inputR,
          targetE,
          epochs=20,
          batch_size=128,
          shuffle= True,
          initial_epoch=0,
          callbacks=callbacks_list,
          validation_split = 0.2,
          validation_data = None,
          validation_steps = None)

Train on 105309 samples, validate on 26328 samples
Epoch 1/20

Epoch 00001: saving model to /content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:001-val_acc:0.37880.hdf5
Epoch 2/20

Epoch 00002: saving model to /content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:002-val_acc:0.45359.hdf5
Epoch 3/20

Epoch 00003: saving model to /content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:003-val_acc:0.48264.hdf5
Epoch 4/20

Epoch 00004: saving model to /content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:004-val_acc:0.50471.hdf5
Epoch 5/20

Epoch 00005: saving model to /content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:005-val_acc:0.51128.hdf5
Epoch 6/20

Epoch 00006: saving model to /content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:006-val_acc:0.51193.hdf5
Epoch 7/20

Epoch 00007: saving model to /content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:007-val_acc:0.50687.hdf5
Epoch 8/20


KeyboardInterrupt: ignored

Load weights for generation

In [0]:

#Load weights                                                                         #choose the right filename
model.load_weights('/content/drive/My Drive/Colab Notebooks/CheckpointsLyricsGen/epochs:008-val_acc:0.49939.hdf5')                                                                                    
#compile model                                                                       
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

Lyrics Generation

In [156]:
# set a random seed :
start = np.random.randint(0, len(input)-1)
random_pattern = input[start] 


#set a not random seed
seed="Η Άννα"
seed_int=([char2int[c] for c in seed])
pad_len=seq_length-len(seed_int)   
set_pattern=np.pad(seed_int,(pad_len,0),constant_values=char2int[" "]) #we need to pad the seed so it can be the correct shape

pattern = set_pattern   #Choose what type of seed we want

# if pattern.all() == set_pattern.all():
print('Seed : ')
print(seed)
# elif pattern.all() == random_pattern.all():
# print('Seed : ')
# print("\"",''.join([int2char[v] for v in random_pattern]), "\"\n")
# else:
#   print("No seed")



# How many characters you want to generate
generated_characters = 300

results=[]

for i in range(generated_characters):
    x = np.reshape(pattern, ( 1, len(pattern)))
    
    prediction = model.predict(x,verbose = 0)
        
    index = np.argmax(prediction)

    result = int2char[index]

    results.append(result)
    # sys.stdout.write(result)
    
    pattern = np.append(pattern,index)
    
    pattern = pattern[1:len(pattern)]
print("Generated text:")
print("\"",''.join(results), "\"\n")    
print('\nDone')

Seed : 
Η Άννα
Generated text:
"  μακριά και με παίρνεις με το φως το φως το φεγγάρι σου και με παγούδες μου είσαι εσύ που το αίμα και με το φως το παραθείο και το φως το φεγγάρι σου το παραμό που σου το παραπάνε. Πάντα μου είπες που το αίμα και μου λέξεις το φως μου το παραθάκι το φως το χρώμα της καρδιάς σου με το φως το φεγγάρι  "


Done
