<a href="https://colab.research.google.com/github/mojgan65/Poem-Generation/blob/main/Hafez_Poem_Generation_Using_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This file generate Hafez poems

### Import TensorFlow and other libraries

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
# tf.enable_eager_execution()

import numpy as np
import os
import time
import glob

In [None]:
from google.colab import files
uploaded = files.upload()

Saving hafez.txt to hafez.txt


In [None]:
path_to_file = 'hafez.txt'

### Read the data

First, look in the text.

In [None]:
# Read, then decode for py2 compat.
# text = open('poems', 'rb').read().decode(encoding='utf-8')
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

  
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

# remove some exteranous chars 
execluded = '!()*-.1:=[]«»;؛,،~?؟#\u200f\ufeff'
out = ""


for char in text:
  if char not in execluded:
    out += char
text = out
text = text.replace("\t\t\t", "\t")
text = text.replace("\r\r\n", "\n")
text = text.replace("\r\n","\n")
text = text.replace("\t\n", "\n")
text = text.replace("\n\n", "\n")
text = '\n'.join(line for line in text.split('\n') if len(line) >= 7)


vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

Length of text: 296677 characters
48 unique characters


In [None]:
# Take a look at the first 250 characters in text
print(text[:250])

الا يا ايها الساقی ادر کاسا و ناولها
که عشق آسان نمود اول ولی افتاد مشکل‌ها
به بوی نافه‌ای کاخر صبا زان طره بگشايد
ز تاب جعد مشکينش چه خون افتاد در دل‌ها
مرا در منزل جانان چه امن عيش چون هر دم
جرس فرياد می‌دارد که بربنديد محمل‌ها
به می سجاده رنگين کن


## Process the text

### Vectorize the text

Before training, we need to map strings to a numerical representation. Create two lookup tables: one mapping characters to numbers, and another for numbers to characters.

In [None]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

Now we have an integer representation for each character. Notice that we mapped the character as indexes from 0 to `len(unique)`.

In [None]:
print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
  '\n':   0,
  ' ' :   1,
  'Y' :   2,
  'آ' :   3,
  'ا' :   4,
  'ب' :   5,
  'ت' :   6,
  'ث' :   7,
  'ج' :   8,
  'ح' :   9,
  'خ' :  10,
  'د' :  11,
  'ذ' :  12,
  'ر' :  13,
  'ز' :  14,
  'س' :  15,
  'ش' :  16,
  'ص' :  17,
  'ض' :  18,
  'ط' :  19,
  ...
}


In [None]:
# Show how the first 13 characters from the text are mapped to integers
print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

'الا يا ايها ا' ---- characters mapped to int ---- > [ 4 25  4  1 30  4  1  4 30 28  4  1  4]


### The prediction task

Given a character, or a sequence of characters, what is the most probable next character? This is the task we're training the model to perform. The input to the model will be a sequence of characters, and we train the model to predict the output—the following character at each time step.

Since RNNs maintain an internal state that depends on the previously seen elements, given all the characters computed until this moment, what is the next character?


### Create training examples and targets

Next divide the text into example sequences. Each input sequence will contain `seq_length` characters from the text.

For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.

So break the text into chunks of `seq_length+1`. For example, say `seq_length` is 4 and our text is "Hello". The input sequence would be "Hell", and the target sequence "ello".

To do this first use the `tf.data.Dataset.from_tensor_slices` function to convert the text vector into a stream of character indices.

In [None]:
# The maximum length sentence we want for a single input in characters
seq_length = 200
examples_per_epoch = len(text)//seq_length

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])

ا
ل
ا
 
ي


The `batch` method lets us easily convert these individual characters to sequences of the desired size.

In [None]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))

'الا يا ايها الساقی ادر کاسا و ناولها\nکه عشق آسان نمود اول ولی افتاد مشکل\u200cها\nبه بوی نافه\u200cای کاخر صبا زان طره بگشايد\nز تاب جعد مشکينش چه خون افتاد در دل\u200cها\nمرا در منزل جانان چه امن عيش چون هر دم\nجرس فريا'
'د می\u200cدارد که بربنديد محمل\u200cها\nبه می سجاده رنگين کن گرت پير مغان گويد\nکه سالک بی\u200cخبر نبود ز راه و رسم منزل\u200cها\nشب تاريک و بيم موج و گردابی چنين هايل\nکجا دانند حال ما سبکباران ساحل\u200cها\nهمه کارم ز خود کامی ب'
'ه بدنامی کشيد آخر\nنهان کی ماند آن رازی کز او سازند محفل\u200cها\nحضوری گر همی\u200cخواهی از او غايب مشو حافظ\nمتی ما تلق من تهوی دع الدنيا و اهملها\nصلاح کار کجا و من خراب کجا\nببين تفاوت ره کز کجاست تا به کجا\nدلم ز'
' صومعه بگرفت و خرقه سالوس\nکجاست دير مغان و شراب ناب کجا\nچه نسبت است به رندی صلاح و تقوا را\nسماع وعظ کجا نغمه رباب کجا\nز روی دوست دل دشمنان چه دريابد\nچراغ مرده کجا شمع آفتاب کجا\nچو کحل بينش ما خاک آستان'
' شماست\nکجا رويم بفرما از اين جناب کجا\nمبين به سيب زنخدان که چاه در راه است\nکجا همی\u200cروی ای دل بدين شتاب 

For each sequence, duplicate and shift it to form the input and target text by using the `map` method to apply a simple function to each batch:

In [None]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

Print the first examples input and target values:

In [None]:
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  'الا يا ايها الساقی ادر کاسا و ناولها\nکه عشق آسان نمود اول ولی افتاد مشکل\u200cها\nبه بوی نافه\u200cای کاخر صبا زان طره بگشايد\nز تاب جعد مشکينش چه خون افتاد در دل\u200cها\nمرا در منزل جانان چه امن عيش چون هر دم\nجرس فري'
Target data: 'لا يا ايها الساقی ادر کاسا و ناولها\nکه عشق آسان نمود اول ولی افتاد مشکل\u200cها\nبه بوی نافه\u200cای کاخر صبا زان طره بگشايد\nز تاب جعد مشکينش چه خون افتاد در دل\u200cها\nمرا در منزل جانان چه امن عيش چون هر دم\nجرس فريا'


Each index of these vectors are processed as one time step. For the input at time step 0, the model receives the index for "F" and tries to predict the index for "i" as the next character. At the next timestep, it does the same thing but the `RNN` considers the previous step context in addition to the current input character.

In [None]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 4 ('ا')
  expected output: 25 ('ل')
Step    1
  input: 25 ('ل')
  expected output: 4 ('ا')
Step    2
  input: 4 ('ا')
  expected output: 1 (' ')
Step    3
  input: 1 (' ')
  expected output: 30 ('ي')
Step    4
  input: 30 ('ي')
  expected output: 4 ('ا')


### Create training batches

We used `tf.data` to split the text into manageable sequences. But before feeding this data into the model, we need to shuffle the data and pack it into batches.

In [None]:
# Batch size
BATCH_SIZE = 128
steps_per_epoch = examples_per_epoch//BATCH_SIZE

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<_BatchDataset element_spec=(TensorSpec(shape=(128, 200), dtype=tf.int64, name=None), TensorSpec(shape=(128, 200), dtype=tf.int64, name=None))>

## Build The Model

Use `tf.keras.Sequential` to define the model. For this simple example three layers are used to define our model:

* `tf.keras.layers.Embedding`: The input layer. A trainable lookup table that will map the numbers of each character to a vector with `embedding_dim` dimensions;
* `tf.keras.layers.GRU`: A type of RNN with size `units=rnn_units` (You can also use a LSTM layer here.)
* `tf.keras.layers.Dense`: The output layer, with `vocab_size` outputs.

In [None]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

Next define a function to build the model.

In [None]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
        return_sequences=True,
        recurrent_initializer='glorot_uniform',
        stateful=True),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

In [None]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

For each character the model looks up the embedding, runs the GRU one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-liklihood of the next character:

![A drawing of the data passing through the model](https://tensorflow.org/tutorials/sequences/images/text_generation_training.png)

## Try the model

Now run the model to see that it behaves as expected.

First check the shape of the output:

In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(128, 200, 48) # (batch_size, sequence_length, vocab_size)


In the above example the sequence length of the input is `100` but the model can be run on inputs of any length:

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (128, None, 256)          12288     
                                                                 
 gru (GRU)                   (128, None, 1024)         3938304   
                                                                 
 dense (Dense)               (128, None, 48)           49200     
                                                                 
Total params: 3,999,792
Trainable params: 3,999,792
Non-trainable params: 0
_________________________________________________________________


To get actual predictions from the model we need to sample from the output distribution, to get actual character indices. This distribution is defined by the logits over the character vocabulary.

Note: It is important to _sample_ from this distribution as taking the _argmax_ of the distribution can easily get the model stuck in a loop.

Try it for the first example in the batch:

In [None]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

This gives us, at each timestep, a prediction of the next character index:

In [None]:
print(len(sampled_indices))
sampled_indices

200


array([ 2, 12, 21, 20, 15, 39, 41, 38,  2, 37, 39, 25,  1, 28, 31, 33, 37,
       40,  3,  2, 27,  9, 15, 17, 41,  7,  7,  2, 13, 33,  2, 31, 25, 23,
       39,  7,  8,  7, 26, 44, 26, 41, 10,  4, 47, 16,  7, 35, 18, 37,  7,
       25, 20, 22, 16, 28, 40, 21, 15, 46, 33,  5, 42,  5, 13, 47,  1,  7,
        3,  5, 32,  7, 38,  4,  8, 16, 28,  3,  8, 25, 35, 46,  8,  7, 23,
       25, 38,  2, 13, 15,  5, 34, 22, 39, 36, 44, 33,  9, 42, 26,  4, 34,
       46, 20, 42, 19,  8,  8, 26, 32, 16,  0,  5, 30, 12, 33,  2, 40,  3,
       10, 45, 15, 30, 44, 47, 24,  3,  8, 15,  8, 46,  5, 36, 25, 47, 34,
        5, 10, 35,  4, 27, 26, 11, 45,  9, 42, 46, 36, 34,  5, 30, 15, 22,
       46,  1, 35, 18, 25, 18,  6, 41, 40, 42, 40, 35, 26, 13, 23,  3, 35,
       45,  7, 37, 30, 34, 12, 36, 32,  1, 24,  6,  7,  6,  5, 34, 24, 18,
       12, 42,  7, 36,  0,  0, 21,  7, 32,  1, 11,  6, 14])

Decode these to see the text predicted by this untrained model:

In [None]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0].numpy()])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 'ظی دامی\nگله از زاهد بدخو نکنم رسم اين است\nکه چو صبحی بدمد در پی اش افتد شامی\nيار من چون بخرامد به تماشای چمن\nبرسانش ز من ای پيک صبا پيغامی\nآن حريفی که شب و روز می صاف کشد\nبود آيا که کند ياد ز دردآشامی'

Next Char Predictions: 
 'Yذعظس۲۴۱Y۰۲ل هپژ۰۳آYنحسص۴ثثYرژYپلف۲ثجثم۷م۴خا\u200cشثگض۰ثلظغشه۳عس۹ژب۵بر\u200c ثآبچث۱اجشهآجلگ۹جثفل۱Yرسبکغ۲ی۷ژح۵ماک۹ظ۵طججمچش\nبيذژY۳آخ۸سي۷\u200cقآجسج۹بیل\u200cکبخگانمد۸ح۵۹یکبيسغ۹ گضلضت۴۳۵۳گمرفآگ۸ث۰يکذیچ قتثتبکقضذ۵ثی\n\nعثچ دتز'


## Train the model

At this point the problem can be treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character.

### Attach an optimizer, and a loss function

The standard `tf.keras.losses.sparse_softmax_crossentropy` loss function works in this case because it is applied across the last dimension of the predictions.

Because our model returns logits, we need to set the `from_logits` flag.


In [None]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (128, 200, 48)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       3.8705091


Configure the training procedure using the `tf.keras.Model.compile` method. We'll use `tf.train.AdamOptimizer` with default arguments and the loss function.

In [None]:
model.compile(
    optimizer='adam',
    loss = loss,
    metrics=['accuracy'])

### Configure checkpoints

Use a `tf.keras.callbacks.ModelCheckpoint` to ensure that checkpoints are saved during training:

In [None]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

### Execute the training

To keep training time reasonable, use 3 epochs to train the model. In Colab, set the runtime to GPU for faster training.

In [None]:
EPOCHS=100

In [None]:
history = model.fit(dataset.repeat(), epochs=EPOCHS, steps_per_epoch=steps_per_epoch,
                    callbacks=[checkpoint_callback])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

## Generate text

### Restore the latest checkpoint

To keep this prediction step simple, use a batch size of 1.

Because of the way the RNN state is passed from timestep to timestep, the model only accepts a fixed batch size once built.

To run the model with a different `batch_size`, we need to rebuild the model and restore the weights from the checkpoint.


In [None]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_100'

In [None]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [None]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (1, None, 256)            12288     
                                                                 
 gru_1 (GRU)                 (1, None, 1024)           3938304   
                                                                 
 dense_1 (Dense)             (1, None, 48)             49200     
                                                                 
Total params: 3,999,792
Trainable params: 3,999,792
Non-trainable params: 0
_________________________________________________________________


### The prediction loop

The following code block generates the text:

* It Starts by choosing a start string, initializing the RNN state and setting the number of characters to generate.

* Get the prediction distribution of the next character using the start string and the RNN state.

* Then, use a multinomial distribution to calculate the index of the predicted character. Use this predicted character as our next input to the model.

* The RNN state returned by the model is fed back into the model so that it now has more context, instead than only one word. After predicting the next word, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted words.


![To generate text the model's output is fed back to the input](https://tensorflow.org/tutorials/sequences/images/text_generation_sampling.png)

Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [None]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 500

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a multinomial distribution to predict the word returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
      #print(tf.multinomial(predictions, num_samples=1).shape)
      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [None]:
generate_poetry = generate_text(model, start_string=u"الا يا ايها الساقی ادر کاسا و ناولها")
print(generate_poetry)

الا يا ايها الساقی ادر کاسا و ناولها کنم
ابروی عندليبش
تندی گر نظر بر هجر تکرت است که صوفی داشت
اگر ز رسيدی و مرغ سهی بشين و شمت
نعد بگردد و سلطنت نگار شم
هر که در اين پرده نزنم سر ما و دارا
دلت برانک می برون را چه مدارد انگيش
که گويی نبرتيده‌ام به دام است و خواب می‌زدم
وز دور طرب آری ز مسکينه ديگرفتار
کمان ابروی جانان گشاده در کنار داشت
گفتم ره نشينه تا تازيرت درازم بنشست
کس نکردی کنيم و مار و جان و زلف نگار تو
حيرتم که ببويد به دربانه مرام است
در خنمه چنگست
جام می گل کن که پرس به خون بگذرد
که به جام بی‌دل خم جهان را چو شمع
غريب 


In [None]:
model.save('keras.h5')



In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Real poems
real_poems = [
    'الا يا ايها الساقی ادر کاسا و ناولها',
    'که عشق آسان نمود اول ولی افتاد مشکل‌ها',
    'به بوی نافه‌ای کاخر صبا زان طره بگشايد',
    'ز تاب جعد مشکينش چه خون افتاد در دل‌ها',
    'مرا در منزل جانان چه امن عيش چون هر دم',
    'جرس فرياد می‌دارد که بربنديد محمل‌ها',
    'به می سجاده رنگين کن گرت پير مغان گويد',
    'که سالک بی‌خبر نبود ز راه و رسم منزل‌ها',
    'شب تاريک و بيم موج و گردابی چنين هايل',
    'کجا دانند حال ما سبکباران ساحل‌ها',
    'همه کارم ز خود کامی به بدنامی کشيد آخر',
    'نهان کی ماند آن رازی کز او سازند محفل‌ها',
]

# Generate poetry
generated_text = generate_text(model, start_string=u"الا يا ايها الساقی ادر کاسا و ناولها")



In [None]:
generated_text = generated_text.split('\n')

In [None]:
# Vectorize the real poems
vectorizer = TfidfVectorizer()
real_poems_vectors = vectorizer.fit_transform(real_poems)

# Vectorize the generated poem
generated_vector = vectorizer.transform(generated_text)

# Calculate cosine similarity between generated poem and real poems
similarity_scores = cosine_similarity(generated_vector, real_poems_vectors)

average_similarity_score = sum(similarity_scores) / len(similarity_scores)

# Get the most similar real poem
most_similar_index = np.argmax(average_similarity_score)
most_similar_poem = real_poems[most_similar_index]

print("Generated Poem:\n", generated_text)
print("Most Similar Real Poem:\n", most_similar_poem)
print("Cosine Similarity Score of most similar poem:", average_similarity_score[most_similar_index])
print("Cosine Similarity Score:", average_similarity_score)

Generated Poem:
 ['الا يا ايها الساقی ادر کاسا و ناولهاست', 'برفکن حافظ جام زه می\u200cآيد فيض ازل', 'به پياهی که اين\u200cهاست نباشد', 'با کار خود باشم ز باغ عيش يک جم شد', 'اين آبثان که دعا رود گرفته چو داری', 'مکن به چنگ مرد خاک نفس باده فروش', 'کان عالم و رندان به سر رفتم بدان سانگ کش', 'تا به ابد او دلم کز سخت بی\u200cنبارد دگر بينشته', 'و اندر گل بدين شاه درست', 'با آنم نگردد گدای عارض فضول', 'چون باد صبا گوش به تو زان لب شيرين کارگاه کم', 'ز روی طاعت و اين راه خبرند چنگ و رباب باد', 'خواه شمشيد و گدا می\u200cشتفتم بی\u200cبها کرد', 'غزل    ۱۷۹', 'سرود عقل بر سر کو فرق از بطرض دوست', 'به کوک مکن که ديد ار درياب ']
Most Similar Real Poem:
 به می سجاده رنگين کن گرت پير مغان گويد
Cosine Similarity Score of most similar poem: 0.10008756944085351
Cosine Similarity Score: [0.05786376 0.04143822 0.09084686 0.         0.02545009 0.07887823
 0.10008757 0.09319273 0.         0.         0.0886961  0.0216562 ]


In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
# tf.enable_eager_execution()

import numpy as np
import os
import time
import glob

In [None]:
from google.colab import files
uploaded = files.upload()

Saving hafez.txt to hafez.txt


In [None]:
path_to_file = 'hafez.txt'

### Read the data

First, look in the text.

In [None]:
# Read, then decode for py2 compat.
# text = open('poems', 'rb').read().decode(encoding='utf-8')
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

  
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

# remove some exteranous chars 
execluded = '!()*-.1:=[]«»;؛,،~?؟#\u200f\ufeff'
out = ""


for char in text:
  if char not in execluded:
    out += char
text = out
text = text.replace("\t\t\t", "\t")
text = text.replace("\r\r\n", "\n")
text = text.replace("\r\n","\n")
text = text.replace("\t\n", "\n")
text = text.replace("\n\n", "\n")
text = '\n'.join(line for line in text.split('\n') if len(line) >= 7)


vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

Length of text: 296677 characters
48 unique characters


In [None]:
# Take a look at the first 250 characters in text
print(text[:250])

الا يا ايها الساقی ادر کاسا و ناولها
که عشق آسان نمود اول ولی افتاد مشکل‌ها
به بوی نافه‌ای کاخر صبا زان طره بگشايد
ز تاب جعد مشکينش چه خون افتاد در دل‌ها
مرا در منزل جانان چه امن عيش چون هر دم
جرس فرياد می‌دارد که بربنديد محمل‌ها
به می سجاده رنگين کن


## Process the text

### Vectorize the text

Before training, we need to map strings to a numerical representation. Create two lookup tables: one mapping characters to numbers, and another for numbers to characters.

In [None]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

Now we have an integer representation for each character. Notice that we mapped the character as indexes from 0 to `len(unique)`.

In [None]:
print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
  '\n':   0,
  ' ' :   1,
  'Y' :   2,
  'آ' :   3,
  'ا' :   4,
  'ب' :   5,
  'ت' :   6,
  'ث' :   7,
  'ج' :   8,
  'ح' :   9,
  'خ' :  10,
  'د' :  11,
  'ذ' :  12,
  'ر' :  13,
  'ز' :  14,
  'س' :  15,
  'ش' :  16,
  'ص' :  17,
  'ض' :  18,
  'ط' :  19,
  ...
}


In [None]:
# Show how the first 13 characters from the text are mapped to integers
print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

'الا يا ايها ا' ---- characters mapped to int ---- > [ 4 25  4  1 30  4  1  4 30 28  4  1  4]


### The prediction task

Given a character, or a sequence of characters, what is the most probable next character? This is the task we're training the model to perform. The input to the model will be a sequence of characters, and we train the model to predict the output—the following character at each time step.

Since RNNs maintain an internal state that depends on the previously seen elements, given all the characters computed until this moment, what is the next character?


### Create training examples and targets

Next divide the text into example sequences. Each input sequence will contain `seq_length` characters from the text.

For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.

So break the text into chunks of `seq_length+1`. For example, say `seq_length` is 4 and our text is "Hello". The input sequence would be "Hell", and the target sequence "ello".

To do this first use the `tf.data.Dataset.from_tensor_slices` function to convert the text vector into a stream of character indices.

In [None]:
# The maximum length sentence we want for a single input in characters
seq_length = 200
examples_per_epoch = len(text)//seq_length

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])

ا
ل
ا
 
ي


The `batch` method lets us easily convert these individual characters to sequences of the desired size.

In [None]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))

'الا يا ايها الساقی ادر کاسا و ناولها\nکه عشق آسان نمود اول ولی افتاد مشکل\u200cها\nبه بوی نافه\u200cای کاخر صبا زان طره بگشايد\nز تاب جعد مشکينش چه خون افتاد در دل\u200cها\nمرا در منزل جانان چه امن عيش چون هر دم\nجرس فريا'
'د می\u200cدارد که بربنديد محمل\u200cها\nبه می سجاده رنگين کن گرت پير مغان گويد\nکه سالک بی\u200cخبر نبود ز راه و رسم منزل\u200cها\nشب تاريک و بيم موج و گردابی چنين هايل\nکجا دانند حال ما سبکباران ساحل\u200cها\nهمه کارم ز خود کامی ب'
'ه بدنامی کشيد آخر\nنهان کی ماند آن رازی کز او سازند محفل\u200cها\nحضوری گر همی\u200cخواهی از او غايب مشو حافظ\nمتی ما تلق من تهوی دع الدنيا و اهملها\nصلاح کار کجا و من خراب کجا\nببين تفاوت ره کز کجاست تا به کجا\nدلم ز'
' صومعه بگرفت و خرقه سالوس\nکجاست دير مغان و شراب ناب کجا\nچه نسبت است به رندی صلاح و تقوا را\nسماع وعظ کجا نغمه رباب کجا\nز روی دوست دل دشمنان چه دريابد\nچراغ مرده کجا شمع آفتاب کجا\nچو کحل بينش ما خاک آستان'
' شماست\nکجا رويم بفرما از اين جناب کجا\nمبين به سيب زنخدان که چاه در راه است\nکجا همی\u200cروی ای دل بدين شتاب 

For each sequence, duplicate and shift it to form the input and target text by using the `map` method to apply a simple function to each batch:

In [None]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

Print the first examples input and target values:

In [None]:
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  'الا يا ايها الساقی ادر کاسا و ناولها\nکه عشق آسان نمود اول ولی افتاد مشکل\u200cها\nبه بوی نافه\u200cای کاخر صبا زان طره بگشايد\nز تاب جعد مشکينش چه خون افتاد در دل\u200cها\nمرا در منزل جانان چه امن عيش چون هر دم\nجرس فري'
Target data: 'لا يا ايها الساقی ادر کاسا و ناولها\nکه عشق آسان نمود اول ولی افتاد مشکل\u200cها\nبه بوی نافه\u200cای کاخر صبا زان طره بگشايد\nز تاب جعد مشکينش چه خون افتاد در دل\u200cها\nمرا در منزل جانان چه امن عيش چون هر دم\nجرس فريا'


Each index of these vectors are processed as one time step. For the input at time step 0, the model receives the index for "F" and tries to predict the index for "i" as the next character. At the next timestep, it does the same thing but the `RNN` considers the previous step context in addition to the current input character.

In [None]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 4 ('ا')
  expected output: 25 ('ل')
Step    1
  input: 25 ('ل')
  expected output: 4 ('ا')
Step    2
  input: 4 ('ا')
  expected output: 1 (' ')
Step    3
  input: 1 (' ')
  expected output: 30 ('ي')
Step    4
  input: 30 ('ي')
  expected output: 4 ('ا')


### Create training batches

We used `tf.data` to split the text into manageable sequences. But before feeding this data into the model, we need to shuffle the data and pack it into batches.

In [None]:
# Batch size
BATCH_SIZE = 128
steps_per_epoch = examples_per_epoch//BATCH_SIZE

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<_BatchDataset element_spec=(TensorSpec(shape=(128, 200), dtype=tf.int64, name=None), TensorSpec(shape=(128, 200), dtype=tf.int64, name=None))>

## Build The Model

Use `tf.keras.Sequential` to define the model. For this simple example three layers are used to define our model:

* `tf.keras.layers.Embedding`: The input layer. A trainable lookup table that will map the numbers of each character to a vector with `embedding_dim` dimensions;
* `tf.keras.layers.GRU`: A type of RNN with size `units=rnn_units` (You can also use a LSTM layer here.)
* `tf.keras.layers.Dense`: The output layer, with `vocab_size` outputs.

In [None]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

Next define a function to build the model.

In [None]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
        return_sequences=True,
        recurrent_initializer='glorot_uniform',
        stateful=True),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

In [None]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

For each character the model looks up the embedding, runs the GRU one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-liklihood of the next character:

![A drawing of the data passing through the model](https://tensorflow.org/tutorials/sequences/images/text_generation_training.png)

## Try the model

Now run the model to see that it behaves as expected.

First check the shape of the output:

In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(128, 200, 48) # (batch_size, sequence_length, vocab_size)


In the above example the sequence length of the input is `100` but the model can be run on inputs of any length:

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (128, None, 256)          12288     
                                                                 
 gru (GRU)                   (128, None, 1024)         3938304   
                                                                 
 dense (Dense)               (128, None, 48)           49200     
                                                                 
Total params: 3,999,792
Trainable params: 3,999,792
Non-trainable params: 0
_________________________________________________________________


To get actual predictions from the model we need to sample from the output distribution, to get actual character indices. This distribution is defined by the logits over the character vocabulary.

Note: It is important to _sample_ from this distribution as taking the _argmax_ of the distribution can easily get the model stuck in a loop.

Try it for the first example in the batch:

In [None]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

This gives us, at each timestep, a prediction of the next character index:

In [None]:
print(len(sampled_indices))
sampled_indices

200


array([ 2, 12, 21, 20, 15, 39, 41, 38,  2, 37, 39, 25,  1, 28, 31, 33, 37,
       40,  3,  2, 27,  9, 15, 17, 41,  7,  7,  2, 13, 33,  2, 31, 25, 23,
       39,  7,  8,  7, 26, 44, 26, 41, 10,  4, 47, 16,  7, 35, 18, 37,  7,
       25, 20, 22, 16, 28, 40, 21, 15, 46, 33,  5, 42,  5, 13, 47,  1,  7,
        3,  5, 32,  7, 38,  4,  8, 16, 28,  3,  8, 25, 35, 46,  8,  7, 23,
       25, 38,  2, 13, 15,  5, 34, 22, 39, 36, 44, 33,  9, 42, 26,  4, 34,
       46, 20, 42, 19,  8,  8, 26, 32, 16,  0,  5, 30, 12, 33,  2, 40,  3,
       10, 45, 15, 30, 44, 47, 24,  3,  8, 15,  8, 46,  5, 36, 25, 47, 34,
        5, 10, 35,  4, 27, 26, 11, 45,  9, 42, 46, 36, 34,  5, 30, 15, 22,
       46,  1, 35, 18, 25, 18,  6, 41, 40, 42, 40, 35, 26, 13, 23,  3, 35,
       45,  7, 37, 30, 34, 12, 36, 32,  1, 24,  6,  7,  6,  5, 34, 24, 18,
       12, 42,  7, 36,  0,  0, 21,  7, 32,  1, 11,  6, 14])

Decode these to see the text predicted by this untrained model:

In [None]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0].numpy()])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 'ظی دامی\nگله از زاهد بدخو نکنم رسم اين است\nکه چو صبحی بدمد در پی اش افتد شامی\nيار من چون بخرامد به تماشای چمن\nبرسانش ز من ای پيک صبا پيغامی\nآن حريفی که شب و روز می صاف کشد\nبود آيا که کند ياد ز دردآشامی'

Next Char Predictions: 
 'Yذعظس۲۴۱Y۰۲ل هپژ۰۳آYنحسص۴ثثYرژYپلف۲ثجثم۷م۴خا\u200cشثگض۰ثلظغشه۳عس۹ژب۵بر\u200c ثآبچث۱اجشهآجلگ۹جثفل۱Yرسبکغ۲ی۷ژح۵ماک۹ظ۵طججمچش\nبيذژY۳آخ۸سي۷\u200cقآجسج۹بیل\u200cکبخگانمد۸ح۵۹یکبيسغ۹ گضلضت۴۳۵۳گمرفآگ۸ث۰يکذیچ قتثتبکقضذ۵ثی\n\nعثچ دتز'


## Train the model

At this point the problem can be treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character.

### Attach an optimizer, and a loss function

The standard `tf.keras.losses.sparse_softmax_crossentropy` loss function works in this case because it is applied across the last dimension of the predictions.

Because our model returns logits, we need to set the `from_logits` flag.


In [None]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (128, 200, 48)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       3.8705091


Configure the training procedure using the `tf.keras.Model.compile` method. We'll use `tf.train.AdamOptimizer` with default arguments and the loss function.

In [None]:
model.compile(
    optimizer='adam',
    loss = loss,
    metrics=['accuracy'])

### Configure checkpoints

Use a `tf.keras.callbacks.ModelCheckpoint` to ensure that checkpoints are saved during training:

In [None]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

### Execute the training

To keep training time reasonable, use 3 epochs to train the model. In Colab, set the runtime to GPU for faster training.

In [None]:
EPOCHS=100

In [None]:
history = model.fit(dataset.repeat(), epochs=EPOCHS, steps_per_epoch=steps_per_epoch,
                    callbacks=[checkpoint_callback])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

## Generate text

### Restore the latest checkpoint

To keep this prediction step simple, use a batch size of 1.

Because of the way the RNN state is passed from timestep to timestep, the model only accepts a fixed batch size once built.

To run the model with a different `batch_size`, we need to rebuild the model and restore the weights from the checkpoint.


In [None]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_100'

In [None]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [None]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (1, None, 256)            12288     
                                                                 
 gru_1 (GRU)                 (1, None, 1024)           3938304   
                                                                 
 dense_1 (Dense)             (1, None, 48)             49200     
                                                                 
Total params: 3,999,792
Trainable params: 3,999,792
Non-trainable params: 0
_________________________________________________________________


### The prediction loop

The following code block generates the text:

* It Starts by choosing a start string, initializing the RNN state and setting the number of characters to generate.

* Get the prediction distribution of the next character using the start string and the RNN state.

* Then, use a multinomial distribution to calculate the index of the predicted character. Use this predicted character as our next input to the model.

* The RNN state returned by the model is fed back into the model so that it now has more context, instead than only one word. After predicting the next word, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted words.


![To generate text the model's output is fed back to the input](https://tensorflow.org/tutorials/sequences/images/text_generation_sampling.png)

Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [None]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 500

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a multinomial distribution to predict the word returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
      #print(tf.multinomial(predictions, num_samples=1).shape)
      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [None]:
generate_poetry = generate_text(model, start_string=u"الا يا ايها الساقی ادر کاسا و ناولها")
print(generate_poetry)

الا يا ايها الساقی ادر کاسا و ناولها کنم
ابروی عندليبش
تندی گر نظر بر هجر تکرت است که صوفی داشت
اگر ز رسيدی و مرغ سهی بشين و شمت
نعد بگردد و سلطنت نگار شم
هر که در اين پرده نزنم سر ما و دارا
دلت برانک می برون را چه مدارد انگيش
که گويی نبرتيده‌ام به دام است و خواب می‌زدم
وز دور طرب آری ز مسکينه ديگرفتار
کمان ابروی جانان گشاده در کنار داشت
گفتم ره نشينه تا تازيرت درازم بنشست
کس نکردی کنيم و مار و جان و زلف نگار تو
حيرتم که ببويد به دربانه مرام است
در خنمه چنگست
جام می گل کن که پرس به خون بگذرد
که به جام بی‌دل خم جهان را چو شمع
غريب 


In [None]:
model.save('keras.h5')



In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Real poems
real_poems = [
    'الا يا ايها الساقی ادر کاسا و ناولها',
    'که عشق آسان نمود اول ولی افتاد مشکل‌ها',
    'به بوی نافه‌ای کاخر صبا زان طره بگشايد',
    'ز تاب جعد مشکينش چه خون افتاد در دل‌ها',
    'مرا در منزل جانان چه امن عيش چون هر دم',
    'جرس فرياد می‌دارد که بربنديد محمل‌ها',
    'به می سجاده رنگين کن گرت پير مغان گويد',
    'که سالک بی‌خبر نبود ز راه و رسم منزل‌ها',
    'شب تاريک و بيم موج و گردابی چنين هايل',
    'کجا دانند حال ما سبکباران ساحل‌ها',
    'همه کارم ز خود کامی به بدنامی کشيد آخر',
    'نهان کی ماند آن رازی کز او سازند محفل‌ها',
]

# Generate poetry
generated_text = generate_text(model, start_string=u"الا يا ايها الساقی ادر کاسا و ناولها")



In [None]:
generated_text = generated_text.split('\n')

In [None]:
# Vectorize the real poems
vectorizer = TfidfVectorizer()
real_poems_vectors = vectorizer.fit_transform(real_poems)

# Vectorize the generated poem
generated_vector = vectorizer.transform(generated_text)

# Calculate cosine similarity between generated poem and real poems
similarity_scores = cosine_similarity(generated_vector, real_poems_vectors)

average_similarity_score = sum(similarity_scores) / len(similarity_scores)

# Get the most similar real poem
most_similar_index = np.argmax(average_similarity_score)
most_similar_poem = real_poems[most_similar_index]

print("Generated Poem:\n", generated_text)
print("Most Similar Real Poem:\n", most_similar_poem)
print("Cosine Similarity Score of most similar poem:", average_similarity_score[most_similar_index])
print("Cosine Similarity Score:", average_similarity_score)

Generated Poem:
 ['الا يا ايها الساقی ادر کاسا و ناولهاست', 'برفکن حافظ جام زه می\u200cآيد فيض ازل', 'به پياهی که اين\u200cهاست نباشد', 'با کار خود باشم ز باغ عيش يک جم شد', 'اين آبثان که دعا رود گرفته چو داری', 'مکن به چنگ مرد خاک نفس باده فروش', 'کان عالم و رندان به سر رفتم بدان سانگ کش', 'تا به ابد او دلم کز سخت بی\u200cنبارد دگر بينشته', 'و اندر گل بدين شاه درست', 'با آنم نگردد گدای عارض فضول', 'چون باد صبا گوش به تو زان لب شيرين کارگاه کم', 'ز روی طاعت و اين راه خبرند چنگ و رباب باد', 'خواه شمشيد و گدا می\u200cشتفتم بی\u200cبها کرد', 'غزل    ۱۷۹', 'سرود عقل بر سر کو فرق از بطرض دوست', 'به کوک مکن که ديد ار درياب ']
Most Similar Real Poem:
 به می سجاده رنگين کن گرت پير مغان گويد
Cosine Similarity Score of most similar poem: 0.10008756944085351
Cosine Similarity Score: [0.05786376 0.04143822 0.09084686 0.         0.02545009 0.07887823
 0.10008757 0.09319273 0.         0.         0.0886961  0.0216562 ]
