<a href="https://colab.research.google.com/github/minhaz1172/Deep-Learning/blob/main/Text_Generation_by_transfomer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Transformer for Text generation**

#objectives:
# Implement Transformers for text generation tasks

# Build, train, and evaluate Transformer models for text generation using TensorFlow and Keras

# Apply text generation in real-world scenarios

# Import necessary libraries

In [1]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.utils import get_file


# Load the dataset

# get_file downloads the file from the URL if it’s not present locally.

# The text is read in binary mode (byte string) and then decoded into UTF-8 for proper text handling.

In [2]:
# load the dataset (shakespeare works) from a url.This utility downloads the file if needed.

path_to_file=get_file('shakespeare.txt','https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

 # read the raw text from the downlader file and decode it from bytes to a UTF-8 String
text=open(path_to_file,'rb').read().decode(encoding='utf-8')

 # preview the first 1000 characters of the dataset
print(text[:1000])

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
[1m1115394/1115394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their 

In [6]:
print(f"length of text: {len(text)} characters")

length of text: 1115394 characters


# he TextVectorization layer is used to convert raw strings into sequences of integer tokens.
# The .adapt() method analyzes the text to build the vocabulary.
#vectorizer([text]) is wrapped in an extra dimension (batch dimension), so [0] extracts the main token sequence.

# **data preprocessing and text vectorization**

In [5]:
#  # Set parameters: a maximum vocabulary size and a sequence length.
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

vocab_size=10000
seq_lenth=100

# create a textVectorization layer instance,set output indices
vectorizer=TextVectorization(max_tokens=vocab_size,output_mode='int')

# create a tensoflow dataset from the raw text and batch it (batch size 1 here) for adaptization
text_ds=tf.data.Dataset.from_tensor_slices([text]).batch(1)
# adapt the textvecotrization layer on the dataset.this bulds the vocabolary
vectorizer.adapt(text_ds)
# vectorize the entire text into a sequence of integer token IDs
vectorized_text=vectorizer(text)
print(f"shape of vectorized text: {vectorized_text.shape}")
print(f"first 10 tokens of vectorized text: {vectorized_text[:10]}")

shape of vectorized text: (202646,)
first 10 tokens of vectorized text: [ 89 270 138  36 982 144 673 125  16 106]


# : To prepare the data for sequence prediction, we slide a window of fixed size (seq_length) over the vectorized text.

# Input Sequences (X): Each sequence of tokens used as model input.

# target Sequences (Y): The next tokens (shifted by one position) that the model is expected to predict.

# Finally, the data is converted to TF tensors so the training loop can process them efficiently.

In [8]:
def create_sequences(text,seq_length):
  input_seqs=[]
  target_seqs=[]

  for i in range(len(text)-seq_length):
    input_seq=text[i:i+seq_length]
    target_seq=text[i+1:i+seq_length+1]

    input_seqs.append(input_seq)
    target_seqs.append(target_seq)

  return np.array(input_seqs),np.array(target_seqs)


#Generate input X and Target Y sequences from the vectorized text

X,Y=create_sequences(vectorized_text.numpy(),seq_lenth)

#verify that sequences have generated successfully
print(f"number of sequences generated: {len(X)}")
print(f"first input sequence: {X[0]}")
print(f"first target sequence: {Y[0]}")

# Ensure X and Y are not empty ,then convert to tensorflow tensor for training
X=tf.convert_to_tensor(X)
Y=tf.convert_to_tensor(Y)
print(f"shape of X: {X.shape}")
print(f"shape of Y: {Y.shape}")

number of sequences generated: 202546
first input sequence: [  89  270  138   36  982  144  673  125   16  106   34  106  106   89
  270    7   41   34 1286  344    4  200   64    4 3690   34 1286 1286
   89  270   89    7   93 1187  225   12 2442  592    4    2  307   34
   36 2655   36 2655   89  270   72   79  506   27    3   56   24 1390
   57   40  161 2328  644    9 4980   34   32   54 2863  885   72   17
   18  163  146  146  165  270   74  218   46  595   89  270   36   41
 6739  172  595    2 1780   46   29 1323 5151   47   58 4151   79   39
   60   58]
first target sequence: [ 270  138   36  982  144  673  125   16  106   34  106  106   89  270
    7   41   34 1286  344    4  200   64    4 3690   34 1286 1286   89
  270   89    7   93 1187  225   12 2442  592    4    2  307   34   36
 2655   36 2655   89  270   72   79  506   27    3   56   24 1390   57
   40  161 2328  644    9 4980   34   32   54 2863  885   72   17   18
  163  146  146  165  270   74  218   46  595   89  2