# Step 1: Setting Up the Environment

# Step 2: Importing Necessary Libraries

We will use TensorFlow and PyTorch, popular AI frameworks used for developing machine learning models. 

These frameworks provide a comprehensive set of tools that enable developers to easily create and deploy ML models.

In [9]:
import tensorflow as tf
import torch
import torch.nn as nn

# Step 3: Gathering and Preprocessing Data

The first step in building a language model is to gather and preprocess the data. 

The data for a language model is typically a large corpus of text. 

For example, you could use a book, a collection of articles, or any other large text file.

https://www.mltut.com/how-to-build-generative-ai-model/ 



Once you have your text data, you'll need to preprocess it. This typically involves:

* Tokenization: Splitting the text into individual words or tokens.
* Lowercasing: Converting all the text to lowercase to ensure the model doesn't treat the same word in different cases as different words.
* Removing punctuation and non-alphanumeric characters: This simplifies the model's input space.

In [10]:
import re

# Step 4: Building the Model

We will use a Recurrent Neural Network (RNN) for our language model. 

RNNs are great for generating sequences, like sentences or melodies5.

Here's a simple example of how you might define an RNN in PyTorch

https://coda.io/@peter-sigurdson/building-a-simple-ai-generative-language-model-in-python#_lu891

In [11]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'\W', ' ', text)
    return text

torch is imported to ensure we have access to all necessary PyTorch functions and classes.

torch.nn is aliased as nn for ease of use.

The RNNModel class extends nn.Module, which is the base class for all neural network modules in PyTorch.

The forward method is where the input tensor x goes through the layers of the network.

To actually run this, you'll need to install PyTorch if it's 
not already available in your session.

!pip install torch

After importing the necessary libraries and defining your model, 
you'll be ready to instantiate the RNNModel class and use it for whatever task you have in mind, such as text generation or another sequence modeling task.

1. First class line declares a class named RNNModel that inherits from nn.Module, which is a base class for all PyTorch neural network modules.
2. The __init__ method initializes the RNNModel. It takes four parameters:

- vocab_size: The size of the vocabulary (number of unique words).
- embed_size: The size of the word embeddings.
- hidden_size: The number of features in the hidden state of the RNN.
- num_layers: The number of RNN layers.

It then calls the constructor of the parent class (nn.Module) using super()

3. MOdel components are:
- self.embed: An embedding layer that converts input indices (word IDs) into dense vectors of fixed size (embed_size).
- self.rnn: An RNN layer with num_layers layers, taking input of size embed_size and producing a hidden state of size hidden_size.
- self.linear: A linear layer that maps the RNN hidden state to the output vocabulary size (vocab_size).

4. The forward method defines the forward pass of the model. It takes two parameters:

- x: Input sequence (word indices).
- h: Initial hidden state of the RNN.

The forward pass involves the following steps:

Embedding the input sequence (x) using the embedding layer.
Passing the embedded sequence through the RNN, producing an output tensor (out) and an updated hidden state (h).
Applying the linear layer to the RNN output to get the final prediction.
The method returns the output tensor and the updated hidden state.



In [12]:
class RNNModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(RNNModel, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.RNN(embed_size, hidden_size, num_layers)
        self.linear = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, x, h):
        x = self.embed(x)
        out, h = self.rnn(x, h)
        out = self.linear(out)
        return out, h

In summary, this code defines a basic RNN model for language modeling or sequence prediction tasks, where the goal is to predict the next word in a sequence given the previous words.

# Step 5: Training the Model

Training involves feeding your preprocessed data into the model, calculating the error of the model's predictions, and updating the model's parameters to reduce this error. 

This process is repeated for a number of iterations or epochs5.

In [13]:
def train(model, data, epochs, lr):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    for epoch in range(epochs):
        hidden = None
        for x, y in data:
            optimizer.zero_grad()
            outputs, hidden = model(x, hidden)
            loss = criterion(outputs, y)
            loss.backward()
            optimizer.step()

# Step 6: Using the Model for Inference [Doing the next token generation]

Once the model is trained, you can use it to generate new text. 

This involves:


Providing the model with a seed sequence

Having the model make a prediction for the next word

Adding the predicted word to the sequence

Repeating this process for as many words as you want to generate

Below code says:-
    
1. Function signature

  The function is defined with three parameters:

- model: The PyTorch model used for text generation.
- seed_text: A starting sequence of text from which the generation begins.
- num_words: The number of words to generate.

2. Set Model to Evaluation Mode:

    model.eval()

This line sets the model to evaluation mode. In PyTorch, this is important because it disables certain operations like dropout, which are typically used during training but not during evaluation or generation.

3. Initialize Text:

    text = seed_text
    
The text variable is initialized with the provided seed_text. This will be the starting point for text generation.

4. Generate Text:


    for _ in range(num_words):
        x = torch.tensor([text[-1]])
        output, _ = model(x, None)
        _, predicted = torch.max(output, 1)
        text.append(predicted.item())
        
        
Inside the loop, a tensor x is created with the last word of the current text (text[-1]). This tensor is then passed to the model.
The model's forward pass is executed (model(x, None)), and the output tensor (output) is obtained. The second return value (_) represents the hidden state, which is not used here (None is passed).

The torch.max function is used to get the index of the word with the highest probability in the output distribution. This index (predicted) is added to the text as the next predicted word.

5. Return Generated Text:

    return text
    
After generating the specified number of words, the function returns the complete generated text.

In [14]:
def generate_text(model, seed_text, num_words):
    model.eval()
    text = seed_text
    for _ in range(num_words):
        x = torch.tensor([text[-1]])
        output, _ = model(x, None)
        _, predicted = torch.max(output, 1)
        text.append(predicted.item())
    return text

In summary, this function takes a trained model, a seed text, and a number of words to generate. It utilizes the model to predict the next word in the sequence iteratively, updating the text with each prediction, and finally returns the generated text.

# Step 7: Interacting with the Model

You will interact with your trained model by providing it with a seed sequence {prompt engineering} and having it generate a response. This can be done in a loop to simulate a conversation with the model5.

Simple learning guide on creating and training a generative language model using Python, focusing on Transformers-based library called GPT-2. This model has been well noted for its ability in generating coherent and contextually relevant sentences based on a given prompt.

In [17]:
!pip install transformers





# Step 8: Importing the Model and Tokenizer

In [18]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

# Step 9: Encoding and Decoding Functions

These two functions are related to text encoding and decoding using the **Hugging Face Transformers library**. This library is commonly used for natural language processing tasks, including pre-trained models like BERT, GPT, etc. The provided functions are specifically designed to work with a tokenizer provided by this library.

Here's a breakdown of each function:

**encode function:**

def encode(prompt):
    return tokenizer.encode(prompt, return_tensors="pt")
    
This function takes a prompt as input, which is a piece of text or a sentence.
It uses the Hugging Face tokenizer to encode the text. The encode method of the tokenizer converts the input text into a sequence of numerical IDs (usually representing subword tokens).

The return_tensors="pt" argument specifies that the function should return a PyTorch tensor. This is useful when working with PyTorch-based models.

In [19]:
def encode(prompt):
    return tokenizer.encode(prompt, return_tensors="pt")

def decode(encoded_prompt):
    return tokenizer.decode(encoded_prompt[0], skip_special_tokens=True)

**decode function:

def decode(encoded_prompt):
    return tokenizer.decode(encoded_prompt[0], skip_special_tokens=True)

This function takes an encoded_prompt as input, which is the result of the encoding process (a tensor of numerical IDs).
It uses the Hugging Face tokenizer again, this time with the decode method. This method converts the numerical IDs back into human-readable text.
The skip_special_tokens=True argument instructs the tokenizer to exclude any special tokens (e.g., [CLS], [SEP]) that might have been added during the encoding process.

In summary, these functions provide a convenient way to encode and decode text using the Hugging Face Transformers library's tokenizer. They are particularly useful when working with models that require input in a numerical format (encoded) and when you want to convert the model's output back into human-readable text (decoded).

# Step 10: Running The Model

Now for the fun part: generating text!

Inputs and outputs to the GPT2 model are all sequences of integers. We can encode our input prompt, generate a response, and then decode this response to get our output message:

In [20]:
input_prompt = "How are you feeling today?"

input_prompt_encoded = encode(input_prompt)
output = model.generate(input_prompt_encoded, max_length=50, num_return_sequences=1, no_repeat_ngram_size=2, do_sample=True)
output_message = decode(output)

print(output_message)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


How are you feeling today? Can you smell? Do you feel good? Are you able to speak? Have you ever had to work after work that made you cringe? And that's just one part of the journey that we are taking today.



# Step 10: Create a conversational model

**A conversational model** is a type of natural language processing (NLP) model designed to understand and generate human-like responses in the context of a conversation. These models are often built using techniques from machine learning and deep learning and can be used for various applications, including chatbots, virtual assistants, and other interactive systems.

For creating a conversational model, one needs to append the prompt to all previous dialogue.

In [21]:
# conversation history
history_encoded = tokenizer.encode("Hello, I'm an AI model. ", return_tensors="pt")

# user input
user_input_encoded = tokenizer.encode("Hello, how are you?", return_tensors="pt")

# append the new user input tokens to the chat history
history_with_user_input_encoded = torch.cat([history_encoded, user_input_encoded], dim=-1)

# generate a response
output = model.generate(history_with_user_input_encoded, max_length=100, num_return_sequences=1, no_repeat_ngram_size=2, do_sample=True)

history_with_reply_encoded = output

# Print message
output_message = decode(history_with_reply_encoded)
print(output_message)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello, I'm an AI model. Hello, how are you?  It sounds so familiar. If I look for those words in my head, they are.  
I don't know what that means. I have my thoughts, if I recall accurately.   
This is a big one. It's one of those things that I think is so central to human experience where we are all human. 
What?   I will be getting more into this soon.
How
