<a href="https://colab.research.google.com/github/mukeshrock7897/GenerativeAI/blob/main/2_Transformers_Intermediate_Level.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Intermediate Level**

1. Types of Transformer Models
    * BERT (Bidirectional Encoder Representations from Transformers)
    * GPT (Generative Pre-trained Transformer)
    * T5 (Text-to-Text Transfer Transformer)
    * RoBERTa (A Robustly Optimized BERT Pretraining Approach)
    * XLNet (Generalized Autoregressive Pretraining for Language Understanding)
    * ALBERT (A Lite BERT for Self-supervised Learning of Language Representations)

2. Pre-training and Fine-tuning
    * Pre-training objectives (e.g., masked language modeling, causal language modeling)
    * Fine-tuning for specific tasks
    * Transfer learning

3. Sequence-to-Sequence Models
    * Overview of seq2seq models
    * Applications in translation, summarization

4. Transformer Implementations
    * Using Hugging Face Transformers library
    * Training and fine-tuning transformers on custom datasets
    * Handling large datasets and model training

5. Practical Projects
    * Building a text generator
    * Creating a chatbot
    * Implementing image generation using transformers


# **1. Types of Transformer Models**
**BERT (Bidirectional Encoder Representations from Transformers)**
* BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. It is pre-trained using masked language modeling (MLM) and next sentence prediction (NSP) tasks.

**Example of BERT for Text Classification:**



In [1]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# Tokenize input
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1

# Forward pass
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**GPT (Generative Pre-trained Transformer)**
* GPT is designed for generating text by predicting the next word in a sequence. It uses causal language modeling and can generate coherent and contextually relevant text.

**Example of GPT for Text Generation:**



In [2]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Encode input text
input_text = "The quick brown fox"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate text
output = model.generate(input_ids, max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Text:", generated_text)


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text: The quick brown foxes are a great way to get a little bit of a kick out of your dog.

The quick brown foxes are a great way to get a little bit of a kick out of your dog. The quick brown fox


**T5 (Text-to-Text Transfer Transformer)**
* T5 treats every NLP task as a text-to-text problem, allowing it to be used for a wide range of tasks such as translation, summarization, and classification.

**Example of T5 for Summarization:**

In [3]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load pre-trained model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")

# Encode input text
input_text = "summarize: The quick brown fox jumps over the lazy dog."
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate summary
output = model.generate(input_ids, max_length=50, num_return_sequences=1)
summary = tokenizer.decode(output[0], skip_special_tokens=True)
print("Summary:", summary)


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Summary: the quick brown fox jumps over the lazy dog.


**RoBERTa (A Robustly Optimized BERT Pretraining Approach)**
* RoBERTa improves on BERT by optimizing the pretraining process, including training on more data, longer sequences, and removing the next sentence prediction task.

**XLNet (Generalized Autoregressive Pretraining for Language Understanding)**
* XLNet leverages the best of both autoregressive and autoencoding models. It uses a permutation-based training method to capture bidirectional context.

**ALBERT (A Lite BERT for Self-supervised Learning of Language Representations)**
* ALBERT reduces the memory footprint and increases the training speed of BERT by factorizing the embedding parameters and sharing parameters across layers.

# **2. Pre-training and Fine-tuning**
**Pre-training Objectives**

* **Masked Language Modeling (MLM):** Used in BERT, where a percentage of input tokens are masked and the model learns to predict the masked tokens.
* **Causal Language Modeling (CLM):** Used in GPT, where the model predicts the next token in a sequence.

**Example of Fine-tuning a Pre-trained Model:**

In [8]:
# !pip install datasets
# !pip install accelerate -U
# !pip install transformers[sentencepiece]
!pip install transformers[torch]

from transformers import Trainer, TrainingArguments, AutoTokenizer

# Load dataset
from datasets import load_dataset
dataset = load_dataset('imdb')

# Load tokenizer - choose one you've used before
tokenizer = AutoTokenizer.from_pretrained("gpt2") # or "t5-small" or another model

# Add a padding token to the tokenizer
tokenizer.pad_token = tokenizer.eos_token

# Tokenize data
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)


# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# Trainer
trainer = Trainer(
    model=model, # Make sure 'model' is defined appropriately
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
)

# Train the model
trainer.train()

**Transfer Learning**
* Using a pre-trained model and fine-tuning it on a specific task with a smaller dataset. This reduces training time and computational resources while achieving good performance.

# **3. Sequence-to-Sequence Models**
**Overview of Seq2Seq Models**
* Seq2Seq models transform an input sequence into an output sequence. Transformers can be used as Seq2Seq models for tasks like translation and summarization.

**Applications in Translation and Summarization**
* Transformers like T5 can be used for translating text from one language to another or summarizing long documents into shorter versions.

**Example of Translation using a Seq2Seq Model:**

In [9]:
from transformers import MarianMTModel, MarianTokenizer

# Load pre-trained model and tokenizer
model_name = 'Helsinki-NLP/opus-mt-en-fr'
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)

# Encode input text
input_text = "The quick brown fox jumps over the lazy dog."
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate translation
output = model.generate(input_ids, max_length=50)
translation = tokenizer.decode(output[0], skip_special_tokens=True)
print("Translation:", translation)


config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



Translation: Le renard brun rapide saute sur le chien paresseux.


# **4. Transformer Implementations**

**Using Hugging Face Transformers Library**
* The Hugging Face Transformers library provides easy-to-use APIs for working with transformer models, including pre-trained models for various tasks.

**Training and Fine-tuning Transformers on Custom Datasets**
* You can train and fine-tune transformers on your custom datasets by loading the data, tokenizing it, and using the Trainer class.

**Example of Loading a Custom Dataset and Tokenizing:**

In [11]:
from transformers import BertTokenizer

# Load dataset
dataset = load_dataset('csv', data_files={'train': 'train.csv', 'test': 'test.csv'})

# Tokenize data
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)


**Handling Large Datasets and Model Training**
* For large datasets and models, you can use techniques like gradient accumulation, mixed precision training, and distributed training to handle memory and computational constraints.

# **5. Practical Projects**
**Building a Text Generator**
* Use GPT-2 or GPT-3 to build a text generator that can generate coherent and contextually relevant text based on a given prompt.

**Creating a Chatbot**
* Use transformers like DialoGPT to create a conversational AI that can respond to user queries in a natural and engaging manner.

**Example of Building a Chatbot:**

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load pre-trained model and tokenizer
model_name = "microsoft/DialoGPT-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Chatbot conversation
input_text = "Hello! How are you?"
input_ids = tokenizer.encode(input_text + tokenizer.eos_token, return_tensors="pt")
chat_history_ids = model.generate(input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
response = tokenizer.decode(chat_history_ids[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
print("Chatbot Response:", response)


**Implementing Image Generation Using Transformers**
* Use Vision Transformers (ViTs) or models like DALL-E to generate images from text descriptions or other input formats.

**Example of Implementing Image Generation:**

In [None]:
# Vision Transformer example
from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests

# Load pre-trained model and feature extractor
model_name = "google/vit-base-patch16-224"
feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)
model = ViTForImageClassification.from_pretrained(model_name)

# Load image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Preprocess image
inputs = feature_extractor(images=image, return_tensors="pt")

# Make prediction
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
