

---
# Attention based Models and Transfer Learning


---




# 1. What is BERT and how does it work?


BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based language model that learns deep contextual representations of words by considering both the left and right context at the same time. It works by pretraining on large text data using masked language modeling (predicting masked words) and next sentence prediction, then fine-tuning for specific NLP tasks such as classification, question answering, and named entity recognition.


# 2. What are the main advantages of using the attention mechanism in neural networks?


The attention mechanism allows a neural network to focus on the most relevant parts of the input data while processing it. It helps models handle long sequences better, capture important contextual relationships, improve accuracy in tasks like translation and text understanding, and reduce information loss compared to traditional sequential models.


# 3. How does the self-attention mechanism differ from traditional attention mechanisms?


Self-attention computes relationships **within the same sequence**, allowing each word or element to attend to all other elements in that sequence. Traditional attention usually works **between two different sequences**, such as the decoder attending to the encoder outputs. Self-attention is more efficient and better at capturing global context.


# 4. What is the role of the decoder in a Seq2Seq model?


The decoder in a Seq2Seq model is responsible for generating the output sequence one element at a time. It uses the encoded representation from the encoder along with previously generated outputs to predict the next output token until the complete sequence is produced.


# 5. What is the difference between GPT-2 and BERT models?


BERT is a bidirectional model that focuses on understanding language by using context from both directions, making it suitable for tasks like classification and question answering. GPT-2 is a unidirectional model designed for text generation, predicting the next word based only on previous words. BERT uses only transformer encoders, while GPT-2 uses only transformer decoders.


# 6. Why is the Transformer model considered more efficient than RNNs and LSTMs?


The Transformer model is more efficient because it processes all input tokens in parallel instead of sequentially. This removes long-term dependency problems, allows faster training on modern hardware, and handles long-range relationships better than RNNs and LSTMs.


# 7. Explain how the attention mechanism works in a Transformer model?


In a Transformer, the attention mechanism computes relationships between all words in a sequence using **queries, keys, and values**. Each word generates these three vectors, and attention scores are calculated to determine how much focus each word should give to others. The weighted combination of values produces context-aware representations that capture the meaning of each word based on the entire sequence.


# 8. What is the difference between an encoder and a decoder in a Seq2Seq model?


The **encoder** processes the input sequence and converts it into a fixed-size context representation that captures the meaning of the input. The **decoder** takes this context and generates the output sequence step by step, using both the context and previously generated tokens to produce accurate outputs.


# 9. What is the primary purpose of using the self-attention mechanism in transformers?


The primary purpose of self-attention is to allow the model to **capture relationships between all elements in a sequence**, regardless of their distance. This enables transformers to understand context and dependencies globally, improving tasks like translation, text understanding, and sequence modeling.


# 10.  How does the GPT-2 model generate text?


GPT-2 generates text **autoregressively**, predicting the next word in a sequence based on all previously generated words. It uses a transformer decoder to model the probability of the next token and continues this process step by step to produce coherent, contextually relevant text.


# 11. What is the main difference between the encoder-decoder architecture and a simple neural network?


A simple neural network maps inputs directly to outputs in a single step, while an **encoder-decoder architecture** first encodes the input into a **latent context representation** (encoder) and then decodes it into the output sequence (decoder). This separation allows the model to handle variable-length sequences and complex tasks like machine translation or text summarization.


# 12. Explain the concept of “fine-tuning” in BERT?


Fine-tuning in BERT refers to **adapting a pretrained BERT model** to a specific downstream task (like sentiment analysis, question answering, or named entity recognition) by training it on a smaller task-specific dataset. During fine-tuning, the **entire BERT model’s weights are updated** along with a task-specific output layer to optimize performance for that particular task.


# 13. How does the attention mechanism handle long-range dependencies in sequences?


The attention mechanism handles long-range dependencies by allowing each element in a sequence to **directly attend to all other elements**, regardless of their distance. This enables the model to capture relationships between distant tokens without relying on sequential steps, unlike RNNs or LSTMs, which struggle with long-term dependencies.


# 14. What is the core principle behind the Transformer architecture?


The core principle of the Transformer is to **replace sequential processing with self-attention and parallel computation**, allowing the model to capture **global dependencies** in a sequence efficiently. It uses **encoder and decoder stacks** with attention mechanisms to process input and output sequences, enabling faster training and better handling of long-range relationships compared to RNNs and LSTMs.


# 15. What is the role of the "position encoding" in a Transformer model?


Position encoding provides the Transformer with **information about the order of tokens** in a sequence. Since Transformers process all tokens in parallel and do not inherently capture sequence order, position encodings (usually sinusoidal or learned vectors) are added to input embeddings so the model can distinguish the relative and absolute positions of tokens.


# 16. How do Transformers use multiple layers of attention?


Transformers stack **multiple self-attention layers** in both the encoder and decoder. Each layer refines the representation of tokens by allowing them to attend to all other tokens, capturing increasingly complex relationships and hierarchical features. Stacking layers enables the model to learn **richer, deeper contextual information** for better sequence understanding and generation.


# 17. What does it mean when a model is described as “autoregressive” like GPT-2?


An **autoregressive model** generates each token in a sequence **based on the previously generated tokens**. In GPT-2, this means the model predicts the next word using all prior words in the sequence, producing text **step by step** in a sequential, left-to-right manner.

# 18. How does BERT's bidirectional training improve its performance?


BERT’s bidirectional training allows the model to **look at both the left and right context** of each word during pretraining. This provides a **richer, more complete understanding of word meaning** in context, improving performance on tasks like question answering, sentiment analysis, and named entity recognition compared to unidirectional models.


# 19. What are the advantages of using the Transformer over RNN-based models in NLP?


* **Parallel Processing:** Transformers process all tokens simultaneously, unlike RNNs which process sequentially, enabling faster training.
* **Long-Range Dependencies:** Self-attention captures relationships between distant tokens more effectively than RNNs or LSTMs.
* **Better Context Understanding:** Each token attends to all others, improving contextual representations.
* **Scalability:** Transformers scale well to large datasets and long sequences.
* **Stable Gradients:** Avoid issues like vanishing or exploding gradients common in RNNs.


# 20. What is the attention mechanism’s impact on the performance of models like BERT and GPT-2?


The attention mechanism allows these models to **focus on the most relevant parts of the input sequence**, capturing contextual relationships between words regardless of their distance. This leads to **better understanding, higher accuracy, and improved generation quality** compared to models without attention. It is a key reason why BERT and GPT-2 excel in NLP tasks.




---
# Practical


---



# 1. How to implement a simple text classification model using LSTM in Keras?

In [1]:
# Import necessary libraries
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# ===== Sample Data =====
texts = [
    "I love this movie",
    "This movie is terrible",
    "Amazing film",
    "I did not like this movie",
    "Best film ever",
    "Worst movie"
]
labels = [1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

# ===== Step 1: Tokenize the text =====
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

# ===== Step 2: Pad sequences =====
max_len = 5
X = pad_sequences(sequences, maxlen=max_len)
y = np.array(labels)

# ===== Step 3: Build LSTM Model =====
model = Sequential()
model.add(Embedding(input_dim=1000, output_dim=64, input_length=max_len))
model.add(LSTM(64))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

# ===== Step 4: Compile the model =====
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# ===== Step 5: Train the model =====
model.fit(X, y, epochs=10, batch_size=2, verbose=1)

# ===== Step 6: Make predictions =====
test_texts = ["I really enjoyed this film", "This was a bad movie"]
test_seq = tokenizer.texts_to_sequences(test_texts)
test_pad = pad_sequences(test_seq, maxlen=max_len)
predictions = model.predict(test_pad)

for text, pred in zip(test_texts, predictions):
    print(f"Text: {text} | Predicted Score: {pred[0]:.4f}")


Epoch 1/10




[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 18ms/step - accuracy: 0.2708 - loss: 0.6936
Epoch 2/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step - accuracy: 0.7708 - loss: 0.6867
Epoch 3/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.4375 - loss: 0.6884
Epoch 4/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.7292 - loss: 0.6834
Epoch 5/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 1.0000 - loss: 0.6759
Epoch 6/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 1.0000 - loss: 0.6732
Epoch 7/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 1.0000 - loss: 0.6766
Epoch 8/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 1.0000 - loss: 0.6676
Epoch 9/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16

# 2. How to generate sequences of text using a Recurrent Neural Network (RNN)?

In [2]:
# Import libraries
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# ===== Sample Text Data =====
corpus = [
    "hello world",
    "hello there",
    "hello machine learning",
    "machine learning is fun",
    "deep learning with python"
]

# ===== Step 1: Tokenize text =====
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

# Create input sequences for training
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_seq = token_list[:i+1]
        input_sequences.append(n_gram_seq)

# Pad sequences
max_seq_len = max(len(x) for x in input_sequences)
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_seq_len, padding='pre'))

# Split predictors and label
X = input_sequences[:, :-1]
y = input_sequences[:, -1]

# ===== Step 2: Build RNN model =====
model = Sequential()
model.add(Embedding(total_words, 10, input_length=max_seq_len-1))
model.add(SimpleRNN(50))
model.add(Dense(total_words, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

# ===== Step 3: Train the model =====
model.fit(X, y, epochs=200, verbose=0)

# ===== Step 4: Generate text =====
def generate_text(seed_text, next_words, max_seq_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_seq_len-1, padding='pre')
        predicted = np.argmax(model.predict(token_list, verbose=0), axis=-1)[0]
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text

# ===== Example =====
print(generate_text("hello", 5, max_seq_len))


hello world learning is python python


# 3. How to perform sentiment analysis using a simple CNN model?

In [3]:
# Import libraries
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout

# ===== Sample Data =====
texts = [
    "I love this movie",
    "This movie is terrible",
    "Amazing film",
    "I did not like this movie",
    "Best film ever",
    "Worst movie"
]
labels = [1, 0, 1, 0, 1, 0]  # 1 = positive, 0 = negative

# ===== Step 1: Tokenize and pad sequences =====
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
max_len = 5
X = pad_sequences(sequences, maxlen=max_len)
y = np.array(labels)

# ===== Step 2: Build CNN model =====
model = Sequential()
model.add(Embedding(input_dim=1000, output_dim=64, input_length=max_len))
model.add(Conv1D(filters=32, kernel_size=3, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

# ===== Step 3: Compile the model =====
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# ===== Step 4: Train the model =====
model.fit(X, y, epochs=10, batch_size=2, verbose=1)

# ===== Step 5: Predict on new text =====
test_texts = ["I really enjoyed this film", "This was a bad movie"]
test_seq = tokenizer.texts_to_sequences(test_texts)
test_pad = pad_sequences(test_seq, maxlen=max_len)
predictions = model.predict(test_pad)

for text, pred in zip(test_texts, predictions):
    print(f"Text: {text} | Predicted Score: {pred[0]:.4f}")


Epoch 1/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 18ms/step - accuracy: 0.6458 - loss: 0.6865
Epoch 2/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.7292 - loss: 0.6810
Epoch 3/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.5833 - loss: 0.6781
Epoch 4/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.8542 - loss: 0.6346
Epoch 5/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.7708 - loss: 0.6571
Epoch 6/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.9167 - loss: 0.6431
Epoch 7/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 1.0000 - loss: 0.6199
Epoch 8/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 1.0000 - loss: 0.6005
Epoch 9/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [

# 4. How to perform Named Entity Recognition (NER) using spaCy?

In [4]:
# Install spaCy (if not already installed)
!pip install spacy

# Download the English language model
!python -m spacy download en_core_web_sm

# Import spaCy
import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Apple is looking at buying U.K. startup for $1 billion. Elon Musk lives in California."

# Process the text
doc = nlp(text)

# ===== Extract Named Entities =====
print("Named Entities, Labels, and Positions:\n")
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}, Start: {ent.start_char}, End: {ent.end_char}")

# ===== Visualize Entities (Optional) =====
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m84.3 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Named Entities, Labels, and Positions:

Entity: Apple, Label: ORG, Start: 0, End: 5
Entity: U.K., Label: GPE, Start: 27, End: 31
Entity: $1 billion, Label: MONEY, Start: 44, End: 54
Entity: Elon Musk, Label: PERSON, Start: 56, End: 65
Entity: California, Label: GPE, Start: 75, End: 85


# 5. How to implement a simple Seq2Seq model for machine translation using LSTM in Keras?

In [7]:
# Import libraries
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# ===== Sample Data =====
# English to French (toy dataset)
input_texts = ["hello", "how are you", "good morning", "thank you"]
target_texts = ["bonjour", "comment ça va", "bonjour", "merci"]

# Add start and end tokens for target
target_texts = ["\t" + t + "\n" for t in target_texts]

# ===== Step 1: Tokenize and pad sequences =====
# Input language
input_tokenizer = Tokenizer(char_level=False)
input_tokenizer.fit_on_texts(input_texts)
input_sequences = input_tokenizer.texts_to_sequences(input_texts)
max_encoder_len = max(len(seq) for seq in input_sequences)
encoder_input_data = pad_sequences(input_sequences, maxlen=max_encoder_len, padding='post')

# Target language
target_tokenizer = Tokenizer(char_level=False, filters='')
target_tokenizer.fit_on_texts(target_texts)
target_sequences = target_tokenizer.texts_to_sequences(target_texts)
max_decoder_len = max(len(seq) for seq in target_sequences)
decoder_input_data = pad_sequences(target_sequences, maxlen=max_decoder_len, padding='post')

# Prepare decoder target data (shifted by 1)
decoder_target_data = np.zeros_like(decoder_input_data)
decoder_target_data[:, :-1] = decoder_input_data[:, 1:]
decoder_target_data[:, -1] = 0

# Vocabulary sizes
num_encoder_tokens = len(input_tokenizer.word_index) + 1
num_decoder_tokens = len(target_tokenizer.word_index) + 1

# ===== Step 2: Build Seq2Seq Model =====
# Encoder
encoder_inputs = Input(shape=(max_encoder_len,))
encoder_embedding = Embedding(num_encoder_tokens, 64, input_length=max_encoder_len)(encoder_inputs) # Use Embedding layer
encoder_lstm = LSTM(64, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(max_decoder_len,))
decoder_embedding = Embedding(num_decoder_tokens, 64, input_length=max_decoder_len)(decoder_inputs) # Use Embedding layer
decoder_lstm = LSTM(64, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# ===== Step 3: Compile and train =====
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

# Note: For real datasets, use many epochs and proper one-hot encoding for decoder targets.
# Here we skip training for simplicity.

# 6. How to generate text using a pre-trained transformer model (GPT-2)?

In [8]:
# Install the transformers library
!pip install transformers

# Import libraries
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# ===== Function to generate text =====
def generate_text_gpt2(prompt, max_length=50):
    # Encode the prompt
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # Generate text
    output = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.8
    )

    # Decode the generated text
    return tokenizer.decode(output[0], skip_special_tokens=True)

# ===== Example usage =====
prompt_text = "Once upon a time"
generated_text = generate_text_gpt2(prompt_text, max_length=100)
print("Generated Text:\n")
print(generated_text)




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated Text:

Once upon a time there were no humans in the sky, and the sun had set above the horizon, the wind blew over the mountains, bringing them full of life.

All life was brought to light by the light from heaven. Now, I must say, that was a very good thing. I had seen the stars. There was no way to know which way. In the end, my world was an endless stream of light, where none knew what was going on. My world is


# 8. How can you add an Attention Mechanism to a Seq2Seq model?

**How can you add an Attention Mechanism to a Seq2Seq model

To add attention to a Seq2Seq model:

1. **Compute attention scores:** At each decoder step, calculate a score (e.g., dot product or additive) between the decoder hidden state and each encoder hidden state.
2. **Apply softmax:** Convert the scores into a probability distribution (attention weights) over the encoder outputs.
3. **Compute context vector:** Multiply the attention weights with encoder outputs and sum them to get a context vector.
4. **Combine with decoder state:** Concatenate or combine the context vector with the decoder hidden state to generate the output token.
5. **Train end-to-end:** The model learns to focus on the relevant encoder states for each output step.

This allows the decoder to **focus on specific parts of the input sequence** instead of relying only on the fixed encoder state, improving performance in tasks like translation and summarization.