This notebook serves my practice of LLM.

# Transformer
Transformer is introduced in the seminal paper "Attention is All You Need" by Vaswani et al. in 2017, transformers are particularly known for their efficiency and effectiveness in handling sequential data.

### Core Components of Transformers

1. **Attention Mechanism**:
   - The key innovation of the transformer model is the attention mechanism, specifically the "self-attention" mechanism. This allows the model to weigh the importance of different words within the input data regardless of their position in the sequence. In essence, self-attention gives the model the ability to focus on different parts of the input when producing a specific part of the output.

2. **Multi-Head Attention**:
   - In a transformer, the attention mechanism is extended into what is known as multi-head attention. This setup allows the model to jointly attend to information from different representation subspaces at different positions, providing a richer understanding of the context.

3. **Layered Architecture**:
   - Transformers are composed of a series of identical layers, each containing two main sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. Additionally, each sub-layer has a residual connection around it followed by layer normalization.

4. **Positional Encoding**:
   - Since transformers do not inherently process sequential data as a sequence (like RNNs do), they require some form of positional encoding to maintain the order of the input. Positional encodings are added to the input embeddings to provide some information about the relative or absolute position of the tokens in the sequence.

### Advantages of Transformers

- **Parallelization**: Unlike recurrent neural networks (RNNs), transformers do not require that the data be processed in order. This means that operations can be parallelized, significantly speeding up training.
- **Long-range Dependencies**: Transformers can handle long-range dependencies in text, making them effective for applications like document summarization, where understanding broader context is crucial.
- **Flexibility**: They can be adapted for a wide range of tasks beyond NLP, such as image classification (Vision Transformer), and time-series forecasting.

### Applications

Transformers are the backbone of many modern NLP systems, including:
- **Text Generation**: Models like GPT (Generative Pre-trained Transformer) can generate coherent and contextually relevant text over extensive passages.
- **Translation**: Models like the original Transformer architecture are highly effective at translation tasks.
- **Sentiment Analysis, Named Entity Recognition, and more**: BERT (Bidirectional Encoder Representations from Transformers) and its variants have set new standards in various NLP benchmarks.

The transformer model has not only improved the performance on traditional NLP tasks but has also enabled new applications and methods in AI, leading to ongoing innovations across many domains of machine learning.

In [None]:
from transformers import BertModel, BertTokenizer
import torch

In [None]:
# Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample text
text = "Hello, how are you today?"

# Encode text to get token ids and attention mask
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
print(inputs)

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)

# The last hidden state is the first element of the output tuple
last_hidden_states = outputs.last_hidden_state

# You can take the mean of the token embeddings to get a sentence-level representation
sentence_embedding = last_hidden_states.mean(dim=1)
print(sentence_embedding)


# Information Retrieval with BERT and FAISS
Using transformers and large language models (LLMs) for information retrieval can significantly enhance the ability to understand and retrieve relevant information based on the semantic content of queries and documents. This is often done using embedding-based retrieval, where both documents and queries are converted into dense vectors, and similarity metrics are used to find the best matches.

For this example, let’s use the transformers library by Hugging Face, combined with FAISS for efficient similarity search among embeddings. We'll use a BERT model to generate embeddings for a set of documents and a query, then use FAISS to find the most relevant documents.

In [3]:
#!pip install transformers faiss-cpu
from transformers import BertTokenizer, BertModel
import torch
import faiss
import numpy as np

# Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding='max_length')
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the mean of the last hidden state as document representation
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Example documents
docs = [
    "Pandas is an open source data analysis library.",
    "Hugging Face provides state-of-the-art machine learning models.",
    "Transformers are models that handle sequential data.",
    "FAISS is designed for efficient similarity search.",
    "NumPy is a fundamental package for scientific computing."
]

# Convert documents to embeddings
doc_embeddings = np.array([get_embedding(doc) for doc in docs])

# Initialize FAISS index
dimension = doc_embeddings.shape[1]  # Dimension of embeddings
index = faiss.IndexFlatL2(dimension)  # Using L2 distance for similarity
index.add(doc_embeddings)  # Add embeddings to index

# Example query
query = "Pandas"
query_embedding = get_embedding(query)

# Perform search
k = 5  # Number of nearest neighbors
distances, indices = index.search(np.array([query_embedding]), k)

# Display search results
print("Query:", query)
print("Top results:")
for i, idx in enumerate(indices[0]):
    print(f"{i+1}: {docs[idx]} (Distance: {distances[0][i]})")


Query: Pandas
Top results:
1: Hugging Face provides state-of-the-art machine learning models. (Distance: 48.62348556518555)
2: Pandas is an open source data analysis library. (Distance: 55.538551330566406)
3: NumPy is a fundamental package for scientific computing. (Distance: 56.726688385009766)
4: FAISS is designed for efficient similarity search. (Distance: 69.76235961914062)
5: Transformers are models that handle sequential data. (Distance: 86.79019165039062)


(5, 768)