<a href="https://colab.research.google.com/github/kla55/langchain-llm/blob/main/Langchain_RAG_app_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install faiss-cpu transformers sentence-transformers

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1


In [2]:
# Sample dataset (can be replaced with any open-source text data)
data = [
    {"id": 1, "text": "What is COVID-19?", "answer": "COVID-19 is caused by the SARS-CoV-2 virus."},
    {"id": 2, "text": "What are symptoms of COVID-19?", "answer": "Symptoms include fever, cough, and fatigue."},
    {"id": 3, "text": "How does COVID-19 spread?", "answer": "COVID-19 spreads primarily through respiratory droplets."},
]

In [3]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

In [4]:
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
texts = [entry['text'] for entry in data]

# Create embeddings for the dataset
embeddings = embedding_model.encode(text)

# Build a FAISS index
dimensions = embeddings.shape[1]
index = faiss.IndexFlatL2(dimensions) # L2 similarity
index.add(np.array(embeddings)) # Add embeddings to the index

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [5]:
dimensions

384

# Build a FAISS index
**Embedding**\
The embedding is a 2D NumPy array or similar structure where each row corresponds to the vector representation (embedding) of a text/document, and each column represents a dimension of that vector. \
Example: If embeddings has a shape (3, 384), it means there are 3 embeddings, each with 384 dimensions.\
Purpose: dimension stores the number of dimensions per embedding (384 in this example). FAISS needs this to initialize the index properly.

**IndexFlatL2()**\
FAISS (Facebook AI Similarity Search) provides a way to perform fast similarity searches over high-dimensional vectors.\

**IndexFlatL2:**\
This creates a "flat" (non-hierarchical) index that computes similarities using L2 (Euclidean) distance.\
L2 distance is defined as the square root of the sum of squared differences between corresponding elements in two vectors: L2 distance = ∑sqrt((𝑥𝑖−𝑦𝑖)2)\
Purpose: The IndexFlatL2 object is initialized to accept embeddings of the specified dimension and perform searches based on Euclidean distance.

**np.array(embeddings):**\
Converts the embeddings into a NumPy array if it isn't already.
This ensures compatibility with FAISS, which expects a NumPy array for indexing.

**index.add():**\
Adds the embeddings to the FAISS index. Each row in embeddings becomes a searchable vector in the index.\
Purpose: The index now contains the embeddings and is ready to perform similarity searches.

In [6]:
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)  # L2 similarity
index.add(np.array(embeddings))  # Add embeddings to the index

embedding_model.encode([query])
- The function encodes the query into its vector representation using the embedding model.

index.search(query_embedding, k)
- This searches the FAISS index for the k most similar embeddings to the query_embedding.\
- k=1: Only the closest match (the single most similar vector) will be retrieved from the index.
FAISS will calculate the similarity (or distance) of the query_embedding to every vector in the index, find the one with the smallest distance, and return its:\

Returns:
- distances: The distances (e.g., L2 norm) between the query embedding and the closest embeddings in the index.
- indices: The indices of the matching embeddings in the FAISS index.

In [7]:
def retrieve_relevant_text(query, index, embedding_model, texts, k=1):
    query_embedding = embedding_model.encode([query])
    distances, indices= index.search(query_embedding, k)
    return [texts[i] for i in indices[0]]

In [16]:
# Example query
query = "What causes COVID-19?"
retrieved_texts = retrieve_relevant_text(query, index, embedding_model, texts, 3)
print("Retrieved Text:", retrieved_texts)

Retrieved Text: ['What is COVID-19?', 'How does COVID-19 spread?', 'What are symptoms of COVID-19?']


In [18]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a generative model (GPT-style model)
model_name = "gpt2"  # Or any other Hugging Face GPT-like model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def generate_answer(context, query):
    prompt = f"Given the following context: {context}\n\nAnswer the following question concisely: {query}\n\nAnswer:"
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(**inputs, max_length=150, num_return_sequences=1)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Generate an answer
context = retrieved_texts[0]
response = generate_answer(context, query)
print("Generated Answer:", response)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Answer: Given the following context: What is COVID-19?
\Answer the following question concisely: What causes COVID-19?

Answer: COVID-19 is a chemical compound that is used in the manufacture of a wide variety of products. It is used in many different ways, including as a solvent, as a solvent-based solvent, as a solvent-based solvent, as a solvent-based solvent, as a solvent-based solvent, as a solvent-based solvent, as a solvent-based solvent, as a solvent-based solvent, as a solvent-based solvent, as a solvent-based solvent, as a solvent-based solvent, as a solvent-based solvent, as a solvent-based solvent,


In [None]:
outputs = model.generate(
    inputs["input_ids"],
    max_new_tokens=50,  # Limit the length of the generated output
    num_return_sequences=1,  # Generate only one response
    no_repeat_ngram_size=2,  # Avoid repetitive phrases
    temperature=0.7,  # Control randomness; lower values make the output more deterministic
    top_k=50,  # Limit sampling to top-k tokens
    top_p=0.9,  # Use nucleus sampling
    eos_token_id=tokenizer.eos_token_id,  # Ensure it stops at an appropriate point
)