In [2]:
!pip install transformers faiss-cpu



#### Step 1: Load the LLM (Language Model)
The first step is loading a pre-trained language model that can generate text based on the input. We'll use an open-source model like Bloom.

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load pre-trained open-source LLM
model_name = "bigscience/bloom-560m"  # Small Bloom model for demo
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

print("Language model and tokenizer loaded!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/693 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Language model and tokenizer loaded!


#### Step 2: Load the Embedding Model
For retrieval, we need a model to convert text into vector embeddings. We'll use a Sentence Transformer to encode both the query and documents.

In [4]:
from sentence_transformers import SentenceTransformer

# Load embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")  # Compact and efficient
print("Embedding model loaded!")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding model loaded!


#### Step 3: Prepare and Index the Documents
To enable retrieval, we need a corpus of documents and a system to find the most relevant ones. We'll use FAISS for indexing and similarity search.

In [5]:
import faiss
import numpy as np

# Sample corpus of documents
documents = [
    "What is machine learning?",
    "Explain neural networks.",
    "How do transformers work in NLP?",
    "What is gradient descent?"
]

# Generate embeddings for the documents
doc_embeddings = embedding_model.encode(documents)

# Create a FAISS index for similarity search
dimension = doc_embeddings.shape[1]  # Embedding size
index = faiss.IndexFlatL2(dimension)
index.add(doc_embeddings)

print("FAISS index created and documents added!")


FAISS index created and documents added!


#### Step 4: Implement the Retrieval Step
Given a query, we'll retrieve the most relevant documents from the FAISS index.

In [6]:
def retrieve(query, k=2):
    """Retrieve top-k documents for a given query."""
    query_embedding = embedding_model.encode([query])
    distances, indices = index.search(query_embedding, k)
    return [documents[idx] for idx in indices[0]]

# Test the retrieval
query = "What is gradient descent?"
retrieved_docs = retrieve(query)
print("Query:", query)
print("Retrieved Documents:", retrieved_docs)


Query: What is gradient descent?
Retrieved Documents: ['What is gradient descent?', 'What is machine learning?']


#### Step 5: Combine Retrieval and Generation
Now, we'll combine the retrieved context with the query and pass it to the language model to generate an answer.

In [7]:
def generate_answer(query):
    """Generate an answer using retrieved context and LLM."""
    # Step 1: Retrieve relevant documents
    retrieved_docs = retrieve(query)
    context = " ".join(retrieved_docs)

    # Step 2: Combine context and query
    input_text = f"Context: {context}\nQuestion: {query}\nAnswer:"
    inputs = tokenizer(input_text, return_tensors="pt")

    # Step 3: Generate response from LLM
    outputs = model.generate(inputs["input_ids"], max_length=150, num_return_sequences=1)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test the RAG system
response = generate_answer("Can you explain how transformers work?")
print("Response:", response)


Response: Context: How do transformers work in NLP? Explain neural networks.
Question: Can you explain how transformers work?
Answer: Transformers are a set of rules that are used to transform input data into output data. Transformers are used to transform input data into output data in a way that is consistent with the input data. Transformers are used to transform input data into output data in a way that is consistent with the input data. Transformers are used to transform input data into output data in a way that is consistent with the input data. Transformers are used to transform input data into output data in a way that is consistent with the input data. Transformers are used to transform input data into output data in a way that is consistent
