##### Retrieval-Augmented Generation

In [298]:
import os
import numpy as np
from openai import OpenAI
from langchain_openai import ChatOpenAI

from sentence_transformers import SentenceTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings#, HuggingFacePipeline
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

import faiss

In [262]:
with open("../open_router_api_key.txt", "r") as fi:
    api_key = fi.read()

In [278]:
#os.environ["OPENAI_API_KEY"] = api_key
#os.environ["OPENAI_API_BASE"] = "https://openrouter.ai/api/v1"

In this exercise, you'll put together a RAG system and compare outputs from RAG vs. just querying an LLM.

For this exercise, you'll be asking about Subspace-Constrained LoRA (SC-LoRA), a new technique described in [a recent article publised on arXiv.org](https://arxiv.org/abs/2505.23724). You've been provided the text of this article in the file 2505.23724v1.txt.

### Part 1: Manual RAG

In this first part, you'll build all of the pieces of the RAG system individually.

First, you'll need the retriever portion. Create a FAISS index to hold the text of the article. Encode this text using the all-MiniLM-L6-v2 encoder. Note that you'll want to divide the text into smaller chunks rather than encoding the whole artile all at once. You could try, for example, the [RecursiveCharacterTextSplitter class from LangChain](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html). You'll need to specify a chunk_size and chunk_overlap. You could try a chunk_size of 500 and overlap of 50 as a starting point.

In [30]:
with open("../data/2505.23724v1.txt", "r", encoding="utf-8") as f:
    article = f.read()
    #article = article.replace('\n', '')

In [31]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

In [44]:
chunked_article = text_splitter.split_text(article)

In [36]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')

In [81]:
articles_vector = embedder.encode(chunked_article, show_progress_bar=True)

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

In [82]:
d = articles_vector.shape[1]

faiss_index = faiss.IndexFlatIP(d)   # build the index
faiss_index.add(articles_vector)     # add vectors to the index

Next, you'll need to set up a way to interact with the generator model. You can use the OpenAI class from the openai library for this. See [this page](https://platform.openai.com/docs/api-reference/chat/create) for more information. When you do this, you'll need to set the base_url to ["https://openrouter.ai/api/v1"](https://openrouter.ai/api/v1) and to pass in your api key. Set the model to "meta-llama/llama-4-scout:free".

In [281]:
client = OpenAI(
    api_key = api_key,
    base_url="https://openrouter.ai/api/v1",
)

First, ask the model "How does SC-LoRA differ from regular LoRA?" without providing any additional context. Read through a few different responses.

In [58]:
query = "How does SC-LoRA differ from regular LoRA?"

In [301]:
completion = client.chat.completions.create(
  model="meta-llama/llama-4-scout:free",
  messages=[
    #{"role": "developer", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": query
    }
  ]
)

print(completion.choices[0].message.content)

ChatCompletionMessage(content='SC-LoRA (Structured and Controlled Low-Rank Adaptation) is an extension or a variation of LoRA (Low-Rank Adaptation), which is a method used in the context of large language models and other neural networks to adapt or fine-tune these models efficiently. While both SC-LoRA and LoRA aim to achieve efficient adaptation of large models with a minimal number of additional parameters, they differ in their approach and objectives:\n\n1. **LoRA (Low-Rank Adaptation):** \n   - LoRA is designed to adapt large pre-trained models to specific tasks with a relatively small number of additional parameters. It achieves this by introducing low-rank matrices that are learned during the adaptation process. These low-rank matrices are used to update the weights of the original model in a way that is efficient in terms of the number of parameters and computations required.\n   - The primary goal of LoRA is to reduce the number of trainable parameters during adaptation, makin

In [302]:
completion = client.chat.completions.create(
  model="meta-llama/llama-4-scout:free",
  messages=[
    #{"role": "developer", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": query
    }
  ]
)

print(completion.choices[0].message.content)

ChatCompletionMessage(content="SC-LoRA, or Scalable Low-Rank Adaptation, and LoRA (Low-Rank Adaptation) are both methods used for efficient fine-tuning of large pre-trained models, such as those used in natural language processing and computer vision. While they share some similarities, SC-LoRA is an advancement over the traditional LoRA method, primarily focusing on improving scalability and efficiency. Here's how SC-LoRA differs from regular LoRA:\n\n1. **Scalability and Efficiency:**\n   - **LoRA:** LoRA is designed to adapt large pre-trained models to specific tasks more efficiently than full fine-tuning. It achieves this by updating only a small portion of the model's parameters, specifically through low-rank matrices that are learned during the fine-tuning process. While LoRA is efficient for a single task adaptation, managing and deploying multiple task-specific models can become cumbersome and less scalable as the number of tasks increases.\n   - **SC-LoRA:** SC-LoRA builds upo

Next, use the following as a system prompt:

system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentences maximum and keep the answer concise. "
    f"Context: {context}"
)
Use the FAISS index to pull in relevant context to fill in the context. Try passing in this additional system prompt. Hint: you can do this by using the following messages in the client.chat.completions.create function

    messages=[
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": query,
        }
    ]
How does adding this context change the results?

In [304]:
query_vector = embedder.encode([query])

In [305]:
k = 5
distances, indices = faiss_index.search(query_vector, k)

In [307]:
#updated with Michael's code to add chunks (I had issues with query_vector formatting...)
most_similar_chunks = indices[0]
context = ''
for i in most_similar_chunks:
    context += '\n\n' + chunked_article[i]

In [None]:
print(context)

In [309]:
system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentences maximum and keep the answer concise. "
    f"Context: {context}"
)

In [311]:
completion = client.chat.completions.create(
    model="meta-llama/llama-4-scout:free",
    messages=[
        {
            "role": "system",
            "content": system_prompt,
            },
        {
            "role": "user",
            "content": query,
        }
         ]
    )

print(completion.choices[0].message.content)

SC-LoRA is a LoRA initialization method that modifies the beta parameter (β) to balance utility and safety, whereas regular LoRA has a fixed learning rate and does not have this beta parameter. SC-LoRA aims to preserve safety and knowledge while fine-tuning, and its β values (e.g., 0.5, 0.7, 0.9) control this balance. This allows SC-LoRA to achieve better safety and utility performance than regular LoRA.


### Part 2: LangChain

You can also use the [LangChain library](https://www.langchain.com/) to help build your RAG system.

For the retriever, you can use the [HugginFaceEmbeddings class](https://python.langchain.com/api_reference/huggingface/embeddings/langchain_huggingface.embeddings.huggingface.HuggingFaceEmbeddings.html), using the all-MiniLM-L6-v2 model, to create your embedding model. There is also a [FAISS class](https://python.langchain.com/docs/integrations/vectorstores/faiss/), which has a useful [from_texts method](https://python.langchain.com/api_reference/community/vectorstores/langchain_community.vectorstores.faiss.FAISS.html#langchain_community.vectorstores.faiss.FAISS.from_texts). Once you've created your vector store, use the [as_retriever method](https://python.langchain.com/api_reference/community/vectorstores/langchain_community.vectorstores.faiss.FAISS.html#langchain_community.vectorstores.faiss.FAISS.as_retriever) on it and save it to a variable named retriever.

For the generator, you can use the [ChatOpenAI class](https://python.langchain.com/docs/integrations/chat/openai/). Be sure to set base_url="[https://openrouter.ai/api/v1](https://openrouter.ai/api/v1)", model_name="meta-llama/llama-4-scout:free", and openai_api_key= Your API key. Save this to a variable named llm.

In [241]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

In [242]:
#faiss = FAISS.from_texts(texts, embeddings)
faiss = FAISS.from_texts(chunked_article, embeddings)

In [243]:
retriever = faiss.as_retriever()

In [244]:
llm = ChatOpenAI(
    base_url="https://openrouter.ai/api/v1",
    model_name="meta-llama/llama-4-scout:free",
    openai_api_key=api_key
)

Now that the two components have been created, we can combine them into a chat template using the [ChatPromptTemplate](https://python.langchain.com/api_reference/core/prompts/langchain_core.prompts.chat.ChatPromptTemplate.html) class. We can set up a system prompt and then pass that in, like

system_prompt = (  
    "Use the given context to answer the question. "  
    "If you don't know the answer, say you don't know. "  
    "Use three sentence maximum and keep the answer concise. "  
    "Context: {context}"  
)  
  
prompt = ChatPromptTemplate.from_messages(  
    [  
        ("system", system_prompt),  
        ("human", "{input}"),  
    ]  
)  
  
Then, you can use the [create_stuff_documents_chain function](https://python.langchain.com/api_reference/langchain/chains/langchain.chains.combine_documents.stuff.create_stuff_documents_chain.html), passing in your llm and the prompt, and then create a chain using the [create_retrieval_chain](https://python.langchain.com/api_reference/langchain/chains/langchain.chains.retrieval.create_retrieval_chain.html) function, passing in the retriever and the chain you just created.

Finally, you can use the invoke method to pass in your query as input. See the example on [this page](https://python.langchain.com/api_reference/langchain/chains/langchain.chains.retrieval.create_retrieval_chain.html).

In [295]:
system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentence maximum and keep the answer concise. "
    "Context: {context}"
)

In [296]:
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [299]:
chain = create_stuff_documents_chain(llm, prompt)
retrieval_chain = create_retrieval_chain(retriever, chain)

In [300]:
retrieval_chain.invoke({"input": query})

{'input': 'How does SC-LoRA differ from regular LoRA?',
 'context': [Document(id='870459b3-7640-4535-a499-b250fd2ebe81', metadata={}, page_content='methods, both in utility and safety metric. Com-\npared to the original model, SC-LoRA ( β= 0.9)\nexhibits almost no safety degradation, and achieves\nbest utility, even surpassing full fine-tuning by 3.79\npoints. When increasing the learning rate, LoRA\nshows a sharp decline in safety alignment while\nmath ability is increasing. LoRA (lr=2e-5) and\nCorDA KPA, though preserving safety well, are\ninsufficient in fine-tuning performance compared\nto our method. PiSSA and CorDA IPA, though'),
  Document(id='6a4610e5-713b-48f8-bcfb-312d6c3102d3', metadata={}, page_content='sponses (score = 5) as harmfulness rate . Lower\nvalues for both metrics indicate stronger safety of\nthe model.\n5Method #Params HS↓HR(%) ↓Utility ↑\nLlama-2-7b-Chat - 1.100 1.212 24.13\nFull fine-tuning 6738M 1.364 5.455 51.41\nLoRA 320M 1.176 2.424 50.32\nPiSSA 320M 1.252