In [4]:
from sentence_transformers import SentenceTransformer
import faiss

from openai import OpenAI

from langchain_huggingface import HuggingFaceEmbeddings, HuggingFacePipeline
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFacePipeline
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

from transformers import pipeline

import os
os.environ["OPENAI_API_BASE"] = "https://openrouter.ai/api/v1"

import json
with open("keys.json", "r") as fi:
    api_key = json.load(fi)['api_key']

## Part 1: Manual RAG System

First, we'll put together the components of our RAG system individually.

We'll start with our data source. We'll use FAISS for our vector database.

For this exercise, we'll be working with a recent research article, [SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA](https://arxiv.org/abs/2505.23724). The text of this article is contained in the txts directory.

Our goal is to store passages from this text in our database. We'll use the RecursiveCharacterTextSplitter, which will divide the text into chunks of length <= 550, where chunks overlap by 50.

In [6]:
with open("data/2505.23724v1.txt", "r", encoding="utf-8") as f:
    text = f.read()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_text(text)

Now, we need to create an embedding of these chunks. We can use the all-MiniLM-L6-v2 embedder for this. 

**Task 2:** Use a sentence transformer to encode all of the chunks. Then save the results in a faiss IndexFlatIP index.

In [7]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedder.encode(chunks)

dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(embeddings)

The next necessary piece is a generative model. We'll make use of [OpenRouter](https://openrouter.ai/), using the OpenAI interface.

In [8]:
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=api_key,
)

In [9]:
query = "How does SC-LoRA differ from regular LoRA?"

Baseline LLM Query
The notebook sets up an OpenAI-compatible client for OpenRouter and queries the LLM directly with a question (e.g., "How does SC-LoRA differ from regular LoRA?") without any context.
This demonstrates how the LLM answers based only on its pretraining, not the specific paper.

In [10]:
response = client.chat.completions.create(
    model="meta-llama/llama-4-scout:free",
    messages=[
        {
            "role": "user",
            "content": query,
        },
    ],
)

response.choices[0].message.content

"SC-LoRA (Space-Conditioned Low-Rank Adaptation) and LoRA (Low-Rank Adaptation) are both methods used in the context of adapting large pre-trained models to specific tasks or domains with efficient and effective fine-tuning strategies. While they share the common goal of updating a subset of model parameters to adapt to new conditions (like tasks, styles, or domains) without retraining the entire model, there are key differences between SC-LoRA and regular LoRA:\n\n1. **Basic Approach**:\n   - **LoRA**: This method works by adding low-rank matrices to the original weights of the model layers. These low-rank matrices are learned during fine-tuning and allow the model to adapt to new tasks with a relatively small number of additional parameters. LoRA focuses on optimizing these adaptation matrices with the goal of minimizing the need for extensive retraining of the entire model.\n   - **SC-LoRA**: SC-LoRA extends the basic LoRA approach by incorporating an additional conditioning mechani

**Task 3:**

In [11]:
query = "How does SC-LoRA differ from regular LoRA?"

In [12]:
query_embedding = embedder.encode([query])
D, I = index.search(query_embedding, k=5)
most_similar_chunks = I[0]

context = ""
for i in most_similar_chunks:
    context += "\n\n" + chunks[i]

In [13]:
system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentences maximum and keep the answer concise. "
    f"Context: {context}"
)

The notebook then:
Encodes the query.
Retrieves the top-5 most similar text chunks from the FAISS index.
Concatenates these chunks as context.
Constructs a system_prompt that instructs the LLM to answer using only the provided context.
The LLM is then queried again, but this time with the context included as a system prompt.
Improvement: The answer is now grounded in the actual content of the paper, making it more accurate and reliable.

In [14]:
response = client.chat.completions.create(
    model="meta-llama/llama-4-scout:free",
    messages=[
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": query,
        }
    ]
)

print(response.choices[0].message.content)

SC-LoRA is a LoRA initialization method that modifies the base LoRA approach. The key difference is that SC-LoRA introduces a hyperparameter β to balance utility and safety, allowing for better preservation of safety and knowledge during fine-tuning. This results in SC-LoRA outperforming regular LoRA in both safety and utility metrics.


In [15]:
print(context)



methods, both in utility and safety metric. Com-
pared to the original model, SC-LoRA ( β= 0.9)
exhibits almost no safety degradation, and achieves
best utility, even surpassing full fine-tuning by 3.79
points. When increasing the learning rate, LoRA
shows a sharp decline in safety alignment while
math ability is increasing. LoRA (lr=2e-5) and
CorDA KPA, though preserving safety well, are
insufficient in fine-tuning performance compared
to our method. PiSSA and CorDA IPA, though

sponses (score = 5) as harmfulness rate . Lower
values for both metrics indicate stronger safety of
the model.
5Method #Params HS↓HR(%) ↓Utility ↑
Llama-2-7b-Chat - 1.100 1.212 24.13
Full fine-tuning 6738M 1.364 5.455 51.41
LoRA 320M 1.176 2.424 50.32
PiSSA 320M 1.252 4.242 51.87
CorDA IPA 320M 1.209 3.333 44.61
CorDA KPA 320M 1.106 0.606 50.89
SC-LoRAβ= 0.5 320M 1.161 1.818 52.54
β= 0.7 320M 1.148 1.818 52.07
β= 0.9 320M 1.097 0.000 51.67

2019) with the following hyper-parameters: batch
size 128, learning 

## Part 2: LangChain

Now, let's see how we could use the [LangChain](https://www.langchain.com/) library to build our RAG system.

In [16]:
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_store = FAISS.from_texts(chunks, embedding_model)

In [17]:
llm = ChatOpenAI(
    base_url="https://openrouter.ai/api/v1",
    model_name="meta-llama/llama-4-scout:free",
    openai_api_key=api_key
)

In [18]:
query = "How does SC-LoRA differ from regular LoRA?"

retriever = vector_store.as_retriever()

system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentence maximum and keep the answer concise. "
    "Context: {context}"
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)
question_answer_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, question_answer_chain)

chain.invoke({"input": query})

{'input': 'How does SC-LoRA differ from regular LoRA?',
 'context': [Document(id='a0b7312c-7b6b-420f-882e-c210023329aa', metadata={}, page_content='methods, both in utility and safety metric. Com-\npared to the original model, SC-LoRA ( β= 0.9)\nexhibits almost no safety degradation, and achieves\nbest utility, even surpassing full fine-tuning by 3.79\npoints. When increasing the learning rate, LoRA\nshows a sharp decline in safety alignment while\nmath ability is increasing. LoRA (lr=2e-5) and\nCorDA KPA, though preserving safety well, are\ninsufficient in fine-tuning performance compared\nto our method. PiSSA and CorDA IPA, though'),
  Document(id='bcaef4f2-ab84-4573-89e1-83d7fd77bfae', metadata={}, page_content='sponses (score = 5) as harmfulness rate . Lower\nvalues for both metrics indicate stronger safety of\nthe model.\n5Method #Params HS↓HR(%) ↓Utility ↑\nLlama-2-7b-Chat - 1.100 1.212 24.13\nFull fine-tuning 6738M 1.364 5.455 51.41\nLoRA 320M 1.176 2.424 50.32\nPiSSA 320M 1.252