##### Retrieval-Augmented Generation

In [78]:
import os
import numpy as np
from langchain_openai import ChatOpenAI

from sentence_transformers import SentenceTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings

import faiss


In [None]:
#os.environ["OPENAI_API_KEY"] = api_key
#os.environ["OPENAI_API_BASE"] = "https://openrouter.ai/api/v1"

In [53]:
with open("../open_router_api_key.txt", "r") as fi:
    api_key = fi.read()

In this exercise, you'll put together a RAG system and compare outputs from RAG vs. just querying an LLM.

For this exercise, you'll be asking about Subspace-Constrained LoRA (SC-LoRA), a new technique described in [a recent article publised on arXiv.org](https://arxiv.org/abs/2505.23724). You've been provided the text of this article in the file 2505.23724v1.txt.

### Part 1: Manual RAG

In this first part, you'll build all of the pieces of the RAG system individually.

First, you'll need the retriever portion. Create a FAISS index to hold the text of the article. Encode this text using the all-MiniLM-L6-v2 encoder. Note that you'll want to divide the text into smaller chunks rather than encoding the whole artile all at once. You could try, for example, the [RecursiveCharacterTextSplitter class from LangChain](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html). You'll need to specify a chunk_size and chunk_overlap. You could try a chunk_size of 500 and overlap of 50 as a starting point.

In [30]:
with open("../data/2505.23724v1.txt", "r", encoding="utf-8") as f:
    article = f.read()
    #article = article.replace('\n', '')

In [31]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

In [44]:
chunked_article = text_splitter.split_text(article)

In [36]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')
#embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

In [81]:
articles_vector = embedder.encode(chunked_article, show_progress_bar=True)

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

In [82]:
d = articles_vector.shape[1]

faiss_index = faiss.IndexFlatIP(d)   # build the index
faiss_index.add(articles_vector)       # add vectors to the index

Next, you'll need to set up a way to interact with the generator model. You can use the OpenAI class from the openai library for this. See [this page](https://platform.openai.com/docs/api-reference/chat/create) for more information. When you do this, you'll need to set the base_url to ["https://openrouter.ai/api/v1"](https://openrouter.ai/api/v1) and to pass in your api key. Set the model to "meta-llama/llama-4-scout:free".

In [223]:
llm = ChatOpenAI(
    base_url="https://openrouter.ai/api/v1",
    model_name="meta-llama/llama-4-scout:free",
    openai_api_key=api_key
)

First, ask the model "How does SC-LoRA differ from regular LoRA?" without providing any additional context. Read through a few different responses.

In [58]:
query = "How does SC-LoRA differ from regular LoRA?"

In [236]:
response = llm.invoke(query)

In [235]:
print(response.content)

SC-LoRA (Space-Conditioned Low-Rank Adaptation) is an adaptation of LoRA (Low-Rank Adaptation), which is a method used in transformer-based models for efficient fine-tuning. The primary distinction between SC-LoRA and regular LoRA lies in how they condition or adapt the model's parameters during the fine-tuning process.

1. **Regular LoRA**: LoRA introduces a low-rank matrix that is learned during fine-tuning. This matrix is used to adapt the weights of the model. Specifically, for a given weight matrix \(W\), LoRA updates it as \(W + \Delta W\), where \(\Delta W = BA\), and \(B\) and \(A\) are low-rank matrices learned during training. The key here is that the adaptation \(\Delta W\) is not dependent on the input but is a fixed adaptation learned during fine-tuning.

2. **SC-LoRA**: SC-LoRA enhances the basic LoRA approach by conditioning the adaptation on the input space or certain conditions. This means that instead of having a static \(\Delta W\), SC-LoRA allows \(\Delta W\) to var

In [238]:
print(response.content)

SC-LoRA (Structured Compression-LoRA) and LoRA (Low-Rank Adaptation) are both efficient fine-tuning methods for large language models, but they differ in their approach and structure:

**LoRA (Low-Rank Adaptation)**

LoRA is a method that adapts a pre-trained model to a specific task by adding low-rank matrices to the model's weights. The key idea is to update only a small subset of the model's parameters, specifically by adding a low-rank matrix to the weight matrix of a given layer. This allows for efficient adaptation to new tasks with a relatively small number of additional parameters.

**SC-LoRA (Structured Compression-LoRA)**

SC-LoRA builds upon LoRA by introducing an additional structured compression component. The main differences between SC-LoRA and LoRA are:

1. **Structured pruning**: SC-LoRA incorporates structured pruning, which eliminates entire groups of parameters or neurons, leading to a more compact and efficient model. This pruning is done in a way that maintains th

Next, use the following as a system prompt:

system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentences maximum and keep the answer concise. "
    f"Context: {context}"
)
Use the FAISS index to pull in relevant context to fill in the context. Try passing in this additional system prompt. Hint: you can do this by using the following messages in the client.chat.completions.create function

    messages=[
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": query,
        }
    ]
How does adding this context change the results?

In [84]:
query_vector = embedder.encode(query)

In [214]:
k = 1
distances, indices = faiss_index.search(np.array([query_vector], dtype=np.float32), k)

In [219]:
context = chunked_article[np.take(indices, indices = 0)]

In [220]:
system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentences maximum and keep the answer concise. "
    f"Context: {context}"
)

In [221]:
messages=[
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": query,
        }
    ]

In [231]:
response = llm.invoke(messages)

In [232]:
print(response.content)

SC-LoRA has a β value of 0.9, which is not specified for regular LoRA. SC-LoRA also outperforms LoRA in terms of utility and safety metrics. LoRA's performance varies with learning rate, whereas SC-LoRA achieves a balance between utility and safety.


In [230]:
print(response.content)

SC-LoRA has a β value of 0.9, which seems to make it outperform regular LoRA in terms of utility and safety. Regular LoRA shows a decline in safety alignment when the learning rate is increased, but SC-LoRA doesn't exhibit this degradation. The exact differences between SC-LoRA and LoRA are not specified, but β=0.9 appears to be a key factor.


### Part 2: LangChain

You can also use the [LangChain library](https://www.langchain.com/) to help build your RAG system.

For the retriever, you can use the [HugginFaceEmbeddings class](https://python.langchain.com/api_reference/huggingface/embeddings/langchain_huggingface.embeddings.huggingface.HuggingFaceEmbeddings.html), using the all-MiniLM-L6-v2 model, to create your embedding model. There is also a [FAISS class](https://python.langchain.com/docs/integrations/vectorstores/faiss/), which has a useful [from_texts method](https://python.langchain.com/api_reference/community/vectorstores/langchain_community.vectorstores.faiss.FAISS.html#langchain_community.vectorstores.faiss.FAISS.from_texts). Once you've created your vector store, use the [as_retriever method](https://python.langchain.com/api_reference/community/vectorstores/langchain_community.vectorstores.faiss.FAISS.html#langchain_community.vectorstores.faiss.FAISS.as_retriever) on it and save it to a variable named retriever.

For the generator, you can use the [ChatOpenAI class](https://python.langchain.com/docs/integrations/chat/openai/). Be sure to set base_url="[https://openrouter.ai/api/v1](https://openrouter.ai/api/v1)", model_name="meta-llama/llama-4-scout:free", and openai_api_key= Your API key. Save this to a variable named llm.







Now that the two components have been created, we can combine them into a chat template using the ChatPromptTemplate class. We can set up a system prompt and the pass that in, like

system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentence maximum and keep the answer concise. "
    "Context: {context}"
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)
Then, you can use the create_stuff_documents_chain function, passing in your llm and the prompt, and then create a chain using the create_retrieval_chain function, passing in the retriever and the chain you just created.

Finally, you can use the invoke method to pass in your query as input. See the example on this page.