<a href="https://colab.research.google.com/github/pmadhyastha/INM434/blob/main/Introduction_to_frontiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
__author__ = "Pranava Madhyastha"
__version__ = "INM434/IN3045 City, University of London, Spring 2025"

## Exploring Advanced LLM Techniques: RAG, Chain of Thought, and LoRA

In this notebook, we will dive into three important techniques that enhance the capabilities and efficiency of Large Language Models (LLMs):

1.  **Retrieval Augmented Generation (RAG):** We'll learn how to augment LLMs with external knowledge to overcome their limitations and generate more informed and accurate responses.
2.  **Chain of Thought (CoT) Reasoning:** We'll explore how to prompt LLMs to perform step-by-step reasoning, improving their ability to solve complex problems.
3.  **Low-Rank Adaptation (LoRA):** We'll also learn how to fine-tune LLMs efficiently using LoRA, a parameter-efficient technique that reduces computational costs and resource requirements.

In [None]:
!pip uninstall -y wandb  # Explicitly uninstall wandb (wandb is a experiment logging tool - have a peak at the tool if you want to)
import os
os.environ["WANDB_DISABLED"] = "true" # Set environment variable to disable wandb
os.environ["WANDB_MODE"] = "disabled"
!echo "WandB disabled forcefully."
!pip install faiss-gpu-cu12
!pip install -U transformers bitsandbytes accelerate datasets # we have seen all these libraries before.


Before you begin as usual please make sure, like last time, you are running this notebook in an environment with **GPU acceleration enabled**. In Google Colab, go to "Runtime" -> "Change runtime type" and select "A100 GPU" (or a similar GPU) as the hardware accelerator.

In [None]:
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display


# 1. Retrieval Augmented Generation (RAG)

Large Language Models are powerful language generators, but they have limitations and we have seen this in the lecture:

*   **Knowledge "time and domain" cut off:** LLMs are trained on data up to a certain point in time and on a large selection of domains. They lack knowledge about recent events or specific domain information not present in their training data.
*   **Hallucinations:** LLMs can sometimes generate factually incorrect or nonsensical information, as they are trained to generate text that is statistically likely, not necessarily factually true.

RAG is one of the approaches to address these issues by combining the strengths of LLMs with the ability to retrieve information from external knowledge sources.

**How RAG Works:**

1.  **Retrieval:** Given a user query, RAG first retrieves relevant documents or passages from a knowledge base (e.g., a vector database of documents).
2.  **Augmentation:** The retrieved information is then incorporated into the prompt given to the LLM.
3.  **Generation:** The LLM uses both its pre-existing knowledge and the retrieved information to generate a more informed and grounded response.

Let's implement a simple RAG pipeline! We will use a small dataset of example documents and a pre-trained LLM for text generation.

First, let's create a simple **knowledge base**. For this example, we'll use a few sentences about different topics. In a real-world scenario, this could be a large collection of documents, articles, or a specialized database.

In [None]:
# Simple knowledge base (list of text snippets)
knowledge_base = [
    "The capital of France is Paris.",
    "The Eiffel Tower is a famous landmark in Paris.",
    "Machine learning is a subfield of artificial intelligence.",
    "Large Language Models are a type of machine learning model.",
    "RAG stands for Retrieval Augmented Generation.",
    "LoRA is a parameter-efficient fine-tuning technique.",
    "Chain of Thought prompting improves reasoning in LLMs."
]

Now, we need to **index** this knowledge base so we can efficiently search for relevant information. We will use **FAISS** (Facebook AI Similarity Search) to create a vector database.

First, we need to convert our text snippets into **vector embeddings**. Embeddings are numerical representations of text that capture their semantic meaning. We'll use a pre-trained sentence transformer model from Hugging Face for this.

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
import faiss

# Load a sentence transformer model and tokenizer
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2" # A fast and efficient sentence embedding model
embedding_tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)
embedding_model = AutoModel.from_pretrained(embedding_model_name)

# Function to get embeddings for text
def get_embeddings(texts):
    encoded_input = embedding_tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = embedding_model(**encoded_input)
    # Perform pooling. In this case, Mean Pooling - Take attention mask into account for correct averaging
    input_mask_expanded = encoded_input['attention_mask'].unsqueeze(-1).expand(model_output.last_hidden_state.size()).float()
    sum_embeddings = torch.sum(model_output.last_hidden_state * input_mask_expanded, 1)
    sum_mask = torch.sum(input_mask_expanded, 1)
    mean_pooled = sum_embeddings / sum_mask
    return mean_pooled.cpu().numpy()

# Get embeddings for our knowledge base
knowledge_embeddings = get_embeddings(knowledge_base)
print("Embeddings shape:", knowledge_embeddings.shape)

We have generated embeddings for each text snippet in our knowledge base. Now, let's build a **FAISS index** to store these embeddings and enable efficient similarity search. We will use a simple IndexFlatL2 index for this demonstration, which is suitable for smaller datasets. For larger datasets, you might consider more advanced indexing techniques offered by FAISS.

In [None]:
dimension = knowledge_embeddings.shape[1] # Dimension of embeddings
index = faiss.IndexFlatL2(dimension) # L2 or Euclidean distance for similarity -- TODO -- change it to other distance functions!!
index.add(knowledge_embeddings)

Now we have our knowledge base indexed! Let's create a **retrieval function** that takes a user query, generates its embedding, and retrieves the most similar documents from the knowledge base using the FAISS index.

In [None]:
def retrieve_relevant_documents(query, top_k=2): # Retrieve top 2 most relevant documents by default
    query_embedding = get_embeddings([query])
    D, I = index.search(query_embedding, top_k) # Search top_k most similar embeddings
    retrieved_documents = [knowledge_base[i] for i in I[0]] # Get corresponding documents
    return retrieved_documents

Let's try an example to retrieve.

In [None]:
query_rag = "What is RAG?"
relevant_docs = retrieve_relevant_documents(query_rag)
print(f"Query: '{query_rag}'")
print("Retrieved documents:")
for doc in relevant_docs:
    print(f"- {doc}")

Great! Our retrieval function is working. Now, let's integrate this with an **LLM for text generation**. We'll use the `distilgpt2` model again, similar to the previous notebook.

We will construct a **prompt** that includes both the user query and the retrieved relevant documents. This combined prompt will be given to the LLM to generate a RAG-enhanced response.

In [None]:
from transformers import pipeline

llm_model_name = "distilgpt2"
llm_generator = pipeline('text-generation', model=llm_model_name)

def generate_rag_response(query):
    relevant_documents = retrieve_relevant_documents(query)
    context = "\n".join(relevant_documents) # Combine retrieved documents into context
    prompt = f"Context information: {context}\n\nUser Query: {query}\n\nAnswer:" # Construct RAG prompt
    rag_output = llm_generator(prompt, max_length=150, num_return_sequences=1, pad_token_id=llm_generator.tokenizer.eos_token_id) # Generate response
    return rag_output[0]['generated_text']

# Example RAG response generation
query_example_rag = "Tell me about Paris."
rag_response = generate_rag_response(query_example_rag)
print(f"RAG Query: '{query_example_rag}'")
print(f"RAG Response: '{rag_response}'")


**Examine the RAG Response:**

*   Does the response incorporate information from the retrieved documents?
*   Is the response more informative and relevant compared to just asking the LLM directly without RAG?

Let's create an **interactive interface** using `ipywidgets` so you can experiment with different queries and see RAG in action!

In [None]:
def rag_interface(query_text):
    rag_response_text = generate_rag_response(query_text)
    print(f"Query: '{query_text}'\n")
    print(f"RAG Response: '{rag_response_text}'")

query_input_widget_rag = widgets.Textarea(
    value='Ask me something about the topics in the knowledge base...',
    placeholder='Type your query here',
    description='Query:',
    disabled=False,
    layout=widgets.Layout(width='80%', height='100px')
)

interactive_rag = interactive(rag_interface, query_text=query_input_widget_rag)
display(interactive_rag)


**Try it out!**

*   Enter different queries related to the topics in our `knowledge_base` (Paris, Eiffel Tower, Machine Learning, RAG, LoRA, Chain of Thought).
*   Observe how RAG uses the retrieved context to generate answers.
*   Try queries outside of the knowledge base topics. How does RAG behave in those cases?
*   Can you try a different model (perhaps a larger model)?



**TODOs for RAG Exploration:**

1.  **Experiment with different knowledge bases:**
    *   Replace the simple `knowledge_base` with a larger text file or a dataset from Hugging Face Datasets ([https://huggingface.co/datasets](https://huggingface.co/datasets)).
    *   How does the performance of RAG change with a larger and more diverse knowledge base?
2.  **Explore different embedding models:**
    *   Try other sentence transformer models from Hugging Face ([https://huggingface.co/sentence-transformers](https://huggingface.co/sentence-transformers)). Some models are more accurate but slower, while others are faster but potentially less accurate.
    *   How does the choice of embedding model affect retrieval quality and RAG response?
3.  **Investigate different retrieval strategies:**
    *   Instead of simple similarity search with FAISS, research and implement more advanced retrieval techniques like:
        *   Keyword-based retrieval (e.g., using TF-IDF or BM25).
        *   Hybrid retrieval (combining keyword and semantic search).
    *   How do these different strategies impact the relevance of retrieved documents and the overall RAG performance?
4.  **Evaluate RAG qualitatively and quantitatively:**
    *   **Qualitative Evaluation:** Manually examine the RAG responses for different queries. Are they factually correct? Do they answer the query effectively? Are they coherent and well-written?
    *   **Quantitative Evaluation (more advanced):** If you have access to ground truth answers for your queries, you can use metrics like:
        *   **Recall@k:**  Does the correct answer appear in the top-k retrieved documents?
        *   **Answer Relevance:**  How relevant is the generated answer to the query? (Requires more sophisticated evaluation methods).

## 2. Chain of Thought (CoT) Reasoning

Large Language Models can sometimes struggle with complex reasoning tasks that require multiple steps or logical inference. **Chain of Thought (CoT) prompting** is a technique that encourages LLMs to break down complex problems into smaller, more manageable steps, mimicking human-like step-by-step reasoning.

**How CoT Works:**

Instead of directly asking the LLM for the final answer, we prompt it to **show its reasoning process**.  This is typically done by adding phrases like "Let's think step by step" or "Reasoning process:" to the prompt.  By explicitly prompting for the reasoning chain, we guide the LLM to generate intermediate steps before arriving at the final answer.

Let's see an example. Consider a simple arithmetic problem:

**Problem:** "If I have 3 apples and I give 2 to my friend, how many apples do I have left?"

Let's try prompting the LLM in two ways:

1.  **Direct Prompting (without CoT):** Just ask the question directly.
2.  **CoT Prompting (with CoT):**  Prompt the LLM to think step by step.
"""


In [None]:
def generate_llm_response(prompt_cot, model_name_cot="distilgpt2"): # Function to generate text from LLM
    generator_cot = pipeline('text-generation', model=model_name_cot)
    output_cot = generator_cot(prompt_cot, max_length=100, num_return_sequences=1, pad_token_id=generator_cot.tokenizer.eos_token_id)
    return output_cot[0]['generated_text']

# Example Problem
arithmetic_problem = "If I have 3 apples and I give 2 to my friend, how many apples do I have left?"

# 1. Direct Prompting
direct_prompt = f"Question: {arithmetic_problem}\nAnswer:"
direct_response = generate_llm_response(direct_prompt)
print("Direct Prompt Response:\n", direct_response)

# 2. CoT Prompting
cot_prompt = f"Question: {arithmetic_problem}\nLet's think step by step:"
cot_response = generate_llm_response(cot_prompt)
print("\nChain of Thought Response:\n", cot_response)


**Examine the Responses:**

*   Compare the direct response and the CoT response.
*   Does the CoT response show any steps of reasoning? Even if it doesn't perfectly solve the arithmetic, does it attempt to break down the problem?
*   Note: `distilgpt2` is a smaller model and might not be ideal for complex reasoning, but we can still observe the effect of CoT prompting.

Let's try a slightly more complex problem and see if CoT makes a difference.

In [None]:
logic_problem = "A train leaves Chicago at 6am traveling at 60mph towards Denver. Another train leaves Denver at 8am traveling at 70mph towards Chicago. When will they meet?"

# Direct Prompting for logic problem
direct_prompt_logic = f"Question: {logic_problem}\nAnswer:"
direct_response_logic = generate_llm_response(direct_prompt_logic)
print("\nDirect Prompt Response (Logic Problem):\n", direct_response_logic)

# CoT Prompting for logic problem
cot_prompt_logic = f"Question: {logic_problem}\nLet's think step by step:"
cot_response_logic = generate_llm_response(cot_prompt_logic)
print("\nChain of Thought Response (Logic Problem):\n", cot_response_logic)

**Examine the Responses for the Logic Problem:**

*   Is there a noticeable difference between the direct response and the CoT response for this more complex problem?
*   Does CoT prompting encourage the LLM to at least attempt a step-by-step approach, even if it doesn't solve the problem correctly?

**Important Note:** The effectiveness of CoT prompting can depend on:

*   **Model Size and Capabilities:** Larger and more capable LLMs are generally better at CoT reasoning. Smaller models like `distilgpt2` might show limited CoT abilities.
*   **Prompt Engineering:** The way you phrase your CoT prompt can significantly impact the results. Experiment with different CoT prompting phrases and formats.
*   **Task Complexity:** CoT is most beneficial for tasks that genuinely require multi-step reasoning. For very simple tasks, direct prompting might be sufficient.

Let's create an **interactive interface** to experiment with CoT prompting for different questions!

In [None]:
def cot_interface(question_text, use_cot):
    if use_cot:
        cot_prompt_interactive = f"Question: {question_text}\nLet's think step by step:"
        response_text = generate_llm_response(cot_prompt_interactive)
        print(f"Prompt (CoT): '{cot_prompt_interactive}'\n")
        print(f"CoT Response: '{response_text}'")
    else:
        direct_prompt_interactive = f"Question: {question_text}\nAnswer:"
        response_text = generate_llm_response(direct_prompt_interactive)
        print(f"Prompt (Direct): '{direct_prompt_interactive}'\n")
        print(f"Direct Response: '{response_text}'")

question_input_widget_cot = widgets.Textarea(
    value='Ask me a question that might require some reasoning...',
    placeholder='Type your question here',
    description='Question:',
    disabled=False,
    layout=widgets.Layout(width='80%', height='100px')
)
cot_checkbox_widget = widgets.Checkbox(
    value=False,
    description='Use Chain of Thought (CoT)',
    disabled=False
)

interactive_cot = interactive(cot_interface, question_text=question_input_widget_cot, use_cot=cot_checkbox_widget)
display(widgets.VBox([question_input_widget_cot, cot_checkbox_widget, interactive_cot]))

**Try it out!**

*   Enter different questions that might require reasoning (arithmetic, logic, common sense reasoning, etc.).
*   Toggle the "Use Chain of Thought (CoT)" checkbox to compare responses with and without CoT prompting.
*   Observe if CoT prompting makes a difference in the quality of the responses, especially for more complex questions.

**TODOs for CoT Exploration:**

1.  **Experiment with different CoT prompt formats:**
    *   Try different phrases to initiate CoT prompting, such as:
        *   "Let's work this out step by step."
        *   "Reasoning:"
        *   "First, I will..." "Then, I will..." "Finally, I will..." (more explicit step-by-step prompting)
    *   How do different prompt formats influence the LLM's reasoning process and output?
2.  **Apply CoT to different types of reasoning tasks:**
    *   Explore CoT prompting for tasks beyond arithmetic and logic, such as:
        *   Common sense reasoning questions.
        *   Textual entailment or contradiction detection.
        *   Simple planning or decision-making problems.
    *   For which types of tasks is CoT most effective?
3.  **Investigate techniques to improve CoT reliability and accuracy:**
    *   Research more advanced CoT techniques like:
        *   **Self-Consistency in CoT:** Generate multiple CoT reasoning paths and select the most consistent answer.
        *   **Least-to-Most Prompting:** Break down a complex problem into a sequence of simpler subproblems and solve them in order, building up to the final solution.
    *   How can these techniques further enhance the power of CoT reasoning?
4.  **Evaluate CoT performance:**
    *   Design experiments to quantitatively evaluate the impact of CoT prompting on reasoning accuracy for specific tasks.
    *   Compare the performance of LLMs with and without CoT prompting using appropriate evaluation metrics.

# 3. Low-Rank Adaptation (LoRA) for Efficient Fine-tuning

Fine-tuning Large Language Models can be computationally expensive and resource-intensive, especially for very large models. **Low-Rank Adaptation (LoRA)** is a parameter-efficient fine-tuning technique that significantly reduces the number of trainable parameters, making fine-tuning more accessible and efficient.

**How LoRA Works:**

LoRA freezes the pre-trained weights of the LLM and introduces a small number of **new trainable parameters** in the form of low-rank matrices. During fine-tuning, only these low-rank matrices are updated, while the original LLM weights remain unchanged.

**Benefits of LoRA:**

*   **Parameter Efficiency:**  Reduces the number of trainable parameters by orders of magnitude (e.g., from billions to millions or even less).
*   **Reduced Computational Cost:**  Faster training and lower memory requirements due to fewer trainable parameters.
*   **Preserves Pre-trained Knowledge:**  Freezing the original weights helps retain the general knowledge and capabilities of the pre-trained LLM.
*   **Modular and Portable:**  LoRA adapters (the low-rank matrices) are small and can be easily swapped or combined for different tasks without modifying the base LLM.

Let's demonstrate LoRA fine-tuning using the `peft` (Parameter-Efficient Fine-Tuning) library from Hugging Face. We will fine-tune `distilgpt2` for a simple text classification task (e.g., sentiment analysis, although we'll simplify it for demonstration).

First, let's load a dataset for text classification. We'll use a very small, synthetic dataset for quick demonstration purposes. In a real scenario, you would use a larger and more representative dataset for your target task.
"""

In [None]:
# Synthetic dataset for demonstration (replace with a real dataset for meaningful fine-tuning)
lora_dataset_dict = {
    "text": [
        "This is a great movie!",
        "I really enjoyed this book.",
        "The food was terrible.",
        "I hated this product.",
        "Excellent service and quality.",
        "Very disappointing experience."
    ],
    "label": [1, 1, 0, 0, 1, 0] # 1 for positive, 0 for negative (arbitrary labels for demonstration)
}

from datasets import Dataset
lora_dataset = Dataset.from_dict(lora_dataset_dict)
print(lora_dataset) # Examine the synthetic dataset

"""Now, let's tokenize this dataset using the tokenizer for `distilgpt2`. This is the same tokenization process we used in the previous notebook for sentiment analysis fine-tuning.
"""

from transformers import AutoTokenizer

lora_model_name = "distilgpt2"
lora_tokenizer = AutoTokenizer.from_pretrained(lora_model_name)
lora_tokenizer.pad_token = lora_tokenizer.eos_token # Important: Set pad token



def tokenize_lora_function(examples):
    return lora_tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

tokenized_lora_dataset = lora_dataset.map(tokenize_lora_function, batched=True)
print(tokenized_lora_dataset) # Inspect the tokenized dataset

"Next, we need to load the `distilgpt2` model for **sequence classification** (since we're doing text classification). We use `AutoModelForSequenceClassification` as before.

Now, the crucial step: we will apply **LoRA** to this model using the `get_peft_model` function from the `peft` library. We need to configure `LoraConfig` to specify the LoRA parameters, such as the rank `r` and the alpha parameter.

In [None]:
from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model

# Load model for sequence classification
lora_base_model = AutoModelForSequenceClassification.from_pretrained(lora_model_name, num_labels=2) # 2 labels for our example

# Configure LoRA
lora_config = LoraConfig(
    r=8, # Rank of the low-rank matrices
    lora_alpha=32, # Scaling factor for LoRA layers
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_CLS" # Task type is sequence classification
)

# Apply LoRA to the base model
lora_model = get_peft_model(lora_base_model, lora_config)
lora_model.config.pad_token_id = lora_model.config.eos_token_id # Set pad token for LoRA model
lora_model.print_trainable_parameters() # Print the number of trainable parameters

**Examine the Trainable Parameters:**

*   Run the `lora_model.print_trainable_parameters()` line.
*   Compare the number of trainable parameters with LoRA to the total number of parameters in `distilgpt2` (which is around 82 million).
*   You should see a significantly smaller number of trainable parameters with LoRA, demonstrating its parameter efficiency.

Now, let's set up the **training arguments** and the **Trainer** from Hugging Face `transformers`, just like we did for fine-tuning in the previous notebook. The training process for LoRA is very similar to regular fine-tuning.

In [None]:
from transformers import TrainingArguments, Trainer

lora_training_args = TrainingArguments(
    output_dir="./lora-distilgpt2-classification",
    learning_rate=2e-4, # Higher learning rate often used for LoRA
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3, # Small number of epochs for demonstration
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    push_to_hub=False,
    report_to=None, # Disable WandB
)



lora_trainer = Trainer(
    model=lora_model,
    args=lora_training_args,
    train_dataset=tokenized_lora_dataset,
    eval_dataset=tokenized_lora_dataset, # Using train dataset as eval for simplicity in demonstration
    tokenizer=lora_tokenizer,
    # metrics can be added if needed
)

lora_trainer.train() # Start LoRA fine-tuning


The `lora_trainer.train()` command starts the LoRA fine-tuning process. This will be much faster and less memory-intensive than full fine-tuning of `distilgpt2`.

After training, let's evaluate the LoRA-fine-tuned model and see how it performs on our synthetic dataset.

In [None]:
def predict_lora_sentiment(sentence):
    inputs = lora_tokenizer(sentence, return_tensors="pt", truncation=True, padding=True).to(lora_model.device)
    outputs = lora_model(**inputs)
    predictions = outputs.logits.argmax(dim=-1)
    sentiment_lora = "positive" if predictions.item() == 1 else "negative" # Assuming label 1 is positive, 0 is negative
    return f"Sentiment (LoRA): {sentiment_lora}"

lora_sentences_to_test = [
    "This movie was fantastic!",
    "I am really disappointed with this.",
    "It was okay, nothing special."
]

print("\nSentiment Predictions from LoRA Fine-tuned GPT-2:")
for sentence in lora_sentences_to_test:
    print(f"- Sentence: '{sentence}' - {predict_lora_sentiment(sentence)}")

**Observe the LoRA Predictions:**

*   Do the sentiment predictions from the LoRA-fine-tuned model seem reasonable for the example sentences?
*   Remember that we fine-tuned on a very small and synthetic dataset, so the model's performance might be limited.

**TODOs for LoRA Exploration:**

1.  **Experiment with different LoRA configurations:**
    *   Change the LoRA rank `r` (e.g., try r=4, r=16, r=32). Higher rank generally allows for more expressiveness but also increases the number of trainable parameters.
    *   Adjust the `lora_alpha` parameter. This scales the LoRA matrices.
    *   How do these configuration changes affect the fine-tuning process, model performance, and the number of trainable parameters?
2.  **Apply LoRA to different tasks and datasets:**
    *   Fine-tune `distilgpt2` (or other models) using LoRA for different NLP tasks, such as:
        *   Sentiment analysis on a real dataset (e.g., IMDB reviews, as in the previous notebook).
        *   Text summarization.
        *   Question answering.
    *   Explore datasets on Hugging Face Datasets ([https://huggingface.co/datasets](https://huggingface.co/datasets)) for various tasks.
3.  **Compare LoRA with full fine-tuning (conceptually):**
    *   Think about the advantages and disadvantages of LoRA compared to full fine-tuning, especially in terms of:
        *   Computational cost and resource requirements.
        *   Fine-tuning speed.
        *   Number of trainable parameters.
        *   Potential performance differences (LoRA might sometimes be slightly less performant than full fine-tuning but is much more efficient).
4.  **Investigate other parameter-efficient fine-tuning methods (conceptually):**
    *   have a look at other parameter-efficient fine-tuning techniques besides LoRA, such as:
        *   Adapter layers.
        *   Prefix tuning.
        *   Prompt tuning.
    *   How do these methods compare to LoRA in terms of efficiency, implementation complexity, and performance?