<a href="https://colab.research.google.com/github/hirdeshkumar2407/NLP_Group_Assigment/blob/main/Training%20models/2_RAG_Retriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports and loading the dataset:

In [1]:
pip install hnswlib

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
import os
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import torch
import hnswlib
from transformers import AutoModel

if os.path.isfile("rag_instruct.json"): 
    df = pd.read_json("rag_instruct.json")
else:
    df = pd.read_json("hf://datasets/FreedomIntelligence/RAG-Instruct/rag_instruct.json")

documents = df['documents']

2025-05-19 17:59:52.209797: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747677592.233916     193 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747677592.241323     193 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
print(documents[:3])

0    [decided to make the story more straightforwar...
1    [the world with 68.5% of Taiwanese high school...
2    [Sparrho Sparrho combines human and artificial...
Name: documents, dtype: object


## Our models for calculating the embeddings and using the CrossEncoder

In [4]:
semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
#semb_model.to('cuda')

## Calculating the embeddings for the corpus:

In [5]:
corpus_embeddings = semb_model.encode(documents, convert_to_tensor=True, show_progress_bar=True)


Batches:   0%|          | 0/1267 [00:00<?, ?it/s]

## Indexing for faster access:

In [6]:
index = hnswlib.Index(space='cosine', dim=corpus_embeddings.size(1))

In [7]:
# Define hnswlib index path
index_path = "./hnswlib.index"

# Load index if available
if os.path.exists(index_path):
    print("Loading index...")
    index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print("Start creating HNSWLIB index")
    index.init_index(max_elements=corpus_embeddings.size(0), ef_construction=400, M=64)
    #  Compute the HNSWLIB index (it may take a while)
    index.add_items(corpus_embeddings.cpu(), list(range(len(corpus_embeddings))))
    # Save the index to a file for future loading
    print("Saving index to:", index_path)
    index.save_index(index_path)

Loading index...


In [8]:
# function to get the related docs
def get_related_docs(query, k=3):
    query_embedding = semb_model.encode(query, convert_to_tensor=True)
    corpus_ids, _ = index.knn_query(query_embedding.cpu(), k=k)

    model_inputs = [(query, str(documents[idx])) for idx in corpus_ids[0]]
    cross_scores = xenc_model.predict(model_inputs)
    send_to_LLM = ""
    positive_docs = [documents[corpus_ids[0][idx]] for idx in np.argsort(-cross_scores) if cross_scores[idx] > 0]

    if len(positive_docs) > 1:
        for i, doc in enumerate(positive_docs):
            send_to_LLM += f"Document {i+1}:\n\n"
            # Convert the list 'doc' to a string before concatenating
            send_to_LLM += str(doc) + "\n"
    elif len(positive_docs) == 1:
        # Convert the list to a string if there's only one document
        send_to_LLM = str(positive_docs[0])

    else:
        # If no positive scores, take the top 2 negative scores
        negative_docs = []
        for idx in np.argsort(-cross_scores)[:2]: # Take the top 2 indices based on sorted scores
            negative_docs.append(documents[corpus_ids[0][idx]])

        if len(negative_docs) > 1:
            for i, doc in enumerate(negative_docs):
                send_to_LLM += f"Document {i+1}:\n"
                send_to_LLM += str(doc) + "\n\n"
        elif len(negative_docs) == 1:
            send_to_LLM = str(negative_docs[0])

    return send_to_LLM



In [9]:
!pip install -U bitsandbytes accelerate transformers
print("Required libraries upgrade/installation attempted.")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Required libraries upgrade/installation attempted.


In [10]:
import bitsandbytes
print(f"bitsandbytes version: {bitsandbytes.__version__}")
import transformers
print(f"transformers version: {transformers.__version__}")
import torch
print(f"PyTorch version: {torch.__version__}")
# Check if GPU is available to transformers
if torch.cuda.is_available():
    print(f"CUDA is available. GPU: {torch.cuda.get_device_name(0)}")
else:
    print("CUDA not available. Running on CPU.")

bitsandbytes version: 0.45.5
transformers version: 4.51.3
PyTorch version: 2.6.0+cu124
CUDA is available. GPU: Tesla T4


In [11]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch # Already imported, but good practice if cell is standalone

# Define quantization configuration
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model_name = "AITeamVN/Vi-Qwen2-3B-RAG"

try:
    print(f"\nLoading tokenizer for {model_name}...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    print("Tokenizer loaded.")

    print(f"\nLoading model {model_name} with 4-bit quantization...")
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quant_config,
        device_map="auto" # Let accelerate handle device placement
    )
    print("Model loaded successfully with 4-bit quantization.")
    if hasattr(model, 'hf_device_map'):
        print(f"Model device map: {model.hf_device_map}")
    else:
        print(f"Model is on device: {model.device}")


except ImportError as e:
    print(f"ImportError during model loading: {e}")
    print("This usually means 'bitsandbytes' is not the correct version or not found.")
    print("Ensure Cell 1 (pip install -U ...) ran successfully in THIS session.")
    print("If you ran Cell 1 and then the KERNEL/SESSION fully restarted, you need to run Cell 1 again.")
except Exception as e:
    print(f"An error occurred during model loading: {e}")


Loading tokenizer for AITeamVN/Vi-Qwen2-3B-RAG...
Tokenizer loaded.

Loading model AITeamVN/Vi-Qwen2-3B-RAG with 4-bit quantization...


model.safetensors.index.json:   0%|          | 0.00/35.6k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.21G [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

Model loaded successfully with 4-bit quantization.
Model device map: {'model.embed_tokens': 0, 'lm_head': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 1, 'model.layers.8': 1, 'model.layers.9': 1, 'model.layers.10': 1, 'model.layers.11': 1, 'model.layers.12': 1, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1, 'model.layers.28': 1, 'model.layers.29': 1, 'model.layers.30': 1, 'model.layers.31': 1, 'model.layers.32': 1, 'model.layers.33': 1, 'model.layers.34': 1, 'model.layers.35': 1, 'model.norm': 1, 'model.rotary_emb': 1}


In [12]:
# model.to('cuda')
# tokenizer.to('cuda')

query = "Do all plants do photosynthesis?"

context_docs = get_related_docs(query)

prompt = f"Given this context: \n{context_docs} \n\nPlease answer the question: {query}.\n\nAnswer:\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode and print result
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n=== Generated Answer ===\n")
print(answer.split("Answer:")[-1].strip())  # Optional: strip prompt parts

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


=== Generated Answer ===

All plants do photosynthesis. This is based on the information provided in Document 1, which states that "Green plants obtain most of their energy from sunlight via photosynthesis by primary chloroplasts that are derived from endosymbiosis with cyanobacteria. Their chloroplasts contain chlorophylls a and b, which gives them their green color." This clearly indicates that photosynthesis is a fundamental process for all plants. Additionally, the text also mentions that "Oxygenic photosynthetic organisms use chlorophyll 'a', but differ in accessory pigments like chlorophylls 'b'." This further supports the idea that photosynthesis is a key process for all oxygenic photosynthetic organisms, which includes plants. 

For further clarification, Document 1 provides additional context about the process of photosynthesis in plants, including the role of chlorophyll, the Calvin cycle, and the light-dependent and light-independent reactions. However, the core statement t

# Model Inspection

In [17]:
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM

model_name = "AITeamVN/Vi-Qwen2-3B-RAG" # Ensure this is the correct model name

print(f"\n---  Inspecting Configuration for {model_name} ---")

try:
    # Load the configuration
    config = AutoConfig.from_pretrained(model_name)
    print("\nModel Configuration Loaded Successfully.")

    # a) Maximum Context Window / Sequence Length
    # Common attribute names for max sequence length / context window:
    # 'max_position_embeddings', 'n_positions', 'n_ctx'
    # The exact name can vary between model architectures.
    max_len_attrs = ['max_position_embeddings', 'n_positions', 'n_ctx', 'sliding_window'] # Qwen2 uses 'sliding_window' or implies by rope_theta
    context_window = None
    print("\nPotential attributes for max context window:")
    for attr in max_len_attrs:
        if hasattr(config, attr):
            value = getattr(config, attr)
            print(f"  - Found '{attr}': {value}")
            if isinstance(value, int) and (context_window is None or value > context_window) and attr != 'sliding_window': # sliding_window is different
                context_window = value
            if attr == 'sliding_window' and value is not None:
                print(f"    Note: This model uses a sliding window of {value}. Effective context might be related but not strictly this value for all operations.")
                # For Qwen2, the context window is typically very large (e.g., 32k, 128k) but it uses a sliding window attention mechanism
                # The 'sliding_window' parameter itself in Qwen2 config is the size of the attention window.
                # The actual theoretical max context can be much larger, often found in model card or from rope_theta.
                # If 'sliding_window' is present, this often indicates the *attention* window size.
                # The actual max sequence length might be different (often larger for Qwen2 series).
                # Let's try to load the tokenizer too, as it sometimes has max_model_input_sizes
                try:
                    tokenizer_temp = AutoTokenizer.from_pretrained(model_name)
                    if hasattr(tokenizer_temp, 'model_max_length'):
                        print(f"  - Tokenizer 'model_max_length': {tokenizer_temp.model_max_length}")
                        if context_window is None or tokenizer_temp.model_max_length > context_window:
                             context_window = tokenizer_temp.model_max_length
                except Exception as e_tok:
                    print(f"    Could not load tokenizer to check its max length: {e_tok}")


    if context_window:
        print(f"\nEstimated Maximum Context Window / Sequence Length: {context_window} tokens")
    else:
        print("\nCould not automatically determine a clear maximum context window from common config attributes.")
        print("Please refer to the model card or documentation for the definitive context window.")

    # b) Model Type / Architecture
    if hasattr(config, 'model_type'):
        print(f"\nModel Type / Architecture: {config.model_type}")
    else:
        print("\nModel Type not explicitly found in config.")

  

except Exception as e:
    print(f"An error occurred while loading or inspecting the model configuration: {e}")
    print("Ensure the model name is correct and you have an internet connection.")


---  Inspecting Configuration for AITeamVN/Vi-Qwen2-3B-RAG ---

Model Configuration Loaded Successfully.

Potential attributes for max context window:
  - Found 'max_position_embeddings': 32768
  - Found 'sliding_window': 32768
    Note: This model uses a sliding window of 32768. Effective context might be related but not strictly this value for all operations.
  - Tokenizer 'model_max_length': 131072

Estimated Maximum Context Window / Sequence Length: 131072 tokens

Model Type / Architecture: qwen2


In [18]:
sample_queries = [ "What is a key reason OpenSSH is considered secure?", "What are the main components of Docker's service?", "In which film did Christopher Walken portray a character who gives a speech involving a gold watch related to his experiences in the Vietnam War?", "Which health benefits are associated with running?", "Which flag, the New Zealand Ensign or the Union Jack, had formal legislation passed for its use earlier?", "Did Jimmy Carter's high school activities have any influence on his professional pursuit?"]

print(f"Defined {len(sample_queries)} sample queries for evaluation.")

Defined 6 sample queries for evaluation.


# Helper Function for LLM Generation

In [26]:
# Cell A: Helper Function for LLM Generation

import torch # Ensure torch is imported

def generate_llm_answer_experimental(current_model, current_tokenizer, prompt_text,
                                     max_new_tokens=250, temperature=0.7, top_p=0.9, do_sample=True):
    """
    Generates an answer from the LLM given a prompt and generation parameters.
    Returns the extracted answer text.
    """
    answer_text = "Error during generation."
    # print(f"--- DEBUG: Prompt to LLM (first 500 chars) ---\n{prompt_text[:500]}...") # Uncomment for deep debugging of prompt
    try:
        inputs = current_tokenizer(prompt_text, return_tensors="pt", truncation=True, max_length=current_tokenizer.model_max_length - max_new_tokens - 10).to(current_model.device) # Added truncation & buffer
        
        with torch.no_grad():
            outputs = current_model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                top_p=top_p,
                do_sample=do_sample,
                pad_token_id=current_tokenizer.eos_token_id
            )
        
        full_generation = current_tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Try to extract answer after "Answer:" or "answer:"
        answer_cue_index = -1
        if "Answer:" in full_generation:
            answer_cue_index = full_generation.rfind("Answer:") + len("Answer:")
        elif "answer:" in full_generation.lower():
            answer_cue_index = full_generation.lower().rfind("answer:") + len("answer:")
            
        if answer_cue_index != -1:
            answer_text = full_generation[answer_cue_index:].strip()
        else:
            # Fallback: take text after the prompt if cue not found
            # This might be noisy if the model doesn't follow the prompt structure.
            answer_text = full_generation[len(prompt_text):].strip() if len(full_generation) > len(prompt_text) else full_generation
            # print("Warning: 'Answer:' cue not found in LLM output. Using fallback extraction.")

    except Exception as e:
        print(f"  Error during LLM generation: {e}")
        answer_text = f"Error: {e}"
    return answer_text

print("Helper function 'generate_llm_answer_experimental' defined.")

Helper function 'generate_llm_answer_experimental' defined.


# Baseline & Retriever Output Check

In [28]:
# Cell B: Baseline Generation & Retriever Check for Sample Queries

# Assuming 'get_related_docs' from Mehdi's notebook is defined and uses the global 'documents_corpus_for_retriever'
# Assuming 'model' and 'tokenizer' for your LLM are loaded.
# Assuming 'sample_queries' is defined from your notebook.

print("--- Baseline Performance & Retriever Check ---")

baseline_results = []

# Define your baseline/default generation parameters
default_max_new_tokens = 250
default_temperature = 0.7
default_top_p = 0.9
default_do_sample = True

for i, query_text in enumerate(sample_queries):
    print(f"\n\n=== PROCESSING QUERY {i+1}/{len(sample_queries)}: \"{query_text}\" ===")

    # 1. Get Retrieved Context
    print("  Retrieving context...")
    try:
        # Call Mehdi's function. Ensure it's adapted to return a clean string.
        # Let's assume it returns a dictionary like get_related_docs_for_evaluation or just the context string.
        # For simplicity, let's assume get_related_docs is the original one and returns the context string.
        retrieved_context_str = get_related_docs(query_text, k=3) # k=3 was in Mehdi's original get_related_docs
                                                              # Adjust k if his function uses a different default for final context
    except Exception as e_retriever:
        print(f"  Error during context retrieval: {e_retriever}")
        retrieved_context_str = "Error: Could not retrieve context."
    
    print(f"\n  --- Retrieved Context (first 500 chars) ---")
    print(retrieved_context_str[:500] + "..." if len(retrieved_context_str) > 500 else retrieved_context_str)
    

    # 2. Construct Prompt
    prompt_text = f"Given this context: \n{retrieved_context_str} \n\nPlease answer the question: {query_text}.\n\nAnswer:\n"

    # 3. Generate Answer
    print("\n  --- Generating LLM Answer (Baseline) ---")
    generated_answer = generate_llm_answer_experimental(
        model, tokenizer, prompt_text,
        max_new_tokens=default_max_new_tokens,
        temperature=default_temperature,
        top_p=default_top_p,
        do_sample=default_do_sample
    )
    print(f"  Generated Answer:\n    {generated_answer}")

    baseline_results.append({
        "query": query_text,
        "retrieved_context": retrieved_context_str,
        "prompt_used": prompt_text,
        "generated_answer": generated_answer
    })
    
    print("-" * 50)

print("\nBaseline generation for sample queries complete.")

--- Baseline Performance & Retriever Check ---


=== PROCESSING QUERY 1/6: "What is a key reason OpenSSH is considered secure?" ===
  Retrieving context...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


  --- Retrieved Context (first 500 chars) ---
Document 1:

["codebase with the OpenSSH 7.6 release. SSH is a protocol that can be used for many applications across many platforms including most Unix variants (GNU/Linux, the BSDs including Apple's macOS, and Solaris), as well as Microsoft Windows. Some of the applications below may require features that are only available or compatible with specific SSH clients or servers. For example, using the SSH protocol to implement a VPN is possible, but presently only with the OpenSSH server and clien...

  --- Generating LLM Answer (Baseline) ---
  Generated Answer:
    OpenSSH is considered secure due to the following key reasons:

1. **End-to-End Encryption**: OpenSSH encrypts all information, including usernames and passwords, ensuring that sensitive data is protected from unauthorized access.

2. **Multiple Layers of Security**:
   - **Transport Layer**: Uses SSH-2 protocol, which includes multiple layers of security.
   - **User-Authentica

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


  --- Retrieved Context (first 500 chars) ---
Document 1:
["VMware vSphere Integrated Containers. The Cloud Foundry Diego project integrates Docker into the Cloud Foundry PaaS. Nanobox uses Docker (natively and with VirtualBox) containers as a core part of its software development platform. Red Hat's OpenShift PaaS integrates Docker with related projects (Kubernetes, Geard, Project Atomic and others) since v3 (June 2015). The Apprenda PaaS integrates Docker containers in version 6.0 of its product. Jelastic PaaS provides managed multi-tenan...

  --- Generating LLM Answer (Baseline) ---
  Generated Answer:
    The main components of Docker's service include:

- Resource Isolation: Docker uses the resource isolation features of the Linux kernel such as cgroups and kernel namespaces.
- Union-capable File System: Docker uses a union-capable file system such as OverlayFS to allow independent "containers" to run within a single Linux instance.
-
--------------------------------------------

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


  --- Retrieved Context (first 500 chars) ---
['and killers. Cimino countered that his film was not political, polemical, literally accurate, or posturing for any particular point of view. He further defended his position by saying that he had news clippings from Singapore that confirm Russian roulette was used during the war (without specifying which article). During the 29th Berlin International Film Festival in 1979, the Soviet delegation expressed its indignation with the film which, in their opinion, insulted the Vietnamese people in n...

  --- Generating LLM Answer (Baseline) ---
  Generated Answer:
    In the film "The Deer Hunter" (1978), Christopher Walken portrayed a character who gives a speech involving a gold watch related to his experiences in the Vietnam War. The character is a young Pennsylvania steelworker who is emotionally destroyed by the Vietnam War and explains in graphic detail how he had hidden the watch from the Vietcon
---------------------------------------

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


  --- Retrieved Context (first 500 chars) ---
Document 1:

['exists the potential for injury while running (just as there is in any sport), there are many benefits. Some of these benefits include potential weight loss, improved cardiovascular and respiratory health (reducing the risk of cardiovascular and respiratory diseases), improved cardiovascular fitness, reduced total blood cholesterol, strengthening of bones (and potentially increased bone density), possible strengthening of the immune system and an improved self-esteem and emotional...

  --- Generating LLM Answer (Baseline) ---
  Generated Answer:
    Based on Document 1, the health benefits associated with running include:

1. Weight loss
2. Improved cardiovascular and respiratory health (reducing the risk of cardiovascular and respiratory diseases)
3. Improved cardiovascular fitness
4. Reduced total blood cholesterol
5. Strengthening of bones (and potentially increased bone density)
6. Possible strengthening of the immune s

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


  --- Retrieved Context (first 500 chars) ---
Document 1:

['the Flags Act 1953, section 8 of that Act specified that "this Act does not affect the right or privilege of a person to fly the Union Jack." The Union Jack continued to be used for a period thereafter as a national flag. The current national flag of New Zealand was given official standing under the New Zealand Ensign Act in 1902, but similarly to Australia the Union Jack continued to be used in some contexts as a national flag. On 5 February 2008, Conservative MP Andrew Rosindell...

  --- Generating LLM Answer (Baseline) ---
  Generated Answer:
    The New Zealand Ensign had formal legislation passed for its use earlier. The New Zealand Ensign was given official standing under the New Zealand Ensign Act in 1902, while the Union Jack was not subject to any formal legislation until the 1953 Flags Act. Therefore, the New Zealand Ensign had
--------------------------------------------------


=== PROCESSING QUERY 6/6: "Did Jim

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


  --- Retrieved Context (first 500 chars) ---
['The future president wrote that it was the last time he ever stole. Carter would be credited by his eldest son with being the person who most shaped his "work habits and ambitions". Carter was a conservative in his political views. However, his son Jimmy recollected that, "within our family we never thought about trying to define such labels." Initially having supported Franklin D. Roosevelt, Carter opposed implimentation of his New Deal when production control programs instituted under the Ro...

  --- Generating LLM Answer (Baseline) ---
  Generated Answer:
    Based on the context provided, Jimmy Carter's high school activities indeed had some influence on his professional pursuit. Specifically:

- Carter played on the Plains High School basketball team.
- He joined the Future Farmers of America and developed a lifelong interest in woodworking.
- These activities helped shape his work habits and ambitions, according to his eldest son,


#### Query 1: "What is a key reason OpenSSH is considered secure?"
Retriever: Provided context was topically related (mentioned OpenSSH and SSH) but did not contain specific reasons for its security.
LLM Answer: Generated a plausible list of security features of OpenSSH.
Grounding: Poorly grounded. The answer likely comes from the LLM's pre-trained knowledge, not the provided context.
RAG Performance: Failure (answer not derived from context).

#### Query 2: "What are the main components of Docker's service?"
Retriever: Provided context mentioned Docker integrations but did not list Docker's own components.
LLM Answer: Correctly listed the main components of Docker.
Grounding: Completely ungrounded. Answer derived from pre-trained knowledge.
RAG Performance: Failure (answer not derived from context).

#### Query 3: "In which film did Christopher Walken portray a character who gives a speech involving a gold watch related to his experiences in the Vietnam War?"
Retriever: Provided context was irrelevant (seemed to be about "The Deer Hunter" or a similar war film, no mention of Walken, gold watch, or Pulp Fiction).
LLM Answer: Correctly identified "Pulp Fiction" and described the scene.
Grounding: Completely ungrounded. Answer derived from pre-trained knowledge. The LLM even incorrectly claimed the context confirmed its answer.
RAG Performance: Failure (answer not derived from context; includes hallucination about context).

#### Query 4: "Which health benefits are associated with running?"
Retriever: Provided context was highly relevant and listed several health benefits.
LLM Answer: Listed many health benefits. The initial set was well-grounded in the provided context snippet. Later points might have come from the fuller document or pre-trained knowledge. Ended with a repetitive artifact.
Grounding: Good to Fair. The core of the answer was grounded.
RAG Performance: Mostly Successful (demonstrates RAG working when context is good, though answer could be more concise and avoid repetition).

#### Query 5: "Which flag, the New Zealand Ensign or the Union Jack, had formal legislation passed for its use earlier?"
Retriever: Provided context was highly relevant, containing dates for legislation for both flags.
LLM Answer: Correctly identified the New Zealand Ensign (though the output was truncated in the baseline).
Grounding: Good. The answer is derivable from the context.
RAG Performance: Successful (demonstrates RAG working when context is good, though generation length needs adjustment).

#### Query 6: "Did Jimmy Carter's high school activities have any influence on his professional pursuit?"
Retriever: Provided context was about Jimmy Carter's character and later political views, but did not mention specific high school activities or their direct influence on his career.
LLM Answer: Generated a detailed answer about Carter's high school activities (reading, sports, FFA, Naval Academy interest) and their influence.
Grounding: Poorly grounded. The answer clearly comes from the LLM's pre-trained knowledge about Carter, not the provided snippet.
RAG Performance: Failure (answer not derived from context).

# Experiment 1 - Prompt Engineering

In [29]:
# Cell C: Experiment 1 - Prompt Engineering

print("\n--- Experiment 1: Prompt Engineering ---")

# Choose one query index from your 'sample_queries' list (0 to N_SAMPLE_QUERIES-1)
query_index_for_prompt_exp = 0 # Example: use the first sample query
selected_query = sample_queries[query_index_for_prompt_exp]

# Get the context that was retrieved for this query in the baseline run
# Or re-retrieve if you prefer, to ensure it's fresh
print(f"  Retrieving context for: \"{selected_query}\"")
context_for_prompt_exp = get_related_docs(selected_query, k=3) # Using Mehdi's function
print(f"  Context (start): {context_for_prompt_exp[:300]}...")


print(f"\n--- Testing different prompts for query: \"{selected_query}\" ---")

# Prompt Style 1 (Baseline)
prompt_style_1 = f"Given this context: \n{context_for_prompt_exp} \n\nPlease answer the question: {selected_query}.\n\nAnswer:\n"
print("\n--- Prompt Style 1 (Baseline) ---")
answer_1 = generate_llm_answer_experimental(model, tokenizer, prompt_style_1)
print(f"  Generated Answer:\n    {answer_1}")

# Prompt Style 2 (More direct, instruction first)
prompt_style_2 = f"Based ONLY on the following context, answer the question. If the answer is not in the context, state that.\n\nContext:\n{context_for_prompt_exp}\n\nQuestion: {selected_query}\n\nAnswer:\n"
print("\n--- Prompt Style 2 (Direct Instruction) ---")
answer_2 = generate_llm_answer_experimental(model, tokenizer, prompt_style_2)
print(f"  Generated Answer:\n    {answer_2}")

# Prompt Style 3 (Role-playing)
prompt_style_3 = f"You are a helpful AI assistant. Your task is to answer the question using *only* the provided text. Question: {selected_query}\n\nProvided text:\n{context_for_prompt_exp}\n\nAnswer based on provided text:\n"
print("\n--- Prompt Style 3 (Role-Playing) ---")
answer_3 = generate_llm_answer_experimental(model, tokenizer, prompt_style_3)
print(f"  Generated Answer:\n    {answer_3}")




--- Experiment 1: Prompt Engineering ---
  Retrieving context for: "What is a key reason OpenSSH is considered secure?"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  Context (start): Document 1:

["codebase with the OpenSSH 7.6 release. SSH is a protocol that can be used for many applications across many platforms including most Unix variants (GNU/Linux, the BSDs including Apple's macOS, and Solaris), as well as Microsoft Windows. Some of the applications below may require featu...

--- Testing different prompts for query: "What is a key reason OpenSSH is considered secure?" ---

--- Prompt Style 1 (Baseline) ---
  Generated Answer:
    OpenSSH is considered secure due to several key reasons:

1. **End-to-End Encryption**: OpenSSH encrypts all information, including usernames and passwords. This ensures that data transmitted between the client and server is protected from eavesdropping and unauthorized access.

2. **Multiple Layers of Security**: OpenSSH uses multiple layers of security, including transport, user authentication, and connection layers. This multi-layered approach helps protect against various types of attacks and vulnerabilities.


## LLM Generation Parameters

In [24]:

print("\n--- Experiment 2: LLM Generation Parameters ---")

query_index_for_param_exp = 1 # Example: use the second sample query
selected_query_params = sample_queries[query_index_for_param_exp]

# Get context for this query
print(f"  Retrieving context for: \"{selected_query_params}\"")
context_for_param_exp = get_related_docs(selected_query_params, k=3)
print(f"  Context (start): {context_for_param_exp[:300]}...")

# Use your preferred prompt style (e.g., style 1 or 2 from previous cell)
# For this example, let's use a slightly modified version of Style 2
best_prompt_template = "Based ONLY on the following context, answer the question. If the answer is not in the context, state that.\n\nContext:\n{context}\n\nQuestion: {query}\n\nAnswer:\n"
base_prompt_for_params = best_prompt_template.format(context=context_for_param_exp, query=selected_query_params)

print(f"\n--- Testing different generation parameters for query: \"{selected_query_params}\" ---")
print(f"--- Using base prompt (first 100 chars): {base_prompt_for_params[:100]}... ---")

# Param Set 1 (Default-ish: temp=0.7, top_p=0.9, do_sample=True)
print("\n--- Param Set 1 (temp=0.7, top_p=0.9, sample=True) ---")
answer_p1 = generate_llm_answer_experimental(model, tokenizer, base_prompt_for_params, temperature=0.7, top_p=0.9, do_sample=True)
print(f"  Generated Answer:\n    {answer_p1}")

# Param Set 2 (Lower temperature -> less random)
print("\n--- Param Set 2 (temp=0.2, top_p=0.9, sample=True) ---")
answer_p2 = generate_llm_answer_experimental(model, tokenizer, base_prompt_for_params, temperature=0.2, top_p=0.9, do_sample=True)
print(f"  Generated Answer:\n    {answer_p2}")

# Param Set 3 (No sampling -> greedy decoding)
print("\n--- Param Set 3 (sample=False) ---")
answer_p3 = generate_llm_answer_experimental(model, tokenizer, base_prompt_for_params, do_sample=False) # temp & top_p are less relevant if do_sample=False
print(f"  Generated Answer:\n    {answer_p3}")

# Param Set 4 (Shorter max_new_tokens)
print("\n--- Param Set 4 (max_new_tokens=100, temp=0.7, sample=True) ---")
answer_p4 = generate_llm_answer_experimental(model, tokenizer, base_prompt_for_params, max_new_tokens=100, temperature=0.7, do_sample=True)
print(f"  Generated Answer:\n    {answer_p4}")

print("\n>>> MANUAL EVALUATION: Compare answers from different generation parameters. <<<")
print(">>> How do temperature, sampling, and max_tokens affect the output? <<<")


--- Experiment 2: LLM Generation Parameters ---
  Retrieving context for: "What are the main components of Docker's service?"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  Context (start): Document 1:
["VMware vSphere Integrated Containers. The Cloud Foundry Diego project integrates Docker into the Cloud Foundry PaaS. Nanobox uses Docker (natively and with VirtualBox) containers as a core part of its software development platform. Red Hat's OpenShift PaaS integrates Docker with relate...

--- Testing different generation parameters for query: "What are the main components of Docker's service?" ---
--- Using base prompt (first 100 chars): Based ONLY on the following context, answer the question. If the answer is not in the context, state... ---

--- Param Set 1 (temp=0.7, top_p=0.9, sample=True) ---
  Generated Answer:
    The main components of Docker's service are:

- Resource Isolation: Docker uses the resource isolation features of the Linux kernel such as cgroups and kernel namespaces.
- Union-capable File System: Docker uses a union-capable file system such as OverlayFS to allow independent "containers" to run within a single Linux instance, avoid



  Generated Answer:
    Docker's service consists of several main components:

1. **Resource Isolation**: Docker uses the resource isolation features of the Linux kernel such as cgroups and kernel namespaces. These features help to isolate applications by controlling their CPU, memory, and other resources.

2. **Union File System**: Docker employs a union file system such as OverlayFS to allow independent "containers" to run within a single Linux instance, avoiding the overhead of starting and maintaining virtual machines (VMs).

3. **Namespaces**: The Linux kernel's support for namespaces isolates an application's view of the operating environment, including process trees, network, user IDs, and mounted file systems. This helps to ensure that each container runs in a secure and isolated environment.

4. **Cgroups (Control Groups)**: Cgroups provide resource limiting for memory usage, helping to control the amount of CPU, memory, and other resources that each container can consume.

5.

# Experiment with Context Length Handling (If Necessary)

In [25]:
# Cell E: Experiment with Context Length Handling

print("\n--- Experiment 3: Handling Context Length ---")

# Determine your model's practical token limit for the full prompt
# This value comes from Cell 2 of your "LLM Investigation" notebook, or team discussion.
# Qwen2 models often have large theoretical limits (32k+) but may have been fine-tuned
# on shorter sequences, or there's a practical sliding window size (e.g., 4096, 8192).
# The tokenizer.model_max_length is often a good practical guide.
PRACTICAL_TOKEN_LIMIT = tokenizer.model_max_length if hasattr(tokenizer, 'model_max_length') else 4096 # Default if not found
print(f"Using practical token limit for full prompt: {PRACTICAL_TOKEN_LIMIT}")

# Pick a query that tends to get long context OR use a dummy long context
query_for_long_context_test = sample_queries[0] # Use the first sample query
retrieved_context_long = get_related_docs(query_for_long_context_test, k=5) # Try to get more context by increasing k

# Let's use your best prompt template
best_prompt_template = "Based ONLY on the following context, answer the question. If the answer is not in the context, state that.\n\nContext:\n{context}\n\nQuestion: {query}\n\nAnswer:\n"

# Strategy 1: Use as is (might truncate in tokenizer if too long)
print(f"\n--- Strategy 1: Using Potentially Long Context As Is ---")
prompt_long_as_is = best_prompt_template.format(context=retrieved_context_long, query=query_for_long_context_test)
# Check token count for the prompt
token_ids_as_is = tokenizer.encode(prompt_long_as_is)
print(f"  Approx. token count for 'as is' prompt: {len(token_ids_as_is)}")
if len(token_ids_as_is) > PRACTICAL_TOKEN_LIMIT:
    print(f"  WARNING: Prompt may exceed model's practical token limit of {PRACTICAL_TOKEN_LIMIT}!")

answer_as_is = generate_llm_answer_experimental(model, tokenizer, prompt_long_as_is)
print(f"  Generated Answer (Long Context As Is):\n    {answer_as_is}")


# Strategy 2: Truncate the context string to fit (simple truncation)
print(f"\n--- Strategy 2: Truncating Context String ---")
prompt_parts_text = best_prompt_template.format(context="[CONTEXT_PLACEHOLDER]", query=query_for_long_context_test)
prompt_template_tokens = tokenizer.tokenize(prompt_parts_text.replace("[CONTEXT_PLACEHOLDER]", ""))
# Rough estimate, subtract tokens for prompt structure and desired answer length
max_context_tokens_allowed = PRACTICAL_TOKEN_LIMIT - len(prompt_template_tokens) - 100 # Buffer for answer
print(f"  Max context tokens allowed: {max_context_tokens_allowed}")

context_tokens_original = tokenizer.tokenize(retrieved_context_long)
print(f"  Original context token count: {len(context_tokens_original)}")

if len(context_tokens_original) > max_context_tokens_allowed:
    print("  Context too long, truncating...")
    truncated_tokens = context_tokens_original[:max_context_tokens_allowed]
    truncated_context_str = tokenizer.decode(tokenizer.convert_tokens_to_ids(truncated_tokens), skip_special_tokens=True)
else:
    print("  Context is within limits, no truncation needed for this strategy.")
    truncated_context_str = retrieved_context_long

prompt_truncated = best_prompt_template.format(context=truncated_context_str, query=query_for_long_context_test)
token_ids_truncated = tokenizer.encode(prompt_truncated)
print(f"  Approx. token count for truncated prompt: {len(token_ids_truncated)}")

answer_truncated = generate_llm_answer_experimental(model, tokenizer, prompt_truncated)
print(f"  Generated Answer (Truncated Context):\n    {answer_truncated}")

print("\n>>> MANUAL EVALUATION: Compare answers. Does truncation hurt? Is 'as is' better if it fits? <<<")


--- Experiment 3: Handling Context Length ---
Using practical token limit for full prompt: 131072


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--- Strategy 1: Using Potentially Long Context As Is ---
  Approx. token count for 'as is' prompt: 4004
  Generated Answer (Long Context As Is):
    OpenSSH is considered secure due to several key factors:

1. **Encryption**: OpenSSH encrypts all information, including usernames and passwords, which ensures that data transmitted between the client and server is protected from unauthorized access.

2. **Authentication**: SSH uses public key cryptography, where the private key is kept secret and the public key is shared. This allows for secure authentication without transmitting the private key over the network, reducing the risk of unauthorized access.

3. **Multiplexing**: OpenSSH allows multiple sessions to be multiplexed over a single SSH connection, which can improve performance and security. This means that multiple applications can share the same secure connection, reducing the overall attack surface.

4. **Forward Secrecy**: OpenSSH supports forward secrecy, meaning that even if

# Summary of LLM 

*1. Baseline Performance:**
   - [Your observations from Cell B regarding overall quality with default settings and retriever output]

**2. Prompt Engineering (Cell C):**
   - Best performing prompt style: [e.g., "Style 2 (Direct Instruction)"]
   - Observations: [e.g., "More direct prompts led to more focused answers. Role-playing prompts sometimes added unnecessary fluff."]

**3. LLM Generation Parameters (Cell D):**
   - Best parameter combination found: [e.g., temperature=0.3, do_sample=True, max_new_tokens=250]
   - Observations: [e.g., "Lower temperature reduced randomness and improved factual grounding. `do_sample=False` was too repetitive. Shorter `max_new_tokens` sometimes cut off answers."]

**4. Context Length Handling (Cell E):**
   - Observations: [e.g., "Simple truncation of long contexts often removed crucial information. Using fewer, more relevant documents (if retriever allows precise control) might be better. The current model seems to handle up to X tokens reasonably well."]

**Overall LLM Performance after Initial Experiments:**
   - [Summarize if the quality has improved significantly, or if major issues persist.]
   - [Are the "terrible" results now "less terrible" or "good" for some queries?]

