# Retrieval Augmented Generation Playground
> This notebook acts as a tool for users to play around with vectorizing documents and using a RAG architecture to improve the responses and capabilities of an AI (LLM) for some unique purpose. The goal is to allow users to run the commands to load and vectorize their own documents and test how things such as chunking and retrieval parameters can improve the responses of LLM for unique contexts and subject matter. 

## Load your model and set it's generation parameters

In [1]:
from amas_manager_tools import *
from datasets import load_dataset
import pandas as pd
from langchain.schema import Document

from gradio_assistant_apps import *
import os
# Load an LLM model and set the generation parameters

# lets start with a small one
home_dir = os.path.expanduser("~")
# model_path = "/home/gerald/shared_space/models/MODELS/meta_llama_Llama_3p2_1B_Instruct/"
model_path = f"{home_dir}/shared_space/models/MODELS/meta_llama_Llama_3p2_1B_Instruct/"
LLM_LIST = [
    f"{home_dir}/shared_space/models/MODELS/Llama_3p1_8B_Instruct/",
    f"{home_dir}/shared_space/models/MODELS/Llama_3p2_3B_Instruct/",
    f"{home_dir}/shared_space/models/MODELS/meta_llama_Llama_3p2_1B_Instruct",
]


amas_assistant = AMAS_RAG_Assistant(
    model_path=model_path,
    # hf_login=True, creds_json=creds_json,
    hf_login=False, creds_json=None,
    max_tokens=4000, max_new_tokens=4000,
    verbose=True,
    # device_map=0,
    device_map="most_free",
)

2025-07-01 21:36:57.706827: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm


My home directory: /home/gerald
		--->Try count: 0
GRADIO_CACHE for gr.Image is: /home/gerald/gradio_cache
GRADIO_CACHE for gr.Image is: /home/gerald/gradio_cache
Second load model: most_free
thing: 0
gpu int: 0
allocated_mem: 512
reserved_mem: 2097152
total_mem: 47697362944
free_mem: 47695265280
thing: 1
gpu int: 1
allocated_mem: 0
reserved_mem: 0
total_mem: 47697362944
free_mem: 47697362944




			Assigning model to GPU 1 with 44.42 GB free memory.




Attempting to load model @: '/home/gerald/shared_space/models/MODELS/meta_llama_Llama_3p2_1B_Instruct/'


Device set to use cuda:1


Pipeline Callable Parameters: odict_keys(['text_inputs', 'kwargs'])
Pipeline Model Configuration: LlamaConfig {
  "_attn_implementation_autoset": true,
  "architectures": [
    "LlamaModel"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 16,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.51.0",
  "use_cache": true,
  "vocab_

## Load a document, vectorize and store it for later retrieval

> In the file view, navigate to your data folder (RAG_Workshop_Repo/data) and download a pdf you want to test with rag
> Then in the cell below, move to the "../data/" and press tab to see whats in there, then when you have highligted the file you wish to vectorize, press enter.
>
> After you have set up the path to the pdf you want to use, you can select one of the embedding models I have stored in the shared space. You can uncomment the lines that set the embedding model to choose one to test it out. 

In [2]:
print(amas_assistant.device_map)
print(amas_assistant.assigned_gpu)

{'': 'cuda:1'}
{'': 'cuda:1'}


In [3]:
# now try to use the knexus manager to create and load a knexus

pdf_file = [
    # "../../data/knowledge_docs/FIST_3-16_06-2020_Maintenace of Power circuit breakers BOR.pdf",
    "../data/CUI_SPEC.pdf"
]

# the below line will get the path to your home directory
my_home_directory = os.path.expanduser("~")
print("My home: ", my_home_directory)
embedding_model_name = my_home_directory + "/shared_space/models/MODELS/EMBEDING_MODELS/thenlper_gte_large/"

embedding_model_name = my_home_directory + "/shared_space/models/MODELS/EMBEDING_MODELS/thenlper_gte_base/"
# embedding_model_name = "../models/MODELS/EMBEDING_MODELS/sentence_transformers_multi_qa_mpnet_base_dot_v1/" # decent
# embedding_model_name = "../models/MODELS/EMBEDING_MODELS/sentence_transformers_all_MiniLM_L6_v2/" # decent
documents, vector_store, embeddings_tool = amas_assistant.knexus_mngr.process_pdfs_to_vector_store_and_embeddings(
    pdf_paths=pdf_file, 
    embedding_model_name=embedding_model_name, 
    chunk_size=50000, chunk_overlap=200
)

My home:  /home/gerald
loading embedding model: /home/gerald/shared_space/models/MODELS/EMBEDING_MODELS/thenlper_gte_base/
✅ Using previously saved wrapped model at: /home/gerald/shared_space/models/MODELS/EMBEDING_MODELS/thenlper_gte_base/


## Take a look at the embeddings
> You have now embedded the document into a set of vectors and stored them in a vector store. The embedding model has used it's understanding of how to encode text, context and semantic meanings into numbers that represent the contents of the pages or chunks of your document. Below you can see what the embeddings actually are, just a tensfor or vector of vectors of numbers (tokens). These are the encoded version of the text you just vectorized. When you send a query to the vector store tool (FAISS), the embedding model will tokenize the text into a new set of numbers and calculate the euclidiean distance between that encoded vector and the ones stored. The ones with the smallest distance will be returned and the scores you see represent this distance. 

In [4]:
print(f"\t\tEmbeddings:\n{amas_assistant.knexus_mngr.base_embeddings}")

		Embeddings:
tensor([[ 0.1124,  0.4051, -0.4798,  ...,  0.4000,  0.0477,  0.4578],
        [ 0.1495,  0.0683, -0.5740,  ...,  0.1556,  0.5193,  0.2238],
        [-0.0180,  0.1775, -0.2900,  ...,  0.2379, -0.1973,  0.3429],
        ...,
        [ 0.1300,  0.6325, -0.4012,  ...,  0.1829, -0.0716,  0.4890],
        [ 0.1034,  0.3938, -0.3440,  ...,  0.1189, -0.1217,  0.2520],
        [ 0.1643,  0.3946, -0.3233,  ...,  0.3264, -0.2431,  0.4446]])


## Take a look at the stored documents
> Take a look at the first few documents that have been vectorized. You may notice some of the text is a little off, we can fix this later, for example, you can have an LLM go through it and spell correct if for you if you like and revectorize the data. 

In [5]:
print("There are ", len(documents), "vecotorized and stored documents.")

# set the number of documents you want to see printed out
num_to_view = min(1,len(documents)) 
for i,d in enumerate(documents[:num_to_view]):
    print(f"\n\n\t\tDocument {i}")
    print(d.page_content)
    print(">>>>>>>>>>>>>>>>\n")

There are  28 vecotorized and stored documents.


		Document 0
AVAILABLE ONLINE AT: INITIATED BY: 
www.directives.doe.gov Office of the Chief Information Officer 
U.S. Department of Energy ORDE R 
Washington, DC 
 Approved: 2-3-2022 
SUBJECT: CONTROLLED UNCLASSIFIED INFORMATION 
1. PURPOSE. Establish the Department of Energy’s (DOE) Controlled Unclassified
Information (CUI) Program and document a policy for designating and handling
information that qualifies as CUI. The CUI Program standardizes the way DOE handles
information that requires protection under laws, regulations, or Government-wide
policies (LRGWP), but that does not qualify as classified under Executive Order (EO)
13526, Classified National Security Information, 12-29-2009 (3 Code of Federal
Regulations (CFR), 2010 Comp., p. 298-327), or any predecessor or successor order, or
the Atomic Energy Act of 1954 (42 U.S.C. 2011, et seq.), as amended. This Directive
implements the requirements in EO 13556, Controlled Unclassified 

## Test Queries
> Below if you wish you can test querying your vector store to see how that part works and see what the scores look like. The scores returned are l2 norm (euclidean distance) based on the vector of tokens generated by the embedding model for what ever your query text is, and what ever is stored in the vector store. The smaller the value, the more similar the text was to what ever document is returned.  

In [7]:
####################################################
# CUI_SPEC.pdf
####################################################
# enter some text or a question to see if what you get back is related...
query = (
    # "What is CUI Specified?\n"
    "tell me about When sending CUI via email to accounts outside of Federal IT"
)

k=2 # the number of the top most similar documents to return
min_score=200.0 # the minimum value for the similarity score
docs, scores = amas_assistant.knexus_mngr.query_similarity_search(
                                query=query, 
                                k=k, 
                                min_score=min_score, 
                                # verbose=True,
                                verbose=False,  # set to true to see more of what is going on behind the scenes
                               )

print(f"Query:\n{query}\n>>>>>>>>>>>\n\n")
for i,d in enumerate(docs):
    print(f"\t\t\tobject: {i}")
    print(d.page_content)
    print("\n>>>>>>>>>>>>>>>>\n")

Query:
tell me about When sending CUI via email to accounts outside of Federal IT
>>>>>>>>>>>


			object: 0
5 DOE O 471.7 
2-3-2022 
(FOIA), and Mandatory Declassification Reviews (MDR) procedures, of 
CUI must be approved by the Departmental Element Designated CUI 
Official. 
(4) When transmitting CUI by mail, it can be shipped through interagency
mail systems, United States Postal Service (USPS), or other commercial
delivery services, consistent with other provisions of law. In-transit
automated tracking and accountability tools can be used to record the
location of the CUI Specified. Furthermore, while CUI documents and
matter being mailed must be properly labeled, the wrapping or package
containing the CUI must not indicate the presence of CUI. The wrapping
or package should indicate "Open by Addressee Only" to ensure it is only
opened by the intended recipient.
(5) When sending CUI via email to accounts outside of Federal IT systems the
CUI must be in an attachment and protected 

## Start a gradio UI & test how the RAG architecture improves responses  

### Add your assistant bot to a Gradio UI tools and create the app

In [8]:
gradio_assistant_app = RAGWorkshopUI(assistant_bot=amas_assistant, 
                                     embedding_model_name=embedding_model_name)

gradio_assistant_app.create_app_tabs()

## Start you app and start testing...

> **Note**: since we are all running this on the same system, and the notebooks don't like it when you try to start these locally, we will need to set unique ports, lets discuss which ports we are going to use so we do not clash
>
> If all goes well you should see the app start up below, for a better view you can click on one of the links and it will open it in a window.

### How the app works:

> The left side is more of a control panel that lets you play with some parameters and turn the rag part on/off
> RAG Mode:
> * **LLM**: lists some of the local models we have available in the shared space. 
> * **RAG Mode**: Turn on/off the RAG architecture. On will try to use what ever document you have loaded, of will use the LLMs general knowledge The app starts in 'Rag Disabled' mode so you can see what happens if ask questions that are very specific to the document knowledge, and then see how responses change after you include the RAG structure. 
>
> * **RAG & Generation Parameters**
>   * **chunk_size**: will slice up your documents into larger/smaller chunks based on this value, if the value is larger than any single page it will be ignored
>   * **chunk_overlap**: if set to some non-zero value will have the information on the chunks overlap across chunks based on this value
>   * **K**: determines the maximum number of most similar documents returned in the retrieval step.
>   * **min_similiary**: can be used to filter responses below this threshold of "similarity" if desired
>   * **Temperature**: controls how 'creative' the llm is when generating responses. Higher temperatures mean more creative generation.
>   * **Top k**: Top-k controls how many of the most likely next tokens the model considers when generating each word (token) in its response. For example, if k = 50, the model will look at only the 50 tokens with the highest probability scores and choose from that subset, ignoring the rest.
>     * **Smaller k** → the model is more deterministic and repetitive, because it picks only from a very narrow set of high-probability options.
>     * **Larger k** → the model has more possible next words to choose from, which increases randomness and creativity but may risk less coherent text if k is too large.
>   * **Top p**: Top-p (also called nucleus sampling) controls the diversity in a different way: instead of picking a fixed number of tokens, it picks the smallest possible set of top tokens whose cumulative probability adds up to p.
For example, with p = 0.9, the model chooses from the smallest set of tokens whose combined probability is at least 90%.
>     * **Smaller p** → the model is more conservative and predictable, focusing on the highest-confidence choices.
>     * **Larger p** → the model allows more low-probability tokens in, increasing creativity and variety in responses.
Unlike top-k, top-p adapts to the shape of the probability distribution for each step. 

> **Stopping the app safely**
> * If the app is not stopped safely you can end up blocking a port for a little while leading to that port not being usable. To stop the app hit the stop button on the top panel and re-run the cell below. You can also run the cell directly below this one, but it will yell at you, but you can ignore it. If restarting the app fails, you can always restart the kernel (see the restart kernel button to the right of the stop button) and re-run all the cells above to get everything going again.  

In [11]:
gradio_assistant_app.launch_app(share=True, debug=False, server_port=7871, server_name=None, system_directive=None, 
                   save_context=False, use_system_role=False)

* Running on local URL:  http://127.0.0.1:7871
* Running on public URL: https://b3d7166a1dffc2d20d.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




do sample: False


Non-RAG mode!!




			inside: response_generation_process
------------------------


🔹 Using remote model inference, filtering invalid arguments...
updating conversation
do sample: False


Full RAG mode!!


Query in side knexus_gen:
Now use any documentation to tell me about When sending CUI via email to accounts outside of Federal IT

returned results:
 [(Document(metadata={'source': 'CUI_SPEC.pdf', 'page': 4}, page_content='5 DOE O 471.7 \n2-3-2022 \n(FOIA), and Mandatory Declassification Reviews (MDR) procedures, of \nCUI must be approved by the Departmental Element Designated CUI \nOfficial. \n(4) When transmitting CUI by mail, it can be shipped through interagency\nmail systems, United States Postal Service (USPS), or other commercial\ndelivery services, consistent with other provisions of law. In-transit\nautomated tracking and accountability tools can be used to record the\nlocation of the CUI Specified. Furthermore, while CUI documents and\nmatter being maile

In [9]:
gradio_assistant_app.kill_app()

# GPU-Check

> below is a command that will show you what is going on with the GPU i.e., how much space is available, used and what processes are taking up space. This can help you get an idea of when you may run out of space and the LLMs fail since they rely on these to work. 

In [7]:
!nvidia-smi

Mon Jun 23 18:40:07 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L40S                    Off |   00000000:61:00.0 Off |                    0 |
| N/A   34C    P0             85W /  350W |   27822MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    Off |   00

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


thing: 0
gpu int: 0
allocated_mem: 447589888
reserved_mem: 494927872
total_mem: 47697362944
free_mem: 46754845184
thing: 1
gpu int: 1
allocated_mem: 0
reserved_mem: 0
total_mem: 47697362944
free_mem: 47697362944




			Assigning model to GPU 1 with 44.42 GB free memory.




Attempting to load remote version of '/home/gerald/shared_space/models/MODELS/Llama_3p2_3B_Instruct/'


Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  7.01it/s]
Device set to use cpu


Pipeline Callable Parameters: odict_keys(['text_inputs', 'kwargs'])
Pipeline Model Configuration: LlamaConfig {
  "_attn_implementation_autoset": true,
  "architectures": [
    "LlamaModel"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 24,
  "num_hidden_layers": 28,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.51.0",
  "use_cache": true,
  "vocab