<a href="https://colab.research.google.com/github/nepomucenoc/EdgeRAG/blob/main/EdgeRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lightweight RAG Demo with Simulated LoRA & TinyLLM Optimizations



In [1]:
import torch
from transformers import (
    RagTokenizer,
    RagRetriever,
    RagTokenForGeneration
)

In [2]:
!pip install datasets



In [3]:
!pip install faiss-cpu



In [4]:
# Load a pre-trained RAG model (for demonstration)
MODEL_NAME = "facebook/rag-token-nq"
tokenizer = RagTokenizer.from_pretrained(MODEL_NAME)
retriever = RagRetriever.from_pretrained(MODEL_NAME, index_name="exact", use_dummy_dataset=True)
model = RagTokenForGeneration.from_pretrained(MODEL_NAME, retriever=retriever)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoder

README.md:   0%|          | 0.00/14.9k [00:00<?, ?B/s]

wiki_dpr.py:   0%|          | 0.00/8.63k [00:00<?, ?B/s]

The repository for wiki_dpr contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/wiki_dpr.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


train-00000-of-00001.parquet:   0%|          | 0.00/40.8M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

  0%|          | 0/10 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/2.06G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.06G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/rag-token-nq were not used when initializing RagTokenForGeneration: ['rag.question_encoder.question_encoder.bert_model.pooler.dense.bias', 'rag.question_encoder.question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing RagTokenForGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RagTokenForGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
# Simulated function to apply LoRA (Low-Rank Adaptation)
def apply_lora(model):
    # In a real case, you would use a library like PEFT to tune the model parameters.
    print("Applying LoRA fine-tuning... (simulation)")
    return model

# Dummy function to apply TinyLLAMA updates (quantization, pruning, etc.)
def apply_tinyllm_optimization(model):
    print("Applying TinyLLM optimization... (simulation)")
    # For simulation, we convert the model to half precision (float16)
    model.half()
    return model

# Function to simulate a security mechanism (Guardrails) that filters unwanted content
def security_guardrails(text):
    banned_keywords = ["hack", "illegal", "fraud"]
    for keyword in banned_keywords:
        if keyword in text.lower():
            return "Content blocked due to security policies."
    return text

# Generation pipeline using RAG
def generate_response(query):
    # Prepare input for RAG model
    input_dict = tokenizer.prepare_seq2seq_batch(query, return_tensors="pt")
    generated_ids = model.generate(input_ids=input_dict["input_ids"])
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    # Apply safety guardrails
    response = security_guardrails(response)
    return response

def main():
    global model

    # Applies LoRA simulations and TinyLLM optimizations
    model = apply_lora(model)
    model = apply_tinyllm_optimization(model)

    # Example query (can be adjusted as needed)
    query = "What is the latest technology in embedded AI systems?"
    print("Query:", query)

    response = generate_response(query)
    print("Response:", response)

In [7]:
main()

Applying LoRA fine-tuning... (simulation)
Applying TinyLLM optimization... (simulation)
Query: What is the latest technology in embedded AI systems?




Response:  iscsi
