# RAG Implementation of DeepSeekR1 8 Billion Params

In [None]:
print("Author : Hamza Amin")
print("Date : 02/05/2025")

Author : Hamza Amin
Date : 02/05/2025


In [None]:
"""
RAG Implementation using Original DeepSeekR1 and FAISS
Following the Fine-tuning Section
"""

print("\n--- Starting RAG Implementation ---")


--- Starting RAG Implementation ---


In [None]:
# install necessary libraries for RAG
# %%capture
!pip install langchain-community langchain datasets

In [None]:
# %%capture
!pip install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [None]:
# imports for RAG
from sentence_transformers import SentenceTransformer
import numpy as np
from datasets import load_dataset
from langchain.vectorstores import FAISS as LangChainFAISS
from langchain.schema.document import Document
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline # To wrap our model
from transformers import pipeline
import gc # Garbage collector

In [None]:
# for loading deepseek

import torch
from datasets import load_dataset
import wandb
from huggingface_hub import login
from unsloth import FastLanguageModel


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [None]:
max_seq_length = 2048
dtype = None
load_in_4bit = True  # enables 4 bit quantization
hugging_face_token = "[INPUT YOUR HF TOKEN]"

model , tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8b",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hugging_face_token
)

print("Ensuring base model is in inference mode...")
FastLanguageModel.for_inference(model)

==((====))==  Unsloth 2025.4.7: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Ensuring base model is in inference mode...


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
    (layers): ModuleList(
      (0): LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((409

In [None]:
print("Loading and preparing data for RAG...")
rag_dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en", split="train[0:500]", trust_remote_code=True)

Loading and preparing data for RAG...


README.md:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

medical_o1_sft.json:   0%|          | 0.00/58.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/19704 [00:00<?, ? examples/s]

In [None]:
# Combine Question and Response into documents for retrieval
rag_documents = []
for example in rag_dataset:
    text_chunk = f"Question: {example['Question']}\nAnswer: {example['Response']}"
    # We store the original question as metadata, which can be useful sometimes
    metadata = {"original_question": example['Question']}
    rag_documents.append(Document(page_content=text_chunk, metadata=metadata))

print(f"Created {len(rag_documents)} documents for RAG.")
print("Example document for RAG:")
print(rag_documents[0].page_content)
print("-" * 20)
print(rag_documents[0].metadata)
print("-" * 20)

Created 500 documents for RAG.
Example document for RAG:
Question: Given the symptoms of sudden weakness in the left arm and leg, recent long-distance travel, and the presence of swollen and tender right lower leg, what specific cardiac abnormality is most likely to be found upon further evaluation that could explain these findings?
Answer: The specific cardiac abnormality most likely to be found in this scenario is a patent foramen ovale (PFO). This condition could allow a blood clot from the venous system, such as one from a deep vein thrombosis in the leg, to bypass the lungs and pass directly into the arterial circulation. This can occur when the clot moves from the right atrium to the left atrium through the PFO. Once in the arterial system, the clot can travel to the brain, potentially causing an embolic stroke, which would explain the sudden weakness in the left arm and leg. The connection between the recent travel, which increases the risk of deep vein thrombosis, and the neuro

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

print("Initializing embedding model via LangChain...")
embedding_model_name = 'all-MiniLM-L6-v2'

embedder = HuggingFaceEmbeddings(
    model_name=embedding_model_name,
    model_kwargs={'device': 'cuda'},  # Ensure embeddings are generated on GPU
    encode_kwargs={'normalize_embeddings': True} # Often beneficial for cosine similarity
)
print(f"Using LangChain embedding wrapper for: {embedding_model_name}")

# Check if it has the required method
if hasattr(embedder, 'embed_documents'):
    print("Embedder object has 'embed_documents' method. Good to go!")
else:
    print("WARNING: Embedder object *still* lacks 'embed_documents'. Check imports/versions.")

Initializing embedding model via LangChain...


  embedder = HuggingFaceEmbeddings(


Using LangChain embedding wrapper for: all-MiniLM-L6-v2
Embedder object has 'embed_documents' method. Good to go!


In [None]:
print("Creating FAISS vector store...")
vector_store = LangChainFAISS.from_documents(rag_documents, embedder)
print("FAISS vector store created successfully.")

Creating FAISS vector store...
FAISS vector store created successfully.


In [None]:
print("Setting up retriever...")
retriever = vector_store.as_retriever(search_kwargs={'k': 3})

# testing retreiver
test_query = "Symptoms of myocardial infarction"
retrieved_docs = retriever.get_relevant_documents(test_query)
print(f"\n--- Testing Retriever with query: '{test_query}' ---")
for i, doc in enumerate(retrieved_docs):
    print(f"Retrieved Doc {i+1}:\n{doc.page_content}\n---")

Setting up retriever...

--- Testing Retriever with query: 'Symptoms of myocardial infarction' ---
Retrieved Doc 1:
Question: A 58-year-old man with no prior cardiac history presents to the emergency department with symptoms of retrosternal chest pain starting at rest and lasting 30 minutes. The pain radiates to the left arm and is associated with diaphoresis and dyspnea.On physical examination, his blood pressure is 150/90 mm Hg, pulse is 100/min, the heart sounds are normal, and the lungs are clear to auscultation. Which of the following is the next most appropriate investigation?
A. CT scan - chest
B. CXR
C. cardiac troponin
D. ECG
Answer: The next most appropriate investigation for this patient, given the classic presentation of symptoms that suggest an acute coronary syndrome, is an ECG (D). An ECG provides immediate insight into any cardiac issues, such as whether there is a heart attack occurring, by showing abnormalities in the heart's electrical activity due to compromised blo

  retrieved_docs = retriever.get_relevant_documents(test_query)


In [None]:
rag_prompt_template = """You are a helpful medical assistant. Answer the following question based *only* on the context provided below.
If the answer is not found in the context, state clearly "The answer is not available in the provided context." Do not use any prior knowledge.

Context:
{context}

Question:
{question}

Answer:"""

rag_prompt = ChatPromptTemplate.from_template(rag_prompt_template)

In [None]:
# create HuggingFacePipeline for LangChain integration

text_generation_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512, # Reduced tokens for RAG answer generation
    model_kwargs={"temperature": 0.0, "use_cache": True} # Low temp for factual answers
)
llm_pipeline = HuggingFacePipeline(pipeline=text_generation_pipeline)


Device set to use cuda:0
  llm_pipeline = HuggingFacePipeline(pipeline=text_generation_pipeline)


In [None]:
# create RAG Chain using LangChain Expression Language (LCEL)
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm_pipeline
    | StrOutputParser()
)


In [None]:
# Function to Query the RAG System
def query_rag(question):
    print(f"\n--- Querying RAG with: '{question}' ---")
    gc.collect()
    torch.cuda.empty_cache()
    try:
        response = rag_chain.invoke(question)
        # The pipeline might include the input prompt in the output, let's clean it
        # Find the start of the actual answer based on the template
        answer_marker = "Answer:"
        answer_start_index = response.rfind(answer_marker)
        if answer_start_index != -1:
             cleaned_response = response[answer_start_index + len(answer_marker):].strip()
        else:
             # Fallback if the marker isn't found exactly as expected
             cleaned_response = response.strip()

        print(f"\nRAG Response:\n{cleaned_response}")
        return cleaned_response
    except Exception as e:
        print(f"An error occurred during RAG invocation: {e}")
        return None

In [None]:
# Test RAG with the same question used before
comparison_question = """A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing
but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely
reveal about her residual volume and detrusor contractions?"""

rag_answer = query_rag(comparison_question)


print("\n--- RAG Implementation Complete ---")


--- Querying RAG with: 'A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing
but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely
reveal about her residual volume and detrusor contractions?' ---

RAG Response:
Cystometry in this case of stress urinary incontinence would most likely reveal a normal post-void residual volume, as stress incontinence typically does not involve issues with bladder emptying. Additionally, since stress urinary incontinence is primarily related to physical exertion and not an overactive bladder, you would not expect to see any involuntary detrusor contractions during the test.

I need to find the answer to the question based on the provided context. Let me look through the context again.

The question is about a 61-year-old woman with a history of involuntary urine loss during activities like coughing or sneezing, no leakage at