<a href="https://colab.research.google.com/github/mouhammadesp/GestionContacts/blob/main/DeepSeek_RAG/RAG_DeepSeek_v4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive RAG with DeepSeek and LangChain

This notebook shows an easy RAG (Retrieval Augmented Generation) with DeepSeek model from Hugging Face [`deepseek-ai/DeepSeek-R1`](https://huggingface.co/deepseek-ai/DeepSeek-R1), and LangChain.


**RAG process**

The RAG (Retrieval-Augmented Generation) system combines a retrieval system with an LLM. The system first retrieves relevant documents from a corpus using a vector database, then uses an LLM hosted in Hugging Face to generate answers based on the retrieved documents.


In [None]:
# -*- coding: utf-8 -*-
"""RAG_DeepSeek_v4.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/github/claudio1975/Medium-blog/blob/master/DeepSeek_RAG/RAG_DeepSeek_v4.ipynb

# Naive RAG with DeepSeek and LangChain

This notebook shows an easy RAG (Retrieval Augmented Generation) with DeepSeek model from Hugging Face [`deepseek-ai/DeepSeek-R1`](https://huggingface.co/deepseek-ai/DeepSeek-R1), and LangChain.

**RAG process**

The RAG (Retrieval-Augmented Generation) system combines a retrieval system with an LLM. The system first retrieves relevant documents from a corpus using a vector database, then uses an LLM hosted in Hugging Face to generate answers based on the retrieved documents.

# Prepare Workspace
"""

!pip install -q torch transformers sentence-transformers faiss-cpu pypdf &> /dev/null

!pip install -U langchain-huggingface &>/dev/null

!pip install -q langchain langchain-community &> /dev/null

import langchain as lc
from langchain import LLMMathChain
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import pipeline
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_huggingface import HuggingFacePipeline
import os

"""## Upload the data

"""

# Load content from multiple PDFs in a directory
pdf_directory = "./pdf_documents"  # Replace with your directory path
loader = DirectoryLoader(pdf_directory, glob="*.pdf", loader_cls=PyPDFLoader)
docs = loader.load()

# Print the number of documents loaded
print(f"Loaded {len(docs)} documents")

# Split documents into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)
chunked_docs = splitter.split_documents(docs)

print(f"PDFs split into {len(chunked_docs)} chunks")

"""## Embeddings + Retriever

For embeddings I use the `HuggingFaceEmbeddings` and the [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) embeddings model.

To create the vector database, I use `FAISS`, a library developed by Facebook AI. This library offers efficient similarity search and clustering of dense vectors.
"""

# Create embeddings and vector store
embeddings = HuggingFaceEmbeddings(model_name='BAAI/bge-base-en-v1.5')
db = FAISS.from_documents(chunked_docs, embeddings)

retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={'k': 3}
)

"""## Load the model"""

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

"""## Setup the RAG

First, I create a text_generation pipeline using the loaded model and its tokenizer.

Next, I create a prompt template.

Then, I combined the `llm_chain` with the retriever to create a RAG chain.
"""

# Pipeline for text generation
text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=500,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

# Prompt template to match desired output format
prompt_template = """
You are a professional AI researcher, give an help in study. Use the following context to answer the question using information provided by the paper:

{context}

Question: {question}
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

llm_chain = prompt | llm | StrOutputParser()

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

"""# Questions"""

questions = [
    "What are the advantages of using reinforcement learning directly on a base model, as demonstrated by DeepSeek-R1-Zero?",
    "What is cold-start data and why is it used in DeepSeek-R1 training?",
    "What are DeepSeek-R1-Zero and DeepSeek-R1?"
]

for question in questions:
    # Invoke the chain to generate answers
    result = rag_chain.invoke(question)
    # Display the output
    print(f"Question: {question}\nAnswer: {result}\n")

# Prepare Workspace

In [1]:
!pip install -q torch transformers sentence-transformers faiss-cpu pypdf &> /dev/null

In [2]:
!pip install -U langchain-huggingface &>/dev/null

In [3]:
!pip install -q langchain langchain-community &> /dev/null

In [4]:
import langchain as lc
from langchain import LLMMathChain
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import pipeline
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_huggingface import HuggingFacePipeline
import os


## Upload the data


In [10]:


"""## Upload the data

"""

# Load content from multiple PDFs in a directory
pdf_directory = "/content/drive/MyDrive/Dataset-Nist/"  # Replace with your directory path
loader = DirectoryLoader(pdf_directory, glob="*.pdf", loader_cls=PyPDFLoader)
docs = loader.load()

# Print the number of documents loaded
print(f"Loaded {len(docs)} documents")

Loaded 796 documents


In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Define the document:
Document(page_content="DeepSeek-R1:Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.",
         metadata={
             'document_id' : '2501.12948v1',
             'document_source' : "ArXiv",
             'document_create_time' : "2025"
         })

Document(metadata={'document_id': '2501.12948v1', 'document_source': 'ArXiv', 'document_create_time': '2025'}, page_content='DeepSeek-R1:Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.')

In [11]:
print("\nPage Content: ", docs[0].page_content)
print("\nMeta Data: ", docs[0].metadata)


Page Content:  

Meta Data:  {'producer': 'Adobe PDF Library 15.0', 'creator': 'Adobe InDesign 15.0 (Windows)', 'creationdate': '2020-04-16T20:44:27+05:30', 'moddate': '2020-04-17T14:10:42+05:30', 'trapped': '/False', 'source': '/content/drive/MyDrive/Dataset-Nist/Digital Forensics With Kali Linux.pdf', 'total_pages': 334, 'page': 0, 'page_label': 'Cover'}


In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)
chunked_docs = splitter.split_documents(docs)

In [12]:
# Split documents into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)
chunked_docs = splitter.split_documents(docs)

print(f"PDFs split into {len(chunked_docs)} chunks")

"""## Embeddings + Retriever

For embeddings I use the `HuggingFaceEmbeddings` and the [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) embeddings model.

To create the vector database, I use `FAISS`, a library developed by Facebook AI. This library offers efficient similarity search and clustering of dense vectors.
"""

PDFs split into 2980 chunks


'## Embeddings + Retriever\n\nFor embeddings I use the `HuggingFaceEmbeddings` and the [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) embeddings model.\n\nTo create the vector database, I use `FAISS`, a library developed by Facebook AI. This library offers efficient similarity search and clustering of dense vectors.\n'

In [13]:
print("PDF Splited by Chunks - You have {0} number of chunks.".format(len(docs)))

PDF Splited by Chunks - You have 796 number of chunks.


## Embeddings + Retriever

For embeddings I use the `HuggingFaceEmbeddings` and the [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) embeddings model.

To create the vector database, I use `FAISS`, a library developed by Facebook AI. This library offers efficient similarity search and clustering of dense vectors.

In [14]:
db = FAISS.from_documents(chunked_docs,
                          HuggingFaceEmbeddings(model_name='BAAI/bge-base-en-v1.5'))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [15]:
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={'k': 5}
)

## Load the model

In [16]:
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

## Setup the RAG

First, I create a text_generation pipeline using the loaded model and its tokenizer.

Next, I create a prompt template.

Then, I combined the `llm_chain` with the retriever to create a RAG chain.

In [17]:
# Pipeline for text generation
text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=500,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

# Prompt template to match desired output format
prompt_template = """
You are a professional AI researcher, give an help in study. Use the following context to answer the question using information provided by the paper:

{context}

Question: {question}
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

llm_chain = prompt | llm | StrOutputParser()


rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)



Device set to use cuda:0


# Questions

In [27]:
question = "Listez les contrôles NIST SP 800-53 relatifs à la gestion des logs système"

# Invoke the chain to generate answers
result = rag_chain.invoke(question)

# Display the output
print(result)

</think>

Les contrôles NIST SP 800-53 relatifs à la gestion des log systematically incluent :

1. **Contrôle de la Protection des Log Systemes**:
   - **Section 4.2.1 et Section 4.2.2**: Les sections spéciales portent sur le gérer des log systematically et l'automatiser respectively.
   - **Section 4.3**: Définit un contrôle de la sharing granulaire des informations systematically pour éviter les attaques visuelles.

2. **Contrôle de la Protection des Log Systemes**:
   - **Section 6.2.4 et Section 6.2.5**: Les sections spéciales portent sur la gérer des log systematically et la gestion des log systematically during remote access respectively.
   - **Section 6.2.6 et Section 6.2.7**: Définit un contrôle de la protection des log systematically via des outils de analyse des forensics respectively.

Ces contrôles sont intégrés dans la norme NIST SP 800-53 pour garantir la sécurité et la compatibilité des log systematically.


In [36]:
question = "what does the document recommend for ensuring data integrity and effective analysis when working with data files"

# Invoke the chain to generate answers
result = rag_chain.invoke(question)

# Display the output
print(result)

</think>

When working with data files, it's crucial to ensure both data integrity and effective analysis. Here's a structured approach based on the provided content:

1. **Data Collection Strategy**:
   - **Multiple Copies**: Collect at least three copies of each dataset (master, working, and backup). This ensures robustness against accidental alterations.
   - **Preservation**: Maintain all collected copies while performing analyses. This helps in identifying and correcting errors early.

2. **Data Integrity Verification**:
   - **Bit Stream Image Analysis**: Perform a bit stream image analysis on critical datasets. This method checks for anomalies and confirms data accuracy.
   - **Write Blocker Use**: Implement a write blocker during backups and imaging to prevent data corruption and ensure data integrity.

3. **Order of Data Acquisition**:
   - Prioritize data acquisition based on likelihood, volatility, and effort requirements. Establish a clear timeline to manage expectations an

In [35]:
question = "What does the document say about Using a Forensic Toolkit?"

# Invoke the chain to generate answers
result = rag_chain.invoke(question)

# Display the output
print(result)

The options are:
A. It says nothing about using a forensic toolkit.
B. It says it uses a forensic toolkit.
C. It says it doesn't mention using a forensic toolkit.
D. It says it mentions using a forensic toolkit.
E. It says it doesn't mention using a forensic toolkit.
F. It says it mentions using a forensic toolkit.
G. It says it doesn't mention using a forensic toolkit.
H. It says it mentions using a forensic toolkit.
I. It says it doesn't mention using a forensic toolkit.
J. It says it mentions using a forensic toolkit.
K. It says it doesn't mention using a forensic toolkit.
L. It says it mentions using a forensic toolkit.
M. It says it doesn't mention using a forensic toolkit.
N. It says it mentions using a forensic toolkit.
O. It says it doesn't mention using a forensic toolkit.
P. It says it mentions using a forensic toolkit.
Q. It says it doesn't mention using a forensic toolkit.
R. It says it mentions using a forensic toolkit.
S. It says it doesn't mention using a forensic toolki