## Local Retrieval-Augmented Generation (RAG) chatbot

### 🔽 Step 1: Install Dependencies

In [1]:
pip install pymupdf langchain unstructured

Collecting pymupdf
  Downloading pymupdf-1.26.1-cp39-abi3-win_amd64.whl.metadata (3.4 kB)
Collecting unstructured
  Downloading unstructured-0.18.1-py3-none-any.whl.metadata (24 kB)
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured)
  Downloading python_iso639-2025.2.18-py3-none-any.whl.metadata (14 kB)
Collecting langdetect (from unstructured)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
     ---------------------------------------- 0.0/981.5 kB ? eta -:--:--
     --- ----------------------------------- 92.2/981.5 kB 5.5 MB/s eta 0:00:01
     ------------------- ------------------ 491.5/981.5 kB 6.2 MB/s eta 0:00:01
     -------------------------------------- 981.5/981.5

In [7]:
pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Downloading sentence_transformers-4.1.0-py3-none-any.whl (345 kB)
   ---------------------------------------- 0.0/345.7 kB ? eta -:--:--
   -- ------------------------------------ 20.5/345.7 kB 682.7 kB/s eta 0:00:01
   -------------------- ------------------- 174.1/345.7 kB 2.6 MB/s eta 0:00:01
   ---------------------------------------- 345.7/345.7 kB 3.6 MB/s eta 0:00:00
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-4.1.0
Note: you may need to restart the kernel to use updated packages.


In [11]:
pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp312-cp312-win_amd64.whl.metadata (5.0 kB)
Downloading faiss_cpu-1.11.0-cp312-cp312-win_amd64.whl (15.0 MB)
   ---------------------------------------- 0.0/15.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/15.0 MB ? eta -:--:--
   ---------------------------------------- 0.1/15.0 MB 1.6 MB/s eta 0:00:10
   ---------------------------------------- 0.2/15.0 MB 2.2 MB/s eta 0:00:07
   - -------------------------------------- 0.5/15.0 MB 3.5 MB/s eta 0:00:05
   -- ------------------------------------- 1.1/15.0 MB 5.6 MB/s eta 0:00:03
   ---- ----------------------------------- 1.5/15.0 MB 6.4 MB/s eta 0:00:03
   ---- ----------------------------------- 1.7/15.0 MB 6.1 MB/s eta 0:00:03
   ----- ---------------------------------- 1.9/15.0 MB 5.8 MB/s eta 0:00:03
   ----- ---------------------------------- 2.2/15.0 MB 5.7 MB/s eta 0:00:03
   ------ --------------------------------- 2.5/15.0 MB 5.8 MB/s eta 0:00:03
   ---

### 📄 Step 2: Use PyMuPDFLoader to Load PDFs

In [3]:
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load the PDF file
loader = PyMuPDFLoader("Statistics_Notes_and_Practice.pdf")  # Replace with your actual PDF path
documents = loader.load()

# Split the document into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = splitter.split_documents(documents)

### 🧠 Step 3: Create Embeddings and FAISS Vector Store

In [13]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Load sentence-transformers embedding model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Create FAISS index
db = FAISS.from_documents(docs, embedding_model)

# Save for Streamlit use later
db.save_local("faiss_index")

### 🔁 Step 4: Create Retriever + Prompt Template

In [15]:
retriever = db.as_retriever()

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

# Load a local Hugging Face model
model_name = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer, max_length=512)
llm = HuggingFacePipeline(pipeline=pipe)

# Build the QA chain
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, return_source_documents=True)

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Device set to use cpu
  llm = HuggingFacePipeline(pipeline=pipe)


### 💬 Step 5: Ask Questions

In [17]:
query = "What is the document about?"
result = qa_chain(query)

print("Answer:", result["result"])

  result = qa_chain(query)
Token indices sequence length is longer than the specified maximum sequence length for this model (616 > 512). Running this sequence through the model will result in indexing errors


Answer: To test assumptions (hypotheses) using statistical methods. - Types: - Z-Test - T-Test - Chi-Square Test - ANOVA Errors in Hypothesis Testing: - Type I Error: Rejecting a true null hypothesis (false positive). - Type II Error: Failing to reject a false null hypothesis (false negative). Used in regression for parameter significance and in classification for threshold optimization. Methods and Python Code for Hypothesis Testing 1. Z-Test p-value in a t-Test from scipy.stats import t# Example: T-test t_score = 2.1 # Replace with your calculated t_score degrees_of_freedom = 10 # Replace with your sample's degrees of freedom p_value = 2 * (1 - t.cdf(abs(t_score), df=degrees_of_freedom)) # Two-tailed test print("P-value for T-test:", p_value) The degrees of freedom (df) for a sample are calculated using the formula: df=n1 Where n is the sample size (the total number of observations in the sample).
