# Create and run a local RAG pipeline from scratch

## What is RAG?

RAG stands for Retrieval Augmenta Generation.
The goal of RAG is to take information and pass it to an LLM and generate answers based on that information



## Steps to building the RAG system from scratch

1. Open a PDF document (or a group of PDF documents)
2. Format the text of the document/(s) suitably to feed into an embedding model.
3. Embed all of the chunks of text in the document/(s) and turn them into numerical representations which can be stored for later.
4. Build a retrieval system that uses vector search to find the relevant chunk of text based on the query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to the quesry based on the passages of the textbook with an LLM.

All will be done locally.

## 1. Document processing and embedding creation.

Steps:
1. Import the PDF document/(s)
2. Process text for embedding (e.g. split into chunks of sentences)
3. Embed text chunks with embedding model
4. Save embeddings for later

### Import the PDF

In [1]:
import os
import requests

# Get PDF document path
pdf_path = "./Reference Documents/AI Engineering.pdf"

# Download PDF
if os.path.exists(pdf_path):
    print("[INFO] File exists")
else:
    print("[INFO] File does not exist.")

[INFO] File exists


### Opening the PDF

In [7]:
%pip uninstall fitz -y

Found existing installation: fitz 0.0.1.dev2
Uninstalling fitz-0.0.1.dev2:
  Successfully uninstalled fitz-0.0.1.dev2
Note: you may need to restart the kernel to use updated packages.


In [9]:
%pip install pymupdf

Collecting pymupdf
  Downloading pymupdf-1.26.7-cp310-abi3-macosx_11_0_arm64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.7-cp310-abi3-macosx_11_0_arm64.whl (22.5 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.5/22.5 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.7
Note: you may need to restart the kernel to use updated packages.


In [19]:
import fitz # used to open pdfs
from tqdm.auto import tqdm

def text_formatter(text:str) -> str:
    # Performing minor formatting on text
    cleaned_text = text.replace("\n"," ").strip()
    
    return cleaned_text

def open_and_read_pdf(pdf_path:str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 23,
                               "page_char_count":len(text),
                               "page_word_count": len(text.split(" ")),
                               "page_sentence_count_raw":len(text.split(". ")),
                               "page_token_count": len(text) / 4, # 1 token is equal roughly to 4 english characters
                               "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:25]
        

0it [00:00, ?it/s]

[{'page_number': -23,
  'page_char_count': 72,
  'page_word_count': 11,
  'page_sentence_count_raw': 1,
  'page_token_count': 18.0,
  'text': 'Chip Huyen  AI Engineering Building Applications  with Foundation Models'},
 {'page_number': -22,
  'page_char_count': 2339,
  'page_word_count': 385,
  'page_sentence_count_raw': 11,
  'page_token_count': 584.75,
  'text': '9 7 8 1 0 9 8 1 6 6 3 0 4 5 7 9 9 9 ISBN:   978-1-098-16630-4 US $79.99\t   CAN $99.99 DATA Foundation models have enabled many new AI use cases while lowering the barriers to entry for  building AI products. This has transformed AI from an esoteric discipline into a powerful development  tool that anyone can use—including those with no prior AI experience. In this accessible guide, author Chip Huyen discusses AI engineering: the process of building applications  with readily available foundation models. AI application developers will discover how to navigate  the AI landscape, including models, datasets, evaluation benchmar