## RAG 

* R --> Retrieval 
* A --> Augented  
* G --> Generation

* First the document is stored in vector database .
* We take prompt from the user .
* Relevent information is **retrived** from the vector database on the basis of the basic prompt.
* The prompt and the context from the retrieved document is **Augmented**.
* This Augmented prompt is used to **Generate** the response from LLM.

In [7]:
# requirements
# ! pip install faiss-cpu   langchain-community  sentence-transformers

In [None]:
# Import the libraries
import os 
import google.generativeai as genai
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from pypdf import PdfReader

from langchain_text_splitters import RecursiveCharacterTextSplitter   # FOR CHUNKING

import faiss
from langchain_community.vectorstores import FAISS

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Step 1 : Lets  configure the Model

# LLM MODEL
gemini_key = os.getenv('Google_API_Key2')
genai.configure(api_key=gemini_key)
model = genai.GenerativeModel('gemini-2.5-flash-lite')

# Configure the embedding model 

embedding_model = HuggingFaceBgeEmbeddings(model_name='all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [18]:
# step 2: Get the document and extract the text 

from pydoc import text


pdf_file = PdfReader(r"C:\Users\Harshal\Downloads\sample.pdf")

raw_text = ''
for page in pdf_file.pages:
    text = page.extract_text()
    if text:
        raw_text = raw_text + text +'\n'

In [19]:
print(raw_text)

1  
MAJOR PROJECT REPORT 
at 
Sathyabama Institute of Science and Technology 
(Deemed to be University) 
 
Submitted in partial fulfillment of the requirements for the award of 
Bachelor of Engineering Degree in Computer Science and Engineering 
 
By 
Busupalli Harinath Reddy(Reg.No.38110063) 
Avala Pavan Kumar (Reg. No.38110058) 
 
 
 
 
 
 
 
 
 
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 
SCHOOL OF COMPUTING 
SATHYABAMA INSTITUTE OF SCIENCE AND TECHNOLOGY 
JEPPIAAR NAGAR, RAJIV GANDHI SALAI, 
CHENNAI – 600119, TAMILNADU 
 
 
MARCH 2022 

2  
 
SATHYABAMA 
INSTITUTE OF SCIENCE AND TECHNOLOGY 
(DEEMED TO BE UNIVERSITY) 
Accredited with Grade “A” by NAAC 
(Established under Section 3 of UGC Act, 1956) 
JEPPIAAR NAGAR, RAJIV GANDHI SALAI, CHENNAI– 600119 
www.sathyabamauniversity.ac.in 
 
 
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 
 
 
                                               BONAFIDE CERTIFICATE 
 
 
This is to certify that this Project Report is the bonafide work of Av

In [24]:
# Step 3 : Chunking 
# First we need to split the text 

from textdistance import overlap


splitter = RecursiveCharacterTextSplitter(chunk_size = 1000,chunk_overlap=200)
chunks = splitter.split_text(raw_text)

In [23]:
len(raw_text)

37306

In [26]:
len(chunks)

49

In [38]:
# Step 4 : Create the vector DB (FAISS)

vector_store = FAISS.from_texts(chunks,embedding_model)

In [39]:
# Step 5 : Get the prompt from the user

prompt = 'Give the 5 line summary of the project'

In [40]:
# Step 6 : Retrival(R)
retriever = vector_store.as_retriever(search_kwargs={'k':3})
retrieved_docs = retriever.invoke(prompt)

In [41]:
retrieved_docs

[Document(id='351bd497-6e0b-409f-b69d-65f4a2b532eb', metadata={}, page_content='Dr.S.Vigneshwari M.E., Ph.D., and Dr.L.Lakshmanan M.E., Ph.D., Heads of the  \nDepartment of Computer Science and Engineering for providing me necessary \nsupport and details at the right time during the progressive reviews. \nI would like to express my sincere and deep sense of gratitude to my Project Guide  Dr. \nR. AROUL CANESSANE  M.E., Ph.D.,  for her valuable guidance, suggestions and  \nconstant encouragement paved way for the successful completion of my project work. \nI wish to express my thanks to all Teaching and Non -teaching staff members of the  \nDepartment of Computer Science and Engineering who were helpful in many \nways for the completion of the project.\n5  \n \n                                 TABLE OF CONTENT \n \nINDEX \nNO \n                                             TITLE PAGE \nNO \n1.                                 ABSTRACT 6 \n2. INTRODUCTION 7 \n3.                            

In [42]:

context = '\n'.join([d.page_content for d in retrieved_docs])

In [43]:
print(context)

Dr.S.Vigneshwari M.E., Ph.D., and Dr.L.Lakshmanan M.E., Ph.D., Heads of the  
Department of Computer Science and Engineering for providing me necessary 
support and details at the right time during the progressive reviews. 
I would like to express my sincere and deep sense of gratitude to my Project Guide  Dr. 
R. AROUL CANESSANE  M.E., Ph.D.,  for her valuable guidance, suggestions and  
constant encouragement paved way for the successful completion of my project work. 
I wish to express my thanks to all Teaching and Non -teaching staff members of the  
Department of Computer Science and Engineering who were helpful in many 
ways for the completion of the project.
5  
 
                                 TABLE OF CONTENT 
 
INDEX 
NO 
                                             TITLE PAGE 
NO 
1.                                 ABSTRACT 6 
2. INTRODUCTION 7 
3.                                     AIM 13 
4.                                   SCOPE 13
2  
 
SATHYABAMA 
INSTITUTE OF SCIEN

In [44]:
# Step 7 :Augmenting(A)

augmented_prompt = f'''
<Role> Your are an Helpful assistant using RAG.
<Goal> Answer the question asked by the user. Here is the question: {prompt}
<Context> Here are the documents retrived from the vector  database to support the answer which you have to generate {context}


'''

In [45]:
# Step 8 : Generation (G)

response = model.generate_content(augmented_prompt)
print(response.text)

The project, titled "A SYSTEMATIC APPROACH TOWARDS DESCRIPTION AND CLASSIFICATION OF CRIME INCIDENTS," was carried out by Avala Pavan Kumar and Busupalli Harinath Reddy from January 2022 to April 2022. The project's abstract, introduction, aim, and scope are documented. The project utilizes Unified Modeling Language (UML) for modeling and describes a process involving input data, preprocessing, feature extraction, training, and prediction/classification of crime incidents. The primary goals of the UML design, relevant to this project, include providing an expressive visual modeling language, extendibility, and independence from specific programming languages and development processes.
