# Retrieval-AugmentedGeneration (RAG)

## Initialize

In [1]:
import openai
from dotenv import load_dotenv 
import os
import rich
import numpy as np

## An Inspiring Example

1. Large language models have limitations regarding the data they are trained on, such as a fixed cutoff date and gaps in domain-specific knowledge.
2. Retraining a large language model is expensive. For example, ChatGPT was launched on November 30, 2022, and it cost around $12 million to train the model (excluding fine-tuning).

In [17]:
# Set your OpenAI API key
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

# Create a chat completion request
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, please give me a summary about the personality of Jin Meng, who is from the department of statistics and actuarial science, University of Iowa?"}
    ]
)

# Print the generated message
print(response.choices[0].message.content)

Jin Meng is a faculty member in the Department of Statistics and Actuarial Science at the University of Iowa. While I don't have specific information on their personal characteristics or personality traits, I can provide a general idea based on their professional background. Faculty members in such a department typically have strong analytical skills, attention to detail, and a deep interest in statistical theory and its applications. They are likely to be methodical, enjoy problem-solving, and have a passion for teaching and research in statistics and actuarial science. For specific information about Jin Meng's personality, you might consider reaching out to colleagues or students who have worked with them, or looking for any publicly available interviews or biographical information.


## RAG
### Step 1: Load document

In [18]:
# Create the PDF loader
from langchain_community.document_loaders import PyPDFLoader
uploads_folder = "../.local/uploads"
uploads_file_path = "cover_letter_jin_meng.pdf"
loader = PyPDFLoader(f"{uploads_folder}/{uploads_file_path}")

# Load the data from the PDF
data = loader.load()

In [19]:
# Explore loaded data
print(f"Number of documents loaded: {len(data)}")
rich.print("\nSample document structure:")
rich.print(f"Metadata: {data[0].metadata}")
rich.print(f"Content preview: {data[0].page_content[:500]}...")

Number of documents loaded: 1


### Step 2: Split document

In [20]:
# Initialize text splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,                             # Maximum number of characters in each chunk
    chunk_overlap=50,                           # Number of characters to overlap between chunks for context
    length_function=len,                        # Function to measure the length of each chunk (default: len for characters)
    separators=["\n\n", "\n", ". ", " ", ""]    # List of separators to split the text, tried in order
)

# Split the loaded data into smaller chunks
# text_chunks = []
# for doc in data:
#     text_chunks.extend(text_splitter.split_text(doc.page_content))

chunks = text_splitter.split_documents(data)

In [21]:
# Analyze chunk statistics
chunk_lengths = [len(chunk.page_content) for chunk in chunks]
print(f"\nChunk statistics:")
print(f"  - Total chunks: {len(chunks)}")
print(f"  - Average length: {np.mean(chunk_lengths):.0f} characters")
print(f"  - Min length: {min(chunk_lengths)} characters")
print(f"  - Max length: {max(chunk_lengths)} characters")

# %% Examine chunks
print("\nFirst 5 chunks:")
for i, chunk in enumerate(chunks[:5]):
    print(f"\nChunk {i+1} (from {chunk.metadata['source']}):")
    print(f"Length: {len(chunk.page_content)} chars")
    print(f"Content: {chunk.page_content}...")


Chunk statistics:
  - Total chunks: 7
  - Average length: 408 characters
  - Min length: 360 characters
  - Max length: 498 characters

First 5 chunks:

Chunk 1 (from ../.local/uploads/cover_letter_jin_meng.pdf):
Length: 498 chars
Content: Jin (Jeremy) Meng   
1650 Ranier Dr, Iowa City, IA 52236｜(319)333-9236｜jin.meng.uiowa@gmail.com 
 
To whom it may concern, 
I am currently a data scientist at the United Fire Group, Inc.  (UFG), a commercial property and casualty insurance company. 
My major responsibility is to help build and implement underwriting pricing models for different business lines, including 
commercial automobile, commercial property, workers compensation, and general liability business lines. Before working full-...

Chunk 2 (from ../.local/uploads/cover_letter_jin_meng.pdf):
Length: 372 chars
Content: time at UFG starting from August 2020, I was a data science intern and worked part-time since 2018. In the meantime, I was 
a Ph.D. student at the University of Iowa, ma

### Step 3: Store Embeddings

In [22]:
# Initialize OpenAI embeddings
from langchain_openai import OpenAIEmbeddings
embeddings_model = OpenAIEmbeddings(
    model="text-embedding-3-small",      # Model name: "text-embedding-3-small", "text-embedding-3-large", "text-embedding-ada-002"
    dimensions=1536,                     # Embedding vector size: 1536 for "text-embedding-3-small", 3072 for "text-embedding-3-large"
    organization=None                    # OpenAI organization ID (optional, default: None)
 )
print(f"Embedding model: {embeddings_model.model}")

Embedding model: text-embedding-3-small


In [23]:
# # Create a small test to understand embeddings
# test_texts = ["Python programming", "Machine learning algorithms", "Database systems"]
# test_embeddings = embeddings_model.embed_documents(test_texts)

# print(f"\nEmbedding test:")
# print(f"  - Input texts: {len(test_texts)}")
# print(f"  - Output Embeddings: {len(test_embeddings)}")
# print(f"  - Embedding dimensions: {len(test_embeddings[0])}")
# print(f"  - Sample embedding values: {test_embeddings[0][:5]}...")

In [24]:
# Create vector store from chunks
from langchain_community.vectorstores import FAISS

# Create a FAISS index with cosine similarity
vectorstore = FAISS.from_documents(chunks, embeddings_model)

print(f"Vector store created!")
print(f"  - Total vectors: {vectorstore.index.ntotal}")
print(f"  - Vector dimension: {vectorstore.index.d}")
print(f"  - Sample embedding values: {vectorstore.index.reconstruct(0)[:5]}...")

Vector store created!
  - Total vectors: 7
  - Vector dimension: 1536
  - Sample embedding values: [-0.03626532 -0.01408294  0.04419661  0.04318768 -0.01275872]...


### Step 4: Retrieve relevant chunks according to similarity search

In [25]:
# query = "Help me conclude the personality of Jin Meng"

# # Get relevant chunks
# relevant_chunks_with_score = vectorstore.similarity_search_with_score(query, k=3)

# print(f"Found {len(relevant_chunks_with_score)} relevant chunks:")
# for i, (chunk, score) in enumerate(relevant_chunks_with_score):
#     print(f"\n{i+1}. Similarity Score: {score:.4f}")
#     print(f"\n  {i+1}. From: {chunk.metadata['source']}")
#     print(f"     Content: {chunk.page_content}...")

### Step 5: Generate output via QA with LLM

In [27]:
# Step 5-1: Initialize the chat model
from langchain_openai import ChatOpenAI
model = ChatOpenAI(
    model_name="gpt-4o"
)

In [28]:
# Step 5-2: Define the prompt template
from langchain.prompts import PromptTemplate
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.

{context}
Question: {question}
Helpful Answer:"""
prompt = PromptTemplate.from_template(template)

In [29]:
# Step 5-3: Create the RetrievalQA chain
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
    llm=model,
    retriever=vectorstore.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

In [30]:
# Step 5-4: Run the chain with your query
query = "Help me conclude the personality of Jin Meng"
response = qa_chain({"query": query})

In [31]:
rich.print(response["result"])
rich.print(response["source_documents"])

### Notes
- The RAG Process finishes the backend business logic/data layer of a web application.
- To have a full web application, we would need two more things:
    1. A HTTP Request Handling (Backend API Layer)
    2. A frontend web app.