# First RAG using OpenAI API 

## 1. Import the relevant libraries

In [2]:
from openai import OpenAI

In [3]:
from dotenv import load_dotenv, find_dotenv
import os
load_dotenv(find_dotenv(), override=True)

True

In [4]:
from IPython.display import Markdown, display

In [5]:
from pdf2image import convert_from_path

## 2. Test the connection to the API

In [6]:
# Load the API key
API_KEY = os.getenv("API_KEY")

In [7]:
prompt = "What is the difference between a Green Belt and a Black Belt ?"
sys_message = "You are an expert in Lean Sigma."
MODEL = "gpt-4o-mini"

In [8]:
client = OpenAI(api_key=API_KEY)

In [9]:
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role":"system", "content":sys_message},
        {"role":"user", "content":prompt}
    ],
    temperature=0
)

In [67]:
display(Markdown(response.choices[0].message.content))

In Lean Six Sigma, both Green Belts and Black Belts play crucial roles in process improvement initiatives, but they differ in terms of their responsibilities, training, and expertise. Here are the key differences:

### 1. **Training and Certification:**
   - **Green Belt:**
     - Typically undergoes a shorter training program (usually 2-4 weeks).
     - Focuses on the basics of Lean Six Sigma methodologies, tools, and techniques.
     - Often certified after passing an exam that tests their understanding of fundamental concepts.

   - **Black Belt:**
     - Completes a more extensive training program (usually 4-6 months).
     - Gains in-depth knowledge of advanced statistical tools, project management, and leadership skills.
     - Certification usually requires passing a more comprehensive exam and demonstrating proficiency through project work.

### 2. **Role and Responsibilities:**
   - **Green Belt:**
     - Works on process improvement projects part-time while maintaining their primary job responsibilities.
     - Typically leads smaller projects or assists Black Belts in larger projects.
     - Focuses on data collection, analysis, and implementing solutions within their area of expertise.

   - **Black Belt:**
     - Works on process improvement projects full-time and often leads larger, more complex projects.
     - Trains and mentors Green Belts and other team members in Lean Six Sigma methodologies.
     - Responsible for project management, ensuring that projects align with organizational goals, and delivering measurable results.

### 3. **Statistical Knowledge:**
   - **Green Belt:**
     - Has a basic understanding of statistical tools and techniques, such as process mapping, root cause analysis, and basic hypothesis testing.
     - Uses these tools to support project work and data analysis.

   - **Black Belt:**
     - Possesses advanced statistical knowledge and is proficient in using complex tools such as regression analysis, design of experiments (DOE), and multivariate analysis.
     - Capable of interpreting data and making data-driven decisions to drive significant improvements.

### 4. **Project Scope:**
   - **Green Belt:**
     - Typically works on projects that have a limited scope and impact, often within a specific department or function.
     - Focuses on incremental improvements.

   - **Black Belt:**
     - Engages in projects that have a broader organizational impact and strategic significance.
     - Aims for transformational changes that can lead to substantial cost savings and efficiency gains.

### 5. **Leadership and Influence:**
   - **Green Belt:**
     - May have some leadership responsibilities but generally operates under the guidance of Black Belts or higher-level management.
     - Influences change within their immediate team or department.

   - **Black Belt:**
     - Acts as a leader and change agent within the organization.
     - Has a greater influence on strategic decisions and can drive cultural change towards continuous improvement.

In summary, while both Green Belts and Black Belts are essential to Lean Six Sigma initiatives, Black Belts have a deeper level of expertise, broader responsibilities, and a more significant role in leading and managing improvement projects.

## 3. Prepare the data

Our date is in the form of PDF files. In order to have good performance for our information extraction using the LLM, we need to transform them into images.

This is what we are going to do next

### 3.1: Grabing the files

In [10]:
from pathlib import Path

In [11]:
# Grab the current directory 
current_directory = Path.cwd()

# check for the pdf folder
pdf_folder = "pdf_files"
for item in current_directory.iterdir():
    if item.is_dir() and item.name == pdf_folder:
        print(f"Found the folder {pdf_folder}")
        break
    print("No directory as mentionned found ....")

Found the folder pdf_files


In [12]:
# Path to the files
path_to_files = current_directory /pdf_folder

In [13]:
# loop through the folder and graph the pdf file names 
list_files = [file.name for file in path_to_files.iterdir()]
list_files

['The white house cookbook.pdf',
 'The chinese cookbook.pdf',
 'Things mother used to make.pdf',
 'Southern Cookbook of Fine Recipes.pdf',
 'The Italian cookbook.pdf']

In [14]:
# grab the one that interest me 
file_name = list_files[2]

### 3.2. Converting the files

Let's create a function that takes our pdf location and return a folder containing the images


In [15]:
# generate the entire path
path = path_to_files / file_name
output_foler = "images"

In [74]:
def convert_to_image(image_path=path, output = output_foler):
    output_folder = Path(".") / output
    output_folder.mkdir(parents=True, exist_ok=True)
    images_path = list()
    images = convert_from_path(pdf_path=image_path.as_posix())
    for idx, image in enumerate(images, start=1):
        #img_path = os.path.join(image_path, f"page_{idx}.jpg")
        img_path = output_folder / f"page_{idx}.jpg"
        image.save(img_path.as_posix())
        images_path.append(img_path.as_posix())
    return images_path

In [75]:
images_path = convert_to_image()

## 4. Extract the data

We will use the GPT model `gtp-4o-mini`to extract the data

First, we will propose a prompt (as `sys_message`for the LLM) in order to get most of our extraction. Then we will extract the information from the pages and store it in a dictionnary (two keys: the `img_path`, the `img_content`)

### 4.1. Prompt engineering for system message

In [76]:
sys_message = """
Please analyze the content of this image and extract any related recipe information into structured component.
Specifically, extract the recipe title, list of ingredients and step by step instructions, cuisine type, dish type,
and any relevant tags or metadata.
The output must be formation in a way suited for embedding in a Retrival Augmented Generation (RAG) system.
"""

In [77]:
user_prompt = "Please proceed by extracting the information as required"

### 4.2. Extract the information

In [78]:
# generate a function to call the OpenAI API

def ask_gpt(img_data, model=MODEL, prompt=user_prompt, sys_message=sys_message):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role":"system", "content":sys_message},
            {"role":"user", "content":[
                {"type":"text", "text":prompt},
                {"type":"image_url", "image_url":{"url":f"data:image/jpeg;base64,{img_data}"}}
            ]}
        ],
        temperature = 0
    )
    return response.choices[0].message.content



For this we will be using the `images_path`list previously used, and pass each image one by one to the gpt model.

Before, we would be needing to encode it in a base64 format. 

In [81]:
# Let create a function for the encoding
import base64

def encode_base64(img_path):
    with open(img_path, mode="rb") as img_file:
        return base64.b64encode(img_file.read()).decode("utf-8")

In [82]:
# Extract the informtion using GPT
content = list()
for img_path in images_path:
    print(f"Processing {img_path} ...")
    img_data = encode_base64(img_path=img_path)
    answer = ask_gpt(img_data=img_data)
    content.append({"image_path":img_path,"info":answer})
    print(f"Ending processing {img_path}...")
print("Processing ended !")

Processing images/page_1.jpg ...
Ending processing images/page_1.jpg...
Processing images/page_2.jpg ...
Ending processing images/page_2.jpg...
Processing images/page_3.jpg ...
Ending processing images/page_3.jpg...
Processing images/page_4.jpg ...
Ending processing images/page_4.jpg...
Processing images/page_5.jpg ...
Ending processing images/page_5.jpg...
Processing images/page_6.jpg ...
Ending processing images/page_6.jpg...
Processing images/page_7.jpg ...
Ending processing images/page_7.jpg...
Processing images/page_8.jpg ...
Ending processing images/page_8.jpg...
Processing images/page_9.jpg ...
Ending processing images/page_9.jpg...
Processing images/page_10.jpg ...
Ending processing images/page_10.jpg...
Processing images/page_11.jpg ...
Ending processing images/page_11.jpg...
Processing images/page_12.jpg ...
Ending processing images/page_12.jpg...
Processing images/page_13.jpg ...
Ending processing images/page_13.jpg...
Processing images/page_14.jpg ...
Ending processing imag

## Cleaning the data

In [85]:
# check how the data look like
from random import choice
example = choice(content)
example

{'image_path': 'images/page_44.jpg',
 'info': 'Here is the extracted recipe information structured for a Retrieval Augmented Generation (RAG) system:\n\n### Recipe Title\n- Butter Scotch\n\n### Ingredients\n- ½ Cupful of Molasses\n- ½ Cupful of Butter\n- ½ Cupful of Sugar\n\n### Instructions\n1. Boil until it strings.\n2. Pour into a buttered tin.\n3. When cold, break into pieces. This is very nice when cooled on snow.\n\n---\n\n### Recipe Title\n- Pop Corn Balls (very old recipe)\n\n### Ingredients\n- 1 Cupful of Molasses\n- Piece of Butter, half the size of an Egg\n- Pinch of soda\n- 1 quart dish full of popped corn\n\n### Instructions\n1. Boil together until it strings.\n2. Stir in a pinch of soda.\n3. Put this over a quart dish full of popped corn.\n4. When cool enough to handle, squeeze into balls the size of an orange.\n\n### Cuisine Type\n- Traditional American\n\n### Dish Type\n- Candy / Snack\n\n### Tags/Metadata\n- Old-fashioned recipes\n- Sweet treats\n- Homemade candy\n- Fa

In [86]:
# display the data 
display(Markdown(example["info"]))

Here is the extracted recipe information structured for a Retrieval Augmented Generation (RAG) system:

### Recipe Title
- Butter Scotch

### Ingredients
- ½ Cupful of Molasses
- ½ Cupful of Butter
- ½ Cupful of Sugar

### Instructions
1. Boil until it strings.
2. Pour into a buttered tin.
3. When cold, break into pieces. This is very nice when cooled on snow.

---

### Recipe Title
- Pop Corn Balls (very old recipe)

### Ingredients
- 1 Cupful of Molasses
- Piece of Butter, half the size of an Egg
- Pinch of soda
- 1 quart dish full of popped corn

### Instructions
1. Boil together until it strings.
2. Stir in a pinch of soda.
3. Put this over a quart dish full of popped corn.
4. When cool enough to handle, squeeze into balls the size of an orange.

### Cuisine Type
- Traditional American

### Dish Type
- Candy / Snack

### Tags/Metadata
- Old-fashioned recipes
- Sweet treats
- Homemade candy
- Family recipes

## Clean the data

In this section we will remove the unecessary information before embedding and indexing.

In our case we are dealing with recipes. So the relevant information is:

- ingredients
- recipe name
- instructions

In [92]:
# Filter the recipe information
filtered_recipe = list()
keyword_list = ["ingredients","instructions","recipe title"]
for recipe in content:
    if any(keyword in recipe["info"].lower() for keyword in keyword_list):
        filtered_recipe.append(recipe)
    else:
        print(f"{recipe} is not relevant ...")

{'image_path': 'images/page_9.jpg', 'info': 'Based on the provided text, here is the structured information extracted:\n\n### Recipe Information\n\n- **Title**: The Things Mother Used To Make\n- **Cuisine Type**: Traditional/New England\n- **Dish Type**: Old-fashioned recipes\n- **Introduction**: \n  - The recipes have been handed down through generations for nearly a hundred years.\n  - The author, a New England woman, has tested these recipes in her kitchen.\n  - The material was originally published in *Suburban Life* and has been preserved in book form.\n\n### Metadata\n- **Author**: Frank A. Arnold\n- **Publication Date**: September 15, 1913\n- **Series**: Countryside Manuals\n\n### Tags\n- Traditional recipes\n- Family recipes\n- Historical cooking\n\nThis format is suitable for embedding in a Retrieval Augmented Generation (RAG) system.'} is not relevant ...
{'image_path': 'images/page_10.jpg', 'info': "I'm unable to extract any information from the image as it appears to be bla

## Embedding

In [19]:
import json # to store our cleaned data and allows for reload in case of crash

In [96]:
output_data = Path.cwd() / "output_data.json"
with open(output_data.as_posix(), mode="w") as json_file:
    json.dump(filtered_recipe, json_file, indent=4) # indent = 4 for formatting.

In [16]:
# import libraries
import numpy as np

In [20]:
# load the recipes just in case
def load_data(file_name = "./output_data.json", load=False):
	if load:
		with open(file_name, mode="r") as json_file:
			filtered_recipes = json.load(json_file)
	return filtered_recipes

In [21]:
filtered_recipes = load_data(load=True)

In [23]:
# Generate embedding for each recipe
recipe_texts = [recipe["info"] for recipe in filtered_recipes]

In [24]:
embedding_response = client.embeddings.create(
	input = recipe_texts,
	model = "text-embedding-3-large"
)

In [25]:
# Extract the embeddings 
embedding = [data.embedding for data in embedding_response.data]

In [26]:
# to improve the efficiency when feeding the embedding to our retriever, we will convert the embedding into numpy array

embedding_matrix = np.array(embedding)

## Creating the index (using FAISS)

In [27]:
import faiss

In [28]:
# create the index
index = faiss.IndexFlatL2(embedding_matrix.shape[1])
# add the embedding to the index
index.add(embedding_matrix)

In [30]:
# Save the index (good thing to do when working with large data)
faiss.write_index(index, "filtered_recipes_index.index")

In [31]:
# Save the metadata
metadata = [{"recipe_info":recipe["info"],
						"image_path":recipe["image_path"]} for recipe in filtered_recipes]

with open("recipe_metadata.json", mode="w") as json_file:
	json.dump(metadata, json_file, indent = 4)

In [40]:
#Define a function to query the embeddings
def query_embeddings(query, index, metadata, k=5):
	# Generate the embeddings for the query
	query_embedding = client.embeddings.create(
		input = [query],
		model = "text-embedding-3-large"
	).data[0].embedding
	query_vector = np.array(query_embedding).reshape(1,-1)
	# Search the FAISS index
	distances, indices = index.search(query_vector, min(k,len(metadata)))
	# Store the indices and distances
	stored_indices = indices[0].tolist()
	stored_distances = distances[0].tolist()
	# return the result
	result = [(metadata[i]["recipe_info"], dist) for i, dist in zip(stored_indices, stored_distances) if 0<= i < len(metadata)
	]
	return result

In [41]:
# test the retrival system
query = "How to make bread ?"
result = query_embeddings(query, index, metadata)

In [42]:
# combine the results together for the generation (LLM)
def combine_retrieved_content(results):
	combined_content = "\n\n".join([result[0] for result in results])
	return combined_content
	
combine_content = combine_retrieved_content(result)

## Define the generation

In [44]:
# Define the system prompt
system_prompt = f"""
You are a highly experience and expert chef specialized in providing cooking advice.
Your main task is to provide information precise and accurate on the combined content.
You answer directly to the query using only information from the provided {combine_content}.
If you do not know the answer, just say that you do not know.
Your goal is to help the user answer the query: {query}.

"""

In [45]:
# Define the function to retrieve from the API
def generate_response(query, combined_content, system_prompt):
	response = client.chat.completions.create(
		model = MODEL,
		messages = [
			{"role":"system", "content":system_prompt},
			{"role":"user","content": query},
			{"role":"assistant", "content":combined_content}
		],
		temperature = 0
	)
	return response.choices[0].message.content

In [47]:
# get the results
answer = generate_response(query=query, combined_content=combine_content, system_prompt=system_prompt)

In [49]:
display(Markdown(answer))

To make bread, you can follow one of these recipes:

### 1. Bannocks
**Ingredients:**
- 1 Cupful of Thick Sour Milk
- ½ Cupful of Sugar
- 1 Egg
- 2 Cupfuls of Flour
- ½ Cupful of Indian Meal
- 1 Teaspoonful of Soda
- A pinch of Salt

**Instructions:**
1. Make the mixture stiff enough to drop from a spoon.
2. Drop mixture, size of a walnut, into boiling fat.
3. Serve warm, with maple syrup.

---

### 2. Boston Brown Bread
**Ingredients:**
- 1 Cupful of Rye Meal
- 1 Cupful of Graham Meal
- ½ Cupful of Flour
- 1 Cupful of Indian Meal
- 1 Cupful of Sweet Milk
- 1 Cupful of Sour Milk
- 1 Cupful of Molasses
- ½ Teaspoonful of Salt
- 1 Heaping Teaspoonful of Soda

**Instructions:**
1. Stir the meals and salt together.
2. Beat the soda into the molasses until it foams.
3. Add sour milk, mix all together, and pour into a well-greased tin pail.

---

### 3. Nut Bread
**Ingredients:**
- 2½ Cupfuls of Flour
- 3 Teaspoonfuls of Baking Powder
- ¼ Teaspoonful of Salt
- ½ Cupful of Sugar
- 1 Egg
- 1 Cupful of Milk
- ¾ Cupful of English Walnut Meats, chopped fine

**Instructions:**
1. Beat egg and sugar together, then add milk and salt.
2. Sift the baking powder into the dry flour.
3. Combine all ingredients together, adding the nuts last while covering with a little flour to prevent falling.
4. Bake in a moderate oven for one hour.

---

### 4. Oatmeal Bread
**Ingredients:**
- 2 Cupfuls of Rolled Oats
- 3½ Cupfuls of Boiling Water
- ½ Cupful of Molasses
- 1 Yeast Cake
- Pinch of Salt
- Flour (enough to make a stiff dough)

**Instructions:**
1. Let the rolled oats and boiling water stand until cool.
2. Add the molasses, salt, and yeast cake (dissolved in cold water).
3. Stir in flour enough to make a stiff dough.
4. Let it rise overnight, then mold into loaves and let rise again.
5. Bake for forty-five minutes in a moderate oven.

---

### 5. Parker House Rolls
**Ingredients:**
- 1 Quart of Flour
- 1 Tablespoon of Lard
- 3 Tablespoons of Sugar
- ½ Teaspoon of Salt
- ½ Pint of Milk
- 1 Yeast Cake (dissolved in ½ cup of cold water)

**Instructions:**
1. Scald the milk. When nearly cold, add the yeast cake dissolved in cold water.
2. Rub the lard, sugar, and salt into the flour.
3. Stir all together with a knife and knead.
4. Let rise to twice its bulk and knead again.
5. Roll half an inch thick, cut into rounds, spread with butter, and double over.
6. Let rise again and bake for twenty minutes in a hot oven.

Choose any of these recipes to make your bread!

## Combine retriever and generation to get the RAG system

In [54]:
def rag_system(query, index, metadata, system_prompt, k=5):
	# Retrival system
	results = query_embeddings(query, index, metadata, k)
	# Content merge
	combined_contents = combine_retrieved_content(results)
	# Genaration
	response =  generate_response(query, combined_contents, system_prompt)
	
	return response

In [57]:
query2  = "I want something Vegan?"
answer2 = rag_system(query=query2, index=index, metadata=metadata, system_prompt=system_prompt)

In [59]:
display(Markdown(answer2))

The recipes provided are not specifically vegan, as they include ingredients like eggs, milk, and butter. However, you can modify some of the recipes to make them vegan-friendly by substituting animal products with plant-based alternatives. Here are some suggestions:

1. **Bannocks**: 
   - Substitute the egg with a flaxseed meal or chia seed mixture (1 tablespoon of flaxseed or chia seeds mixed with 2.5 tablespoons of water, let sit until it thickens).
   - Use a plant-based milk instead of sour milk.

2. **Boston Brown Bread**: 
   - Replace the egg with a flaxseed or chia seed mixture as mentioned above.
   - Use plant-based milk and ensure the molasses is vegan.

3. **Nut Bread**: 
   - Substitute the egg with a flaxseed or chia seed mixture.
   - Use plant-based milk.

4. **Oatmeal Bread**: 
   - This recipe can be made vegan by using plant-based milk and ensuring the yeast is vegan-friendly.

5. **Parker House Rolls**: 
   - Replace the lard with a vegan butter or coconut oil.
   - Use plant-based milk.

6. **Popovers**: 
   - This recipe is challenging to make vegan due to the nature of popovers, but you can experiment with a combination of aquafaba (chickpea water) and plant-based milk to create a similar texture.

If you need a specific vegan bread recipe, I can help create one based on common vegan ingredients. Let me know!