# Agentic RAG Example with LangChain, OpenAI, and Chroma

Source:

Vasilis Kalyvas, The easiest AI agent you will ever create!
https://generativeai.pub/a-very-simple-agentic-ai-implementation-28d59afb8096

### Process Flow:

In any case, the agent is using OpenAI’s API for answering questions:

1. The agent gets a question.

2. The agent analyzes whether the question is about **data_analysis** or **info_retrieval**.

3. If the question is about **info_retrieval**, the agent performs **RAG** retrieving the PDF file from the vector store **(Chroma)** and replies with the information of the machines (mentioned in the PDF).

4. If the question is about **data_analysis**, it converts the question into Python code and executes the command on the dataset itself.

In [1]:
import pandas as pd
import openai
import os
from pprint import pprint

from langchain.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA

import warnings
warnings.filterwarnings("ignore")

# better to set the OpenAI key as environment variable but let's keep it 
# like this for simplicity reasons
openai_api_key = open("/Users/mjack6/.secrets/openai_mjack.apikey", "r").read().strip()
openai.api_key = openai_api_key

### Decision Process

Decide if the question is about data_analysis or info_retrieval.

In [2]:
# function to determine query intent
def classify_intent(question):
    prompt = f"Classify the intent of the following question as either \
        'data_analysis' or 'info_retrieval': {question} \
        and reply with only the intent"
                
    response = openai.chat.completions.create(
        model="gpt-4", 
        messages=[{"role": "system", "content": "You are an AI assistant that classifies questions."},
                  {"role": "user", "content": prompt}]
        )
    return response.choices[0].message.content

In [3]:
# load and process PDF for RAG
def load_pdf_for_rag(pdf_path):
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()
    text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    split_docs = text_splitter.split_documents(documents)
    embeddings = OpenAIEmbeddings(api_key=openai_api_key)
    vectorstore = Chroma.from_documents(split_docs, embeddings)
    return vectorstore

### Load PDF

In [4]:
# the file should exist in the working directory, to create the retriever
pdf_path = "data/sample_machines.pdf"  # Replace with your PDF path

if os.path.exists(pdf_path):
    vectorstore = load_pdf_for_rag(pdf_path)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 1})
else:
    retriever = None

### Define Pandas DataFrame

In [5]:
data = {
        'timestamp': [
            '2024-01-01 00:00:00',
            '2024-01-01 01:00:00',
            '2024-01-01 02:00:00',
            '2024-01-01 03:00:00',
            '2024-01-01 04:00:00',
            '2024-01-01 05:00:00',
            '2024-01-01 06:00:00',
            '2024-01-01 07:00:00',
            '2024-01-01 08:00:00',
            '2024-01-01 09:00:00',
            ],
        'machine': ['A', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'A', 'B'],
        'bottles_produced': [100, 150, 120, 130, 110, 160, 115, 140, 125, 135],
        'operating_state': [
            'Running', 
            'Idle', 
            'Running', 
            'Failure',
            'Running', 
            'Idle', 
            'Running', 
            'Failure',
            'Running', 
            'Idle', 
            ],
        'failure': [0, 0, 0, 1, 0, 0, 0, 1, 0, 0],
        'type_of_failure': [
            None, None, None, 'Mechanical',
            None, None, None, 'Mechanical',
            None, None,
            ],
    }
df = pd.DataFrame.from_dict(data)

### Reasoning by Agent

In [6]:
# function to handle user questions
def process_query(question):
    intent = classify_intent(question)
    
    if intent == "info_retrieval":
        if retriever:
            qa = RetrievalQA.from_chain_type(
                    llm=ChatOpenAI(api_key=openai_api_key), 
                    retriever=retriever)
            response = qa.run(question)
            return response
        else:
            return "Machine information is unavailable."
        
    else:
        prompt = f"Convert the following question into a python \
                    pandas command: {question}. Return only the python code \
                    in your response, without mentioning anything else, \
                    in plain text."
                    
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "system", 
                    "content": "You are an AI assistant \
                        that converts questions into python pandas commands. \
                        You will be asked questions for a dataframe that \
                        contains information about machines in a manufacturing \
                        plant. The dataframe consists of: timestamp, machine, \
                        bottles_produced, operating_state, failure, \
                        type_of_failure. Remember you must return only the \
                        python code."
                },
                {"role": "user", "content": prompt}
            ]
        )
        
        code_snippet = response.choices[0].message.content
        pprint(f"This is the Python code snippet that was generated: \
            {code_snippet}")
        print()
        
        try:
            result = eval(code_snippet, {"df": df, "pd": pd})
            return result
        except Exception as e:
            return f"Error executing code: {e}"

### Execute Example User Queries

In [7]:
# # example user queries
# queries = [
#     "How many bottles were produced by (machine) A in total?",
#     "What machines in the dataset had failures?",
#     "Tell me more about machine B.",
#     "What is machine D doing? I see lots of failures."
# ]

In [8]:
# example user queries
queries = [
    "How many bottles were produced by (machine) A in total?",
    "What machines in the dataset had failures?",
    "Tell me more about machine B.",
    "What is machine D doing? I see lots of failures."
]

In [11]:
for query in queries:
    result = process_query(query)
    print(result)
    print()

('This is the Python code snippet that was generated:             '
 "df[df['machine'] == 'A']['bottles_produced'].sum()")

335

('This is the Python code snippet that was generated:             '
 "df[df['failure'] == True]['machine'].unique()")

['D']

Machine B is a Labeling Machine responsible for applying labels to bottles as they move along the conveyor. It operates at a speed of 450 bottles per minute. Common operating states for Machine B include Running, Idle, and Adjustment. Typical failures that may occur with Machine B are misaligned labels and adhesive malfunctions.

Machine D, the Quality Control Scanner, is responsible for inspecting bottles for defects using an AI-powered vision system. It identifies issues such as incorrect labeling, improper sealing, and empty bottles. The failures mentioned in the context include sensor malfunctions and misclassification errors.



In [12]:
for query in queries:
    result = process_query(query)
    pprint(result)
    print()

('This is the Python code snippet that was generated:             '
 "df[df['machine'] == 'A']['bottles_produced'].sum()")

np.int64(335)

('This is the Python code snippet that was generated:             '
 "df[df['failure'] == True]['machine'].unique()")

array(['D'], dtype=object)

('Machine B is a Labeling Machine that applies labels to bottles as they move '
 'along the conveyor. It operates at a speed of 450 bottles per minute. The '
 'typical operating states of Machine B include Running, Idle, and Adjustment. '
 'Common failures that can occur with Machine B are misaligned labels and '
 'adhesive malfunctions.')

('Machine D, the Quality Control Scanner, is responsible for inspecting '
 'bottles for defects using an AI-powered vision system. It identifies issues '
 'such as incorrect labeling, improper sealing, and empty bottles. However, it '
 'is experiencing failures such as sensor malfunctions and misclassification '
 'errors.')



### Conclusion

The agent identified the first 2 questions had to do with calculations, so it created corresponding code and executed it nicely. It answered very simply, just with the answer, but we can configure it more if we want.

The other 2 questions had to do with machines’ information, so it retrieved the PDF file and answered accordingly.

The agent correctly follows the rule of: **Plan ➡ Decision ➡ Action**.