diff --git a/langgraph_agent_with_genai/README.md b/langgraph_agent_with_genai/README.md new file mode 100644 index 0000000..474e23d --- /dev/null +++ b/langgraph_agent_with_genai/README.md @@ -0,0 +1,254 @@ +# Chat With Your Personal Files: A Real-Life Example of Using LangGraph and Oracle Generative AI to Summarize, Index, and Find Files Instantly. + + +We all know the frustration of searching endlessly for a document you know exists somewhere in your personal files. Traditional search often relies on exact file names or rigid folder structures, making it difficult to quickly find what you need. But what if you could simply chat with your documents and instantly locate the right file—summarized, indexed, and ready to use? + +In this article, I’ll share a real-life example of how I built a solution that does exactly that. By combining **LangGraph** with **Oracle Generative AI** and **Oracle Database 23ai**, I created an intelligent system that automatically extracts metadata, generates concise summaries, and indexes file contents for easy retrieval. The result is a conversational agent that transforms how we interact with our personal data: instead of searching through folders, you ask questions in natural language and the system finds the right document for you. + +>**Note**: This agent is intended for file search only. It does not have the ability to answer in-depth questions regarding the content of your files, as it can only access the summary of each document. + +# File Indexing: Discovery what your files are about + +This blog demonstrates how to index PDF, Images, DOCx or TXT files using Python code, Generative AI models and Oracle 23ai database + +![T1_1](images/FileProcessingFlow.png "T1_1") + + +>**IMPORTANT**: This blog is designed solely for educational and study purposes. It provides an environment for learners to experiment and gain practical experience in a controlled setting. It is crucial to note that the security configurations and practices employed in this lab might not be suitable for real-world scenarios. +> +> Security considerations for real-world applications are often far more complex and dynamic. Therefore, before implementing any of the techniques or configurations demonstrated here in a production environment, it is essential to conduct a comprehensive security assessment and review. This review should encompass all aspects of security, including access control, encryption, monitoring, and compliance, to ensure that the system aligns with the organization's security policies and standards. +> +> Security should always be a top priority when transitioning from a lab environment to a real-world deployment. + + +## Technologies powering this solution + - [Oracle Autonomous Database 23ai](https://docs.oracle.com/en-us/iaas/autonomous-database-serverless/doc/autonomous-intro-adb.html#GUID-8EAA5AE6-397D-4E9A-9BD0-3E37A0345E24) : Oracle 23ai is a converged database that seamlessly combines advanced vector search with proven relational capabilities, enabling AI-driven and traditional workloads in a single platform. + - [LangGraph](https://github.com/langchain-ai/langgraph) : LangGraph is a low-level, stateful orchestration framework for building and managing long-running AI agents and complex multi-agent workflows using graph-based architecture + - [OCI Generative AI Services](https://docs.oracle.com/en-us/iaas/Content/generative-ai/getting-started.htm): OCI Generative AI is a managed service offering customizable LLMs for chat, text generation and summarization. + - [Sentence Transformers](https://pypi.org/project/sentence-transformers/): The sentence-transformers package is a Python framework designed for creating and working with embeddings—dense vector representations of text that capture semantic meaning. Built on top of PyTorch and Hugging Face Transformers, it simplifies the process of encoding sentences, paragraphs, or documents into fixed-size vectors suitable for a wide range of Natural Language Processing (NLP) tasks. + + + +## Prerequisites - Oracle Cloud Infrastructure + + - Oracle account with admin level access permissions, if not admin level then apply the needed policies [Getting Access to Generative AI](https://docs.oracle.com/en-us/iaas/Content/generative-ai/iam-policies.htm) + - Oracle CLI installed on your local machine, see details here: [Installing the CLI](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm) + - Oracle Autonomous Database 23ai, see details here: [Provision an Autonomous Database Instance](https://docs.oracle.com/en/cloud/paas/autonomous-database/serverless/adbsb/autonomous-provision.html) + + - Python 3.12.10 installed, you can find more details here [Simple Python Version Management: pyenv](https://github.com/pyenv/pyenv) + - OpenGL libs installed on your machine, if you're using Oracle Linux, you can install by running the command: + ``` + # Install OpenGL + sudo dnf install -y mesa-libGL + ``` + >**Note**: If you're using MacOS you don't need to install OpenGL. + +## Quick Start + +1. Collect all database needed information: + + 1.1 - Go to your autonomous Oracle 23ai database, click on **Database Connection** and then click on **Download Wallet**. + >**Note** Take note of your database wallet password! + + 1.2 - Click on **Storage**, under **Object Storage & Archive Storage** click on **Buckets**, then create a new bucket to store your database wallet, then upload the downloaded wallet zip file. + >**Note** Take note of your namespace, bucket name and wallet zip file name. + + + 1.3 - Collect all that information and make a note of it — we’ll need it later. + + | Variable | Value | Description | + |----------|--------|-------------------------------| + | DB_USER | "admin"| The administrative database username | + | DB_PASSWORD | "xxxxxx" | Your database password | + | DB_DSN | "xxxxxxx" | Get the database service name from your tnsnames.ora, ie. yourdatabase_low | + | WALLET_PASSWORD | "xxxxxxxx" | Get your wallet password | + | OCI_BUCKET_NAMESPACE | "your bucket namespace" | The namespace of your created bucket | + | OCI_BUCKET_NAME_WALLET | "your bucket name" | Your bucket name | + | OCI_WALLET_OBJECT_NAME | "your wallet zip file name" | The name of your database wallet zip file | + | OCI_COMPARTMENT_ID | "your compartment OCID" | Get your compartment OCID | + + > **Note** The tnsnames.ora file can be found inside the Wallet zip file. + + + +2. Let's start indexing the files + 2.1 - On your shell machine, get the code and create your .env file + ```` + git clone https://github.com/oracle-devrel/devrel-labs.git + cd langgraph_agent_with_genai/src + ## Create your .env file from the template + cp .env_template .env + ```` + + 2.2 - Setting your values for the environment variables on **.env** + > **Note** Edit the .env file using your preferred text editor and fill the information with your collected values from **step 1** + + ``` + cat .env + + OCI_CLI_PROFILE=DEFAULT + DB_USER="admin" + DB_PASSWORD="xxxxxxx" + DB_DSN="myrdbms_tp" + WALLET_PASSWORD="xxxxxxx" + OCI_BUCKET_NAMESPACE="xxxxxxxx" + OCI_BUCKET_NAME_WALLET="my-bucket" + OCI_WALLET_OBJECT_NAME="Wallet_MyRDBMS.zip" + + OCI_COMPARTMENT_ID="ocid1.compartment.oc1..xxxxx" + + #meta.llama-3.2-90b-vision-instruct + OCI_GENAI_IMAGE_MODEL_OCID="ocid1.generativeaimodel.oc1.sa-saopaulo-1.amaaaaaask7dceyalwceqwzlywqqxfzz3grpzjr42fej5qlybhu2d666oz4q" + OCI_IMAGE_MODEL_ENDPOINT="https://inference.generativeai.sa-saopaulo-1.oci.oraclecloud.com" + + #meta.llama-3.3-70b-instruct + OCI_GENAI_ENDPOINT="https://inference.generativeai.sa-saopaulo-1.oci.oraclecloud.com" + OCI_GENAI_REASONING_MODEL_OCID="ocid1.generativeaimodel.oc1.sa-saopaulo-1.amaaaaaask7dceyarsn4m6k3aqvvgatida3omyprlcs3alrwcuusblru4jaa" + + ``` + > **IMPORTANT**: This sample code was validated using models available in the **sa-saopaulo-1** region. Model availability may vary across regions; please refer to the documentation for the latest details. + You can find more information here [Pretrained Foundational Models in Generative AI](https://docs.oracle.com/en-us/iaas/Content/generative-ai/pretrained-models.htm#pretrained-models). + **If you use different models, you will need to adjust the distance threshold on the search agent.** + + + We use different models for each requirement: + - OCI_GENAI_IMAGE_MODEL_OCID: This is used to extract text from images (OCR), you must choose a **Multimodal** model. + - OCI_GENAI_REASONING_MODEL_OCID: This is used as LLM generation and understanding of the extracted text. + + + + 2.3 - Initialize the database table on Oracle 23ai + Run the **init_database.py** script that will create the table. + + ``` + ls -lrt + # Make sure you install all needed python dependencies. + pip install -r requirements.txt + python --version + python init_database.py + ``` + + ![T02_03](images/T02_03_InitDatabase.png "T02_03") + + + 2.4 - Run the IA file indexing. + + > **Note**: For demonstration purposes, sample files have been created to support tool validation. These files simulate real documents, including lab test requisitions and results, a Brazilian driver’s license, and Brazilian invoices and receipts. + The files are located on ./samples directory. + + ``` + python batch_process_samples.py + + ``` + ![T02_04](images/T02_04_IndexResultPart1.png "T02_04") + ![T02_04](images/T02_04_IndexResultPart2.png "T02_04") + + The sample files from **samples** directory are now successfully indexed and stored your Oracle 23ai database. + + 2.5 - Let's validate our indexed data (this is not our Search Agent). + To validate the data, run the **validation.py** script. This script verifies the loaded data and checks the distances returned by the model when querying specific columns. The results can be used to confirm and refine the distance thresholds applied in the Search Agent. + + ```` + python validation.py + ```` + ![T02_05](images/T02_05_Validation_1.png "T02_05") + + ```` + python validation.py PERSON_NAME_EMBEDDING "heloisa pires" + ```` + ![T02_05](images/T02_05_Validation_2.png "T02_05") + + > **Note**: As we can see, a simple query for person name using vector search return a very close distance, which means our model is running as expected. + +
+
+ +# AI Agent Search: Chat with your files + + The agent is designed to perform file retrieval based on both document summaries and associated metadata. Users can submit natural language queries about their files, and the agent will intelligently process the request to identify and return the most relevant documents. + + + ![T03_01](images/LangGraphPipelineSimplified.png "T03_01") + + ![T03_01](images/ToolPipeline_Simplified.png "T03_01") + ![T03_01](images/ToolPipeline_2_Simplified.png "T03_01") + + +
+ + + ### Key Code Snippet: Leveraging OCI Generative AI in the Agent + + ⚠️ **Warning** + When working with LangChain on OCI, make sure you use the official Python package **langchain_oci**. Many users get confused because there is also a community ChatOCIGenAI class, which is depreacated and should not be used, for more information, check [langchain_oci.](https://docs.public.content.oci.oraclecloud.com/en-us/iaas/Content/generative-ai/langchain.htm) + + The following snippet demonstrates the code currently used in this blog. + + ``` + # This is the correct package used in this agent + cat requirements.txt |grep langchain-oci==0.1.3 + ``` + + + ``` + . + . + . + from langchain_oci.chat_models import ChatOCIGenAI + + llm = ChatOCIGenAI( + model_id=model_id, + service_endpoint=endpoint, + compartment_id=compartment_id, + auth_type="API_KEY", + model_kwargs={"temperature": 0.2, "max_tokens": 1000} +) + + ``` + + + + + ### Run the agent, by running **AgentChat.py** script and start asking questions... + + ``` + python AgentChat.py + ``` + ![T03_01](images/AgentStart.png "T03_01") + + Let's try with a initial question: "**I need a document that is about receipt of physiotherapy**" + + ![T03_01](images/AgentQuestion1_01.png "T03_01") + ![T03_01](images/AgentQuestion1_02.png "T03_01") + + ### Playing with questions: + + - **"Which documents talk about brazilian beaches?"** + ![T03_01](images/Question2_result.png "T03_01") + + - **"List test results from August 2025"** + ![T03_01](images/Question3_result.png "T03_01") + + - **"List receipts that are from physiotherapy"** + ![T03_01](images/Question4_result.png "T03_01") + +
+ +## Conclusion + + +This project demonstrates how combining LangGraph, Oracle Generative AI, and Oracle Database 23ai can transform the way we interact with our files—making search faster, smarter, and more intuitive. While the implementation here focuses on document indexing and retrieval, the same principles can be extended to many other real-world scenarios. By experimenting with additional tools, refining thresholds, and expanding metadata extraction, you can evolve this agent into a powerful foundation for building AI-driven knowledge management systems. The journey does not end here—this is just the starting point for reimagining how we connect with our data. + +## Contributing + +This project is open source. Please submit your contributions by forking this repository and submitting a pull request! Oracle appreciates any contributions that are made by the open source community. + +## License + +Copyright (c) 2025 Oracle and/or its affiliates. + +Licensed under the Universal Permissive License (UPL), Version 1.0. + +See [LICENSE](../LICENSE) for more details. + +ORACLE AND ITS AFFILIATES DO NOT PROVIDE ANY WARRANTY WHATSOEVER, EXPRESS OR IMPLIED, FOR ANY SOFTWARE, MATERIAL OR CONTENT OF ANY KIND CONTAINED OR PRODUCED WITHIN THIS REPOSITORY, AND IN PARTICULAR SPECIFICALLY DISCLAIM ANY AND ALL IMPLIED WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. FURTHERMORE, ORACLE AND ITS AFFILIATES DO NOT REPRESENT THAT ANY CUSTOMARY SECURITY REVIEW HAS BEEN PERFORMED WITH RESPECT TO ANY SOFTWARE, MATERIAL OR CONTENT CONTAINED OR PRODUCED WITHIN THIS REPOSITORY. IN ADDITION, AND WITHOUT LIMITING THE FOREGOING, THIRD PARTIES MAY HAVE POSTED SOFTWARE, MATERIAL OR CONTENT TO THIS REPOSITORY WITHOUT ANY REVIEW. USE AT YOUR OWN RISK. diff --git a/langgraph_agent_with_genai/images/AgentChat_01.png b/langgraph_agent_with_genai/images/AgentChat_01.png new file mode 100644 index 0000000..a48afd3 Binary files /dev/null and b/langgraph_agent_with_genai/images/AgentChat_01.png differ diff --git a/langgraph_agent_with_genai/images/AgentQuestion1_01.png b/langgraph_agent_with_genai/images/AgentQuestion1_01.png new file mode 100644 index 0000000..16ad5fb Binary files /dev/null and b/langgraph_agent_with_genai/images/AgentQuestion1_01.png differ diff --git a/langgraph_agent_with_genai/images/AgentQuestion1_02.png b/langgraph_agent_with_genai/images/AgentQuestion1_02.png new file mode 100644 index 0000000..63164e0 Binary files /dev/null and b/langgraph_agent_with_genai/images/AgentQuestion1_02.png differ diff --git a/langgraph_agent_with_genai/images/AgentStart.png b/langgraph_agent_with_genai/images/AgentStart.png new file mode 100644 index 0000000..b4a1692 Binary files /dev/null and b/langgraph_agent_with_genai/images/AgentStart.png differ diff --git a/langgraph_agent_with_genai/images/FileProcessingFlow.png b/langgraph_agent_with_genai/images/FileProcessingFlow.png new file mode 100644 index 0000000..d301d13 Binary files /dev/null and b/langgraph_agent_with_genai/images/FileProcessingFlow.png differ diff --git a/langgraph_agent_with_genai/images/LangGraphPipelineSimplified.png b/langgraph_agent_with_genai/images/LangGraphPipelineSimplified.png new file mode 100644 index 0000000..241d9fa Binary files /dev/null and b/langgraph_agent_with_genai/images/LangGraphPipelineSimplified.png differ diff --git a/langgraph_agent_with_genai/images/Question2_result.png b/langgraph_agent_with_genai/images/Question2_result.png new file mode 100644 index 0000000..5ea9c63 Binary files /dev/null and b/langgraph_agent_with_genai/images/Question2_result.png differ diff --git a/langgraph_agent_with_genai/images/Question3_result.png b/langgraph_agent_with_genai/images/Question3_result.png new file mode 100644 index 0000000..178fea5 Binary files /dev/null and b/langgraph_agent_with_genai/images/Question3_result.png differ diff --git a/langgraph_agent_with_genai/images/Question4_result.png b/langgraph_agent_with_genai/images/Question4_result.png new file mode 100644 index 0000000..46a69c6 Binary files /dev/null and b/langgraph_agent_with_genai/images/Question4_result.png differ diff --git a/langgraph_agent_with_genai/images/T02_03_InitDatabase.png b/langgraph_agent_with_genai/images/T02_03_InitDatabase.png new file mode 100644 index 0000000..a83a358 Binary files /dev/null and b/langgraph_agent_with_genai/images/T02_03_InitDatabase.png differ diff --git a/langgraph_agent_with_genai/images/T02_04_IndexResultPart1.png b/langgraph_agent_with_genai/images/T02_04_IndexResultPart1.png new file mode 100644 index 0000000..a8385b5 Binary files /dev/null and b/langgraph_agent_with_genai/images/T02_04_IndexResultPart1.png differ diff --git a/langgraph_agent_with_genai/images/T02_04_IndexResultPart2.png b/langgraph_agent_with_genai/images/T02_04_IndexResultPart2.png new file mode 100644 index 0000000..b1816e1 Binary files /dev/null and b/langgraph_agent_with_genai/images/T02_04_IndexResultPart2.png differ diff --git a/langgraph_agent_with_genai/images/T02_05_Validation_1.png b/langgraph_agent_with_genai/images/T02_05_Validation_1.png new file mode 100644 index 0000000..15babb7 Binary files /dev/null and b/langgraph_agent_with_genai/images/T02_05_Validation_1.png differ diff --git a/langgraph_agent_with_genai/images/T02_05_Validation_2.png b/langgraph_agent_with_genai/images/T02_05_Validation_2.png new file mode 100644 index 0000000..e42b730 Binary files /dev/null and b/langgraph_agent_with_genai/images/T02_05_Validation_2.png differ diff --git a/langgraph_agent_with_genai/images/ToolPipeline_2_Simplified.png b/langgraph_agent_with_genai/images/ToolPipeline_2_Simplified.png new file mode 100644 index 0000000..4b1e34f Binary files /dev/null and b/langgraph_agent_with_genai/images/ToolPipeline_2_Simplified.png differ diff --git a/langgraph_agent_with_genai/images/ToolPipeline_Simplified.png b/langgraph_agent_with_genai/images/ToolPipeline_Simplified.png new file mode 100644 index 0000000..babf073 Binary files /dev/null and b/langgraph_agent_with_genai/images/ToolPipeline_Simplified.png differ diff --git a/langgraph_agent_with_genai/src/.env_template b/langgraph_agent_with_genai/src/.env_template new file mode 100644 index 0000000..919156a --- /dev/null +++ b/langgraph_agent_with_genai/src/.env_template @@ -0,0 +1,19 @@ +OCI_CLI_PROFILE=DEFAULT +DB_USER="admin" +DB_PASSWORD="xxxxxxx" +DB_DSN="myrdbms_tp" +WALLET_PASSWORD="xxxxxxx" +OCI_BUCKET_NAMESPACE="xxxxxxxx" +OCI_BUCKET_NAME_WALLET="my-bucket" +OCI_WALLET_OBJECT_NAME="Wallet_MyRDBMS.zip" + +OCI_COMPARTMENT_ID="ocid1.compartment.oc1..xxxxx" + +#meta.llama-3.2-90b-vision-instruct +OCI_GENAI_IMAGE_MODEL_OCID="ocid1.generativeaimodel.oc1.sa-saopaulo-1.amaaaaaask7dceyalwceqwzlywqqxfzz3grpzjr42fej5qlybhu2d666oz4q" +OCI_IMAGE_MODEL_ENDPOINT="https://inference.generativeai.sa-saopaulo-1.oci.oraclecloud.com" + +#meta.llama-3.3-70b-instruct +OCI_GENAI_ENDPOINT="https://inference.generativeai.sa-saopaulo-1.oci.oraclecloud.com" +OCI_GENAI_REASONING_MODEL_OCID="ocid1.generativeaimodel.oc1.sa-saopaulo-1.amaaaaaask7dceyarsn4m6k3aqvvgatida3omyprlcs3alrwcuusblru4jaa" +OCI_GENAI_REASONING_MODEL_NAME="meta.llama-3.3-70b-instruct" diff --git a/langgraph_agent_with_genai/src/AgentChat.py b/langgraph_agent_with_genai/src/AgentChat.py new file mode 100644 index 0000000..16b8206 --- /dev/null +++ b/langgraph_agent_with_genai/src/AgentChat.py @@ -0,0 +1,298 @@ +import os +import logging +from dotenv import load_dotenv +load_dotenv() + +from typing import List +from langchain_oci.chat_models import ChatOCIGenAI +from langchain_core.messages import SystemMessage, HumanMessage, BaseMessage, AIMessage, ToolMessage +from langgraph.graph import StateGraph, END, MessagesState +from langgraph.prebuilt import ToolNode +from agent_tools.search_tools import search_documents +from agent_tools.document_stats import get_document_statistics, load_document_statistics + +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +compartment_id = os.environ.get("OCI_COMPARTMENT_ID") +endpoint = os.environ.get("OCI_GENAI_ENDPOINT") +model_id = os.environ.get("OCI_GENAI_REASONING_MODEL_NAME") + +llm = ChatOCIGenAI( + model_id=model_id, + service_endpoint=endpoint, + compartment_id=compartment_id, + auth_type="API_KEY", + model_kwargs={"temperature": 0.2, "max_tokens": 1000} +) + +llm_with_tools = llm.bind_tools([search_documents, get_document_statistics]) + +llm_final = ChatOCIGenAI( + model_id=model_id, + service_endpoint=endpoint, + compartment_id=compartment_id, + auth_type="API_KEY", + model_kwargs={"temperature": 0.2, "max_tokens": 1000} +) + +def build_app(doc_stats: str): + sysmsg = SystemMessage(content=f"""You are a document-search assistant for an Oracle 23ai repository. + +AVAILABLE DOCUMENTS SNAPSHOT: +{doc_stats} + +RULES: +- Use only the provided tools. Do not invent data or fetch from the internet. +- Make sure all questions have meaning related to searching documents, ignore non-related questions such as "What is life?" "What is today"? "What is a mountain?" +- If search returns no items, answer exactly: No documents were found matching your search criteria +- Do not carry filters across questions unless the user says also/additionally. +- Prefer SUMMARY for content queries; PERSON_NAME for people; DOC_TYPE/CATEGORY only when explicitly requested; include original_query for date parsing. +- When calling the search tool, pass a JSON string with keys: summary, person, doc_type, category, event_date_start, event_date_end, original_query. +- Return file paths exactly as provided by the tool results. + +TOOL USAGE: +- Use **search_documents** for content-based retrieval (questions like "documents about X", "files mentioning Y"). +- Use **get_document_statistics** for quantitative/statistical questions, such as: + * "How many documents do I have?" + * "How many documents are of type driver_license + * "What are the doc type availble?" + * "What categories exist and how many documents in each?" + * "What document types are available?" +- Never combine both tools in the same call. Pick the one that best matches the user question. +- AFTER TOOLS: When tool results are present, produce a concise natural-language answer. Do not paste raw tool output. If results do not answer the question, say you don't know. +""") + + def agent_node(state: MessagesState) -> MessagesState: + msgs: List[BaseMessage] = state["messages"] + if not msgs or not isinstance(msgs[0], SystemMessage): + msgs = [sysmsg] + msgs + resp = llm_with_tools.invoke(msgs) + return {"messages": msgs + [resp]} + + def router(state: MessagesState) -> str: + last = state["messages"][-1] + return "call_tools" if isinstance(last, AIMessage) and getattr(last, "tool_calls", None) else "end" + + def analyze_relevance(state: MessagesState) -> MessagesState: + """ + Analyzes tool results to determine which documents are actually relevant to the user's question. + This step filters and ranks documents based on actual relevance, not just semantic similarity. + """ + msgs: List[BaseMessage] = state["messages"] + + # Find the latest user question and gather conversation context + user_question = None + conversation_context = [] + + for msg in reversed(msgs): + if isinstance(msg, HumanMessage): + if user_question is None: # Get the most recent question + user_question = msg.content + conversation_context.insert(0, f"User: {msg.content}") + elif isinstance(msg, AIMessage) and not ("RELEVANCE_ANALYSIS:" in msg.content): + conversation_context.insert(0, f"Assistant: {msg.content}") + + # Keep only last few exchanges for context + conversation_context = conversation_context[-6:] # Last 3 user-assistant exchanges + + # Find the latest tool result + last_tool = None + for m in reversed(msgs): + if isinstance(m, ToolMessage) or getattr(m, "type", None) == "tool": + last_tool = m + break + + if not last_tool or not user_question: + return {"messages": msgs} + + raw_results = (last_tool.content or "").strip() + + # Check if we have search results (not statistics) + if not raw_results or raw_results == "[]" or "DOCUMENT STATISTICS" in raw_results: + return {"messages": msgs} + + # Create analysis prompt with conversation context + context_str = "\n".join(conversation_context) if conversation_context else "No previous context" + + analysis_prompt = f""" +You are analyzing search results to determine which documents actually answer the user's question. + +CONVERSATION CONTEXT: +{context_str} + +CURRENT USER QUESTION: "{user_question}" + +SEARCH RESULTS FROM DATABASE: +{raw_results} + +INSTRUCTIONS: +1. Consider the conversation context to understand references like "what about another category such as xxxx?", "what are the dates from these documents?", etc. +2. Carefully read each document's summary +3. Evaluate how well each document matches the user's intent and current question +4. Consider that results are ordered by semantic similarity, but similarity doesn't always mean relevance +5. Only include documents that genuinely relate to and can answer the user's question +6. If multiple documents are relevant, rank them by actual usefulness to answer the question +7. If NO documents actually answer the question, clearly state that + +ANALYSIS TASK: +- Identify which documents (if any) truly answer the user's question +- Explain why each selected document is relevant +- If none are relevant, explain why the search results don't match the question + +FORMAT YOUR RESPONSE AS: +RELEVANT DOCUMENTS: +[List only the truly relevant documents with their file_name and brief explanation of relevance] + +EXPLANATION: +[Brief explanation of your analysis and selection criteria] +""" + + # Get analysis from LLM + analysis_response = llm_final.invoke([HumanMessage(content=analysis_prompt)]) + + # Add analysis as AI message instead of ToolMessage to avoid tool_call_id issues + analysis_content = f"RELEVANCE_ANALYSIS:\n{analysis_response.content}\n\nORIGINAL_RESULTS:\n{raw_results}" + analysis_message = AIMessage(content=analysis_content) + + return {"messages": msgs + [analysis_message]} + + def synthesize(state: MessagesState) -> MessagesState: + msgs: List[BaseMessage] = state["messages"] + + # Look for analysis result first (AIMessage with RELEVANCE_ANALYSIS) + analysis_message = None + for m in reversed(msgs): + if isinstance(m, AIMessage) and m.content and "RELEVANCE_ANALYSIS:" in m.content: + analysis_message = m + break + + if analysis_message: + raw = analysis_message.content.strip() + else: + # Fallback to looking for tool messages + last_tool = None + for m in reversed(msgs): + if isinstance(m, ToolMessage) or getattr(m, "type", None) == "tool": + last_tool = m + break + if not last_tool: + return {"messages": msgs} + raw = (last_tool.content or "").strip() + + # Check if this is an analysis result or regular tool result + if raw.startswith("RELEVANCE_ANALYSIS:"): + # This is the analyzed result - use it directly + synth_instructions = SystemMessage(content=""" +You have received both the original search results and a relevance analysis. + +The analysis has already filtered which documents actually answer the user's question. + +INSTRUCTIONS: +1. Use the RELEVANCE_ANALYSIS section to understand which documents are truly relevant +2. Present the relevant documents in a clear, user-friendly format +3. If the analysis indicates no documents are relevant, clearly state that no matching documents were found +4. Include file paths exactly as provided +5. Be concise but informative + +FORMAT YOUR RESPONSE: +- If relevant documents were found: List them with file_name, summary, and why they're relevant +- If no relevant documents: Clearly state "No documents were found that match your search criteria" + +Do NOT paste the raw analysis - synthesize it into a natural response for the user. +""") + else: + # Fallback to original logic for statistics or other tools + if raw == "[]" or raw == "" or "No documents were found" in raw: + return {"messages": msgs + [AIMessage(content="No documents were found matching your search criteria")]} + + synth_instructions = SystemMessage(content=""" +You now have the tool results above. +Write the final answer to the user's last question using ONLY those tool results. +Do NOT paste raw JSON or echo the tool output verbatim. +Be concise and focus on what answers the question. +If results are insufficient, say you don't know. +Return file paths exactly as provided by the tool results when relevant. +If the user asks for a file's full path and it exists in the tool results, answer with the exact 'full_path'. +When listing search results, prefer showing both 'file_name' and 'full_path' if it helps the user. +Format your answer with a list of objects, like the sample: +- file_name: /full_path/of/the/file.txt +- summary: Document about amazon forest in 2024 forecast. +- person: John Silva +If a field is missing, use null. +""") + + resp = llm_final.invoke(msgs + [synth_instructions]) + return {"messages": msgs + [resp]} + + graph = StateGraph(MessagesState) + graph.add_node("agent", agent_node) + graph.add_node("call_tools", ToolNode([search_documents, get_document_statistics])) + graph.add_node("analyze_relevance", analyze_relevance) + graph.add_node("synthesize", synthesize) + graph.set_entry_point("agent") + graph.add_conditional_edges("agent", router, {"call_tools": "call_tools", "end": END}) + graph.add_edge("call_tools", "analyze_relevance") + graph.add_edge("analyze_relevance", "synthesize") + graph.add_edge("synthesize", END) + return graph.compile() + +def main(): + print("=== LangGraph Agent Chat ===") + print("Loading document statistics...") + doc_stats = load_document_statistics() + print("Statistics loaded!") + print("Type 'exit' to quit\n") + + app = build_app(doc_stats) + + # Simple in-memory conversation history + conversation_history = [] + max_history_length = 20 # Keep last 20 messages to avoid context overflow + + while True: + try: + user_input = input("You: ").strip() + if user_input.lower() in ["exit", "quit", "sair"]: + print("Ending chat...") + break + if not user_input: + continue + + # Add user message to history + user_message = HumanMessage(content=user_input) + conversation_history.append(user_message) + + # Create state with conversation history + state = {"messages": conversation_history.copy()} + out = app.invoke(state) + + # Find the latest AI response + reply = None + latest_ai_message = None + for m in reversed(out["messages"]): + if isinstance(m, AIMessage) and not ("RELEVANCE_ANALYSIS:" in m.content): + # Skip analysis messages, get only final response + reply = m.content + latest_ai_message = m + break + + # Add AI response to conversation history + if latest_ai_message: + conversation_history.append(latest_ai_message) + + # Keep history manageable - remove oldest messages if too long + if len(conversation_history) > max_history_length: + # Keep system message at beginning and remove oldest user/ai pairs + conversation_history = conversation_history[-max_history_length:] + + print(f"Agent: {reply or 'No response generated'}\n") + + except KeyboardInterrupt: + print("\nEnding chat...") + break + except Exception as e: + print(f"Error: {e}\n") + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/langgraph_agent_with_genai/src/agent_tools/__init__.py b/langgraph_agent_with_genai/src/agent_tools/__init__.py new file mode 100644 index 0000000..201cab6 --- /dev/null +++ b/langgraph_agent_with_genai/src/agent_tools/__init__.py @@ -0,0 +1,9 @@ +""" +Agent Tools Package +Specialized tools for the document search LLM agent +""" + +from .search_tools import search_documents +from .document_stats import get_document_statistics + +__all__ = ['search_documents', 'get_document_statistics'] \ No newline at end of file diff --git a/langgraph_agent_with_genai/src/agent_tools/document_stats.py b/langgraph_agent_with_genai/src/agent_tools/document_stats.py new file mode 100644 index 0000000..9bbbf6c --- /dev/null +++ b/langgraph_agent_with_genai/src/agent_tools/document_stats.py @@ -0,0 +1,112 @@ +""" +Document statistics tool for LLM agent +Responsible for providing totals and counts of documents in the database +""" + +import logging +from typing import List +from langchain.tools import tool +from jlibspython.oracledb_utils import execute_query + +# Logging configuration +logger = logging.getLogger(__name__) + + +def load_document_statistics(): + """ + Loads document statistics for use in the agent prompt. + """ + try: + stats_result = get_document_statistics.invoke({}) + return "\n".join(stats_result) + except Exception as e: + logger.error(f"Error loading statistics: {e}") + return "Statistics not available" + +@tool(return_direct=True) +def get_document_statistics() -> List[str]: + """ + Returns statistics of documents available in the database. + + This tool provides quantitative information about stored documents: + - Total number of documents + - Number of documents by type (DOC_TYPE) + - Number of documents by category (CATEGORY) + + Useful for answering questions like: + - "How many documents do I have?" + - "How many receipt type documents?" + - "What document types are available?" + + Returns only totals, never specific document details. + + Returns: + List with formatted document statistics + """ + try: + logger.info("Loading document statistics...") + + # Query for general total + total_query = "SELECT COUNT(*) as total FROM DOCUMENT_VECTORS" + total_result = execute_query(total_query) + total_docs = total_result[0]['total'] if total_result else 0 + + # Query for count by DOC_TYPE + doc_type_query = """ + SELECT DOC_TYPE, COUNT(*) as quantidade + FROM DOCUMENT_VECTORS + WHERE DOC_TYPE IS NOT NULL + GROUP BY DOC_TYPE + ORDER BY quantidade DESC + """ + doc_type_results = execute_query(doc_type_query) + + # Query for count by CATEGORY + category_query = """ + SELECT CATEGORY, COUNT(*) as quantidade + FROM DOCUMENT_VECTORS + WHERE CATEGORY IS NOT NULL + GROUP BY CATEGORY + ORDER BY quantidade DESC + """ + category_results = execute_query(category_query) + + # Result formatting + stats_output = [] + + # Header + stats_output.append("=== DOCUMENT STATISTICS ===") + stats_output.append("") + + # General total + stats_output.append(f"TOTAL DOCUMENTS: {total_docs:,}") + stats_output.append("") + + # Documents by type + if doc_type_results: + stats_output.append("DOCUMENTS BY TYPE:") + for row in doc_type_results: + doc_type = row.get('doc_type', 'N/A') + quantidade = row.get('quantidade', 0) + stats_output.append(f"- {doc_type}: {quantidade:,} documents") + stats_output.append("") + + # Documents by category + if category_results: + stats_output.append("DOCUMENTS BY CATEGORY:") + for row in category_results: + category = row.get('category', 'N/A') + quantidade = row.get('quantidade', 0) + stats_output.append(f"- {category}: {quantidade:,} documents") + stats_output.append("") + + # Additional information + stats_output.append("Use the search_documents tool to find specific documents.") + + logger.info(f"Statistics loaded - {total_docs} total documents") + + return stats_output + + except Exception as e: + logger.error(f"Error loading statistics: {e}") + return [f"Error loading statistics: {str(e)}"] \ No newline at end of file diff --git a/langgraph_agent_with_genai/src/agent_tools/search_tools.py b/langgraph_agent_with_genai/src/agent_tools/search_tools.py new file mode 100644 index 0000000..fa1c96a --- /dev/null +++ b/langgraph_agent_with_genai/src/agent_tools/search_tools.py @@ -0,0 +1,358 @@ +""" +Search tools for LLM agent +Contains specific tools for searching documents in the DOCUMENT_VECTORS table +""" + +import logging +from typing import Optional, List +from langchain.tools import tool +from jlibspython.oracledb_utils import execute_query, filter_outliers_by_std_dev +from jlibspython.proxy_embedding_helper import generate_embeddings_batch +from jlibspython.llm_date_parser import parse_date_with_llm + +import os + +# Logging configuration +logger = logging.getLogger(__name__) + +OCI_COMPARTMENT_ID = os.environ.get("OCI_COMPARTMENT_ID") +OCI_EMBEDDING_ENDPOINT = os.environ.get("OCI_EMBEDDING_ENDPOINT") +OCI_EMBEDDING_MODEL_NAME = os.environ.get("OCI_EMBEDDING_MODEL_NAME") +FILTER_VECTOR_THRESHOLD = 0.08 +FILTER_VECTOR_THRESHOLD_PERSON = 0.21 +FILTER_VECTOR_THRESHOLD_EMBEDDING_TEXT = 0.4 + + +#@tool(return_direct=True) +@tool() +def search_documents(argumentos: str) -> List[str]: + """ + Searches documents in the DOCUMENT_VECTORS table based on search criteria. + + This tool allows filtering documents using different vector search criteria. + Arguments must be passed as JSON string containing the desired criteria. + + Available criteria: + - summary: Document content summary + - person: Name of person associated with document + - doc_type: Document type + - category: Document category + - event_date_start: Start date in DD/MM/YYYY format + - event_date_end: End date in DD/MM/YYYY format + - original_query: Original user query for intelligent date parsing + + Returns a list of source files (SOURCE_FILE) that match the search criteria. + + + Args: + argumentos: JSON string with search criteria. + Example: '{"doc_type": "receipt"}' or '{"summary": "contracts", "original_query": "show me contracts from last year"}' + + Returns: + List of structured information from found documents + """ + try: + logger.info(f"Current arguments: {argumentos}") + params_dict = parse_llm_json(argumentos) + + sql, params = build_sql(params_dict, 'exact') + + if len(params) == 0: + logger.info("There are no filter specified, returning 0 documents.") + return [] + + results_raw = execute_query(sql, params) + logger.info(f"Query returned {len(results_raw)} documents") + + if len(results_raw) == 0: + logger.info("Query returned no records, falling back to semantic search...") + sql, params = build_sql(params_dict, 'semantic') + results_raw = execute_query(sql, params) + + + # If there is summary distance, capture the most relevant documents removing the outliers on distance + if params_dict.get("summary"): + logger.info("Filtering outliers...") + results = filter_outliers_by_std_dev(results_raw,'distance_summary') + else: + results = results_raw + + + # Format structured return with all information + formatted_results = [] + for row in results: + if row.get('source_file'): + result_text = ( + f"file_name: {row.get('source_file', '')}\n" + f"Summary: {row.get('summary', '')}\n" + f"Doc Type: {row.get('doc_type', '')}\n" + f"Category: {row.get('category', '')}\n" + f"Person: {row.get('person_name', '')}\n" + f"Event Date: {row.get('event_date', '')}\n" + f"Distance: {row.get('distance_summary', 'N/A')}\n" + ) + + result_text += "---" + formatted_results.append(result_text) + + logger.info(f"Found {len(formatted_results)} documents") + + + return formatted_results + + except Exception as e: + logger.error(f"Error searching documents: {e}") + return [f"Search error: {str(e)}"] + + +def _embed_literal(text: str) -> Optional[str]: + try: + vec = generate_embeddings_batch( + [text], + compartment_id=OCI_COMPARTMENT_ID, + embedding_model=OCI_EMBEDDING_MODEL_NAME, + genai_endpoint=OCI_EMBEDDING_ENDPOINT + )[0] + return "[" + ",".join(str(x) for x in vec) + "]" + except Exception as e: + logger.error(f"Embedding error for '{text}': {e}") + return None + + +def build_sql(params_dict: dict, filter_mode: str) -> List[str]: + try: + + # Extract parameters from JSON and convert to lowercase + summary = params_dict.get("summary") + summary = summary.lower() if summary else None + + person = params_dict.get("person") + person = person.lower() if person else None + + doc_type = params_dict.get("doc_type") + doc_type = doc_type.lower() if doc_type else None + + category = params_dict.get("category") + category = category.lower() if category else None + + # Enhanced date parameter handling with LLM parsing only + event_date_start = params_dict.get("event_date_start") + event_date_end = params_dict.get("event_date_end") + original_query = params_dict.get("original_query") + + # Try LLM date parsing if no structured dates provided + if not (event_date_start and event_date_end) and original_query: + llm_parsed_dates = parse_date_with_llm(original_query) + if llm_parsed_dates: + event_date_start = llm_parsed_dates.get("event_date_start") + event_date_end = llm_parsed_dates.get("event_date_end") + logger.info(f"Using LLM parsed dates: {event_date_start} to {event_date_end}") + + + logger.info(f"DEBUG - Extracted parameters:") + logger.info(f" summary: {summary}") + logger.info(f" person: {person}") + logger.info(f" doc_type: {doc_type}") + logger.info(f" category: {category}") + logger.info(f" event_date_start: {event_date_start}") + logger.info(f" event_date_end: {event_date_end}") + logger.info(f" original_query: {original_query}") + + # Build base SQL query + where_clause = [] + params = {} + + if person: + if filter_mode == 'semantic': + emb_str = _embed_literal(person) + where_clause.append(f"VECTOR_DISTANCE(PERSON_NAME_EMBEDDING, VECTOR(:person_embedding)) <= {FILTER_VECTOR_THRESHOLD_PERSON}") + params["person_embedding"] = emb_str + elif filter_mode == 'exact': + where_clause.append(f"lower(PERSON_NAME) like :person_name") + params["person_name"] = f"%{person}%" + + if doc_type: + if filter_mode == 'semantic': + emb_str = _embed_literal(doc_type) + where_clause.append(f"VECTOR_DISTANCE(DOC_TYPE_EMBEDDING, VECTOR(:doc_type_embedding)) <= {FILTER_VECTOR_THRESHOLD}") + params["doc_type_embedding"] = emb_str + elif filter_mode == 'exact': + where_clause.append(f"lower(DOC_TYPE) = :doc_type") + params["doc_type"] = doc_type + + + if category: + if filter_mode == 'semantic': + emb_str = _embed_literal(category) + where_clause.append(f"VECTOR_DISTANCE(CATEGORY_EMBEDDING, VECTOR(:category_embedding)) <= {FILTER_VECTOR_THRESHOLD}") + params["category_embedding"] = emb_str + elif filter_mode == 'exact': + where_clause.append(f"lower(CATEGORY) = :category") + params["category"] = category + + + # Date filter using LLM-parsed dates only + if event_date_start and event_date_end: + where_clause.append('EVENT_DATE BETWEEN TO_DATE(:start_date, \'DD/MM/YYYY\') AND TO_DATE(:end_date, \'DD/MM/YYYY\')') + params["start_date"] = event_date_start + params["end_date"] = event_date_end + logger.info(f"Using date range filter: {event_date_start} to {event_date_end}") + + + # Dynamic SELECT construction + select_columns = [ + "SOURCE_FILE", + "SUMMARY", + "DOC_TYPE", + "CATEGORY", + "PERSON_NAME", + "EVENT_DATE" + ] + if summary: + emb_str = _embed_literal(summary) + params["summary_embedding"] = emb_str + select_columns.append("VECTOR_DISTANCE(SUMMARY_EMBEDDING, VECTOR(:summary_embedding)) as DISTANCE_SUMMARY") + + select_clause = ", ".join(select_columns) + + where_sql = "" + if where_clause: + where_sql = "AND " + " AND ".join(where_clause) + + # Conditional ORDER BY and LIMIT + if summary: + order_by = "ORDER BY DISTANCE_SUMMARY" + limit = "FETCH FIRST 20 ROWS ONLY" + else: + order_by = "ORDER BY SOURCE_FILE" + limit = "FETCH FIRST 20 ROWS ONLY" + + sql = f""" + SELECT {select_clause} + FROM DOCUMENT_VECTORS + WHERE 1=1 + {where_sql} + {order_by} {limit} + """ + + logger.info(f"Parameters: {params}") + logger.info(f"SQL (MODE:{filter_mode} - {sql}") + return sql, params + + + except Exception as e: + logger.error(f"Error building SQL Query: {e}") + return [f"Error building SQL Query: {str(e)}"] + +def parse_llm_json(json_input: str) -> dict: + """ + Handles and cleans JSON sent by LLM with different formatting issues. + + Common problems: + - JSON with unnecessary escapes: "{\"category\": \"PIX\"}" + - External double quotes: "{"category": "PIX"}" + - Extra spaces or characters + + Args: + json_input: Potentially malformed JSON string from LLM + + Returns: + dict with parsed parameters or empty dict if it fails + """ + if not json_input or not json_input.strip(): + return {} + + import json + + # List of cleanup strategies in priority order + strategies = [ + lambda x: json.loads(x), # 1. Try direct + lambda x: json.loads(x.strip().strip('"')), # 2. Remove external quotes + lambda x: json.loads(x.replace('\\"', '"')), # 3. Remove escapes + lambda x: json.loads(x.strip().strip('"').replace('\\"', '"')), # 4. Combined + lambda x: json.loads(x.strip().strip("'").replace("\\'", "'")), # 5. Single quotes + ] + + for i, strategy in enumerate(strategies, 1): + try: + result = strategy(json_input) + if isinstance(result, dict): + logger.info(f"JSON parsed successfully (strategy {i}): {result}") + return result + except json.JSONDecodeError: + continue + except Exception: + continue + + # If no strategy worked + logger.error(f"Failed to parse JSON with all strategies: {json_input}") + return {} + + +def parse_event_date(date_input: str) -> dict: + """ + Parses simple date formats and returns SQL condition. + + Supported formats: + - "2024-01-15" or "15/01/2024" -> specific date + - "2024-01-15 a 2024-01-31" or "01/01/2024 até 30/05/2024" -> period between dates + + Returns: + dict with 'condition' (SQL string) and 'params' (parameter dict) + None if format not recognized + """ + import re + + date_str = date_input.strip() + + try: + # Format: Brazilian period "01/01/2024 a 30/05/2024" or "01/01/2024 até 30/05/2024" + br_period_match = re.match(r'(\d{2}/\d{2}/\d{4})\s+(a|até)\s+(\d{2}/\d{2}/\d{4})', date_str) + if br_period_match: + start_date, _, end_date = br_period_match.groups() + return { + 'condition': 'EVENT_DATE BETWEEN TO_DATE(:start_date, \'DD/MM/YYYY\') AND TO_DATE(:end_date, \'DD/MM/YYYY\')', + 'params': {'start_date': start_date, 'end_date': end_date} + } + + # Format: ISO period "2024-01-15 a 2024-01-31" or "2024-01-15 até 2024-01-31" + iso_period_match = re.match(r'(\d{4}-\d{2}-\d{2})\s+(a|até)\s+(\d{4}-\d{2}-\d{2})', date_str) + if iso_period_match: + start_date_iso, _, end_date_iso = iso_period_match.groups() + + # Convert ISO dates to DD/MM/YYYY format + start_year, start_month, start_day = start_date_iso.split('-') + end_year, end_month, end_day = end_date_iso.split('-') + + start_date = f"{start_day}/{start_month}/{start_year}" + end_date = f"{end_day}/{end_month}/{end_year}" + + return { + 'condition': 'EVENT_DATE BETWEEN TO_DATE(:start_date, \'DD/MM/YYYY\') AND TO_DATE(:end_date, \'DD/MM/YYYY\')', + 'params': {'start_date': start_date, 'end_date': end_date} + } + + # Format: specific date "15/01/2024" + if re.match(r'^\d{2}/\d{2}/\d{4}$', date_str): + # Already in DD/MM/YYYY format + return { + 'condition': 'EVENT_DATE = TO_DATE(:event_date, \'DD/MM/YYYY\')', + 'params': {'event_date': date_str} + } + + # Format: ISO specific date "2024-01-15" + if re.match(r'^\d{4}-\d{2}-\d{2}$', date_str): + # Convert to DD/MM/YYYY format for Oracle + year, month, day = date_str.split('-') + oracle_date = f"{day}/{month}/{year}" + return { + 'condition': 'EVENT_DATE = TO_DATE(:event_date, \'DD/MM/YYYY\')', + 'params': {'event_date': oracle_date} + } + + logger.warning(f"Date format not recognized: {date_input}") + return None + + except Exception as e: + logger.error(f"Error parsing date '{date_input}': {e}") + return None diff --git a/langgraph_agent_with_genai/src/app_specifics.py b/langgraph_agent_with_genai/src/app_specifics.py new file mode 100644 index 0000000..fb8381d --- /dev/null +++ b/langgraph_agent_with_genai/src/app_specifics.py @@ -0,0 +1,136 @@ +import logging +from jlibspython.oracledb_utils import execute_query, parse_date +from jlibspython.proxy_embedding_helper import generate_embeddings_batch +import json +from datetime import datetime + +logger = logging.getLogger(__name__) + +def file_already_exists(source_file: str) -> bool: + try: + sql = "SELECT COUNT(*) as count FROM document_vectors WHERE source_file = :source_file" + result = execute_query(sql, {"source_file": source_file}) + + if result and len(result) > 0 and result[0]['count'] > 0: + logger.info(f"File is already indexed, skipped: {source_file}") + return True + + return False + + except Exception as e: + logger.error(f"Error checking is file already indexed: {e}") + return False + +def store_document_in_oracledb( + source_file: str, + chunk_text: str, + embedding_list: list[float], + metadata: dict, + created_on: datetime, + modified_on: datetime, + embedding_model: str, + compartment_id: str, + genai_endpoint: str +): + logger.info("Storing data into Oracle") + try: + embedding_vector = embedding_list[0] + emb_str = "[" + ",".join(str(x) for x in embedding_vector) + "]" + + sql = """ + INSERT INTO document_vectors ( + source_file, + chunk_text, + embedding, + summary, + doc_type, + category, + person_name, + event_date, + created_on, + modified_on, + full_metadata, + DOC_TYPE_EMBEDDING, + CATEGORY_EMBEDDING, + PERSON_NAME_EMBEDDING, + SUMMARY_EMBEDDING + ) VALUES ( + :source_file, + :chunk_text, + :embedding, + :summary, + :doc_type, + :category, + :person_name, + :event_date, + :created_on, + :modified_on, + :full_metadata, + :doc_type_embedding, + :category_embedding, + :person_name_embedding, + :summary_embedding + ) + """ + + logger.info(f"Detected metadata: {metadata}") + full_metadata_json = json.dumps(metadata, ensure_ascii=False) + + summary = metadata.get("summary").lower() or None + doc_type = metadata.get("type").lower() or None + category = metadata.get("category").lower() or None + event_date = parse_date(metadata.get("eventdate")) + person_name = metadata.get("person").lower() or None + + logger.info("Embedding selected columns...") + + doc_type_embed_str = "" + category_embed_str = "" + summary_embed_str = "" + person_name_embed_str = "" + + if doc_type: + doc_type_embed = generate_embeddings_batch([doc_type], compartment_id=compartment_id, embedding_model=embedding_model, genai_endpoint=genai_endpoint)[0] + doc_type_embed_str = "[" + ",".join(str(x) for x in doc_type_embed) + "]" + + if category: + category_embed = generate_embeddings_batch([category], compartment_id=compartment_id, embedding_model=embedding_model, genai_endpoint=genai_endpoint)[0] + category_embed_str = "[" + ",".join(str(x) for x in category_embed) + "]" + + if summary: + summary_embed = generate_embeddings_batch([summary], compartment_id=compartment_id, embedding_model=embedding_model, genai_endpoint=genai_endpoint)[0] + summary_embed_str = "[" + ",".join(str(x) for x in summary_embed) + "]" + + if person_name: + person_name_embed = generate_embeddings_batch([person_name], compartment_id=compartment_id, embedding_model=embedding_model, genai_endpoint=genai_endpoint)[0] + person_name_embed_str = "[" + ",".join(str(x) for x in person_name_embed) + "]" + + + execute_query(sql, { + "source_file": source_file, + "chunk_text": chunk_text, + "embedding": emb_str, + "summary": summary, + "doc_type": doc_type, + "category": category, + "person_name":person_name , + "event_date":event_date , + "created_on": created_on, + "modified_on": modified_on, + "full_metadata": full_metadata_json, + "doc_type_embedding": doc_type_embed_str, + "category_embedding": category_embed_str, + "person_name_embedding": person_name_embed_str, + "summary_embedding": summary_embed_str + }) + + + return { + "status": "success", + **metadata + } + + except Exception as e: + logger.error(f"Failed to store in Oracle DB: {str(e)}") + return {"status": "failed", "reason": f"SQL execution failed: {str(e)}"} + diff --git a/langgraph_agent_with_genai/src/batch_process_samples.py b/langgraph_agent_with_genai/src/batch_process_samples.py new file mode 100644 index 0000000..90dedd3 --- /dev/null +++ b/langgraph_agent_with_genai/src/batch_process_samples.py @@ -0,0 +1,83 @@ +import logging +import os +import sys +from dotenv import load_dotenv +load_dotenv() + +from datetime import datetime +from tzlocal import get_localzone +from file_processor import processFile + +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +def process_all_samples(): + samples_dir = "samples" + + if not os.path.exists(samples_dir): + print(f"Directory not found: {samples_dir}") + sys.exit(1) + + files = [f for f in os.listdir(samples_dir) if os.path.isfile(os.path.join(samples_dir, f))] + + if not files: + print(f"File not found on: {samples_dir}") + return + + local_tz = get_localzone() + results = [] + + logger.info(f"Found {len(files)} files to process") + + for filename in files: + file_path = os.path.join(samples_dir, filename) + local_file = os.path.abspath(file_path) + + logger.info(f"Processing file: {filename}") + + try: + created_on_ts = os.path.getctime(local_file) + modified_on_ts = os.path.getmtime(local_file) + + created_on_dt = datetime.fromtimestamp(created_on_ts, tz=local_tz) + modified_on_dt = datetime.fromtimestamp(modified_on_ts, tz=local_tz) + + print(f"\n=== Processing: {filename} ===") + print(f"Created on: {created_on_dt}") + print(f"Last modified on: {modified_on_dt}") + + result = processFile(local_file, created_on_dt, modified_on_dt) + + results.append({ + "file": filename, + "result": result + }) + + logger.info(f"Results for {filename}: {result}") + print(f"Status: {result.get('status', 'unknown')}") + + except Exception as e: + error_msg = f"Error on processing files {filename}: {str(e)}" + logger.error(error_msg) + results.append({ + "file": filename, + "result": {"status": "failed", "reason": error_msg} + }) + + # Summary + print(f"\n=== FINAL ===") + successful = sum(1 for r in results if r["result"].get("status") == "success") + failed = len(results) - successful + + print(f"Total processed files: {len(results)}") + print(f"Success: {successful}") + print(f"Failed: {failed}") + + if failed > 0: + print("\n Failed files:") + for r in results: + if r["result"].get("status") != "success": + print(f"- {r['file']}: {r['result'].get('reason', 'Unkwnon erros')}") + +if __name__ == "__main__": + process_all_samples() \ No newline at end of file diff --git a/langgraph_agent_with_genai/src/db/DOCUMENT_VECTORS.sql b/langgraph_agent_with_genai/src/db/DOCUMENT_VECTORS.sql new file mode 100644 index 0000000..b58395b --- /dev/null +++ b/langgraph_agent_with_genai/src/db/DOCUMENT_VECTORS.sql @@ -0,0 +1,23 @@ +CREATE TABLE "DOCUMENT_VECTORS" + ("ID" NUMBER GENERATED BY DEFAULT AS IDENTITY MINVALUE 1 MAXVALUE 9999999999999999999999999999 INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER NOCYCLE NOKEEP NOSCALE , + "SOURCE_FILE" VARCHAR2(512 BYTE) COLLATE "USING_NLS_COMP", + "CHUNK_TEXT" CLOB COLLATE "USING_NLS_COMP", + "EMBEDDING" VECTOR(1024, *), + "SUMMARY" CLOB COLLATE "USING_NLS_COMP", + "DOC_TYPE" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", + "CATEGORY" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", + "PERSON_NAME" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", + "FULL_METADATA" JSON, + "EVENT_DATE" DATE, + "CREATED_ON" DATE, + "MODIFIED_ON" DATE, + "DOC_TYPE_EMBEDDING" VECTOR(1024, *), + "CATEGORY_EMBEDDING" VECTOR(1024, *), + "PERSON_NAME_EMBEDDING" VECTOR(1024, *), + "SUMMARY_EMBEDDING" VECTOR(1024, *) + ) DEFAULT COLLATION "USING_NLS_COMP" ; + +ALTER TABLE "DOCUMENT_VECTORS" ADD PRIMARY KEY ("ID") + USING INDEX ENABLE; + + diff --git a/langgraph_agent_with_genai/src/file_processor.py b/langgraph_agent_with_genai/src/file_processor.py new file mode 100644 index 0000000..6b88ec8 --- /dev/null +++ b/langgraph_agent_with_genai/src/file_processor.py @@ -0,0 +1,192 @@ +import os +import logging +import numpy as np + +from jlibspython.file_utils import (pdf_has_image,extract_text_from_pdf_with_PyPDF, extract_text_from_doc,extract_text_from_txt, + normalize_text_list) +from jlibspython.proxy_embedding_helper import (generate_embeddings_batch) +from jlibspython.oci_utils_helpers import (extract_text_from_image_with_genAI, extract_metadata_from_chunks_GenAI) +from app_specifics import store_document_in_oracledb, file_already_exists +import datetime +from pdf2image import convert_from_path +import cv2 + +logger = logging.getLogger(__name__) + + +OCI_COMPARTMENT_ID = os.environ["OCI_COMPARTMENT_ID"] +OCI_GENAI_ENDPOINT = os.environ["OCI_GENAI_ENDPOINT"] +OCI_IMAGE_MODEL_ENDPOINT = os.environ["OCI_IMAGE_MODEL_ENDPOINT"] +OCI_GENAI_IMAGE_MODEL_OCID = os.environ["OCI_GENAI_IMAGE_MODEL_OCID"] +OCI_GENAI_REASONING_MODEL_OCID = os.environ["OCI_GENAI_REASONING_MODEL_OCID"] + +## In case you decide to use a GenAI MODEL for Embedding instead of local embeedding, you must setup this variable +if "OCI_EMBEDDING_MODEL_NAME" in os.environ: + OCI_EMBEDDING_MODEL_NAME = os.environ["OCI_EMBEDDING_MODEL_NAME"] +else: + OCI_EMBEDDING_MODEL_NAME = "" +## In case you decide to use a GenAI MODEL for Embedding instead of local embeedding, you must setup this variable +if "OCI_EMBEDDING_ENDPOINT" in os.environ: + OCI_EMBEDDING_ENDPOINT = os.environ["OCI_EMBEDDING_ENDPOINT"] +else: + OCI_EMBEDDING_ENDPOINT = "" + + +ENRICH_PROMPT=""" +You are an AI that extracts standardized metadata from document texts. +The JSON fields to be returned, with examples: +- type: try to identify the document type, based on the examples below: + - "voucher", "receipt", "invoice", "bill", "contract", "report", "payment_slip", "vehicle_document", "driver_license", "id_card", "taxpayer_id", "passport", "tax_form", "medical_prescription", "test_result" , "exam_request", "medical_prescription" +- category: one of the following — "PIX", "Payment Slip", "Health", "Work", "Tax", "Contract". If it doesn’t fit in any of these, set it as "Other". +- person: the main person’s name in the document, only the most important name. +- eventdate: in the format YYYY-MM-DD +- summary: a brief description of the content +- Always respond with a **valid JSON**, **without markdown** or any extra text + +Examples: + +"Transfer receipt Pix by key May 2, 2025 R$ 600.00 debited account data name JOHN taxpayer_id 111.111.111-11" +→ {"type": "voucher", "category": "PIX", "person": "JOHN SILVA", "eventdate": "2025-05-02", "summary": "PIX transfer receipt from Itau Bank, sent by JOHN SILVA to Larissa Manuela"} + +"Receipt, received from John Almada, taxpayer_id 111.111.111-11 the amount of R$ 23.00 for physiotherapy sessions" +→ {"type": "receipt", "category": "Health", "person": "JOHN ALMADA", "eventdate": "2025-05-02", "summary": "Payment receipt for physiotherapy sessions on 2025-05-02, covering 5 sessions"} + +"Pix completed - Date and Time: 2025-06-02 - 13:52:05 - Name: FRANCISCO SILVA" +→ {"type": "voucher", "category": "PIX", "person": "FRANCISCO SILVA", "eventdate": "2025-06-02", "summary": "PIX transfer receipt from FRANCISCO SILVA to EDUARDO SANTOS"} + +"Medical invoice April 30, 2025 R$ 600.00 Payer JOHN SILVA Beneficiary JOHN SILVA Professional CAMILA SILVA" +→ {"type": "invoice", "category": "Health", "person": "JOHN SILVA", "eventdate": "2025-04-30", "summary": "Medical invoice for three psychotherapy sessions in April 2025 for John Silva, issued by Camila Silva"} + +"Digital Driver License - Name: CAROLINE SILVA - ISSUE DATE: 2021-07-05" +→ {"type": "driver_license", "category": "Document", "person": "CAROLINE SILVA", "eventdate": "2021-07-05", "summary": "Driver License issued to Caroline Silva"} + +"Tax Form - Name CAROLINE SILVA - Period: 2018-12-31" +→ {"type": "tax_form", "category": "Tax", "person": "CAROLINE SILVA", "eventdate": "2018-12-31", "summary": "Tax Form related to the fiscal period of 2018"} + +"Exam request - Name: CAROLINE SILVA - Date: 2021-01-13" +→ {"type": "medical_prescription", "category": "Health", "person": "CAROLINE SILVA", "eventdate": "2021-01-13", "summary": "Medical prescription with exam requests for Caroline Silva"} + +Text: +""" + + +def extract_text_from_pdf_Images(pdf_path): + try: + logger.info(f"Converting PDF to images: {pdf_path}") + pages = convert_from_path(pdf_path) + logger.info(f"{len(pages)} pages found") + all_text = [] + + for i, page in enumerate(pages): + logger.info(f"Processing page {i + 1}") + img = cv2.cvtColor(np.array(page), cv2.COLOR_RGB2BGR) + page_text = extract_text_from_image_with_genAI(img_array=img, ocid_compartment_id=OCI_COMPARTMENT_ID, + oci_genai_endpoint=OCI_IMAGE_MODEL_ENDPOINT, + ocid_genai_model=OCI_GENAI_IMAGE_MODEL_OCID) + all_text.extend(page_text) + logger.info("Finished all pages") + return all_text + except Exception as e: + logger.error(f"Failed to process PDF {e}") + return [] + + +def process_file_with_ocr(local_file: str): + + if not os.path.exists(local_file): + logger.error(f"File not found: {local_file}") + raise FileNotFoundError(f"{local_file} not found") + + file_ext = os.path.splitext(local_file)[1].lower() + extracted = "" + + if file_ext == ".pdf": + logger.info(f"Detected PDF file: {local_file}") + + extracted_text_from_image = [] + if pdf_has_image(local_file): + logger.info(f"{local_file} has image on it, extracting text from each image...") + extracted_text_from_image = extract_text_from_pdf_Images(local_file) + + logger.info(f"{local_file} extracting text with PyPDF") + extracted_text = [] + extracted_text = extract_text_from_pdf_with_PyPDF(local_file) + + extracted = extracted_text + extracted_text_from_image + + elif file_ext in [".jpg", ".jpeg", ".png"]: + logger.info(f"Detected image file: {local_file}") + logger.info("Opening image") + img = cv2.imread(local_file) + extracted = extract_text_from_image_with_genAI(img_array=img, ocid_compartment_id=OCI_COMPARTMENT_ID, + oci_genai_endpoint=OCI_IMAGE_MODEL_ENDPOINT, + ocid_genai_model=OCI_GENAI_IMAGE_MODEL_OCID) + #logger.info(f"Extracted text from image is {extracted} ") + + elif file_ext in [".doc", ".docx"]: + logger.info(f"Detected Word document: {local_file}") + extracted = extract_text_from_doc(local_file) + + elif file_ext == ".txt": + logger.info(f"Detected TXT file: {local_file}") + extracted = extract_text_from_txt(local_file) + + else: + logger.error("Unsupported file type. Only PDF, JPG, JPEG, PNG, DOC, DOCX, or TXT are supported.") + raise ValueError(f"Unsupported file type: {file_ext}") + + return normalize_text_list(extracted) + + +def processFile(source_file_path:str, created_on:datetime, modified_on:datetime ): + + # Check if file is already indexed + if file_already_exists(source_file_path): + return {"status": "skipped", "reason": "File already exists in database"} + + try: + logger.info(f"Extract text chunks from file [{source_file_path}]...") + chunks = process_file_with_ocr(source_file_path) + except Exception as e: + return {"status": "failed", "reason": f"Text extraction error: {str(e)}"} + + try: + if chunks: + logger.info("Detected chunks from text") + for attempt in range(3): + logger.info(f"Capture metadata using LLM [attempt {attempt + 1}/3] - model {OCI_GENAI_REASONING_MODEL_OCID}") + metadata = extract_metadata_from_chunks_GenAI(chunks=chunks,prompt_text=ENRICH_PROMPT, ocid_compartment_id=OCI_COMPARTMENT_ID, + ocid_genai_model=OCI_GENAI_REASONING_MODEL_OCID, oci_genai_endpoint=OCI_GENAI_ENDPOINT) + if metadata.get("summary") and metadata.get("type"): + logger.info("LLM completed!") + break + else: + logger.info(f"Metada output is {metadata}") + return {"status": "failed", "reason": "Failed to extract metadata after 3 attempts"} + + except Exception as e: + return {"status": "failed", "reason": f"Metadata enrichment error: {str(e)}"} + + if not chunks: + return {"status": "failed", "reason": "No text extracted from file."} + + # Save all chunkgs into a single CLOB column on DB + file_name = os.path.basename(source_file_path) + merged_clob_chunks = file_name + "\n" + "\n".join(chunks) + + try: + logger.info(f"Starting embedding extracted data...") + embeddings = generate_embeddings_batch(chunks,compartment_id=OCI_COMPARTMENT_ID, embedding_model=OCI_EMBEDDING_MODEL_NAME, genai_endpoint=OCI_EMBEDDING_ENDPOINT) + logger.info(f"Finished embedding!") + except Exception as e: + return {"status": "failed", "reason": f"Embedding generation failed: {str(e)}"} + + + result = store_document_in_oracledb(source_file_path, merged_clob_chunks, embeddings, metadata, created_on, modified_on, + compartment_id=OCI_COMPARTMENT_ID,embedding_model=OCI_EMBEDDING_MODEL_NAME, genai_endpoint=OCI_EMBEDDING_ENDPOINT) + return result + + + + + diff --git a/langgraph_agent_with_genai/src/init_database.py b/langgraph_agent_with_genai/src/init_database.py new file mode 100644 index 0000000..5ba7683 --- /dev/null +++ b/langgraph_agent_with_genai/src/init_database.py @@ -0,0 +1,44 @@ +import os +import logging +from dotenv import load_dotenv +load_dotenv() + + +from jlibspython.oracledb_utils import execute_ddl + +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +def init_database(): + """Initialize the database by running the DOCUMENT_VECTORS.sql DDL script""" + + # Get the absolute path to the SQL file + script_dir = os.path.dirname(os.path.abspath(__file__)) + sql_file_path = os.path.join(script_dir, "db", "DOCUMENT_VECTORS.sql") + + if not os.path.exists(sql_file_path): + logger.error(f"SQL file not found: {sql_file_path}") + return False + + try: + # Read the SQL file + with open(sql_file_path, 'r') as file: + sql_content = file.read() + + logger.info(f"Reading DDL script from: {sql_file_path}") + logger.info("Executing database initialization...") + + # Execute the DDL + result = execute_ddl(sql_content) + + logger.info("Database initialization completed successfully!") + logger.info(f"Result: {result}") + return True + + except Exception as e: + logger.error(f"Failed to initialize database: {e}") + return False + +if __name__ == "__main__": + success = init_database() + exit(0 if success else 1) \ No newline at end of file diff --git a/langgraph_agent_with_genai/src/jlibspython/file_utils.py b/langgraph_agent_with_genai/src/jlibspython/file_utils.py new file mode 100644 index 0000000..4633dd6 --- /dev/null +++ b/langgraph_agent_with_genai/src/jlibspython/file_utils.py @@ -0,0 +1,59 @@ + + +import fitz +import logging +from pypdf import PdfReader +import docx +from unicodedata import normalize + +logger = logging.getLogger(__name__) + + +def pdf_has_image(file_path, image_threshold=0.9): + doc = fitz.open(file_path) + image_pages = 0 + + for page in doc: + images = page.get_images(full=True) + if images: + image_pages += 1 + + return image_pages > 0 + + + +def extract_text_from_pdf_with_PyPDF(file_path): + extracted_text = [] + try: + reader = PdfReader(file_path) + for page in reader.pages: + text = page.extract_text() + if text and text.strip(): + extracted_text.append(text.strip()) + except Exception as e: + logger.error(f"Failed to extract text via PyPDF: {e}") + + return extracted_text + +def extract_text_from_doc(file_path): + doc = docx.Document(file_path) + paragraphs = [p.text.strip() for p in doc.paragraphs if p.text.strip()] + if paragraphs: + return [" ".join(paragraphs)] + return [] + + +def extract_text_from_txt(file_path): + with open(file_path, "r", encoding="utf-8") as f: + content = f.read().strip() + return [content] if content else [] + +def normalize_text_list(text_list): + def clean(t): + try: + return normalize('NFKC', t.encode('utf-8').decode('utf-8')) + except Exception as e: + print(f"ERRO AO FAZER NORMALIZE {e}") + return t + + return [clean(t) for t in text_list if isinstance(t, str)] diff --git a/langgraph_agent_with_genai/src/jlibspython/llm_date_parser.py b/langgraph_agent_with_genai/src/jlibspython/llm_date_parser.py new file mode 100644 index 0000000..e943623 --- /dev/null +++ b/langgraph_agent_with_genai/src/jlibspython/llm_date_parser.py @@ -0,0 +1,178 @@ +""" +LLM-based intelligent date parsing utility +Converts natural language date expressions to structured date ranges +""" + +import os +import logging +from datetime import datetime, timedelta +from typing import Dict, Optional +import json +from langchain_oci.chat_models import ChatOCIGenAI + + +logger = logging.getLogger(__name__) + +def get_current_date_context() -> Dict[str, str]: + """Get current date context for LLM parsing""" + now = datetime.now() + last_month = now.replace(day=1) - timedelta(days=1) + + return { + "current_date": now.strftime("%d/%m/%Y"), + "current_month": now.strftime("%B %Y"), + "current_year": str(now.year), + "last_month": last_month.strftime("%B %Y"), + "last_year": str(now.year - 1) + } + +def validate_date_format(date_str: str) -> bool: + """Validate if date string is in DD/MM/YYYY format""" + try: + datetime.strptime(date_str, "%d/%m/%Y") + return True + except ValueError: + return False + +def parse_llm_json_response(json_input: str) -> dict: + """ + Parse JSON response from LLM with cleanup strategies + Reuses logic from search_tools but focused on date parsing + """ + if not json_input or not json_input.strip(): + return {} + + # List of cleanup strategies + strategies = [ + lambda x: json.loads(x), + lambda x: json.loads(x.strip().strip('"')), + lambda x: json.loads(x.replace('\\"', '"')), + lambda x: json.loads(x.strip().strip('"').replace('\\"', '"')), + lambda x: json.loads(x.strip().strip("'").replace("\\'", "'")), + ] + + for i, strategy in enumerate(strategies, 1): + try: + result = strategy(json_input) + if isinstance(result, dict): + logger.debug(f"Date JSON parsed successfully (strategy {i}): {result}") + return result + except (json.JSONDecodeError, Exception): + continue + + logger.error(f"Failed to parse date JSON: {json_input}") + return {} + +def parse_date_with_llm(user_query: str, current_date: Optional[str] = None) -> Dict[str, str]: + """ + Uses LLM to extract date ranges from natural language queries. + + Args: + user_query: Natural language query containing date references + current_date: Optional current date context (defaults to today) + + Returns: + Dict with keys "event_date_start" and "event_date_end" in DD/MM/YYYY format + Empty dict if no dates found or parsing fails + + Examples: + "documents from 2024" -> {"event_date_start": "01/01/2024", "event_date_end": "31/12/2024"} + "files from April 2024" -> {"event_date_start": "01/04/2024", "event_date_end": "30/04/2024"} + "last month reports" -> {"event_date_start": "01/02/2024", "event_date_end": "29/02/2024"} + """ + try: + # Get date context + date_context = get_current_date_context() + if current_date: + date_context["current_date"] = current_date + + # Create focused prompt for date extraction + prompt = f"""You are a date extraction expert. Extract date ranges from queries and return ONLY valid JSON. + +CONTEXT: +- Current date: {date_context['current_date']} +- Current month: {date_context['current_month']} +- Current year: {date_context['current_year']} +- Last month: {date_context['last_month']} +- Last year: {date_context['last_year']} + +RULES: +- Always return dates in DD/MM/YYYY format +- For year only: start = 01/01/YYYY, end = 31/12/YYYY +- For month/year: start = first day, end = last day of month +- For single dates: start = end = same date +- If no date found, return empty JSON: {{}} + +EXAMPLES: +"documents from 2024" → {{"event_date_start": "01/01/2024", "event_date_end": "31/12/2024"}} +"files from April 2024" → {{"event_date_start": "01/04/2024", "event_date_end": "30/04/2024"}} +"docs from January 15, 2024" → {{"event_date_start": "15/01/2024", "event_date_end": "15/01/2024"}} +"reports from last year" → {{"event_date_start": "01/01/{date_context['last_year']}", "event_date_end": "31/12/{date_context['last_year']}"}} +"contracts from Q1 2024" → {{"event_date_start": "01/01/2024", "event_date_end": "31/03/2024"}} + +QUERY: {user_query} + +Return only the JSON object:""" + + + llm = ChatOCIGenAI( + model_id=os.environ.get("OCI_GENAI_REASONING_MODEL_NAME"), + service_endpoint=os.environ.get("OCI_GENAI_ENDPOINT"), + compartment_id=os.environ.get("OCI_COMPARTMENT_ID"), + auth_type="API_KEY", + model_kwargs={ + "temperature": 0.1, # Low temperature for consistent parsing + "max_tokens": 150 + } + ) + + # Get LLM response + response = llm.invoke(prompt) + logger.debug(f"LLM date parsing response: {response.content}") + + # Parse JSON response + result = parse_llm_json_response(response.content.strip()) + + # Validate result structure and format + if result and "event_date_start" in result and "event_date_end" in result: + start_date = result["event_date_start"] + end_date = result["event_date_end"] + + # Validate date formats + if validate_date_format(start_date) and validate_date_format(end_date): + logger.info(f"LLM successfully parsed dates from '{user_query}': {start_date} to {end_date}") + return { + "event_date_start": start_date, + "event_date_end": end_date + } + else: + logger.warning(f"LLM returned invalid date format: {result}") + + logger.info(f"No valid dates extracted from: '{user_query}'") + return {} + + except Exception as e: + logger.error(f"Error in LLM date parsing for query '{user_query}': {e}") + return {} + +# Convenience function for quick testing +def test_date_parsing(): + """Test function to validate date parsing with common queries""" + test_queries = [ + "documents from 2024", + "files from April 2024", + "reports from last month", + "contracts from Q1 2024", + "docs from January 15, 2024", + "files from last year", + "documents from 15/01/2024 to 30/01/2024" + ] + + for query in test_queries: + result = parse_date_with_llm(query) + print(f"Query: '{query}' -> {result}") + +if __name__ == "__main__": + # Enable testing when run directly + logging.basicConfig(level=logging.INFO) + test_date_parsing() \ No newline at end of file diff --git a/langgraph_agent_with_genai/src/jlibspython/local_embedding_utils.py b/langgraph_agent_with_genai/src/jlibspython/local_embedding_utils.py new file mode 100644 index 0000000..d935c68 --- /dev/null +++ b/langgraph_agent_with_genai/src/jlibspython/local_embedding_utils.py @@ -0,0 +1,64 @@ +from sentence_transformers import SentenceTransformer +import logging +from typing import List +from pydantic import BaseModel + + +logger = logging.getLogger(__name__) + +# Global cache of the model to avoid reloading +_cached_model = None +_model_name = "intfloat/multilingual-e5-large" + +class EmbedResponse(BaseModel): + embeddings: list[list[float]] + + +class EmbedRequest(BaseModel): + texts: list[str] + +def _get_cached_model(): + """Return the cached model or load it if necessary""" + global _cached_model + + if _cached_model is None: + try: + logger.info(f"Loading embedding model: {_model_name}") + _cached_model = SentenceTransformer(_model_name) + logger.info("Model loaded successfully") + except Exception as e: + logger.error(f"Error loading embedding model: {e}") + raise RuntimeError(f"Failed to initialize embedding model: {e}") + + return _cached_model + +def generate_embeddings_local(request_text: list[str]) -> List[List[float]]: + + if not request_text: + raise ValueError("Text list cannot be empty") + + # Remove empty texts + filtered_texts = [text.strip() for text in request_text if text and text.strip()] + + #logger.info(f"Embedding the string: {filtered_texts})") + + if not filtered_texts: + raise ValueError("No valid text found after filtering") + + try: + logger.info(f"Starting embedding for {len(filtered_texts)} texts...") + + # Get cached model + embedding_model = _get_cached_model() + + # Generate embeddings + embeddings = embedding_model.encode(filtered_texts) + embeddings_list = embeddings.tolist() + + logger.info(f"Embeddings successfully generated - {len(embeddings_list)} vectors of dimension {len(embeddings_list[0])}") + + return embeddings_list + + except Exception as e: + logger.error(f"Error generating embeddings: {e}") + raise RuntimeError(f"Failed to generate embeddings: {e}") \ No newline at end of file diff --git a/langgraph_agent_with_genai/src/jlibspython/oci_embedding_utils.py b/langgraph_agent_with_genai/src/jlibspython/oci_embedding_utils.py new file mode 100644 index 0000000..fd6068e --- /dev/null +++ b/langgraph_agent_with_genai/src/jlibspython/oci_embedding_utils.py @@ -0,0 +1,67 @@ + +import logging +from typing import List +import oci +import os + + +logger = logging.getLogger(__name__) + +def generate_embeddings_oci(texts: List[str], compartment_id:str, embedding_model:str, genai_endpoint:str ) -> List[List[float]]: + if not texts: + raise ValueError("Text list must not be empty") + + # Remove textos vazios + filtered_texts = [text.strip() for text in texts if text and text.strip()] + + if not filtered_texts: + raise ValueError("No valid text after filter") + + try: + + config_path = os.path.expanduser("~/.oci/config") + if os.path.exists(config_path): + config = oci.config.from_file("~/.oci/config",profile_name=os.environ.get("OCI_CLI_PROFILE")) + client = oci.generative_ai_inference.GenerativeAiInferenceClient( + config=config, + service_endpoint=genai_endpoint + ) + else: + signer = oci.auth.signers.InstancePrincipalsSecurityTokenSigner() + client = oci.generative_ai_inference.GenerativeAiInferenceClient( + config={}, + signer=signer, + service_endpoint=genai_endpoint + ) + + # Request de embedding para múltiplos textos + embed_text_details = oci.generative_ai_inference.models.EmbedTextDetails( + inputs=filtered_texts, + serving_mode=oci.generative_ai_inference.models.OnDemandServingMode( + model_id=embedding_model + ), + compartment_id=compartment_id, + is_echo=False, + truncate="NONE" + ) + + # Chamada da API + embed_text_response = client.embed_text(embed_text_details) + + # Extrai todos os embeddings + embedding_vectors = [] + embeddings_data = embed_text_response.data.embeddings + + for embedding_data in embeddings_data: + if hasattr(embedding_data, 'embedding'): + embedding_vectors.append(embedding_data.embedding) + else: + # Se vier como lista direta + embedding_vectors.append(embedding_data) + + logger.info(f"Embeddings successfully processed of - {len(embedding_vectors)} texts") + return embedding_vectors + + except Exception as e: + logger.error(f"Error generating embeddings: {e}") + raise diff --git a/langgraph_agent_with_genai/src/jlibspython/oci_utils_helpers.py b/langgraph_agent_with_genai/src/jlibspython/oci_utils_helpers.py new file mode 100644 index 0000000..833deae --- /dev/null +++ b/langgraph_agent_with_genai/src/jlibspython/oci_utils_helpers.py @@ -0,0 +1,251 @@ +import os, base64, json, mimetypes +import logging +import oci +import tempfile +import datetime +import cv2 + +logger = logging.getLogger(__name__) + + +def download_file_from_objectStore(bucket, namespace, object_name): + try: + config_path = os.path.expanduser("~/.oci/config") + if os.path.exists(config_path): + config = oci.config.from_file("~/.oci/config",profile_name=os.environ.get("OCI_CLI_PROFILE")) + oci_client = oci.object_storage.ObjectStorageClient(config=config) + elif os.environ.get("OCI_RESOURCE_PRINCIPAL_VERSION"): + logger.info("Using Resource Principal for authentication") + signer = oci.auth.signers.get_resource_principals_signer() + oci_client = oci.object_storage.ObjectStorageClient(config={}, signer=signer) + else: + logger.info("Using Instance Principal for authentication") + signer = oci.auth.signers.InstancePrincipalsSecurityTokenSigner() + oci_client = oci.object_storage.ObjectStorageClient(config={}, signer=signer) + + + logger.info(f"Downloading {object_name} from OCI bucket {bucket}") + + # Create a temporary directory and construct local path with original filename + temp_dir = tempfile.gettempdir() + local_path = os.path.join(temp_dir, os.path.basename(object_name)) + + response = oci_client.get_object(namespace, bucket, object_name) + + with open(local_path, 'wb') as f: + for chunk in response.data.raw.stream(1024 * 1024, decode_content=False): + f.write(chunk) + + logger.info(f"File saved to {local_path}") + + response_metadata = oci_client.head_object(namespace_name=namespace, bucket_name=bucket, object_name=object_name) + + # Extract timestamps + last_modified_str = response_metadata.headers.get("last-modified") + time_created_str = response_metadata.headers.get("opc-meta-timecreated") + + # Convert Last-Modified from HTTP header format to datetime + last_modified_dt = datetime.strptime(last_modified_str, "%a, %d %b %Y %H:%M:%S %Z") + + if time_created_str: + # If custom metadata exists for created time + created_on_dt = datetime.fromisoformat(time_created_str) + else: + # Fall back to last modified if timecreated was not set + created_on_dt = last_modified_dt + + created_on = created_on_dt.strftime("%Y-%m-%dT%H:%M:%S") + modified_on = last_modified_dt.strftime("%Y-%m-%dT%H:%M:%S") + + return local_path, created_on, modified_on + + except Exception as e: + logger.error(f"Failed to download file from OCI: {e}") + raise + + + +def extract_metadata_from_chunks_GenAI( + chunks, + prompt_text: str, + ocid_compartment_id: str, + oci_genai_endpoint: str, + ocid_genai_model: str, + temperature: float = 0.0, +): + logger.info("Starting metadata extraction with LLM...") + EMPTY_JSON = {"summary": "", "type": "", "category": "", "person": "", "eventdate": ""} + + try: + if not prompt_text: + return EMPTY_JSON + config_path = os.path.expanduser("~/.oci/config") + if os.path.exists(config_path): + logger.info("DEBUG - Using Local config authentication in OCI") + config = oci.config.from_file("~/.oci/config",profile_name=os.environ.get("OCI_CLI_PROFILE")) + client = oci.generative_ai_inference.GenerativeAiInferenceClient( + config=config, + service_endpoint=oci_genai_endpoint, + retry_strategy=oci.retry.NoneRetryStrategy(), + timeout=(10, 240), + ) + + else: + signer = oci.auth.signers.InstancePrincipalsSecurityTokenSigner() + client = oci.generative_ai_inference.GenerativeAiInferenceClient( + config={}, + signer=signer, + service_endpoint=oci_genai_endpoint, + retry_strategy=oci.retry.NoneRetryStrategy(), + timeout=(10, 240), + ) + + joined = " ".join(chunks or []) + user_prompt = f"{prompt_text}\n{joined}".strip() + + # Mensagem USER com TextContent (padrão do sample OCI) + text_content = oci.generative_ai_inference.models.TextContent() + text_content.text = user_prompt + + msg = oci.generative_ai_inference.models.Message() + msg.role = "USER" + msg.content = [text_content] + + chat_request = oci.generative_ai_inference.models.GenericChatRequest() + chat_request.api_format = oci.generative_ai_inference.models.BaseChatRequest.API_FORMAT_GENERIC + chat_request.messages = [msg] + chat_request.max_tokens = 2048 + chat_request.temperature = temperature + chat_request.frequency_penalty = 0 + chat_request.presence_penalty = 0 + chat_request.top_p = 1 + chat_request.top_k = 1 + + chat_detail = oci.generative_ai_inference.models.ChatDetails() + chat_detail.serving_mode = oci.generative_ai_inference.models.OnDemandServingMode(model_id=ocid_genai_model) + chat_detail.chat_request = chat_request + chat_detail.compartment_id = ocid_compartment_id + + resp = client.chat(chat_detail) + + logger.info("DEBUG - client.chat() executed successfully!") + + # A resposta JSON vem em choices[0].message.content[0].text + choices = getattr(resp.data.chat_response, "choices", []) or [] + if not choices: + return EMPTY_JSON + + msg_content = choices[0].message.content or [] + if not msg_content or not hasattr(msg_content[0], "text"): + return EMPTY_JSON + + raw_text = (msg_content[0].text or "").strip() + if not raw_text: + return EMPTY_JSON + + # Tenta parsear diretamente; se falhar, remove cercas simples de markdown e tenta de novo + try: + return json.loads(raw_text) + except Exception: + cleaned = raw_text + if cleaned.startswith("```"): + cleaned = cleaned.strip("`").strip() + # Se vier com "json" no início da cerca + if cleaned.lower().startswith("json"): + cleaned = cleaned[4:].strip() + try: + return json.loads(cleaned) + except Exception: + return EMPTY_JSON + + except Exception as e: + logger.error(f"Error {e}") + return EMPTY_JSON + + +def extract_text_from_image_with_genAI( + img_array, + ocid_compartment_id: str, + oci_genai_endpoint: str, + ocid_genai_model: str +) -> list[str]: + result_text: list[str] = [] + prompt = ( + "Extract all the text from the image exactly as it appears, without any modification.\n" + "Do not return markdown." + ) + + # converte numpy array para PNG em memória + success, buffer = cv2.imencode(".png", img_array) + if not success: + return result_text + + img_bytes = buffer.tobytes() + img_b64 = base64.b64encode(img_bytes).decode("utf-8") + + config_path = os.path.expanduser("~/.oci/config") + if os.path.exists(config_path): + config = oci.config.from_file("~/.oci/config",profile_name=os.environ.get("OCI_CLI_PROFILE")) + client = oci.generative_ai_inference.GenerativeAiInferenceClient( + config=config, + service_endpoint=oci_genai_endpoint, + retry_strategy=oci.retry.NoneRetryStrategy(), + timeout=(10, 240), + ) + else: + signer = oci.auth.signers.InstancePrincipalsSecurityTokenSigner() + client = oci.generative_ai_inference.GenerativeAiInferenceClient( + config={}, + signer=signer, + service_endpoint=oci_genai_endpoint, + retry_strategy=oci.retry.NoneRetryStrategy(), + timeout=(10, 240), + ) + + text_content = oci.generative_ai_inference.models.TextContent() + text_content.text = prompt + + image_url = oci.generative_ai_inference.models.ImageUrl( + url=f"data:image/png;base64,{img_b64}" + ) + image_content = oci.generative_ai_inference.models.ImageContent(image_url=image_url) + + msg = oci.generative_ai_inference.models.Message() + msg.role = "USER" + msg.content = [text_content, image_content] + + chat_request = oci.generative_ai_inference.models.GenericChatRequest() + chat_request.api_format = oci.generative_ai_inference.models.BaseChatRequest.API_FORMAT_GENERIC + chat_request.messages = [msg] + chat_request.max_tokens = 2048 + chat_request.temperature = 0 + + chat_detail = oci.generative_ai_inference.models.ChatDetails() + chat_detail.serving_mode = oci.generative_ai_inference.models.OnDemandServingMode( + model_id=ocid_genai_model + ) + chat_detail.chat_request = chat_request + chat_detail.compartment_id = ocid_compartment_id + + try: + resp = client.chat(chat_detail) + choices = getattr(resp.data.chat_response, "choices", []) + if not choices: + return result_text + + msg_content = choices[0].message.content + if not msg_content: + return result_text + + raw_text = getattr(msg_content[0], "text", None) + if not raw_text: + return result_text + + result_text.append(raw_text.strip()) + return result_text + + except Exception as e: + logger.error(f"Error {e}") + return [] + + diff --git a/langgraph_agent_with_genai/src/jlibspython/oracledb_utils.py b/langgraph_agent_with_genai/src/jlibspython/oracledb_utils.py new file mode 100644 index 0000000..d7ed717 --- /dev/null +++ b/langgraph_agent_with_genai/src/jlibspython/oracledb_utils.py @@ -0,0 +1,196 @@ +import zipfile +import logging +import oracledb +import oci +import os +import logging +import re +from datetime import datetime +import numpy as np +from typing import List, Dict, Any + + +logger = logging.getLogger(__name__) + +_wallet_downloaded = False +_WALLET_PATH = "/tmp/wallet" + +_oracle_pool_singleton = None + + +def getOracleConnection(): + global _oracle_pool_singleton + + if _oracle_pool_singleton is None: + wallet_path = download_and_extract_wallet() + _oracle_pool_singleton = oracledb.SessionPool( + user=os.environ["DB_USER"], + password=os.environ["DB_PASSWORD"], + dsn=os.environ["DB_DSN"], + config_dir=wallet_path, + wallet_location=wallet_path, + wallet_password=os.environ["WALLET_PASSWORD"], + min=1, + max=5, + increment=1, + homogeneous=True + ) + logger.info("Session pool created for Oracle ATP") + + return _oracle_pool_singleton.acquire() + + + +def download_and_extract_wallet(): + global _wallet_downloaded + if _wallet_downloaded: + return _WALLET_PATH + + logger.info("Downloading wallet from OCI Object Storage...") + namespace = os.environ.get("OCI_BUCKET_NAMESPACE") + bucket = os.environ.get("OCI_BUCKET_NAME_WALLET") + object_name = os.environ.get("OCI_WALLET_OBJECT_NAME") + if not all([namespace, bucket, object_name]): + raise EnvironmentError("Missing OCI bucket environment variables") + + + config_path = os.path.expanduser("~/.oci/config") + if os.path.exists(config_path): + logger.info("Connecting using Local OCI Config....") + config = oci.config.from_file("~/.oci/config",profile_name=os.environ.get("OCI_CLI_PROFILE")) + client = oci.object_storage.ObjectStorageClient(config=config) + else: + logger.info("Connecting using Instance Principal...") + signer = oci.auth.signers.InstancePrincipalsSecurityTokenSigner() + client = oci.object_storage.ObjectStorageClient(config={}, signer=signer) + + + response = client.get_object(namespace, bucket, object_name) + + with open("/tmp/wallet.zip", "wb") as f: + f.write(response.data.content) + os.makedirs(_WALLET_PATH, exist_ok=True) + with zipfile.ZipFile("/tmp/wallet.zip", "r") as zip_ref: + zip_ref.extractall(_WALLET_PATH) + + _wallet_downloaded = True + logger.info(f"WALLET_PATH updated to {_WALLET_PATH}") + return _WALLET_PATH + + +def execute_query(sql, params=None): + try: + conn = getOracleConnection() + cursor = conn.cursor() + cursor.execute(sanitize_sql(sql), params or {}) + + if cursor.description is None: + conn.commit() + return {"status": "success", "rows": 0} + + columns = [col[0].lower() for col in cursor.description] + rows = [ + { + col: val.read() if isinstance(val, oracledb.LOB) else val + for col, val in zip(columns, row) + } for row in cursor.fetchall() + ] + return rows + except Exception as e: + logger.exception(f"Error running query e:{e}") + raise + + +def execute_query_single_value(sql, params=None): + try: + conn = getOracleConnection() + cursor = conn.cursor() + cursor.execute(sanitize_sql(sql), params or {}) + values = [ + val.read() if isinstance(val, oracledb.LOB) else val + for val, *_ in cursor.fetchall() + ] + return values + except Exception as e: + logger.exception(f"Error running query e:{e}") + raise + + +def sanitize_sql(sql: str) -> str: + sql = sql.strip() + sql = re.sub(r"--.*?$", "", sql, flags=re.MULTILINE) + sql = re.sub(r"/\*.*?\*/", "", sql, flags=re.DOTALL) + sql = re.sub(r"```sql", "", sql, flags=re.IGNORECASE) + sql = sql.replace("```", "").replace("`", "") + sql = re.sub(r";\s*$", "", sql) + sql = re.sub(r"\s+", " ", sql) + return sql.strip() + + + +def parse_date(date_str): + if not date_str: + return None + input_formats = ("%Y-%m-%d", "%Y-%m-%dT%H:%M:%S", "%Y-%m-%d %H:%M:%S") + for fmt in input_formats: + try: + return datetime.strptime(date_str, fmt) + except ValueError: + continue + logger.warning(f"Invalid date format: {date_str}") + return None + +def execute_ddl(sql): + try: + conn = getOracleConnection() + cursor = conn.cursor() + + # Split multiple statements and execute each one + statements = [stmt.strip() for stmt in sql.split(';') if stmt.strip()] + + for statement in statements: + logger.info(f"Executing DDL: {statement[:100]}...") + cursor.execute(statement) + + conn.commit() + logger.info(f"Successfully executed {len(statements)} DDL statement(s)") + return {"status": "success", "statements_executed": len(statements)} + except Exception as e: + logger.exception(f"Error executing DDL: {e}") + raise + + +def safe_float(value): + try: + return float(value) + except (TypeError, ValueError): + return None + +def filter_outliers_by_std_dev(data: List[Dict[str, Any]], column_name: str) -> List[Dict[str, Any]]: + + + weight_value = 1.5 + + if not data: + return [] + + ## if there is 5 or less records, then just return it! (few records is not good for outlier detection) + if len(data) <= 5: + return data + + + valid_data = [item for item in data if safe_float(item.get(column_name)) is not None] + if not valid_data: + return [] + + distances = [safe_float(item.get(column_name)) for item in valid_data] + + mean_distance = np.mean(distances) + std_dev_distance = np.std(distances) + outlier_threshold = mean_distance - weight_value * std_dev_distance + + outlier_results = [ + item for item, dist in zip(valid_data, distances) if dist < outlier_threshold + ] + + return outlier_results diff --git a/langgraph_agent_with_genai/src/jlibspython/proxy_embedding_helper.py b/langgraph_agent_with_genai/src/jlibspython/proxy_embedding_helper.py new file mode 100644 index 0000000..48a4fe1 --- /dev/null +++ b/langgraph_agent_with_genai/src/jlibspython/proxy_embedding_helper.py @@ -0,0 +1,25 @@ +""" +This is a proxy helper +You can modify when to use Local Embeddings or OCI Generative AI by switching the imports +""" + + +import logging +from typing import List +from jlibspython.local_embedding_utils import generate_embeddings_local +#from jlibspython.oci_embedding_utils import generate_embeddings_oci + +logger = logging.getLogger(__name__) + +# To switch between embedding types, just uncoment parts bellow. + +# Using OCI +#def generate_embeddings_batch(texts: List[str], compartment_id:str, embedding_model:str, genai_endpoint:str ) -> List[List[float]]: +# ret = generate_embeddings_oci(texts, compartment_id=compartment_id, embedding_model=embedding_model, genai_endpoint=genai_endpoint) +# return ret + + +# Using Local +def generate_embeddings_batch(texts: List[str], compartment_id:str, embedding_model:str, genai_endpoint:str ) -> List[List[float]]: + ret = generate_embeddings_local(texts) + return ret diff --git a/langgraph_agent_with_genai/src/requirements.txt b/langgraph_agent_with_genai/src/requirements.txt new file mode 100644 index 0000000..15bb769 --- /dev/null +++ b/langgraph_agent_with_genai/src/requirements.txt @@ -0,0 +1,22 @@ +dotenv==0.9.9 +oci==2.159.1 +python-docx==1.1 +pdf2image==1.17.0 +numpy==2.2.6 +pypdf==5.5.0 +opencv-python==4.12.0.88 +tzlocal==5.3.1 +PyMuPDF==1.26.0 +requests==2.32.5 +oracledb==3.3.0 +poppler-utils==0.1.0 +langchain==0.3.27 +langchain-community==0.3.29 +langchain-core==0.3.75 +langsmith==0.4.20 +langgraph==0.6.6 +langchain-oci==0.1.3 +sentence-transformers==2.7.0 + + + diff --git a/langgraph_agent_with_genai/src/samples/BoletoCaixaSample.png b/langgraph_agent_with_genai/src/samples/BoletoCaixaSample.png new file mode 100644 index 0000000..98af489 Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/BoletoCaixaSample.png differ diff --git a/langgraph_agent_with_genai/src/samples/Brazilian_Beaches.docx b/langgraph_agent_with_genai/src/samples/Brazilian_Beaches.docx new file mode 100644 index 0000000..c6333e5 Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/Brazilian_Beaches.docx differ diff --git a/langgraph_agent_with_genai/src/samples/Comprovante_PIX.jpeg b/langgraph_agent_with_genai/src/samples/Comprovante_PIX.jpeg new file mode 100644 index 0000000..0393889 Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/Comprovante_PIX.jpeg differ diff --git a/langgraph_agent_with_genai/src/samples/DogsArticle.docx b/langgraph_agent_with_genai/src/samples/DogsArticle.docx new file mode 100644 index 0000000..55e0a63 Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/DogsArticle.docx differ diff --git a/langgraph_agent_with_genai/src/samples/RECIPT_DOC_SAMPLE.docx b/langgraph_agent_with_genai/src/samples/RECIPT_DOC_SAMPLE.docx new file mode 100644 index 0000000..934ef70 Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/RECIPT_DOC_SAMPLE.docx differ diff --git a/langgraph_agent_with_genai/src/samples/Receipt_Clothes.png b/langgraph_agent_with_genai/src/samples/Receipt_Clothes.png new file mode 100644 index 0000000..35e8ee0 Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/Receipt_Clothes.png differ diff --git a/langgraph_agent_with_genai/src/samples/SampleCNH.png b/langgraph_agent_with_genai/src/samples/SampleCNH.png new file mode 100644 index 0000000..ff9cb6a Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/SampleCNH.png differ diff --git a/langgraph_agent_with_genai/src/samples/US_Invoice_01.png b/langgraph_agent_with_genai/src/samples/US_Invoice_01.png new file mode 100644 index 0000000..5261a2f Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/US_Invoice_01.png differ diff --git a/langgraph_agent_with_genai/src/samples/US_Receipt_1.png b/langgraph_agent_with_genai/src/samples/US_Receipt_1.png new file mode 100644 index 0000000..76b11ba Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/US_Receipt_1.png differ diff --git a/langgraph_agent_with_genai/src/samples/exame_1.pdf b/langgraph_agent_with_genai/src/samples/exame_1.pdf new file mode 100644 index 0000000..962d1f7 Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/exame_1.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/exame_10.pdf b/langgraph_agent_with_genai/src/samples/exame_10.pdf new file mode 100644 index 0000000..20b53fb Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/exame_10.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/exame_2.pdf b/langgraph_agent_with_genai/src/samples/exame_2.pdf new file mode 100644 index 0000000..5f0ad2c Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/exame_2.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/exame_3.pdf b/langgraph_agent_with_genai/src/samples/exame_3.pdf new file mode 100644 index 0000000..f693a3b Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/exame_3.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/exame_4.pdf b/langgraph_agent_with_genai/src/samples/exame_4.pdf new file mode 100644 index 0000000..086c3d8 Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/exame_4.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/exame_5.pdf b/langgraph_agent_with_genai/src/samples/exame_5.pdf new file mode 100644 index 0000000..617962f Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/exame_5.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/exame_6.pdf b/langgraph_agent_with_genai/src/samples/exame_6.pdf new file mode 100644 index 0000000..4fb9cca Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/exame_6.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/exame_7.pdf b/langgraph_agent_with_genai/src/samples/exame_7.pdf new file mode 100644 index 0000000..d862830 Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/exame_7.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/exame_8.pdf b/langgraph_agent_with_genai/src/samples/exame_8.pdf new file mode 100644 index 0000000..db074bb Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/exame_8.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/exame_9.pdf b/langgraph_agent_with_genai/src/samples/exame_9.pdf new file mode 100644 index 0000000..0841be8 Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/exame_9.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/exame_laboratorial.pdf b/langgraph_agent_with_genai/src/samples/exame_laboratorial.pdf new file mode 100644 index 0000000..56e94ca Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/exame_laboratorial.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/pedido_exame_001.pdf b/langgraph_agent_with_genai/src/samples/pedido_exame_001.pdf new file mode 100644 index 0000000..935a9ce Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/pedido_exame_001.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/pedido_exame_002.pdf b/langgraph_agent_with_genai/src/samples/pedido_exame_002.pdf new file mode 100644 index 0000000..f9fa51a Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/pedido_exame_002.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/pedido_exame_003.pdf b/langgraph_agent_with_genai/src/samples/pedido_exame_003.pdf new file mode 100644 index 0000000..bf8c89b Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/pedido_exame_003.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/pedido_exame_004.pdf b/langgraph_agent_with_genai/src/samples/pedido_exame_004.pdf new file mode 100644 index 0000000..091e255 Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/pedido_exame_004.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/pedido_exame_005.pdf b/langgraph_agent_with_genai/src/samples/pedido_exame_005.pdf new file mode 100644 index 0000000..d3fc197 Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/pedido_exame_005.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/pedido_exame_006.pdf b/langgraph_agent_with_genai/src/samples/pedido_exame_006.pdf new file mode 100644 index 0000000..b157c1b Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/pedido_exame_006.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/pedido_exame_007.pdf b/langgraph_agent_with_genai/src/samples/pedido_exame_007.pdf new file mode 100644 index 0000000..89cac23 Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/pedido_exame_007.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/pedido_exame_008.pdf b/langgraph_agent_with_genai/src/samples/pedido_exame_008.pdf new file mode 100644 index 0000000..3db78fb Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/pedido_exame_008.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/pedido_exame_009.pdf b/langgraph_agent_with_genai/src/samples/pedido_exame_009.pdf new file mode 100644 index 0000000..c0447d7 Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/pedido_exame_009.pdf differ diff --git a/langgraph_agent_with_genai/src/samples/pedido_exame_010.pdf b/langgraph_agent_with_genai/src/samples/pedido_exame_010.pdf new file mode 100644 index 0000000..0e401ba Binary files /dev/null and b/langgraph_agent_with_genai/src/samples/pedido_exame_010.pdf differ diff --git a/langgraph_agent_with_genai/src/validation.py b/langgraph_agent_with_genai/src/validation.py new file mode 100644 index 0000000..9a6c9ca --- /dev/null +++ b/langgraph_agent_with_genai/src/validation.py @@ -0,0 +1,142 @@ +""" +Script for validation of database indexed data. +This script helps you to perform some queries on the database. +""" + + +import sys +import os +import logging +from jlibspython.oracledb_utils import filter_outliers_by_std_dev +from dotenv import load_dotenv +load_dotenv() + +from jlibspython.proxy_embedding_helper import generate_embeddings_batch +from jlibspython.oracledb_utils import execute_query + + +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + + +OCI_COMPARTMENT_ID = os.environ["OCI_COMPARTMENT_ID"] + +## In case you decide to use a GenAI MODEL for Embedding instead of local embeedding, you must setup this variable +if "OCI_EMBEDDING_MODEL_NAME" in os.environ: + OCI_EMBEDDING_MODEL_NAME = os.environ["OCI_EMBEDDING_MODEL_NAME"] +else: + OCI_EMBEDDING_MODEL_NAME = "" +## In case you decide to use a GenAI MODEL for Embedding instead of local embeedding, you must setup this variable +if "OCI_EMBEDDING_ENDPOINT" in os.environ: + OCI_EMBEDDING_ENDPOINT = os.environ["OCI_EMBEDDING_ENDPOINT"] +else: + OCI_EMBEDDING_ENDPOINT = "" + + +def display_document_stats(): + """Display document statistics by type and category""" + try: + sql = "SELECT COUNT(*), DOC_TYPE, CATEGORY FROM DOCUMENT_VECTORS GROUP BY DOC_TYPE, CATEGORY ORDER BY 1,2" + results = execute_query(sql) + + print("\n DOCUMENT STATISTICS") + print("=" * 50) + print(f"{'Count':<8} {'Document Type':<20} {'Category':<20}") + print("-" * 50) + + for row in results: + count = row['count(*)'] if row.get('count(*)') else 0 + doc_type = row['doc_type'] if row.get('doc_type') else 'N/A' + category = row['category'] if row.get('category') else 'N/A' + print(f"{count:<8} {doc_type:<20} {category:<20}") + + print("-" * 50) + print() + + except Exception as e: + logger.error(f"Error displaying document statistics: {e}") + +def display_available_names(): + """Display available person names in the database""" + try: + sql = "SELECT UPPER(LISTAGG(PERSON_NAME, ', ') WITHIN GROUP (ORDER BY PERSON_NAME)) AS available_name FROM DOCUMENT_VECTORS" + results = execute_query(sql) + + print("AVAILABLE PERSON NAMES") + print("=" * 50) + + if results and results[0] and results[0].get('available_name'): + names = results[0]['available_name'] + print(f"{names}") + else: + print("No person names found in the database.") + + print("=" * 50) + print() + + except Exception as e: + logger.error(f"Error displaying available names: {e}") + +def main(): + + display_document_stats() + display_available_names() + + + if len(sys.argv) < 3: + print("Usage: python validation.py 'your text here'") + print("Available columns: SUMMARY_EMBEDDING, DOC_TYPE_EMBEDDING, CATEGORY_EMBEDDING, PERSON_NAME_EMBEDDING") + sys.exit(1) + + + column_name = sys.argv[1] + text = sys.argv[2] + text = text.lower() + + try: + embeddings = generate_embeddings_batch( + [text], + compartment_id=OCI_COMPARTMENT_ID, + embedding_model=OCI_EMBEDDING_MODEL_NAME, + genai_endpoint=OCI_EMBEDDING_ENDPOINT + ) + + embedding_vector = embeddings[0] + emb_str = "[" + ",".join(str(x) for x in embedding_vector) + "]" + + sql = f""" + SELECT SOURCE_FILE, PERSON_NAME, DOC_TYPE, CATEGORY, CHUNK_TEXT, SUMMARY, + VECTOR_DISTANCE({column_name}, VECTOR(:embedding)) as DISTANCE + FROM DOCUMENT_VECTORS + ORDER BY DISTANCE + """ + + results = execute_query(sql, {"embedding": emb_str}) + + logger.info(f"Embedding: {emb_str}") + + print(f"{'File':<30} {'Person':<20} {'Doc Type':<20} {'Category':<15} {'Distance':<10} {'Summary':<50}") + print("-" * 145) + + for row in results: + distance = f"{row['distance']:.4f}" if row['distance'] is not None else 999 + file_name = row['source_file'].split('/')[-1] if row['source_file'] else 'N/A' + person = row['person_name'] or 'N/A' + doc_type = row['doc_type'] or 'N/A' + category = row['category'] or 'N/A' + #distance = f"{row['distance']:.4f}" + summary = (row['summary'] or 'N/A')[:47] + "..." if row['summary'] and len(row['summary']) > 50 else (row['summary'] or 'N/A') + + print(f"{file_name:<30} {person:<20} {doc_type:<20} {category:<15} {distance:<10} {summary:<50}") + + outliers = filter_outliers_by_std_dev(results, 'distance') + + for outlier in outliers: + print(f"- File: {outlier['source_file']}, Distance: {outlier['distance']}") + + except Exception as e: + print(f"Error: {e}", file=sys.stderr) + sys.exit(1) + +if __name__ == "__main__": + main() \ No newline at end of file