<a href="https://colab.research.google.com/github/graphlit/graphlit-samples/blob/main/python/Notebook%20Examples/Graphlit_2024_12_08_CrewAI_Product_Data_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Description**

This example shows how to integrate with CrewAI and the Graphlit Agent Tools to extract structured data from web pages.

**Requirements**

Prior to running this notebook, you will need to [signup](https://docs.graphlit.dev/getting-started/signup) for Graphlit, and [create a project](https://docs.graphlit.dev/getting-started/create-project).

You will need the Graphlit organization ID, preview environment ID and JWT secret from your created project.

Assign these properties as Colab secrets: GRAPHLIT_ORGANIZATION_ID, GRAPHLIT_ENVIRONMENT_ID and GRAPHLIT_JWT_SECRET.

For CrewAI, assign this property as Colab secret: OPENAI_API_KEY.

---

Install CrewAI

In [1]:
!pip install --upgrade crewai



Install LangChain OpenAI support

In [2]:
!pip install --upgrade langchain-openai



Install Graphlit Python agent tools SDK

In [3]:
!pip install --upgrade graphlit-tools[crewai]



In [4]:
!pip install --upgrade isodate



Initialize OpenAI for CrewAI

In [5]:
import os
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

Initialize Graphlit

In [6]:
import os
from google.colab import userdata
from graphlit import Graphlit
from graphlit_api import input_types, enums, exceptions

os.environ['GRAPHLIT_ORGANIZATION_ID'] = userdata.get('GRAPHLIT_ORGANIZATION_ID')
os.environ['GRAPHLIT_ENVIRONMENT_ID'] = userdata.get('GRAPHLIT_ENVIRONMENT_ID')
os.environ['GRAPHLIT_JWT_SECRET'] = userdata.get('GRAPHLIT_JWT_SECRET')

graphlit = Graphlit()

Define Graphlit helper functions

In [7]:
from datetime import datetime, timedelta
import isodate

async def lookup_usage(correlation_id: str):
    if graphlit.client is None:
        return;

    try:
        response = await graphlit.client.lookup_usage(correlation_id)

        return response.lookup_usage if response.lookup_usage is not None else None
    except exceptions.GraphQLClientError as e:
        print(str(e))
        return None

async def lookup_credits(correlation_id: str):
    if graphlit.client is None:
        return;

    try:
        response = await graphlit.client.lookup_credits(correlation_id)

        return response.lookup_credits if response.lookup_credits is not None else None
    except exceptions.GraphQLClientError as e:
        print(str(e))
        return None

def dump_usage_record(record):
    print(f"{record.date}: {record.name}")

    if record.duration is not None:
        duration = isodate.parse_duration(record.duration)

        if record.workflow:
            print(f"- Workflow [{record.workflow}] took {duration}, used credits [{record.credits:.8f}]")
        else:
            print(f"- Operation took {duration}, used credits [{record.credits:.8f}]")
    else:
        if record.workflow:
            print(f"- Workflow [{record.workflow}] used credits [{record.credits:.8f}]")
        else:
            print(f"- Operation used credits [{record.credits:.8f}]")

    if record.entity_id:
        if record.entity_type:
            if record.entity_type == enums.EntityTypes.CONTENT and record.content_type:
                print(f"- {record.entity_type} [{record.entity_id}]: Content type [{record.content_type}], file type [{record.file_type}]")
            else:
                print(f"- {record.entity_type} [{record.entity_id}]")
        else:
            print(f"- Entity [{record.entity_id}]")

    if record.model_service:
        print(f"- Model service [{record.model_service}], model name [{record.model_name}]")

    if record.processor_name:
        if record.processor_name in ["Deepgram Audio Transcription", "Assembly.AI Audio Transcription"]:
            length = timedelta(milliseconds=record.count or 0)

            if record.model_name:
                print(f"- Processor name [{record.processor_name}], model name [{record.model_name}], length [{length}]")
            else:
                print(f"- Processor name [{record.processor_name}], length [{length}]")
        else:
            if record.count:
                if record.model_name:
                    print(f"- Processor name [{record.processor_name}], model name [{record.model_name}], units [{record.count}]")
                else:
                    print(f"- Processor name [{record.processor_name}], units [{record.count}]")
            else:
                if record.model_name:
                    print(f"- Processor name [{record.processor_name}], model name [{record.model_name}]")
                else:
                    print(f"- Processor name [{record.processor_name}]")

    if record.uri:
        print(f"- URI [{record.uri}]")

    if record.name == "Prompt completion":
        if record.prompt:
            print(f"- Prompt [{record.prompt_tokens} tokens (includes RAG context tokens)]:")
            print(record.prompt)

        if record.completion:
            print(f"- Completion [{record.completion_tokens} tokens (includes JSON guardrails tokens)], throughput: {record.throughput:.3f} tokens/sec:")
            print(record.completion)

    elif record.name == "Text embedding":
        if record.prompt_tokens is not None:
            print(f"- Text embedding [{record.prompt_tokens} tokens], throughput: {record.throughput:.3f} tokens/sec")

    elif record.name == "Document preparation":
        if record.prompt_tokens is not None and record.completion_tokens is not None:
            print(f"- Document preparation [{record.prompt_tokens} input tokens, {record.completion_tokens} output tokens], throughput: {record.throughput:.3f} tokens/sec")

    elif record.name == "Data extraction":
        if record.prompt_tokens is not None and record.completion_tokens is not None:
            print(f"- Data extraction [{record.prompt_tokens} input tokens, {record.completion_tokens} output tokens], throughput: {record.throughput:.3f} tokens/sec")

    elif record.name == "GraphQL":
        if record.request:
            print(f"- Request:")
            print(record.request)

        if record.variables:
            print(f"- Variables:")
            print(record.variables)

        if record.response:
            print(f"- Response:")
            print(record.response)

    if record.name.startswith("Upload"):
        print(f"- File upload [{record.count} bytes], throughput: {record.throughput:.3f} bytes/sec")

    print()

async def delete_all_contents():
    if graphlit.client is None:
        return;

    _ = await graphlit.client.delete_all_contents(is_synchronous=True)


Execute Graphlit example

In [8]:
import nest_asyncio

nest_asyncio.apply()

# Remove any existing contents; only needed for notebook example
await delete_all_contents()

print('Deleted all contents.')

Deleted all contents.


In [9]:
from pydantic import BaseModel
from typing import List

class Product(BaseModel):
    name: str
    description: str
    features: List[str]

Define and kickoff CrewAI crew

In [10]:
import os
import dotenv
from crewai import Agent, Crew, Process, Task
from langchain_openai import ChatOpenAI
from graphlit_tools import WebSearchTool, WebMapTool, ExtractWebPageTool, CrewAIConverter
from datetime import datetime

company_name = input('Enter the automaker company name to be analyzed: ')

llm = ChatOpenAI(model="gpt-4o")

# NOTE: create a unique cost correlation ID
correlation_id = datetime.now().isoformat()

# NOTE: need to convert Graphlit tools to CrewAI tool schema
web_search_tool = CrewAIConverter.from_tool(WebSearchTool(graphlit, correlation_id=correlation_id))
web_map_tool = CrewAIConverter.from_tool(WebMapTool(graphlit, correlation_id=correlation_id))
extract_web_page_tool = CrewAIConverter.from_tool(ExtractWebPageTool(graphlit, correlation_id=correlation_id))

web_search_agent = Agent(
    role="Web Researcher",
    goal="Find the {company} website.",
    backstory="",
    verbose=True,
    allow_delegation=False,
    tools=[web_search_tool],
)

web_map_agent = Agent(
    role="Web Mapping Agent",
    goal="Enumerate all the web page URLs for the provided web site.",
    backstory="",
    verbose=True,
    allow_delegation=False,
    tools=[web_map_tool],
)

web_page_analyst_agent = Agent(
    role="Web Analyst Agent",
    goal="Extract structured data from the {company} web pages.",
    backstory="""You work for a major automotive manufacturer, and are doing competitive analysis on other automakers websites.
    We are looking to gather structured product information about automotive models.
    """,
    verbose=True,
    allow_delegation=False,
    tools=[extract_web_page_tool],
)

writer_agent = Agent(
    role="Data Writer Agent",
    goal="Summarize all of the structured product data into CSV format.",
    backstory="You work for a major automotive manufacturer, and are doing competitive analysis on other automakers websites.",
    verbose=True,
    llm=llm,
    allow_delegation=False
)

search_web_task = Task(
    description=(
        """Given company named {company}, search the web to find their home page.
        Return the root path for URLs, not individual web pages.
        For example return https://www.example.com, not https://www.example.com/index.html"""
    ),
    expected_output="A single URL for the {company} home page, prefer the US targeted home page",
    agent=web_search_agent,
)

fetch_web_pages_task = Task(
    description=(
        """Fetch the URLs at or beneath the given home page for further analysis.
        Filter the resulting URLs to locate pages which appear to be about automobile models and specifications.
        Select one most relevant page per automobile model.
        """
    ),
    expected_output="A list of web page URLs, maximum 10",
    agent=web_map_agent,
    context=[search_web_task],
)

extract_web_page_task = Task(
    description=(
        """Extract structured data from the provided web pages from the {company} website.

        Execute task once for each provided web page.
        """
    ),
    expected_output="A list of Product Pydantic data models",
    output_pydantic=Product,
    agent=web_page_analyst_agent,
    context=[fetch_web_pages_task],
)

writer_task = Task(
    description=(
        """Collate the extracted structured data from the provided {company} web pages.
        """
    ),
    expected_output="All the extracted automotive models in structured form, collated into CSV format.",
    agent=writer_agent,
    context=[extract_web_page_task],
)

crew = Crew(
    agents=[web_search_agent, web_map_agent, web_page_analyst_agent, writer_agent],
    tasks=[search_web_task, fetch_web_pages_task, extract_web_page_task, writer_task],
    manager_llm=llm,
    function_calling_llm=llm,
    planning_llm=llm,
    process=Process.sequential,
    planning=True,
    verbose=True,
)

# Kickoff the process and print the result
result = await crew.kickoff_async(inputs={"company": company_name})
print("Website Summary Process Completed:")
print(result)

Enter the automaker company name to be analyzed: Porsche
[1m[93m 
[2024-12-09 02:04:07][INFO]: Planning the crew execution[00m
[1m[95m# Agent:[00m [1m[92mWeb Researcher[00m
[95m## Task:[00m [92mGiven company named Porsche, search the web to find their home page.
        Return the root path for URLs, not individual web pages.
        For example return https://www.example.com, not https://www.example.com/index.html1. Utilize the Graphlit web search tool by entering the search string 'Porsche official site' or 'Porsche homepage'. Set an appropriate search limit to ensure the tool returns a manageable number of results.
2. Analyze the returned results to identify the root URL of the Porsche homepage, prioritizing the US-targeted homepage if available.
3. Verify the URL by checking the official branding and content of the website to ensure authenticity.
4. Return the root URL, which is expected to be in the format 'https://www.porsche.com'.[00m


[1m[95m# Agent:[00m [1m[

Calculate Graphlit credits & usage

In [12]:
import time
from IPython.display import display, HTML, JSON, Markdown

time.sleep(10) # NOTE: give some time to consume all billing events

credits = await lookup_credits(correlation_id)

if credits is not None:
    display(Markdown(f"### Credits used: {credits.credits:.6f}"))
    print(f"- storage [{credits.storage_ratio:.2f}%], compute [{credits.compute_ratio:.2f}%]")
    print(f"- embedding [{credits.embedding_ratio:.2f}%], completion [{credits.completion_ratio:.2f}%]")
    print(f"- ingestion [{credits.ingestion_ratio:.2f}%], indexing [{credits.indexing_ratio:.2f}%], preparation [{credits.preparation_ratio:.2f}%], extraction [{credits.extraction_ratio:.2f}%], enrichment [{credits.enrichment_ratio:.2f}%], publishing [{credits.publishing_ratio:.2f}%]")
    print(f"- search [{credits.search_ratio:.2f}%], conversation [{credits.conversation_ratio:.2f}%]")
    print()

usage = await lookup_usage(correlation_id)

if usage is not None:
    display(Markdown(f"### Usage records [{len(usage)}]:"))

    for record in usage:
        dump_usage_record(record)

    print()


### Credits used: 8.080167

- storage [0.21%], compute [3.08%]
- embedding [3.96%], completion [76.77%]
- ingestion [13.33%], indexing [0.00%], preparation [2.66%], extraction [0.00%], enrichment [0.00%], publishing [0.00%]
- search [0.00%], conversation [0.00%]



### Usage records [74]:

2024-12-09T02:07:19.185Z: GraphQL
- Operation took 0:00:00.011231, used credits [0.00000000]
- Request:
mutation ExtractText($prompt: String!, $text: String!, $textType: TextTypes, $specification: EntityReferenceInput, $tools: [ToolDefinitionInput!]!, $correlationId: String) { extractText(prompt: $prompt, text: $text, textType: $textType, specification: $specification, tools: $tools, correlationId: $correlationId) { specification { id } content { id } name value startTime endTime pageNumber error } }
- Variables:
{"prompt":"\"Extract the vehicle model name, description, and key features for the Porsche Taycan models.\"","text":"\"**Content ID:** e5ad3408-0214-4d19-8700-2a37a227a84d\\n**Type:** [PAGE]\\n**URI:** https:\\/\\/www.porsche.com\\/usa\\/models\\/taycan\\/\\n**Ingestion Date:** 2024-12-09T02:07:00.000Z\\n**Title:** Porsche All Taycan Models - Porsche USA\\n**WEB Link:** https:\\/\\/www.porsche.com\\/usa\\/\\n**WEB Link:** https:\\/\\/www.porsche.com\\/usa\\/models\\/taycan\\/j