# ScrapeGraph AI

## Overview

ScrapeGraph AI exists in two forms:

1. An [open-source project](https://github.com/ScrapeGraphAI/Scrapegraph-ai) (16k+ stars) that provides a Python library for web scraping using LLMs and direct graph logic.

2. A [commercial API service](https://scrapegraphai.com/) that offers production-ready endpoints for:
   - Extracting structured data from websites
   - Converting webpages to markdown
   - Processing local HTML content
   
This integration uses the commercial API service through LangChain tools.

## Installation

First, install the required packages:

In [14]:
%pip install --upgrade --quiet langchain-scrapegraph langchain-openai langchain-community

Note: you may need to restart the kernel to use updated packages.


## Instantiation

Set up your API keys and initialize the tools:

In [15]:
import os

os.environ["SGAI_API_KEY"] = "sgai-your-api-key"
os.environ["OPENAI_API_KEY"] = (
    "your-openai-api-key"
)

from langchain_scrapegraph.tools import (
    GetCreditsTool,
    LocalScraperTool,
    MarkdownifyTool,
    SmartScraperTool,
)

# Initialize tools
smartscraper = SmartScraperTool()
markdownify = MarkdownifyTool()
localscraper = LocalScraperTool()
credits = GetCreditsTool()

## Invocation

### SmartScraper Example

In [16]:
result = smartscraper.invoke(
    {
        "user_prompt": "Extract the company name, description, and any contact information",
        "website_url": "https://scrapegraphai.com",
    }
)
print("SmartScraper Result:")
print(result)

SmartScraper Result:
{'company_name': 'ScrapeGraphAI', 'description': "ScrapeGraphAI is a powerful AI web scraping tool that turns entire websites into clean, structured data through a simple API. It's designed to help developers and AI companies extract valuable data from websites efficiently and transform it into formats that are ready for use in LLM applications and data analysis.", 'contact_information': {'email': 'contact@scrapegraphai.com', 'linkedin': 'https://www.linkedin.com/company/101881123', 'twitter': 'https://x.com/scrapegraphai', 'github': 'https://github.com/ScrapeGraphAI/Scrapegraph-ai'}}


### Markdownify Example

In [17]:
markdown = markdownify.invoke({"website_url": "https://scrapegraphai.com"})
print("Markdownify Result (first 500 chars):")
print(markdown[:500] + "...")

Markdownify Result (first 500 chars):
[![ScrapeGraphAI Logo](https://scrapegraphai.com/images/scrapegraphai_logo.svg)ScrapeGraphAI](https://scrapegraphai.com/)

PartnersPricingFAQ[Blog](https://scrapegraphai.com/blog)DocsLog inSign up

Open main menu

[Top 200 Global AI Open-Source Project](https://runacap.com/ross-index/q3-2024/)

# Get the data you need from any website

Easily extract and gather information with just a few lines of code with a simple api. Turn websites into clean and usable structured data.

Get started

Simple I...


### LocalScraper Example

In [18]:
html_content = """
<html>
    <body>
        <h1>AI Company</h1>
        <p>We specialize in artificial intelligence solutions.</p>
        <div class="contact">
            <p>Email: info@example.com</p>
            <p>Phone: (555) 123-4567</p>
        </div>
    </body>
</html>
"""

result = localscraper.invoke(
    {
        "user_prompt": "Extract the company name, description, and all contact information",
        "website_html": html_content,
    }
)
print("LocalScraper Result:")
print(result)

LocalScraper Result:
{'company_name': 'AI Company', 'specialization': 'artificial intelligence solutions', 'contact': {'email': 'info@example.com', 'phone': '(555) 123-4567'}}


### GetCredits Example

In [19]:
credits_info = credits.invoke({})
print("Credits Info:")
print(credits_info)

Credits Info:
{'remaining_credits': 49722, 'total_credits_used': 871}


## Chaining

Use these tools with LangChain's agent framework for more complex workflows:

In [20]:
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_core.messages import SystemMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI

# Initialize tools
tools = [SmartScraperTool(), MarkdownifyTool(), LocalScraperTool(), GetCreditsTool()]

# Create prompt template
prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessage(
            content=(
                "You are a helpful AI assistant that can analyze websites and extract information. "
                "You have access to tools that can help you scrape and process web content. "
                "Always explain what you're doing before using a tool."
            )
        ),
        MessagesPlaceholder(variable_name="chat_history", optional=True),
        ("user", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ]
)

# Create agent
llm = ChatOpenAI(temperature=0)
agent = create_openai_functions_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# Example usage
query = """Extract information about what ScrapeGraph AI does from their website https://scrapegraphai.com/"""

response = agent_executor.invoke({"input": query})
print("\nFinal Response:", response["output"])



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `SmartScraper` with `{'user_prompt': 'Extract information about what ScrapeGraph AI does', 'website_url': 'https://scrapegraphai.com/'}`


[0m[36;1m[1;3m{'description': 'ScrapeGraphAI is a powerful AI web scraping tool that turns entire websites into clean, structured data through a simple API. It is designed to help developers and AI companies extract valuable data from websites efficiently and transform it into formats that are ready for use in LLM applications and data analysis.', 'features': ['Easily extract and gather information with just a few lines of code.', 'Support for multiple programming languages with no complex setup required.', 'Provides clean, structured JSON data ready for use in applications.', 'Effortless, cost-effective, and AI-powered data extraction.', 'Handles web scraping challenges like proxy rotation and rate limiting.', 'Equipped to handle dynamic content rendered with JavaScript.', '

## API Reference
You can find the official Scrapegraph SDK docs [here](https://scrapegraphai.com/documentation)

### SmartScraperTool
- **Purpose**: Extract structured data from websites given a URL and prompt
- **Input**: `website_url` (str), `user_prompt` (str)
- **Output**: Dictionary containing extracted data

### MarkdownifyTool
- **Purpose**: Convert any webpage to clean markdown format
- **Input**: `website_url` (str)
- **Output**: Markdown string

### LocalScraperTool
- **Purpose**: Extract structured data from a local HTML file
- **Input**: `website_html` (str), `user_prompt` (str)
- **Output**: Dictionary containing extracted data

### GetCreditsTool
- **Purpose**: Check remaining ScrapeGraph AI credits
- **Input**: None
- **Output**: Dictionary with credit information