# AI Web Scraper Agent

## *Introduction*

This standalone project will make use of ChatGPT to integrate web scraping that allows users to specify what they want to collect regarding information on the website. This targets the limitation of current ChatGPT and its knowledge base that must always be updated. For example, a new framework comes out with specific features and ChatGPT does not have the ability to tell people how to use it specifically, (i.e., which function has such parameters and how they are used and such). Not only this but it could also just be used to only scrape specific information that the user wants.

## *Ideas*

- Could make ChatGPT scrape a documentation for an API or framework that require lots of reading to be able to give answers directly by training the chat to know all the contents of the documentation
- Could implement a vector database to build a knowledge base which will make it smarter
	- Cosine similarity search for most relevant information within the database

## *Pipeline*

```mermaid
flowchart LR
    A[Input] --> B[Agent] --formulate search parameter--> C[Search & Scraper] -- First run: Scrape relevant links within main page --> B[Agent] 
    C[Search & Scraper] -- Second run: Scrape relevant information --> B[Agent]
    B[Agent] --> D[Output]

```
This is a block diagram of how the agent will perform scraping/searching

## *Environment set-up*

In [129]:
from dotenv import load_dotenv, find_dotenv

In [130]:
load_dotenv(find_dotenv())

True

In [131]:
from langchain.chat_models import ChatOpenAI
from langchain_experimental.plan_and_execute import PlanAndExecute, load_agent_executor, load_chat_planner
from langchain.llms import OpenAI
from langchain import SerpAPIWrapper
from langchain.agents.tools import Tool
from langchain import LLMMathChain
from langchain.tools import tool
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
from langchain.memory import ConversationBufferMemory

## *Scrape Tool*

Test 1 `<article>` tags collected for retrieving all information within a page

In [132]:
import requests
from bs4 import BeautifulSoup

search = SerpAPIWrapper()

@tool
def SerpScraper(query: str) -> str:
    """Scrapes the first result from a Google search and returns the text of the first article."""
    # url = search.results(query)["organic_results"][0]["link"]
    url = query
    response = requests.get(url)
    
    if response.status_code == 200:
        html_content = response.text
        soup = BeautifulSoup(html_content, 'html.parser')
        article = soup.find_all("main")

        for i in article:
            res = i.text.strip()
            return res
        

    else:
        print(f"Failed to fetch the page. Status code: {response.status_code}")
    
    return

In [133]:
# SerpScraper("scrape langchain website for all the text")

## *Agent set-up*

Lets set-up the agent using LangChain's most recent Plan and execute agent that is based on [BabyAGI](https://github.com/yoheinakajima/babyagi). This will allow us to have a good starting point for the agent. 

In [134]:
llm = OpenAI(temperature=0)
llm_math_chain = LLMMathChain.from_llm(llm=llm, verbose=True)
scraper_tool = SerpScraper
tools = [
    Tool(
        name="Scraper",
        func=scraper_tool.run,
        description="useful for when you need detailed information from a website"
    ),
]

In [135]:
model = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k")

In [136]:
planner = load_chat_planner(model)

In [137]:
executor = load_agent_executor(model, tools, verbose=True)

In [138]:
agent = PlanAndExecute(planner=planner, executor=executor, verbose=True, handle_parse_error=True)

In [143]:
agent.run("scrape langchain website for all the text without get request" )



[1m> Entering new PlanAndExecute chain...[0m
steps=[Step(value='Use a web scraping library like BeautifulSoup or Scrapy to scrape the Langchain website.'), Step(value='Inspect the HTML structure of the Langchain website to identify the elements that contain the desired text.'), Step(value='Write code to extract the text from the identified elements using the web scraping library.'), Step(value='Store the extracted text in a suitable data structure, such as a list or a string.'), Step(value='Repeat steps 2-4 for all the relevant pages of the Langchain website.'), Step(value='Concatenate the extracted text from all the pages into a single string or combine them into a single list.'), Step(value='Clean the extracted text by removing any unwanted characters or formatting.'), Step(value="Return the cleaned text as the final result.\n\nGiven the above steps taken, please respond to the user's original question. ")]

[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAction:
```
{

ToolException: Too many arguments to single-input tool Scraper. Args: ['inspect', 'https://www.langchain.com']

Store the observations in a database first to be able to fetch only relevant information from the database to be able to answer questions based on the information it has collected without having to scrape again. This will allow the agent to be able to answer questions faster and more efficiently also with using less tokens.