# Web scraping using OpenAI Functions Extraction chain

Web scraping is challenging for many reasons; one of them is the changing nature of modern websites' layouts and content, which requires modifying scraping scripts to accommodate the changes.

Using OpenAI Function in the extraction chain, you can avoid having to change your code constantly when websites change. Plus, this will save you time setting up when scraping a website for the first time by creating simple schemas.

In this notebook, we scrape The Wall Street Journal. You'll need to have an OpenAI API key for this example.

## Install dependencies

We need OpenAI and LangChain for the extraction part, and Playwright and Beautiful Soup for the scraping part.

In [None]:
! pip install -q openai langchain playwright beautifulsoup4
! playwright install

# Set env var OPENAI_API_KEY or load from a .env file:
# import dotenv
# dotenv.load_env()

## Use AsyncChromiumLoader

This load will use Chromium to scrape the web page. In the context of this code, Chromium is one of the browsers supported by Playwright, a library used to control browser automation. This will launch a headless instance of Chromium. Headless mode means that the browser is running without a graphical user interface, which is commonly used for web scraping or automated testing of web pages.

In [2]:
from langchain.document_loaders import AsyncChromiumLoader

## Define the LLM object and extraction function

Next, we define the LLM object using LangChain and OpenAI Python SDK.

We're using `gpt-3.5-turbo-0613` to guarantee access to OpenAI Functions feature (although this might be available to everyone by time of writing). We're also keeping `temperature` at `0` to keep randomness of the LLM down.

In [4]:
from langchain.chains import create_extraction_chain
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")

def extract(content: str, schema: dict):
        return create_extraction_chain(schema=schema, llm=llm).run(content)

# Define a schema

Next, you define a schema to specify what kind of data you want to extract. Here, the key names matter as they tell the LLM what kind of information they want. So, be as detailed as possible. In this example, we want to scrape only news article's name and summary from The Wall Street Journal website.

In [5]:
schema = {
    "properties": {
        "news_article_title": {"type": "string"},
        "news_article_summary": {"type": "string"},
    },
    "required": ["news_article_title", "news_article_summary"],
}

# Run the web scraper

Last but not least, we combine everything together, and run the scraper. The model's token limit is around 4000, so for simplicity's sake in this example, we only let the LLM look at the first 4000 characters, otherwise we risk sending more tokens than allowed to OpenAI.

In [6]:
import asyncio
import pprint

token_limit = 4000
wsj_url = "https://www.wsj.com"

def scrape_with_playwright(urls, schema):
    
    loader = AsyncChromiumLoader(urls)
    docs = loader.load()
    html_content = docs[0].page_content
    
    print("Extracting content with LLM")
    html_content_fits_context_window_llm = html_content[:token_limit]
    extracted_content = extract(schema=schema,
                                content=html_content_fits_context_window_llm)

    pprint.pprint(extracted_content)
    
urls = ["https://www.wsj.com"]
scrape_with_playwright(urls, schema=schema)

Extracting content with LLM
[{'news_article_summary': 'It isn’t cool enough for the Fed just yet, but it '
                          'is getting there.',
  'news_article_title': 'Employment data showed America’s job market is '
                        'cooling.'},
 {'news_article_title': 'Icahn Enterprises is cutting its dividend in half and '
                        'winding down bets that the stock market would '
                        'collapse, which have inflicted heavy losses.'},
 {'news_article_title': 'The decision rejects an argument made by a bipartisan '
                        'group of 38 state attorneys general and comes ahead '
                        'of a trial that is slated to begin next month.'},
 {'news_article_title': 'Russian opposition leader Alexei Navalny, Vladimir '
                        'Putin’s most prominent critic, was charged with '
                        'inciting and funding extremism.'},
 {'news_article_title': 'John Lauro called the indictment an