# Web scraping using OpenAI Functions Extraction chain

Web scraping is challenging for many reasons; one of them is the changing nature of modern websites' layouts and content, which requires modifying scraping scripts to accommodate the changes.

Using OpenAI Function in the extraction chain, you can avoid having to change your code constantly when websites change. Plus, this will save you time setting up when scraping a website for the first time by creating simple schemas.

In this notebook, we scrape The Wall Street Journal. You'll need to have an OpenAI API key for this example.

## Install dependencies

We need OpenAI and LangChain for the extraction part, and Playwright and Beautiful Soup for the scraping part.

In [5]:
pip install -q openai langchain playwright beautifulsoup4


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Create a simple scraper function

First, we create an asynchronous function (`ascrape_playwright()`) that takes in a URL to scrape. This function returns the HTML of the website from the URL.

We also need a few utility functions to clean up the HTML.

In [6]:
import requests
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright

def remove_unwanted_tags(html_content, unwanted_tags=["script", "style"]):
    soup = BeautifulSoup(html_content, 'html.parser')
    for tag in unwanted_tags:
        for element in soup.find_all(tag):
            element.decompose()
    return str(soup)


def extract_tags(html_content, tags: list[str]):
    soup = BeautifulSoup(html_content, 'html.parser')
    text_parts = []
    for tag in tags:
        elements = soup.find_all(tag)
        for element in elements:
            if tag == "a":
                href = element.get('href')
                if href:
                    text_parts.append(f"{element.get_text()} ({href})")
                else:
                    text_parts.append(element.get_text())
            else:
                text_parts.append(element.get_text())
    
    return ' '.join(text_parts)


def scrape_by_url_raw(url: str):
    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.content, 'html.parser')
    return response.text


def remove_unessesary_lines(content):
    lines = content.split("\n")
    stripped_lines = [line.strip() for line in lines]
    non_empty_lines = [line for line in stripped_lines if line]
    seen = set()
    deduped_lines = [line for line in non_empty_lines if not (line in seen or seen.add(line))]
    cleaned_content = "".join(deduped_lines)

    return cleaned_content


async def ascrape_playwright(url) -> str:
    print("Started scraping...")
    results = ""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        try:
            page = await browser.new_page()
            await page.goto(url)

            page_source = await page.content()

            results = remove_unessesary_lines(extract_tags(remove_unwanted_tags(
                page_source), ["p", "li", "div", "a"]))
            print("Content scraped")
        except Exception as e:
            results = f"Error: {e}"
        await browser.close()
    return results

## Define the LLM object and extraction function

Next, we define the LLM object using LangChain and OpenAI Python SDK.

We're using `gpt-3.5-turbo-0613` to guarantee access to OpenAI Functions feature (although this might be available to everyone by time of writing). We're also keeping `temperature` at `0` to keep randomness of the LLM down.

In [13]:
from langchain.chains import create_extraction_chain
from langchain.chat_models import ChatOpenAI

openai_api_key = "OPENAI_API_KEY"

llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613", openai_api_key=openai_api_key)


def extract(content: str, schema: dict):
        return create_extraction_chain(schema=schema, llm=llm).run(content)

# Define a schema

Next, you define a schema to specify what kind of data you want to extract. Here, the key names matter as they tell the LLM what kind of information they want. So, be as detailed as possible. In this example, we want to scrape only news article's name and summary from The Wall Street Journal website.

In [14]:
schema = {
    "properties": {
        "news_article_title": {"type": "string"},
        "news_article_summary": {"type": "string"},
    },
    "required": ["news_article_title", "news_article_summary"],
}

# Run the web scraper

Last but not least, we combine everything together, and run the scraper. The model's token limit is around 4000, so for simplicity's sake in this example, we only let the LLM look at the first 4000 characters, otherwise we risk sending more tokens than allowed to OpenAI.

In [15]:
import asyncio
import pprint

token_limit = 4000
wsj_url = "https://www.wsj.com"

async def scrape_with_playwright(url: str, schema: dict):
    html_content = await ascrape_playwright(url)

    print("Extracting content with LLM")

    html_content_fits_context_window_llm = html_content[:token_limit]

    extracted_content = extract(schema=schema,
                                content=html_content_fits_context_window_llm)

    pprint.pprint(extracted_content)

await scrape_with_playwright(url=wsj_url, schema=schema)

Started scraping...


  for attr in list(attrs.keys()):


Content scraped
Extracting content with LLM
[{'news_article_summary': 'Former President Donald Trump arrived at a '
                          'Washington federal courthouse to face criminal '
                          'charges related to his efforts to overturn the 2020 '
                          'election.',
  'news_article_title': 'Former President Donald Trump faces criminal charges '
                        'related to overturning the 2020 election'},
 {'news_article_summary': 'The tech giant’s revenue declined for the third '
                          'consecutive quarter as sales of iPhones missed '
                          'analyst estimates. But revenue in the services unit '
                          'reached a record high.',
  'news_article_title': "Tech giant's revenue declined, but services unit "
                        'reached a record high'},
 {'news_article_summary': 'The American Academy of Pediatrics plans to review '
                          'the evidence for gen