# Web scraping using OpenAI Functions Extraction chain

Web scraping is challenging for many reasons; one of them is the changing nature of modern websites' layouts and content, which requires modifying scraping scripts to accommodate the changes.

Using OpenAI Function in the extraction chain, you can avoid having to change your code constantly when websites change. Plus, this will save you time setting up when scraping a website for the first time by creating simple schemas.

In this notebook, we scrape The Wall Street Journal. You'll need to have an OpenAI API key for this example.

## Install dependencies

We need OpenAI and LangChain for the extraction part, and Playwright and Beautiful Soup for the scraping part.

In [None]:
! pip install -q openai langchain playwright beautifulsoup4
! playwright install

# Set env var OPENAI_API_KEY or load from a .env file:
# import dotenv
# dotenv.load_env()

## Use AsyncChromiumLoader

This load will use Chromium to scrape the web page. In the context of this code, Chromium is one of the browsers supported by Playwright, a library used to control browser automation. This will launch a headless instance of Chromium. Headless mode means that the browser is running without a graphical user interface, which is commonly used for web scraping or automated testing of web pages.

In [2]:
from langchain.document_loaders import AsyncChromiumLoader

## Define the LLM object and extraction function

Next, we define the LLM object using LangChain and OpenAI Python SDK.

We're using `gpt-3.5-turbo-0613` to guarantee access to OpenAI Functions feature (although this might be available to everyone by time of writing). We're also keeping `temperature` at `0` to keep randomness of the LLM down.

# Define a schema

Next, you define a schema to specify what kind of data you want to extract. Here, the key names matter as they tell the LLM what kind of information they want. So, be as detailed as possible. In this example, we want to scrape only news article's name and summary from The Wall Street Journal website.

In [3]:
from langchain.chains import create_extraction_chain
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")
schema = {
    "properties": {
        "news_article_title": {"type": "string"},
        "news_article_summary": {"type": "string"},
    },
    "required": ["news_article_title", "news_article_summary"],
}

def extract(content: str, schema: dict):
        return create_extraction_chain(schema=schema, llm=llm).run(content)

# Run the web scraper w/ HTML2Text

* Grab HTML / JS w/ AsyncChromiumLoader
* Transform HTML to Text (Markdown) w/ Html2TextTransformer

In [5]:
import asyncio
import pprint
from langchain.document_transformers import Html2TextTransformer

token_limit = 4000
def scrape_with_playwright(urls, schema):
    
    loader = AsyncChromiumLoader(urls)
    docs = loader.load()
    html2text = Html2TextTransformer()
    docs_transformed = html2text.transform_documents(docs)
    html_content = docs_transformed[0].page_content
    
    print("Extracting content with LLM")
    html_content_fits_context_window_llm = html_content[:token_limit]
    extracted_content = extract(schema=schema,
                                content=html_content_fits_context_window_llm)
    pprint.pprint(extracted_content)
    
urls = ["https://www.wsj.com"]
scrape_with_playwright(urls, schema=schema)

Extracting content with LLM
[{'news_article_summary': "Trusted Insights On Today's Top Stories",
  'news_article_title': 'The Wall Street Journal'},
 {'news_article_summary': 'WSJ Pro, Bankruptcy, Central Banking, Private '
                          'Equity, Venture Capital',
  'news_article_title': 'Economy'},
 {'news_article_summary': 'Management, Journal Reports, The Future of '
                          'Everything, Obituaries, Tech/WSJ.D',
  'news_article_title': 'Business'},
 {'news_article_summary': 'CFO Journal, CIO Journal, CMO Today, Logistics '
                          'Report, Risk & Compliance, The Workplace Report',
  'news_article_title': 'C-Suite'},
 {'news_article_summary': 'CIO Journal, The Future of Everything, Personal '
                          'Tech',
  'news_article_title': 'Tech'},
 {'news_article_summary': 'Bonds, Commercial Real Estate, Commodities & '
                          'Futures, Stocks, Personal Finance, WSJ Money, '
                          'Stree

# Run the web scraper w/ BeautifulSoup

* Grab HTML / JS w/ AsyncChromiumLoader
* Transform HTML to Text w/ BeautifulSoup

In [12]:
from bs4 import BeautifulSoup

def remove_unwanted_tags(html_content, unwanted_tags=["script", "style"]):
    soup = BeautifulSoup(html_content, "html.parser")
    for tag in unwanted_tags:
        for element in soup.find_all(tag):
            element.decompose()
    return str(soup)

def extract_tags(html_content, tags: list[str]):
    soup = BeautifulSoup(html_content, "html.parser")
    text_parts = []
    for tag in tags:
        elements = soup.find_all(tag)
        for element in elements:
            if tag == "a":
                href = element.get("href")
                if href:
                    text_parts.append(f"{element.get_text()} ({href})")
                else:
                    text_parts.append(element.get_text())
            else:
                text_parts.append(element.get_text())

    return " ".join(text_parts)

def scrape_by_url_raw(url: str) -> str:
    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.content, "html.parser")
    return response.text

def remove_unessesary_lines(content: str) -> str:
    lines = content.split("\n")
    stripped_lines = [line.strip() for line in lines]
    non_empty_lines = [line for line in stripped_lines if line]
    seen = set()
    deduped_lines = [
        line for line in non_empty_lines if not (line in seen or seen.add(line))
    ]
    cleaned_content = "".join(deduped_lines)

    return cleaned_content

def bs4_transfomer(html):

    clean_html = remove_unessesary_lines(extract_tags(remove_unwanted_tags(html), 
                                                      ["p", "li", "div", "a"]))
    return clean_html

Run

In [13]:
def scrape_with_playwright(urls, schema):
    
    loader = AsyncChromiumLoader(urls)
    docs = loader.load()
    html_content = bs4_transfomer(docs[0].page_content)

    print("Extracting content with LLM")
    html_content_fits_context_window_llm = html_content[:token_limit]
    extracted_content = extract(schema=schema,
                                content=html_content_fits_context_window_llm)
    pprint.pprint(extracted_content)
    
urls = ["https://www.wsj.com"]
scrape_with_playwright(urls, schema=schema)

Extracting content with LLM
[{'news_article_summary': 'It isn’t cool enough for the Fed just yet, but it '
                          'is getting there.',
  'news_article_title': 'Employment data showed America’s job market is '
                        'cooling.'},
 {'news_article_title': 'A bond selloff pushed up yields and drew traders away '
                        'from stocks in recent days.'},
 {'news_article_title': 'Icahn Enterprises is cutting its dividend in half and '
                        'winding down bets that the stock market would '
                        'collapse, which have inflicted heavy losses.'},
 {'news_article_title': 'The decision eliminates a sizable claim against '
                        'Google while preserving the core of the government’s '
                        'case against the search giant, clearing the way for '
                        'the antitrust trial that is slated to begin next '
                        'month.'},
 {'news_article_title': 'J