Added new use case docs for Web Scraping, Chromium loader, BS4 transformer #8732

trancethehuman · 2023-08-04T02:12:41Z

Description: Added a new use case category called "Web Scraping", and a tutorial to scrape websites using OpenAI Functions Extraction chain to the docs.
Tag maintainer:@baskaryan @hwchase17 ,
Twitter handle: https://www.linkedin.com/in/haiphunghiem/ (I'm on LinkedIn mostly)

vercel · 2023-08-04T02:12:44Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Aug 11, 2023 5:29pm

hwchase17

i think the structure should be - keep index.mdx in docs/docs_skeleton/docs/use_cases/web_scraping but then move the notebook to /docs/extras/use_cases/web_scraping (and move the .mdx file, it will get built at build time)

trancethehuman · 2023-08-04T03:33:44Z

i think the structure should be - keep index.mdx in docs/docs_skeleton/docs/use_cases/web_scraping but then move the notebook to /docs/extras/use_cases/web_scraping (and move the .mdx file, it will get built at build time)

I guess I don't need to run yarn build and push the markdown file that was generated from the notebook on here because that's in the build CI/CD?

rlancemartin

Great use case!

rlancemartin · 2023-08-04T18:11:29Z

docs/extras/use_cases/web_scraping/web_scraping_with_openai_functions.ipynb

+   "source": [
+    "# Web scraping using OpenAI Functions Extraction chain\n",
+    "\n",
+    "Web scraping is challenging for many reasons; one of them is the changing nature of modern websites' layouts and content, which requires modifying scraping scripts to accommodate the changes.\n",


Great use case!

rlancemartin · 2023-08-04T18:14:09Z

docs/extras/use_cases/web_scraping/web_scraping_with_openai_functions.ipynb

+   "id": "5ef7f514",
+   "metadata": {},
+   "source": [
+    "## Create a simple scraper function\n",


OK, this is cool.

1/ Let's create a new web loader for Chromium that wraps this logic.

2/ It can follow what we did here:

#8036

In particular, see.

Simply move this code to create a new loader (e.g., chromium_loader.py or similar).

Will launch a headless instance of Chromium to scrape.

I ran what you have and compared to this. (See docs here):

loader = AsyncHtmlLoader(url) docs = loader.load() html2text = Html2TextTransformer() docs = html2text.transform_documents(docs) html_content=docs[0].page_content

I found found that Chromium is better in this case.

For some reason, html2text is loosing the news article summaries.

We should add it as a new web loader and simply import here.

Nice. Will change mine

Just took care of this for you!

rlancemartin · 2023-08-04T18:19:05Z

docs/extras/use_cases/web_scraping/web_scraping_with_openai_functions.ipynb

+    "\n",
+    "openai_api_key = \"OPENAI_API_KEY\"\n",
+    "\n",
+    "llm = ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo-0613\", openai_api_key=openai_api_key)\n",


Functions work w/ default gpt3.5 / 4 now, AFAIK.

rlancemartin · 2023-08-04T18:29:19Z

docs/extras/use_cases/web_scraping/web_scraping_with_openai_functions.ipynb

+    }
+   ],
+   "source": [
+    "pip install -q openai langchain playwright beautifulsoup4"


Also need to add:

! playwright install

to download the necessary browser binaries (Chromium, Firefox, WebKit).

rlancemartin · 2023-08-04T21:33:21Z

I cleaned this up a bit more.

Main issue: the extraction is sensitive to the transformation of raw HTML (HTML2Text vs BS4).

Have a look at the ntbk.

Also title / summary extraction doesn't look quite right.

rlancemartin · 2023-08-07T14:06:13Z

docs/extras/use_cases/web_scraping/web_scraping_with_openai_functions.ipynb

+   "id": "97f7de42",
+   "metadata": {},
+   "source": [
+    "# Run the web scraper w/ BeautifulSoup\n",


Can you update the notebook to use the BeautifulSoupTransformer here?

@baskaryan

…ormer (langchain-ai#8732) - Description: Added a new use case category called "Web Scraping", and a tutorial to scrape websites using OpenAI Functions Extraction chain to the docs. - Tag maintainer:@baskaryan @hwchase17 , - Twitter handle: https://www.linkedin.com/in/haiphunghiem/ (I'm on LinkedIn mostly) --------- Co-authored-by: Lance Martin <lance@langchain.dev>

added docs for web scraping

7e4a0ec

dosubot bot added the 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder label Aug 4, 2023

trancethehuman changed the title ~~added docs for web scraping~~ Added new use case category (Web Scraping) and a tutorial for using OpenAI Functions extraction chain for that Aug 4, 2023

vercel bot deployed to Preview – langchain August 4, 2023 02:21 View deployment

hwchase17 reviewed Aug 4, 2023

View reviewed changes

rlancemartin self-assigned this Aug 4, 2023

trancethehuman added 2 commits August 3, 2023 23:20

typos

0bfc404

moved files around

63e278c

vercel bot deployed to Preview – langchain August 4, 2023 03:32 View deployment

trancethehuman requested a review from hwchase17 August 4, 2023 05:15

rlancemartin reviewed Aug 4, 2023

View reviewed changes

Create new loader, update notebook

b434fc8

vercel bot deployed to Preview – langchain August 4, 2023 20:08 View deployment

rlancemartin added 2 commits August 4, 2023 14:30

Split loading and transformation (via html2text or bs4)

c62c189

fmt

d8401bd

added beautiful soup transformer

1157aa7

rlancemartin reviewed Aug 7, 2023

View reviewed changes

took out html2Text

eba9d9f

vercel bot deployed to Preview – langchain August 7, 2023 14:28 View deployment

trancethehuman and others added 3 commits August 8, 2023 08:14

added types

12f5e3a

Merge branch 'master' into trancethehuman/add-web-scraper-to-usecases

84584a5

Update, clean

993aa75

rlancemartin changed the title ~~Added new use case category (Web Scraping) and a tutorial for using OpenAI Functions extraction chain for that~~ Added new use case docs for Web Scraping, Chromium loader, BS4 transformer Aug 9, 2023

vercel bot deployed to Preview – langchain August 9, 2023 02:07 View deployment

fmt

b9c421a

rlancemartin force-pushed the add-web-scraper-to-usecases branch from 2edafe6 to e266287 Compare August 9, 2023 02:37

bs4 import check

3553df1

rlancemartin force-pushed the add-web-scraper-to-usecases branch from e266287 to 3553df1 Compare August 9, 2023 02:42

Update tags to get correct article names

8607346

vercel bot deployed to Preview – langchain August 9, 2023 03:19 View deployment

trancethehuman added 2 commits August 10, 2023 14:14

onlyspan

0b52d84

added CNN

67dbdf9

vercel bot deployed to Preview – langchain August 10, 2023 18:37 View deployment

Minor updates

d0cc529

vercel bot deployed to Preview – langchain August 10, 2023 21:13 View deployment

rlancemartin force-pushed the add-web-scraper-to-usecases branch 2 times, most recently from 8577adb to 976bd1c Compare August 10, 2023 21:55

fmt

7a96ae4

rlancemartin force-pushed the add-web-scraper-to-usecases branch from 976bd1c to 7a96ae4 Compare August 10, 2023 22:09

vercel bot deployed to Preview – langchain August 10, 2023 22:23 View deployment

Merge branch 'master' into trancethehuman/add-web-scraper-to-usecases

7a53441

vercel bot deployed to Preview – langchain August 11, 2023 17:29 View deployment

rlancemartin merged commit e4418d1 into langchain-ai:master Aug 11, 2023
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added new use case docs for Web Scraping, Chromium loader, BS4 transformer #8732

Added new use case docs for Web Scraping, Chromium loader, BS4 transformer #8732

trancethehuman commented Aug 4, 2023

vercel bot commented Aug 4, 2023 •

edited

Loading

hwchase17 left a comment

trancethehuman commented Aug 4, 2023

rlancemartin left a comment

rlancemartin Aug 4, 2023

rlancemartin Aug 4, 2023 •

edited

Loading

trancethehuman Aug 4, 2023

rlancemartin Aug 4, 2023

trancethehuman Aug 4, 2023

rlancemartin Aug 4, 2023

rlancemartin Aug 4, 2023

rlancemartin commented Aug 4, 2023

rlancemartin Aug 7, 2023

Added new use case docs for Web Scraping, Chromium loader, BS4 transformer #8732

Added new use case docs for Web Scraping, Chromium loader, BS4 transformer #8732

Conversation

trancethehuman commented Aug 4, 2023

vercel bot commented Aug 4, 2023 • edited Loading

hwchase17 left a comment

Choose a reason for hiding this comment

trancethehuman commented Aug 4, 2023

rlancemartin left a comment

Choose a reason for hiding this comment

rlancemartin Aug 4, 2023

Choose a reason for hiding this comment

rlancemartin Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

trancethehuman Aug 4, 2023

Choose a reason for hiding this comment

rlancemartin Aug 4, 2023

Choose a reason for hiding this comment

trancethehuman Aug 4, 2023

Choose a reason for hiding this comment

rlancemartin Aug 4, 2023

Choose a reason for hiding this comment

rlancemartin Aug 4, 2023

Choose a reason for hiding this comment

rlancemartin commented Aug 4, 2023

rlancemartin Aug 7, 2023

Choose a reason for hiding this comment

vercel bot commented Aug 4, 2023 •

edited

Loading

rlancemartin Aug 4, 2023 •

edited

Loading