# Web data extraction using ZYTE URL loader

In this notebook we present an example of question answering using the some recent newspaper articles. To collect newpaper articles we use ZYTE API web-scraper. First we collect the webpage content using ZYTE URL Loader.

In [1]:
#!pip install zyte-api
#!pip install html2text
#!pip install nest_asyncio

In [2]:
from langchain_community.document_loaders import ZyteURLLoader
import nest_asyncio
import os

nest_asyncio.apply()

In [3]:
zyte_api_key = os.environ.get('ZYTE_API_KEY')

In [4]:
urls = [
    'https://www.rte.ie/sport/paris-2024/2024/0806/1463620-paris-2024-coyle-and-sweetnam-dont-figure-in-medal/'
]

## Check webpage content extracted by ZYTE_URL_Loader

Here we show different modes in which a URL can be loaded. There are three options: 'html', 'html-text', 'article'. For a single URL we try all three options and check the results. 

In [5]:
loader_html = ZyteURLLoader(urls, mode='html', api_key=zyte_api_key)
loader_html_text = ZyteURLLoader(urls, mode='html-text', api_key=zyte_api_key)
loader_article = ZyteURLLoader(urls, mode='article', api_key=zyte_api_key)

In [6]:
page_html = loader_html.load()
page_html_text = loader_html_text.load()
page_article = loader_article.load()


We check the beginning of page content and the length of the text to get an idea of the content

In [7]:
print(page_html[0].page_content[:400])





<!DOCTYPE html>
<html class="no-js" lang="en">
<head>

<script src="https://cdn.cookielaw.org/scripttemplates/otSDKStub.js" type="text/javascript" charset="UTF-8" data-domain-script="a58df52b-2812-4cc9-99c6-cf9bcfe5af8b"></script>
<script src="https://www.rte.ie/djstatic/dotie/privacy/cookie-functions.js?v=20240830v16729"></script>
<script type="text/javascript">
    var optanonCallbacks = [
 


In [8]:
len(page_html[0].page_content)

143489

In [9]:
print(page_html_text[0].page_content[:200])

skip to main content

Your browser does not support Javascript. Please turn Javascript on to get the
best experience from rte.ie

Menu

[ Weather ](https://www.rte.ie/weather/)

[ __ Ireland's Nationa


In [10]:
len(page_html_text[0].page_content)

19182

In [11]:
print(page_article[0].page_content[:200])

Paris 2024: No luck for Daniel Coyle and Shane Sweetnam in individual final

There was no joy for Ireland's Daniel Coyle and Shane Sweetnam in the show jumping individual final, with the gold medal wo


In [12]:
len(page_article[0].page_content)

3840

As it can be noticed that length of content extracted in 'article' model is much smaller than 'html-text' or 'html' as it uses a Machine learning model to find the article content from the webpage and only extract the title and text of the article. It particularly helpful in (1) Removing the irrelevant content from the page. (2) Reduce the cost by cutting down on the input token to the LLM in the next step. 

## Page content with browser rendering enabled (JS off)

At time we would need the content of the page by enabling browser rendering. It can be done by passing setting the `browserHtml` argument. We disable the JS in this example.

In [13]:
urls = [
    'https://www.whatismybrowser.com/detect/is-javascript-enabled/'
]

In [14]:
kwargs = {
    'browserHtml': True,
    'javascript': False
}
loader_browser_text = ZyteURLLoader(urls, mode='html-text', api_key=zyte_api_key, download_kwargs=kwargs)

In [15]:
page_browser = loader_browser_text.load()

In [16]:
print(page_browser[0].page_content[:500])

[ ](/)

[WhatIsMyBrowser.com](/)

  * [My browser](/)
  * [Guides](/guides/)
  * [Detect my settings](/detect/)
  * [Tools](/developers/tools/)

  1. [ Homepage  ](/)
  2. [ Detect my settings  ](/detect/)
  3. [ Is JavaScript enabled?  ](/detect/is-javascript-enabled/)

# Is JavaScript enabled?

Updated at: Jun 26, 2024

No

JavaScript is enabled in your web browser. Congratulations; you're one step
closer to having a fully featured online experience.

## [Need help enabling JavaScript?](/guide


## Page content with JS enabled

In [17]:
kwargs = {
    'browserHtml': True,
    'javascript': True,
}
loader_browser_js_text = ZyteURLLoader(urls, mode='html-text', api_key=zyte_api_key, download_kwargs=kwargs)

In [18]:
page_browser_js = loader_browser_js_text.load()

In [19]:
print(page_browser_js[0].page_content[:500])

[ ](/)

[WhatIsMyBrowser.com](/)

  * [My browser](/)
  * [Guides](/guides/)
  * [Detect my settings](/detect/)
  * [Tools](/developers/tools/)

  1. [ Homepage  ](/)
  2. [ Detect my settings  ](/detect/)
  3. [ Is JavaScript enabled?  ](/detect/is-javascript-enabled/)

# Is JavaScript enabled?

Updated at: Jun 26, 2024

Yes

JavaScript is enabled in your web browser. Congratulations; you're one step
closer to having a fully featured online experience.

## [Need help enabling JavaScript?](/guid


## Question answering from some recent articles

In this part we extract the content from a number of articles and then perform question answering.

In [30]:
#!pip install -qU chromadb
#!pip install -qU langchain-openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [8]:
openai_api_key = os.environ.get('OPENAI_API_KEY')

In [9]:
urls = [
    # Sports
    'https://www.rte.ie/sport/paris-2024/2024/0806/1463620-paris-2024-coyle-and-sweetnam-dont-figure-in-medal/', 
    'https://www.rte.ie/sport/paris-2024/2024/0806/1463576-kellie-harrington-olympics-yang/',
    'https://www.rte.ie/sport/paris-2024/2024/0806/1463615-paris-2024-healy-and-osullivan-into-1500m-repechage/',
    'https://www.rte.ie/sport/football/2024/0806/1463614-gaa-invites-bids-for-rights-to-gaago-broadcast-package/',
    # Business
    'https://www.rte.ie/news/business/2024/0806/1463642-cso-monthly-unemployment-figures/',
    'https://www.rte.ie/news/business/2024/0806/1463597-tokyos-nikkei-index-recovers/',
    'https://www.rte.ie/news/business/2024/0806/1463591-aib-services-pmi/',
]

In [10]:
loader_articles = ZyteURLLoader(urls, mode='article', api_key=zyte_api_key)


In [11]:
pages = loader_articles.load()

In [12]:
len(pages)

7

In [13]:
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

We use OpenAI embedding to store the semantic information about the articles. 

In [14]:
embedding = OpenAIEmbeddings(api_key=openai_api_key)

In [15]:
vectordb = Chroma.from_documents(
    documents=pages,
    embedding=embedding,
    #persist_directory=persist_directory
)

In [16]:
len(vectordb)

7

In [17]:
question = 'Who won the gold medal in individual jumping in 2024 olympics?'

In [18]:
ret_docs = vectordb.similarity_search(question,k=1)

In [19]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model_name='gpt-3.5-turbo', api_key=openai_api_key, temperature=0)

In [20]:
from langchain.chains import RetrievalQA

In [21]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    return_source_documents=True,
    retriever=vectordb.as_retriever()
)

In [24]:
result = qa_chain.invoke(question)

In [25]:
result['result']

"The gold medal in individual jumping at the 2024 Olympics was won by Germany's Christian Kukuk after a jump-off."

In [26]:
question2 = 'How is the employment rate changing for men in recent months?'

In [27]:
result = qa_chain.invoke(question2)

In [28]:
result['result']

'The unemployment rate for men in Ireland rose to 4.7% in July from the revised rate of 4.5% in June. This is higher than the rate of 4.5% in July of the previous year.'