<a href="https://colab.research.google.com/github/towardsai/ai-tutor-rag-system/blob/main/notebooks/Crawl_a_Website.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [82]:

!pip install -q llama-index-llms-openai newspaper3k==0.2.8 lxml_html_clean llama-index==0.10.57 openai==1.37.0 google-generativeai==0.5.4 httpx==0.27.2 cohere==5.6.2 tiktoken==0.7.0 --force-reinstall --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.10.1 requires pandas<2.2.3dev0,>=2.0, but you have pandas 2.2.3 which is incompatible.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.12.0 which is incompatible.
google-colab 1.0.0 requires google-auth==2.27.0, but you have google-auth 2.37.0 which is incompatible.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.2.3 which is incompatible.
jupyter-server 1.24.0 requires anyio<4,>=3.1.0, but you have anyio 4.8.0 which is incompatible.
langchain 0.3.12 requires async-timeout<5.0.0,>=4.0.0; python_version < "3.11", but you have async-timeout 5.0.1 which is incompatible.[0m[31m
[0m

In [49]:
import os
from google.colab import userdata

# Set the following API Keys in the Python environment. Will be used later.
os.environ["OPENAI_API_KEY"] = userdata.get('openai_api_key')
USESCRAPER_API_KEY = userdata.get('usescraper_api_key')


There are two primary methods for extracting webpage content. The first method involves having a list of URLs; one can iterate through this list to retrieve the content of each page. The second method, web crawling, requires using a script or service to extract page URLs from a sitemap or manually following links on the page to access all the content. Initially, we will explore web scraping techniques before discussing how to use a service like usescraper.com to perform web crawling.


# 1. Scraping using `newspaper` Library


## Define URLs


In [62]:
urls = [
    "https://docs.llamaindex.ai/en/stable/understanding",
    "https://docs.llamaindex.ai/en/stable/understanding/using_llms/using_llms/",
    "https://docs.llamaindex.ai/en/stable/understanding/indexing/indexing/",
    "https://docs.llamaindex.ai/en/stable/understanding/querying/querying/",
]

## Get Page Contents


In [63]:
import newspaper

pages_content = []

# Retrieve the Content
for url in urls:
    try:
        article = newspaper.Article(url)
        article.download()
        article.parse()
        if len(article.text) > 0:
            pages_content.append(
                {"url": url, "title": article.title, "text": article.text}
            )
    except:
        continue

In [64]:
pages_content[0]

{'url': 'https://docs.llamaindex.ai/en/stable/understanding',
 'title': 'Building an LLM Application',
 'text': "Building an LLM application#\n\nWelcome to the beginning of Understanding LlamaIndex. This is a series of short, bite-sized tutorials on every stage of building an LLM application to get you acquainted with how to use LlamaIndex before diving into more advanced and subtle strategies. If you're an experienced programmer new to LlamaIndex, this is the place to start.\n\nKey steps in building an LLM application#\n\nTip If you've already read our high-level concepts page you'll recognize several of these steps.\n\nThis tutorial has three main parts: Building a RAG pipeline, Building an agent, and Building Workflows, with some smaller sections before and after. Here's what to expect:\n\nUsing LLMs : hit the ground running by getting started working with LLMs. We'll show you how to use any of our dozens of supported LLMs, whether via remote API calls or running locally on your mac

In [65]:
len(pages_content)

4

## Convert to Document


In [66]:
from llama_index.core.schema import Document

# Convert the chunks to Document objects so the LlamaIndex framework can process them.
documents = [
    Document(text=row["text"], metadata={"title": row["title"], "url": row["url"]})
    for row in pages_content
]

# 2. Submit the Crawler Job


In [67]:
import requests
import json

urls_to_crawl = [
    "https://docs.llamaindex.ai/en/stable/understanding/", # add your URLs here, e.g. "https://docs.llamaindex.ai/en/stable/understanding/"
]

payload = {
    "urls": urls_to_crawl,  # list of urls to crawl
    "output_format": "markdown",  # text, html, markdown
    "output_expiry": 604800,  # Automatically delete after X seconds
    "min_length": 50,  # Skip pages with less than X characters
    "page_limit": 3,  # Maximum number of pages to crawl
    "force_crawling_mode": "link",  # "link" follows links in the page reccursively, or "sitemap" to find pages from website's sitemap
    "block_resources": True,  # skip loading images, stylesheets, or scripts
    "include_linked_files": False,  # include files (PDF, text, ...) in output
}
headers = {
    "Authorization": "Bearer " + USESCRAPER_API_KEY,
    "Content-Type": "application/json",
}

response = requests.request(
    "POST", "https://api.usescraper.com/crawler/jobs", json=payload, headers=headers
)

response = json.loads(response.text)

print(response)

{'org': '5582', 'id': '7YDEZ9VY07TDCKAAWHGDTM8RBD', 'urls': ['https://docs.llamaindex.ai/en/stable/understanding/'], 'exclude_globs': [], 'exclude_elements': 'nav, header, footer, script, style, noscript, svg, [role="alert"], [role="banner"], [role="dialog"], [role="alertdialog"], [role="region"][aria-label*="skip" i], [aria-modal="true"]', 'output_format': 'markdown', 'output_expiry': 604800, 'min_length': 50, 'page_limit': 3, 'force_crawling_mode': 'link', 'block_resources': True, 'include_linked_files': False, 'createdAt': 1736263731192, 'status': 'starting', 'use_browser': True, 'sitemapPageCount': 0, 'notices': []}


## Get the Status


In [68]:
url = "https://api.usescraper.com/crawler/jobs/{}".format(response["id"])

status_res = requests.request("GET", url, headers=headers)

status_res = json.loads(status_res.text)

print(status_res["status"])
print(status_res["progress"])

starting
{'scraped': 0, 'discarded': 0, 'failed': 0}


## Get the Data


In [69]:
url = "https://api.usescraper.com/crawler/jobs/{}/data".format(response["id"])

data_res = requests.request("GET", url, headers=headers)

data_res = json.loads(data_res.text)

print(data_res)

{'data': []}


In [70]:
print("URL:", data_res["data"][0]["meta"]["url"])
print("Title:", data_res["data"][0]["meta"]["meta"]["title"])
print("Content:", data_res["data"][0]["text"][0:500], "...")

IndexError: list index out of range

## Convert to Document


In [71]:
from llama_index.core.schema import Document

# Convert the chunks to Document objects so the LlamaIndex framework can process them.
documents = [
    Document(
        text=row["text"],
        metadata={"title": row["meta"]["meta"]["title"], "url": row["meta"]["url"]},
    )
    for row in data_res["data"]
]

In [72]:
def keyword_filter(documents, keywords):
    filtered_docs = []
    for doc in documents:
        if any(keyword.lower() in doc.text.lower() for keyword in keywords):
            filtered_docs.append(doc)
    return filtered_docs

ai_keywords = ["artificial intelligence", "machine learning", "neural networks", "deep learning"]
filtered_documents = keyword_filter(documents, ai_keywords)

In [73]:
from bs4 import BeautifulSoup

def clean_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove script and style elements
    for script in soup(["script", "style"]):
        script.decompose()

    # Extract text from remaining tags
    text = soup.get_text()

    # Remove extra whitespace
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)

    return text

cleaned_documents = [Document(text=clean_html(doc.text), metadata=doc.metadata) for doc in documents]

In [74]:
def truncate_document(doc, max_tokens=1000):
    tokens = doc.text.split()
    if len(tokens) > max_tokens:
        truncated_text = ' '.join(tokens[:max_tokens])
        return Document(text=truncated_text, metadata=doc.metadata)
    return doc

truncated_documents = [truncate_document(doc) for doc in cleaned_documents]

# Create RAG Pipeline


In [75]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(temperature=0, model="gpt-4o-mini")

In [76]:
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-small")

In [77]:
from llama_index.core.node_parser import SentenceSplitter

text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=30)

In [78]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model
Settings.text_splitter = text_splitter

In [79]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

In [80]:
query_engine = index.as_query_engine()

In [81]:
res = query_engine.query("What is a query engine?")

TypeError: Client.__init__() got an unexpected keyword argument 'proxies'

In [None]:
res.response

In [None]:
# Show the retrieved nodes
for src in res.source_nodes:
    print("Node ID\t", src.node_id)
    print("Title\t", src.metadata["title"])
    print("URL\t", src.metadata["url"])
    print("Score\t", src.score)
    print("-_" * 20)