In [1]:
import os
import asyncio
import json
import nest_asyncio
import dotenv
dotenv.load_dotenv()

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
COHERE_API_KEY = os.getenv("COHERE_API_KEY")

nest_asyncio.apply()

In [2]:
import newspaper
from newspaper import Config

# 1. Create a configuration object
config = Config()
# 2. Tell it to allow "binary" urls (bypassing the bad check)
config.allow_binary_content = True 

urls = [
    "https://developers.llamaindex.ai/python/framework/understanding/",
    "https://developers.llamaindex.ai/python/framework/understanding/using_llms/",
    "https://developers.llamaindex.ai/python/framework/understanding/rag/indexing/",
    "https://developers.llamaindex.ai/python/framework/understanding/rag/querying/",
]   


pages_content = []

# Retrieve the Content
for url in urls:
    try:
        article = newspaper.Article(url, config=config)
        article.download()
        article.parse()
        if len(article.text) > 0:
            pages_content.append(
                {"url": url, "title": article.title, "text": article.text}
            )
    except:
        print(f"Failed to retrieve content from {url}")
        continue

print(pages_content[0])
print(len(pages_content))

{'url': 'https://developers.llamaindex.ai/python/framework/understanding/', 'title': 'Building an LLM application', 'text': 'Using LLMs: hit the ground running by getting started working with LLMs. We’ll show you how to use any of our dozens of supported LLMs, whether via remote API calls or running locally on your machine.\n\nBuilding agents: agents are LLM-powered knowledge workers that can interact with the world via a set of tools. Those tools can retrieve information (such as RAG, see below) or take action. This tutorial includes:\n\nBuilding a single agent: We show you how to build a simple agent that can interact with the world via a set of tools.\n\nUsing existing tools: LlamaIndex provides a registry of pre-built agent tools at LlamaHub that you can incorporate into your agents.\n\nMaintaining state: agents can maintain state, which is important for building more complex applications.\n\nStreaming output and events: providing visibility and feedback to the user is important, a

In [3]:
pages_content

[{'url': 'https://developers.llamaindex.ai/python/framework/understanding/',
  'title': 'Building an LLM application',
  'text': 'Using LLMs: hit the ground running by getting started working with LLMs. We’ll show you how to use any of our dozens of supported LLMs, whether via remote API calls or running locally on your machine.\n\nBuilding agents: agents are LLM-powered knowledge workers that can interact with the world via a set of tools. Those tools can retrieve information (such as RAG, see below) or take action. This tutorial includes:\n\nBuilding a single agent: We show you how to build a simple agent that can interact with the world via a set of tools.\n\nUsing existing tools: LlamaIndex provides a registry of pre-built agent tools at LlamaHub that you can incorporate into your agents.\n\nMaintaining state: agents can maintain state, which is important for building more complex applications.\n\nStreaming output and events: providing visibility and feedback to the user is importa

Convert that into the Document so Llama-index understand it

In [4]:
# Convert to Document
from llama_index.core.schema import Document

documents = [
    Document(text=row["text"], metadata={"title": row["title"], "url": row["url"]})
    for row in pages_content
]


## Now we will use the **crawl4ai** for scrapt the page
Given in 04.1 py file

Here are the concise notes for your notebook:

### 1. What is Crawl4AI?

**Crawl4AI** is an asynchronous, browser-based web crawler built specifically for RAG and LLM applications. Unlike traditional crawlers that return raw HTML, Crawl4AI renders JavaScript (handling dynamic content like React apps) and converts the page directly into clean **Markdown**. This structure preserves headers, tables, and lists in a format that LLMs can easily understand, while stripping away "noise" like navigation bars and ads.

### 2. What does the code do?

The script configures a crawler to visit a list of URLs in parallel (concurrently) using a headless browser. It forces a fresh download (bypassing cache), filters out empty pages, and extracts the content as raw Markdown. It also includes a fallback mechanism: if the website's official metadata is missing a title, the script manually scans the Markdown headers (lines starting with `#`) to identify the document's topic. Finally, it structures the output into a clean JSON format separating the text payload from the metadata.

### 3. Why we added the sync function?

The `crawl4ai` library is **asynchronous** (it uses `await` to handle network delays efficiently), but most standard Python scripts and notebooks run **synchronously** (line-by-line). The `crawl_sync` function acts as a **bridge or adapter** between these two worlds. It manages the complex "Event Loop" internally—starting the engine, running the async task, and then shutting it down—allowing you to call the crawler with a simple, blocking function call without rewriting your entire codebase to be async.

In [20]:
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core import VectorStoreIndex
import google.genai.types as types

config = types.GenerateContentConfig(
    thinking_config=types.ThinkingConfig(thinking_budget=0),
    max_output_tokens=1024,
    temperature=1,
)

llm = GoogleGenAI(
    model="gemini-2.5-flash",
    generation_config=config,
    )

Settings.embed_model = CohereEmbedding(
    model_name="embed-english-v3.0",
    input_type="search_document",
    api_key=COHERE_API_KEY
)
Settings.llm = llm
Settings.text_splitter = SentenceSplitter(chunk_size=300, chunk_overlap=50)

In [21]:
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

In [27]:
res = query_engine.query("What is a query engine?")
print(res.response)

The basis of all querying is the QueryEngine. The simplest way to get a QueryEngine is to get an index to create one for you.


In [28]:
# Show the retrieved nodes
for src in res.source_nodes:
    print("Node ID\t", src.node_id)
    print("Title\t", src.metadata["title"])
    print("URL\t", src.metadata["url"])
    print("Score\t", src.score)
    print("-_" * 20)

Node ID	 66bba379-ab53-4924-a7ff-4a6f5f2520bb
Title	 Querying
URL	 https://developers.llamaindex.ai/python/framework/understanding/rag/querying/
Score	 0.3982705347745195
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
Node ID	 05274416-3668-4677-b9aa-cbd1da89306c
Title	 Querying
URL	 https://developers.llamaindex.ai/python/framework/understanding/rag/querying/
Score	 0.3661506833164389
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_


In [26]:
res = query_engine.query("Can I pass show_progress parameter to the VectorStoreIndex.from_documents method?")
print(res.response)

The context provided does not contain information about the `show_progress` parameter for the `VectorStoreIndex.from_documents` method. Therefore, I cannot answer whether you can pass this parameter.


> This is a classic limitation of Newspaper4k (and its predecessor Newspaper3k). It doesn't extract the Code block and Pro Tip Content
The reason it failed is that Newspaper4k is "opinionated." It is programmed to believe that "content" means paragraphs of storytelling text (like a CNN or New York Times article).

We will use Crawl4kAI for this.

In [25]:
# Show the retrieved nodes
for src in res.source_nodes:
    print("Node ID\t", src.node_id)
    print("Title\t", src.metadata["title"])
    print("URL\t", src.metadata["url"])
    print("Text\t", src.text)
    print("Score\t", src.score)
    print("-_" * 20)

Node ID	 4ca5cb27-b3e0-4592-bd4a-466b5853232c
Title	 Indexing
URL	 https://developers.llamaindex.ai/python/framework/understanding/rag/indexing/
Text	 With your data loaded, you now have a list of Document objects (or a list of Nodes). It’s time to build an Index over these objects so you can start querying them.

In LlamaIndex terms, an Index is a data structure composed of Document objects, designed to enable querying by an LLM. Your Index is designed to be complementary to your querying strategy.

LlamaIndex offers several different index types. We’ll cover the two most common here.

A VectorStoreIndex is by far the most frequent type of Index you’ll encounter. The Vector Store Index takes your Documents and splits them up into Nodes. It then creates vector embeddings of the text of every node, ready to be queried by an LLM.

Vector embeddings are central to how LLM applications function.

A vector embedding, often just called an embedding, is a numerical representation of the seman

Both methods allow you to leverage the extensive data available on the Internet to enhance your chatbot with current and high-quality information. The choice between them depends on your specific needs: use the newspaper library for simple, targeted scraping of known URLs, or employ Crawl4AI when you need more robust crawling capabilities, JavaScript support, or want to avoid external service dependencies.