# Part 4: Web Knowledge Source

In Parts 1-3, you worked with internal data sources (search indexes and SharePoint). In Part 4, you'll add public web content using `WebKnowledgeSource`. This lets you combine your internal knowledge with external information from the web.

## Step 1: Load Environment Variables

Run below cell to load the configuration for your Azure resources, choose the **.venv** environment that is created for you.

> **⚠️ Troubleshooting**
>
> If code cells get stuck and keep spinning, select **Restart** from the notebook toolbar at the top. If the issue persists after a couple of tries, close VS Code completely and reopen it.

In [None]:
import os

from azure.core.credentials import AzureKeyCredential
from azure.identity import DefaultAzureCredential
from dotenv import load_dotenv

load_dotenv(override=True) # take environment variables from .env.

# Azure AI Search configuration
endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]
if os.getenv('KEYLESS','false').lower() == 'true':
    credential = DefaultAzureCredential()    
else:
    credential = AzureKeyCredential(os.environ["AZURE_SEARCH_ADMIN_KEY"])

# Knowledge base name
knowledge_base_name = "web-knowledge-base"

# Azure OpenAI configuration
azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
if os.getenv('KEYLESS','false').lower() == 'true':
    azure_openai_key = ''
else:
    azure_openai_key = os.environ["AZURE_OPENAI_KEY"]
azure_openai_chatgpt_deployment = os.getenv("AZURE_OPENAI_CHATGPT_DEPLOYMENT", "gpt-4.1")
azure_openai_chatgpt_model_name = os.getenv("AZURE_OPENAI_CHATGPT_MODEL_NAME", "gpt-4.1")

print("Environment variables loaded")

## Step 2: Create Web Knowledge Source

A **WebKnowledgeSource** queries public web URLs in real-time, just like how SharePoint knowledge sources query SharePoint documents. The difference is that web sources search the public internet instead of your internal content.

The code below creates a web knowledge source without any URL restrictions, which means it can search across the entire web. Later in this part, you'll see how to restrict searches to specific domains.

In [None]:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import WebKnowledgeSource

index_client = SearchIndexClient(endpoint=endpoint, credential=credential)

ks = WebKnowledgeSource(
    name="web-knowledge-source",
    description="Knowledge source for Web"
)
index_client.create_or_update_knowledge_source(knowledge_source=ks)
print(f"Knowledge source '{ks.name}' created or updated successfully.")

## Step 3: Create Web Knowledge Base

You'll now create a knowledge base that references the web knowledge source. The setup is identical to what you've done in previous parts. It configures the Azure OpenAI model, adds a reference to your knowledge source, and sets `output_mode=ANSWER_SYNTHESIS`.

The only difference is the knowledge source type. Instead of querying a search index or SharePoint, this knowledge base searches the web.

In [None]:
from azure.search.documents.indexes.models import AzureOpenAIVectorizerParameters, KnowledgeBase, KnowledgeBaseAzureOpenAIModel, KnowledgeRetrievalOutputMode, KnowledgeSourceReference

aoai_params = AzureOpenAIVectorizerParameters(
    resource_url=azure_openai_endpoint,
    deployment_name=azure_openai_chatgpt_deployment,
    model_name=azure_openai_chatgpt_model_name,
    api_key=azure_openai_key
)

knowledge_base = KnowledgeBase(
    name=knowledge_base_name,
    models=[KnowledgeBaseAzureOpenAIModel(azure_open_ai_parameters=aoai_params)],
    knowledge_sources=[
        KnowledgeSourceReference(name=ks.name)
    ],
    output_mode=KnowledgeRetrievalOutputMode.ANSWER_SYNTHESIS
)

index_client.create_or_update_knowledge_base(knowledge_base)
print(f"Knowledge base '{knowledge_base_name}' created or updated successfully.")

## Step 4: Query Web Content

Now you can ask questions that require web knowledge. The question "How tall is the Eiffel tower?" isn't something your internal HR or health documents would know, but the web does.

When you run the query, the knowledge base searches the open web, retrieves relevant content, and synthesizes an answer with citations pointing to the web pages used. Check the references to see which websites were consulted.

In [None]:
from azure.search.documents.knowledgebases import KnowledgeBaseRetrievalClient
from azure.search.documents.knowledgebases.models import KnowledgeBaseMessage, KnowledgeBaseMessageTextContent, KnowledgeBaseRetrievalRequest, WebKnowledgeSourceParams
from IPython.display import display, Markdown

knowledge_base_client = KnowledgeBaseRetrievalClient(endpoint=endpoint, knowledge_base_name=knowledge_base_name, credential=credential)

web_ks_params = WebKnowledgeSourceParams(
    knowledge_source_name="web-knowledge-source",
    include_references=True,
    include_reference_source_data=True
)
req = KnowledgeBaseRetrievalRequest(
    messages=[
        KnowledgeBaseMessage(role="user", content=[KnowledgeBaseMessageTextContent(text="How tall is the Eiffel tower?")])
    ],
    knowledge_source_params=[
        web_ks_params
    ],
    include_activity=True
)

result = knowledge_base_client.retrieve(retrieval_request=req)
display(Markdown(result.response[0].content[0].text))

## Step 5: Review Response, References, and Activity

The two cells below show the citations and activity log from the web query.

The **references** reveal which websites were used to answer your question. Each citation includes the URL and the specific text snippet that contributed to the answer.

The **activity log** reveals what happened behind the scenes: which web searches were performed, which URLs were retrieved, and how the results were ranked.

In [None]:
import json

references = json.dumps([ref.as_dict() for ref in result.references], indent=2)
print(references)

In [None]:
activity_content = json.dumps([a.as_dict() for a in result.activity], indent=2)
print(activity_content)

## Step 6: Restrict to Specific Domains

You can control which websites the knowledge base searches by specifying allowed domains. This is useful when you want to combine internal data with specific trusted external sources, like industry documentation or regulatory websites.

The code below creates a new web knowledge source restricted to `britannica.com`. Setting `include_subpages=True` means the knowledge base can search any page on that domain, not just the homepage.

After creating the restricted source, you'll create a new knowledge base that uses it.

In [None]:
from azure.search.documents.indexes.models import WebKnowledgeSourceDomain, WebKnowledgeSourceDomains, WebKnowledgeSourceParameters

ks = WebKnowledgeSource(
    name="custom-web-knowledge-source",
    description="Custom knowledge source for Web",
    web_parameters = WebKnowledgeSourceParameters(
        domains = WebKnowledgeSourceDomains(
            allowed_domains=[
                WebKnowledgeSourceDomain(address="https://www.britannica.com/", include_subpages=True),
            ]
        )
    )
)
index_client.create_or_update_knowledge_source(knowledge_source=ks)
print(f"Knowledge source '{ks.name}' created or updated successfully.")

knowledge_base = KnowledgeBase(
    name="custom-web-knowledge-base",
    models=[KnowledgeBaseAzureOpenAIModel(azure_open_ai_parameters=aoai_params)],
    knowledge_sources=[
        KnowledgeSourceReference(name=ks.name)
    ],
    output_mode=KnowledgeRetrievalOutputMode.ANSWER_SYNTHESIS
)

index_client.create_or_update_knowledge_base(knowledge_base)
print(f"Knowledge base '{knowledge_base.name}' created or updated successfully.")

## Step 7: Query Restricted Web Source

Now query the restricted knowledge base with the same question about the Eiffel Tower. This time, the knowledge base can only search britannica.com for the answer.

Compare the results with Step 4. You'll notice the answer might be slightly different.

In [None]:
custom_knowledge_base_client = KnowledgeBaseRetrievalClient(endpoint=endpoint, knowledge_base_name=knowledge_base.name, credential=credential)

web_ks_params = WebKnowledgeSourceParams(
    knowledge_source_name=ks.name,
    include_references=True,
    include_reference_source_data=True
)
req = KnowledgeBaseRetrievalRequest(
    messages=[
        KnowledgeBaseMessage(role="user", content=[KnowledgeBaseMessageTextContent(text="How tall is the Eiffel tower?")])
    ],
    knowledge_source_params=[
        web_ks_params
    ],
    include_activity=True
)

result = custom_knowledge_base_client.retrieve(retrieval_request=req)
display(Markdown(result.response[0].content[0].text))

## Step 8: Review Restricted Source Results

The two cells below show the citations and activity log from the domain-restricted query.

Check the references to verify all citations come from britannica.com. The activity log shows how the knowledge base limited its web searches to the allowed domain.

In [None]:
references = json.dumps([ref.as_dict() for ref in result.references], indent=2)
print(references)

In [None]:
import pandas as pd

activity_types = [{"type": a.type} for a in result.activity]

df = pd.DataFrame(activity_types)

print("Activity Log Steps")
df

In [None]:
activity_content = json.dumps([a.as_dict() for a in result.activity], indent=2)
print("Activity Details")
print(activity_content)

## Summary

You've now added web content to your knowledge bases using `WebKnowledgeSource`. This lets you combine internal data with public information from the web.

**Key concepts to remember:**
- `WebKnowledgeSource` queries public web URLs in real-time
- You can search the entire web or restrict to specific domains
- `include_subpages=True` allows searching all pages within a domain
- Web citations include URLs instead of internal file paths

### What's Next?

➡️ Continue to [Part 5: Blob Knowledge Source](part5-blob-knowledge-source.ipynb) to learn how to upload documents from Azure Blob Storage and compare minimal vs. standard indexing.