# Intelligent Content Retrieval System
## Web Scraping and Vector Database Assignment

**Author:** Phillemon Senoamadi  
**Date:** December 2025


## Imports

In [1]:
import os
import time
import re
from datetime import datetime
from typing import List, Dict
import chromadb
import requests
from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer
import pandas as pd
from tqdm import tqdm

## Configurations

In [2]:
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; ContentRetrievalBot/1.0; +educational)"}
REQUEST_DELAY = 3  # seconds (rate limiting)
CHUNK_SIZE = 800
CHUNK_OVERLAP = 150

## Website Selection

Four South African websites were selected to ensure:
- Public accessibility
- English-language content
- Diverse categories
- Sufficient textual volume

Categories covered:
- News
- Technology
- Education
- Legal / Public Information


### List of Websites

In [3]:
websites = [
    {
        "name": "WHO",
        "url": "https://data.who.int/dashboards/covid19/cases?n=c",
        "category": "Government ",
    },
    {
        "name": "Microsoft",
        "url": "https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/ai-agent-design-patterns?toc=%2Fazure%2Fdeveloper%2Fai%2Ftoc.json&bc=%2Fazure%2Fdeveloper%2Fai%2Fbreadcrumb%2Ftoc.json",
        "category": "Technology ",
    },
    {
        "name":"apnews",
        "url":"https://apnews.com/",
        "category" :"News"
    },
    {
        "name":"towardsdatascience",
        "url":"https://towardsdatascience.com/solving-a-constrained-project-scheduling-problem-with-quantum-annealing-d0640e657a3b/",
        "category" :"Educational Resources"
    }
]

### Fetch page Function

In [4]:
def fetch_page(url: str) -> str:
    """
    Fetch HTML content from a URL using ethical scraping practices.
    """
    response = requests.get(url, headers=HEADERS, timeout=15)
    response.raise_for_status()
    time.sleep(REQUEST_DELAY)
    return response.text

### Clean Text Function

In [5]:
def extract_text(html: str) -> str:
    """
    Extract visible text from HTML and clean it.
    """
    soup = BeautifulSoup(html, "lxml")
    for tag in soup(["script", "style", "noscript", "header", "footer", "nav"]):
        tag.decompose()
    text = soup.get_text(separator=" ")
    text = re.sub(r"\s+", " ", text).strip()
    return text

## Scrape Websites

In [6]:
scraped_data = []

for site in tqdm(websites, desc="Scraping websites"):
    try:
        html = fetch_page(site["url"])
        text = extract_text(html)
        
        scraped_data.append({
            "domain": site["name"],
            "url": site["url"],
            "category": site["category"],
            "timestamp": datetime.utcnow().isoformat(),
            "text": text,
            "char_count": len(text)})
        print(f"{site['name']} scraped: {len(text):,} characters")

    except Exception as e:
        print(f"Failed to scrape {site['name']}: {e}")


  "timestamp": datetime.utcnow().isoformat(),
Scraping websites:  25%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Ž                                                | 1/4 [00:03<00:09,  3.23s/it]

WHO scraped: 18,708 characters


Scraping websites:  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ                                | 2/4 [00:06<00:06,  3.42s/it]

Microsoft scraped: 41,278 characters


Scraping websites:  75%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Š                | 3/4 [00:10<00:03,  3.45s/it]

apnews scraped: 31,324 characters


Scraping websites: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 4/4 [00:13<00:00,  3.43s/it]

towardsdatascience scraped: 38,014 characters





## Scrapping Summary

### Scraping Summary

The table below shows the character count for each website to verify that
each source exceeds the minimum 5,000-character requirement.


## Scrapping Stats

In [7]:
df_scraped = pd.DataFrame(scraped_data)
df_scraped[["domain", "category", "char_count"]]


Unnamed: 0,domain,category,char_count
0,WHO,Government,18708
1,Microsoft,Technology,41278
2,apnews,News,31324
3,towardsdatascience,Educational Resources,38014


In [58]:
df_scraped['percentages'] = df_scraped['char_count']/df_scraped['char_count'].sum()

In [59]:
df_scraped

Unnamed: 0,domain,url,category,timestamp,text,char_count,percentages
0,WHO,https://data.who.int/dashboards/covid19/cases?n=c,Government,2026-01-11T08:20:48.200842,COVID-19 cases | WHO COVID-19 dashboard Skip t...,18708,0.14466
1,Microsoft,https://learn.microsoft.com/en-us/azure/archit...,Technology,2026-01-11T08:20:51.755113,AI Agent Orchestration Patterns - Azure Archit...,41278,0.319183
2,apnews,https://apnews.com/,News,2026-01-11T08:20:55.245993,"Associated Press News: Breaking News, Latest H...",31324,0.242213
3,towardsdatascience,https://towardsdatascience.com/solving-a-const...,Educational Resources,2026-01-11T08:20:58.687296,Solving a Constrained Project Scheduling Probl...,38014,0.293944


## Text Chunking Function `

In [8]:
def chunk_text(text: str,chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> List[str]:
    """
    Split text into overlapping chunks.
    """
    chunks = []
    start = 0
    text_length = len(text)

    while start < text_length:
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap

    return chunks

## Create Chunked Corpus

In [9]:
all_chunks = []

for row in scraped_data:
    chunks = chunk_text(row["text"])

    for i, chunk in enumerate(chunks):
        all_chunks.append({
            "domain": row["domain"],
            "url": row["url"],
            "category": row["category"],
            "timestamp": row["timestamp"],
            "chunk_id": i,
            "text": chunk,
            "chunk_length": len(chunk)
        })

df_chunks = pd.DataFrame(all_chunks)
print(f"Total chunks created: {len(df_chunks)}")

Total chunks created: 201


In [10]:
df_chunks.head(2)

Unnamed: 0,domain,url,category,timestamp,chunk_id,text,chunk_length
0,WHO,https://data.who.int/dashboards/covid19/cases?n=c,Government,2026-01-11T08:20:48.200842,0,COVID-19 cases | WHO COVID-19 dashboard Skip t...,800
1,WHO,https://data.who.int/dashboards/covid19/cases?n=c,Government,2026-01-11T08:20:48.200842,1,and Saba Bosnia and Herzegovina Botswana Braz...,800


## Chunk Distribution

In [11]:
df_chunks["chunk_length"].describe()

count    201.000000
mean     790.288557
std       70.104895
min      124.000000
25%      800.000000
50%      800.000000
75%      800.000000
max      800.000000
Name: chunk_length, dtype: float64

## End of Scraping & Processing

## 2. Data Collection and Text Processing Complete

At this stage:
- All websites were scraped ethically
- HTML noise was removed
- Text was chunked into 800â€“1200 character segments
- Metadata was preserved for each chunk

The processed corpus is now ready for embedding generation.


# Embedding Generation

## Load Embedding Model

In [31]:
model = SentenceTransformer("all-MiniLM-L6-v2")

## Generate Embeddings

In [13]:
texts = df_chunks["text"].tolist()

embeddings = model.encode(
    texts,
    batch_size=32,
    show_progress_bar=True,
    normalize_embeddings=True
)

print("Embedding shape:", embeddings.shape, "\n" ,"Embeddings:", embeddings)

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

Embedding shape: (201, 384) 
 Embeddings: [[ 0.00674995  0.02543921 -0.05318063 ... -0.03135065 -0.02791882
   0.03389486]
 [ 0.08636957  0.01803675 -0.05463221 ... -0.02771902  0.00385751
  -0.0168611 ]
 [ 0.0194258   0.0223157  -0.04385505 ... -0.0583318  -0.01312458
  -0.03841379]
 ...
 [-0.17394926 -0.08896809  0.07456005 ...  0.06617937 -0.0763445
  -0.03236045]
 [-0.09830879 -0.09961621  0.01288439 ...  0.04623834 -0.08831649
  -0.01794759]
 [-0.03202517 -0.10690522 -0.08798777 ...  0.02482592 -0.0690345
   0.04047528]]


# VECTOR DATABASE (ChromaDB)

## Create Vector DB

In [14]:
client = chromadb.Client()
collection = client.create_collection(
    name="content_retrieval",
    metadata={"hnsw:space": "cosine"}
)

## Store Embeddings

In [16]:
collection.add(
    documents=df_chunks["text"].tolist(),
    embeddings=embeddings.tolist(),
    metadatas=df_chunks[["domain", "url", "category"]].to_dict("records"),
    ids=[str(i) for i in range(len(df_chunks))]
)

print("Stored documents:", collection.count())


Stored documents: 201


# Semantic Search Interface

### Search Function

In [32]:
def semantic_search(query: str, top_k: int = 5):
    query_embedding = model.encode([query], normalize_embeddings=True)

    results = collection.query(
        query_embeddings=query_embedding.tolist(),
        n_results=top_k
    )

    for i in range(top_k):
        print(f"\nResult {i+1}")
        print("Score:", results["distances"][0][i])
        print("domain:", results["metadatas"][0][i]["domain"])
        print("Text:", results["documents"][0][i][:300], "...")


In [49]:
import json
import os
from datetime import datetime

def semantic_search(query: str, top_k: int = 5, output_file: str = "system_results.json"):
    # Encode query
    query_embedding = model.encode([query], normalize_embeddings=True)

    # Query vector database
    results = collection.query(
        query_embeddings=query_embedding.tolist(),
        n_results=top_k
    )

    # Build this run's result
    run_results = {
        "query": query,
        "top_k": top_k,
        "timestamp": datetime.utcnow().isoformat(),
        "results": []
    }

    for i in range(top_k):
        score = results["distances"][0][i]
        domain = results["metadatas"][0][i].get("domain")
        text = results["documents"][0][i]

        print(f"\nResult {i+1}")
        print("Score:", score)
        print("domain:", domain)
        print("Text:", text[:300], "...")

        run_results["results"].append({
            "rank": i + 1,
            "score": score,
            "domain": domain,
            "text": text
        })

    # Load existing file safely
    if os.path.exists(output_file):
        with open(output_file, "r", encoding="utf-8") as f:
            try:
                existing_data = json.load(f)
            except json.JSONDecodeError:
                existing_data = []
    else:
        existing_data = []

    # ðŸ”‘ Normalize storage format
    if isinstance(existing_data, dict):
        all_results = [existing_data]   # migrate old single-run format
    elif isinstance(existing_data, list):
        all_results = existing_data
    else:
        all_results = []

    # Append new run
    all_results.append(run_results)

    # Write back
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(all_results, f, ensure_ascii=False, indent=2)

    print(f"\n Results appended to {output_file}")


## Testing Semantic Search 

In [55]:
semantic_search("Trump")


Result 1
Score: 0.6415318250656128
domain: apnews
Text: g with oil executives in the East Room of the White House, Friday, Jan. 9, 2026, in Washington. (AP Photo/Alex Brandon) Trump signs executive order meant to protect the money from Venezuelan oil President Donald Trump has issued a new executive order to protect Venezuelan oil revenue from judicial p ...

Result 2
Score: 0.6860718131065369
domain: apnews
Text: n EspaÃ±ol Deportes Donald Trump Most watched videos Standards Quizzes Press Releases My Account AP News Code of Conduct Sign in Search Query Submit Search Show Search Menu Submit Search Bob Weir dies Alex Bregman contract Iran protests T.K. Carter dies at 69 Anti-ICE protests Menu World SECTIONS Isr ...

Result 3
Score: 0.6862151622772217
domain: apnews
Text: 10, 2026. (UGC via AP) Iran warns US troops and Israel will be targets if America strikes over protests as death toll rises Protests challenging Iranâ€™s theocracy have reached the two-week mark, with demonstrators flo

  "timestamp": datetime.utcnow().isoformat(),


In [56]:
semantic_search("parallelized")


Result 1
Score: 0.5820498466491699
domain: Microsoft
Text: e to the same problem. This collaboration typically occurs in scenarios that feature the following multi-agent decision-making techniques: Brainstorming Ensemble reasoning Quorum and voting-based decisions Time-sensitive scenarios where parallel processing reduces latency. When to avoid concurrent o ...

Result 2
Score: 0.5907529592514038
domain: Microsoft
Text:  processing inefficient or impossible. Agents can't reliably coordinate changes to shared state or external systems while running simultaneously. There's no clear conflict resolution strategy to handle contradictory or conflicting results from each agent. Result aggregation logic is too complex or l ...

Result 3
Score: 0.6375733017921448
domain: Microsoft
Text: g, all agents work in parallel, which reduces overall run time and provides comprehensive coverage of the problem space. This orchestration pattern resembles the Fan-out/Fan-in cloud design pattern. The results

  "timestamp": datetime.utcnow().isoformat(),


In [57]:
semantic_search("British")


Result 1
Score: 0.8164938688278198
domain: WHO
Text: reenland Grenada Guadeloupe Guam Guatemala Guernsey Guinea Guinea-Bissau Guyana Haiti Holy See Honduras Hungary Iceland India Indonesia Iran (Islamic Republic of) Iraq Ireland Isle of Man Israel Italy Jamaica Japan Jersey Jordan Kazakhstan Kenya Kiribati Kosovo (In accordance with UN Security Counci ...

Result 2
Score: 0.8229862451553345
domain: WHO
Text:  between the Governments of Argentina and the United Kingdom of Great Britain and Northern Ireland concerning sovereignty over the Falkland Islands (Malvinas). The mention of specific companies or of certain manufacturersâ€™ products does not imply that they are endorsed or recommended by WHO in prefe ...

Result 3
Score: 0.8350211381912231
domain: WHO
Text: f accessing or utilizing the Datasets with or without prior notice to you. Maps The designations employed and the presentation of the material in this publication do not imply the expression of any opinion whatsoever on the pa

  "timestamp": datetime.utcnow().isoformat(),
