# CivicHonorsKGv20: Enhanced Knowledge Graph with Advanced Link Collection and Page Search

This notebook is an improved version of CivicHonorsKGv19, featuring enhanced link collection and page search capabilities.

## Overview

This notebook implements a knowledge graph for Civic Honors content with the following steps:

1. **Install and Import Libraries**: Set up the required dependencies
2. **Define Knowledge Graph Class**: Create a flexible KG structure
3. **Define Enhanced Knowledge Reduction Class**: Implement advanced reduction techniques
4. **Define Advanced Link Collection Class**: Implement comprehensive link collection
5. **Define Advanced Page Search Class**: Implement intelligent page searching
6. **Scrape Website Data**: Collect information from relevant websites and follow links
7. **Populate Knowledge Graph**: Extract and structure facts
8. **Retrieve and Display Facts**: View the extracted knowledge
9. **Ensure Uniqueness**: Remove duplicate facts
10. **Advanced Cleaning**: Apply semantic similarity for redundancy reduction
11. **Enhanced Knowledge Reduction**: Apply transformer-based models and hierarchical clustering
12. **Serialization**: Save and load the knowledge graph

The main improvements in this version are the enhanced link collection and page search capabilities that allow the knowledge graph to discover and incorporate information from a broader range of relevant sources.

## Step 1: Install and Import Libraries

In [None]:
# Install required packages
!pip install requests beautifulsoup4 difflib spacy sentence-transformers scikit-learn networkx
!python -m spacy download en_core_web_md

In [None]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import json
import datetime
import enum
import networkx as nx
from difflib import SequenceMatcher
import spacy
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re
import urllib.parse
import time
from urllib.robotparser import RobotFileParser
from collections import deque, defaultdict
import logging
from typing import List, Dict, Tuple, Any, Optional, Set, Union, Callable

## Step 2: Define Knowledge Graph Class

In [None]:
class ReliabilityRating(enum.Enum):
    UNKNOWN = "Unknown"
    UNLIKELY_FALSE = "Unlikely False"
    POSSIBLY_FALSE = "Possibly False"
    POSSIBLY_TRUE = "Possibly True"
    LIKELY_TRUE = "Likely True"
    VERIFIED_TRUE = "Verified True"

class KnowledgeGraph:
    def __init__(self):
        self.data = []
        
    def add_fact(self, 
                fact_id=None, 
                fact_statement=None, 
                category="General", 
                tags=None, 
                date_recorded=None, 
                last_updated=None, 
                reliability_rating=ReliabilityRating.UNKNOWN, 
                source_id=None, 
                source_title=None, 
                author_creator=None, 
                publication_date=None, 
                url_reference=None, 
                related_facts=None, 
                contextual_notes=None, 
                access_level="Public", 
                usage_count=0):
        
        if date_recorded is None:
            date_recorded = datetime.datetime.now()
        
        if last_updated is None:
            last_updated = datetime.datetime.now()
        
        if tags is None:
            tags = []
        
        if related_facts is None:
            related_facts = []
            
        fact = {
            "fact_id": fact_id,
            "fact_statement": fact_statement,
            "category": category,
            "tags": tags,
            "date_recorded": date_recorded,
            "last_updated": last_updated,
            "reliability_rating": reliability_rating,
            "source_id": source_id,
            "source_title": source_title,
            "author_creator": author_creator,
            "publication_date": publication_date,
            "url_reference": url_reference,
            "related_facts": related_facts,
            "contextual_notes": contextual_notes,
            "access_level": access_level,
            "usage_count": usage_count
        }
        
        self.data.append(fact)
        return fact
    
    def update_quality_score(self, fact_id, new_score):
        for fact in self.data:
            if fact["fact_id"] == fact_id:
                fact["quality_score"] = new_score
                fact["last_updated"] = datetime.datetime.now()
                return True
        return False
    
    def save_to_file(self, filename):
        with open(filename, 'w') as f:
            # Convert datetime objects to strings for JSON serialization
            serializable_data = []
            for fact in self.data:
                fact_copy = fact.copy()
                if isinstance(fact_copy["date_recorded"], datetime.datetime):
                    fact_copy["date_recorded"] = fact_copy["date_recorded"].isoformat()
                if isinstance(fact_copy["last_updated"], datetime.datetime):
                    fact_copy["last_updated"] = fact_copy["last_updated"].isoformat()
                if isinstance(fact_copy["publication_date"], datetime.datetime):
                    fact_copy["publication_date"] = fact_copy["publication_date"].isoformat()
                if isinstance(fact_copy["reliability_rating"], ReliabilityRating):
                    fact_copy["reliability_rating"] = fact_copy["reliability_rating"].value
                serializable_data.append(fact_copy)
            
            json.dump(serializable_data, f, indent=2)
    
    def load_from_file(self, filename):
        with open(filename, 'r') as f:
            data = json.load(f)
            
            # Convert string representations back to objects
            for fact in data:
                if "date_recorded" in fact and fact["date_recorded"]:
                    fact["date_recorded"] = datetime.datetime.fromisoformat(fact["date_recorded"])
                if "last_updated" in fact and fact["last_updated"]:
                    fact["last_updated"] = datetime.datetime.fromisoformat(fact["last_updated"])
                if "publication_date" in fact and fact["publication_date"]:
                    fact["publication_date"] = datetime.datetime.fromisoformat(fact["publication_date"])
                if "reliability_rating" in fact:
                    fact["reliability_rating"] = ReliabilityRating(fact["reliability_rating"])
            
            self.data = data
    
    def get_facts_by_tag(self, tag):
        return [fact for fact in self.data if tag in fact["tags"]]
    
    def get_facts_by_category(self, category):
        return [fact for fact in self.data if fact["category"] == category]
    
    def get_facts_by_source(self, source_id):
        return [fact for fact in self.data if fact["source_id"] == source_id]
    
    def get_fact_by_id(self, fact_id):
        for fact in self.data:
            if fact["fact_id"] == fact_id:
                return fact
        return None
    
    def update_fact(self, fact_id, **kwargs):
        for fact in self.data:
            if fact["fact_id"] == fact_id:
                for key, value in kwargs.items():
                    if key in fact:
                        fact[key] = value
                fact["last_updated"] = datetime.datetime.now()
                return True
        return False
    
    def delete_fact(self, fact_id):
        for i, fact in enumerate(self.data):
            if fact["fact_id"] == fact_id:
                del self.data[i]
                return True
        return False
    
    def get_all_tags(self):
        tags = set()
        for fact in self.data:
            tags.update(fact["tags"])
        return list(tags)
    
    def get_all_categories(self):
        return list(set(fact["category"] for fact in self.data))
    
    def get_all_sources(self):
        return list(set(fact["source_id"] for fact in self.data if fact["source_id"]))
    
    def get_fact_count(self):
        return len(self.data)

## Step 3: Define Enhanced Knowledge Reduction Class

In [None]:
class EnhancedKnowledgeReduction:
    def __init__(self, 
                 transformer_model='all-MiniLM-L6-v2',
                 spacy_model='en_core_web_md',
                 similarity_threshold=0.85,
                 cluster_distance_threshold=0.5,
                 importance_threshold=0.6):
        self.transformer_model = transformer_model
        self.spacy_model = spacy_model
        self.similarity_threshold = similarity_threshold
        self.cluster_distance_threshold = cluster_distance_threshold
        self.importance_threshold = importance_threshold
        
        # Load models
        self.sentence_transformer = SentenceTransformer(transformer_model)
        self.nlp = spacy.load(spacy_model)
        
        # Initialize graph for entity disambiguation
        self.graph = nx.Graph()
    
    def reduce_knowledge(self, knowledge_graph):
        # Step 1: Extract fact statements
        facts = [fact["fact_statement"] for fact in knowledge_graph.data]
        fact_ids = [fact["fact_id"] for fact in knowledge_graph.data]
        
        # Step 2: Compute embeddings
        embeddings = self.sentence_transformer.encode(facts)
        
        # Step 3: Perform hierarchical clustering
        clustering = AgglomerativeClustering(
            n_clusters=None,
            distance_threshold=self.cluster_distance_threshold,

            linkage='average'
        ).fit(embeddings)
        
        # Step 4: Extract entities and build entity graph
        self._build_entity_graph(facts, fact_ids)
        
        # Step 5: Calculate importance scores
        importance_scores = self._calculate_importance_scores(knowledge_graph)
        
        # Step 6: Select representative facts from each cluster
        clusters = {}
        for i, cluster_id in enumerate(clustering.labels_):
            if cluster_id not in clusters:
                clusters[cluster_id] = []
            clusters[cluster_id].append(i)
        
        # Step 7: Create reduced knowledge graph
        reduced_kg = KnowledgeGraph()
        
        for cluster_id, fact_indices in clusters.items():
            # If cluster has only one fact, keep it
            if len(fact_indices) == 1:
                idx = fact_indices[0]
                if importance_scores[idx] >= self.importance_threshold:
                    reduced_kg.data.append(knowledge_graph.data[idx])
                continue
            
            # Find the most important fact in the cluster
            most_important_idx = max(fact_indices, key=lambda i: importance_scores[i])
            
            # Only keep if it meets the importance threshold
            if importance_scores[most_important_idx] >= self.importance_threshold:
                # Add the most important fact
                most_important_fact = knowledge_graph.data[most_important_idx].copy()
                
                # Add related facts
                related_facts = [fact_ids[i] for i in fact_indices if i != most_important_idx]
                most_important_fact["related_facts"] = related_facts
                
                reduced_kg.data.append(most_important_fact)
        
        return reduced_kg
    
    def _build_entity_graph(self, facts, fact_ids):
        self.graph.clear()
        
        # Extract entities from each fact
        for i, fact in enumerate(facts):
            doc = self.nlp(fact)
            
            # Add fact node
            fact_node = f"fact_{fact_ids[i]}"
            self.graph.add_node(fact_node, type="fact", text=fact)
            
            # Add entity nodes and edges
            for ent in doc.ents:
                entity_node = f"entity_{ent.text}_{ent.label_}"
                if not self.graph.has_node(entity_node):
                    self.graph.add_node(entity_node, type="entity", text=ent.text, label=ent.label_)
                
                self.graph.add_edge(fact_node, entity_node, weight=1.0)
    
    def _calculate_importance_scores(self, knowledge_graph):
        importance_scores = []
        
        for fact in knowledge_graph.data:
            # Base score
            score = 0.5
            
            # Adjust based on reliability rating
            if fact["reliability_rating"] == ReliabilityRating.VERIFIED_TRUE:
                score += 0.3
            elif fact["reliability_rating"] == ReliabilityRating.LIKELY_TRUE:
                score += 0.2
            elif fact["reliability_rating"] == ReliabilityRating.POSSIBLY_TRUE:
                score += 0.1
            elif fact["reliability_rating"] == ReliabilityRating.POSSIBLY_FALSE:
                score -= 0.1
            elif fact["reliability_rating"] == ReliabilityRating.UNLIKELY_FALSE:
                score -= 0.2
            
            # Adjust based on centrality in entity graph
            fact_node = f"fact_{fact['fact_id']}"
            if self.graph.has_node(fact_node):
                # Use degree centrality as a measure of importance
                centrality = nx.degree_centrality(self.graph)[fact_node]
                score += centrality * 0.2
            
            # Adjust based on usage count
            if fact["usage_count"] > 0:
                score += min(0.2, fact["usage_count"] * 0.02)
            
            # Ensure score is between 0 and 1
            score = max(0.0, min(1.0, score))
            
            importance_scores.append(score)
        
        return importance_scores
    
    def save_graph(self, filename):
        # Explicitly set edges="links" to avoid future compatibility warnings
        data = nx.node_link_data(self.graph, edges="links")
        with open(filename, 'w') as f:
            json.dump(data, f, indent=2)
    
    def load_graph(self, filename):
        with open(filename, 'r') as f:
            data = json.load(f)
        # Explicitly set edges="links" to avoid future compatibility warnings
        self.graph = nx.node_link_graph(data, edges="links")

## Step 4: Define Advanced Link Collection Class

In [None]:
class LinkCollector:
    """Class for collecting links from websites with advanced features."""
    
    def __init__(self, 
                 start_urls: List[str] = None,
                 allowed_domains: List[str] = None,
                 max_depth: int = 2,
                 respect_robots_txt: bool = True,
                 rate_limit: float = 1.0,
                 user_agent: str = 'CivicHonorsKG/1.0',
                 max_urls: int = 100,
                 link_filters: List[Callable[[str], bool]] = None):
        """
        Initialize the LinkCollector.
        
        Args:
            start_urls: List of URLs to start crawling from
            allowed_domains: List of domains to restrict crawling to
            max_depth: Maximum depth to crawl (0 = only start_urls)
            respect_robots_txt: Whether to respect robots.txt files
            rate_limit: Minimum time between requests in seconds
            user_agent: User agent string to use for requests
            max_urls: Maximum number of URLs to collect
            link_filters: List of filter functions that take a URL and return True if it should be followed
        """
        self.start_urls = start_urls or []
        self.allowed_domains = allowed_domains or []
        self.max_depth = max_depth
        self.respect_robots_txt = respect_robots_txt
        self.rate_limit = rate_limit
        self.user_agent = user_agent
        self.max_urls = max_urls
        self.link_filters = link_filters or []
        
        # Internal state
        self.url_queue = deque()  # Queue of (url, depth) tuples
        self.visited_urls = set()  # Set of visited URLs
        self.robots_parsers = {}  # Cache of RobotFileParser objects
        self.last_request_time = 0  # Time of last request
        self.collected_pages = {}  # Dictionary of {url: {'html': html, 'soup': soup, 'links': links, 'metadata': metadata}}
        
        # Initialize queue with start URLs
        for url in self.start_urls:
            self.url_queue.append((self._normalize_url(url), 0))
    
    def _normalize_url(self, url: str) -> str:
        """
        Normalize a URL to prevent duplicates.
        
        Args:
            url: URL to normalize
            
        Returns:
            Normalized URL
        """
        # Parse the URL
        parsed = urllib.parse.urlparse(url)
        
        # Normalize the path
        path = parsed.path
        if not path:
            path = '/'
        
        # Remove trailing slash except for root
        if path != '/' and path.endswith('/'):
            path = path[:-1]
        
        # Reconstruct the URL without fragments and with default ports removed
        normalized = urllib.parse.urlunparse((
            parsed.scheme,
            parsed.netloc,
            path,
            parsed.params,
            parsed.query,
            ''  # Remove fragment
        ))
        
        return normalized
    
    def _get_domain(self, url: str) -> str:
        """
        Extract the domain from a URL.
        
        Args:
            url: URL to extract domain from
            
        Returns:
            Domain name
        """
        parsed = urllib.parse.urlparse(url)
        return parsed.netloc
    
    def _is_allowed_domain(self, url: str) -> bool:
        """
        Check if a URL's domain is in the allowed domains list.
        
        Args:
            url: URL to check
            
        Returns:
            True if domain is allowed or no restrictions set
        """
        if not self.allowed_domains:
            return True
        
        domain = self._get_domain(url)
        
        for allowed_domain in self.allowed_domains:
            if domain == allowed_domain or domain.endswith('.' + allowed_domain):
                return True
        
        return False
    
    def _can_fetch(self, url: str) -> bool:
        """
        Check if a URL can be fetched according to robots.txt.
        
        Args:
            url: URL to check
            
        Returns:
            True if URL can be fetched
        """
        if not self.respect_robots_txt:
            return True
        
        parsed = urllib.parse.urlparse(url)
        domain = parsed.netloc
        scheme = parsed.scheme
        
        # Get or create robots parser for this domain
        if domain not in self.robots_parsers:
            robots_url = f"{scheme}://{domain}/robots.txt"
            parser = RobotFileParser()
            parser.set_url(robots_url)
            try:
                parser.read()
                self.robots_parsers[domain] = parser
            except Exception as e:
                print(f"Error reading robots.txt for {domain}: {e}")
                # If we can't read robots.txt, assume we can fetch
                return True
        
        # Check if we can fetch this URL
        return self.robots_parsers[domain].can_fetch(self.user_agent, url)
    
    def _apply_rate_limit(self):
        """Apply rate limiting between requests."""
        current_time = time.time()
        time_since_last_request = current_time - self.last_request_time
        
        if time_since_last_request < self.rate_limit:
            time.sleep(self.rate_limit - time_since_last_request)
        
        self.last_request_time = time.time()
    
    def _extract_links(self, soup: BeautifulSoup, base_url: str) -> List[str]:
        """
        Extract all links from a BeautifulSoup object.
        
        Args:
            soup: BeautifulSoup object
            base_url: Base URL for resolving relative links
            
        Returns:
            List of absolute URLs
        """
        links = []
        
        for a_tag in soup.find_all('a', href=True):
            href = a_tag['href']
            
            # Skip empty links, javascript, and mailto links
            if not href or href.startswith(('javascript:', 'mailto:', 'tel:')):
                continue
            
            # Resolve relative URLs
            absolute_url = urllib.parse.urljoin(base_url, href)
            
            # Normalize the URL
            normalized_url = self._normalize_url(absolute_url)
            
            # Apply custom filters
            if all(filter_func(normalized_url) for filter_func in self.link_filters):
                links.append(normalized_url)
        
        return links
    
    def _extract_metadata(self, soup: BeautifulSoup, url: str) -> Dict:
        """
        Extract metadata from a page.
        
        Args:
            soup: BeautifulSoup object
            url: URL of the page
            
        Returns:
            Dictionary of metadata
        """
        metadata = {
            'title': None,
            'description': None,
            'keywords': None,
            'author': None,
            'published_date': None,
        }
        
        # Extract title
        title_tag = soup.find('title')
        if title_tag:
            metadata['title'] = title_tag.get_text().strip()
        
        # Extract meta tags
        for meta in soup.find_all('meta'):
            name = meta.get('name', '').lower()
            property = meta.get('property', '').lower()
            content = meta.get('content', '')
            
            if name == 'description' or property == 'og:description':
                metadata['description'] = content
            elif name == 'keywords':
                metadata['keywords'] = content
            elif name == 'author':
                metadata['author'] = content
            elif name == 'article:published_time' or property == 'article:published_time':
                metadata['published_date'] = content
        
        return metadata
    
    def fetch_url(self, url: str) -> Tuple[Optional[str], Optional[BeautifulSoup]]:
        """
        Fetch a URL and return its HTML content and BeautifulSoup object.
        
        Args:
            url: URL to fetch
            
        Returns:
            Tuple of (html_content, soup_object) or (None, None) if fetch failed
        """
        print(f"Fetching URL: {url}")
        
        # Apply rate limiting
        self._apply_rate_limit()
        
        # Check robots.txt
        if not self._can_fetch(url):
            print(f"Skipping URL {url} (disallowed by robots.txt)")
            return None, None
        
        # Fetch the URL
        try:
            headers = {'User-Agent': self.user_agent}
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            
            html = response.text
            soup = BeautifulSoup(html, 'html.parser')
            
            return html, soup
        
        except requests.exceptions.RequestException as e:
            print(f"Error fetching {url}: {e}")
            return None, None
    
    def collect_links(self) -> Dict[str, Dict]:
        """
        Collect links starting from the start_urls.
        
        Returns:
            Dictionary of collected pages
        """
        while self.url_queue and len(self.visited_urls) < self.max_urls:
            # Get the next URL and depth from the queue
            url, depth = self.url_queue.popleft()
            
            # Skip if already visited
            if url in self.visited_urls:
                continue
            
            # Skip if not in allowed domains
            if not self._is_allowed_domain(url):
                continue
            
            # Mark as visited
            self.visited_urls.add(url)
            
            # Fetch the URL
            html, soup = self.fetch_url(url)
            
            if not html or not soup:
                continue
            
            # Extract links
            links = self._extract_links(soup, url)
            
            # Extract metadata
            metadata = self._extract_metadata(soup, url)
            
            # Store the page
            self.collected_pages[url] = {
                'html': html,
                'soup': soup,
                'links': links,
                'metadata': metadata,
                'depth': depth
            }
            
            # If we haven't reached max depth, add links to queue
            if depth < self.max_depth:
                for link in links:
                    if link not in self.visited_urls:
                        self.url_queue.append((link, depth + 1))
            
            print(f"Collected {len(self.visited_urls)} URLs, {len(self.url_queue)} in queue")
        
        return self.collected_pages
    
    def get_collected_urls(self) -> List[str]:
        """
        Get list of collected URLs.
        
        Returns:
            List of URLs
        """
        return list(self.collected_pages.keys())
    
    def get_page_content(self, url: str) -> Optional[Dict]:
        """
        Get content for a specific URL.
        
        Args:
            url: URL to get content for
            
        Returns:
            Dictionary with page content or None if not found
        """
        return self.collected_pages.get(url)

# Example filter functions
def filter_pdf_urls(url: str) -> bool:
    """Filter out PDF URLs."""
    return not url.lower().endswith('.pdf')

def filter_image_urls(url: str) -> bool:
    """Filter out image URLs."""
    return not url.lower().endswith(('.jpg', '.jpeg', '.png', '.gif', '.bmp', '.svg'))

def filter_by_keywords(url: str, keywords: List[str]) -> bool:
    """Filter URLs by keywords in the URL."""
    url_lower = url.lower()
    return any(keyword.lower() in url_lower for keyword in keywords)

## Step 5: Define Advanced Page Search Class

In [None]:
class PageSearcher:
    """Class for searching and extracting information from web pages with advanced features."""
    
    def __init__(self, 
                 spacy_model: str = 'en_core_web_md',
                 transformer_model: str = 'all-MiniLM-L6-v2',
                 min_content_length: int = 20,
                 relevance_threshold: float = 0.5,
                 context_window: int = 2):
        """
        Initialize the PageSearcher.
        
        Args:
            spacy_model: Name of spaCy model to use
            transformer_model: Name of sentence transformer model to use
            min_content_length: Minimum length of content to consider
            relevance_threshold: Minimum relevance score for content to be included
            context_window: Number of surrounding elements to include as context
        """
        self.min_content_length = min_content_length
        self.relevance_threshold = relevance_threshold
        self.context_window = context_window
        
        # Load NLP models
        print(f"Loading spaCy model: {spacy_model}")
        self.nlp = spacy.load(spacy_model)
        
        print(f"Loading sentence transformer model: {transformer_model}")
        self.transformer = SentenceTransformer(transformer_model)
        
        # Initialize storage for extracted information
        self.extracted_facts = []
        self.entity_index = defaultdict(list)  # Maps entity -> list of fact indices
        self.keyword_index = defaultdict(list)  # Maps keyword -> list of fact indices
        self.url_to_facts = defaultdict(list)  # Maps URL -> list of fact indices
        
    def extract_text_with_context(self, soup: BeautifulSoup) -> List[Dict]:
        """
        Extract text from HTML elements with surrounding context.
        
        Args:
            soup: BeautifulSoup object
            
        Returns:
            List of dictionaries with text and context
        """
        elements = []
        
        # Get all elements that might contain content
        content_elements = soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'li', 'div', 'span', 'article', 'section'])
        
        for i, element in enumerate(content_elements):
            text = element.get_text().strip()
            
            # Skip empty or very short elements
            if not text or len(text) < self.min_content_length:
                continue
            
            # Get context (surrounding elements)
            start_idx = max(0, i - self.context_window)
            end_idx = min(len(content_elements), i + self.context_window + 1)
            
            context_elements = content_elements[start_idx:i] + content_elements[i+1:end_idx]
            context = [elem.get_text().strip() for elem in context_elements if elem.get_text().strip()]
            
            elements.append({
                'text': text,
                'element_type': element.name,
                'context': context,
                'html': str(element)
            })
        
        return elements
    
    def extract_structured_data(self, soup: BeautifulSoup) -> List[Dict]:
        """
        Extract structured data like tables and lists.
        
        Args:
            soup: BeautifulSoup object
            
        Returns:
            List of dictionaries with structured data
        """
        structured_data = []
        
        # Extract tables
        tables = soup.find_all('table')
        for table in tables:
            rows = []
            headers = []
            
            # Extract headers
            th_elements = table.find_all('th')
            if th_elements:
                headers = [th.get_text().strip() for th in th_elements]
            
            # Extract rows
            for tr in table.find_all('tr'):
                cells = [td.get_text().strip() for td in tr.find_all(['td', 'th'])]
                if cells and not all(cell == '' for cell in cells):
                    rows.append(cells)
            
            if rows:
                structured_data.append({
                    'type': 'table',
                    'headers': headers,
                    'rows': rows,
                    'html': str(table)
                })
        
        # Extract lists
        for list_tag in soup.find_all(['ul', 'ol']):
            items = [li.get_text().strip() for li in list_tag.find_all('li')]
            if items:
                structured_data.append({
                    'type': 'list',
                    'list_type': list_tag.name,
                    'items': items,
                    'html': str(list_tag)
                })
        
        return structured_data
    
    def extract_entities(self, text: str) -> Dict[str, List[str]]:
        """
        Extract named entities from text using spaCy.
        
        Args:
            text: Text to extract entities from
            
        Returns:
            Dictionary mapping entity types to lists of entities
        """
        doc = self.nlp(text)
        entities = defaultdict(list)
        
        for ent in doc.ents:
            entities[ent.label_].append(ent.text)
        
        return dict(entities)
    
    def calculate_relevance_score(self, text: str, query_embedding: np.ndarray) -> float:
        """
        Calculate relevance score of text to a query.
        
        Args:
            text: Text to score
            query_embedding: Embedding of the query
            
        Returns:
            Relevance score between 0 and 1
        """
        text_embedding = self.transformer.encode([text])[0]
        similarity = cosine_similarity([query_embedding], [text_embedding])[0][0]
        return float(similarity)
    
    def search_by_keywords(self, pages: Dict[str, Dict], keywords: List[str]) -> List[Dict]:
        """
        Search pages for specific keywords.
        
        Args:
            pages: Dictionary of pages from LinkCollector
            keywords: List of keywords to search for
            
        Returns:
            List of dictionaries with search results
        """
        results = []
        
        for url, page_data in pages.items():
            soup = page_data['soup']
            
            # Get all text elements
            elements = self.extract_text_with_context(soup)
            
            for element in elements:
                text = element['text'].lower()
                
                # Check if any keyword is in the text
                matches = [keyword for keyword in keywords if keyword.lower() in text]
                
                if matches:
                    results.append({
                        'url': url,
                        'text': element['text'],
                        'element_type': element['element_type'],
                        'context': element['context'],
                        'matched_keywords': matches,
                        'html': element['html']
                    })
        
        return results
    
    def search_by_regex(self, pages: Dict[str, Dict], pattern: str) -> List[Dict]:
        """
        Search pages using regular expression pattern.
        
        Args:
            pages: Dictionary of pages from LinkCollector
            pattern: Regular expression pattern
            
        Returns:
            List of dictionaries with search results
        """
        results = []
        regex = re.compile(pattern)
        
        for url, page_data in pages.items():
            soup = page_data['soup']
            
            # Get all text elements
            elements = self.extract_text_with_context(soup)
            
            for element in elements:
                text = element['text']
                
                # Find all matches
                matches = regex.findall(text)
                
                if matches:
                    results.append({
                        'url': url,
                        'text': element['text'],
                        'element_type': element['element_type'],
                        'context': element['context'],
                        'regex_matches': matches,
                        'html': element['html']
                    })
        
        return results
    
    def search_by_css_selector(self, pages: Dict[str, Dict], css_selector: str) -> List[Dict]:
        """
        Search pages using CSS selector.
        
        Args:
            pages: Dictionary of pages from LinkCollector
            css_selector: CSS selector string
            
        Returns:
            List of dictionaries with search results
        """
        results = []
        
        for url, page_data in pages.items():
            soup = page_data['soup']
            
            # Find elements matching the CSS selector
            matching_elements = soup.select(css_selector)
            
            for element in matching_elements:
                text = element.get_text().strip()
                
                if text and len(text) >= self.min_content_length:
                    results.append({
                        'url': url,
                        'text': text,
                        'element_type': element.name,
                        'css_selector': css_selector,
                        'html': str(element)
                    })
        
        return results
    
    def semantic_search(self, pages: Dict[str, Dict], query: str, top_k: int = 10) -> List[Dict]:
        """
        Perform semantic search on pages using sentence transformers.
        
        Args:
            pages: Dictionary of pages from LinkCollector
            query: Search query
            top_k: Number of top results to return
            
        Returns:
            List of dictionaries with search results, sorted by relevance
        """
        # Encode the query
        query_embedding = self.transformer.encode([query])[0]
        
        all_elements = []
        
        # Extract text elements from all pages
        for url, page_data in pages.items():
            soup = page_data['soup']
            elements = self.extract_text_with_context(soup)
            
            for element in elements:
                all_elements.append({
                    'url': url,
                    'text': element['text'],
                    'element_type': element['element_type'],
                    'context': element['context'],
                    'html': element['html']
                })
        
        # Calculate relevance scores
        for element in all_elements:
            element['relevance_score'] = self.calculate_relevance_score(element['text'], query_embedding)
        
        # Sort by relevance score and take top_k
        results = sorted(all_elements, key=lambda x: x['relevance_score'], reverse=True)[:top_k]
        
        return results
    
    def extract_facts_from_pages(self, pages: Dict[str, Dict], topics: List[str] = None) -> List[Dict]:
        """
        Extract facts from pages with optional topic filtering.
        
        Args:
            pages: Dictionary of pages from LinkCollector
            topics: Optional list of topics to filter by
            
        Returns:
            List of dictionaries with extracted facts
        """
        # Reset storage
        self.extracted_facts = []
        self.entity_index = defaultdict(list)
        self.keyword_index = defaultdict(list)
        self.url_to_facts = defaultdict(list)
        
        # Encode topics if provided
        topic_embeddings = None
        if topics:
            topic_embeddings = self.transformer.encode(topics)
        
        for url, page_data in pages.items():
            soup = page_data['soup']
            metadata = page_data['metadata']
            
            # Extract text elements
            elements = self.extract_text_with_context(soup)
            
            # Extract structured data
            structured_data = self.extract_structured_data(soup)
            
            # Process text elements
            for element in elements:
                text = element['text']
                
                # Skip if too short
                if len(text) < self.min_content_length:
                    continue
                
                # Check relevance to topics if provided
                if topic_embeddings is not None:
                    text_embedding = self.transformer.encode([text])[0]
                    similarities = cosine_similarity([text_embedding], topic_embeddings)[0]
                    max_similarity = max(similarities)
                    
                    if max_similarity < self.relevance_threshold:
                        continue
                    
                    topic_relevance = {topics[i]: float(similarities[i]) for i in range(len(topics))}
                else:
                    topic_relevance = {}
                    max_similarity = 1.0
                
                # Extract entities
                entities = self.extract_entities(text)
                
                # Create fact
                fact = {
                    'text': text,
                    'url': url,
                    'element_type': element['element_type'],
                    'context': element['context'],
                    'entities': entities,
                    'topic_relevance': topic_relevance,
                    'relevance_score': max_similarity,
                    'metadata': metadata,
                    'html': element['html']
                }
                
                # Add to storage
                fact_idx = len(self.extracted_facts)
                self.extracted_facts.append(fact)
                self.url_to_facts[url].append(fact_idx)
                
                # Index entities
                for entity_type, entity_list in entities.items():
                    for entity in entity_list:
                        self.entity_index[entity].append(fact_idx)
                
                # Index keywords (simple approach - split by spaces and remove punctuation)
                words = re.findall(r'\b\w+\b', text.lower())
                for word in words:
                    if len(word) > 3:  # Skip very short words
                        self.keyword_index[word].append(fact_idx)
            
            # Process structured data
            for data in structured_data:
                if data['type'] == 'table':
                    # For tables, create a fact for each row
                    headers = data['headers']
                    for row in data['rows']:
                        if headers and len(headers) == len(row):
                            # Create a text representation of the row
                            row_text = '. '.join([f"{headers[i]}: {row[i]}" for i in range(len(headers))])
                        else:
                            row_text = '. '.join(row)
                        
                        # Skip if too short
                        if len(row_text) < self.min_content_length:
                            continue
                        
                        # Extract entities
                        entities = self.extract_entities(row_text)
                        
                        # Create fact
                        fact = {
                            'text': row_text,
                            'url': url,
                            'element_type': 'table_row',
                            'context': [],
                            'entities': entities,
                            'topic_relevance': {},
                            'relevance_score': 1.0,
                            'metadata': metadata,
                            'structured_data': {
                                'type': 'table_row',
                                'headers': headers,
                                'values': row
                            },
                            'html': data['html']
                        }
                        
                        # Add to storage
                        fact_idx = len(self.extracted_facts)
                        self.extracted_facts.append(fact)
                        self.url_to_facts[url].append(fact_idx)
                        
                        # Index entities
                        for entity_type, entity_list in entities.items():
                            for entity in entity_list:
                                self.entity_index[entity].append(fact_idx)
                
                elif data['type'] == 'list':
                    # For lists, create a fact for the entire list
                    list_text = '. '.join(data['items'])
                    
                    # Skip if too short
                    if len(list_text) < self.min_content_length:
                        continue
                    
                    # Extract entities
                    entities = self.extract_entities(list_text)
                    
                    # Create fact
                    fact = {
                        'text': list_text,
                        'url': url,
                        'element_type': f"{data['list_type']}_list",
                        'context': [],
                        'entities': entities,
                        'topic_relevance': {},
                        'relevance_score': 1.0,
                        'metadata': metadata,
                        'structured_data': {
                            'type': 'list',
                            'list_type': data['list_type'],
                            'items': data['items']
                        },
                        'html': data['html']
                    }
                    
                    # Add to storage
                    fact_idx = len(self.extracted_facts)
                    self.extracted_facts.append(fact)
                    self.url_to_facts[url].append(fact_idx)
                    
                    # Index entities
                    for entity_type, entity_list in entities.items():
                        for entity in entity_list:
                            self.entity_index[entity].append(fact_idx)
        
        return self.extracted_facts
    
    def search_facts_by_entity(self, entity: str) -> List[Dict]:
        """
        Search extracted facts by entity.
        
        Args:
            entity: Entity to search for
            
        Returns:
            List of facts containing the entity
        """
        fact_indices = self.entity_index.get(entity, [])
        return [self.extracted_facts[idx] for idx in fact_indices]
    
    def search_facts_by_keyword(self, keyword: str) -> List[Dict]:
        """
        Search extracted facts by keyword.
        
        Args:
            keyword: Keyword to search for
            
        Returns:
            List of facts containing the keyword
        """
        keyword = keyword.lower()
        fact_indices = self.keyword_index.get(keyword, [])
        return [self.extracted_facts[idx] for idx in fact_indices]
    
    def get_facts_by_url(self, url: str) -> List[Dict]:
        """
        Get all facts extracted from a specific URL.
        
        Args:
            url: URL to get facts for
            
        Returns:
            List of facts from the URL
        """
        fact_indices = self.url_to_facts.get(url, [])
        return [self.extracted_facts[idx] for idx in fact_indices]
    
    def find_related_facts(self, fact_idx: int, threshold: float = 0.7) -> List[Tuple[int, float]]:
        """
        Find facts related to a given fact based on semantic similarity.
        
        Args:
            fact_idx: Index of the fact to find related facts for
            threshold: Minimum similarity threshold
            
        Returns:
            List of tuples (fact_idx, similarity_score)
        """
        if fact_idx >= len(self.extracted_facts):
            return []
        
        fact = self.extracted_facts[fact_idx]
        fact_embedding = self.transformer.encode([fact['text']])[0]
        
        related = []
        
        # Encode all other facts
        for i, other_fact in enumerate(self.extracted_facts):
            if i == fact_idx:
                continue
            
            other_embedding = self.transformer.encode([other_fact['text']])[0]
            similarity = cosine_similarity([fact_embedding], [other_embedding])[0][0]
            
            if similarity >= threshold:
                related.append((i, float(similarity)))
        
        # Sort by similarity
        related.sort(key=lambda x: x[1], reverse=True)
        
        return related
    
    def classify_content(self, text: str, categories: List[str]) -> Dict[str, float]:
        """
        Classify content into predefined categories using semantic similarity.
        
        Args:
            text: Text to classify
            categories: List of category names
            
        Returns:
            Dictionary mapping categories to confidence scores
        """
        # Encode text and categories
        text_embedding = self.transformer.encode([text])[0]
        category_embeddings = self.transformer.encode(categories)
        
        # Calculate similarities
        similarities = cosine_similarity([text_embedding], category_embeddings)[0]
        
        # Create result dictionary
        result = {categories[i]: float(similarities[i]) for i in range(len(categories))}
        
        return result

## Step 6: Scrape Website Data with Advanced Link Collection

In [None]:
# Initialize the LinkCollector with improved capabilities
collector = LinkCollector(
    start_urls=[
        "https://civichonors.com/",
        "https://www.nelslindahl.com/"
    ],
    allowed_domains=["civichonors.com", "nelslindahl.com"],
    max_depth=1,  # Follow links one level deep
    respect_robots_txt=True,
    rate_limit=1.0,  # Wait 1 second between requests
    max_urls=10,  # Limit to 10 URLs for demonstration
    link_filters=[filter_pdf_urls, filter_image_urls]  # Filter out PDFs and images
)

# Collect links and pages
pages = collector.collect_links()

# Print summary of collected pages
print(f"\nCollected {len(pages)} pages:")
for url in pages.keys():
    print(f"  - {url}")
    
    # Print number of links found on this page
    links = pages[url]['links']
    print(f"    Found {len(links)} links on this page")
    
    # Print metadata
    metadata = pages[url]['metadata']
    print(f"    Title: {metadata.get('title')}")

## Step 7: Populate Knowledge Graph with Advanced Page Search

In [None]:
# Initialize the PageSearcher
searcher = PageSearcher()

# Extract facts from collected pages with topic filtering
facts = searcher.extract_facts_from_pages(pages, topics=["civic honors", "community service", "volunteering"])

print(f"Extracted {len(facts)} facts from {len(pages)} pages")

# Initialize the KnowledgeGraph
kg = KnowledgeGraph()

# Add extracted facts to the knowledge graph
for i, fact in enumerate(facts):
    # Create a unique ID for the fact
    fact_id = f"fact_{i}"
    
    # Get the source ID from the URL
    url = fact['url']
    domain = urllib.parse.urlparse(url).netloc
    source_id = domain.replace('www.', '').split('.')[0].capitalize()
    
    # Get metadata
    metadata = fact['metadata']
    
    # Add the fact to the knowledge graph
    kg.add_fact(
        fact_id=fact_id,
        fact_statement=fact['text'],
        category="General",
        tags=[source_id, "WebScraped"],
        date_recorded=datetime.datetime.now(),
        last_updated=datetime.datetime.now(),
        reliability_rating=ReliabilityRating.LIKELY_TRUE,
        source_id=source_id,
        source_title=metadata.get('title', f"{source_id} Website"),
        author_creator=metadata.get('author', "Web Scraping"),
        publication_date=datetime.datetime.now(),
        url_reference=url,
        related_facts=[],
        contextual_notes=f"Extracted from {source_id} website",
        access_level="Public",
        usage_count=0
    )

print(f"Added {kg.get_fact_count()} facts to the knowledge graph")

## Step 8: Retrieve and Display Facts

In [None]:
# Display a sample of facts from the knowledge graph
sample_size = min(5, kg.get_fact_count())
print(f"Sample of {sample_size} facts from the knowledge graph:\n")

for i in range(sample_size):
    fact = kg.data[i]
    print(f"Fact ID: {fact['fact_id']}")
    print(f"Statement: {fact['fact_statement']}")
    print(f"Source: {fact['source_id']}")
    print(f"URL: {fact['url_reference']}")
    print("\n" + "-"*80 + "\n")

## Step 9: Ensure Uniqueness

In [None]:
def remove_duplicates(knowledge_graph, similarity_threshold=0.9):
    """Remove duplicate facts from the knowledge graph based on string similarity."""
    unique_facts = []
    duplicate_count = 0
    
    for fact in knowledge_graph.data:
        is_duplicate = False
        for unique_fact in unique_facts:
            similarity = SequenceMatcher(None, fact["fact_statement"], unique_fact["fact_statement"]).ratio()
            if similarity >= similarity_threshold:
                is_duplicate = True
                duplicate_count += 1
                break
        
        if not is_duplicate:
            unique_facts.append(fact)
    
    # Create a new knowledge graph with unique facts
    unique_kg = KnowledgeGraph()
    unique_kg.data = unique_facts
    
    return unique_kg, duplicate_count

# Remove duplicates
unique_kg, duplicate_count = remove_duplicates(kg)

print(f"Removed {duplicate_count} duplicate facts")
print(f"Original knowledge graph: {kg.get_fact_count()} facts")
print(f"Unique knowledge graph: {unique_kg.get_fact_count()} facts")

## Step 10: Advanced Cleaning

In [None]:
def clean_knowledge_graph(knowledge_graph, similarity_threshold=0.85):
    """Clean the knowledge graph by removing semantically similar facts."""
    # Load the sentence transformer model
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Extract fact statements
    facts = [fact["fact_statement"] for fact in knowledge_graph.data]
    
    # Compute embeddings
    embeddings = model.encode(facts)
    
    # Compute similarity matrix
    similarity_matrix = cosine_similarity(embeddings)
    
    # Find groups of similar facts
    groups = []
    processed = set()
    
    for i in range(len(facts)):
        if i in processed:
            continue
        
        group = [i]
        processed.add(i)
        
        for j in range(i+1, len(facts)):
            if j in processed:
                continue
            
            if similarity_matrix[i, j] >= similarity_threshold:
                group.append(j)
                processed.add(j)
        
        groups.append(group)
    
    # Create a new knowledge graph with one fact from each group
    cleaned_kg = KnowledgeGraph()
    
    for group in groups:
        # Choose the fact with the highest reliability rating
        best_fact_idx = group[0]
        best_rating = knowledge_graph.data[best_fact_idx]["reliability_rating"]
        
        for idx in group[1:]:
            rating = knowledge_graph.data[idx]["reliability_rating"]
            if rating.value > best_rating.value:
                best_fact_idx = idx
                best_rating = rating
        
        cleaned_kg.data.append(knowledge_graph.data[best_fact_idx])
    
    return cleaned_kg, len(knowledge_graph.data) - len(cleaned_kg.data)

# Clean the knowledge graph
cleaned_kg, removed_count = clean_knowledge_graph(unique_kg)

print(f"Removed {removed_count} semantically similar facts")
print(f"Unique knowledge graph: {unique_kg.get_fact_count()} facts")
print(f"Cleaned knowledge graph: {cleaned_kg.get_fact_count()} facts")

## Step 11: Enhanced Knowledge Reduction

In [None]:
# Initialize the EnhancedKnowledgeReduction
reducer = EnhancedKnowledgeReduction()

# Reduce the knowledge graph
reduced_kg = reducer.reduce_knowledge(cleaned_kg)

print(f"Cleaned knowledge graph: {cleaned_kg.get_fact_count()} facts")
print(f"Reduced knowledge graph: {reduced_kg.get_fact_count()} facts")

## Step 12: Serialization

In [None]:
# Save the knowledge graph to a file
reduced_kg.save_to_file('civic_honors_kg.json')
print("Knowledge graph saved to civic_honors_kg.json")

# Save the entity graph
reducer.save_graph('entity_graph.json')
print("Entity graph saved to entity_graph.json")

## Conclusion

This notebook has demonstrated an enhanced knowledge graph for Civic Honors content with advanced link collection and page search capabilities. The improvements allow the knowledge graph to discover and incorporate information from a broader range of relevant sources, rather than just the two predefined URLs in the original implementation.

Key improvements include:

1. **Advanced Link Collection**: The LinkCollector class provides comprehensive link collection capabilities including URL normalization, depth control, domain filtering, robots.txt compliance, rate limiting, and metadata extraction.

2. **Intelligent Page Searching**: The PageSearcher class provides advanced search and extraction capabilities including keyword search, semantic search, structured data extraction, entity recognition, and content classification.

3. **Enhanced Knowledge Reduction**: The EnhancedKnowledgeReduction class provides advanced techniques for reducing redundancy and improving the quality of the knowledge graph.

These improvements significantly enhance the knowledge graph's ability to collect, search, and process information from relevant sources, resulting in a more comprehensive and accurate representation of Civic Honors content.