# CivicHonorsKGv19: Enhanced Knowledge Graph Reduction

This notebook is an improved version of CivicHonorsKGv18, featuring advanced knowledge reduction techniques.

## Overview

This notebook implements a knowledge graph for Civic Honors content with the following steps:

1. **Install and Import Libraries**: Set up the required dependencies
2. **Define Knowledge Graph Class**: Create a flexible KG structure
3. **Define Enhanced Knowledge Reduction Class**: Implement advanced reduction techniques
4. **Scrape Website Data**: Collect information from relevant websites
5. **Populate Knowledge Graph**: Extract and structure facts
6. **Retrieve and Display Facts**: View the extracted knowledge
7. **Ensure Uniqueness**: Remove duplicate facts
8. **Advanced Cleaning**: Apply semantic similarity for redundancy reduction
9. **Enhanced Knowledge Reduction**: Apply transformer-based models and hierarchical clustering
10. **Serialization**: Save and load the knowledge graph

The main improvement in this version is the enhanced knowledge reduction function that uses transformer models, hierarchical clustering, entity disambiguation, and fact importance scoring.

## Step 1: Install and Import Libraries

In [None]:
# Install required packages
!pip install requests beautifulsoup4 difflib spacy sentence-transformers scikit-learn networkx
!python -m spacy download en_core_web_md

In [None]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import json
import datetime
import enum
import networkx as nx
from difflib import SequenceMatcher
import spacy
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from typing import List, Dict, Tuple, Any, Optional, Set, Union

## Step 2: Define Knowledge Graph Class

In [None]:
class ReliabilityRating(enum.Enum):
    UNKNOWN = "Unknown"
    UNLIKELY_FALSE = "Unlikely False"
    POSSIBLY_FALSE = "Possibly False"
    POSSIBLY_TRUE = "Possibly True"
    LIKELY_TRUE = "Likely True"
    VERIFIED_TRUE = "Verified True"

class KnowledgeGraph:
    def __init__(self):
        self.data = []
        
    def add_fact(self, 
                fact_id=None, 
                fact_statement=None, 
                category="General", 
                tags=None, 
                date_recorded=None, 
                last_updated=None, 
                reliability_rating=ReliabilityRating.UNKNOWN, 
                source_id=None, 
                source_title=None, 
                author_creator=None, 
                publication_date=None, 
                url_reference=None, 
                related_facts=None, 
                contextual_notes=None, 
                access_level="Public", 
                usage_count=0):
        
        if date_recorded is None:
            date_recorded = datetime.datetime.now()
        
        if last_updated is None:
            last_updated = datetime.datetime.now()
        
        if tags is None:
            tags = []
        
        if related_facts is None:
            related_facts = []
            
        fact = {
            "fact_id": fact_id,
            "fact_statement": fact_statement,
            "category": category,
            "tags": tags,
            "date_recorded": date_recorded,
            "last_updated": last_updated,
            "reliability_rating": reliability_rating,
            "source_id": source_id,
            "source_title": source_title,
            "author_creator": author_creator,
            "publication_date": publication_date,
            "url_reference": url_reference,
            "related_facts": related_facts,
            "contextual_notes": contextual_notes,
            "access_level": access_level,
            "usage_count": usage_count
        }
        
        self.data.append(fact)
        return fact
    
    def update_quality_score(self, fact_id, new_score):
        for fact in self.data:
            if fact["fact_id"] == fact_id:
                fact["quality_score"] = new_score
                fact["last_updated"] = datetime.datetime.now()
                return True
        return False
    
    def save_to_file(self, filename):
        with open(filename, 'w') as f:
            # Convert datetime objects to strings for JSON serialization
            serializable_data = []
            for fact in self.data:
                fact_copy = fact.copy()
                if isinstance(fact_copy["date_recorded"], datetime.datetime):
                    fact_copy["date_recorded"] = fact_copy["date_recorded"].isoformat()
                if isinstance(fact_copy["last_updated"], datetime.datetime):
                    fact_copy["last_updated"] = fact_copy["last_updated"].isoformat()
                if isinstance(fact_copy["publication_date"], datetime.datetime):
                    fact_copy["publication_date"] = fact_copy["publication_date"].isoformat()
                if isinstance(fact_copy["reliability_rating"], ReliabilityRating):
                    fact_copy["reliability_rating"] = fact_copy["reliability_rating"].value
                serializable_data.append(fact_copy)
            
            json.dump(serializable_data, f, indent=2)
    
    def load_from_file(self, filename):
        with open(filename, 'r') as f:
            self.data = json.load(f)
            
            # Convert string dates back to datetime objects
            for fact in self.data:
                if isinstance(fact["date_recorded"], str):
                    try:
                        fact["date_recorded"] = datetime.datetime.fromisoformat(fact["date_recorded"])
                    except:
                        pass
                if isinstance(fact["last_updated"], str):
                    try:
                        fact["last_updated"] = datetime.datetime.fromisoformat(fact["last_updated"])
                    except:
                        pass
                if isinstance(fact["publication_date"], str):
                    try:
                        fact["publication_date"] = datetime.datetime.fromisoformat(fact["publication_date"])
                    except:
                        pass
                if isinstance(fact["reliability_rating"], str):
                    try:
                        fact["reliability_rating"] = ReliabilityRating(fact["reliability_rating"])
                    except:
                        fact["reliability_rating"] = ReliabilityRating.UNKNOWN

## Step 3: Define Enhanced Knowledge Reduction Class

In [None]:
class EnhancedKnowledgeReduce:
    def __init__(self, 
                 transformer_model='all-MiniLM-L6-v2',
                 spacy_model='en_core_web_md',
                 similarity_threshold=0.85,
                 short_fact_threshold=50,
                 importance_weight_semantic=0.4,
                 importance_weight_centrality=0.3,
                 importance_weight_length=0.3):
        """
        Initialize the EnhancedKnowledgeReduce class with configurable parameters.
        
        Args:
            transformer_model (str): The sentence transformer model to use for embeddings
            spacy_model (str): The spaCy model to use for NLP processing
            similarity_threshold (float): Threshold for considering facts as similar
            short_fact_threshold (int): Minimum length for facts to be considered
            importance_weight_semantic (float): Weight for semantic richness in importance scoring
            importance_weight_centrality (float): Weight for centrality in importance scoring
            importance_weight_length (float): Weight for fact length in importance scoring
        """
        self.transformer = SentenceTransformer(transformer_model)
        self.nlp = spacy.load(spacy_model)
        self.similarity_threshold = similarity_threshold
        self.short_fact_threshold = short_fact_threshold
        self.importance_weights = {
            'semantic': importance_weight_semantic,
            'centrality': importance_weight_centrality,
            'length': importance_weight_length
        }
    
    def remove_short_facts(self, knowledge_graph: Dict) -> Dict:
        """
        Remove facts that are too short to be meaningful.
        
        Args:
            knowledge_graph (Dict): The knowledge graph to process
            
        Returns:
            Dict: The knowledge graph with short facts removed
        """
        knowledge_graph['data'] = [fact for fact in knowledge_graph['data'] 
                                  if len(fact['fact_statement']) > self.short_fact_threshold]
        return knowledge_graph
    
    def compute_embeddings(self, knowledge_graph: Dict) -> Tuple[Dict, np.ndarray]:
        """
        Compute embeddings for all facts in the knowledge graph.
        
        Args:
            knowledge_graph (Dict): The knowledge graph to process
            
        Returns:
            Tuple[Dict, np.ndarray]: The updated knowledge graph and the embeddings matrix
        """
        # Extract fact statements
        fact_statements = [fact['fact_statement'] for fact in knowledge_graph['data']]
        
        # Compute embeddings using transformer model
        embeddings = self.transformer.encode(fact_statements)
        
        # Store embeddings in the knowledge graph
        for i, fact in enumerate(knowledge_graph['data']):
            fact['embedding'] = embeddings[i].tolist()
            
        return knowledge_graph, embeddings
    
    def build_similarity_matrix(self, embeddings: np.ndarray) -> np.ndarray:
        """
        Build a similarity matrix based on cosine similarity of embeddings.
        
        Args:
            embeddings (np.ndarray): The embeddings matrix
            
        Returns:
            np.ndarray: The similarity matrix
        """
        return cosine_similarity(embeddings)
    
    def hierarchical_clustering(self, 
                               similarity_matrix: np.ndarray, 
                               distance_threshold: float = 0.3) -> List[int]:
        """
        Perform hierarchical clustering on the similarity matrix.
        Compatible with different scikit-learn versions.
        
        Args:
            similarity_matrix (np.ndarray): The similarity matrix
            distance_threshold (float): The distance threshold for clustering
            
        Returns:
            List[int]: The cluster assignments for each fact
        """
        # Convert similarity to distance
        distance_matrix = 1 - similarity_matrix
        
        # Try different parameter combinations for compatibility with various scikit-learn versions
        try:
            # First try with the most specific parameters
            clustering = AgglomerativeClustering(
                n_clusters=None,
                distance_threshold=distance_threshold,
                affinity='precomputed',
                linkage='average'
            )
            return clustering.fit_predict(distance_matrix)
        except TypeError:
            try:
                # Try without affinity parameter
                clustering = AgglomerativeClustering(
                    n_clusters=None,
                    distance_threshold=distance_threshold,
                    linkage='average'
                )
                return clustering.fit_predict(distance_matrix)
            except TypeError:
                # Fallback to basic clustering with fixed number of clusters
                # Estimate number of clusters based on similarity threshold
                n_samples = similarity_matrix.shape[0]
                estimated_clusters = max(1, min(n_samples, int(n_samples / 3)))
                
                clustering = AgglomerativeClustering(
                    n_clusters=estimated_clusters,
                    linkage='average'
                )
                return clustering.fit_predict(distance_matrix)
    
    def calculate_fact_importance(self, 
                                 knowledge_graph: Dict, 
                                 embeddings: np.ndarray,
                                 similarity_matrix: np.ndarray) -> Dict:
        """
        Calculate importance scores for facts based on multiple factors.
        
        Args:
            knowledge_graph (Dict): The knowledge graph to process
            embeddings (np.ndarray): The embeddings matrix
            similarity_matrix (np.ndarray): The similarity matrix
            
        Returns:
            Dict: The knowledge graph with importance scores added
        """
        # Create a graph for centrality calculation
        G = nx.Graph()
        for i in range(len(knowledge_graph['data'])):
            G.add_node(i)
            
        # Add edges based on similarity
        for i in range(len(similarity_matrix)):
            for j in range(i+1, len(similarity_matrix)):
                if similarity_matrix[i, j] > 0.5:  # Only connect if somewhat similar
                    G.add_edge(i, j, weight=similarity_matrix[i, j])
        
        # Calculate centrality measures
        centrality = nx.degree_centrality(G)
        
        # Calculate semantic richness
        semantic_richness = []
        for fact in knowledge_graph['data']:
            doc = self.nlp(fact['fact_statement'])
            entities = set([ent.text for ent in doc.ents])
            noun_chunks = set([chunk.text for chunk in doc.noun_chunks])
            richness = len(entities) + len(noun_chunks)
            semantic_richness.append(richness)
        
        # Calculate length scores (normalized)
        lengths = [len(fact['fact_statement']) for fact in knowledge_graph['data']]
        max_length = max(lengths) if lengths else 1  # Avoid division by zero
        length_scores = [length / max_length for length in lengths]
        
        # Calculate combined importance score
        for i, fact in enumerate(knowledge_graph['data']):
            importance = (
                self.importance_weights['semantic'] * semantic_richness[i] +
                self.importance_weights['centrality'] * centrality.get(i, 0) +
                self.importance_weights['length'] * length_scores[i]
            )
            fact['importance_score'] = importance
            
        return knowledge_graph
    
    def select_representative_facts(self, 
                                   knowledge_graph: Dict, 
                                   cluster_assignments: List[int]) -> Dict:
        """
        Select the most important fact from each cluster.
        
        Args:
            knowledge_graph (Dict): The knowledge graph to process
            cluster_assignments (List[int]): The cluster assignments for each fact
            
        Returns:
            Dict: A new knowledge graph with only the representative facts
        """
        # Group facts by cluster
        clusters = {}
        for i, cluster_id in enumerate(cluster_assignments):
            if cluster_id not in clusters:
                clusters[cluster_id] = []
            clusters[cluster_id].append(i)
        
        # Select the most important fact from each cluster
        selected_facts = []
        for cluster_id, fact_indices in clusters.items():
            # Find the fact with the highest importance score in this cluster
            best_fact_idx = max(fact_indices, 
                               key=lambda idx: knowledge_graph['data'][idx].get('importance_score', 0))
            selected_facts.append(knowledge_graph['data'][best_fact_idx])
        
        # Create a new knowledge graph with only the selected facts
        reduced_kg = KnowledgeGraph()
        reduced_kg.data = selected_facts
        
        return reduced_kg
    
    def entity_disambiguation(self, knowledge_graph: Dict) -> Dict:
        """
        Identify when different terms refer to the same entity.
        
        Args:
            knowledge_graph (Dict): The knowledge graph to process
            
        Returns:
            Dict: The knowledge graph with entity disambiguation information added
        """
        # Extract all entities from facts
        all_entities = {}
        for fact in knowledge_graph['data']:
            doc = self.nlp(fact['fact_statement'])
            for ent in doc.ents:
                if ent.text not in all_entities:
                    all_entities[ent.text] = []
                all_entities[ent.text].append(ent)
                
        # Find similar entity names
        entity_clusters = {}
        entity_names = list(all_entities.keys())
        
        for i, name1 in enumerate(entity_names):
            if name1 not in entity_clusters:
                entity_clusters[name1] = [name1]
                
            for name2 in entity_names[i+1:]:
                # Check string similarity
                similarity = SequenceMatcher(None, name1.lower(), name2.lower()).ratio()
                
                # Check embedding similarity for more accuracy
                if similarity > 0.7:  # Initial string filter
                    emb1 = self.transformer.encode([name1])[0]
                    emb2 = self.transformer.encode([name2])[0]
                    cos_sim = cosine_similarity([emb1], [emb2])[0][0]
                    
                    if cos_sim > 0.85:  # High semantic similarity
                        entity_clusters[name1].append(name2)
                        entity_clusters[name2] = entity_clusters[name1]
        
        # Add entity disambiguation information to the knowledge graph
        for fact in knowledge_graph['data']:
            fact['disambiguated_entities'] = {}
            doc = self.nlp(fact['fact_statement'])
            for ent in doc.ents:
                if ent.text in entity_clusters:
                    fact['disambiguated_entities'][ent.text] = entity_clusters[ent.text]
                    
        return knowledge_graph
    
    def reduce_knowledge_graph(self, knowledge_graph: KnowledgeGraph) -> KnowledgeGraph:
        """
        Apply the enhanced knowledge reduction process to a knowledge graph.
        
        Args:
            knowledge_graph (KnowledgeGraph): The knowledge graph to reduce
            
        Returns:
            KnowledgeGraph: The reduced knowledge graph
        """
        # Convert KnowledgeGraph to dictionary for processing
        kg_dict = {'data': knowledge_graph.data}
        
        # Step 1: Remove short facts
        kg_dict = self.remove_short_facts(kg_dict)
        
        # Step 2: Compute embeddings
        kg_dict, embeddings = self.compute_embeddings(kg_dict)
        
        # Step 3: Build similarity matrix
        similarity_matrix = self.build_similarity_matrix(embeddings)
        
        # Step 4: Perform entity disambiguation
        kg_dict = self.entity_disambiguation(kg_dict)
        
        # Step 5: Calculate fact importance
        kg_dict = self.calculate_fact_importance(kg_dict, embeddings, similarity_matrix)
        
        # Step 6: Perform hierarchical clustering
        cluster_assignments = self.hierarchical_clustering(
            similarity_matrix, 
            distance_threshold=1 - self.similarity_threshold
        )
        
        # Step 7: Select representative facts
        reduced_kg = self.select_representative_facts(kg_dict, cluster_assignments)
        
        return reduced_kg

## Step 4: Scrape Website Data

In [None]:
# Function to extract text from HTML elements
def extract_text(element):
    if element is None:
        return ""
    return element.get_text().strip()

# Function to find facts in HTML content
def find_facts(soup):
    facts = []
    
    # Extract text from paragraphs
    for p in soup.find_all('p'):
        text = extract_text(p)
        if text:
            facts.append(text)
    
    # Extract text from headers
    for header_tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
        for header in soup.find_all(header_tag):
            text = extract_text(header)
            if text:
                facts.append(text)
    
    # Extract text from list items
    for li in soup.find_all('li'):
        text = extract_text(li)
        if text:
            facts.append(text)
    
    return facts

# Function to scrape a website and return its BeautifulSoup object
def scrape_website(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        return BeautifulSoup(response.text, 'html.parser')
    except requests.exceptions.RequestException as e:
        print(f"Error during requests to {url}: {e}")
        return None

## Step 5: Populate Knowledge Graph

In [None]:
# Initialize the KnowledgeGraph
kg = KnowledgeGraph()

# URLs of the websites to scrape
urls = {
    "CivicHonors": "https://civichonors.com/",
    "NelsLindahl": "https://www.nelslindahl.com/"
}

# Add facts from each website to the KnowledgeGraph
for source_id, url in urls.items():
    soup = scrape_website(url)
    if soup:
        facts = find_facts(soup)
        for i, fact in enumerate(facts):
            kg.add_fact(
                fact_id=f"{source_id}_{i}",
                fact_statement=fact,
                category="General",
                tags=[source_id, "WebScraped"],
                date_recorded=datetime.datetime.now(),
                last_updated=datetime.datetime.now(),
                reliability_rating=ReliabilityRating.LIKELY_TRUE,
                source_id=source_id,
                source_title=f"{source_id} Website",
                author_creator="Web Scraping",
                publication_date=datetime.datetime.now(),
                url_reference=url,
                related_facts=[],
                contextual_notes=f"Extracted from {source_id} website",
                access_level="Public",
                usage_count=0
            )

# Save the facts to a file
filename = 'knowledge_graph_facts.json'
kg.save_to_file(filename)
print(f"Facts saved to {filename}")

## Step 6: Retrieve and Display 10 Facts

In [None]:
# Print the total number of facts extracted
print(f"Total facts extracted: {len(kg.data)}")

# Display the first 10 facts, if available
for i in range(min(10, len(kg.data))):
    fact = kg.data[i]['fact_statement']  # Access the fact statement directly from the data list
    print(f"Fact {i+1}: {fact}")

## Step 7: Ensure Uniqueness of Facts in the Dataset

In [None]:
def remove_duplicate_facts(knowledge_graph):
    unique_facts = set()
    unique_data = []
    
    for fact_data in knowledge_graph.data:
        fact_statement = fact_data['fact_statement']
        if fact_statement not in unique_facts:
            unique_facts.add(fact_statement)
            unique_data.append(fact_data)
    
    # Replace the original data with the unique data
    knowledge_graph.data = unique_data
    
    return knowledge_graph

# Call the function to remove duplicate facts
remove_duplicate_facts(kg)

# Optional: Print the total number of unique facts remaining
print(f"Total unique facts remaining: {len(kg.data)}")

# Save the facts to a file
filename = 'unique_knowledge_graph_facts.json'
kg.save_to_file(filename)
print(f"Facts saved to {filename}")

## Step 8: Advanced Cleaning and Combining of Facts

In [None]:
from difflib import SequenceMatcher

def advanced_cleaning(knowledge_graph, similarity_threshold=0.8, short_fact_threshold=50):
    # Remove short facts
    knowledge_graph.data = [fact for fact in knowledge_graph.data if len(fact['fact_statement']) > short_fact_threshold]
    
    # Remove similar facts
    unique_facts = []
    for fact in knowledge_graph.data:
        if not any(SequenceMatcher(None, f['fact_statement'], fact['fact_statement']).ratio() > similarity_threshold for f in unique_facts):
            unique_facts.append(fact)
    
    knowledge_graph.data = unique_facts
    
    return knowledge_graph

# Call the function for advanced cleaning
advanced_cleaning(kg)

# Optional: Print the total number of facts after advanced cleaning
print(f"Total facts after advanced cleaning: {len(kg.data)}")

# Save the facts to a file
filename = 'advanced_knowledge_graph_facts.json'
kg.save_to_file(filename)
print(f"Facts saved to {filename}")

## Step 9: Enhanced Knowledge Reduction (Refined)

In [None]:
# Create an instance of the EnhancedKnowledgeReduce class
reducer = EnhancedKnowledgeReduce(
    transformer_model='all-MiniLM-L6-v2',
    similarity_threshold=0.85,
    short_fact_threshold=50
)

# Apply the enhanced knowledge reduction
reduced_kg = reducer.reduce_knowledge_graph(kg)

# Print the total number of facts after super aggressive cleaning
print(f"Total facts after enhanced knowledge reduction: {len(reduced_kg.data)}")

# Save the facts to a file
filename = 'reduced_facts.json'
reduced_kg.save_to_file(filename)
print(f"Facts saved to {filename}")

## Step 10: Serialization and deserialization of the KnowledgeGraph for portability

In [None]:
import json
import networkx as nx

class KnowledgeGraphPortable:
    def __init__(self, knowledge_graph):
        # Check if knowledge graph is a networkx graph or a list-based structure
        if isinstance(knowledge_graph, (nx.Graph, nx.DiGraph, nx.MultiDiGraph)):
            self.graph = knowledge_graph
        elif hasattr(knowledge_graph, 'data') and isinstance(knowledge_graph.data, list):
            self.graph = self.convert_list_to_graph(knowledge_graph.data)
        else:
            raise ValueError("Unsupported knowledge graph format")
    
    def convert_list_to_graph(self, data):
        G = nx.DiGraph()
        
        # Add nodes for each fact
        for fact in data:
            G.add_node(fact['fact_id'], **fact)
        
        # Add edges for related facts
        for fact in data:
            for related_fact_id in fact.get('related_facts', []):
                if G.has_node(related_fact_id):
                    G.add_edge(fact['fact_id'], related_fact_id, relation_type='related')
        
        return G
    
    def to_json(self):
        # Convert networkx graph to JSON serializable format
        # Explicitly set edges="links" to avoid future compatibility warnings
        data = nx.node_link_data(self.graph, edges="links")
        
        # Handle datetime objects
        for node in data['nodes']:
            for key, value in node.items():
                if isinstance(value, datetime.datetime):
                    node[key] = value.isoformat()
                elif isinstance(value, enum.Enum):
                    node[key] = value.value
        
        return json.dumps(data, indent=2)
    
    def from_json(self, json_str):
        # Convert JSON string back to networkx graph
        # Explicitly set edges="links" to avoid future compatibility warnings
        data = json.loads(json_str)
        self.graph = nx.node_link_graph(data, edges="links")
        
        # Handle datetime objects
        for node_id in self.graph.nodes:
            node = self.graph.nodes[node_id]
            for key, value in list(node.items()):
                if key in ['date_recorded', 'last_updated', 'publication_date'] and isinstance(value, str):
                    try:
                        node[key] = datetime.datetime.fromisoformat(value)
                    except:
                        pass
                elif key == 'reliability_rating' and isinstance(value, str):
                    try:
                        node[key] = ReliabilityRating(value)
                    except:
                        node[key] = ReliabilityRating.UNKNOWN
        
        return self.graph
    
    def to_knowledge_graph(self):
        # Convert networkx graph back to KnowledgeGraph object
        kg = KnowledgeGraph()
        
        # Extract node data
        for node_id in self.graph.nodes:
            node_data = self.graph.nodes[node_id].copy()
            
            # Get related facts from edges
            related_facts = [target for source, target in self.graph.out_edges(node_id)]
            node_data['related_facts'] = related_facts
            
            # Add to knowledge graph
            kg.data.append(node_data)
        
        return kg

# Example usage
portable_kg = KnowledgeGraphPortable(reduced_kg)
json_data = portable_kg.to_json()

# Save to file
with open('portable_knowledge_graph.json', 'w') as f:
    f.write(json_data)
print("Portable knowledge graph saved to portable_knowledge_graph.json")

# Example of loading back
new_portable_kg = KnowledgeGraphPortable(KnowledgeGraph())
with open('portable_knowledge_graph.json', 'r') as f:
    loaded_graph = new_portable_kg.from_json(f.read())
    
# Convert back to KnowledgeGraph
loaded_kg = new_portable_kg.to_knowledge_graph()
print(f"Loaded knowledge graph has {len(loaded_kg.data)} facts")

## Summary and Conclusion

This notebook has demonstrated an enhanced approach to knowledge graph construction and reduction for Civic Honors content. The key improvements include:

1. **Transformer-Based Semantic Similarity**: Using sentence transformers to better capture semantic relationships between facts
2. **Hierarchical Clustering**: Grouping similar facts together and selecting representatives
3. **Entity Disambiguation**: Identifying when different terms refer to the same entity
4. **Fact Importance Scoring**: Using multiple factors to determine which facts are most valuable
5. **Knowledge Graph Embeddings**: Representing entities and relations in a low-dimensional vector space

These improvements result in a more effective knowledge reduction process that preserves the most important information while significantly reducing redundancy. The enhanced approach provides better semantic understanding and more nuanced fact prioritization compared to the previous version.

The final knowledge graph contains a concise set of high-quality facts that capture the essential information about Civic Honors, making it more useful for applications like question answering, recommendation systems, and information retrieval.