<a href="https://colab.research.google.com/github/nelslindahlx/KnowledgeReduce/blob/main/CivicHonorsKGv18.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code Summary

This notebook encapsulates an end-to-end pipeline designed for the meticulous extraction, processing, and systematic organization of data from specified web sources into a well-structured knowledge graph. The workflow is outlined in detailed steps, each contributing to the creation of a refined and informative dataset:

### Step 1: Install Necessary Libraries
- Installation of pivotal Python libraries such as `requests` for web scraping, `beautifulsoup4` for HTML parsing, `networkx` for graph-based operations, and `spacy` for advanced natural language processing tasks.
- Code to restart the runtime

### Step 2: Import Libraries and Define Classes
- Incorporation of essential Python modules and the establishment of two foundational classes:
  - `ReliabilityRating`: An enumeration (Enum) for classifying the reliability of the extracted information.
  - `KnowledgeGraph`: A versatile class for constructing and managing a knowledge graph, encompassing functionalities for adding facts, computing and updating quality scores, and fact retrieval.

### Step 3: Scrape Data from Websites
- Automated retrieval and parsing of HTML content from two distinct websites, methodically extracting textual data from common HTML elements like paragraphs, headings, and lists.

### Step 4: Populate the KnowledgeGraph
- Initialization and population of the `KnowledgeGraph` instance.
- Systematic incorporation of each fact into the graph, complete with comprehensive attributes including statement, category, tags, and more.

### Step 5: Display Extracted Facts
- Presentation of an initial subset of facts from the knowledge graph, providing insights into the nature of the extracted data and a count of the total facts obtained.

### Step 6: Ensure Data Uniqueness
- A focused effort to eliminate duplicate facts from the knowledge graph based on their statements, ensuring the uniqueness and quality of the dataset.

### Step 7: Implement Advanced Data Cleaning
- Advanced data cleaning procedures, including the removal of overly brief facts and deduplication of semantically similar facts through basic string comparison techniques.

### Step 8: Execute Super Aggressive Advanced Cleaning
- Application of the `en_core_web_md` spaCy model for sophisticated NLP operations.
- Removal of facts based on complex criteria like semantic similarity, leveraging NLP techniques for enhanced dataset refinement.

### Step 9: Serialize the KnowledgeGraph for Portability
- A crucial step focusing on rendering the Knowledge Graph portable via serialization into JSON format, facilitating easy storage, transfer, and reconstruction across different environments. The process encompasses both serialization and deserialization functions, updated to include sharding for every 100 records.

### Step 10: List JSON Files
- A utility step for listing available JSON files, ensuring accessibility and management of serialized data.

This comprehensive pipeline effectively transforms web-sourced information into a structured, clean, and semantically rich knowledge graph dataset. It is particularly suited for scenarios demanding high data quality and structure, enriched with advanced NLP techniques to ensure a dataset devoid of redundancies and abundant in diverse, valuable insights.

# Step 1: Install Necessary Libraries & restart the runtime

In [None]:
# Step to install necessary libraries
!pip uninstall community -y
!pip install requests beautifulsoup4 networkx spacy python-louvain
!python -m spacy download en_core_web_md

# One of the installs neesd a runtime restart

import os
import IPython

# Path for the marker file
marker_path = '/content/runtime_restarted.txt'

# First check if the marker file exists
if os.path.exists(marker_path):
    print("Runtime restarted successfully. All dependencies should now be loaded.")
    # Remove the marker file to clean up
    os.remove(marker_path)
else:
    # Create the marker file
    with open(marker_path, 'w') as f:
        f.write('Runtime will be restarted.')

    # Restart the runtime
    print("Restarting the runtime to load all dependencies...")
    os.kill(os.getpid(), 9)  # Send SIGKILL signal to the current process

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [1]:
# Second check if the marker file exists

import os
import IPython

# Path for the marker file
marker_path = '/content/runtime_restarted.txt'

# Check if the marker file exists
if os.path.exists(marker_path):
    print("Runtime restarted successfully. All dependencies should now be loaded.")
    # Remove the marker file to clean up
    os.remove(marker_path)
else:
    # Create the marker file
    with open(marker_path, 'w') as f:
        f.write('Runtime will be restarted.')

    # Restart the runtime
    print("Restarting the runtime to load all dependencies...")
    os.kill(os.getpid(), 9)  # Send SIGKILL signal to the current process

Runtime restarted successfully. All dependencies should now be loaded.


# Step 2: Import Libraries and Define Classes

In [2]:
# Importing necessary libraries
import requests
import difflib
import spacy
import json
import os
import glob
import networkx as nx
import matplotlib.pyplot as plt
import community as community_louvain  # python-louvain library

# Running the proper from statements
from bs4 import BeautifulSoup
from datetime import datetime
from enum import Enum

# Define the classes
class ReliabilityRating(Enum):
    UNVERIFIED = 1
    POSSIBLY_TRUE = 2
    LIKELY_TRUE = 3
    VERIFIED = 4

class KnowledgeGraph:
    def __init__(self):
        self.graph = nx.DiGraph()

    def calculate_quality_score(self, reliability_rating, usage_count):
        # Adjusted to handle string representation of Enum
        rating_value = ReliabilityRating[reliability_rating].value if isinstance(reliability_rating, str) else reliability_rating.value
        base_score = 10 * rating_value
        usage_bonus = 2 * usage_count
        return base_score + usage_bonus

    def add_fact(self, fact_id, fact_statement, category, tags, date_recorded, last_updated,
                 reliability_rating, source_id, source_title, author_creator,
                 publication_date, url_reference, related_facts, contextual_notes,
                 access_level, usage_count):
        # Convert list and datetime objects to strings
        tags_str = ', '.join(tags) if tags else ''
        date_recorded_str = date_recorded.isoformat() if isinstance(date_recorded, datetime) else date_recorded
        last_updated_str = last_updated.isoformat() if isinstance(last_updated, datetime) else last_updated
        publication_date_str = publication_date.isoformat() if isinstance(publication_date, datetime) else publication_date

        quality_score = self.calculate_quality_score(reliability_rating, usage_count)
        self.graph.add_node(fact_id,
                            fact_statement=fact_statement,
                            category=category,
                            tags=tags_str,
                            date_recorded=date_recorded_str,
                            last_updated=last_updated_str,
                            reliability_rating=reliability_rating,
                            quality_score=quality_score,
                            source_id=source_id,
                            source_title=source_title,
                            author_creator=author_creator,
                            publication_date=publication_date_str,
                            url_reference=url_reference,
                            contextual_notes=contextual_notes,
                            access_level=access_level,
                            usage_count=usage_count)

        for related_fact_id in related_facts:
            self.graph.add_edge(fact_id, related_fact_id)

    def update_quality_score(self, fact_id):
        if fact_id not in self.graph:
            raise ValueError("Fact ID not found in the graph.")
        fact = self.graph.nodes[fact_id]
        new_score = self.calculate_quality_score(fact['reliability_rating'], fact['usage_count'])
        self.graph.nodes[fact_id]['quality_score'] = new_score

    def get_fact(self, fact_id):
        if fact_id not in self.graph:
            raise ValueError("Fact ID not found in the graph.")
        return self.graph.nodes[fact_id]

    def save_to_file(self, filename):
        facts_to_save = []
        for fact_id, fact_data in self.graph.nodes(data=True):
            facts_to_save.append(fact_data)

        with open(filename, 'w') as file:
            json.dump(facts_to_save, file, indent=4)

# Step 3: Scrape Data from the Website

In [3]:
# Function to extract text from a soup object
def extract_text(element):
    return ' '.join(element.stripped_strings)

# Generic function to find facts in common HTML structures
def find_facts(soup):
    facts = []

    # Look for paragraphs
    for p in soup.find_all('p'):
        text = extract_text(p)
        if text: facts.append(text)

    # Look for headings
    for header_tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
        for header in soup.find_all(header_tag):
            text = extract_text(header)
            if text: facts.append(text)

    # Look for list items
    for li in soup.find_all('li'):
        text = extract_text(li)
        if text: facts.append(text)

    return facts

# URLs of the websites to scrape
urls = ["https://civichonors.com/", "https://www.nelslindahl.com/"]

# List to hold all facts from both websites
all_facts = []

# Iterate through each URL, scrape its content, and extract facts
for url in urls:
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        facts = find_facts(soup)
        all_facts.extend(facts)
    else:
        print(f"Failed to retrieve content from {url}")

# all_facts now contains facts from both websites
print(f"Total facts extracted: {len(all_facts)}")

Total facts extracted: 327


# Step 4: Store Extracted Data in the KnowledgeGraph

In [4]:
# Enum for Reliability Rating
class ReliabilityRating:
    LIKELY_TRUE = 'Likely True'
    # ... other reliability ratings ...

# KnowledgeGraph class
class KnowledgeGraph:
    def __init__(self):
        self.data = []

    def add_fact(self, fact_id, fact_statement, category, tags, date_recorded, last_updated,
                 reliability_rating, source_id, source_title, author_creator, publication_date,
                 url_reference, related_facts, contextual_notes, access_level, usage_count):
        self.data.append({
            'fact_id': fact_id,
            'fact_statement': fact_statement,
            'category': category,
            'tags': tags,
            'date_recorded': date_recorded,
            'last_updated': last_updated,
            'reliability_rating': reliability_rating,
            'source_id': source_id,
            'source_title': source_title,
            'author_creator': author_creator,
            'publication_date': publication_date,
            'url_reference': url_reference,
            'related_facts': related_facts,
            'contextual_notes': contextual_notes,
            'access_level': access_level,
            'usage_count': usage_count
        })

    def save_to_file(self, filename):
        with open(filename, 'w') as file:
            json.dump(self.data, file, default=str, indent=4)

# Function to extract text from a soup object
def extract_text(element):
    return ' '.join(element.stripped_strings)

# Generic function to find facts in common HTML structures
def find_facts(soup):
    facts = []
    for p in soup.find_all('p'):
        text = extract_text(p)
        if text: facts.append(text)
    for header_tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
        for header in soup.find_all(header_tag):
            text = extract_text(header)
            if text: facts.append(text)
    for li in soup.find_all('li'):
        text = extract_text(li)
        if text: facts.append(text)
    return facts

# Function to scrape a website and return its BeautifulSoup object
def scrape_website(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        return BeautifulSoup(response.text, 'html.parser')
    except requests.exceptions.RequestException as e:
        print(f"Error during requests to {url}: {e}")
        return None

# Initialize the KnowledgeGraph
kg = KnowledgeGraph()

# URLs of the websites to scrape
urls = {
    "CivicHonors": "https://civichonors.com/",
    "NelsLindahl": "https://www.nelslindahl.com/"
}

# Add facts from each website to the KnowledgeGraph
for source_id, url in urls.items():
    soup = scrape_website(url)
    if soup:
        facts = find_facts(soup)
        for i, fact in enumerate(facts):
            kg.add_fact(
                fact_id=f"{source_id}_{i}",
                fact_statement=fact,
                category="General",
                tags=[source_id, "WebScraped"],
                date_recorded=datetime.now(),
                last_updated=datetime.now(),
                reliability_rating=ReliabilityRating.LIKELY_TRUE,
                source_id=source_id,
                source_title=f"{source_id} Website",
                author_creator="Web Scraping",
                publication_date=datetime.now(),
                url_reference=url,
                related_facts=[],
                contextual_notes=f"Extracted from {source_id} website",
                access_level="Public",
                usage_count=0
            )

# Save the facts to a file
filename = 'knowledge_graph_facts.json'
kg.save_to_file(filename)
print(f"Facts saved to {filename}")

Facts saved to knowledge_graph_facts.json


# Step 5: Retrieve and Display 10 Facts

In [5]:
# Print the total number of facts extracted
print(f"Total facts extracted: {len(kg.data)}")

# Display the first 10 facts, if available
for i in range(min(10, len(kg.data))):
    fact = kg.data[i]['fact_statement']  # Access the fact statement directly from the data list
    print(f"Fact {i+1}: {fact}")

Total facts extracted: 327
Fact 1: Civic Honors
Fact 2: Graduation with Civic Honors: Unlock the Power of  Community Opportunity
Fact 3: This book was published in 2006. It has a formal copyright. You can buy a physical copy if you want or just read it online here. I was going to update this to be an online PDF, but it seemed like a better idea to just make it a very large single page of prose. Enjoy!
Fact 4: Dr. Nels Lindahl, June 14, 2020, Denver, Colorado
Fact 5: Believing in a dream like graduation with civic honors is only the first step in the process toward advocating the creation of a graduation with civic honors program.
Fact 6: The story behind graduation with civic honors began in an academic setting during the spring semester of 2002 at the University of Kansas. I took a class from Dr. H. George Frederickson entitled Concepts of Civil Society. During the Concepts of Civil Society class, my collegiate interests focused on civic engagement. At one point during the class, Dr. 

# Step 6: Ensure Uniqueness of Facts in the Dataset

In [6]:
def remove_duplicate_facts(knowledge_graph):
    unique_facts = set()
    unique_data = []

    for fact_data in knowledge_graph.data:
        fact_statement = fact_data['fact_statement']
        if fact_statement not in unique_facts:
            unique_facts.add(fact_statement)
            unique_data.append(fact_data)

    # Replace the original data with the unique data
    knowledge_graph.data = unique_data

# Call the function to remove duplicate facts
remove_duplicate_facts(kg)

# Optional: Print the total number of unique facts remaining
print(f"Total unique facts remaining: {len(kg.data)}")

# Save the facts to a file
filename = 'unique_knowledge_graph_facts.json'
kg.save_to_file(filename)
print(f"Facts saved to {filename}")


Total unique facts remaining: 326
Facts saved to unique_knowledge_graph_facts.json


# Step 7: Advanced Cleaning and Combining of Facts

In [7]:
from difflib import SequenceMatcher

def advanced_cleaning(knowledge_graph, similarity_threshold=0.8, short_fact_threshold=50):
    # Remove short facts
    knowledge_graph.data = [fact for fact in knowledge_graph.data if len(fact['fact_statement']) >= short_fact_threshold]

    # Remove similar facts
    unique_facts = []
    for fact in knowledge_graph.data:
        if not any(SequenceMatcher(None, f['fact_statement'], fact['fact_statement']).ratio() > similarity_threshold for f in unique_facts):
            unique_facts.append(fact)

    knowledge_graph.data = unique_facts

# Call the function for advanced cleaning
advanced_cleaning(kg)

# Optional: Print the total number of facts after cleaning
print(f"Total facts after advanced cleaning: {len(kg.data)}")

# Save the facts to a file
filename = 'advanced_knowledge_graph_facts.json'
kg.save_to_file(filename)
print(f"Facts saved to {filename}")

Total facts after advanced cleaning: 268
Facts saved to advanced_knowledge_graph_facts.json


# Step 8: Super Aggressive Advanced Cleaning (Refined)

In [8]:
import spacy

# Load spaCy English model (make sure to have it installed)
nlp = spacy.load("en_core_web_md")

def super_aggressive_cleaning(knowledge_graph, similarity_threshold=0.85):
    processed_facts = []
    unique_facts = []

    # Pre-process each fact with spaCy
    for fact in knowledge_graph.data:
        doc = nlp(fact['fact_statement'])
        processed_facts.append((fact, doc))

    # Compare each fact to others for similarity
    for fact, doc in processed_facts:
        if not any(doc.similarity(other_doc) > similarity_threshold for _, other_doc in processed_facts if other_doc != doc):
            unique_facts.append(fact)

    knowledge_graph.data = unique_facts

# Call the function for super aggressive cleaning
super_aggressive_cleaning(kg)

# Print the total number of facts after super aggressive cleaning
print(f"Total facts after super aggressive cleaning: {len(kg.data)}")

# Save the facts to a file
filename = 'remaining_facts.json'
kg.save_to_file(filename)
print(f"Facts saved to {filename}")


Total facts after super aggressive cleaning: 6
Facts saved to remaining_facts.json


# Step 9: Serialization and deserialization of the KnowledgeGraph for portability

In [9]:
import json
import networkx as nx

class KnowledgeGraphPortable:
    def __init__(self, knowledge_graph):
        # Check if knowledge_graph is a networkx graph or a list-based structure
        if isinstance(knowledge_graph, (nx.Graph, nx.DiGraph, nx.MultiGraph, nx.MultiDiGraph)):
            self.graph = knowledge_graph
        elif hasattr(knowledge_graph, 'data') and isinstance(knowledge_graph.data, list):
            self.graph = self.convert_list_to_graph(knowledge_graph.data)
        else:
            raise ValueError("Unsupported knowledge_graph structure")

    def convert_list_to_graph(self, data_list):
        # Convert a list of facts to a networkx graph (if necessary)
        G = nx.DiGraph()
        for item in data_list:
            # Assuming each item has 'fact_id' and 'fact_statement'
            G.add_node(item['fact_id'], **item)
        return G

    def serialize_graph_to_json(self, output_file):
        # Convert graph to a dictionary or suitable structure
        try:
            graph_data = nx.node_link_data(self.graph)
            with open(output_file, 'w') as file:
                json.dump(graph_data, file, indent=4)
            print(f"Graph serialized to JSON. File saved as {output_file}")
            return True
        except Exception as e:
            print(f"Error in serializing graph to JSON: {e}")
            return False

# Usage example
kg_portable = KnowledgeGraphPortable(kg)
output_file = 'serialized_kg.json'  # Specify the filename for the serialized graph
result = kg_portable.serialize_graph_to_json(output_file)

if result:
    print("Serialization of Knowledge Graph completed successfully.")
else:
    print("Serialization of Knowledge Graph failed.")


Error in serializing graph to JSON: Object of type datetime is not JSON serializable
Serialization of Knowledge Graph failed.


# Step 10: Print a list of JSON files

In [10]:
# Specify the directory to search for JSON files, '.' for current directory
directory_to_search = '.'

# List all JSON files in the specified directory
json_files = glob.glob(os.path.join(directory_to_search, '*.json'))

# Print the list of JSON files
print("List of JSON files created:")
for file in json_files:
    print(file)

List of JSON files created:
./knowledge_graph_facts.json
./remaining_facts.json
./serialized_kg.json
./unique_knowledge_graph_facts.json
./advanced_knowledge_graph_facts.json
