<a href="https://colab.research.google.com/github/nelslindahlx/KnowledgeReduce/blob/main/CivicHonorsKGv9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code Summary

The code provided is a comprehensive Python script designed to run in Google Colab for creating a knowledge graph from data scraped from two websites. Here's a summary of the steps and key functionalities of the code:

1. **Install Necessary Libraries**:
   - The script begins by installing required Python libraries: `networkx` for graph-related operations, `beautifulsoup4` and `lxml` for web scraping, and `requests` for handling HTTP requests.

2. **Import Libraries and Setup Logging**:
   - Essential libraries are imported, and Python's logging module is configured for better output readability and debugging.

3. **Web Scraping Functions**:
   - Two separate functions, `scrape_website_for_couplets` and `scrape_another_website_for_couplets`, are defined. Each is tailored to scrape data from specific websites: "https://civichonors.com/" and "https://www.nelslindahl.com/".
   - The scraping functions extract titles and summaries/descriptions from the articles present on these websites, storing them as "couplets" (pairs of related information).

4. **Create the Knowledge Graph**:
   - The `create_knowledge_graph_from_couplets` function constructs a network graph using the `networkx` library. Each couplet results in two nodes (representing entities or attributes) and an edge between them.

5. **Export the Knowledge Graph**:
   - The `export_graph` function exports the constructed graph into GraphML and GEXF formats, ensuring the graph's portability and compatibility with various graph analysis tools.

6. **Evaluate the Knowledge Graph**:
   - `evaluate_knowledge_graph` assesses the structure of the graph, providing insights into the number of nodes and edges, detection of isolated nodes, connectivity, and other structural properties.

7. **Integrate Data from Another Website**:
   - Additional data from "https://www.nelslindahl.com/" is scraped and integrated into the existing graph using `integrate_new_data`, enhancing the graph's comprehensiveness.

8. **Execution Flow**:
   - The script executes these functions in sequence: scraping data from both websites, creating and evaluating the graph, then integrating the new data, re-evaluating, and finally exporting the updated graph.

The code effectively demonstrates how to build a knowledge graph by extracting and structuring data from multiple web sources. It showcases essential tasks like web scraping, graph construction, data integration, and graph analysis, making it a versatile template for similar data processing and knowledge representation projects.

# Step 1: Install Necessary Libraries

In [None]:
!pip install networkx beautifulsoup4 requests lxml



# Step 2: Import Libraries and Setup Logging

In [None]:
import requests
from bs4 import BeautifulSoup
import networkx as nx
import logging

# Configure logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)

# Step 3: Define the Web Scraping Function

In [None]:
def scrape_website_for_couplets(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'lxml')
        articles = soup.find_all('article')
        couplets = [(article.find('h2').get_text(strip=True), article.find('p').get_text(strip=True)) for article in articles if article.find('h2') and article.find('p')]
        return couplets
    except Exception as e:
        logging.error(f"Error scraping {url}: {e}")
        return []

def scrape_another_website_for_couplets(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'lxml')
        # Replace the following line with the correct selector for www.nelslindahl.com
        articles = soup.find_all('article')
        couplets = [(article.find('h2').get_text(strip=True), article.find('p').get_text(strip=True)) for article in articles if article.find('h2') and article.find('p')]
        return couplets
    except Exception as e:
        logging.error(f"Error scraping {url}: {e}")
        return []

# Step 4: Create the Knowledge Graph

In [None]:
def create_knowledge_graph_from_couplets(couplets):
    G = nx.Graph()
    for entity, attribute in couplets:
        G.add_node(entity)
        G.add_node(attribute)
        G.add_edge(entity, attribute)
    return G

# Step 5: Export the Knowledge Graph

In [None]:
def export_graph(graph, filename, format):
    try:
        file_path = f"{filename}.{format}"
        if format.lower() == 'gexf':
            nx.write_gexf(graph, file_path)
        elif format.lower() == 'graphml':
            nx.write_graphml(graph, file_path)
        logging.info(f"Graph successfully exported as {file_path}.")
    except Exception as e:
        logging.error(f"Failed to export graph: {e}")

# Step 6: Evaluate the Knowledge Graph


In [None]:
def evaluate_knowledge_graph(graph):
    num_nodes = graph.number_of_nodes()
    num_edges = graph.number_of_edges()
    print(f"The graph has {num_nodes} nodes and {num_edges} edges.")
    isolated_nodes = list(nx.isolates(graph))
    print(f"Number of isolated nodes: {len(isolated_nodes)}")
    if nx.is_connected(graph):
        print("The graph is connected.")
    else:
        largest_cc = max(nx.connected_components(graph), key=len)
        print(f"Size of the largest connected component: {len(largest_cc)}")

# Step 7: Integrate Data from Another Website

In [None]:
def integrate_new_data(graph, new_data):
    for entity, attribute in new_data:
        if not graph.has_node(entity):
            graph.add_node(entity)
        if not graph.has_node(attribute):
            graph.add_node(attribute)
        graph.add_edge(entity, attribute)
    return graph

# Step 8: Execution Flow

Scrape Data from the First Website (https://civichonors.com/):

In [None]:
couplets = scrape_website_for_couplets("https://civichonors.com/")

Create the Initial Knowledge Graph:

In [None]:
G = create_knowledge_graph_from_couplets(couplets)

Evaluate the Initial Graph:

In [None]:
evaluate_knowledge_graph(G)

The graph has 2 nodes and 1 edges.
Number of isolated nodes: 0
The graph is connected.


Scrape Data from the Second Website (https://www.nelslindahl.com/):

In [None]:
new_couplets = scrape_another_website_for_couplets("https://www.nelslindahl.com/")

Integrate and Update the Graph:

In [None]:
G = integrate_new_data(G, new_couplets)

Re-evaluate the Updated Graph:

In [None]:
evaluate_knowledge_graph(G)

The graph has 4 nodes and 2 edges.
Number of isolated nodes: 0
Size of the largest connected component: 2


Export the Updated Graph:

In [None]:
export_graph(G, "updated_knowledge_graph", "gexf")
export_graph(G, "updated_knowledge_graph", "graphml")