<a href="https://colab.research.google.com/github/nelslindahlx/KnowledgeReduce/blob/main/CivicHonorsGraphML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code Overview

The provided Python code is structured to run in Google Colab and is designed to fetch, process, and visualize data from a specified webpage. It particularly focuses on constructing a knowledge graph from the webpage's content, which is then saved in GraphML format. Here's a breakdown of the code's main components and workflow:

### 1. **Library Installation and Imports**
   - The code starts by installing necessary Python libraries (`nltk`, `spacy`, `textblob`) that are used for natural language processing tasks.
   - Essential libraries for web scraping (`requests`, `BeautifulSoup`), graph creation and manipulation (`networkx`), and standard utilities (`string`, `Counter` from `collections`) are imported.
   - NLTK resources for tasks like tokenization and stopwords removal are downloaded, and Spacy's English model is loaded for NLP tasks.

### 2. **Function Definitions**
   - `fetch_webpage_content(url)`: This function fetches the HTML content of a webpage using the `requests` library. It includes error handling to manage HTTP errors effectively.
   - `parse_webpage_content(content)`: Parses the HTML to extract text, performs entity recognition using Spacy, sentiment analysis using TextBlob, and keyword extraction with NLTK. It filters out stopwords and punctuation to identify the most frequent keywords.
   - `knowledge_reduce(entities, sentiment, keyword_freq)`: Constructs a knowledge graph using NetworkX. Nodes in the graph represent entities and keywords, each annotated with specific attributes. The graph also incorporates the overall sentiment of the text. Custom logic can be added to define how nodes are connected.
   - `save_knowledge_graph_graphml(graph, file_name)`: Saves the generated knowledge graph in the GraphML format using NetworkX's `write_graphml` method. GraphML is a versatile format compatible with various graph analysis tools.

### 3. **Execution Flow**
   - **Fetching Webpage Content**: The script begins by retrieving the content from "https://civichonors.com/".
   - **Parsing Webpage Content**: The fetched content is then parsed to extract entities, sentiments, and keywords.
   - **Creating and Saving the Knowledge Graph**: The script constructs a knowledge graph from the parsed data and saves it as a GraphML file. This file can be used for further analysis or visualization in graph software.

This script is particularly useful in educational and data analysis contexts, allowing for an interactive process in web content scraping, natural language processing, and knowledge graph construction. The final output, a GraphML file, provides a structured and visual representation of the content's interconnected elements, making it suitable for further analysis in graph visualization tools.


# Step 1: Install and Import Libraries

In [6]:
!pip install nltk spacy textblob
import requests
from bs4 import BeautifulSoup
import networkx as nx
import nltk
import spacy
from textblob import TextBlob
from nltk.corpus import stopwords
from collections import Counter
import string

nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load('en_core_web_sm')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Step 2: Define Functions

In [7]:
def fetch_webpage_content(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.text
    except requests.exceptions.HTTPError as e:
        raise Exception(f"HTTP Error: {e}")
    except Exception as e:
        raise Exception(f"Error fetching webpage: {e}")

def parse_webpage_content(content):
    soup = BeautifulSoup(content, 'html.parser')
    text_content = soup.get_text()

    doc = nlp(text_content)
    entities = [(ent.text.strip(), ent.label_) for ent in doc.ents if ent.text.strip()]

    blob = TextBlob(text_content)
    sentiment = blob.sentiment

    words = [word.lower() for word in nltk.word_tokenize(text_content)
             if word.lower() not in stopwords.words('english')
             and word not in string.punctuation]
    keyword_freq = Counter(words).most_common(10)

    return entities, sentiment, keyword_freq

def knowledge_reduce(entities, sentiment, keyword_freq):
    graph = nx.Graph()

    for entity, type in entities:
        graph.add_node(entity, type=str(type), label=str(entity))

    for word, freq in keyword_freq:
        graph.add_node(word, type='keyword', frequency=str(freq), label=str(word))

    # Convert sentiment to a string representation
    sentiment_data = f"Polarity: {sentiment.polarity}, Subjectivity: {sentiment.subjectivity}"
    graph.graph['sentiment'] = sentiment_data

    # Add edge creation logic here
    for entity, _ in entities:
        for word, _ in keyword_freq:
            if word in entity:
                graph.add_edge(entity, word)

    return graph

def save_knowledge_graph_graphml(graph, file_name):
    nx.write_graphml(graph, file_name)

# Step 3: Fetch Webpage Content

In [8]:
url = "https://civichonors.com/"
try:
    content = fetch_webpage_content(url)
    print("Webpage content fetched successfully.")
except Exception as e:
    print(f"Error fetching webpage: {e}")

Webpage content fetched successfully.


# Step 4: Parse the Webpage Content

In [9]:
try:
    entities, sentiment, keywords = parse_webpage_content(content)
    print("Webpage content parsed successfully.")
except Exception as e:
    print(f"Error parsing webpage content: {e}")

Webpage content parsed successfully.


# Step 5: Create and Save the Knowledge Graph

In [10]:
try:
    graph = knowledge_reduce(entities, sentiment, keywords)
    save_knowledge_graph_graphml(graph, '/content/knowledge_graph.graphml')
    print("Knowledge graph created and saved in GraphML format successfully.")
except Exception as e:
    print(f"Error creating or saving knowledge graph: {e}")

Knowledge graph created and saved in GraphML format successfully.
