<a href="https://colab.research.google.com/github/nelslindahlx/KnowledgeReduce/blob/main/CivicHonorsGraphML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code Overview

The revised Python code is developed for Google Colab and focuses on extracting, analyzing, and structuring data from a webpage into a knowledge graph, which is subsequently saved in GraphML format. The update addresses a specific issue related to data types compatibility with the GraphML format. Here's an updated summary of the code and its workflow:

### 1. **Library Installation and Imports**
   - The script begins by installing necessary NLP-related Python libraries (`nltk`, `spacy`, `textblob`).
   - It imports libraries for web scraping (`requests`, `BeautifulSoup`), graph creation (`networkx`), and standard utilities (`nltk`, `string`, `Counter`).
   - NLTK resources are downloaded for text processing tasks, and Spacy's English model is loaded for NLP functions.

### 2. **Function Definitions**
   - `fetch_webpage_content(url)`: Retrieves the HTML content from a given URL. It includes error handling for HTTP errors.
   - `parse_webpage_content(content)`: Parses HTML content to extract text, then performs entity recognition using Spacy, sentiment analysis using TextBlob, and keyword extraction with NLTK, filtering out stopwords and punctuation.
   - `knowledge_reduce(entities, sentiment, keyword_freq)`: Constructs a knowledge graph where nodes represent entities and keywords. The function has been updated to ensure all data attributes are converted to strings, a requirement for GraphML compatibility. It includes the overall sentiment of the text.
   - `save_knowledge_graph_graphml(graph, file_name)`: Saves the knowledge graph in GraphML format, addressing the data type compatibility issue by ensuring all node and edge attributes are in a supported format.

### 3. **Execution Flow**
   - **Fetching Webpage Content**: The script fetches content from "https://civichonors.com/".
   - **Parsing Webpage Content**: The content is then parsed to extract entities, sentiments, and keywords.
   - **Creating and Saving the Knowledge Graph**: The script creates a knowledge graph from the extracted data. The updated script ensures all graph attributes are in a format compatible with GraphML, and the graph is saved as a GraphML file.

This updated script is particularly useful for educational and analytical purposes in a Google Colab environment, allowing for an interactive process in web scraping, natural language processing, and knowledge graph construction. The final output, a GraphML file, provides a structured, visual representation of the content's interconnected elements, suitable for analysis or visualization in graph software tools.


# Step 1: Install and Import Libraries

In [6]:
!pip install nltk spacy textblob
import requests
from bs4 import BeautifulSoup
import networkx as nx
import nltk
import spacy
from textblob import TextBlob
from nltk.corpus import stopwords
from collections import Counter
import string

nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load('en_core_web_sm')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Step 2: Define Functions

In [7]:
def fetch_webpage_content(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.text
    except requests.exceptions.HTTPError as e:
        raise Exception(f"HTTP Error: {e}")
    except Exception as e:
        raise Exception(f"Error fetching webpage: {e}")

def parse_webpage_content(content):
    soup = BeautifulSoup(content, 'html.parser')
    text_content = soup.get_text()

    doc = nlp(text_content)
    entities = [(ent.text.strip(), ent.label_) for ent in doc.ents if ent.text.strip()]

    blob = TextBlob(text_content)
    sentiment = blob.sentiment

    words = [word.lower() for word in nltk.word_tokenize(text_content)
             if word.lower() not in stopwords.words('english')
             and word not in string.punctuation]
    keyword_freq = Counter(words).most_common(10)

    return entities, sentiment, keyword_freq

def knowledge_reduce(entities, sentiment, keyword_freq):
    graph = nx.Graph()

    for entity, type in entities:
        graph.add_node(entity, type=str(type), label=str(entity))

    for word, freq in keyword_freq:
        graph.add_node(word, type='keyword', frequency=str(freq), label=str(word))

    # Convert sentiment to a string representation
    sentiment_data = f"Polarity: {sentiment.polarity}, Subjectivity: {sentiment.subjectivity}"
    graph.graph['sentiment'] = sentiment_data

    # Add edge creation logic here
    for entity, _ in entities:
        for word, _ in keyword_freq:
            if word in entity:
                graph.add_edge(entity, word)

    return graph

def save_knowledge_graph_graphml(graph, file_name):
    nx.write_graphml(graph, file_name)

# Step 3: Fetch Webpage Content

In [8]:
url = "https://civichonors.com/"
try:
    content = fetch_webpage_content(url)
    print("Webpage content fetched successfully.")
except Exception as e:
    print(f"Error fetching webpage: {e}")

Webpage content fetched successfully.


# Step 4: Parse the Webpage Content

In [9]:
try:
    entities, sentiment, keywords = parse_webpage_content(content)
    print("Webpage content parsed successfully.")
except Exception as e:
    print(f"Error parsing webpage content: {e}")

Webpage content parsed successfully.


# Step 5: Create and Save the Knowledge Graph

In [10]:
try:
    graph = knowledge_reduce(entities, sentiment, keywords)
    save_knowledge_graph_graphml(graph, '/content/knowledge_graph.graphml')
    print("Knowledge graph created and saved in GraphML format successfully.")
except Exception as e:
    print(f"Error creating or saving knowledge graph: {e}")

Knowledge graph created and saved in GraphML format successfully.
