<a href="https://colab.research.google.com/github/nelslindahlx/KnowledgeReduce/blob/main/CivicHonorsGEXF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code Overview

The revised Python script is designed to extract, analyze, and visualize knowledge from a webpage, specifically by creating a knowledge graph in GEXF format, suitable for analysis in graph visualization tools. The process is implemented step-by-step in Google Colab. Here's a summary of its components and functionalities:

### 1. **Library Installation and Imports**
   - The script starts by installing necessary Python libraries (`nltk`, `spacy`, `textblob`) for natural language processing.
   - It imports libraries for web scraping (`requests`, `BeautifulSoup`), graph creation (`networkx`), and standard utilities (`nltk`, `string`, `Counter`).
   - NLTK resources are downloaded, and Spacy's English model is loaded for NLP tasks.

### 2. **Function Definitions**
   - `fetch_webpage_content(url)`: Retrieves the HTML content from a given URL using `requests`. It includes improved error handling to manage HTTP errors gracefully.
   - `parse_webpage_content(content)`: Parses the HTML content to extract plain text and performs NLP tasks:
       - Entity recognition using Spacy to identify and label entities.
       - Sentiment analysis using TextBlob to evaluate the overall sentiment of the text.
       - Keyword extraction with NLTK, identifying frequent keywords while filtering out stopwords and punctuation.
   - `knowledge_reduce(entities, sentiment, keyword_freq)`: Constructs a knowledge graph with NetworkX. Nodes represent entities and keywords, each with specific attributes, and the overall sentiment of the text is attached to the graph. Edge creation logic can be customized.
   - `save_knowledge_graph_gexf(graph, file_name)`: Saves the knowledge graph in the GEXF format, suitable for visualization and analysis in tools like Gephi.

### 3. **Execution Flow**
   - **Fetching Webpage Content**: The script fetches content from "https://civichonors.com/".
   - **Parsing Webpage Content**: The content is parsed to extract entities, sentiments, and keywords.
   - **Creating and Saving the Knowledge Graph**: A knowledge graph is constructed from the parsed data and saved in GEXF format.

This script is tailored for an educational or analytical setting, such as in a Google Colab notebook, allowing for step-by-step execution. The final output is a knowledge graph that encapsulates the structured information extracted from the webpage, providing insights into the relationships between different entities and concepts present in the content.

# Step 1: Install and Import Libraries

In [1]:
# Installing required packages
!pip install nltk spacy textblob

# Importing necessary libraries
import requests
from bs4 import BeautifulSoup
import networkx as nx
import nltk
import spacy
from textblob import TextBlob
from nltk.corpus import stopwords
from collections import Counter
import string

# Downloading NLTK resources and loading Spacy's English model
nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load('en_core_web_sm')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Step 2: Define Functions

In [2]:
def fetch_webpage_content(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.text
    except requests.exceptions.HTTPError as e:
        raise Exception(f"HTTP Error: {e}")
    except Exception as e:
        raise Exception(f"Error fetching webpage: {e}")

def parse_webpage_content(content):
    soup = BeautifulSoup(content, 'html.parser')
    text_content = soup.get_text()

    doc = nlp(text_content)
    entities = [(ent.text.strip(), ent.label_) for ent in doc.ents if ent.text.strip()]

    blob = TextBlob(text_content)
    sentiment = blob.sentiment

    words = [word.lower() for word in nltk.word_tokenize(text_content)
             if word.lower() not in stopwords.words('english')
             and word not in string.punctuation]
    keyword_freq = Counter(words).most_common(10)

    return entities, sentiment, keyword_freq

def knowledge_reduce(entities, sentiment, keyword_freq):
    graph = nx.Graph()
    for entity, type in entities:
        graph.add_node(entity, type=type, label=type)
    for word, freq in keyword_freq:
        graph.add_node(word, type='keyword', frequency=freq, label='keyword')
    graph.graph['sentiment'] = {'polarity': sentiment.polarity, 'subjectivity': sentiment.subjectivity}
    # Add edges between entities and keywords
    # ... [existing edge creation logic]
    return graph

def save_knowledge_graph_gexf(graph, file_name):
    nx.write_gexf(graph, file_name)

# Step 3: Fetch Webpage Content

In [3]:
url = "https://civichonors.com/"
try:
    content = fetch_webpage_content(url)
    print("Webpage content fetched successfully.")
except Exception as e:
    print(f"Error fetching webpage: {e}")

Webpage content fetched successfully.


# Step 4: Parse the Webpage Content

In [4]:
try:
    entities, sentiment, keywords = parse_webpage_content(content)
    print("Webpage content parsed successfully.")
except Exception as e:
    print(f"Error parsing webpage content: {e}")

Webpage content parsed successfully.


# Step 5: Create and Save the Knowledge Graph

In [5]:
try:
    graph = knowledge_reduce(entities, sentiment, keywords)
    save_knowledge_graph_gexf(graph, '/content/knowledge_graph.gexf')
    print("Knowledge graph created and saved in GEXF format successfully.")
except Exception as e:
    print(f"Error creating or saving knowledge graph: {e}")

Knowledge graph created and saved in GEXF format successfully.
