<a href="https://colab.research.google.com/github/nelslindahlx/KnowledgeReduce/blob/main/CivicHonorsSimpleKG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code Overview

The provided code is designed to fetch content from a specified webpage, process this content to extract meaningful information using natural language processing (NLP) techniques, and then organize this information into a complex knowledge graph. Here's a summary of each part of the code and its functionality:

1. **Installing and Importing Libraries**:
    - The code begins with the installation of necessary Python libraries (`nltk`, `spacy`, `textblob`) in a Google Colab environment, followed by importing other essential libraries (`requests`, `bs4`, `networkx`, `json`, `string`, `Counter` from `collections`).
    - `nltk` and `spacy` are used for NLP tasks like tokenization, stopwords removal, and entity recognition. `textblob` is used for sentiment analysis.
    - NetworkX is used for creating and manipulating the knowledge graph structure.

2. **Function Definitions**:
    - `fetch_webpage_content(url)`: Fetches the content of a webpage given its URL. It uses the `requests` library and handles HTTP responses.
    - `parse_webpage_content(content)`: Parses the fetched content using `BeautifulSoup` to extract text. It then uses `spacy` for entity recognition, `TextBlob` for sentiment analysis, and `nltk` for keyword extraction. The function returns entities, overall sentiment, and the most common keywords.
    - `create_knowledge_graph(entities, sentiment, keyword_freq)`: Creates a knowledge graph using NetworkX. This graph includes nodes for identified entities and keywords, and stores the overall sentiment of the text. The logic for linking (creating edges between) these nodes can be customized based on specific criteria.
    - `save_knowledge_graph(graph, file_name)`: Saves the created knowledge graph into a JSON file, making it portable and easy to access later.

3. **Execution Flow**:
    - The process begins by fetching the webpage content using the specified URL.
    - The fetched content is then parsed to extract entities, sentiment, and keywords.
    - A knowledge graph is created using this extracted information, where entities and keywords become nodes, and their potential relationships are represented as edges.
    - Finally, the knowledge graph is saved as a JSON file, which can be used for further analysis or visualization.

This code is structured to run in a Google Colab environment, allowing for step-by-step execution and observation. It demonstrates the integration of web scraping, NLP, and graph theory to create a structured and meaningful representation of webpage content.

# Step 1: Install Additional Libraries

In [1]:
!pip install nltk spacy textblob
import requests
from bs4 import BeautifulSoup
import networkx as nx
import json
import nltk
import spacy
from textblob import TextBlob
from nltk.corpus import stopwords
from collections import Counter
import string

nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load('en_core_web_sm')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Step 2: Define Enhanced Functions

In [2]:
def fetch_webpage_content(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        raise Exception(f"Failed to fetch webpage content. Status Code: {response.status_code}")

def parse_webpage_content(content):
    soup = BeautifulSoup(content, 'html.parser')
    text_content = soup.get_text()

    doc = nlp(text_content)
    entities = [(ent.text, ent.label_) for ent in doc.ents]

    blob = TextBlob(text_content)
    sentiment = blob.sentiment

    words = [word.lower() for word in nltk.word_tokenize(text_content)
             if word.lower() not in stopwords.words('english')
             and word not in string.punctuation]
    keyword_freq = Counter(words).most_common(10)

    return entities, sentiment, keyword_freq

def create_knowledge_graph(entities, sentiment, keyword_freq):
    G = nx.Graph()

    for entity, type in entities:
        G.add_node(entity, type=type)

    for word, freq in keyword_freq:
        G.add_node(word, type='keyword', frequency=freq)

    G.graph['sentiment'] = sentiment

    # Define logic for creating edges based on your criteria

    return G

def save_knowledge_graph(graph, file_name):
    data = nx.readwrite.json_graph.node_link_data(graph)
    with open(file_name, 'w') as file:
        json.dump(data, file)

# Step 3: Execute the Process

Fetch Webpage Content

In [3]:
url = "https://civichonors.com/"
try:
    content = fetch_webpage_content(url)
    print("Webpage content fetched successfully.")
except Exception as e:
    print(f"Error fetching webpage: {e}")


Webpage content fetched successfully.


Parse Webpage and Create Knowledge Graph

In [4]:
try:
    entities, sentiment, keyword_freq = parse_webpage_content(content)
    graph = create_knowledge_graph(entities, sentiment, keyword_freq)
    print("Knowledge graph created successfully.")
except Exception as e:
    print(f"Error in parsing and graph creation: {e}")

Knowledge graph created successfully.


Save the Knowledge Graph

In [5]:
try:
    save_knowledge_graph(graph, '/content/knowledge_graph.json')
    print("Knowledge graph saved successfully.")
except Exception as e:
    print(f"Error saving knowledge graph: {e}")

Knowledge graph saved successfully.
