<a href="https://colab.research.google.com/github/nelslindahlx/KnowledgeReduce/blob/main/CivicHonorsAdvancedKGv2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Civic Honors Advanced Knoweldge Graph v2

The code provided is a comprehensive Python script designed for execution in Google Colab, specifically tailored to analyze content from "https://civichonors.com/". It encompasses a range of functionalities from web scraping to advanced data visualization and natural language processing. Here's a summary of each step in the code:

### 1. Environment Setup
- **1.1 Installation of Libraries**: Essential libraries like `requests`, `beautifulsoup4` (bs4), `spacy`, `networkx`, `matplotlib`, `textblob`, `plotly`, `gensim`, and `nltk` are installed. These libraries are fundamental for web scraping, text processing, natural language processing, and visualization.
- **1.2 Importing Libraries**: All the installed libraries are imported for use in the subsequent steps.

### 2. Data Acquisition and Processing
- **2.1 Fetch Webpage Content**: The script fetches the HTML content of "https://civichonors.com/" using the `requests` library.
- **2.2 Extract Text from HTML**: The `BeautifulSoup` library is used to parse the HTML content and extract the relevant textual data, stripping away any HTML tags, scripts, or styles.

### 3. Advanced Natural Language Processing
- **3.1 Custom Entity Recognition and Relationship Extraction**: Utilizing `spacy` and its `Matcher` class, custom entity recognition patterns are defined and applied to the text. This step identifies specific entities like awards, civic terms, and organizations.
- **3.2 Sentiment Analysis**: The `TextBlob` library is used to perform sentiment analysis on the extracted text, providing insights into the overall sentiment conveyed in the content.

### 4. Knowledge Graph Creation and Management
- **4.1 Create Advanced Knowledge Graph**: The script constructs a knowledge graph using `networkx`. This graph includes nodes representing the identified entities and edges denoting their relationships.

### 5. Visualization and Analysis
- **5.1 Interactive Knowledge Graph Visualization**: An interactive visualization of the knowledge graph is created using `plotly`, which allows for an engaging and insightful exploration of the graph.
- **5.2 Topic Modeling**: The `gensim` library is employed for topic modeling. The script processes the text, creates a document-term matrix, and applies the LDA (Latent Dirichlet Allocation) model to uncover the main topics within the text.

### 6. Execution in Google Colab
- The workflow is structured to run sequentially in a Google Colab notebook, ensuring each step logically follows from the previous one. This structured approach facilitates a thorough analysis of the webpage content, from initial data acquisition to in-depth NLP analysis, knowledge graph creation, and sophisticated visualization techniques.

This code is a robust example of how various Python libraries and tools can be combined to extract, process, analyze, and visualize web content in a meaningful and insightful way. It demonstrates the power of Python in handling and interpreting complex web data.

# Step 1: Environment Setup

1.1 Install Necessary Libraries

In [1]:
!pip install requests bs4 spacy networkx matplotlib textblob plotly gensim nltk
!python -m spacy download en_core_web_sm

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25l[?25hdone
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1256 sha256=2d6eef8f3fa9040da695ddcb225a4fce3f49e6bda97f265f8d893ae1dede029e
  Stored in directory: /root/.cache/pip/wheels/25/42/45/b773edc52acb16cd2db4cf1a0b47117e2f69bb4eb300ed0e70
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
2023-12-15 18:46:35.776927: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-15 18:46:35.777030: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-15 18:46:35.779670: E

1.2 Import Libraries

In [2]:
import requests
from bs4 import BeautifulSoup
import spacy
from spacy.matcher import Matcher
import networkx as nx
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from textblob import TextBlob
import gensim
from gensim import corpora
from gensim.models import LdaModel
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

# Step 2: Data Acquisition and Processing

2.1 Fetch Webpage Content

In [3]:
def fetch_webpage_content(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        return None

webpage_content = fetch_webpage_content('https://civichonors.com/')

2.2 Extract Text from HTML

In [4]:
def extract_text_from_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    for script_or_style in soup(['script', 'style']):
        script_or_style.extract()
    text = soup.get_text()
    return text

extracted_text = extract_text_from_html(webpage_content)

# Step 3: Advanced Natural Language Processing

3.1 Custom Entity Recognition and Relationship Extraction

In [5]:
nlp = spacy.load("en_core_web_sm")

def define_custom_patterns():
    matcher = Matcher(nlp.vocab)

    award_pattern = [{"LOWER": "award"}]
    civic_term_pattern = [{"LOWER": "civic"}, {"IS_ALPHA": True}]
    organization_pattern = [{"LOWER": "association"}, {"IS_ALPHA": True, "OP": "?"}]

    matcher.add("AWARD", [award_pattern])
    matcher.add("CIVIC_TERM", [civic_term_pattern])
    matcher.add("ORGANIZATION", [organization_pattern])

    return matcher

custom_matcher = define_custom_patterns()

def advanced_nlp_processing(text):
    doc = nlp(text)
    standard_entities = [(ent.text, ent.label_) for ent in doc.ents]
    matches = custom_matcher(doc)
    custom_entities = [(doc[start:end].text, nlp.vocab.strings[match_id]) for match_id, start, end in matches]
    return standard_entities + custom_entities

advanced_entities = advanced_nlp_processing(extracted_text)

3.2 Sentiment Analysis

In [6]:
def analyze_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity

# Example: Analyze the sentiment of the entire extracted text
sentiment_score = analyze_sentiment(extracted_text)

# Step 4: Knowledge Graph Creation and Management

4.1 Create Advanced Knowledge Graph

In [7]:
def create_and_store_knowledge_graph(advanced_entities):
    G = nx.Graph()
    for entity, label in advanced_entities:
        G.add_node(entity, label=label)
        G.add_edge(entity, "extracted_from", label=label)
    return G

knowledge_graph = create_and_store_knowledge_graph(advanced_entities)

# Step 5: Visualization and Analysis

5.1 Interactive Knowledge Graph Visualization

In [8]:
def interactive_visualize_knowledge_graph(knowledge_graph):
    edge_x = []
    edge_y = []
    for edge in knowledge_graph.edges():
        x0, y0 = knowledge_graph.nodes[edge[0]]['pos']
        x1, y1 = knowledge_graph.nodes[edge[1]]['pos']
        edge_x.extend([x0, x1, None])
        edge_y.extend([y0, y1, None])

    edge_trace = go.Scatter(
        x=edge_x, y=edge_y,
        line=dict(width=0.5, color='#888'),
        hoverinfo='none',
        mode='lines')

    node_x = []
    node_y = []
    for node in knowledge_graph.nodes():
        x, y = knowledge_graph.nodes[node]['pos']
        node_x.append(x)
        node_y.append(y)

    node_trace = go.Scatter(
        x=node_x, y=node_y,
        mode='markers',
        hoverinfo='text',
        marker=dict(showscale=True, colorscale='YlGnBu', size=10))

    fig = go.Figure(data=[edge_trace, node_trace],
                    layout=go.Layout(
                        showlegend=False,
                        hovermode='closest',
                        margin=dict(b=0,l=0,r=0,t=0)))

    fig.show()

# Example: Visualize the knowledge graph
pos = nx.spring_layout(knowledge_graph)  # Set the positions of the nodes
for node in knowledge_graph.nodes:
    knowledge_graph.nodes[node]['pos'] = list(pos[node])
interactive_visualize_knowledge_graph(knowledge_graph)

5.2 Topic Modeling

In [10]:
nltk.download('stopwords')

def preprocess_text(text):
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(text.lower())
    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]
    return filtered_tokens

def perform_topic_modeling(text, num_topics=5, passes=15):
    documents = text.split('\n')
    processed_docs = [preprocess_text(doc) for doc in documents if doc.strip()]
    dictionary = corpora.Dictionary(processed_docs)
    corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
    lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=passes)
    return lda_model.print_topics(num_words=4)

topics = perform_topic_modeling(extracted_text)

# Print each topic line by line
for i, topic in enumerate(topics):
    print(f"Topic {i+1}:")
    words = topic[1].split('+')
    for word in words:
        print(f" - {word.strip()}")
    print()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Topic 1:
 - 0.052*"community"
 - 0.038*"civic"
 - 0.029*"within"
 - 0.028*"honors"

Topic 2:
 - 0.044*"program"
 - 0.034*"civic"
 - 0.031*"honors"
 - 0.023*"community"

Topic 3:
 - 0.027*"civic"
 - 0.017*"new"
 - 0.017*"york"
 - 0.016*"engagement"

Topic 4:
 - 0.071*"community"
 - 0.029*"civic"
 - 0.028*"program"
 - 0.024*"honors"

Topic 5:
 - 0.011*"technology"
 - 0.008*"future"
 - 0.007*"database"
 - 0.007*"people"

