<a href="https://colab.research.google.com/github/nelslindahlx/KnowledgeReduce/blob/main/CivicHonorsAdvanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code Overview

The updated code for analyzing webpage content, designed for execution in a Google Colab notebook, is a comprehensive process that combines web scraping, natural language processing (NLP), and data organization techniques. The code is structured to run step-by-step, facilitating easy understanding and debugging. Here's a summary of the updated process:

Setup: The code starts with the installation of necessary Python packages, including spacy for NLP tasks and en_core_web_sm, a Spacy model for English language processing.

Import Libraries: Essential Python libraries such as requests for fetching webpage content, BeautifulSoup for HTML parsing, spacy and Matcher for NLP and pattern matching, and Counter from the collections module for data organization are imported.

Function Definitions:

* Fetch Webpage Content: This function retrieves the HTML content of the specified URL (https://civichonors.com/) and uses BeautifulSoup to parse the HTML, extracting the main textual content.

* Enhanced Map Phase Function: In this function, the Spacy NLP library is used to process the extracted text. It includes extracting noun chunks, named entities, verbs, and identifying specific patterns (like a noun followed by a verb) using Spacy's Matcher. This comprehensive analysis helps in understanding the structure and content of the webpage.

* Updated Shuffle and Sort Function: This function organizes the extracted data by counting occurrences (using Counter) of noun chunks, named entities, and verbs, and creates a set of unique relations identified in the text. This step prepares the data for more detailed analysis.

* Advanced Reduce Phase Function: This phase focuses on reducing the data to its most significant elements, such as the most common noun chunks, named entities, and verbs. It provides a summarized view of the key elements in the webpage content.

Execution and Output:

* The final execution step involves running the content fetching, mapping, shuffling and sorting, and reducing functions in sequence.
* The output is displayed in an organized manner, with each category (relations, top noun chunks, named entities, and verbs) printed line by line, providing a clear and detailed overview of the webpage's content.

This updated code offers a deep dive into the webpage's content, leveraging NLP to extract and analyze key linguistic elements and patterns. It's particularly useful for content analysis, thematic exploration, and understanding the structure and context of web-based text.

# Step 1: Setup

In [1]:
!pip install spacy
!python -m spacy download en_core_web_sm

2023-12-15 21:06:08.955684: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-15 21:06:08.955744: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-15 21:06:08.957582: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-15 21:06:08.968661: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.6.0
  Downloading https:

# Step 2: Import Libraries

In [2]:
import requests
from bs4 import BeautifulSoup
import spacy
from spacy.matcher import Matcher
from collections import Counter

# Step 3: Define Functions

Fetch Webpage Content Function

In [3]:
def fetch_webpage_content():
    url = 'https://civichonors.com/'
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        main_content = soup.find('main')
        return main_content.get_text() if main_content else ''
    else:
        return None

Enhanced Map Phase Function with Matcher Patterns

In [4]:
def map_phase(content):
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(content)

    # Extracting noun chunks, named entities, and verbs
    noun_chunks = [chunk.text for chunk in doc.noun_chunks]
    named_entities = [entity.text for entity in doc.ents]
    verbs = [token.lemma_ for token in doc if token.pos_ == 'VERB']

    # Matcher for relations (existing code)
    matcher = Matcher(nlp.vocab)
    pattern = [{'POS': 'NOUN'}, {'POS': 'VERB'}]
    matcher.add("NounVerbPattern", [pattern])
    relations = []
    for match_id, start, end in matcher(doc):
        span = doc[start:end]
        relations.append(span.text)

    data = {
        'noun_chunks': noun_chunks,
        'named_entities': named_entities,
        'verbs': verbs,
        'relations': relations
    }
    return data

Revised Shuffle and Sort Function

In [5]:
def shuffle_and_sort(mapped_data):
    organized_data = {
        'noun_chunks': Counter(mapped_data.get('noun_chunks', [])),
        'named_entities': Counter([item[0] for item in mapped_data.get('named_entities', [])]),
        'verbs': Counter(mapped_data.get('verbs', [])),
        'relations': set(mapped_data.get('relations', []))
    }

    return organized_data

Advanced Reduce Phase Function

In [6]:
def reduce_phase(organized_data, mapped_data):
    reduced_data = {
        'relations': set(mapped_data['relations']),
        'noun_chunks': organized_data['noun_chunks'].most_common(10),
        'named_entities': organized_data['named_entities'].most_common(10),
        'verbs': organized_data['verbs'].most_common(10),
        # Other reductions...
    }
    return reduced_data

# Step 4: Execute the Process

In [7]:
content = fetch_webpage_content()
if content:
    mapped_data = map_phase(content)
    organized_data = shuffle_and_sort(mapped_data)
    reduced_data = reduce_phase(organized_data, mapped_data)

    print("Relations Found:")
    for relation in reduced_data['relations']:
        print(relation)

    print("\nTop Noun Chunks:")
    for noun_chunk, count in reduced_data['noun_chunks']:
        print(f"{noun_chunk} (Count: {count})")

    print("\nTop Named Entities:")
    for entity, count in reduced_data['named_entities']:
        print(f"{entity} (Count: {count})")

    print("\nTop Verbs:")
    for verb, count in reduced_data['verbs']:
        print(f"{verb} (Count: {count})")

    # Include additional categories as required
else:
    print("Failed to fetch content")

Relations Found:
reality grows
organizations used
program allows
Trends resulting
individuals participating
organizations motivated
organization working
material needs
citizens interact
organizations be
concept relies
organizations need
organization becomes
community understanding
program comes
groups have
members want
institution has
program appear
grass roots
situation exists
tasks driven
issues associated
potential based
mind set
program involves
interests focused
Responsibility lies
program relies
community developed
definition takes
level comes
trends emerging
community becomes
program based
organizations needing
values added
centralization becomes
communication has
issues related
change occurs
Individuals Becoming
message develops
community spreading
program exist
strategies conceived
people involved
program exists
department becomes
organizations perceive
community sends
institution running
People need
program has
problems need
community work
community raises
program enables
org