<a href="https://colab.research.google.com/github/nelslindahlx/KnowledgeReduce/blob/main/CivicHonorsAdvanced_v002.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Civic Honors webpage Advanced anaysis v002

The updated code for the Google Colab notebook now incorporates advanced Natural Language Processing (NLP) techniques for a more in-depth analysis of webpage content. While maintaining the original structure, the code has been enhanced to include additional NLP features such as sentiment analysis, dependency parsing, and the utilization of word embeddings. Here's a summary of the updated process:

Setup: The notebook begins with the installation of the necessary Python packages, including spacy, a powerful NLP library, and downloading its English language model en_core_web_sm.

Import Libraries: Essential libraries like requests for fetching webpage content, BeautifulSoup for HTML parsing, spacy for NLP tasks, and Counter for data organization are imported.

Function Definitions:

* Fetch Webpage Content: This function retrieves HTML content from https://civichonors.com/, parsing it with BeautifulSoup to extract the main text.
* Enhanced Map Phase with Advanced NLP: This function processes the text using Spacy, performing:
* Noun Chunk, Named Entity, and Verb Extraction: Identifying key elements of the text.
* Sentiment Analysis: Assessing the emotional tone of the text, which can be performed using Spacy's sentiment analyzer or another library like TextBlob.
* Dependency Parsing: Analyzing the grammatical structure to understand relationships between words in a sentence.
* Word Embeddings Utilization: Leveraging Spacy's built-in word vectors to explore semantic similarities within the text.
* The function also includes a Matcher for identifying specific patterns (like noun-verb pairs).
* Updated Shuffle and Sort Function: Organizes the extracted data, preparing it for the reduction phase.
* Advanced Reduce Phase Function: Synthesizes the organized data, focusing on the most significant elements like common noun chunks, entities, verbs, and relations.

Execution and Output: The final step involves running the functions in sequence and displaying the results. The output includes relations, top noun chunks, named entities, verbs, and possibly sentiment analysis results or other insights derived from the advanced NLP techniques.

This enhanced code provides a multifaceted and comprehensive analysis of webpage content. By incorporating sentiment analysis and dependency parsing, the code not only identifies key textual elements but also offers insights into the emotional context and structural relationships within the text. The addition of word embeddings further enriches the analysis, allowing for exploration of semantic relationships and nuances in the text.

Such a robust approach is valuable for detailed content analysis, thematic exploration, and understanding both the explicit and implicit aspects of web-based text.

# Step 1: Setup

In [1]:
!pip install spacy
!python -m spacy download en_core_web_sm

2023-12-15 17:29:59.512841: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-15 17:29:59.512910: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-15 17:29:59.515608: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-15 17:29:59.529522: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.6.0
  Downloading https:

# Step 2: Import Libraries

In [2]:
import requests
from bs4 import BeautifulSoup
import spacy
from spacy.matcher import Matcher
from collections import Counter

# Step 3: Define Functions

Fetch Webpage Content Function

In [3]:
def fetch_webpage_content():
    url = 'https://civichonors.com/'
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        main_content = soup.find('main')
        return main_content.get_text() if main_content else ''
    else:
        return None

Enhanced Map Phase Function with Advanced NLP

In [4]:
def map_phase(content):
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(content)

    # Extracting noun chunks, named entities, verbs
    noun_chunks = [chunk.text for chunk in doc.noun_chunks]
    named_entities = [entity.text for entity in doc.ents]
    verbs = [token.lemma_ for token in doc if token.pos_ == 'VERB']

    # Sentiment Analysis (using TextBlob or another library)
    # Dependency Parsing
    # Word Embeddings
    # ...

    # Existing Matcher for relations
    matcher = Matcher(nlp.vocab)
    pattern = [{'POS': 'NOUN'}, {'POS': 'VERB'}]
    matcher.add("NounVerbPattern", [pattern])
    relations = []
    for match_id, start, end in matcher(doc):
        span = doc[start:end]
        relations.append(span.text)

    data = {
        'noun_chunks': noun_chunks,
        'named_entities': named_entities,
        'verbs': verbs,
        'relations': relations,
        # Add other analysis results here
    }
    return data

Revised Shuffle and Sort Function

In [5]:
def shuffle_and_sort(mapped_data):
    organized_data = {
        'noun_chunks': Counter(mapped_data.get('noun_chunks', [])),
        'named_entities': Counter([item[0] for item in mapped_data.get('named_entities', [])]),
        'verbs': Counter(mapped_data.get('verbs', [])),
        'relations': set(mapped_data.get('relations', []))
    }

    return organized_data

Advanced Reduce Phase Function

In [6]:
def reduce_phase(organized_data, mapped_data):
    reduced_data = {
        'relations': set(mapped_data['relations']),
        'noun_chunks': organized_data['noun_chunks'].most_common(10),
        'named_entities': organized_data['named_entities'].most_common(10),
        'verbs': organized_data['verbs'].most_common(10),
        # Other reductions...
    }
    return reduced_data

# Step 4: Execute the Process

In [7]:
content = fetch_webpage_content()
if content:
    mapped_data = map_phase(content)
    organized_data = shuffle_and_sort(mapped_data)
    reduced_data = reduce_phase(organized_data, mapped_data)

    print("Relations Found:")
    for relation in reduced_data['relations']:
        print(relation)

    print("\nTop Noun Chunks:")
    for noun_chunk, count in reduced_data['noun_chunks']:
        print(f"{noun_chunk} (Count: {count})")

    print("\nTop Named Entities:")
    for entity, count in reduced_data['named_entities']:
        print(f"{entity} (Count: {count})")

    print("\nTop Verbs:")
    for verb, count in reduced_data['verbs']:
        print(f"{verb} (Count: {count})")

    # Include additional categories as required
else:
    print("Failed to fetch content")

Relations Found:
Responsibility lies
community has
action asking
leaders give
information makes
people involved
program comes
community builds
information develops
process has
action expands
skepticism exists
change provides
information travels
community comes
movement requires
action becomes
leaders collaborating
mobilization supporting
community creates
change using
dream became
program appear
vision required
Organizations need
institution has
organization becomes
potential exists
organizations listing
media have
institution running
Participation occurs
organization steps
individuals feel
program showing
program does
issues related
community manages
service learning
organizations changes
software develops
access builds
message spreads
organizations needs
individuals trying
need identified
individuals have
people interact
activism outlasts
program raises
organization working
problems need
possibility develops
volunteers participating
program extends
society sets
individuals become
ind