<a href="https://colab.research.google.com/github/nelslindahlx/KnowledgeReduce/blob/main/KnowledgeReduce_CivicHonorsContentExtraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Civic Honors webpage Step-by-Step content extraction anaysis

The updated code in this Google Colab notebook is designed for a more comprehensive analysis of the content from the website https://civichonors.com/. This revised approach utilizes advanced natural language processing (NLP) techniques to extract, process, and summarize key information. Here's a summary of the updated process:

Setup: The notebook begins by installing necessary Python packages. This includes requests for fetching webpage content, beautifulsoup4 for parsing HTML and extracting specific content, and spacy for advanced natural language processing. The Spacy language model en_core_web_sm is also downloaded for English language processing.

Import Libraries: Essential Python libraries are imported, including requests for HTTP requests, BeautifulSoup from beautifulsoup4 for HTML parsing, Counter from collections for data organization, and spacy for NLP tasks.

Function Definitions:

* Fetch Webpage Content: This function retrieves the HTML content from https://civichonors.com/. Using BeautifulSoup, it parses the HTML and extracts the main textual content, focusing on the relevant parts of the webpage.

* Map Phase: In this phase, the webpage's text is processed through Spacy's NLP pipeline. This function extracts noun chunks, named entities (with their types), and verbs (lemmatized) from the content, providing a detailed breakdown of the key elements in the text.

* Shuffle and Sort: This function organizes the mapped data by counting the frequency of noun chunks, named entities, and verbs. It uses Counter to tally these elements, preparing the data for the next phase.

* Reduce Phase: This phase synthesizes the sorted data to derive insights. It identifies the top 10 most common noun chunks, named entities, and verbs, offering a summary of the most prominent themes or topics in the webpage content.

Execution and Output:

* The process is executed in sequence: fetching content, mapping (NLP analysis), shuffling and sorting (organizing data), and reducing (summarizing data).
* The results are printed in an organized manner, showcasing the top noun chunks, named entities, and verbs separately. This provides a multi-faceted view of the webpage's content, highlighting key phrases, important entities, and dominant actions or descriptions.

This enhanced approach provides a deeper and more nuanced understanding of the webpage content. It leverages NLP to not just extract phrases, but also to identify and categorize key elements such as entities and verbs, offering a richer analysis suitable for more advanced content analysis applications.

# Step 1: Setup

In [9]:
!pip install requests spacy beautifulsoup4
!python -m spacy download en_core_web_sm

2023-12-15 15:57:51.118473: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-15 15:57:51.119103: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-15 15:57:51.123365: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-15 15:57:51.135793: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.6.0
  Downloading https:

# Step 2: Import Libraries

In [10]:
import requests
from bs4 import BeautifulSoup
from collections import Counter
import spacy

# Step 3: Define Functions


Fetch Webpage Content Function

In [11]:
def fetch_webpage_content():
    url = 'https://civichonors.com/'
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        main_content = soup.find('main')
        return main_content.get_text() if main_content else ''
    else:
        return None

Expanded Map Phase Function

In [12]:
nlp = spacy.load('en_core_web_sm')

def map_phase(content):
    doc = nlp(content)
    data = {
        'noun_chunks': [chunk.text for chunk in doc.noun_chunks],
        'named_entities': [(entity.text, entity.label_) for entity in doc.ents],
        'verbs': [token.lemma_ for token in doc if token.pos_ == 'VERB']
    }
    return data

Enhanced Shuffle and Sort Function

In [13]:
def shuffle_and_sort(mapped_data):
    organized_data = {
        'noun_chunks': Counter(mapped_data['noun_chunks']),
        'named_entities': Counter([item[0] for item in mapped_data['named_entities']]),
        'verbs': Counter(mapped_data['verbs'])
    }
    return organized_data

Advanced Reduce Phase Function

In [14]:
def reduce_phase(organized_data):
    reduced_data = {
        'top_noun_chunks': organized_data['noun_chunks'].most_common(10),
        'top_entities': organized_data['named_entities'].most_common(10),
        'top_verbs': organized_data['verbs'].most_common(10)
    }
    return reduced_data

# Step 4: Execute the Process

In [15]:
content = fetch_webpage_content()
if content:
    mapped_data = map_phase(content)
    organized_data = shuffle_and_sort(mapped_data)
    reduced_data = reduce_phase(organized_data)

    print("Top Noun Chunks:")
    for item in reduced_data['top_noun_chunks']:
        print(item)

    print("\nTop Named Entities:")
    for item in reduced_data['top_entities']:
        print(item)

    print("\nTop Verbs:")
    for item in reduced_data['top_verbs']:
        print(item)
else:
    print("Failed to fetch content")

Top Noun Chunks:
('the community', 489)
('the civic honors program', 156)
('individuals', 152)
('that', 135)
('organizations', 102)
('it', 98)
('the program', 83)
('what', 64)
('the potential', 64)
('the university', 61)

Top Named Entities:
('first', 20)
('one', 15)
('One', 15)
('New York', 15)
('1999', 6)
('Johnson County Community College', 5)
('2000', 5)
('1989', 5)
('two', 5)
('1984', 4)

Top Verbs:
('have', 187)
('develop', 121)
('become', 74)
('benefit', 70)
('allow', 70)
('work', 65)
('participate', 60)
('think', 52)
('build', 49)
('take', 47)
