<a href="https://colab.research.google.com/github/nelslindahlx/KnowledgeReduce/blob/main/KnowledgeReduce_CivicHonorsEnhancedAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Civic Honors webpage Step-by-Step enhanced anaysis

The updated Google Colab notebook now features a more sophisticated approach to analyzing webpage content, incorporating advanced natural language processing (NLP) and data analysis techniques. The goal is to derive deeper insights from the content of https://civichonors.com/. Here's a summary of the updated process:

Setup: The initial step involves installing necessary Python packages like requests, spacy, beautifulsoup4, nltk, and scikit-learn. These libraries are essential for fetching and processing webpage content, and for performing advanced NLP tasks.

Import Libraries: This step imports the installed libraries, which will be used in the subsequent steps for various tasks like web scraping, text processing, sentiment analysis, and topic modeling.

Function Definitions:

* Fetch Webpage Content: This function retrieves the HTML content from the specified URL, using BeautifulSoup to parse the HTML and extract the main text content for analysis.

* Enhanced Map Phase: The webpage's text is processed to extract meaningful data including noun chunks, named entities, verbs, and sentences. Additionally, this phase conducts sentiment analysis on each sentence using NLTK's SentimentIntensityAnalyzer and prepares data for topic modeling.

* Topic Modeling: Within the map phase, Latent Dirichlet Allocation (LDA) is employed to identify major topics within the text. This is achieved by transforming the sentences into a bag-of-words model and then applying LDA to discern key themes.

* Shuffle and Sort: The extracted data (noun chunks, entities, and verbs) is organized by frequency using Counter from the collections module.

* Advanced Reduce Phase: This phase aggregates the results, including the most common noun chunks, entities, verbs, the average sentiment score across all sentences, and the identified topics from the topic modeling.

Execution and Output:

* The process is executed in sequence: fetching content, mapping (detailed NLP analysis including sentiment analysis and topic modeling), shuffling and sorting (organizing the data), and reducing (summarizing and presenting key insights).
* The results are displayed in a structured format, highlighting top noun chunks, named entities, verbs, the average sentiment of the text, and the main topics identified. This offers a comprehensive view of the content, from its basic structure to its underlying themes and sentiment.

This enhanced notebook provides a multifaceted analysis of the webpage content, combining basic NLP tasks with more complex analyses like sentiment analysis and topic modeling. Such a comprehensive approach allows for a deeper understanding of the text, capturing not only the key elements but also the general tone and main topics, which are crucial for thorough content analysis.

# Step 1: Setup

In [1]:
!pip install requests spacy beautifulsoup4 nltk scikit-learn
!python -m spacy download en_core_web_sm
!python -m nltk.downloader vader_lexicon

2023-12-15 16:25:49.578902: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-15 16:25:49.578963: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-15 16:25:49.581233: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-15 16:25:49.589474: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.6.0
  Downloading https:

# Step 2: Import Libraries

In [2]:
import requests
from bs4 import BeautifulSoup
import spacy
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from collections import Counter
import numpy as np

# Step 3: Define Functions

Fetch Webpage Content Function

In [3]:
def fetch_webpage_content():
    url = 'https://civichonors.com/'
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        main_content = soup.find('main')
        return main_content.get_text() if main_content else ''
    else:
        return None

Enhanced Map Phase Function

In [4]:
nlp = spacy.load('en_core_web_sm')
sia = SentimentIntensityAnalyzer()

def map_phase(content):
    doc = nlp(content)
    data = {
        'noun_chunks': [chunk.text for chunk in doc.noun_chunks],
        'named_entities': [(entity.text, entity.label_) for entity in doc.ents],
        'verbs': [token.lemma_ for token in doc if token.pos_ == 'VERB'],
        'sentences': [sent.text for sent in doc.sents]
    }

    sentiments = [sia.polarity_scores(sent) for sent in data['sentences']]
    data['sentiments'] = sentiments

    vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
    dtm = vectorizer.fit_transform(data['sentences'])
    LDA = LatentDirichletAllocation(n_components=5, random_state=42)
    LDA.fit(dtm)

    topics = []
    for index, topic in enumerate(LDA.components_):
        top_words_indices = topic.argsort()[-10:]
        top_words = [vectorizer.get_feature_names_out()[i] for i in top_words_indices]
        topics.append(f"Topic {index + 1}: {' '.join(top_words)}")
    data['topics'] = topics

    return data

Enhanced Shuffle and Sort Function

In [5]:
def shuffle_and_sort(mapped_data):
    organized_data = {
        'noun_chunks': Counter(mapped_data['noun_chunks']),
        'named_entities': Counter([item[0] for item in mapped_data['named_entities']]),
        'verbs': Counter(mapped_data['verbs'])
    }
    return organized_data

Advanced Reduce Phase Function

In [6]:
def reduce_phase(organized_data, mapped_data):
    reduced_data = {
        'top_noun_chunks': organized_data['noun_chunks'].most_common(10),
        'top_entities': organized_data['named_entities'].most_common(10),
        'top_verbs': organized_data['verbs'].most_common(10),
        'average_sentiment': np.mean([s['compound'] for s in mapped_data['sentiments']]),
        'topics': mapped_data['topics']
    }
    return reduced_data

# Step 4: Execute the Process

In [7]:
content = fetch_webpage_content()
if content:
    mapped_data = map_phase(content)
    organized_data = shuffle_and_sort(mapped_data)
    reduced_data = reduce_phase(organized_data, mapped_data)

    print("Top Noun Chunks:", reduced_data['top_noun_chunks'])
    print("\nTop Named Entities:", reduced_data['top_entities'])
    print("\nTop Verbs:", reduced_data['top_verbs'])
    print("\nAverage Sentiment Score:", reduced_data['average_sentiment'])
    print("\nIdentified Topics:")
    for topic in reduced_data['topics']:
        print(topic)
else:
    print("Failed to fetch content")

Top Noun Chunks: [('the community', 489), ('the civic honors program', 156), ('individuals', 152), ('that', 135), ('organizations', 102), ('it', 98), ('the program', 83), ('what', 64), ('the potential', 64), ('the university', 61)]

Top Named Entities: [('first', 20), ('one', 15), ('One', 15), ('New York', 15), ('1999', 6), ('Johnson County Community College', 5), ('2000', 5), ('1989', 5), ('two', 5), ('1984', 4)]

Top Verbs: [('have', 187), ('develop', 121), ('become', 74), ('benefit', 70), ('allow', 70), ('work', 65), ('participate', 60), ('think', 52), ('build', 49), ('take', 47)]

Average Sentiment Score: 0.40203881278538817

Identified Topics:
Topic 1: model level individuals time society change university able participation community
Topic 2: information implementation message individuals university organizations community honors civic program
Topic 3: participation potential society graduation program engagement individuals honors community civic
Topic 4: understanding society o