<a href="https://colab.research.google.com/github/nelslindahlx/KnowledgeReduce/blob/main/CivicHonorsAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code Overview

The notebook you are reading in Google Colab is designed to extract and process information from a specific website, https://civichonors.com/, using a variation of the MapReduce model. The purpose is to organize and summarize knowledge from web content in a structured and meaningful way. Here's a summary of each step in the process:

Setup: Installation of necessary Python packages (requests for web requests and spacy for natural language processing) and downloading the en_core_web_sm model for Spacy, which is used for language processing tasks.

Import Libraries: The Python libraries needed for the task are imported. These include requests for fetching webpage content, collections.Counter for data organization and frequency counting, and spacy for natural language processing.

Define Functions: Several functions are defined to handle different parts of the process:

*   fetch_webpage_content(): Fetches the HTML content from https://civichonors.com/.
*   map_phase(content): Processes the fetched content to extract meaningful phrases (noun chunks) using Spacy's natural language processing capabilities.
*   map_phase(content): Processes the fetched content to extract meaningful phrases (noun chunks) using Spacy's natural language processing capabilities.
*   Execute the Process: The defined functions are executed in sequence. The content is first fetched, then passed through the map phase to extract phrases, followed by shuffling and sorting by frequency, and finally reduced to the most significant elements. The results are printed line by line for better readability and analysis.

The overall process is a streamlined and automated way to extract, analyze, and summarize key information from a webpage, providing insights into the most prominent themes or topics present in the web content. This method is particularly useful for data analysis, natural language processing applications, or for anyone looking to gain a quick understanding of the primary content of a web page.

# Step 1: Setup

In [1]:
!pip install requests spacy
!python -m spacy download en_core_web_sm

2023-12-15 21:14:17.648658: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-15 21:14:17.648747: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-15 21:14:17.651023: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-15 21:14:17.665041: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.6.0
  Downloading https:

# Step 2: Import Libraries

In [2]:
import requests
from collections import Counter
import spacy

# Step 3: Define Functions

Fetch Webpage Content Function

In [3]:
def fetch_webpage_content():
    url = 'https://civichonors.com/'
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        return None

Map Phase Function

In [4]:
nlp = spacy.load('en_core_web_sm')

def map_phase(content):
    doc = nlp(content)
    return [chunk.text for chunk in doc.noun_chunks]

Shuffle and Sort Function

In [5]:
def shuffle_and_sort(mapped_data):
    return Counter(mapped_data)

Reduce Phase Function

In [6]:
def reduce_phase(sorted_data):
    return sorted_data.most_common(10)

# Step 4: Execute the Process

In [7]:
content = fetch_webpage_content()
if content:
    mapped_data = map_phase(content)
    sorted_data = shuffle_and_sort(mapped_data)
    reduced_data = reduce_phase(sorted_data)
    for item in reduced_data:
        print(item)
else:
    print("Failed to fetch content")

('the community', 448)
('the civic honors program', 152)
('individuals', 149)
('that', 137)
('organizations', 97)
('it', 95)
('the program', 82)
('what', 64)
('the potential', 64)
('the university', 59)
