<a href="https://colab.research.google.com/github/nelslindahlx/KnowledgeReduce/blob/main/CivicHonorsAdvancedBERTQA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code Overview

The updated code for the Google Colab notebook represents a significant advancement in webpage content analysis by integrating a BERT-based question-answering (QA) system. This enhancement allows for interactive querying of the webpage's content to extract specific information or insights. Here's a summary of the updated process:

Setup: The code begins with installing the necessary Python packages. This includes spacy for general NLP tasks and transformers for accessing pre-trained BERT models. The Spacy model en_core_web_sm is also downloaded for processing English text.

Import Libraries: The script imports essential libraries like requests for fetching the webpage, BeautifulSoup for HTML parsing to extract text content, spacy for basic NLP processing, and transformers for utilizing the BERT model.

Function Definitions:

* Fetch Webpage Content: This function retrieves the HTML content from the specified URL (in this case, https://civichonors.com/). It uses BeautifulSoup to parse the HTML and extract the main text content, which will serve as the context for the QA system.

* BERT-Based Question Answering System: This function utilizes the Hugging Face transformers library to implement a QA system. It uses a pre-trained BERT model specifically fine-tuned for question-answering tasks (bert-large-uncased-whole-word-masking-finetuned-squad). The function takes a question and the webpage's text as input and returns the model's best guess at an answer.

Execution and Output:

* The final execution step involves fetching the webpage content and then using the BERT QA system to answer a predefined question about the content.
* Users can modify the question variable to ask different questions, making the analysis interactive and adaptable to various informational needs.

This approach transforms the notebook into an interactive tool for exploring and understanding webpage content. By employing a state-of-the-art NLP model, it allows users to query the webpage in natural language and receive contextually relevant answers, demonstrating a powerful application of machine learning and NLP in information retrieval and text analysis.

The BERT-based QA system, in particular, represents a significant advancement, offering nuanced and context-aware responses to user queries, which is a substantial leap from traditional keyword-based search and information extraction methods. This makes the notebook an invaluable asset for researchers, data analysts, or anyone interested in extracting detailed information from webpages efficiently.

Note: Step 4 did take several minutes(10 or more) to run so be prepared ;)

# Step 1: Setup

In [1]:
!pip install transformers spacy
!python -m spacy download en_core_web_sm

2023-12-15 20:55:55.754237: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-15 20:55:55.754299: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-15 20:55:55.755264: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-15 20:55:55.761276: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.6.0
  Downloading https:

# Step 2: Import Libraries

In [2]:
import requests
from bs4 import BeautifulSoup
import spacy
from transformers import pipeline

# Step 3: Define Functions

Fetch Webpage Content Function

In [3]:
def fetch_webpage_content(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        main_content = soup.find('main')
        return main_content.get_text() if main_content else ''
    else:
        return None

BERT-Based Question Answering System

In [4]:
def answer_question(question, context):
    qa_pipeline = pipeline('question-answering', model='bert-large-uncased-whole-word-masking-finetuned-squad')
    answer = qa_pipeline({'question': question, 'context': context})
    return answer['answer']

# Step 4: Fetch Content and Set Up QA System

In [5]:
url = 'https://civichonors.com/'  # Webpage URL
content = fetch_webpage_content(url)
if content:
    # Example Question
    question = "What is the main topic of the webpage?"

    # Get the answer from BERT QA System
    answer = answer_question(question, content)

    print(f"Question: {question}")
    print(f"Answer: {answer}")
else:
    print("Failed to fetch content")

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Question: What is the main topic of the webpage?
Answer: increasing active participation in the civic honors program
