# IE 7500 Applied NLP

## Question and Answer Bot using Pretrained BERT

## Final Exam

### 1. Data Preprocessing

In [None]:
!pip install pdfplumber PyPDF2 gradio

### 1.1. Preprocessing and Cleaning

In [None]:
import re
from transformers import pipeline
from PyPDF2 import PdfReader
import spacy
import gradio as gr

In [12]:
# Step 1: Extract text from the PDF
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

# Step 2: Clean the text
def clean_text(raw_text):
    # Basic cleaning: remove headers, footers, and excessive whitespace
    text = re.sub(r"\n{2,}", "\n", raw_text)  # Replace multiple newlines with a single newline
    text = re.sub(r"\s{2,}", " ", text)  # Replace multiple spaces with a single space
    text = re.sub(r"\b\d+\b", "", text)  # Remove isolated numbers (like page numbers)
    # Retain only paragraphs with valid content
    paragraphs = [
        para.strip() for para in text.split("\n")
        if len(para.strip()) > 20 and re.search(r"^[A-Za-z]", para)
    ]
    return " ".join(paragraphs)

    # Step 3: Remove irrelevant sections using stricter filtering
    paragraphs = [
        para for para in paragraphs
        if not re.search(r"(COPYRIGHT|PROJECT GUTENBERG|ELECTRONIC VERSION|SERVICE|DOWNLOAD|MEMBERSHIP|PERMISSION|MACHINE READABLE|PROHIBITED DISTRIBUTION)", para, re.IGNORECASE)
        and not re.search(r"^\s*(DISTRIBUTED|PERSONAL USE ONLY|COMMERCIALLY)", para, re.IGNORECASE)  # Exclude specific patterns
        and len(para.strip()) > 20  # Exclude short lines likely to be non-content
    ]
    return " ".join(paragraphs)

    # Step 4: Retain only paragraphs with recognizable Shakespearean content
    paragraphs = [
        para for para in paragraphs
        if re.search(r"^[A-Za-z]", para)  # Starts with letters
    ]

    return " ".join(paragraphs)



# Main Implementation
pdf_path = "/content/Shakespeare-Complete-Works.pdf"  # Update this path as needed
raw_text = extract_text_from_pdf(pdf_path)
cleaned_text = clean_text(raw_text)


### Clean the Extracted Text

- **Function**: `clean_text(raw_text)`
- **Input**: Raw text extracted from the PDF.
- **Output**: Cleaned and formatted text as a single string.

#### Cleaning Steps:
1. **Remove Excessive Newlines**: Replace multiple consecutive newlines with a single newline.
2. **Remove Excessive Whitespace**: Replace multiple consecutive spaces with a single space.
3. **Remove Isolated Numbers**: Exclude numbers like page numbers that appear on their own.
4. **Filter Valid Paragraphs**: Retain only paragraphs that:
   - Have meaningful content (at least 20 characters).
   - Start with letters (ignore any lines starting with non-alphabetic characters).

This ensures the output is clean and structured for further processing.

- The code is structured to:
  - Extract raw text from a PDF file.
  - Clean and preprocess the extracted text for further analysis.
  
- After this we will:
  - Segment the cleaned text into smaller chunks.
  - Use the cleaned text as input to a question-answering pipeline.
  


### 1.2. Data Segmention into smaller logical units

In [11]:
import spacy
# Step 3: Segment the text
def segment_text(text, segment_size=500):
    return [text[i:i + segment_size] for i in range(0, len(text), segment_size)]

# Step 4: Preprocess and Cache NER Entities
def extract_entities(text, nlp):
    doc = nlp(text)
    return {ent.text.lower() for ent in doc.ents}

def preprocess_segments_with_entities(segments, nlp):
    # Cache entities for all segments
    segment_entities = []
    for segment in segments:
        entities = extract_entities(segment, nlp)
        segment_entities.append((segment, entities))
    return segment_entities

# Step 5: Find Relevant Segment Using Preprocessed Entities
def find_relevant_segment(question, segment_entities, nlp):
    question_doc = nlp(question)
    question_entities = {ent.text.lower() for ent in question_doc.ents}

    best_segment = ""
    max_matches = 0
    for segment, entities in segment_entities:
        matches = len(question_entities.intersection(entities))
        if matches > max_matches:
            max_matches = matches
            best_segment = segment
    return best_segment

#### **Segment the Text**
- **Function**: `segment_text(text, segment_size=500)`
- **Purpose**: Break the cleaned text into smaller, fixed-sized chunks (default size: 500 characters).
- **Input**: The cleaned text.
- **Output**: A list of text segments.
- **Why?**: Processing smaller segments ensures efficient handling during Named Entity Recognition (NER) and Question-Answering tasks.


#### **Step 4: Preprocess and Cache NER Entities**
1. **Entity Extraction**:
   - **Function**: `extract_entities(text, nlp)`
   - **Purpose**: Extract Named Entities (e.g., names, places, dates) from a given text segment using a spaCy NLP model.
   - **Input**: A single text segment and a spaCy NLP object (`nlp`).
   - **Output**: A set of extracted entity names in lowercase.

2. **Caching Entities for Segments**:
   - **Function**: `preprocess_segments_with_entities(segments, nlp)`
   - **Purpose**: For each text segment, extract entities and store them in a list for faster access during the relevance matching step.
   - **Input**: A list of text segments and a spaCy NLP object.
   - **Output**: A list of tuples, where each tuple contains:
     - The text segment.
     - The set of named entities extracted from that segment.


#### **Find Relevant Segment Using Preprocessed Entities**
- **Function**: `find_relevant_segment(question, segment_entities, nlp)`
- **Purpose**: Match a user's question to the most relevant text segment based on the overlap of named entities.

**How it Works**:
1. **Extract Entities from the Question**:
   - Uses spaCy to extract entities from the user's question.

2. **Match Entities with Cached Segment Entities**:
   - Compares the entities extracted from the question to the entities cached for each segment.
   - Calculates the number of matching entities between the question and each segment.

3. **Select the Best Segment**:
   - The segment with the highest number of matching entities is considered the most relevant.

**Input**:
   - The user's question (string).
   - The list of preprocessed `segment_entities`.
   - The spaCy NLP object.

**Output**:
   - The most relevant text segment (string).



In [3]:
# Preprocess segments
segments = segment_text(cleaned_text)
nlp = spacy.load("en_core_web_sm")
segment_entities = preprocess_segments_with_entities(segments, nlp)

### Preprocessing Segments and Extracting Entities

This code preprocesses the cleaned text to prepare it for efficient question-answering by segmenting the text and extracting named entities from each segment.

#### **Code Breakdown**
1. **Segmenting the Text**:
   - **`segment_text(cleaned_text)`**:
     - Divides the cleaned text into smaller chunks of a fixed size (default: 500 characters).
     - This ensures each segment is manageable for further processing, such as Named Entity Recognition (NER) and matching.

2. **Loading spaCy's NLP Model**:
   - **`spacy.load("en_core_web_sm")`**:
     - Loads a pre-trained spaCy model capable of performing NER.
     - The model identifies named entities (e.g., people, locations, dates) in each segment.

3. **Extracting and Caching Named Entities**:
   - **`preprocess_segments_with_entities(segments, nlp)`**:
     - For each text segment:
       1. Extracts named entities using the spaCy model.
       2. Stores the segment and its corresponding entities as a tuple.
     - Results in a list called `segment_entities`, where each entry contains:
       - The text segment.
       - A set of named entities found in the segment.

#### **Advantages of Using This Process**
1. **Efficiency**:
   - By segmenting the text and precomputing named entities, we avoid reprocessing the entire text every time a question is asked.
   - This significantly reduces runtime for matching questions with relevant text segments.

2. **Scalability**:
   - Handles large texts (e.g., Shakespeare's Complete Works) by working with smaller, more manageable chunks.
   - Enables processing of extensive datasets without running out of memory.

3. **Accurate Context Matching**:
   - Named Entity Recognition (NER) ensures questions are matched with text segments containing relevant entities (e.g., "Bertram" or "Rousillon").
   - Improves the accuracy of the question-answering system by narrowing down the context.

4. **Reusability**:
   - Preprocessed segments and their entities are stored and reused, making the system efficient for multiple questions without recalculating entities.

5. **Flexibility**:
   - The process can handle a variety of texts and adapt to different domains or datasets by using spaCy's customizable NER models.


### 2. Model Selection and Setup

In [4]:
# Step 3: Set up the Hugging Face Q&A pipeline
def setup_pipeline():
    return pipeline("question-answering", model="distilbert-base-uncased", tokenizer="distilbert-base-uncased")

# Step 4: Ask questions and get answers
def ask_question(qa_pipeline, context, question):
    result = qa_pipeline(question=question, context=context)
    return result["answer"]

### Setting Up the Hugging Face Q&A Pipeline

This section sets up the question-answering (Q&A) pipeline using the Hugging Face Transformers library and defines a function to interact with the pipeline for retrieving answers to user queries.

#### **Set Up the Q&A Pipeline**
- **Function**: `setup_pipeline()`
- **Purpose**:
  - Initializes the Hugging Face Q&A pipeline.
  - Uses the pre-trained `distilbert-base-uncased` model and tokenizer for efficient question-answering tasks.
- **Key Components**:
  - **Model**: `distilbert-base-uncased`:
    - A lightweight, pre-trained Transformer model optimized for speed and accuracy.
  - **Pipeline**: Configures the question-answering workflow.

**Output**:
- Returns a pipeline object that can process user questions and provide answers based on a given context.


#### **Ask Questions and Get Answers**
- **Function**: `ask_question(qa_pipeline, context, question)`
- **Purpose**:
  - Interacts with the initialized Q&A pipeline to process user queries and retrieve answers.
- **Parameters**:
  - **`qa_pipeline`**: The Hugging Face pipeline initialized by `setup_pipeline()`.
  - **`context`**: A text segment containing relevant information for the question.
  - **`question`**: The user’s question as a string.
- **Workflow**:
  1. Passes the question and context to the Q&A pipeline.
  2. Extracts and returns the answer from the pipeline’s output.

**Output**:
- Returns the answer extracted from the context for the given question.


1. **Pre-Trained Model**:
   - The use of `distilbert-base-uncased` leverages transfer learning, allowing the system to handle questions without requiring extensive training.

2. **Seamless Integration**:
   - The pipeline simplifies the interaction between the model and the user, abstracting away lower-level implementation details.

3. **Scalability**:
   - Works efficiently with the segmented and preprocessed text to handle large datasets like Shakespeare’s works.

4. **User-Friendly**:
   - By abstracting the Q&A task into a reusable function, the system becomes modular and easy to extend or modify.



In [5]:
# Set up the pipeline
qa_pipeline = setup_pipeline()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


### 3. Designing Q A system

In [6]:
# Example usage
example_question = "Who is Bertram?"
relevant_segment = find_relevant_segment(example_question, segment_entities, nlp)
if relevant_segment:
    answer = qa_pipeline(question=example_question, context=relevant_segment)
    print(f"Question: {example_question}")
    print(f"Answer: {answer['answer']}")
else:
    print("No relevant segment found.")

Question: Who is Bertram?
Answer: and Servant to the Countess of Rousillon


In [8]:
# Main Q&A Function
def shakespeare_qa(question):

    try:
        answer = qa_pipeline(question=question, context=relevant_segment)
        return answer['answer']
    except Exception as e:
        return f"Sorry, I couldn't find an answer. Error: {str(e)}"

### 5. Web Interface Implementation for Q A system

In [None]:
!pip install gradio

In [9]:
import gradio as gr

# Create Gradio Interface
iface = gr.Interface(
    fn=shakespeare_qa,
    inputs=gr.Textbox(label="Ask a question about Shakespeare's works"),
    outputs=gr.Textbox(label="Answer"),
    title="Shakespeare Works Q&A",
    description="Ask questions about characters, plots, and themes in Shakespeare's works.",
    examples=[
        "Who is Bertram?",
        "What happens in Romeo and Juliet?",
        "Describe Macbeth's character",
        "Who wrote these plays?"
    ]
)

# Launch the interface
iface.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://93d7e101e6c4f4125b.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


