Group Programming Assignment Two-QA System <br>
Kori Fogle, Michaela Herrick, Jackson Holland, Ishaan Indoori <br>
AIT 526 <br>
6/18/2025 <br>

<b> Problem to be Solved </b> <br>
The goal of this program is to create a question answering system executable from the command line. These questions must begin with "who," "what," "when," or "where" and must return a portion of the question back to the user along with the answer. Unclear questions or questions to which answers cannot be found must be addressed. A log file must be created within the program, and users must be able to exit the chat.
 <br>
<b> Algorithm and Flow</b><br>
Our program utilizes a class that contains several functions to accomplish the above goal. First, we loaded all relevant packages. Particularly, this program will use a Wikipedia API to return answers back to a user. We then initialize a log file, as well as create a way for the user to exit. These questions must begin with "who," "what," "when," or "where." The program uses regular expressions and if statements to parse user input, then return answers that rephrase part of the user's original question along with the answer. The user’s input is then categorized into one five types of questions- a who question, a what question, a where question, a when question, or an unclear/ unanswerable question. These questions, except for unclear/ unanswerable questions, are then queried using the Wikipedia API. An answer is then returned to the user. The program also alerts users when it is given a question it cannot answer with a phrase such as "I'm sorry, I don't quite know the answer." Further, users can exit the program using the word "exit." This program also creates a log file that can be referenced after program ends to review input and output. Finally, error handling is included to address specific errors, such as pages not loading, or too many possible answers being returned.
<br>
<b> Example of Input and Output</b><br>

<b> Usage Instructions</b><br>
<oi>
<li>Execute the program</li>
<li>You'll be asked to input a name for the log file. Name this file as you would any other file. It will be accessible in the same working directory as this program.</li>
<li>Instructions will prompt you to start all questions with who, what, when, or where.</li>
<li>Ask a question and wait for the response</li>
<li>Stop asking questions by typing the word "exit."</li>
</ol><br>

<b> References</b><br>
Anon. 2024. “Python Regex Cheat Sheet.” GeeksforGeeks. Retrieved May 31, 2025 (https://www.geeksforgeeks.org/python-regex-cheat-sheet/).<br>
Dib, Firas. n.d. “Regex101 - Online Regex Editor and Debugger.” @Regex101.(https://regex101.com/).<br>
Gadiraju, Sai Surya. n.d. “Program Assignment 2 Demo Video.”<br>
Jurafsky, Daniel, and James H. Martin. 2025. Speech and Language Processing (3rd Ed. Draft).<br>
Liao, Duoduo. n.d. "Tips and Hints- QA System Programming Assignment-2."

In [20]:
#Load relevant packages and en_core_web_sm
import sys
import wikipedia
import wikipedia.exceptions as wiki_exceptions
import nltk
from nltk.tokenize import sent_tokenize
import re
import spacy
from difflib import SequenceMatcher


nlp = spacy.load("en_core_web_sm")

In [27]:
# This version merges your enhanced code into your team's QA_System class
# while preserving all of their original structure and functionality.

from difflib import SequenceMatcher
import re
import wikipedia
from wikipedia import exceptions as wiki_exceptions
from nltk.tokenize import sent_tokenize
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

class QA_System:
    def __init__(self, logfile): 
        # Initialize the QA system with a log file for storing Q&A history
        self.logfile = logfile

    def run(self):
        # Main loop for interactive Q&A
        print("*** This is a QA system. I will try to answer questions that start with Who, What, When, and Where. Type Exit to quit. ***")
        while True:
            try:
                question = input("*?>").strip()
                if question.lower() in ["exit", "quit", "bye", "goodbye"]:
                    print("Thank you, goodbye")
                    break
                self.answer_question(question)
            except Exception as e:
                print(f"An error occurred: {e}")
                continue

    def answer_question(self, question):
        # Determine question type and extract a query to search
        doc = nlp(question)
        question_type = self.identify_question_type(question)
        if question_type is None:
            print("I'm sorry I don't quite know the answer to this question.")
            self.log_question(question, "n/a")
            return 
        refined_query = self.extract_context(question)
        if not refined_query:
            refined_query = self.extract_dynamic_entity(doc, question_type)
        if refined_query: 
            print(f"Trying to search Wikipedia for the question: {refined_query}")
            self.search_wikipedia(refined_query, question_type, question)
        else:
            print("I'm sorry, but I was unable to find an answer. Make sure you've phrased your question correctly.")

    def identify_question_type(self, question):
        # Categorize the question by its starting keyword
        question_lower = question.lower()
        if question_lower.startswith("who"):
            return "Who"
        elif question_lower.startswith("what"):
            return "What"
        elif question_lower.startswith("when"):
            return "When"
        elif question_lower.startswith("where"):
            return "Where"
        return None

    def extract_context(self, question):        
        # Regex patterns to extract the relevant topic from the question
        patterns = [
            r'Who (?:is|was|are)? (.+)',
            r'Who (made|makes|created|invented|discovered|wrote) (.+)',
            r'Who (owns|founded|leads|led) (.+)',
            r'What (?:is|was)? (.+)',
            r'what (?:is|was)? ( .* ) Age',
            r'When (?:is|was) (.+) born',
            r'When (?:is|was) (.+) birthday',
            r'When did (.+)',
            r'When (.+) Born',
            r'When (.+) Birthday',
            r'Where (?:is|was|are|did)? (.+)',
            r'Where (.+)',
        ]
        for pattern in patterns:
            match = re.match(pattern, question, re.IGNORECASE)
            if match:
                return match.group(len(match.groups())).strip()
        return None

    def extract_dynamic_entity(self, doc, question_type):
        # Try to dynamically extract named entities using spaCy
        entities = [ent.text for ent in doc.ents if ent.label_ in {"PERSON", "ORG", "GPE", "DATE"}]
        if entities:
            return " ".join(entities)
        # If no named entities are found, fallback to a cleaned version of the full input text
        return re.sub(r'[^a-zA-Z0-9\s]', '', doc.text).strip().lower()

    def fuzzy_match(self, subject, results):
        # Use string similarity to select the most relevant Wikipedia title
        # Fuzzy matching improves resilience to typos or Wikipedia title mismatches
        def similarity(a, b):
            return SequenceMatcher(None, a.lower(), b.lower()).ratio()
        ranked = sorted(results, key=lambda title: similarity(subject, title), reverse=True)
        return ranked[0] if ranked else None

    def search_wikipedia(self, query, question_type, question):
        # Search Wikipedia for the topic and extract relevant content
        try:
            search_results = wikipedia.search(query)
            if not search_results:
                print("I am sorry I cannot seem to find the answer.")
                self.log_question(question, "n/a")
                return

            chosen_title = self.fuzzy_match(query, search_results) or search_results[0]
            try:
                page = wikipedia.page(chosen_title)
            except wiki_exceptions.DisambiguationError:
                print("I'm sorry, I don't understand the question.")
                self.log_question(question, "ambiguous")
                return

            # Tokenize summary for concise and high-level info (improves answer relevance)
            summary_sentences = sent_tokenize(page.summary)
            # Tokenize beginning of full content for supporting context
            content_sentences = sent_tokenize(page.content)
            # Combine summary and first 25 content sentences
            combined_sentences = summary_sentences + content_sentences[:25]

            meaningful_summary = self.summarize_text(combined_sentences, question_type, query, question)
            if meaningful_summary:
                print(f"=> {meaningful_summary}")
                self.log_question(question, meaningful_summary)
            else:
                print("I am sorry I cannot seem to find the answer.")
                self.log_question(question, "N/A")

        except wiki_exceptions.PageError:
            print("Unfortunately I could not find a page on that topic.")
            self.log_question(question, "no result")

        except wiki_exceptions.HTTPTimeoutError:
            print("There's a network error, check your internet connection and try again.")
            self.log_question(question, "timeout")

    def summarize_text(self, sentences, question_type, query, full_question):
        # Analyze and extract a sentence that best answers the question based on its type
        # Consider breaking these blocks into helpers to reduce complexity
        doc_query = nlp(query)
        clean_name = self.clean_display_name(query)

        if question_type == "Who":
            for sentence in sentences:
                if query.lower() in sentence.lower() and ("is" in sentence or "was" in sentence):
                    return sentence
            return sentences[0] if sentences else None

        elif question_type == "What":
            return sentences[0] if sentences else None

        elif question_type == "When":
            doc = nlp(query)
            is_person = any(ent.label_ == "PERSON" for ent in doc.ents)
            is_death = bool(re.search(r'\b(die|death|passed away|dead)\b', query, re.IGNORECASE))

            full_date_pattern = r'([A-Z][a-z]+ \d{1,2}, \d{4})'

            for sentence in sentences:
                if is_death:
                    match = re.search(r'died on ' + full_date_pattern, sentence)
                    if match:
                        return f"{clean_name} died on {match.group(1)}."
                    match_alt = re.search(r'on ' + full_date_pattern + r' and died', sentence)
                    if match_alt:
                        return f"{clean_name} died on {match_alt.group(1)}."

                if is_person:
                    match = re.search(r'born on ' + full_date_pattern, sentence)
                    if match:
                        return f"{clean_name} was born on {match.group(1)}."
                    match_alt = re.search(r'on ' + full_date_pattern + r' and born', sentence)
                    if match_alt:
                        return f"{clean_name} was born on {match_alt.group(1)}."

                if not is_person:
                    match = re.search(
                        r'(?:started|began|occurred|established|founded|created|formed|launched)(?: in| on)? ' + full_date_pattern,
                        sentence, re.IGNORECASE
                    )
                    if match:
                        return f"{clean_name} was established on {match.group(1)}."

            # ✅ Extra Credit: Fallback if exact sentence isn't found but a DATE exists
            for sent in sentences:
                sent_doc = nlp(sent)
                for ent in sent_doc.ents:
                    if ent.label_ == "DATE" and re.search(r'\d{4}', ent.text):
                        # Heuristic for human names vs other topics
                        if is_death:
                            return f"While the exact date is unclear, it appears {clean_name} died in {ent.text}."
                        elif is_person:
                            return f"While the exact date is unclear, it appears {clean_name} was born in {ent.text}."
                        else:
                            return f"While the exact date is unclear, it appears {clean_name} was established in {ent.text}."

            return "Date or time information not found."

        elif question_type == "Where":
            for sentence in sentences:
                if "GPE" in [ent.label_ for ent in nlp(sentence).ents]:
                    return sentence
            return "Location information not found."

    def log_question(self, question, answer):
        # Append question and answer to the log file
        with open(self.logfile, 'a', encoding='utf-8') as log:
            log.write(f"Question: {question}\n")
            log.write(f"Answer: {answer}\n\n")

    def clean_display_name(self, query):
        # Construct clean names like 'Albert Einstein' by stripping auxiliary verbs and keywords
        query = re.sub(r'\b(when|was|did|is|born|died|created|founded|started|begin|occurred|took place|happen(ed)?)\b', '', query, flags=re.IGNORECASE)
        query = re.sub(r'\s+', ' ', query).strip("? ").strip()
        return query.title()

def main():
    log_filename = input("Enter the name of the log file: ").strip()
    try:
        qa_system = QA_System(log_filename)
        qa_system.run()
    except Exception as ex:
        print(ex)
    finally:
        print("Log is saved.")

if __name__ == "__main__":
    main()


*** This is a QA system. I will try to answer questions that start with Who, What, When, and Where. Type Exit to quit. ***
Trying to search Wikipedia for the question: Rosa Parks
Unfortunately I could not find a page on that topic.
Trying to search Wikipedia for the question: george washington
=> George Washington was born on February 22, 1732.
Trying to search Wikipedia for the question: Thomas Jefferson
=> Thomas Jefferson was born on April 13, 1743.
Thank you, goodbye
Log is saved.
