# QA Bot

## Task 0

Write a function `def question_answer(question, reference)`: that finds a snippet of text within a reference document to answer a question:

- `question` is a string containing the question to answer


- `reference` is a string containing the reference document from which to find the answer


- `Returns`: a string containing the answer


- If no answer is found, return `None`


- Your function should use the `bert-uncased-tf2-qa` model from the `tensorflow-hub` library


- Your function should use the pre-trained `BertTokenizer`, `bert-large-uncased-whole-word-masking-finetuned-squad`, from the transformers library

In [1]:
#!/usr/bin/env python3
"""
Defines function that finds a snippet of text within a reference document
to answer a question
"""


import tensorflow as tf
import tensorflow_hub as hub
from transformers import BertTokenizer


def question_answer(question, reference):
    """
    Finds a snippet of text within a reference document to answer a question
    """
    
    # Specialized for the SQuAD (Stanford Question Answering Dataset) task
    tokenizer = BertTokenizer.from_pretrained(
        'bert-large-uncased-whole-word-masking-finetuned-squad')
    
    # Predict the start and end positions of an answer in a text passage
    model = hub.load("https://tfhub.dev/see--/bert-uncased-tf2-qa/1")

    # Breaking the text down into smaller units (tokens) that the model can understand
    quest_tokens = tokenizer.tokenize(question)
    refer_tokens = tokenizer.tokenize(reference)

    # Preparation of Input Sequence. Add special tokens to include "classification" and "separator"
    tokens = ['[CLS]'] + quest_tokens + ['[SEP]'] + refer_tokens + ['[SEP]']

    # The tokens are converted into numerical IDs
    input_word_ids = tokenizer.convert_tokens_to_ids(tokens)
    
    # A list of 1 indicates the presence of a token
    # Used to differentiate between tokens and padding
    input_mask = [1] * len(input_word_ids)
    
    # 0 for question segment, 1 for reference segments
    input_type_ids = [0] * (1 + len(quest_tokens) + 1) + [1] * (len(refer_tokens) + 1)

    # Convert the input data to TF tensors, with additional batch
    # Used to provide the data to BERT model
    input_word_ids, input_mask, input_type_ids = map(
        lambda t: tf.expand_dims(
            tf.convert_to_tensor(t, dtype=tf.int32), 0),
        (input_word_ids, input_mask, input_type_ids))

    # call the bert model
    outputs = model([input_word_ids, input_mask, input_type_ids])

    # find the positions of the start and end 
    # of predicted answer span in model outputs
    # output[0] represents the logits (predictions) for the start position of the answer
    # output[0][0] indexes into the first (and only) batch
    # output[0][0][1] identifies the index of the maximum value in the sliced tensor.
    short_start = tf.argmax(outputs[0][0][1:]) + 1
    short_end = tf.argmax(outputs[1][0][1:]) + 1
    answer_tokens = tokens[short_start: short_end + 1]
    answer = tokenizer.convert_tokens_to_string(answer_tokens)

    if answer == None or answer == "" or question in answer:
        return None

    return answer




### Main (Test) File

In [3]:
with open('ZendeskArticles/ProfessionalTrack.md') as f:
    reference = f.read()

print(question_answer('What is Professional Track', reference))

"""
What is Professional Track --> ProfessionalTrack.md
What is speaker of the day? --> SpeakeroftheDay.md
Can I study and work at the same time? --> Specializations-FAQ1.md
"""

your professionalism will be equally as important as your actual engineering abilities . we want to make sure that our students leave school not only with great technical skills , but also the professional skills to help them throughout their careers . each trimester , the students are given a baseline professional track score of 100 %


'\nWhat is Professional Track --> ProfessionalTrack.md\nWhat is speaker of the day? --> SpeakeroftheDay.md\nCan I study and work at the same time? --> Specializations-FAQ1.md\n'

## Task 1. Create the Loop

Create a script that takes in input from the user with the prompt `Q:` and prints `A:` as a response. If the user inputs `exit`, `quit`, `goodbye`, or `bye`, case insensitive, print `A: Goodbye and exit`.

In [4]:
#!/usr/bin/env python3
"""
Script that takes in user input with the prompt 'Q:' and
prints 'A:' as the response.
"""


if __name__ == "__main__":
    while (True):
        user_input = input("Q: ")
        user_input = user_input.lower()
        if user_input == 'exit' or user_input == 'quit' \
           or user_input == 'goodbye' or user_input == 'bye':
            print("A: Goodbye")
            break
        print("A:")

Q: Hello
A:
Q: Goodbye
A: Goodbye


## 2. Answer Questions

Based on the previous tasks, write a function `def answer_loop(reference)`: that answers questions from a reference text:

- `reference` is the reference text


- If the answer cannot be found in the reference text, respond with `Sorry, I do not understand your question`.

In [5]:
#!/usr/bin/env python3
"""
Defines function that answers questions from reference text on loop
"""


import tensorflow as tf
import tensorflow_hub as hub
from transformers import BertTokenizer


def answer_loop(reference):
    """
    Answers questions from a reference text on loop
    """
    while (1):
        user_input = input("Q: ")
        user_input = user_input.lower()
        if user_input == 'exit' or user_input == 'quit' \
           or user_input == 'goodbye' or user_input == 'bye':
            print("A: Goodbye")
            break
        answer = question_answer(user_input, reference)
        if answer is None:
            print("A: Sorry, I do not understand your question.")
        else:
            print("A: ", answer)


def question_answer(question, reference):
    """
    Finds a snippet of text within a reference document to answer a question
    """
    tokenizer = BertTokenizer.from_pretrained(
        'bert-large-uncased-whole-word-masking-finetuned-squad')
    model = hub.load("https://tfhub.dev/see--/bert-uncased-tf2-qa/1")

    quest_tokens = tokenizer.tokenize(question)
    refer_tokens = tokenizer.tokenize(reference)

    tokens = ['[CLS]'] + quest_tokens + ['[SEP]'] + refer_tokens + ['[SEP]']

    input_word_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_mask = [1] * len(input_word_ids)
    input_type_ids = [0] * (
        1 + len(quest_tokens) + 1) + [1] * (len(refer_tokens) + 1)

    input_word_ids, input_mask, input_type_ids = map(
        lambda t: tf.expand_dims(
            tf.convert_to_tensor(t, dtype=tf.int32), 0),
        (input_word_ids, input_mask, input_type_ids))

    outputs = model([input_word_ids, input_mask, input_type_ids])

    short_start = tf.argmax(outputs[0][0][1:]) + 1
    short_end = tf.argmax(outputs[1][0][1:]) + 1
    answer_tokens = tokens[short_start: short_end + 1]
    answer = tokenizer.convert_tokens_to_string(answer_tokens)

    if answer == None or answer == "" or question in answer:
        return None

    return answer

### Main (Test) File

In [6]:
with open('ZendeskArticles/PeerLearningDays.md') as f:
    reference = f.read()

answer_loop(reference)

Q: When are PLDs?
A:  on - site days from 9 : 00 am to 3 : 00 pm
Q: Goodbye
A: Goodbye


## 3. Semantic Search

Write a function `def semantic_search(corpus_path, sentence)`: that performs semantic search on a corpus of documents:

- `corpus_path` is the path to the corpus of reference documents on which to perform semantic search


- `sentence` is the sentence from which to perform semantic search


- Returns: the reference text of the document most similar to sentence

In [7]:
#!/usr/bin/env python3
"""
Defines function that performs semantic search on a corpus of documents
"""


import numpy as np
import os
import tensorflow_hub as hub


def semantic_search(corpus_path, sentence):
    """
    Performs semantic search on a corpus of documents
    """
    
    # 1. Start with a list containing just the sentence that will be compared against the corpus
    documents = [sentence]

    # 2. Load and Read Documents
    for filename in os.listdir(corpus_path):
        if filename.endswith(".md") is False:
            continue
        with open(corpus_path + "/" + filename, "r", encoding="utf-8") as f:
            documents.append(f.read())

    # 3. Load Pre-trained Model
    """
        The Universal Sentence Encoder (USE) model from TensorFlow Hub encodes sentences into fixed-size embeddings. 
        This model is designed to produce semantically meaningful vectors that capture the meaning of sentences.
    """
    model = hub.load(
        "https://tfhub.dev/google/universal-sentence-encoder-large/5")

    # Pass the list of documents to the model. This produces embeddings for each document, including the sentence
    embeddings = model(documents)

    # Compute the similarity between each pair of embeddings using the inner product
    correlation = np.inner(embeddings, embeddings)

    # Find the Most Similar Document, Identify Most Similar Document
    closest = np.argmax(correlation[0, 1:])

    # Return the Most Similar Document
    similar = documents[closest + 1]

    return similar

### Main (Test) File

In [8]:
print(semantic_search('ZendeskArticles', 'What is a Stand Up?'))

"""
    When are PLDs?
    What is a Stand Up?
    Can I study and work at the same time?
    Have the Specializations been tested to ensure there aren’t any major bugs or technical discrepancies?
"""

Stand Up is a meeting that takes place daily on campus at the same specified time. It is an opportunity for staff and students to make announcements pertinent to the community. Each stand up will be live-streamed and available for viewing through the intranet. 
It is mandatory for all students on campus to attend Stand Up. Students who are on campus, but not present in the stand up area, are at risk for a deduction to their professionalism score.


'\n    When are PLDs?\n    What is a Stand Up?\n    Can I study and work at the same time?\n    Have the Specializations been tested to ensure there aren’t any major bugs or technical discrepancies?\n'

## 4. Multi-reference Question Answering

Based on the previous tasks, write a function `def question_answer(coprus_path)`: that answers questions from multiple reference texts:


- `corpus_path` is the path to the corpus of reference documents

In [9]:
#!/usr/bin/env python3
"""
Defines function that answers questions from multiple reference texts on loop
"""


import numpy as np
import os
import tensorflow as tf
import tensorflow_hub as hub
from transformers import BertTokenizer


def question_answer(corpus_path):
    """
    Answers questions from multiple reference texts
    """
    while (1):
        user_input = input("Q: ")
        user_input = user_input.lower()
        if user_input == 'exit' or user_input == 'quit' \
           or user_input == 'goodbye' or user_input == 'bye':
            print("A: Goodbye")
            break
        reference = semantic_search(corpus_path, user_input)
        answer = specific_question_answer(user_input, reference)
        if answer is None:
            print("A: Sorry, I do not understand your question.")
        else:
            print("A: ", answer)


def semantic_search(corpus_path, sentence):
    """
    Performs semantic search on a corpus of documents
    """
    documents = [sentence]

    for filename in os.listdir(corpus_path):
        if filename.endswith(".md") is False:
            continue
        with open(corpus_path + "/" + filename, "r", encoding="utf-8") as f:
            documents.append(f.read())

    model = hub.load(
        "https://tfhub.dev/google/universal-sentence-encoder-large/5")

    embeddings = model(documents)
    correlation = np.inner(embeddings, embeddings)
    closest = np.argmax(correlation[0, 1:])
    similar = documents[closest + 1]

    return similar


def specific_question_answer(question, reference):
    """
    Finds a snippet of text within a reference document to answer a question
    """
    tokenizer = BertTokenizer.from_pretrained(
        'bert-large-uncased-whole-word-masking-finetuned-squad')
    model = hub.load("https://tfhub.dev/see--/bert-uncased-tf2-qa/1")

    quest_tokens = tokenizer.tokenize(question)
    refer_tokens = tokenizer.tokenize(reference)

    tokens = ['[CLS]'] + quest_tokens + ['[SEP]'] + refer_tokens + ['[SEP]']

    input_word_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_mask = [1] * len(input_word_ids)
    input_type_ids = [0] * (
        1 + len(quest_tokens) + 1) + [1] * (len(refer_tokens) + 1)

    input_word_ids, input_mask, input_type_ids = map(
        lambda t: tf.expand_dims(
            tf.convert_to_tensor(t, dtype=tf.int32), 0),
        (input_word_ids, input_mask, input_type_ids))

    outputs = model([input_word_ids, input_mask, input_type_ids])

    short_start = tf.argmax(outputs[0][0][1:]) + 1
    short_end = tf.argmax(outputs[1][0][1:]) + 1
    answer_tokens = tokens[short_start: short_end + 1]
    answer = tokenizer.convert_tokens_to_string(answer_tokens)

    if answer == None or answer == "" or question in answer:
        return None

    return answer


### Main (Test) File

In [None]:
question_answer('ZendeskArticles')

"""
    When are PLDs?
    What is a Stand Up?
    Can I study and work at the same time?
    Have the Specializations been tested to ensure there aren’t any major bugs or technical discrepancies?
"""

Q: Can I study and work at the same time?
A: Sorry, I do not understand your question.


## Happy Coding