<img src="../holberton_logo.png" alt="logo" width="500"/>

# QA Bot

## Task 0

Write a function `def question_answer(question, reference)`: that finds a snippet of text within a reference document to answer a question:

- `question` is a string containing the question to answer


- `reference` is a string containing the reference document from which to find the answer


- `Returns`: a string containing the answer


- If no answer is found, return `None`


- Your function should use the `bert-uncased-tf2-qa` model from the `tensorflow-hub` library


- Your function should use the pre-trained `BertTokenizer`, `bert-large-uncased-whole-word-masking-finetuned-squad`, from the transformers library

In [1]:
pip install --upgrade tensorflow tensorflow-hub ml-dtypes

Collecting ml-dtypes
  Obtaining dependency information for ml-dtypes from https://files.pythonhosted.org/packages/f0/36/290745178e5776f7416818abc1334c1b19afb93c7c87fd1bef3cc99f84ca/ml_dtypes-0.4.0-cp311-cp311-win_amd64.whl.metadata
  Downloading ml_dtypes-0.4.0-cp311-cp311-win_amd64.whl.metadata (20 kB)
Downloading ml_dtypes-0.4.0-cp311-cp311-win_amd64.whl (126 kB)
   ---------------------------------------- 0.0/126.8 kB ? eta -:--:--
   ---------------------------------------- 0.0/126.8 kB ? eta -:--:--
   --------- ----------------------------- 30.7/126.8 kB 435.7 kB/s eta 0:00:01
   ------------------------- ------------- 81.9/126.8 kB 651.6 kB/s eta 0:00:01
   -------------------------------------- 126.8/126.8 kB 826.3 kB/s eta 0:00:00
Installing collected packages: ml-dtypes
  Attempting uninstall: ml-dtypes
    Found existing installation: ml-dtypes 0.3.2
    Uninstalling ml-dtypes-0.3.2:
      Successfully uninstalled ml-dtypes-0.3.2
Note: you may need to restart the kernel to 

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\User\\anaconda3\\1\\Lib\\site-packages\\~l_dtypes\\_ml_dtypes_ext.cp311-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



In [2]:
#!/usr/bin/env python3
"""
Defines function that finds a snippet of text within a reference document
to answer a question
"""


import tensorflow as tf
import tensorflow_hub as hub
from transformers import BertTokenizer


def question_answer(question, reference):
    """
    Finds a snippet of text within a reference document to answer a question
    """
    tokenizer = BertTokenizer.from_pretrained(
        'bert-large-uncased-whole-word-masking-finetuned-squad')
    model = hub.load("https://tfhub.dev/see--/bert-uncased-tf2-qa/1")

    # tokenize question and reference
    quest_tokens = tokenizer.tokenize(question)
    refer_tokens = tokenizer.tokenize(reference)

    # add special tokens to include "classification" and "separator"
    tokens = ['[CLS]'] + quest_tokens + ['[SEP]'] + refer_tokens + ['[SEP]']

    # convert tokens to ids
    input_word_ids = tokenizer.convert_tokens_to_ids(tokens)
    
    # A list of 1 indicates the presence of a token
    # used to differentiate between tokens and padding
    input_mask = [1] * len(input_word_ids)
    
    # 0 for question segment, 1 for reference segments
    input_type_ids = [0] * (1 + len(quest_tokens) + 1) + [1] * (len(refer_tokens) + 1)

    # Convert the input data to TF tensors, with additional batch
    # Used to provide the data to BERT model
    input_word_ids, input_mask, input_type_ids = map(
        lambda t: tf.expand_dims(
            tf.convert_to_tensor(t, dtype=tf.int32), 0),
        (input_word_ids, input_mask, input_type_ids))

    # call the bert model
    outputs = model([input_word_ids, input_mask, input_type_ids])

    # find the positions of the start and end 
    # of predicted answer span in model outputs
    short_start = tf.argmax(outputs[0][0][1:]) + 1
    short_end = tf.argmax(outputs[1][0][1:]) + 1
    answer_tokens = tokens[short_start: short_end + 1]
    answer = tokenizer.convert_tokens_to_string(answer_tokens)

    if answer == None or answer == "" or question in answer:
        return None

    return answer




### Main (Test) File

In [3]:
with open('../data/ZendeskArticles/PeerLearningDays.md') as f:
    reference = f.read()

print(question_answer('When are PLDs?', reference))

FileNotFoundError: [Errno 2] No such file or directory: '../data/ZendeskArticles/PeerLearningDays.md'

## Task 1. Create the Loop

Create a script that takes in input from the user with the prompt `Q:` and prints `A:` as a response. If the user inputs `exit`, `quit`, `goodbye`, or `bye`, case insensitive, print `A: Goodbye and exit`.

In [None]:
#!/usr/bin/env python3
"""
Script that takes in user input with the prompt 'Q:' and
prints 'A:' as the response.
"""


if __name__ == "__main__":
    while (True):
        user_input = input("Q: ")
        user_input = user_input.lower()
        if user_input == 'exit' or user_input == 'quit' \
           or user_input == 'goodbye' or user_input == 'bye':
            print("A: Goodbye")
            break
        print("A:")

## 2. Answer Questions

Based on the previous tasks, write a function `def answer_loop(reference)`: that answers questions from a reference text:

- `reference` is the reference text


- If the answer cannot be found in the reference text, respond with `Sorry, I do not understand your question`.

In [None]:
#!/usr/bin/env python3
"""
Defines function that answers questions from reference text on loop
"""


import tensorflow as tf
import tensorflow_hub as hub
from transformers import BertTokenizer


def answer_loop(reference):
    """
    Answers questions from a reference text on loop
    """
    while (1):
        user_input = input("Q: ")
        user_input = user_input.lower()
        if user_input == 'exit' or user_input == 'quit' \
           or user_input == 'goodbye' or user_input == 'bye':
            print("A: Goodbye")
            break
        answer = question_answer(user_input, reference)
        if answer is None:
            print("A: Sorry, I do not understand your question.")
        else:
            print("A: ", answer)


def question_answer(question, reference):
    """
    Finds a snippet of text within a reference document to answer a question
    """
    tokenizer = BertTokenizer.from_pretrained(
        'bert-large-uncased-whole-word-masking-finetuned-squad')
    model = hub.load("https://tfhub.dev/see--/bert-uncased-tf2-qa/1")

    quest_tokens = tokenizer.tokenize(question)
    refer_tokens = tokenizer.tokenize(reference)

    tokens = ['[CLS]'] + quest_tokens + ['[SEP]'] + refer_tokens + ['[SEP]']

    input_word_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_mask = [1] * len(input_word_ids)
    input_type_ids = [0] * (
        1 + len(quest_tokens) + 1) + [1] * (len(refer_tokens) + 1)

    input_word_ids, input_mask, input_type_ids = map(
        lambda t: tf.expand_dims(
            tf.convert_to_tensor(t, dtype=tf.int32), 0),
        (input_word_ids, input_mask, input_type_ids))

    outputs = model([input_word_ids, input_mask, input_type_ids])

    short_start = tf.argmax(outputs[0][0][1:]) + 1
    short_end = tf.argmax(outputs[1][0][1:]) + 1
    answer_tokens = tokens[short_start: short_end + 1]
    answer = tokenizer.convert_tokens_to_string(answer_tokens)

    if answer == None or answer == "" or question in answer:
        return None

    return answer

### Main (Test) File

In [None]:
with open('../data/ZendeskArticles/PeerLearningDays.md') as f:
    reference = f.read()

answer_loop(reference)

## 3. Semantic Search

Write a function `def semantic_search(corpus_path, sentence)`: that performs semantic search on a corpus of documents:

- `corpus_path` is the path to the corpus of reference documents on which to perform semantic search


- `sentence` is the sentence from which to perform semantic search


- Returns: the reference text of the document most similar to sentence

In [None]:
#!/usr/bin/env python3
"""
Defines function that performs semantic search on a corpus of documents
"""


import numpy as np
import os
import tensorflow_hub as hub


def semantic_search(corpus_path, sentence):
    """
    Performs semantic search on a corpus of documents
    """
    documents = [sentence]

    for filename in os.listdir(corpus_path):
        if filename.endswith(".md") is False:
            continue
        with open(corpus_path + "/" + filename, "r", encoding="utf-8") as f:
            documents.append(f.read())

    model = hub.load(
        "https://tfhub.dev/google/universal-sentence-encoder-large/5")

    embeddings = model(documents)

    correlation = np.inner(embeddings, embeddings)

    closest = np.argmax(correlation[0, 1:])

    similar = documents[closest + 1]

    return similar

### Main (Test) File

In [None]:
print(semantic_search('../data/ZendeskArticles', 'When are PLDs?'))

## 4. Multi-reference Question Answering

Based on the previous tasks, write a function `def question_answer(coprus_path)`: that answers questions from multiple reference texts:


- `corpus_path` is the path to the corpus of reference documents

In [None]:
#!/usr/bin/env python3
"""
Defines function that answers questions from multiple reference texts on loop
"""


import numpy as np
import os
import tensorflow as tf
import tensorflow_hub as hub
from transformers import BertTokenizer


def question_answer(corpus_path):
    """
    Answers questions from multiple reference texts
    """
    while (1):
        user_input = input("Q: ")
        user_input = user_input.lower()
        if user_input == 'exit' or user_input == 'quit' \
           or user_input == 'goodbye' or user_input == 'bye':
            print("A: Goodbye")
            break
        reference = semantic_search(corpus_path, user_input)
        answer = specific_question_answer(user_input, reference)
        if answer is None:
            print("A: Sorry, I do not understand your question.")
        else:
            print("A: ", answer)


def semantic_search(corpus_path, sentence):
    """
    Performs semantic search on a corpus of documents
    """
    documents = [sentence]

    for filename in os.listdir(corpus_path):
        if filename.endswith(".md") is False:
            continue
        with open(corpus_path + "/" + filename, "r", encoding="utf-8") as f:
            documents.append(f.read())

    model = hub.load(
        "https://tfhub.dev/google/universal-sentence-encoder-large/5")

    embeddings = model(documents)
    correlation = np.inner(embeddings, embeddings)
    closest = np.argmax(correlation[0, 1:])
    similar = documents[closest + 1]

    return similar


def specific_question_answer(question, reference):
    """
    Finds a snippet of text within a reference document to answer a question
    """
    tokenizer = BertTokenizer.from_pretrained(
        'bert-large-uncased-whole-word-masking-finetuned-squad')
    model = hub.load("https://tfhub.dev/see--/bert-uncased-tf2-qa/1")

    quest_tokens = tokenizer.tokenize(question)
    refer_tokens = tokenizer.tokenize(reference)

    tokens = ['[CLS]'] + quest_tokens + ['[SEP]'] + refer_tokens + ['[SEP]']

    input_word_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_mask = [1] * len(input_word_ids)
    input_type_ids = [0] * (
        1 + len(quest_tokens) + 1) + [1] * (len(refer_tokens) + 1)

    input_word_ids, input_mask, input_type_ids = map(
        lambda t: tf.expand_dims(
            tf.convert_to_tensor(t, dtype=tf.int32), 0),
        (input_word_ids, input_mask, input_type_ids))

    outputs = model([input_word_ids, input_mask, input_type_ids])

    short_start = tf.argmax(outputs[0][0][1:]) + 1
    short_end = tf.argmax(outputs[1][0][1:]) + 1
    answer_tokens = tokens[short_start: short_end + 1]
    answer = tokenizer.convert_tokens_to_string(answer_tokens)

    if answer == None or answer == "" or question in answer:
        return None

    return answer


### Main (Test) File

In [None]:
question_answer('../data/ZendeskArticles')