## Part 3: Creating a customised chatbot

Based on part 1 and 2 in the earlier notebooks, I have extracted the necessary data that I need as context for building a customised chatbot. 
The following portion of this codebook will include:
- Part 3a: Creating a default chatbot without introducing any system prompt
- Part 3b: Creating an improved chatbot with the use of system prompt
- Conclusion

In [73]:
import os
import openai

from llama_index import Document, GPTVectorStoreIndex, ServiceContext
from llama_index.readers import SimpleDirectoryReader
from llama_index.llms import OpenAI
from llama_index.evaluation import DatasetGenerator

In [63]:
# Here we will need to use our own OpenAI API key. This key is removed due to privacy issue.
os.environ['OPENAI_API_KEY'] = "< Include your own OpenAI API key>"
openai.api_key = os.getenv("OPENAI_API_KEY")

In [76]:
filename_fn = lambda filename: {'file_name': filename}
my_docs = SimpleDirectoryReader(input_dir="../data", exclude_hidden=True, file_metadata=filename_fn).load_data()

print(f"Loaded {len(my_docs)} docs")

Loaded 125 docs


### Part 3a: Creating a default chatbot without any system prompt

In [30]:
# This is the original without any prompts for the chatbot
original_service_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0))

In [31]:
original_index = GPTVectorStoreIndex.from_documents(documents=my_docs, service_context=original_service_context)

In [32]:
original_query_engine = original_index.as_query_engine()

#### Testing out some questions with the original query engine

In [145]:
import time
start = time.time()
response = original_query_engine.query("How much salary must a candidate earn to be eligible for employment pass?")
print(response)
end = time.time()
print("")
print(f"This query took: {end-start} secs.")

A candidate must earn a fixed monthly salary starting from $5,000 to be eligible for an employment pass. However, candidates in the financial services sector need to earn higher salaries to qualify.

This query took: 8.021786212921143 secs.


In [146]:
start = time.time()
response = original_query_engine.query("How do I earn 20 points under the salary criteria?")
print(response)
end = time.time()
print("")
print(f"This query took: {end-start} secs.")

You can earn 20 points under the salary criteria by having a fixed monthly salary that is at or above the 90th percentile compared to the salary benchmarks by sector.

This query took: 8.367677211761475 secs.


In [147]:
start = time.time()
response = original_query_engine.query("Where can I find information about my company's diversity?")
print(response)
end = time.time()
print("")
print(f"This query took: {end-start} secs.")

You can find information about your company's diversity in the "Diversity" tab of the Workforce Insights tool, which can be accessed via the myMOM Portal.

This query took: 8.531136989593506 secs.


**Based on the above three queries that we have tested, the chatbot is able to come up with succint answers from the queries we have asked about qualifications, salary and diversity. But, we would ideally hope that the chatbot is able to include more context to how it derived the answer.** 

**As for the runtime, each query took on average about 8.3 secs to complete, which is pretty good based on a study conducted by HubSpot (screenshot below) that found the average response time across industries for chatbots to be 9.3 seconds.** 

![Average response time](../images/Average_response_time_of_chatbot.png)

#### Generating the questions to measure the performance of the original query engine

In [149]:
# Shuffle the documents
import random

random.seed(42)
random.shuffle(my_docs)

In [150]:
question_gen_query = (
    "You are working at Ministry of Manpower focusing on eligibiity requirements of the employment pass. \
    There is a new complementarity assessment framework that will assess the eligibility of all prospective employment pass holders. \
    Your task is to setup all possible questions and requests, \
    using the provided context from documents on eligibility of employment pass, \
    formulate questions that capture important facts from the context. \
    Restrict the question to the context information provided."
)

In [152]:
dataset_generator = DatasetGenerator.from_documents(
    my_docs,
    question_gen_query=question_gen_query,
    service_context=original_service_context,
)

In [153]:
questions = dataset_generator.generate_questions_from_nodes(num=30)
print("Generated ", len(questions), " questions")

Generated  30  questions


In [154]:
with open("../qns_and_eval/original_evaluation_questions.txt", "w") as f:
    for question in questions:
        f.write(question + "\n")

#### Generating the evaluation metrics to find out the performance of the original query engine

In [155]:
original_contexts = []
original_answers = []

for question in questions:
    response = original_query_engine.query(question)
    original_contexts.append([x.node.get_content() for x in response.source_nodes])
    original_answers.append(str(response))

In [157]:
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness

original_ds = Dataset.from_dict(
    {
        "question": questions,
        "answer": original_answers,
        "contexts": original_contexts,
    }
)

original_result = evaluate(original_ds, [answer_relevancy, faithfulness])
print(original_result)

evaluating with [answer_relevancy]


 50%|███████████████████████                       | 1/2 [01:58<01:58, 118.37s/it]Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIError: Bad gateway. {"error":{"code":502,"message":"Bad gateway.","param":null,"type":"cf_bad_gateway"}} 502 {'error': {'code': 502, 'message': 'Bad gateway.', 'param': None, 'type': 'cf_bad_gateway'}} {'Date': 'Wed, 08 Nov 2023 09:41:14 GMT', 'Content-Type': 'application/json', 'Content-Length': '84', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Referrer-Policy': 'same-origin', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Server': 'cloudflare', 'CF-RAY': '822cda7e7dee46f7-SIN', 'alt-svc': 'h3=":443"; ma=86400'}.
100%|██████████████████████████████████████████████| 2/2 [04:42<00:00, 141.28s/it]


evaluating with [faithfulness]


  0%|                                                       | 0/2 [00:00<?, ?it/s]Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600).
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600).
100%|█████████████████████████████████████████████| 2/2 [34:36<00:00, 1038.20s/it]


{'ragas_score': 0.8767, 'answer_relevancy': 0.9227, 'faithfulness': 0.8350}


**Based on the evaluation metrics, the original query engine seems to be performing decent enough to be deployed on streamlit. All three metrics are quite close to 1, which indicates a good performance.**

### Part 3b: Improved chatbot with system prompt

In [91]:
# This is the improved service context with context_window and system prompt added for the chatbot
improved_service_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0), 
    context_window=2048, 
    system_prompt = "You are an expert who understands the eligibility criteria of employment pass and your job is to answer questions related to the COMPASS and all relevant requirements. Keep your answers factual and provide more context. When asked about salary criteria or C1, include both the age and sector assumed if not provided before answering.")

In [92]:
improved_index = GPTVectorStoreIndex.from_documents(documents=my_docs, service_context=improved_service_context)

In [83]:
improved_query_engine = improved_index.as_query_engine()

#### Testing out some questions with the improved query engine

In [165]:
import time
start = time.time()
response = improved_query_engine.query("How much salary must a candidate earn to be eligible for employment pass?")
print(response)
end = time.time()
print("")
print(f"This query took: {end-start} secs.")

A candidate must have a fixed monthly salary starting from $5,000 to be eligible for an Employment Pass. The salary increases progressively with age, up to $10,500 for those in the mid-40s. However, candidates in the financial services sector must earn at least $5,500, with the salary also increasing progressively with age up to $11,500 for those in the mid-40s.

This query took: 18.43025517463684 secs.


In [167]:
start = time.time()
response = improved_query_engine.query("How do I earn 20 points under the salary criteria?")
print(response)
end = time.time()
print("")
print(f"This query took: {end-start} secs.")

To earn 20 points under the salary criteria, your candidate's fixed monthly salary should be at or above the 90th percentile of the salary benchmarks by sector.

This query took: 8.806796073913574 secs.


In [168]:
start = time.time()
response = improved_query_engine.query("Where can I find information about my company's diversity?")
print(response)
end = time.time()
print("")
print(f"This query took: {end-start} secs.")

You can find information about your company's diversity in the "Diversity" tab of the Workforce Insights tool on the myMOM Portal. This tab shows you the top nationalities of PMETs (Professionals, Managers, Executives, and Technicians) in your firm, allowing you to assess the diversity of your workforce.

This query took: 6.932214736938477 secs.


**Based on the above three queries that we have tested, the chatbot is able to come up with a more comprehensive answers from the queries we have asked about qualifications, salary and diversity. But, we would ideally hope that the chatbot is able to include more context to how it derived the answer.** 

**As for the runtime, it took on average 11.3 secs, much longer than the earlier original query engine. While it may take longer than the average response time from the study by HubSpot, we can see that given the same set of questions, the improved chatbot is able to provide more context, which is more suitable for our use case.**

#### Generating the questions to measure the performance of the improved query engine

In [158]:
# Shuffle the documents
import random

random.seed(42)
random.shuffle(my_docs)

In [159]:
question_gen_query = (
    "You are working at Ministry of Manpower focusing on eligibiity requirements of the employment pass. \
    There is a new complementarity assessment framework that will assess the eligibility of all prospective employment pass holders. \
    Your task is to setup all possible questions and requests, \
    using the provided context from documents on eligibility of employment pass, \
    formulate questions that capture important facts from the context. \
    Restrict the question to the context information provided."
)

In [160]:
improved_dataset_generator = DatasetGenerator.from_documents(
    my_docs,
    question_gen_query=question_gen_query,
    service_context=improved_service_context,
)

In [161]:
improved_questions = improved_dataset_generator.generate_questions_from_nodes(num=30)
print("Generated ", len(questions), " questions")

Generated  30  questions


In [162]:
with open("../qns_and_eval/improved_evaluation_questions.txt", "w") as f:
    for question in improved_questions:
        f.write(question + "\n")

#### Generating the evaluation metrics to find out the performance of the improved query engine

In [163]:
improved_eval_contexts = []
improved_eval_answers = []

for question in improved_questions:
    improved_eval_response = improved_query_engine.query(question)
    improved_eval_contexts.append([x.node.get_content() for x in improved_eval_response.source_nodes])
    improved_eval_answers.append(str(improved_eval_response))

In [164]:
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness

improved_eval_ds = Dataset.from_dict(
    {
        "question": improved_questions,
        "answer": improved_eval_answers,
        "contexts": improved_eval_contexts,
    }
)

improved_eval_result = evaluate(improved_eval_ds, [answer_relevancy, faithfulness])
print(improved_eval_result) 

evaluating with [answer_relevancy]


100%|███████████████████████████████████████████████| 2/2 [03:05<00:00, 92.52s/it]


evaluating with [faithfulness]


  0%|                                                       | 0/2 [00:00<?, ?it/s]Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600).
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600).
 50%|██████████████████████▌                      | 1/2 [27:56<27:56, 1676.33s/it]Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600).
100%|█████████████████████████████████████████████| 2/2 [48:17<00:00, 1448.83s/it]


{'ragas_score': 0.8310, 'answer_relevancy': 0.9606, 'faithfulness': 0.7322}


### Conclusion

**The summary of the  evaluation metrics and average response time of both chatbots of slightly different configured query engines can be found in the table below.**

| Query engine         | RAGAS Score | Answer Relevancy | Faithfulness | Average response time |
|----------------------|-------------|------------------|--------------|-----------------------|
| Original (GPT-3.5-Turbo) | 0.8767 | 0.9227        | 0.8350       | 8.3 secs |
| Improved with system prompt (GPT-3.5-Turbo) | 0.8310| 0.9606      | 0.7322 | 11.3 secs |

**At a glance, we can see that the Original (GPT-3.5-Turbo) seems to be better in most attributes, aside from answer relevancy.** 

**However, for our use case, we are focusing slightly more on the answer relevancy. This is because we would like to focus more on the appropriateness and completeness of the responses to the questions posted by the user. There is however, room for better improvement to improve the faithfulness as well as the overall RAGAS score as a lower faithfulness score would imply a lower factual consistency in the responses.**

**We will use the improved query engine with system prompt for our chatbot demonstration.**

In [107]:
#### We store the vectors of all the documents that we have curated for our chatbot that is to be deployed on streamlit
improved_index.storage_context.persist(persist_dir="../streamlit/improved_index.vecstore")

With the vectors of the documents stored, we are now ready to use it for our chatbot on streamlit (https://chatbot-compass.streamlit.app/).