# ChatDoctor (LLM App)

ChatDoctor is an LLLM chatbot that can assist patients providing medical answers. Let's see how we can test and evaluate this LLM App with [Lynxius](https://www.lynxius.ai/).

## Setup

To set up [Lynxius](https://www.lynxius.ai/) you only need to import the `LYNXIUS_API_KEY` and install [Lynxius](https://www.lynxius.ai/) library with `pip install lynxius`. In this tutorial we are going to run [Lynxius](https://www.lynxius.ai/) directly from it's source code instead.

In [1]:
# First, we have to setup Lynxius API key
import os
import sys
from getpass import getpass
sys.path.append("../")

if not (lynxius_api_key := os.getenv("LYNXIUS_API_KEY")):
    lynxius_api_key = getpass("🔑 Enter your Lynxius API key: ")

os.environ["LYNXIUS_API_KEY"] = lynxius_api_key
os.environ["LYNXIUS_BASE_URL"] = "https://REQUEST-THE-ENDPOINT-TO-GET-ACCESS"

🔑 Enter your Lynxius API key:  ········


Run these [Postman](https://www.postman.com/) collections to upload to your platform the datasets used in this notebook:

1. [Lynxius ChatDoctor Project (1/3)](./data/postman/Lynxius_ChatDoctor_Project_1_of_3.postman_collection.json) to upload **Dataset_v1**
2. [Lynxius ChatDoctor Project (2/3)](./data/postman/Lynxius_ChatDoctor_Project_2_of_3.postman_collection.json) to upload **Dataset_v2**
3. [Lynxius ChatDoctor Project (3/3)](./data/postman/Lynxius_ChatDoctor_Project_3_of_3.postman_collection.json) to upload **Dataset_v2-labeled**

## Let's evaluate ChatDoctor_v1 against our Dataset_v1

There is already **Dataset_v1** stored in your platform that containes some question and ground-truth answer pairs. Let's download it with the `get_dataset_details()` function.

We can now evaluate your **ChatDoctor_v1** LLM application by comparing its outputs to **Dataset_v1** queries with the respective ground-truth reference answers provided in **Dataset_v1**. In this notebook we are going to use `BertScore` and `AnswerCorrectness` metrics for the evaluation.

In [2]:
from lynxius.client import LynxiusClient

client = LynxiusClient()

# Downloading Dataset_v1 from Lynxius platform
dataset_details = client.get_dataset_details(dataset_id="DATASET_V1_UUID")

for entry in dataset_details.entries:
    print(entry.query)

How can I prevent the flu?
What are the early signs of diabetes?
How do I know if I have a food allergy?
What should I do if I get a sunburn?


In [3]:
from datasets_utils import chatdoctor_v1

from lynxius.evals.bert_score import BertScore
from lynxius.evals.answer_correctness import AnswerCorrectness

label = "PR #111"
tags = ["GPT-4", "chatdoctor_v1", "Dataset_v1"]
bert_score = BertScore(label=label, tags=tags, level="sentence", presence_threshold=0.65)
answer_correctness = AnswerCorrectness(label=label, tags=tags)

for entry in dataset_details.entries:
    actual_output = chatdoctor_v1(entry.query)
    
    bert_score.add_trace(reference=entry.reference, output=actual_output)
    answer_correctness.add_trace(query=entry.query, reference=entry.reference, output=actual_output)

# Run!
client.evaluate(bert_score)
client.evaluate(answer_correctness)

'fbcdd27d-a973-4b12-a7dc-9ece1e222cb0'

🚀🚀🚀 It looks like the evaluations scored pretty well! **ChatDoctor_v1** can be deployed to production! 🚀🚀🚀

<img src="https://github-public-assets.s3.us-west-1.amazonaws.com/chatdoctorv1_datasetv1.png" alt="chatdoctorv1_datasetv1" width="60%" />

## Production Monitoring

⚠️⚠️⚠️ From the [Lynxius](https://www.lynxius.ai/) you can monitor **ChatDoctor_v1** performance and spot quickly that its performance is decreasing over the weeks. It seems your users are asking queries that your **ChatDoctor_v1** cannot reply with a great level of correctness. ⚠️⚠️⚠️

<img src="https://github-public-assets.s3.us-west-1.amazonaws.com/chatdoctorv1_monitor.png" alt="chatdoctorv1_monitor.png" width="60%" />

## Production Data Capturing

Thankfully [Lynxius](https://www.lynxius.ai/) automatically collects your users' queries and empowers you to efficiently debug edge cases. In this case **Dataset_v2** has been automatically collected.

It seems like your users are asking about the **symptoms** related to specific conditions and your chatbot is not able to provide correct answers to these new kind of queries.

Your Subject Matter Expert (SMEs), like medical doctors 👩‍⚕️👨‍⚕️, can use [Lynxius](https://www.lynxius.ai/) UI to quickly spot this edge case and they can annotate the new queries with high quality reference data ✅

<img src="https://github-public-assets.s3.us-west-1.amazonaws.com/datasetv2_spotted_symptoms.png" alt="datasetv2_spotted_symptoms" width="60%" />

## Let's evaluate ChatDoctor_v1 against our Dataset_v2-labeled

Once your SMEs correctely annotated your new data, you can evaluate again **ChatDoctor_v1** to see the real performance against **Dataset_v2-labeled**.

In [8]:
# Downloading Dataset_v2 from Lynxius platform
dataset_details = client.get_dataset_details(dataset_id="DATASET_V2_LABELED_UUID")

for entry in dataset_details.entries:
    print(entry.query)

How can I prevent the flu?
What are the early signs of diabetes?
How do I know if I have a food allergy?
What should I do if I get a sunburn?
What are the symptoms of a migraine headache?
What are the symptoms of the common cold?
What are the symptoms of a urinary tract infection (UTI)?


In [9]:
from datasets_utils import chatdoctor_v1

from lynxius.evals.bert_score import BertScore
from lynxius.evals.answer_correctness import AnswerCorrectness

label = "PR #222"
tags = ["GPT-4", "chatdoctor_v1", "Dataset_v2-labeled"]
bert_score = BertScore(label=label, tags=tags, level="sentence", presence_threshold=0.65)
answer_correctness = AnswerCorrectness(label=label, tags=tags)

for entry in dataset_details.entries:
    actual_output = chatdoctor_v1(entry.query)
    
    bert_score.add_trace(reference=entry.reference, output=actual_output)
    answer_correctness.add_trace(query=entry.query, reference=entry.reference, output=actual_output)

# Run!
client.evaluate(bert_score)
client.evaluate(answer_correctness)

'7097779c-a2af-4a14-bfd0-9a534db466b3'

❌❌❌ It looks like the evaluations scores were not acceptable ❌❌❌

<img src="https://github-public-assets.s3.us-west-1.amazonaws.com/chatdoctorv1_datasetv2labeled.png" alt="chatdoctorv1_datasetv2labeled" width="60%" />

## Time to build ChatDoctor_v2

🔧🔨🔩 **ChatDoctor_v1** clearly cannot handle the new edge case and the team needs to work hard on the new**ChatDoctor_v2**. 🔧🔨🔩

## Let's evaluate ChatDoctor_v2 against our Dataset_v2-lebeled

Once ***ChatDoctor_v2*** is ready we can test it agains the dataset labeled by your SMEs, **Dataset_v2-lebeled**.

In [10]:
# Downloading Dataset_v2 from Lynxius platform
dataset_details = client.get_dataset_details(dataset_id="DATASET_V2_LABELED_UUID")

for entry in dataset_details.entries:
    print(entry.query)

How can I prevent the flu?
What are the early signs of diabetes?
How do I know if I have a food allergy?
What should I do if I get a sunburn?
What are the symptoms of a migraine headache?
What are the symptoms of the common cold?
What are the symptoms of a urinary tract infection (UTI)?


In [11]:
from datasets_utils import chatdoctor_v2

from lynxius.evals.bert_score import BertScore
from lynxius.evals.answer_correctness import AnswerCorrectness

label = "PR #333"
tags = ["GPT-4", "chatdoctor_v2", "Dataset_v2-labeled"]
bert_score = BertScore(label=label, tags=tags, level="sentence", presence_threshold=0.65)
answer_correctness = AnswerCorrectness(label=label, tags=tags)

for entry in dataset_details.entries:
    actual_output = chatdoctor_v2(entry.query)

    bert_score.add_trace(reference=entry.reference, output=actual_output)
    answer_correctness.add_trace(query=entry.query, reference=entry.reference, output=actual_output)

# Run!
client.evaluate(bert_score)
client.evaluate(answer_correctness)

'f934d5e0-fe25-4884-a51e-79edffd786d0'

🚀🚀🚀 Yess!!! It looks like the evaluations scored well! 🚀🚀🚀

<img src="https://github-public-assets.s3.us-west-1.amazonaws.com/chatdoctorv2_datasetv2labeled.png" alt="chatdoctorv2_datasetv2labeled" width="60%" />

🚀🚀🚀 We can also see that **ChatDoctor_v2** clearly outperforms **ChatDoctor_v1** 🚀🚀🚀

<img src="https://github-public-assets.s3.us-west-1.amazonaws.com/chatdoctorv2_monitor.png" alt="chatdoctorv2_monitor" width="60%" />

## Final Considerations

[Lynxius](https://www.lynxius.ai/) platform helped the team to evaluate their LLM Apps and decide when they were ready to deploy to production ✅✅✅. It also empowered the team to discover production issues fast ✅✅✅ and collect important end user input queries to further improve their product ✅✅✅