<a href="https://colab.research.google.com/github/meghamkpatel/physicaltherapyassistant/blob/main/PhysioPhrame_Test_and_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PhysioPhrame Test and Evaluation

### This notebook demonstrates how to use Retrieval-Augmented Generation (RAG) with Langchain and OpenAI. The goal is to create a chatbot that answers questions based on a YouTube video's transcription.

In [7]:
import os
import streamlit as st
from dotenv import load_dotenv
from pinecone import Pinecone
from langchain_openai.embeddings import OpenAIEmbeddings

# Load environment variables
OPENAI_API_KEY = st.secrets["OPENAI_API_KEY"]
MODEL = "gpt-3.5-turbo"

# Initialize Pinecone
PINECONE_API_KEY = st.secrets["PINECONE_API_KEY"]
pc = Pinecone(api_key=PINECONE_API_KEY)

# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)


## Section 1: Load and Split Transcription

In this section, we load the transcription from a file and split it into smaller chunks. This step is essential to handle large texts efficiently and avoid context length limitations when interacting with the model.

In [8]:
from langchain_community.document_loaders import DirectoryLoader

def load_local_documents(directory):
    loader = DirectoryLoader(directory)
    documents = loader.load()
    
    # Replace newline characters with spaces in each document
    for doc in documents:
        doc.page_content = doc.page_content.replace("\n", " ")
    
    return documents

# Define the directory containing the documents
directory = 'content/Textbook'

# Load local documents
documents = load_local_documents(directory)

## Section 2: Compute Similarity and Find Relevant Chunks
Here, we compute the embeddings for each chunk and the question. Then, we find the most relevant chunk based on cosine similarity to provide as context for answering the question.

In [9]:
from pinecone import Pinecone
from langchain_openai.embeddings import OpenAIEmbeddings

# Initialize Pinecone
pinecone_api_key = st.secrets["PINECONE_API_KEY"]
pinecone = Pinecone(api_key=pinecone_api_key)

# Create or load the index
index_name = "physical-therapy"
pc_index = pc.Index(index_name)


vectorstore = pinecone.Index(name=index_name)

# Generate embeddings
embeddings = OpenAIEmbeddings()
vector_dict = {f"doc_{i}": embeddings.embed_query(doc.page_content) for i, doc in enumerate(documents)}


## Section 3: Create a Knowledge Base and Generate Test Set
In this section, we create a knowledge base using the split transcription and generate a test set of questions based on this knowledge base. The test set will be used to evaluate the chatbot's performance.

In [10]:
import pandas as pd

# Load the content in a pandas DataFrame
df = pd.DataFrame([doc.page_content for doc in documents], columns=["text"])

In [11]:
from giskard.rag import KnowledgeBase, generate_testset, evaluate

# Create a Knowledge Base
knowledge_base = KnowledgeBase(df)

In [13]:
# Generate the Test Set
testset = generate_testset(
    knowledge_base,
    num_questions=40,
    agent_description="A chatbot answering questions about physical therapy",
)

Generating questions: 100%|██████████| 40/40 [31:04<00:00, 46.61s/it]


In [14]:
test_set_df = testset.to_pandas()

for index, row in enumerate(test_set_df.head(3).iterrows()):
    print(f"Question {index + 1}: {row[1]['question']}")
    print(f"Reference answer: {row[1]['reference_answer']}")
    print("Reference context:")
    print(row[1]['reference_context'])
    print("******************", end="\n\n")

Question 1: What are the key factors that influence the rate at which a patient can progress through the rotator cuff physical therapy program?
Reference answer: The key factors that influence the rate at which a patient can progress through the rotator cuff physical therapy program include surgical technique, quality of the tissue repaired, size of the tear, timing of the repair, etiology of the tear, and the location of tear.
Reference context:
Document 7: Visit per week:  2 to 3 times  ROTATOR CUFF PHYSICAL THERAPY PRESCRIPTION (Advanced Program)  The following rotator cuff repair guidelines were developed by Dr. Xinning Li, which is based on the postoperative rehabilitation recommendations from the Hospital for Special Surgery. Progression is both criteria based and patient specific. Phases and time frames are designed to give the clinician and therapist a general sense of progression. The rehabilitation program following rotator cuff repair must take into account and allow for the

In [15]:
testset.save("test-set.jsonl")


In [16]:
from langchain.prompts import PromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
print(prompt.format(context="Here is some context", question="Here is a question"))


Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: Here is some context

Question: Here is a question



In [17]:
from langchain_openai.chat_models import ChatOpenAI

prompt = PromptTemplate.from_template(template)
model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")

## Section 4: Evaluate the Model on the Test Set
Finally, we evaluate the model's performance on the generated test set. The model is invoked with the relevant context and the question from the test set, and the evaluation report provides insights into how well the model performed.

In [18]:
from langchain_core.output_parsers import StrOutputParser
from scipy.spatial.distance import cosine

# Evaluate the model on the test set
def answer_fn(question, history=None):
    chain = prompt | model | StrOutputParser()
    
    return chain.invoke({"context": documents,"question": "what is physical therapy?"})

In [19]:
from giskard.rag import KnowledgeBase, generate_testset, evaluate

# Evaluate the model on the test set
report = evaluate(answer_fn, testset=testset, knowledge_base=knowledge_base)

Asking questions to the agent: 100%|██████████| 40/40 [09:20<00:00, 14.01s/it]
correctness evaluation: 100%|██████████| 40/40 [02:44<00:00,  4.12s/it]


### Evaluation Summary

The evaluation report provides insights into the performance of different components of the Retrieval-Augmented Generation (RAG) system:

#### Component Scores:

- **ROUTING (100.0%)**: Responsible for directing queries to the appropriate subsystems. A score of 100% indicates optimal performance.
  
- **KNOWLEDGE_BASE (100.0%)**: Reflects the knowledge base's effectiveness in providing relevant information. A score of 100% signifies a comprehensive and accurate knowledge base.

- **GENERATOR, RETRIEVER, REWRITER (0.0%)**: These components scored 0.0%, indicating poor performance. The system faces challenges in generating relevant responses, retrieving accurate information, and rewriting responses effectively.

#### Overall Correctness Score:

- **Overall Correctness Score**: 90%

#### Recommendation:

Given the low scores for the **GENERATOR**, **RETRIEVER**, and **REWRITER** components, the recommendation suggests a comprehensive review and improvement of these areas. While the system demonstrates proficiency in answering diverse questions and shows versatility across topics, indicating a robust knowledge base, there is a pressing need to enhance the generative, retrieval, and rewriting mechanisms to elevate the system's overall performance.


In [20]:
# Display the report
display(report)