In [5]:
import os
os.chdir('../../Library/CloudStorage/Box-Box/DISC_AI Project/Transripts Pilot')

In [6]:
os.listdir()

['.chroma',
 '.DS_Store',
 '6.3 Effects of Age and Disuse on Skeletal Muscle.pdf',
 '6.9 Precautions and Contraindications for Resistance Exercise.pdf',
 '6.4 Framework for Resistance Training.pdf',
 'd_db',
 '6.1 Principles of Resistance Exercise.pdf',
 '6.7 Case Application of Oddvar-Holten Method.pdf',
 '6.5 Velocity and Mode Resistsance Training Parameters.pdf',
 '6.6 Volume and Intensity of Resistance Exercise.pdf',
 '6.2 Muscle Adaptations to Resistance Exercise.pdf',
 'p_db',
 '~',
 '6.8 Order and Frequency of Resistance Exercise.pdf']

# Getting the data
In this notebook, I'll two different techiques of getting the text and see which works the best:
1. Using the `LangChain` PDF loader
    * Pros: easy to use, scalable
    * Cons: not sure how it will work across pages
2. Converting every PDF into a string text and then joining them on the document level
    * Pros: allows for further processing (i.e. removing timestamps)
    * Cons: not very scalable

## `LangChain` PDF loader

In [7]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader('./6.1 Principles of Resistance Exercise.pdf')
pages = loader.load_and_split()

In [8]:
pages[0]

Document(page_content="6.1 Principles of Resistance Exercise00:00:00[MUSIC] 00:00:08Hello, and welcome to our lecture on the principles of resistance exercise. 00:00:14Here is our course statement. 00:00:15By the end of this lecture, I am hoping that you'll be able to define resistance training, define the following terms associated with resistance training, and they include resistance training, muscle strength, power, and endurance. 00:00:31And finally, describe the principles of overload, SAID, and reversibility. 00:00:35Here's our patient case that'll be used for this lecture. 00:00:39On general physical examination, J.D a 14-year-old has an obese appearance and presents with difficulty in standing, walking, getting up from sitting positions, and climbing stairs. 00:00:51He also presents with proximal weakness, calf hypertrophy, hamstring muscle contracture, and a positive Gower's sign. 00:00:59So you've determined that this patient is appropriate for resistance training. 00:01:03So

In [9]:
import glob
pages = []
for file in glob.glob('*.pdf'):
    loader = PyPDFLoader(file)
    pages += loader.load_and_split()

In [10]:
len(pages)

61

## Single string per document and then `LangChain` string loader

In [11]:
from langchain.docstore.document import Document
import pypdf as pdf
import re

read_pdf = pdf.PdfReader(open('./6.1 Principles of Resistance Exercise.pdf', 'rb'))
text = ' '.join([p.extract_text() for p in read_pdf.pages])
text = re.sub('\d+\:\d+\:\d+','',text)
text = re.sub('\[.*?\]','',text)
text = text.replace('cielo24 | what’s in your video? | cielo24.com', '')
text[:1000]

"6.1 Principles of Resistance Exercise Hello, and welcome to our lecture on the principles of resistance exercise. Here is our course statement. By the end of this lecture, I am hoping that you'll be able to define resistance training, define the following terms associated with resistance training, and they include resistance training, muscle strength, power, and endurance. And finally, describe the principles of overload, SAID, and reversibility. Here's our patient case that'll be used for this lecture. On general physical examination, J.D a 14-year-old has an obese appearance and presents with difficulty in standing, walking, getting up from sitting positions, and climbing stairs. He also presents with proximal weakness, calf hypertrophy, hamstring muscle contracture, and a positive Gower's sign. So you've determined that this patient is appropriate for resistance training. So what do you do? Well, first, let's understand what resistance training actually is.  Resistance training is 

In [12]:
doc = Document(page_content=text)
doc

Document(page_content="6.1 Principles of Resistance Exercise Hello, and welcome to our lecture on the principles of resistance exercise. Here is our course statement. By the end of this lecture, I am hoping that you'll be able to define resistance training, define the following terms associated with resistance training, and they include resistance training, muscle strength, power, and endurance. And finally, describe the principles of overload, SAID, and reversibility. Here's our patient case that'll be used for this lecture. On general physical examination, J.D a 14-year-old has an obese appearance and presents with difficulty in standing, walking, getting up from sitting positions, and climbing stairs. He also presents with proximal weakness, calf hypertrophy, hamstring muscle contracture, and a positive Gower's sign. So you've determined that this patient is appropriate for resistance training. So what do you do? Well, first, let's understand what resistance training actually is.  R

In [13]:
def process_text(file):
    read_pdf = pdf.PdfReader(open(file, 'rb'))
    text = ' '.join([p.extract_text() for p in read_pdf.pages])
    text = re.sub('\d+\:\d+\:\d+','',text)
    text = re.sub('\[.*?\]','',text)
    text = text.replace('cielo24 | what’s in your video? | cielo24.com', '')
    return text

docs = []
for file in glob.glob('*.pdf'):
    text = process_text(file)
    doc = Document(page_content=text)
    docs.append(doc)

In [14]:
len(docs)

9

# Set up for Retrieval QA

In [31]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain import HuggingFaceHub
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.memory import ConversationBufferMemory

huggingfacehub_api_token = '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'
repo_id = "tiiuae/falcon-7b-instruct"
llm = HuggingFaceHub(huggingfacehub_api_token=huggingfacehub_api_token, 
                     repo_id=repo_id, 
                     model_kwargs={"temperature":0.1, "max_new_tokens":2000})

In [32]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
# d_documents = text_splitter.split_documents(docs)
p_documents = text_splitter.split_documents(pages)

In [33]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'mps'}
encode_kwargs = {'normalize_embeddings': False}
hf = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [34]:
d_persist_directory = '~/d_db'
p_persist_directory = '~/p_db'

# d_docsearch = Chroma.from_documents(d_documents, hf, persist_directory=d_persist_directory)
p_docsearch = Chroma.from_documents(p_documents, hf, persist_directory=p_persist_directory)

In [35]:
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True, input_key='question', output_key='answer')

# Results

## By Page

In [36]:
pqa = ConversationalRetrievalChain.from_llm(llm, p_docsearch.as_retriever(), return_source_documents=True, memory=memory)

In [37]:
dict(memory.chat_memory)['messages']

[]

In [38]:
query = "What is resistance training?"
result = pqa({"question": query, "chat_history":memory.chat_memory})

In [39]:
result['answer']

' Resistance training is a form of active exercise in which dynamic or static muscle contractions are resisted by an outside force applied manually or mechanically.'

In [40]:
memory.chat_memory

ChatMessageHistory(messages=[HumanMessage(content='What is resistance training?', additional_kwargs={}, example=False), AIMessage(content=' Resistance training is a form of active exercise in which dynamic or static muscle contractions are resisted by an outside force applied manually or mechanically.', additional_kwargs={}, example=False)])

In [41]:
result['source_documents']

[Document(page_content="mechanically. 00:01:24[BLANK_AUDIO] 00:01:29Now, since we have established what the definition is, I would like to say before we do anything else that resists training within the literature is heavily, heavily covered. 00:01:40As a matter of fact, it is so ubiquitous that it's hardly worth mentioning the fact that, yes, there is literature supporting resistance training. 00:01:50As a matter of fact, when I did a very quick search term for resistance training and rehabilitation, I had almost 44,000 hits. 00:01:59So again, when we go through the material associated with resistance training, it's assuming that we all understand that resistance training does have therapeutic benefits. 00:02:12[BLANK_AUDIO] 00:02:14Okay, so let's talk about some of the elements of muscle performance, and they are strength, endurance, and power. 00:02:20The strength of a muscle is the ability of contractile tissue to produce tension and resultant force based on the demands placed on t

In [42]:
cites = result['source_documents']
# cites = [c.page_content for c in cites if c.page_content not in cites]
# cites

In [43]:
memory.chat_memory

ChatMessageHistory(messages=[HumanMessage(content='What is resistance training?', additional_kwargs={}, example=False), AIMessage(content=' Resistance training is a form of active exercise in which dynamic or static muscle contractions are resisted by an outside force applied manually or mechanically.', additional_kwargs={}, example=False)])

In [44]:
import pandas as pd
cite_df = pd.DataFrame(cites)

In [45]:
list(zip(list(c[1] for c in cite_df.drop_duplicates(subset=0)[0]), list(c[1] for c in cite_df.drop_duplicates(subset=0)[1])))

[("mechanically. 00:01:24[BLANK_AUDIO] 00:01:29Now, since we have established what the definition is, I would like to say before we do anything else that resists training within the literature is heavily, heavily covered. 00:01:40As a matter of fact, it is so ubiquitous that it's hardly worth mentioning the fact that, yes, there is literature supporting resistance training. 00:01:50As a matter of fact, when I did a very quick search term for resistance training and rehabilitation, I had almost 44,000 hits. 00:01:59So again, when we go through the material associated with resistance training, it's assuming that we all understand that resistance training does have therapeutic benefits. 00:02:12[BLANK_AUDIO] 00:02:14Okay, so let's talk about some of the elements of muscle performance, and they are strength, endurance, and power. 00:02:20The strength of a muscle is the ability of contractile tissue to produce tension and resultant force based on the demands placed on the muscle. 00:02:30So

## By Document

In [63]:
dqa = ConversationalRetrievalChain.from_llm(llm, d_docsearch.as_retriever(), return_source_documents=True)

In [64]:
chat_history = []
query = "What is resistance training?"
result = dqa({"question": query, "chat_history": chat_history})

In [65]:
result['answer']

'\ns.'

In [68]:
result['source_documents']

[Document(page_content="6.1 Principles of Resistance Exercise Hello, and welcome to our lecture on the principles of resistance exercise. Here is our course statement. By the end of this lecture, I am hoping that you'll be able to define resistance training, define the following terms associated with resistance training, and they include resistance training, muscle strength, power, and endurance. And finally, describe the principles of overload, SAID, and reversibility. Here's our patient case that'll be used for this lecture. On general physical examination, J.D a 14-year-old has an obese appearance and presents with difficulty in standing, walking, getting up from sitting positions, and climbing stairs. He also presents with proximal weakness, calf hypertrophy, hamstring muscle contracture, and a positive Gower's sign. So you've determined that this patient is appropriate for resistance training. So what do you do? Well, first, let's understand what resistance training actually is.  

# Other models

In [46]:
import torch
import transformers
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline

tokenizer = LlamaTokenizer.from_pretrained("TheBloke/wizardLM-7B-HF")

model = LlamaForCausalLM.from_pretrained("TheBloke/wizardLM-7B-HF",
                                              load_in_8bit=True,
                                              device_map='auto',
                                              torch_dtype=torch.float16,
                                              low_cpu_mem_usage=True
                                              )

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

In [None]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
import torch

pipe = pipeline(
    "text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=1024,
    temperature=0,
    top_p=0.95,
    repetition_penalty=1.15
)

local_llm = HuggingFacePipeline(pipeline=pipe)

In [None]:
print(local_llm('What is the capital of England?'))

# Llama Index

In [1]:
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index import (
    SimpleDirectoryReader,
    load_index_from_storage,
    VectorStoreIndex,
    StorageContext,
)
import faiss
from llama_index.embeddings import resolve_embed_model
import os
os.environ["OPENAI_API_KEY"] = "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"

embed_model = resolve_embed_model("local:BAAI/bge-small-en")
d = 384
faiss_index = faiss.IndexFlatL2(d)

In [2]:
documents = SimpleDirectoryReader("./data/all_weeks/").load_data()

In [3]:
vector_store = FaissVectorStore(faiss_index=faiss_index)

In [4]:
from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(
    embed_model=embed_model
)

In [5]:
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

In [7]:
index.storage_context.persist('./llama_index_dbs/all_weeks_db')

In [22]:
vector_store = FaissVectorStore.from_persist_dir("./llama_index_dbs/all_weeks_db")
storage_context = StorageContext.from_defaults(
    vector_store=vector_store, persist_dir="./llama_index_dbs/all_weeks_db"
)

index = load_index_from_storage(storage_context=storage_context)

In [8]:
query_engine = index.as_query_engine()

In [9]:
response = query_engine.query("What is resistance training? Give two examples of it.")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [11]:
print(str(response))

Resistance training is a type of exercise that involves working against a force or resistance to build strength, endurance, and muscle mass. It typically involves using weights, resistance bands, or bodyweight exercises to challenge the muscles. Two examples of resistance training exercises are weightlifting and push-ups.


## Small to big

In [12]:
from llama_index.node_parser import SentenceSplitter
node_parser = SentenceSplitter(chunk_size=1024)
base_nodes = node_parser.get_nodes_from_documents(documents)
# set node ids to be a constant
for idx, node in enumerate(base_nodes):
    node.id_ = f"node-{idx}"

In [13]:
service_context = ServiceContext.from_defaults(
    embed_model=embed_model
)

In [14]:
sub_chunk_sizes = [128, 256, 512]
sub_node_parsers = [
    SentenceSplitter(chunk_size=c, chunk_overlap=0) for c in sub_chunk_sizes
]

In [15]:
from llama_index.schema import IndexNode

all_nodes = []
for base_node in base_nodes:
    for n in sub_node_parsers:
        sub_nodes = n.get_nodes_from_documents([base_node])
        sub_inodes = [
            IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes
        ]
        all_nodes.extend(sub_inodes)

    # also add original node to node
    original_node = IndexNode.from_text_node(base_node, base_node.node_id)
    all_nodes.append(original_node)
all_nodes_dict = {n.node_id: n for n in all_nodes}

In [16]:
vector_index_chunk = VectorStoreIndex(
    all_nodes, service_context=service_context
)

In [20]:
index.storage_context.persist('./llama_index_dbs/all_weeks_db_small_to_big')

In [21]:
vector_store = FaissVectorStore.from_persist_dir("./llama_index_dbs/all_weeks_db_small_to_big")
storage_context = StorageContext.from_defaults(
    vector_store=vector_store, persist_dir="./llama_index_dbs/all_weeks_db_small_to_big"
)
index = load_index_from_storage(storage_context=storage_context)

## From PGVectors

In [13]:
import os

base_path = '/Users/pnadel01/Library/CloudStorage/Box-Box/DISC_AI Project/Transcripts Primary Care/'
pdfs = [f"{base_path}/{p}" for p in os.listdir(base_path) if 'through' not in p]
pdfs

['/Users/pnadel01/Library/CloudStorage/Box-Box/DISC_AI Project/Transcripts Primary Care//Week 8 Primary Care.pdf',
 '/Users/pnadel01/Library/CloudStorage/Box-Box/DISC_AI Project/Transcripts Primary Care//Week 3 Primary Care.pdf',
 '/Users/pnadel01/Library/CloudStorage/Box-Box/DISC_AI Project/Transcripts Primary Care//Week 7 Primary Care.pdf',
 '/Users/pnadel01/Library/CloudStorage/Box-Box/DISC_AI Project/Transcripts Primary Care//Week 9 Primary Care.pdf',
 '/Users/pnadel01/Library/CloudStorage/Box-Box/DISC_AI Project/Transcripts Primary Care//Week 2 Primary Care.pdf',
 '/Users/pnadel01/Library/CloudStorage/Box-Box/DISC_AI Project/Transcripts Primary Care//Week 6 Primary Care.pdf',
 '/Users/pnadel01/Library/CloudStorage/Box-Box/DISC_AI Project/Transcripts Primary Care//Week 1 Primary Care.pdf',
 '/Users/pnadel01/Library/CloudStorage/Box-Box/DISC_AI Project/Transcripts Primary Care//Week 5 Primary Care.pdf',
 '/Users/pnadel01/Library/CloudStorage/Box-Box/DISC_AI Project/Transcripts Prima

In [19]:
import pypdf
import re
from llama_index.node_parser import SimpleNodeParser
from llama_index import Document

def read_pdf(path):
    read_pdf = pypdf.PdfReader(path)
    source = path.split('//')[-1]
    full_text = ' '.join([page.extract_text() for page in read_pdf.pages])
    node_parser = SimpleNodeParser.from_defaults(chunk_size=256, chunk_overlap=20)
    nodes = node_parser.get_nodes_from_documents([Document(text=full_text)], show_progress=False)
    return [Document(text=n.text, metadata={'source':source}) for n in nodes]

In [20]:
texts = [read_pdf(pdf) for pdf in pdfs]

In [21]:
documents = [item for sublist in texts for item in sublist]
documents[0]

Document(id_='c3d3ed57-4e84-43a8-8a98-fb3f54f87aed', embedding=None, metadata={'source': 'Week 8 Primary Care.pdf'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='69ebfc63d90439ab82de43018911aae5b718da9ab3b498047245c113e95012e6', text="8.1 Musculoskeletal Ultrasound: What is it?Effects00:00:00[MUSIC] 00:00:09So welcome to week eight, and the first lecture is on musculoskeletal ultrasound. 00:00:14So we're going to start with musculoskeletal ultrasound. 00:00:17What is musculoskeletal ultrasound? 00:00:21[BLANK_AUDIO] 00:00:24So our objectives are to describe how energy is converted to sound waves and how those sound waves are transformed into an image in the process of ultrasound imaging. 00:00:35Then we're going to identify the common equipment and functionality of the equipment used in ultrasound imaging. 00:00:41We'll discuss the difference in equipment shape and image between linear and curvilinear transducers, and describe how frequency af

In [1]:
import os
import openai

os.environ["OPENAI_API_KEY"] = "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"
openai.api_key = "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"

In [22]:
import psycopg
from pgvector.psycopg import register_vector
from sentence_transformers import SentenceTransformer

db_name = "postgres"
conn = psycopg.connect(dbname=db_name, host = "localhost", port = "5432", autocommit=True)#(connection_string)

conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
register_vector(conn)

conn.execute('DROP TABLE IF EXISTS documents')
conn.execute('CREATE TABLE documents (id bigserial PRIMARY KEY, source text, content text, embedding vector(1024))')

chunks = [d.text for d in documents]
sources = [d.metadata['source'] for d in documents]

model = SentenceTransformer('BAAI/bge-large-en-v1.5')
embeddings = model.encode(chunks)

for source, chunk, embedding in zip(sources, chunks, embeddings):
    conn.execute('INSERT INTO documents (source, content, embedding) VALUES (%s, %s, %s)', (source, chunk, embedding))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [26]:
document_id = 3
conn.execute('SELECT content, embedding FROM documents WHERE id != %(id)s ORDER BY embedding <=> (SELECT embedding FROM documents WHERE id = %(id)s) LIMIT 5', {'id': document_id}).fetchall()[0][1].shape

(1024,)

In [2]:
from llama_index.vector_stores import PGVectorStore
from llama_index import StorageContext, ServiceContext
from llama_index.indices.vector_store import VectorStoreIndex
from sqlalchemy import make_url
from llama_index.embeddings import resolve_embed_model
from llama_index import set_global_service_context

# url = make_url(connection_string)
# vector_store = PGVectorStore.from_params(
#     database=db_name,
#     host="localhost",
#     port='5432',
#     user="pnadel01",
#     table_name="documents",
#     embed_dim=1024,
# )

embed_model = resolve_embed_model("local:BAAI/bge-large-en")
service_context = ServiceContext.from_defaults(embed_model=embed_model)
set_global_service_context(service_context)

import psycopg
from pgvector.psycopg import register_vector
from sentence_transformers import SentenceTransformer
from llama_index import Document

db_name = "postgres"
conn = psycopg.connect(dbname=db_name, host = "localhost", port = "5432", autocommit=True)#(connection_string)

conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
register_vector(conn)


cursor = conn.cursor()
cursor.execute("SELECT * FROM documents;")

rows = cursor.fetchall()

docs_from_db = [Document(text=row[2],metadata={'source':row[1]},embedding=list(row[3])) for row in rows]

# index = VectorStoreIndex.from_documents(
#     documents
# )
# query_engine = index.as_query_engine(similarity_top_k=5)

In [3]:
# index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    docs_from_db
)

In [4]:
ret = index.as_retriever()
# ret.retrieve("depression")

In [5]:
from llama_index.query_engine import RetrieverQueryEngine

qe = RetrieverQueryEngine.from_args(retriever=ret)

In [6]:
qe.query('What are the best ways to screen for depression in adults? Use examples from the context.')

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Response(response='The context information suggests that a 2-question initial screening is recommended for screening depression in adults. The Patient Health Questionnaire-2 (PHQ-2) is specifically mentioned as a recommended questionnaire to use with these patients. Additionally, the context mentions that the mnemonic SIGECAPS is helpful for remembering criteria for depression.', source_nodes=[NodeWithScore(node=TextNode(id_='a4e51531-bdac-48ed-97fe-025a04fd8db8', embedding=None, metadata={'source': 'Week 10 Primary Care.pdf'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='54c96c09-aade-4c5c-a5e6-dd90b553a541', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'source': 'Week 10 Primary Care.pdf'}, hash='23291d4fae3fe56d6b706d6d9b1f9169a1fe9405525a97b050d6ab5173f361f4'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='81955928-e779-45a1-b5ee-66fe78e61a53', node_type=<ObjectType.TEXT: '1'>

In [64]:
res = query_engine.query('What are the best ways to screen for depression in adults? Use examples from the context.')

In [65]:
print(str(res))

Context information is below.
---------------------
source: Week 10 Primary Care.pdf

00:03:43A flowchart for PTS showing the sequence of screening questions is above. 00:03:49Because of the high sensitivity value associated with the initial two questions, and no response to both, especially in an individual without a history of depression in the past year, makes it very  unlikely a major depressive episode is present. 00:04:03Because of the low specificity value, a yes response is not diagnostic, but requires that additional patient information to be collected. 00:04:11[BLANK_AUDIO] 00:04:21Additional Information for Recognizing Major Depressive Disorder. 00:04:24Major depressive disorder is the most common mental health condition seen in primary care. 00:04:30The presentation may include mood, cognitive, neurovegetative, or somatic symptoms. 00:04:37There are limited harms associated with screening, as long as a positive screen is followed up on. 00:04:42The Patient Health Questionna

In [55]:
[n.text for n in res.source_nodes]

['00:00:44The prevalence is as high as 13% for adults. 00:00:48Half of those will experience remission, but half of these will relapse the following year. 00:00:54Refer to your Boissonnault book Box 20.2 for important risk factors. 00:01:00A 2-question initial screening is recommended, so box 20.2 in your text provides a list of risk factors related to major depressive disorder.  00:01:10In a 2017 study, only 18% physical therapists in this study screened their patients for depressive disorder. 00:01:18[BLANK_AUDIO] 00:01:24Major Depressive Disorder Screening. 00:01:27[BLANK_AUDIO] 00:01:30To meet the criteria for major depressive episodes, an individual must have symptoms over a two-week period that represent a change from previous functioning, with at least one of the symptoms being a depressed mood or loss of interest or pleasure. 00:01:43The mnemonic SIGECAPS is helpful for remembering criteria for depression.',
 '00:03:43A flowchart for PTS showing the sequence of screening questi

### Query Rewrite

In [48]:
vector_retriever = index.as_retriever(
    similarity_top_k=5
)

In [51]:
from llama_index.retrievers import QueryFusionRetriever
fusion_retriever = QueryFusionRetriever(
    [vector_retriever],
    similarity_top_k=5,
    num_queries=6,  # set this to 1 to disable query generation
    mode="reciprocal_rerank",
    use_async=False,
    verbose=True,
    # query_gen_prompt="...",  # we could override the query generation prompt here
)

In [52]:
from llama_index.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(fusion_retriever, service_context=service_context)

In [53]:
res = query_engine.query(
    "What are the best ways to screen for depression in adults? Use examples from the context."
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Generated queries:
1. What are the most effective screening tools for depression in adults?
2. Are there any validated questionnaires or surveys for screening depression in adults?
3. How do healthcare professionals typically screen for depression in adults?
4. What are the recommended guidelines for depression screening in adults?
5. Can you provide examples of screening methods used by mental health professionals to detect depression in adults?


In [54]:
print(str(res))

Context information is below.
---------------------
source: Week 10 Primary Care.pdf

00:00:44The prevalence is as high as 13% for adults. 00:00:48Half of those will experience remission, but half of these will relapse the following year. 00:00:54Refer to your Boissonnault book Box 20.2 for important risk factors. 00:01:00A 2-question initial screening is recommended, so box 20.2 in your text provides a list of risk factors related to major depressive disorder.  00:01:10In a 2017 study, only 18% physical therapists in this study screened their patients for depressive disorder. 00:01:18[BLANK_AUDIO] 00:01:24Major Depressive Disorder Screening. 00:01:27[BLANK_AUDIO] 00:01:30To meet the criteria for major depressive episodes, an individual must have symptoms over a two-week period that represent a change from previous functioning, with at least one of the symptoms being a depressed mood or loss of interest or pleasure. 00:01:43The mnemonic SIGECAPS is helpful for remembering criteria for 

In [1]:
from llama_index.llama_pack import download_llama_pack

FuzzyCitationEnginePack = download_llama_pack("FuzzyCitationEnginePack", "./fuzzy_pack")