<a href="https://colab.research.google.com/github/karamih/QA/blob/master/Medical_assistant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# QA with custom documents: LangChain and OpenAI

### install packages

In [None]:
!pip -q install langchain openai tiktoken chromadb pypdf InstructorEmbedding sentence_transformers

### import libraries

In [None]:
import os

from langchain.document_loaders import DirectoryLoader, PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.memory import ConversationBufferMemory
from langchain import OpenAI

import torch

from InstructorEmbedding import INSTRUCTOR

import textwrap

### Customization

modify `DATA_DIRECTORY_PATH` for creating qa on your documents.

`VECTOR_DB_PATH` take the path and create your db in there.

set `OPENAI_API_KEY` with your key.

In [None]:
DATA_DIRECTORY_PATH = '/content/drive/MyDrive/docs/' # Gale encyclopedia of medicine vol 1-5 / nurses drug handbook 2022 / meylers's side effects of endocrine and metabolic
VECTOR_DB_PATH = '/content/drive/MyDrive/medical_knowledge/'

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

os.environ["OPENAI_API_KEY"] = ""

### Prepare data and database

In [3]:
class Data:
    docs = None

    data_classes = {"pdf": PyPDFLoader,
                    "text": TextLoader}

    def __init__(self, *, document_dir, data_type, chunk_size, chunk_overlap):
        self.document_dir = document_dir
        self.data_type = data_type
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

    def __len__(self):
        return len(self.docs)

    def __getitem__(self, item):
        return self.docs[item]

    def _split(self):
        loader = DirectoryLoader(self.document_dir, glob=f"*.{self.data_type}",
                                 loader_cls=self.data_classes[self.data_type])

        content = loader.load()

        splitter = RecursiveCharacterTextSplitter(chunk_size=self.chunk_size,
                                                  chunk_overlap=self.chunk_overlap)

        self.docs = splitter.split_documents(content)

    def document(self):
        self._split()
        return self.docs


class VectorDb:
    def __init__(self, *, vectordb_path, embedding, device='cuda'):
        self.vectordb_path = vectordb_path
        self.device = device
        self.embedding = embedding

    def create_vectordb(self, *, document):
        vectordb = Chroma.from_documents(documents=document,
                                         embedding=self.embedding,
                                         persist_directory=self.vectordb_path)

        vectordb.persist()

    def crate_retriever(self, *, k_top_document=3):
        vectordb = Chroma(persist_directory=self.vectordb_path,
                          embedding_function=self.embedding)
        retriever = vectordb.as_retriever(search_kwargs={'k': k_top_document})

        return retriever


  from tqdm.autonotebook import trange


#### Splitting documents

In [4]:
data = Data(document_dir=DATA_DIRECTORY_PATH, chunk_size=1000, chunk_overlap=200, data_type='pdf')
document = data.document()

In [None]:
len(document), document[10]

#### Hugging Face embedding

In [7]:
embedding = HuggingFaceInstructEmbeddings(model_name='hkunlp/instructor-xl',
                                          model_kwargs={'device': DEVICE'})

Downloading (…)7f436/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

Downloading (…)/2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

Downloading (…)0daf57f436/README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

Downloading (…)af57f436/config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)7f436/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading (…)f57f436/modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


#### Creating vector database and retriever

In [8]:
vector_db = VectorDb(vectordb_path=VECTOR_DB_PATH, embedding=embedding)

vector_db.create_vectordb(document=document)

In [20]:
retriever = vector_db.crate_retriever(k_top_document=5)

retriever

#### Testing retriever


In [22]:
query1 = 'what is daiabet?'
query2 = 'what does amoksisilyn pill good for?'
query3 = "اگزمای پوستی چیست؟"
query4 = "چه قرصی برای معده درد مفید است؟"

In [23]:
doc1 = retriever.get_relevant_documents(query1)
doc1

[Document(page_content='severity of the diabetes. Mild forms can be treated withdiet (decreasing the intake of sugars and fats, in particu-lar). Many women are put on strict, detailed diets , and are\nasked to stay within a certain range of calorie intake.Exercise is sometimes used to keep blood sugar levels\nlower. Patients are often asked to regularly measure theirblood sugar. This is done by poking a finger with a needlecalled a lancet, putting a drop of blood on a special type ofpaper, and feeding the paper into a meter which analyzesand reports the blood sugar level. When diet and exercisedo not keep blood glucose levels within an acceptablerange, a patient may need to take regular shots of insulin.KEY TERMS\nGlucose —A form of sugar. The final product of the\nbreakdown of carbohydrates (starches).\nInsulin —A hormone produced by the pancreas\nthat is central to the processing of sugars and car-bohydrates in the diet.\nPlacenta —An organ that is attached to the inside', metadata={

In [24]:
doc2 = retriever.get_relevant_documents(query2)
doc2

[Document(page_content='To treat endocervical, rectal, and urethral infections caused by\nChlamydia trachomatis\nCapsules, Delayed-Release T ablets, Oral Suspension, Syrup,\nTablets\nAdults.  100 mg (120 mg Doryx MPC) twice daily for 7 days.\nTo treat unco mplicated gonococcal infections except anorectal\ninfections in men\nCapsules, Delayed-Release T ablets, Oral Suspension, Syrup,\nTablets\nAdults.  100 mg (120 mg Doryx MPC) twice a day for 7 days.\nAlternatively , 300 mg (360 mg Doryx MPC) followed in 1 hr by a\nsecond 300 mg (360 mg Doryx MPC) dose.\nTo treat epididymoorchitis caused by C. trachomatis or Neisseria\ngonorrhoeae\nCapsules, Delayed-Release T ablets, Oral Suspension, Syrup,\nTablets\nAdults.  100 mg (120 mg Doryx MPC) twice daily for at least 10\ndays.\nTo prevent malaria\nCapsules, Delayed-Release T ablets, Oral Suspension, Syrup,\nTablets\nAdults.  100 mg (120 mg Doryx MPC) daily starting 1 to 2 days\nbefore travel, continued daily during travel, and then daily for 4

In [25]:
doc3 = retriever.get_relevant_documents(query3)
doc3

[Document(page_content='Lesion —A pathologic change in tissues.\nMalignancy —A locally invasive and destructive\ngrowth.White, C.S., C.A. Meyer, and P. A. Templeton. “CT Fluo-\nroscopy for Throacic Interventional Procedures.” Radio-\nlogic Clinics of North America 38 (March 2000).\nWhite, C.S., et. al. “Transbronchial Needle Aspiration: Guid-\nance with CT Fluoroscopy.” Chest 118 (December 2000).\nKim A. Sharp\nCT-myelogram seeMyelography\nCT scan seeComputed tomography scans\nCulture-fair test\nDefinition\nA culture-fair test is a test designed to be free of cul-\ntural bias, as far as possible, so that no one culture has anadvantage over another. The test is designed to not beinfluenced by verbal ability, cultural climate, or educa-tional level.\nPurpose\nThe purpose of a culture-fair test is to eliminate any', metadata={'page': 357, 'source': '/content/drive/MyDrive/docs/Gale Encyclopedia of Medicine Vol. 2 (C-F).pdf'}),
 Document(page_content='years old. The patient only needs the 

In [26]:
doc4 = retriever.get_relevant_documents(query4)
doc4

[Document(page_content='Salmonella paratyphi :S. paratyphi A ; S. schottmuelleri\n(also called S. paratyphi B ); or S. hirschfeldii (also called\nS. paratyphi C ). It can be transmitted from animals or\nanimal products to humans or from person to person. Theincubation period is one to two weeks but is often shorterin children. Symptom onset may be gradual in adults butis often sudden in children.\nParatyphoid fever is marked by high fever,\nheadache , loss of appetite, vomiting, and constipation\nor diarrhea . The patient typically develops an enlarged\nspleen. About 30% of patients have rose spots on thefront of the chest during the first week of illness. Therose spots develop into small hemorrhages that may behard to see in African or Native Americans.\nPatients with intestinal complications have symp-\ntoms resembling those of appendicitis : intense cramping\npain with soreness in the right lower quadrant of theabdomen.\nDiagnosis\nThe diagnosis is usually made on the basis of a his

### Chain

In [42]:
memory = ConversationBufferMemory()


class QARetrievalChain:
    def __init__(self, *, retriever, temperature=0, chain_type='stuff', memory=None):
        self.retriever = retriever
        self.temperature = temperature
        self.chain_type = chain_type
        self.memory = memory

        self.llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=self.temperature)

        self.chain = RetrievalQA.from_chain_type(llm=self.llm,
                                                 retriever=self.retriever,
                                                 chain_type=self.chain_type,
                                                 memory=self.memory)

    def __call__(self, query, *args, **kwargs):
        response = self.chain.run(query)
        return response


In [43]:
chain = QARetrievalChain(retriever=retriever, memory=memory)

#### Testing QA chain

In [29]:
for q in [query1, query2, query3, query4]:
  response = chain(q)
  print(textwrap.fill(response, width=100))
  print("-"*20)

Diabetes is a chronic disease that affects the body's ability to regulate blood sugar levels. There
are two main types of diabetes: Type 1 and Type 2. Type 1 diabetes is usually diagnosed in childhood
and occurs when the body does not produce enough insulin. Type 2 diabetes is more common and
typically develops in adulthood, often in individuals who are overweight and do not exercise
regularly. In Type 2 diabetes, the body either does not produce enough insulin or does not use it
effectively. Treatment for diabetes involves managing blood sugar levels through diet, exercise,
medication, and sometimes insulin injections.
--------------------
I'm sorry, but I don't have any information about a medication called "amoksisilyn." It's possible
that you may have misspelled the name or it could be a brand name specific to a certain country. If
you can provide more accurate information or clarify the name, I may be able to assist you further.
--------------------
من نمی‌دانم.
------------------

In [41]:
response = chain("دیابت چیست؟")
print(textwrap.fill(response, width=100))

 Diabetes is a chronic condition characterized by high levels of sugar in the blood. It is caused by
the body's inability to produce or use insulin properly.


In [46]:
q = "داروی cyclosporine برای چه چیزی خوبه و عوارض مصرفش چی هست؟"

In [47]:
response = chain(q)
print(textwrap.fill(response, width=100))

داروی سیکلوسپورین برای پیشگیری از رد شدن اعضای پیوندی در تراشهای اعضای داخلی استفاده می شود. همچنین
برای درمان برخی از بیماری های التهابی مانند آرتریت روماتوئید و پسوریازیس نیز استفاده می شود.   عوارض
جانبی ممکن از مصرف سیکلوسپورین شامل افزایش خطر عفونت، افزایش فشار خون، افزایش سطح قند خون، اختلالات
کلیوی، تغییرات در پوست و مو، تغییرات در شکل بدن و افزایش خطر سرطان می باشد. همچنین ممکن است عوارض
دیگری نیز وجود داشته باشد. برای اطلاعات دقیقتر و جزئیات بیشتر، بهتر است با پزشک خود مشورت کنید.
