### Query a PDF document read from a directory

For this model we will use **OpenAI** (requires an API Token to access)  
checkout `Account setup` section in  
https://platform.openai.com/docs/quickstart?context=python 

### Sample environmet file holding API Tokens
import os                                                                                                                                                                                                                   
SERPAPI_API_KEY="**Google**"  
HF_TOKEN="**HuggingFaces**"    
GITHUB_TOKEN="**GitHub**"   
OPENAI_API_KEY="**OpenAI**"    
  
def set_environment():  
    &nbsp;&nbsp;variable_dict = globals().items()  
    &nbsp;&nbsp;for key, value in variable_dict:  
        &nbsp;&nbsp;if "API" in key or "ID" in key or "TOKEN" in key:  
            &nbsp;&nbsp;&nbsp;&nbsp;os.environ[key] = value  
  

In [1]:
%load_ext autoreload
%autoreload 2
import os
from pathlib import Path
import sys
## READ your OPENAI_API_KEY from .env file -or- setup as ENV
# os.environ['OPENAI_API_KEY'] = "<your OpenAI access token>"
sys.path.append(r'..')
from configs import set_environment
set_environment()

In [2]:
# Initial imports
from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI

In [3]:
# load related packaged to split, embed, store
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain

### Query PDF document

In [4]:
# import PubMedLoader
from langchain.document_loaders import PubMedLoader

In [None]:
## PubMed filters you can use
# PMC COVID-19 Collection
# obesity overweight hereditary children risk factors (4k+)
# mounica yanamandala

In [7]:
# load documents based on a filter
filter_name = "mounica yanamandala"
loader = PubMedLoader(filter_name)   # requires xmltodict package
docs = loader.load()

In [9]:
# check loaded docs
print(len(docs))
print(docs[1].metadata)
print(docs[2].page_content)

3
{'uid': '36378869', 'Title': 'Severe triple vessel disease secondary to IgG4-related coronary periarteritis.', 'Published': '2022-11-15', 'Copyright Information': '© 2022 Wiley Periodicals LLC.'}
Genome editing of primary human cells with CRISPR-Cas9 is a powerful tool to study gene function. For many cell types, there are efficient protocols for editing with optimized plasmids for Cas9 and sgRNA expression. Vascular cells, however, remain refractory to plasmid-based delivery of CRISPR machinery for in vitro genome editing due to low transfection efficiency, poor expression of the Cas9 machinery, and toxic effects of the selection antibiotics. Here, we describe a method for high-efficiency editing of primary human vascular cells in vitro using nucleofection for direct delivery of sgRNA:Cas9-NLS ribonucleoprotein complexes. This method is more rapid and its high editing efficiency eliminates the need for additional selection steps. The edited cells can be employed in diverse applicati

In [10]:
# split document into chunks - chunk_size = 1000, and chunk_overlap = 150
text_chunks = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

# split document as per txt_chunks
chunks = text_chunks.split_documents(docs)

# get OpenAI Embeddings
embeddings = OpenAIEmbeddings()

# Use Chroma to store vectors of embeddings
vector_store = Chroma.from_documents(chunks, embeddings)

In [13]:
# some verification
print(len(chunks))
chunks[3]

4


Document(page_content='assays to assess various genetic perturbation effects in vitro. This method proves effective in vascular cells that are refractory to standard genome manipulation techniques using viral plasmid delivery. We anticipate that this technique will be applied to other non-vascular cell types that face similar barriers to efficient genome editing. © 2021 Wiley Periodicals LLC. Basic Protocol: CRISPR-Cas9 genome editing of primary human vascular cells in vitro.', metadata={'uid': '34748284', 'Title': 'CRISPR-Cas9 Genome Editing of Primary Human Vascular Cells In Vitro.', 'Published': '--', 'Copyright Information': '© 2021 Wiley Periodicals LLC.'})

In [24]:
# Embeddings are Numerical representation of text data - high dimensional vectors
print(type(embeddings))
str_ch2 = str(str(chunks[2]).split(' \\n')[0])   # split and convert to string
print(type(str_ch2), len(str_ch2))

## print part of chunk to view embeddings
print(embeddings.embed_query(str_ch2)[:20])

<class 'langchain_community.embeddings.openai.OpenAIEmbeddings'>
<class 'str'> 1201
[-0.024597116791355397, 0.01600828644121873, -0.014449125729533193, -0.0051210389643293035, -0.0017196139770701948, 0.01649216441729913, -0.016760985101200415, 0.0020531200222058026, -0.04526944638234542, -0.003555156897355884, -0.014220627589423536, 0.0179975617372625, -0.018508321409203984, -0.0003101941670524317, 0.0019455916089468972, 0.023024513369094092, 0.03475854683846292, -0.011445051513571726, -0.0031468850878686987, -0.009744759104647664]


### ChatGPT

In [25]:
# setup ChatOpenAI
llm = ChatOpenAI(api_key=os.environ['OpenAI_API_KEY'], 
                 model='gpt-3.5-turbo', 
                 temperature=0.1, 
                 max_tokens=100)

In [26]:
# retrieve document from vector store
retriever = vector_store.as_retriever()

In [27]:
# setup retrieval chain
conv_ret_chain = ConversationalRetrievalChain.from_llm(llm, retriever=retriever)

In [None]:
# start asking questions
session_chat = {}
session_chat['history'] = []
# print(type(session_chat))
def ask_questions():
  question = input("Enter your question: ")
  if "Thank" in question:
    print("Thank you for using ChatGPT")
    return
  else:
    response = conv_ret_chain.run({'question':question,
                                  'chat_history':session_chat['history']})
    print(response, end="\n\n")
    session_chat['history'].append((question, response))
    ask_questions()

In [30]:
ask_questions()

Enter your question:  in 3 sentences summarize work of mounica yanamandala


Dr. Mounica Yanamandala's work focuses on developing efficient methods for genome editing of primary human vascular cells using CRISPR-Cas9 technology. The key aspects of her work include utilizing nucleofection for direct delivery of sgRNA:Cas9-NLS ribonucleoprotein complexes to overcome the low transfection efficiency and poor expression of the Cas9 machinery in vascular cells. This method allows for high-efficiency editing without the need for additional selection steps, making it more rapid and



Enter your question:  Thank You


Thank you for using ChatGPT


### Training on a different filter

In [31]:
# load documents based on a filter
filter_name = "obesity overweight hereditary children risk factors"
loader = PubMedLoader(filter_name)   # requires xmltodict package
docs = loader.load()

# check loaded docs
print(len(docs))
print(docs[1].metadata)
print(docs[2].page_content)

3
{'uid': '38291504', 'Title': 'Weight development from childhood to motherhood-embodied experiences in women with pre-pregnancy obesity: a qualitative study.', 'Published': '2024-01-30', 'Copyright Information': '© 2024. The Author(s).'}
Perioperative anaphylaxis (PA) is a severe condition that can be fatal, but data on PA mortality are scarce. The aim of this article is to review the epidemiology, elicitors and risk factors for PA mortality and identify knowledge gaps and areas for improvement regarding the management of severe PA. PA affects about 100 cases per million procedures. Mortality is rare, estimated at 3 to 5 cases per million procedures, but the PA mortality rate is higher than for other anaphylaxis aetiologies, at 1.4% to 4.8%. However, the data are incomplete. Published data mention neuromuscular blocking agents and antibiotics, mainly penicillin and cefazolin, as the main causes of fatal PA. Reported risk factors for fatal PA vary in different countries. Most frequentl

In [32]:
# split document into chunks - chunk_size = 1000, and chunk_overlap = 150
text_chunks = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

# split document as per txt_chunks
chunks = text_chunks.split_documents(docs)

# get OpenAI Embeddings
embeddings = OpenAIEmbeddings()

# Use Chroma to store vectors of embeddings
vector_store = Chroma.from_documents(chunks, embeddings)

In [33]:
# some verification
print(len(chunks))
chunks[3]

9


Document(page_content="BACKGROUND: Pre-pregnancy obesity increases the risk of perinatal complications. Post-pregnancy is a time of preparation for the next pregnancy and lifestyle advice in antenatal care and postpartum follow-up is therefore recommended. However, behavioral changes are difficult to achieve, and a better understanding of pregnant women's perspectives and experiences of pre-pregnancy weight development is crucial.\nMETHODS: We used a qualitative design and conducted semi-structured interviews with 14 women in Norway with pre-pregnancy obesity 3-12\xa0months postpartum. Data were analyzed using thematic analysis.", metadata={'uid': '38291504', 'Title': 'Weight development from childhood to motherhood-embodied experiences in women with pre-pregnancy obesity: a qualitative study.', 'Published': '2024-01-30', 'Copyright Information': '© 2024. The Author(s).'})

In [34]:
# Embeddings are Numerical representation of text data - high dimensional vectors
print(type(embeddings))
str_ch2 = str(str(chunks[2]).split(' \\n')[0])   # split and convert to string
print(type(str_ch2), len(str_ch2))

## print part of chunk to view embeddings
print(embeddings.embed_query(str_ch2)[:20])

<class 'langchain_community.embeddings.openai.OpenAIEmbeddings'>
<class 'str'> 899
[0.012037605021886673, 0.00026436873838712365, 0.009219996393745215, -0.04371477992526467, 0.00052742984157341, 0.005997879416407295, -0.01654298914306304, 0.021020475494417143, -0.011500585758494348, -0.034787702198410805, -0.026711486427699306, 0.004010210115533481, -0.004557690926193827, 0.006318696318782238, -0.0014079324796599502, 0.02162026404141759, 0.04603023800983173, -0.008878256862495553, 0.019597723692419244, -0.013034926777136033]


In [35]:
# retrieve document from vector store
retriever = vector_store.as_retriever()

# setup retrieval chain
conv_ret_chain = ConversationalRetrievalChain.from_llm(llm, retriever=retriever)

In [None]:
# start asking questions - clear history
session_chat = {}
session_chat['history'] = []

In [36]:
ask_questions()

Enter your question:  summarize in 5 sentences obesity risk factors in children


I don't have specific information on obesity risk factors in children based on the context provided.



Enter your question:  summarize in 5 sentences obesity risk factors


Obesity risk factors identified in the study include unmet essential needs in childhood leading to poor diet and emotional eating patterns, genetic predisposition for obesity, challenging life course transitions, and turning points like illness or injuries, negative body awareness influencing behavioral patterns, and relationships, and struggles with food. Lifestyle factors such as consuming alcohol more frequently, smoking, and being overweight or obese were also associated with increased risk. The study suggests that lifestyle choices play a significant role in obesity risk, highlighting the importance of understanding



Enter your question:  what about age? does it affect obesity?


Based on the provided context, the study focused on pre-pregnancy obesity in women 3-12 months postpartum. The study did not specifically address the relationship between age and obesity risk factors. Therefore, based on this information, it is not possible to determine if age affects obesity risk factors.



Enter your question:  Thank You


Thank you for using ChatGPT


**Citation:**
PMC Open Access Subset [Internet].  
Bethesda (MD): National Library of Medicine. 2003 - [cited YEAR MONTH DAY].  
Available from [ncbi.nlm.nih.gov](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/).