### Data ingest

##### load environment variables

In [175]:
import os
from dotenv import load_dotenv

load_dotenv()

os.environ["OPENAI_API_KEY"]=os.getenv("OPENAI_API_KEY")

##### load local text file

In [176]:
from langchain_community.document_loaders import TextLoader
loader=TextLoader("abstract.txt")
text_docs=loader.load()
text_docs[0]

Document(page_content='We define how chronic cigarette smoke-induced time-dependent epigenetic alterations can sensitize human bronchial epithelial cells for transformation by a single oncogene. The smoke-induced chromatin changes include initial repressive polycomb marking of genes, later manifesting abnormal DNA methylation by 10 months. At this time, cells exhibit epithelial-to-mesenchymal changes, anchorage-independent growth, and upregulated RAS/MAPK signaling with silencing of hypermethylated genes, which normally inhibit these pathways and are associated with smoking-related non-small cell lung cancer. These cells, in the absence of any driver gene mutations, now transform by introducing a single KRAS mutation and form adenosquamous lung carcinomas in mice. Thus, epigenetic abnormalities may prime for changing oncogene senescence to addiction for a single key oncogene involved in lung cancer initiation.', metadata={'source': 'abstract.txt'})

##### load local text file

In [177]:
from langchain_community.document_loaders import PyPDFLoader, OnlinePDFLoader
from pprint import pprint

loader=PyPDFLoader("cancercell.pdf")
# loader=OnlinePDFLoader("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5596892/pdf/nihms900592.pdf")
pdf_docs=loader.load()
print(len(pdf_docs))
pprint(pdf_docs)

38
[Document(page_content="Chronic Cigarette Smoke-Induced Epigenomic Changes Precede \nSensitization of Bronchial Epithelial Cells to Single Step \nTransformation by KRAS  Mutations\nMichelle Vaz1, Stephen Y Hwang1, Ioannis Kagiampakis1, Jillian Phallen1, Ashwini Patil2, \nHeather M O’Hagan3, Lauren Murphy1, Cynthia A Zahnow1, Edward Gabrielson4, Victor E \nVelculescu1, Hariharan P Easwaran1,5, and Stephen B Baylin1,5,6\n1Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, The Johns \nHopkins University School of Medicine, Baltimore, MD 21287, USA.\n2Krieger School of Arts and Sciences, Baltimore, MD 21218, USA\n3Medical Sciences, Indiana University School of Medicine, Bloomington, IN 47405, USA Melvin \nand Bren Simon Cancer Center, Indianapolis, IN 46202, USA\n4Department of Pathology, The Johns Hopkins University School of Medicine, Baltimore, MD \n21287\nSUMMARY\nWe define how chronic cigarette smoke-induced time-dependent epigenetic alterations can \nsensitize 

##### web-based loader

In [178]:
# https://python.langchain.com/v0.2/docs/integrations/document_loaders/web_base/
from langchain_community.document_loaders import WebBaseLoader
import bs4

## load, chunk, index contexts of page
loader=WebBaseLoader(web_paths=[
    "https://lilianweng.github.io/posts/2023-06-23-agent/"],
    # bs_kwargs=dict(parse_only=bs4.SoupStrainer("p")) # parse only paragraphs
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(
        class_=("post-title", "post-content", "post-header") # parse only content of blog posts
    ))
)

loader=WebBaseLoader(web_paths=[
    "https://weaviate.io/blog/hybrid-search-explained"],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer( ("header", "p") ) )
)

docs=loader.load()
print(docs)

[Document(page_content='Hybrid Search ExplainedJanuary 3, 2023 · 7 min readErika CardenasTechnology Partner ManagerHybrid search is a technique that combines multiple search algorithms to improve the accuracy and relevance of search results. It uses the best features of both keyword-based search algorithms with vector search techniques. By leveraging the strengths of different algorithms, it provides a more effective search experience for users.The Hybrid search feature was introduced in Weaviate 1.17. It uses sparse and dense vectors to represent the semantic meaning and context of search queries and documents.In this blog post, you will learn about the implementation of hybrid search in Weaviate and how to use it.Sparse and dense vectors are calculated with distinct algorithms. Sparse vectors have mostly zero values with only a few non-zero values, while dense vectors mostly contain non-zero values. Sparse embeddings are generated from algorithms like BM25 and SPLADE. Dense embedding

##### Other loader types:
https://python.langchain.com/v0.2/docs/integrations/document_loaders/

Example: PubMed loader
https://python.langchain.com/v0.2/docs/integrations/document_loaders/pubmed/

potentially other useful loaders:
- HuggingFaceDatasetLoader

In [179]:
from langchain_community.document_loaders import PubMedLoader
from pprint import pprint

loader = PubMedLoader(query="Chronic Cigarette Smoke-Induced Epigenomic Changes")
docs = loader.load()

len(docs)
pprint(docs[0].metadata)
print(docs[0].page_content)




Too Many Requests, waiting for 0.20 seconds...
{'Copyright Information': 'Copyright © 2017 Elsevier Inc. All rights reserved.',
 'Published': '--',
 'Title': 'Chronic Cigarette Smoke-Induced Epigenomic Changes Precede '
          'Sensitization of Bronchial Epithelial Cells to Single-Step '
          'Transformation by KRAS Mutations.',
 'uid': '28898697'}
We define how chronic cigarette smoke-induced time-dependent epigenetic alterations can sensitize human bronchial epithelial cells for transformation by a single oncogene. The smoke-induced chromatin changes include initial repressive polycomb marking of genes, later manifesting abnormal DNA methylation by 10 months. At this time, cells exhibit epithelial-to-mesenchymal changes, anchorage-independent growth, and upregulated RAS/MAPK signaling with silencing of hypermethylated genes, which normally inhibit these pathways and are associated with smoking-related non-small cell lung cancer. These cells, in the absence of any driver gene 

### Transformation

##### Convert PDF to chunks

In [180]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, # split text into chunks of 1000 characters
    chunk_overlap=100 # overlap chunks by 100 characters
)

documents=text_splitter.split_documents(pdf_docs)
pprint(documents[:3])

[Document(page_content='Chronic Cigarette Smoke-Induced Epigenomic Changes Precede \nSensitization of Bronchial Epithelial Cells to Single Step \nTransformation by KRAS  Mutations\nMichelle Vaz1, Stephen Y Hwang1, Ioannis Kagiampakis1, Jillian Phallen1, Ashwini Patil2, \nHeather M O’Hagan3, Lauren Murphy1, Cynthia A Zahnow1, Edward Gabrielson4, Victor E \nVelculescu1, Hariharan P Easwaran1,5, and Stephen B Baylin1,5,6\n1Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, The Johns \nHopkins University School of Medicine, Baltimore, MD 21287, USA.\n2Krieger School of Arts and Sciences, Baltimore, MD 21218, USA\n3Medical Sciences, Indiana University School of Medicine, Bloomington, IN 47405, USA Melvin \nand Bren Simon Cancer Center, Indianapolis, IN 46202, USA\n4Department of Pathology, The Johns Hopkins University School of Medicine, Baltimore, MD \n21287\nSUMMARY\nWe define how chronic cigarette smoke-induced time-dependent epigenetic alterations can', metadata={'so

##### Create embeddings from chunks
Store vectors in Chroma

In [181]:
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

db_chroma = Chroma.from_documents(documents, embedding=OpenAIEmbeddings())

##### Store vectors in FAISS

In [182]:
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

db_faiss = FAISS.from_documents(documents, embedding=OpenAIEmbeddings())

##### Store vectors in Qdrant

In [183]:
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from qdrant_client import QdrantClient


# db_qdrant = await Qdrant.afrom_documents(documents, embedding=OpenAIEmbeddings())

ResponseHandlingException: [Errno 61] Connection refused

##### Query vector store

In [None]:
query="When do the cells become anchorage independent and form clones in soft agar?"
response1 = db_chroma.similarity_search(query)
print(f'response1: {response1[1].page_content}')

response2 = db_faiss.similarity_search(query)
print(f'response2: {response2[1].page_content}')

# response3 = await db_qdrant.asimilarity_search(query)
# print(docs[0].page_content)

response1: Figure 6. Phenotypes of CSC exposed versus control cells before and after expression of mutant 
KRASV12
(A) Soft agar assays to assess the anchorage-independent growth ability of HBEC cells 
following 10 months of CSC exposure. (B) Quantitation of soft agar colonies in panel A. 
Values indicate average number of colonies ± SD from replicate 1 and 2 at the indicated time 
points. (C) Phase contrast photomicrographs showing morphological changes in control and 
CSC treated cells when switched from defined serum-free to serum-containing medium for 
96 hr. Scale bar = 50 µm. (D) Immunofluorescent staining of E-cadherin and Vimentin in 
cells grown in serum-containing medium for 96 hr. Scale bar = 100 µm. (E) Immunoblot for 
EMT markers in cells (CSC and Con) and soft agar clones obtained from 15 month CSC Vaz et al. Page 34
Cancer Cell . Author manuscript; available in PMC 2018 September 11.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript
response2: and t

### Querying

In [None]:
from langchain_community.llms import Ollama

llm=Ollama(model="llama3")

Ollama(model='llama3')

##### Design prompt template

In [None]:
from langchain_core.prompts import ChatPromptTemplate


prompt_template=ChatPromptTemplate.from_template("""
Answer the following quetsion based on the only information provided in the text.
Think step-by-step before providing a detailed answer in 100 words or less.
The question should be easy enough for a child to understand.
<context>{context}</context>
Question: {input}
""")

##### Creating document chain

In [None]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(llm, prompt_template)

##### Adding retriever (interface to VS)

In [None]:
retriever = db_faiss.as_retriever()

##### Creating retriever chain

In [None]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(retriever, document_chain)

#### Invoke the chain

In [None]:
retrieval_chain.invoke({"input": "When do the cells become anchorage independent and form clones in soft agar?"})