# Build a chatbot that can
1. retrieve information from knowledgebase.
- Load Liver Cancer handbook and store it in vector database.
- Add retrievalqa chain to the agent.

2. Remember the previous conversations.
 - Use conversation buffer memory
 - Use conversation agent.


### Build the external knowledge base

In [2]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("liver-hp-patient.pdf")
pages = loader.load_and_split()

'Ü'

In [4]:
pages[3].page_content

'2\nNCCN Guidelines for Patients® \nLiver Cancer, 2021About\nThese NCCN Guidelines for Patients are based on the NCCN Guidelines® for Hepatobiliary Cancers (Version \n2.2021, April 16, 2021). \n© 2021 National Comprehensive Cancer Network, Inc. All rights reserved. \nNCCN Guidelines for Patients and illustrations herein may not be reproduced in \nany form for any purpose without the express written permission of NCCN. No \none, including doctors or patients, may use the NCCN Guidelines for Patients for \nany commercial purpose and may not claim, represent, or imply that the NCCN \nGuidelines for Patients that have been modified in any manner are derived \nfrom, based on, related to, or arise out of the NCCN Guidelines for Patients. \nThe NCCN Guidelines are a work in progress that may be redefined as often \nas new significant data become available. NCCN makes no warranties of any \nkind whatsoever regarding its content, use, or application and disclaims any \nresponsibility for its ap

In [15]:
import tiktoken

BASE_MODEL = "gpt-4"

def count_tokens(text):
    tokenizer = tiktoken.encoding_for_model(BASE_MODEL)
    count = len(tokenizer.encode(text))
    return count

In [26]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_page_to_chunks(page: str) -> list[str]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=400, 
        chunk_overlap=50, 
        length_function=count_tokens, 
        separators=["\n\n", "\n", " ", ""])
    
    chunks = [chunk.replace("\n", " ") for chunk in splitter.split_text(page)]
    
    return chunks

In [42]:
import openai
from getpass import getpass

OPENAI_API_KEY = getpass("Enter your OpenAI API key: ")

In [60]:
from langchain.embeddings.openai import OpenAIEmbeddings


embed = OpenAIEmbeddings(
    model="text-embedding-ada-002",
    openai_api_key=OPENAI_API_KEY)
    
example_texts = ["A happy moment"]
example_embedding = embed.embed_documents(example_texts)

example_length = len(example_embedding[0])
print(example_length)

1536


In [37]:
# Save to pinecone

import pinecone

PINECONE_API_KEY = getpass("PINECONE_API_KEY")
PINECONE_ENV = getpass("PINECONE_ENV")

index_name = 'liver-cancer-retriveval-augmentation'
pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENV
)

pinecone.create_index(
    name=index_name, 
    metric="dotproduct",
    dimension=example_length)


In [39]:
index_name = 'liver-cancer-retriveval-augmentation'

index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [56]:
from uuid import uuid4

texts = []
for page in pages:
    texts.extend(split_page_to_chunks(page.page_content))

metadatas = [{"chunk": j, "text": text} for j, text in enumerate(texts)]
embeddings = embed.embed_documents(texts)
ids = [str(uuid4()) for _ in range(len(texts))]
combined = zip(ids, embeddings, metadatas)

In [57]:
index.upsert(vectors=combined)

{'upserted_count': 122}

In [71]:
from langchain.vectorstores import Pinecone

text_field = "text"
index = pinecone.Index(index_name)
vectorstore = Pinecone(
    index=index, 
    embedding_function=embed.embed_query, 
    text_key=text_field)
query = "What is liver cancer?"
top = vectorstore.similarity_search(query, k=3)


In [72]:
for t in top:
    print(t)

page_content='9 NCCN Guidelines for Patients®  Liver Cancer, 20211 Liver cancer basics  Liver cancer  1 Liver cancer basics  Liver cancer | How liver cancer spreads Liver cancer Cancer that starts in the liver is called primary  liver cancer. Secondary liver cancer is when  other cancer types spread to the liver. For  example, cancer can start in the intestines  (colon) and spread to the liver. This is called  metastatic colon cancer in the liver. There is more than one type of primary liver  cancer in adults. The most common type is  hepatocellular carcinoma (HCC). There is a  subtype of HCC called FLHC (fibrolamellar  hepatocellular carcinoma). FLHC affects very  few people and usually occurs at a younger  age.  The second most common type of primary  liver cancer in adults is called intrahepatic  cholangiocarcinoma, which is a cancer of the  bile ducts. Other primary liver cancers in adults  include rare types of sarcoma that start in the  blood vessel cells of the liver. Another ra

### Check the knowledge base

In [90]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4",
    temperature=0,
    openai_api_key=OPENAI_API_KEY,
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    verbose=True,
    return_source_documents=True,
)

In [None]:
result = qa_chain({"query": "What is liver cancer?"})

In [92]:
result["result"]

'Liver cancer is a type of cancer that starts in the liver, known as primary liver cancer. There are different types of primary liver cancer in adults, with the most common being hepatocellular carcinoma (HCC). There is also a subtype of HCC called fibrolamellar hepatocellular carcinoma (FLHC), which affects very few people and usually occurs at a younger age. The second most common type of primary liver cancer in adults is intrahepatic cholangiocarcinoma, a cancer of the bile ducts. Other types include rare types of sarcoma that start in the blood vessel cells of the liver, and a rare type made of both hepatocellular carcinoma and cholangiocarcinoma, called a mixed-type tumor. Secondary liver cancer is when other types of cancer spread to the liver, such as metastatic colon cancer in the liver.'

In [None]:
result = qa_chain({"query": "What are the causes for liver cancer?"})

In [97]:
print(result["result"])

The causes for liver cancer include:

1. Cirrhosis: This is a long-term liver disease where liver cells are replaced by scar tissue. It can be caused by Hepatitis B, Hepatitis C, alcohol, non-alcoholic fatty liver disease (NAFLD), genetic hemochromatosis, stage 4 primary biliary cholangitis, alpha-1-antitrypsin deficiency, and other causes.

2. Hepatitis: Hepatitis B (HBV) and Hepatitis C (HCV) can cause scarring of the liver (cirrhosis), liver failure, and liver cancer.

3. Alcohol: Drinking too much alcohol can cause damage to the liver.

4. Non-alcoholic fatty liver disease (NAFLD): This is seen in obese people or those with diabetes, high cholesterol, and a few other conditions. Having NAFLD may lead to cirrhosis in people who drink little or no alcohol.

5. Genetic hemochromatosis: This is an inherited condition that causes the liver to store too much iron from food.

6. Other risk factors: Diabetes, obesity, or other problems with processing sugar may put someone at risk for live

In [100]:
for d in result["source_documents"]:
    print(d.page_content)
    print()

15 NCCN Guidelines for Patients®  Liver Cancer, 20211 Liver cancer basics  Risk factors  Cirrhosis Cirrhosis is scarring of the liver. It is a type of  long-term (chronic) liver disease where liver  cells are replaced by scar tissue. If you have  cirrhosis, you should be screened for liver  cancer. Cirrhosis can be caused by:  Hepatitis B  Hepatitis C  Alcohol  Non-alcoholic fatty liver disease (NAFLD)  Genetic hemochromatosis  Stage 4 primary biliary cholangitis  Alpha-1-antitrypsin deficiency  Other causes of cirrhosisHepatitis Hepatitis is a type of liver disease. Viruses  called hepatitis A, hepatitis B (HBV), and  hepatitis C (HCV) are the most common  types of hepatitis. HBV and HCV are spread  by contact with blood and other bodily fluids.  HBV and HCV can cause scarring of the liver  (cirrhosis), liver failure, and liver cancer. If you  have chronic HBV, you should be screened for  liver cancer.  Other risk factors Drinking too much alcohol can cause damage  to the live

#### Build the conversation agent with langchain

In [102]:
from langchain.chains.conversation.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

conversation_chain = ConversationChain(
    llm=llm,
    memory=ConversationBufferMemory(),
)

In [None]:
conversation_chain("What are the treatments for liver cancer?")

In [129]:
from langchain.agents import Tool, initialize_agent

tools = [Tool(
    name="Knowledge Base",
    description="use this tool when answering general knowledge queries to get more information about the topic",
    func=qa_chain.run
)]

In [130]:
conversational_agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent="conversational-react-description",
    verbose=True,
    max_iterations=3,
    memory=ConversationBufferMemory(memory_key="chat_history"),
)


In [None]:
conversational_agent("What is the website to learn more about sarcomas in NCCN")

In [None]:
print(conversational_agent.agent.llm_chain.prompt.template)