# Documentation chat bot
this notebook explores using llm to set up chatbot for documentation
* this chatbot should be able to answer basic questions and provide the documents that were used to generate its responses

there were many good resources from langchain about this appliaction
* https://python.langchain.com/v0.2/docs/tutorials/rag/
* https://python.langchain.com/v0.1/docs/use_cases/question_answering/citations/
* https://www.youtube.com/watch?v=Vw52xyyFsB8&list=PLfaIDFEXuae2LXbO1_PKyVJiQ23ZztA0x&index=4

In [1]:
! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain


Collecting langchainhub
  Downloading langchainhub-0.1.20-py3-none-any.whl.metadata (659 bytes)
Collecting types-requests<3.0.0.0,>=2.31.0.2 (from langchainhub)
  Downloading types_requests-2.32.0.20240622-py3-none-any.whl.metadata (1.8 kB)
Downloading langchainhub-0.1.20-py3-none-any.whl (5.0 kB)
Downloading types_requests-2.32.0.20240622-py3-none-any.whl (15 kB)
Installing collected packages: types-requests, langchainhub
Successfully installed langchainhub-0.1.20 types-requests-2.32.0.20240622



[notice] A new release of pip is available: 24.0 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
%load_ext dotenv
%dotenv ../.env

## using info on web 
based off of [this](https://python.langchain.com/v0.1/docs/use_cases/question_answering/sources/)

In [20]:
from langchain_community.retrievers import WikipediaRetriever
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
wiki = WikipediaRetriever(top_k_results=6, doc_content_chars_max=2000)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You're a helpful AI assistant. Given a user question and some Wikipedia article snippets, answer the user question. If none of the articles answer the question, just say you don't know.\n\nHere are the Wikipedia articles:{context}",
        ),
        ("human", "{question}"),
    ]
)
prompt.pretty_print()



You're a helpful AI assistant. Given a user question and some Wikipedia article snippets, answer the user question. If none of the articles answer the question, just say you don't know.

Here are the Wikipedia articles:[33;1m[1;3m{context}[0m


[33;1m[1;3m{question}[0m


In [6]:
from operator import itemgetter
from typing import List


from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import (
    RunnableLambda,
    RunnableParallel,
    RunnablePassthrough,
)


def format_docs(docs: List[Document]) -> str:
    """Convert Documents to a single string.:"""
    formatted = [
        f"Article Title: {doc.metadata['title']}\nArticle Snippet: {doc.page_content}"
        for doc in docs
    ]
    return "\n\n" + "\n\n".join(formatted)


format = itemgetter("docs") | RunnableLambda(format_docs)
# subchain for generating an answer once we've done retrieval
answer = prompt | llm | StrOutputParser()
# complete chain that calls wiki -> formats docs to string -> runs answer subchain -> returns just the answer and retrieved docs.
chain = (
    RunnableParallel(question=RunnablePassthrough(), docs=wiki)
    .assign(context=format)
    .assign(answer=answer)
    .pick(["answer", "docs"])
)

In [7]:
a = chain.invoke("How fast are cheetahs?")

{'answer': 'Cheetahs are capable of running at speeds between 93 to 104 km/h (58 to 65 mph). They have evolved specialized adaptations for speed, including a light build, long thin legs, and a long tail.',
 'docs': [Document(page_content='The cheetah (Acinonyx jubatus) is a large cat and the fastest land animal. It has a tawny to creamy white or pale buff fur that is marked with evenly spaced, solid black spots. The head is small and rounded, with a short snout and black tear-like facial streaks. It reaches 67–94 cm (26–37 in) at the shoulder, and the head-and-body length is between 1.1 and 1.5 m (3 ft 7 in and 4 ft 11 in). Adults weigh between 21 and 72 kg (46 and 159 lb). The cheetah is capable of running at 93 to 104 km/h (58 to 65 mph); it has evolved specialized adaptations for speed, including a light build, long thin legs and a long tail.\nThe cheetah was first described in the late 18th century. Four subspecies are recognised today that are native to Africa and central Iran. An

In [19]:
for d in a['docs']: 
    print(d.metadata['source'])

https://en.wikipedia.org/wiki/Cheetah
https://en.wikipedia.org/wiki/Southeast_African_cheetah
https://en.wikipedia.org/wiki/Footspeed
https://en.wikipedia.org/wiki/Pursuit_predation
https://en.wikipedia.org/wiki/Fastest_animals
https://en.wikipedia.org/wiki/Cheetah_Chrome


## using local
based off of [this](https://python.langchain.com/v0.1/docs/use_cases/question_answering/quickstart/) 

In [1]:
import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_chroma import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [2]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")



In [3]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('./text', glob="**/*.txt", show_progress=True)

docs = loader.load()

len(docs)

  0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 5/5 [00:10<00:00,  2.11s/it]


5

In [4]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

In [7]:
from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(splits, OpenAIEmbeddings())



In [8]:
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")


In [9]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)



In [11]:
rag_chain.invoke("where were the first inhabitants of the united states from ?")

'The first inhabitants of the United States migrated from Siberia across the Bering land bridge at least 12,000 years ago. These Paleo-Indians formed various civilizations and societies over time. Estimates of the native population before European arrival range from around 500,000 to nearly 10 million.'

In [12]:
from langchain_core.runnables import RunnableParallel

rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

rag_chain_with_source.invoke("where did I go tuesday?")



{'context': [Document(page_content='I went to the school on tuesday', metadata={'source': 'text\\text2.txt'}),
  Document(page_content='I went to the office  on wednesday', metadata={'source': 'text\\text3.txt'}),
  Document(page_content='I went to the park on monday', metadata={'source': 'text\\text1.txt'}),
  Document(page_content='I went to the doctors  on friday', metadata={'source': 'text\\text4.txt'})],
 'question': 'where did I go tuesday?',
 'answer': 'You went to the school on Tuesday.'}