## Understanding Langchain using Langchain

- scrape documentatioon of langchain
- scrape langchain code repo
- load into vectorstore
- prompt vectorstore using LLM

In [6]:
# setup envrionment
import os
from constants import keys
# set API KEYS here
os.environ['openai_api_key'] = keys['openai']

In [2]:
## load documents
from langchain.document_loaders import UnstructuredHTMLLoader # other dataloaders in document_loaders module
loader = UnstructuredHTMLLoader('data\langchain_docs\langchain-harrison-docs-refactor-3-24\index.html')
data = loader.load()
data



In [31]:
# split text
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200) # does chunking with overlap. wonder if overlap is important
docs = text_splitter.split_documents(data)

In [32]:
num_total_characters = sum([len(x.page_content) for x in docs])

print (f"Now you have {len(docs)} documents that have an average of {num_total_characters / len(docs):,.0f} characters (smaller pieces)")

Now you have 1286 documents that have an average of 1,346 characters (smaller pieces)


In [33]:
# The vectorstore we'll be using
from langchain.vectorstores import FAISS

# The embedding engine that will convert our text to vectors
from langchain.embeddings.openai import OpenAIEmbeddings

# Get your embeddings engine ready
embeddings = OpenAIEmbeddings()

# Embed your documents and combine with the raw text in a pseudo db. Note: This will make an API call to OpenAI
docsearch = FAISS.from_documents(docs, embeddings)


In [34]:
# save vectorstore to disk

import faiss
faiss.write_index(docsearch.index, "./db/langchain_docs_short.index")

In [35]:
import pickle
with open('./db/langchain_docs_short.pkl', 'wb') as f:
    pickle.dump(docsearch, f)

In [36]:
index = faiss.read_index('./db/langchain_docs.index')

In [46]:
# The LangChain component we'll use to get the documents
from langchain.chains import RetrievalQA
from langchain import OpenAI
llm = OpenAI(temperature=0)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever())

In [20]:
query = "how to you interact with web apis using langchain, and what does it do?"
qa.run(query)

' LangChain provides a module for interacting with web APIs. This module allows you to make requests to web APIs and use the response in your language model application. This can be used to fetch data from external sources, or to take actions based on the output of the language model.'

In [47]:
qa.run(query2)

' RetrievalQAWithSourcesChain is a chain for doing question-answering with sources over an Index. It does this by using the RetrievalQAWithSourcesChain, which does the lookup of the documents from an Index. It then passes the documents to the LLM, which will return the answer to the question as well as the sources it used to answer the question.'

In [37]:
# alternate method
from langchain.chains import RetrievalQAWithSourcesChain
# try a different text splitter

chain = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, 
                                                    chain_type = 'stuff', 
                                                    retriever=docsearch.as_retriever(),
                                                    reduce_k_below_max_tokens=True # to avoid token error
                                                    )
result = chain({"question": query}, return_only_outputs=True)


In [41]:
print(result['answer'])

 LangChain is a framework for developing applications powered by language models. It provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications. It also provides guidance and assistance in using the modules for personal assistants, question answering, chatbots, querying tabular data, interacting with APIs, extraction, summarization, evaluation, and question answering over documents.



In [43]:
query2 = "how does RetrievalQAWithSourcesChain work?"
result = chain({"question": query2}, return_only_outputs=True)

In [44]:
result 

{'answer': ' RetrievalQAWithSourcesChain is a chain for question-answering with sources over an index. It takes an LLM wrapper, a query, and a list of documents as input and returns the answer to the query and the sources involved.\n',
 'sources': 'data\\langchain_docs\\langchain-harrison-docs-refactor-3-24\\index.html'}