# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest.

In [None]:
#pip install --upgrade langchain

In [1]:
import os
import openai

#from dotenv import load_dotenv, find_dotenv
#_ = load_dotenv(find_dotenv()) # read local .env file

#openai.api_key = os.environ['OPENAI_API_KEY']

In [2]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown

In [3]:
#Import loader for txt file
from langchain.document_loaders import TextLoader

In [4]:
file = 'landing_page.txt'
loader = TextLoader(file_path=file)
document = loader.load()

In [5]:
from langchain.indexes import VectorstoreIndexCreator

In [6]:
#pip install docarray

In [8]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [9]:
query ="Please list the 4 primary benefits of using SoilSense"

In [10]:
response = index.query(query)

In [12]:
display(Markdown(response))

 1. Save water 2. Increase yield 3. Reduce risk 4. Save time

In [13]:
query ="What makes SoilSense different from competing products?"
response = index.query(query)

In [14]:
display(Markdown(response))

 SoilSense offers a reliable and easy-to-use soil sensor system with high accuracy TDT sensors, automatic data analysis, and an intuitive software platform. It also provides notifications via WhatsApp, text, or email when critical thresholds are reached, and it can be used in extreme environments from the Peruvian dessert to a snowy and cold winter in Scandinavia.

In [15]:
query ="I think SoilScout is a better product than SoilSense, what do you think?"
response = index.query(query)

In [16]:
display(Markdown(response))

 I'm not familiar with SoilScout, so I can't compare the two products.

## What happens under the hood
Trying to work with a CSV file combining several SoilSense Sources

In [34]:
import csv

file_names = ['landing_page.txt', 'sensor_page.txt', 'wireless_page.txt']

with open('soilsense_info.csv', 'w', newline='') as csv_file:
    writer = csv.writer(csv_file)

    # Write the header row, per default Langchain expect a CSV to have a header row to index it correctly
    writer.writerow(['document_name', 'content'])

    for file_name in file_names:
        document_name = file_name.split('.')[0]  # Extract the name without extension
        with open(file_name, 'r') as txt_file:
            content = txt_file.read()
            # Write the document name and content as a row
            writer.writerow([document_name, content])
            
            


In [35]:
file = 'soilsense_info.csv'
loader = CSVLoader(file_path=file)
docs = loader.load()

In [36]:
docs[1]

Document(page_content="document_name: sensor_page\ncontent: Scientifically-validated sensors\nAn irrigation support system is only as good as the measurements it depends on.\nAccurate, reliable sensors are a must.\n\nBad data is worse than no data\nThat's why we set the industry standard with the best, scientifically-validated sensor technology\n\nBuried sensors\nMeasure the right thing\nWe use buried soil sensors because they provide the most realistic measurement of actual soil conditions.\nOur buried sensors provide the most correct measurement of the soil because they do not disturb water flow after installation. \nDrill-and-drop probes might be easier to install, but if you can not trust the measurement, what good are they? Sensors of the probe design provoke what is called preferential flow where water from rain and irrigation is more prone to travel along the surface of the stick, leading to misleadingly high measurements.\n\nTDT Technology\nMeasure accurately\nTDT technology is

In [37]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [26]:
embed = embeddings.embed_query("Hi my name is Harrison")

In [27]:
print(len(embed))

1536


In [28]:
print(embed[:5])

[-0.021893448821006502, 0.006750480177320013, -0.018194211392063127, -0.03913139497065547, -0.014054587845333108]


In [40]:
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)

In [38]:
query = "At what frequency does the sensors operate?"

In [41]:
docs = db.similarity_search(query)

In [42]:
#Returns the best matched documents (Although we only have 3 docs)
list(docs)

[Document(page_content="document_name: sensor_page\ncontent: Scientifically-validated sensors\nAn irrigation support system is only as good as the measurements it depends on.\nAccurate, reliable sensors are a must.\n\nBad data is worse than no data\nThat's why we set the industry standard with the best, scientifically-validated sensor technology\n\nBuried sensors\nMeasure the right thing\nWe use buried soil sensors because they provide the most realistic measurement of actual soil conditions.\nOur buried sensors provide the most correct measurement of the soil because they do not disturb water flow after installation. \nDrill-and-drop probes might be easier to install, but if you can not trust the measurement, what good are they? Sensors of the probe design provoke what is called preferential flow where water from rain and irrigation is more prone to travel along the surface of the stick, leading to misleadingly high measurements.\n\nTDT Technology\nMeasure accurately\nTDT technology i

In [43]:
len(docs)

3

In [44]:
docs[0] #THe most similar doc based on our query

Document(page_content="document_name: sensor_page\ncontent: Scientifically-validated sensors\nAn irrigation support system is only as good as the measurements it depends on.\nAccurate, reliable sensors are a must.\n\nBad data is worse than no data\nThat's why we set the industry standard with the best, scientifically-validated sensor technology\n\nBuried sensors\nMeasure the right thing\nWe use buried soil sensors because they provide the most realistic measurement of actual soil conditions.\nOur buried sensors provide the most correct measurement of the soil because they do not disturb water flow after installation. \nDrill-and-drop probes might be easier to install, but if you can not trust the measurement, what good are they? Sensors of the probe design provoke what is called preferential flow where water from rain and irrigation is more prone to travel along the surface of the stick, leading to misleadingly high measurements.\n\nTDT Technology\nMeasure accurately\nTDT technology is

## Creating a retriever from a vector store


In [45]:
retriever = db.as_retriever()

In [46]:
llm = ChatOpenAI(temperature = 0.0)


The query above could also b achieved by joining all the docs together to a single file

In [47]:
qdocs = "".join([docs[i].page_content for i in range(len(docs))])

In [50]:
qdocs

"document_name: sensor_page\ncontent: Scientifically-validated sensors\nAn irrigation support system is only as good as the measurements it depends on.\nAccurate, reliable sensors are a must.\n\nBad data is worse than no data\nThat's why we set the industry standard with the best, scientifically-validated sensor technology\n\nBuried sensors\nMeasure the right thing\nWe use buried soil sensors because they provide the most realistic measurement of actual soil conditions.\nOur buried sensors provide the most correct measurement of the soil because they do not disturb water flow after installation. \nDrill-and-drop probes might be easier to install, but if you can not trust the measurement, what good are they? Sensors of the probe design provoke what is called preferential flow where water from rain and irrigation is more prone to travel along the surface of the stick, leading to misleadingly high measurements.\n\nTDT Technology\nMeasure accurately\nTDT technology is along with TDR the on

In [53]:
response = llm.call_as_llm(f"{qdocs} Question: At what frequency does the sensors operate?") 


In [54]:
display(Markdown(response))

The frequency at which the sensors operate is not specified in the given content.

Apparently the LLM failed to answer as it did not couple frequency with Mhz.

In [55]:
response = llm.call_as_llm(f"{qdocs} Question: At what frequency (Mhz) does the sensors operate?") 
display(Markdown(response))

The sensors operate at a frequency of 150-300 MHz.

## Using LangChain (advanced and 1-liner)
Define llm=llm to receive a language response 
Set chain_type="stuff" which is the most simple, where all the documents are stuffed into context
retriever: The interface for fetching documents and fetching to the model

**More advanced method**

In [56]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

In [57]:
query =  "Question: At what frequency (Mhz) does the sensors operate?"

In [58]:
response = qa_stuff.run(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [59]:
display(Markdown(response))

The sensors operate at a frequency of 150-300 MHz.

**Simple form**

In [64]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [65]:
response = index.query(query, llm=llm)

In [66]:
display(Markdown(response))

The sensors operate at a frequency of 150-300 MHz.

## Custom Vectorstore (Index)
It is possible to customize the index when creating, such as specifying the embedding, or swap out the vectorstore. 

In [61]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])