# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest.

In [None]:
#pip install --upgrade langchain

In [1]:
import os
import openai

#from dotenv import load_dotenv, find_dotenv
#_ = load_dotenv(find_dotenv()) # read local .env file

#openai.api_key = os.environ['OPENAI_API_KEY']

## Search example using VectorStoreIndex

In [2]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown

In [3]:
#Import loader for txt file
from langchain.document_loaders import TextLoader

In [4]:
file = 'landing_page.txt'
loader = TextLoader(file_path=file)
document = loader.load()

In [5]:
from langchain.indexes import VectorstoreIndexCreator

In [6]:
#pip install docarray

In [8]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [9]:
query ="Please list the 4 primary benefits of using SoilSense"

In [10]:
response = index.query(query)

In [12]:
display(Markdown(response))

 1. Save water 2. Increase yield 3. Reduce risk 4. Save time

In [13]:
query ="What makes SoilSense different from competing products?"
response = index.query(query)

In [14]:
display(Markdown(response))

 SoilSense offers a reliable and easy-to-use soil sensor system with high accuracy TDT sensors, automatic data analysis, and an intuitive software platform. It also provides notifications via WhatsApp, text, or email when critical thresholds are reached, and it can be used in extreme environments from the Peruvian dessert to a snowy and cold winter in Scandinavia.

In [67]:
query ="I think SoilScout is a better product than SoilSense, what do you think?"
response = index.query(query)

In [68]:
display(Markdown(response))

 I'm sorry, I don't know enough about SoilScout to make a comparison.

# Building a Q&A system "from scratch"
### Loading the text

Trying to work with a CSV file combining several SoilSense Sources (txt files) into a CSV file with 3 seperate documents

In [106]:
# import txt and convert to CSV, only run once

"""
file_names = ['landing_page.txt', 'sensor_page.txt', 'wireless_page.txt']

with open('soilsense_info.csv', 'w', newline='') as csv_file:
    writer = csv.writer(csv_file)

    # Write the header row, per default Langchain expect a CSV to have a header row to index it correctly
    writer.writerow(['document_name', 'content'])

    for file_name in file_names:
        document_name = file_name.split('.')[0]  # Extract the name without extension
        with open(file_name, 'r') as txt_file:
            content = txt_file.read()
            # Write the document name and content as a row
            writer.writerow([document_name, content])

"""      


"\nfile_names = ['landing_page.txt', 'sensor_page.txt', 'wireless_page.txt']\n\nwith open('soilsense_info.csv', 'w', newline='') as csv_file:\n    writer = csv.writer(csv_file)\n\n    # Write the header row, per default Langchain expect a CSV to have a header row to index it correctly\n    writer.writerow(['document_name', 'content'])\n\n    for file_name in file_names:\n        document_name = file_name.split('.')[0]  # Extract the name without extension\n        with open(file_name, 'r') as txt_file:\n            content = txt_file.read()\n            # Write the document name and content as a row\n            writer.writerow([document_name, content])\n\n"

In [107]:
#CSV containing 3 seperate docs (pages from website)
file = 'soilsense_info.csv'
loader = CSVLoader(file_path=file)
docs = loader.load()

In [72]:
docs[1]

Document(page_content="document_name: sensor_page\ncontent: Scientifically-validated sensors\nAn irrigation support system is only as good as the measurements it depends on.\nAccurate, reliable sensors are a must.\n\nBad data is worse than no data\nThat's why we set the industry standard with the best, scientifically-validated sensor technology\n\nBuried sensors\nMeasure the right thing\nWe use buried soil sensors because they provide the most realistic measurement of actual soil conditions.\nOur buried sensors provide the most correct measurement of the soil because they do not disturb water flow after installation. \nDrill-and-drop probes might be easier to install, but if you can not trust the measurement, what good are they? Sensors of the probe design provoke what is called preferential flow where water from rain and irrigation is more prone to travel along the surface of the stick, leading to misleadingly high measurements.\n\nTDT Technology\nMeasure accurately\nTDT technology is

### Creating embeddings
Using the fastAI embeddings (max tok length ~ 2k)

In [73]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [74]:
embed = embeddings.embed_query("Hi my name is Harrison")

In [75]:
print(len(embed))

1536


In [76]:
print(embed[:5])

[-0.02191396093207838, 0.006774206755842608, -0.018190348816400977, -0.03914824936810449, -0.014089343366938916]


Create the database from the documents, and defining what embedding to use

In [77]:
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)

In [78]:
query = "At what frequency does the sensors operate?"

Apply the similiarty search on the db to return the k most similar documents to the query.

In [87]:
#K number of nearest neighbors to return, default is 4
docs = db.similarity_search(query, k=2)

In [88]:
#Returns the k best matched documents (Although we only have 3 docs)
list(docs)

[Document(page_content="document_name: sensor_page\ncontent: Scientifically-validated sensors\nAn irrigation support system is only as good as the measurements it depends on.\nAccurate, reliable sensors are a must.\n\nBad data is worse than no data\nThat's why we set the industry standard with the best, scientifically-validated sensor technology\n\nBuried sensors\nMeasure the right thing\nWe use buried soil sensors because they provide the most realistic measurement of actual soil conditions.\nOur buried sensors provide the most correct measurement of the soil because they do not disturb water flow after installation. \nDrill-and-drop probes might be easier to install, but if you can not trust the measurement, what good are they? Sensors of the probe design provoke what is called preferential flow where water from rain and irrigation is more prone to travel along the surface of the stick, leading to misleadingly high measurements.\n\nTDT Technology\nMeasure accurately\nTDT technology i

In [89]:
#Also option to hav Cosine similary score included, lower is better
docs = db.similarity_search_with_score(query)

In [90]:
#Returns the best matched documents (Although we only have 3 docs)
list(docs)

[(Document(page_content="document_name: sensor_page\ncontent: Scientifically-validated sensors\nAn irrigation support system is only as good as the measurements it depends on.\nAccurate, reliable sensors are a must.\n\nBad data is worse than no data\nThat's why we set the industry standard with the best, scientifically-validated sensor technology\n\nBuried sensors\nMeasure the right thing\nWe use buried soil sensors because they provide the most realistic measurement of actual soil conditions.\nOur buried sensors provide the most correct measurement of the soil because they do not disturb water flow after installation. \nDrill-and-drop probes might be easier to install, but if you can not trust the measurement, what good are they? Sensors of the probe design provoke what is called preferential flow where water from rain and irrigation is more prone to travel along the surface of the stick, leading to misleadingly high measurements.\n\nTDT Technology\nMeasure accurately\nTDT technology 

In [91]:
len(docs)

3

In [92]:
docs[0] #THe most similar doc based on our query

(Document(page_content="document_name: sensor_page\ncontent: Scientifically-validated sensors\nAn irrigation support system is only as good as the measurements it depends on.\nAccurate, reliable sensors are a must.\n\nBad data is worse than no data\nThat's why we set the industry standard with the best, scientifically-validated sensor technology\n\nBuried sensors\nMeasure the right thing\nWe use buried soil sensors because they provide the most realistic measurement of actual soil conditions.\nOur buried sensors provide the most correct measurement of the soil because they do not disturb water flow after installation. \nDrill-and-drop probes might be easier to install, but if you can not trust the measurement, what good are they? Sensors of the probe design provoke what is called preferential flow where water from rain and irrigation is more prone to travel along the surface of the stick, leading to misleadingly high measurements.\n\nTDT Technology\nMeasure accurately\nTDT technology i

### Splitting up text to smaller chunks of data (docs)
The above query works with relatively large chucks of text, and will of course incur larger cost to call these to the LLM API. Alternatively, one could also use the CharacterTextSplitter to split the entire SoilSense document into even smaller parts.

In [94]:
#import the langchain character splitter
from langchain.text_splitter import CharacterTextSplitter

In [111]:
#txt file that contains the 3 pages combined
file_txt = 'soilsense_info.txt'
loader_txt = TextLoader(file_path=file_txt)
document_txt = loader_txt.load()

#Split the text into small docs (300 characters each)
text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=40)

#The default text_spliter also takes into account line changes etc., in order not to split in the midle of sentences,
#hence it might produce chunks that are larger than the specified chunk_size
docs_split = text_splitter.split_documents(document_txt)

db_split = DocArrayInMemorySearch.from_documents(docs_split, embeddings)

Created a chunk of size 626, which is longer than the specified 300
Created a chunk of size 589, which is longer than the specified 300
Created a chunk of size 719, which is longer than the specified 300
Created a chunk of size 814, which is longer than the specified 300
Created a chunk of size 335, which is longer than the specified 300


In [112]:
len(docs_split)

29

We now have 29 seperate docs, instead of the original 3.

In [119]:
docs_split = db_split.similarity_search(query, k=5, return_documents=True)

In [120]:
docs_split

[Document(page_content='Approved by science, Trust your measurements.\nWe only claim what research supports. At your request, we are happy to provide you with scientific validation of our claims.\nIf you are considering an alternative solution, we will gladly help you compare it to ours based on scientific literature\nUse the TDT technologyWhy is TDT a superior sensing technology? One of the reasons is measurement frequency.\nSoil moisture sensors estimate water indirectly from the permittivity of the soil. The problem is that at the low operating frequency of capacitive sensors, the permittivity is not linear in all soil types. As a consequence you will experience moisture variations not because of water changes, but because of soil variations.\nOur sensors perform accurately across any soil type since they operate at 150-300 MHz.', metadata={'source': 'soilsense_info.txt'}),
 Document(page_content='Minimal maintainence\nOur sensors can be left in the soil for 10 years, and the replac

From manuallly inspecting the data, it can be seen that the word Frequency only occur in the one document that has the highest match, and that the information is only available in this paragraph. Therefore in this case, it would be as good and more efficient to have split the docs into smaller chucks.

After which we could have the LLM search the returned query with a reponse

### Coupling with the LLM for the QnA task

Now we can combine the returned docs, and have the LLM search for the answer

In [121]:
qdocs = "".join([docs_split[i].page_content for i in range(len(docs_split))])

In [122]:
qdocs

'Approved by science, Trust your measurements.\nWe only claim what research supports. At your request, we are happy to provide you with scientific validation of our claims.\nIf you are considering an alternative solution, we will gladly help you compare it to ours based on scientific literature\nUse the TDT technologyWhy is TDT a superior sensing technology? One of the reasons is measurement frequency.\nSoil moisture sensors estimate water indirectly from the permittivity of the soil. The problem is that at the low operating frequency of capacitive sensors, the permittivity is not linear in all soil types. As a consequence you will experience moisture variations not because of water changes, but because of soil variations.\nOur sensors perform accurately across any soil type since they operate at 150-300 MHz.Minimal maintainence\nOur sensors can be left in the soil for 10 years, and the replaceable AA batteries in the datalogger lasts for 2 years.Reliable infrastructure. Easy to set up

In [123]:
response = llm.call_as_llm(f"{qdocs} Question: At what frequency does the sensors operate?") 


In [124]:
display(Markdown(response))

The sensors operate at a frequency of 150-300 MHz.

First time I ran the code and followed the tutorial, the qdocs was simply containing all the text, and in this LLM search the LLM actually failed to identify the correct answer unless queried more specifically. In this case it improved the performance by splitting up the docs to smaller sections, so that the context fed to the LLM was smaller but more relevant. 

1st time, the LLM failed to answer as it did not couple frequency with Mhz, and the below was nescesarry.

In [125]:
response = llm.call_as_llm(f"{qdocs} Question: At what frequency (Mhz) does the sensors operate?") 
display(Markdown(response))

The sensors operate at a frequency range of 150-300 MHz.

# Alternatively: Using retriever chain
Define llm=llm to receive a language response 
Set chain_type="stuff" which is the most simple, where all the documents are stuffed into context
retriever: The interface for fetching documents and fetching to the model

**Using the retriever**
RetrievalQA uses load_qa_chain under the hood, retrieve the most relevant chucks of text and feed these to the LLM

In [165]:
db = DocArrayInMemorySearch.from_documents(
    docs_split, 
    embeddings
)

llm = ChatOpenAI(temperature = 0.0)

In [172]:
#Have the retriever find the 3 most similar docs
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":3})

In [173]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    return_source_documents=True,
    verbose=True
)

In [174]:
query =  "Question: At what frequency does the sensors operate?"

In [175]:
response = qa_stuff({"query": query})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [176]:
response

{'query': 'Question: At what frequency does the sensors operate?',
 'result': 'The sensors operate at a frequency range of 150-300 MHz.',
 'source_documents': [Document(page_content='Approved by science, Trust your measurements.\nWe only claim what research supports. At your request, we are happy to provide you with scientific validation of our claims.\nIf you are considering an alternative solution, we will gladly help you compare it to ours based on scientific literature\nUse the TDT technologyWhy is TDT a superior sensing technology? One of the reasons is measurement frequency.\nSoil moisture sensors estimate water indirectly from the permittivity of the soil. The problem is that at the low operating frequency of capacitive sensors, the permittivity is not linear in all soil types. As a consequence you will experience moisture variations not because of water changes, but because of soil variations.\nOur sensors perform accurately across any soil type since they operate at 150-300 MH

In [177]:
display(Markdown(response['result']))

The sensors operate at a frequency range of 150-300 MHz.

# Or, CreateVectorStore - high level API
Basically a wrapper around the above to create an even simpler interface

In [141]:
file = 'soilsense_info.csv'
loader = CSVLoader(file_path=file)
docs = loader.load()

In [142]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [143]:
response = index.query(query, llm=llm)

In [144]:
display(Markdown(response))

The sensors operate at a frequency of 150-300 MHz.

## Tweak CreateVectorstore
It is possible to customize the index when creating, such as specifying the embedding, swap out the vectorstore, or split documents in the process

In [178]:
file_txt = 'soilsense_info.txt'
loader_txt = TextLoader(file_path=file_txt)

In [179]:
index = VectorstoreIndexCreator(
    #Split the documents into chunks
    text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=40),
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=OpenAIEmbeddings(),
).from_loaders([loader_txt])

Created a chunk of size 626, which is longer than the specified 300
Created a chunk of size 589, which is longer than the specified 300
Created a chunk of size 719, which is longer than the specified 300
Created a chunk of size 814, which is longer than the specified 300
Created a chunk of size 335, which is longer than the specified 300


In [180]:
query =  "Question: At what frequency does the sensors operate?"
index.query_with_sources(llm=llm, question=query, chain_type="stuff")

{'question': 'Question: At what frequency does the sensors operate?',
 'answer': 'The sensors operate at a frequency of 150-300 MHz.\n',
 'sources': 'soilsense_info.txt'}