# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest.

In [None]:
#pip install --upgrade langchain

In [1]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

Note: LLM's do not always produce the same results. When executing the code in your notebook, you may get slightly different answers that those in the video.

In [2]:
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

In [4]:
from langchain.chains import RetrievalQA #A chain to do retrial over documents
from langchain.chat_models import ChatOpenAI #LLM
from langchain.document_loaders import CSVLoader #Loads proprietary data (CSV)
from langchain.vectorstores import DocArrayInMemorySearch #In memory vector store requiring no database connections
from IPython.display import display, Markdown #Common utilities for displaying information in notebooks
from langchain.llms import OpenAI

In [5]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)

In [6]:
from langchain.indexes import VectorstoreIndexCreator #Helps to easily create vectorStores

In [None]:
#pip install docarray

In [7]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [11]:
index

VectorStoreIndexWrapper(vectorstore=<langchain.vectorstores.docarray.in_memory.DocArrayInMemorySearch object at 0x7f96302f6ca0>)

In [15]:
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

**Note**:
- The notebook uses `langchain==0.0.179` and `openai==0.27.7`
- For these library versions, `VectorstoreIndexCreator` uses `text-davinci-003` as the base model, which has been deprecated since 1 January 2024.
- The replacement model, `gpt-3.5-turbo-instruct` will be used instead for the `query`.
- The `response` format might be different than the video because of this replacement model.

In [16]:
llm_replacement_model = OpenAI(temperature=0, 
                               model='gpt-3.5-turbo-instruct')

response = index.query(query, 
                       llm = llm_replacement_model)

In [17]:
display(Markdown(response))



| Name | Description | Sun Protection Rating |
| --- | --- | --- |
| Men's Tropical Plaid Short-Sleeve Shirt | Made of 100% polyester, UPF 50+ rating, wrinkle-resistant, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Men's Plaid Tropic Shirt, Short-Sleeve | Made of 52% polyester and 48% nylon, UPF 50+ rating, SunSmart technology, wrinkle-free, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Men's TropicVibe Shirt, Short-Sleeve | Made of 71% nylon and 29% polyester, UPF 50+ rating, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Sun Shield Shirt | Made of 78% nylon and 22% Lycra Xtra Life fiber, UPF 50+ rating, moisture-wicking, abrasion-resistant, fits over swimsuit | SPF 50+, blocks 98% of harmful UV rays |

## Step By Step

LLMs can only inpect a few thousand words at a time. Embeddings and vector stores help when documents are too large. 

### Emeddings 

Embedding vectors capture context and meaning. Text with similar content will have similar embedding vectors. 

### Vector databases

Vector databases are ways to store embedding vectors. We populate these databases with embedding vectors of "chunks" from the original document, side-by-side with the actual chunk. 

Chunks are useful because they allow us to pass only those relavent parts of a document to our LLM. The similarity of an input embedding vector to each chunk's embedding vector helps us to find the most relevant section (chunk) for that query. 


In [19]:
from langchain.document_loaders import CSVLoader
loader = CSVLoader(file_path=file)

In [20]:
docs = loader.load()

In [21]:
docs[0]

Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0})

In [24]:
import pandas as pd
df = pd.read_csv(file)
df

Unnamed: 0.1,Unnamed: 0,name,description
0,0,Women's Campside Oxfords,This ultracomfortable lace-to-toe Oxford boast...
1,1,"Recycled Waterhog Dog Mat, Chevron Weave",Protect your floors from spills and splashing ...
2,2,Infant and Toddler Girls' Coastal Chill Swimsu...,"She'll love the bright colors, ruffles and exc..."
3,3,"Refresh Swimwear, V-Neck Tankini Contrasts",Whether you're going for a swim or heading out...
4,4,EcoFlex 3L Storm Pants,Our new TEK O2 technology makes our four-seaso...
...,...,...,...
995,995,"Men's Classic Denim, Standard Fit",Crafted from premium denim that will last wash...
996,996,CozyPrint Sweater Fleece Pullover,The ultimate sweater fleece - made from superi...
997,997,Women's NRS Endurance Spray Paddling Pants,These comfortable and affordable splash paddli...
998,998,Women's Stop Flies Hoodie,This great-looking hoodie uses No Fly Zone Tec...


In [25]:
from langchain.embeddings import OpenAIEmbeddings #OpenAI's embedding class
embeddings = OpenAIEmbeddings()

In [26]:
embed = embeddings.embed_query("Hi my name is Harrison")

In [27]:
print(len(embed))

1536


In [28]:
print(embed[:5])

[-0.02199048176407814, 0.006746508646756411, -0.018174780532717705, -0.03918623551726341, -0.01404528971761465]


In [29]:
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
) # Creates a vector store from a list of documents and an embedding object

In [30]:
query = "Please suggest a shirt with sunblocking"

In [31]:
docs = db.similarity_search(query)

In [32]:
len(docs)

4

In [33]:
docs[0]

Document(page_content=': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.', metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 255})

In [46]:
retriever = db.as_retriever() 
# A retriever is a generic interface, that can be underpinned
# by any method that takes in a query and returns documents. 
# Vector stores and embeddings are just one way to do this. 

In [34]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)

In [35]:
qdocs = "".join([docs[i].page_content for i in range(len(docs))])
# This combines the documents into a single piece of text

In [43]:
response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
shirts with sun protection. Provide them in a markdown table with \
the shirt name in one column and a brief (20 word) description in a second column. ") 


In [44]:
display(Markdown(response))

| Shirt Name                           | Description                                                                 |
|--------------------------------------|-----------------------------------------------------------------------------|
| Sun Shield Shirt                     | High-performance sun shirt with UPF 50+ protection, moisture-wicking fabric.  |
| Men's Plaid Tropic Shirt             | Lightweight shirt with UPF 50+ coverage, wrinkle-free, and quick-drying fabric. |
| Men's TropicVibe Shirt                | Men’s sun-protection shirt with UPF 50+, wrinkle-resistant, and venting features. |
| Men's Tropical Plaid Short-Sleeve Shirt | Lightest hot-weather shirt with UPF 50+ protection, wrinkle-resistant fabric. |

In [48]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", # Stuff is the simplest chain type - it "stuffs" all documents into context
    retriever=retriever, # interface for fetching documents to pass to the llm
    verbose=True
)
# The RetrievalQA chain does the above steps in one go. It performs
# retrieval and then does Q&A over the docs 

In [50]:
query =  "Please list all your shirts with sun protection. Provide them in a markdown table with the shirt name in one column and a brief (20 word) description in a second column."

In [51]:
response = qa_stuff.run(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [52]:
display(Markdown(response))

| Shirt Name                                  | Description                                                                 |
|--------------------------------------------|-----------------------------------------------------------------------------|
| Men's Tropical Plaid Short-Sleeve Shirt    | UPF 50+ rated, wrinkle-resistant, front and back venting, two front pockets.     |
| Men's Plaid Tropic Shirt, Short-Sleeve     | UPF 50+ rated, wrinkle-free, quick-drying, front and back venting, two front pockets. |
| Men's TropicVibe Shirt, Short-Sleeve       | UPF 50+ rated, wrinkle-resistant, front and back venting, two front pockets.         |
| Sun Shield Shirt                           | UPF 50+ rated, moisture-wicking, abrasion-resistant, fits comfortably over swimsuit. |

In [53]:
response = index.query(query, llm=llm)

In [54]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])

Reminder: Download your notebook to you local computer to save your work.

## Stuff method

The stuff method is the simplest - you simply stuff all data into the prompt as context to pass ot the LLM. 
**Pros** - Makes a single call t the LLM. The LLM has access to all the data at once.
**Cons** - Large documents or manay documents may not fit in the LLMs context window

## Map_reduce method

Takes all the chunks, passes them along with the question to a LLM, gets a response back, and then uses another LLM call to summarise individual responses into a final answer. 

**Pros** - Can opeate over any number of documents. Can do indivudal questions in parallel. Useful for long documents. 
**Cons** - Makes lots of calls. Treats documents as independant, which may not always be desired. 

## Refine method

Loops over many documents iteratively, building upon the answer from a previous doc. 

**Pros** - Interesting answers. 
**Cons** - Not as fast as calls aren't independent. Can lead to longer answers. 

## Map_refine

More experimental - Do a single call to the LLM for each document. Receive a response but also ask it return a score. You thenselect those with the highest scores. 

**Pros** - Relatively fast. 
**Cons** - Relies on the LLM to know what the score should be. You often have to instruct it that a high score indicate relevance to the document and really refine instructions. 