## LangChain: Q&A over Documents
An example might be a tool that would allow you to query a product catalog for items of interest.

In [None]:
#pip install --upgrade langchain

In [116]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(), override=True) # read local .env file

In [55]:
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

In [91]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

In [57]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)

In [84]:
a = loader.load()
a = dict(a[0])
a['metadata']

{'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0}

In [61]:
from langchain.indexes import VectorstoreIndexCreator

In [101]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [100]:
index.query(query, llm=ChatOpenAI(model=llm_model, temperature=0.0))    

ValidationError: 2 validation errors for DocArrayDoc
text
  Field required [type=missing, input_value={'embedding': [0.00327484..., -0.02110229369594648]}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.5/v/missing
metadata
  Field required [type=missing, input_value={'embedding': [0.00327484..., -0.02110229369594648]}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.5/v/missing

### DocArrayInMemorySearch above was throwing error as you see above

In [105]:
# Create FAISS-based index
index = VectorstoreIndexCreator(
    vectorstore_cls=FAISS,
    embedding=OpenAIEmbeddings()
).from_loaders([loader])

In [63]:
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

In [65]:
index

VectorStoreIndexWrapper(vectorstore=<langchain_community.vectorstores.docarray.in_memory.DocArrayInMemorySearch object at 0x1656f8680>)

In [106]:
# Replace the default model
llm_replacement_model = OpenAI(temperature=0, model='gpt-3.5-turbo-instruct')

response = index.query(query, llm = llm_replacement_model)

In [107]:
#print(response)
display(Markdown(response))



| Name | Description | Sun Protection Rating |
| --- | --- | --- |
| Men's Tropical Plaid Short-Sleeve Shirt | Made of 100% polyester, UPF 50+ rated, wrinkle-resistant, front and back cape venting, two front bellows pockets, imported | SPF 50+, blocks 98% of harmful UV rays |
| Men's Plaid Tropic Shirt, Short-Sleeve | Made of 52% polyester and 48% nylon, UPF 50+ rated, SunSmart technology, wrinkle-free, front and back cape venting, two front bellows pockets, imported | SPF 50+, blocks 98% of harmful UV rays |
| Men's TropicVibe Shirt, Short-Sleeve | Made of 71% nylon and 29% polyester, UPF 50+ rated, wrinkle-resistant, front and back cape venting, two front bellows pockets, imported | SPF 50+, blocks 98% of harmful UV rays |
| Sun Shield Shirt | Made of 78% nylon and 22% Lycra Xtra Life fiber, UPF 50+ rated, moisture-wicking, fits comfortably over swimsuit, abrasion-resistant, imported | SPF

## Now let's do Step By Step

In [108]:
from langchain.document_loaders import CSVLoader
loader = CSVLoader(file_path=file)

In [109]:
docs = loader.load()
docs[0]

Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0})

In [110]:
# Now do the embeddings
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [111]:
embed = embeddings.embed_query("Hi my name is Lalit")

In [146]:
print(len(embed))

1536


In [147]:
print(embed[:5])

[-0.00040702112951074743, 0.01678821693226887, -0.019203048913754805, -0.01338433151562012, -0.02303081277749172]


### Understand how the embedded vectors are similar or dissimilar using
COSINE SIMILARITY between two vectors

In [145]:
from openai import OpenAI
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

response = client.embeddings.create(
    model="text-embedding-ada-002",
    input=[
        "sun protection shirt",
        "UV resistant long sleeve"
    ]
)

vector_1 = response.data[0].embedding
vector_2 = response.data[1].embedding

# Optional: cosine similarity - Let's see how similar these two vectors are
import numpy as np

def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

print("Cosine similarity:", cosine_similarity(vector_1, vector_2))

Cosine similarity: 0.8993436396190205


In [153]:
# Create embeddings for the all the documents loaded from the CSV file
db = FAISS.from_documents(
    docs, 
    embeddings
)

In [160]:
# Retrieve similar documents based on a query from the FAISS vector store of embedded documents
query = "What is the best shirt for sun protection?"
docs = db.similarity_search(query, k=5)

In [162]:
#list(docs)
docs

[Document(page_content=': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.', metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 255}),
 Document(page_content=": 374\nname: Men's Plaid Tropic Shirt, Short-Sleeve\ndescription: Our Ultracomfortable sun protection is rated to UPF

## So how do we use this to the question answering overall

In [163]:
# First we need to create a retriever from the vector store
retriever = db.as_retriever()

In [165]:
# Next since we want text generation and return a natural language response, we can use import a language model 
llm = ChatOpenAI(temperature=0.0)

In [167]:
# Next we combine all the documents into a single piece of text
qdocs = "".join(docs[i].page_content for i in range(len(docs)))

In [194]:
response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
shirts with sun protection and summarize each one.") 


In [196]:
display(Markdown(response))

1. Sun Shield Shirt: A high-performance sun shirt with UPF 50+ sun protection, moisture-wicking fabric, and abrasion resistance. Fits comfortably over swimsuits and blocks 98% of harmful UV rays.

2. Men's Plaid Tropic Shirt: A lightweight, wrinkle-free shirt with UPF 50+ sun protection, front and back venting, and two front pockets. Blocks 98% of UV rays and is machine washable.

3. Men's TropicVibe Shirt: A relaxed fit shirt with UPF 50+ sun protection, wrinkle resistance, front and back venting, and two front pockets. Blocks 98% of UV rays and is machine washable.

4. Men's Tropical Plaid Short-Sleeve Shirt: A traditional fit shirt with UPF 50+ sun protection, wrinkle resistance, front and back venting, and two front pockets. Blocks 98% of UV rays and is wrinkle-resistant.

5. Girls' Ocean Breeze Long-Sleeve Stripe Shirt: A long-sleeve rash guard with UPF 50+ sun protection, quick-drying fabric, fade resistance, and seawater resistance. Blocks 98% of UV rays and coordinates with swimsuits.

In [197]:
# Now with RetrievalQA, all of these can be chained in one single step
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

In [199]:
query =  "Question: Please list all your \
shirts with sun protection and summarize each one."
response = qa_stuff.run(query)
display(Markdown(response))



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


1. **Men's Tropical Plaid Short-Sleeve Shirt**
   - Description: Rated UPF 50+ for superior sun protection, made of 100% polyester, wrinkle-resistant, with front and back cape venting, and two front bellows pockets. Imported.
   - Sun Protection: SPF 50+, blocks 98% of harmful UV rays.

2. **Men's TropicVibe Shirt, Short-Sleeve**
   - Description: UPF 50+ sun-protection shirt with a lightweight feel, traditional fit, wrinkle-resistant, front and back cape venting, and two front bellows pockets. Imported.
   - Sun Protection: SPF 50+, blocks 98% of harmful UV rays.

3. **Men's Plaid Tropic Shirt, Short-Sleeve**
   - Description: UPF 50+ sun protection, designed for fishing, wrinkle-free, quick-drying, front and back cape venting, two front bellows pockets. Imported.
   - Sun Protection: UPF 50+, blocks 98% of harmful UV rays.

4. **Girls' Ocean Breeze Long-Sleeve Stripe Shirt**
   - Description: Long-sleeve sun-protection rash guard with full coverage, made of Nylon Lycra®-elastane blend, UPF 50+ rated, quick-drying, fade-resistant. Recommended by The Skin Cancer Foundation.
   - Sun Protection: SPF 50+, blocks 98% of harmful UV rays. Recommended by The Skin Cancer Foundation.