# QA over unstructured data

unstractured dataL: txt, pdf

Unstructured data can be loaded from many sources.

Use the LangChain integration hub to browse the full set of loaders.

Each loader returns data as a LangChain Document.

**Documents are turned into a Chat or QA app following the general steps below:**

1. Splitting: Text splitters break Documents into splits of specified size

2. Storage: Storage (e.g., often a vectorstore) will house and often embed the splits

3. Retrieval: The app retrieves splits from storage (e.g., often with similar embeddings to the input question)

4. Output: An LLM produces an answer using a prompt that includes the question and the retrieved splits

**There are different ways to do QA, with different levels of abstraction**
- load_doc > **VectorstoreIndexCreator**
- load_doc > split > store > **RetrievalQA**
- load_doc > split > store > retrive > **load_QA_chain**

In [198]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())
openai_api_key =  os.getenv("OPENAI_API_KEY")
gpt4all_path = os.getenv("GPT4ALL_PATH")
print(openai_api_key)
print(gpt4all_path)

from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

question = "Why join CSE?"

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
C:\Users\ASUS\AppData\Local\nomic.ai\GPT4All\ggml-model-gpt4all-falcon-q4_0.bin


In [192]:
def print_str_as_blocks(str, char_limit=70) -> str:
    str_split = str.split(" ")
    line = ""
    block = ""
    lines = []
    for i, str in enumerate(str_split):
        line += str
        if len(line) > char_limit:
            line += "\n"
            block += line
            line = ""
        else:
            line += " "
        if i == len(str_split)-1:
            block+=line
        if "\n" in str:
            block += line
            line = ""
    print(block)

### **VectorstoreIndexCreator**

In [109]:
# 1) load Document
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://cse.hkust.edu.hk/admin/welcome/")
data = loader.load()

In [110]:
from langchain.document_loaders import WebBaseLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import OpenAIEmbeddings
# Document loader
loader = WebBaseLoader("https://cse.hkust.edu.hk/admin/welcome/")
# Index that wraps above steps
embedder = OpenAIEmbeddings(openai_api_key=openai_api_key)
index = VectorstoreIndexCreator(embedding=embedder).from_loaders([loader])
# Question-answering
question = "Why join HKUST?"
index.query(question)

' HKUST offers six undergraduate programs, popular research postgraduate programs, and is consistently ranked among the top-30 in Computer Science and Information Systems in the world. It is a research-oriented, student-centered Department with internationally recognized faculty, extensive links with the industry, and state-of-the-art computing infrastructure. It also has a new campus opening in 2022 that will provide collaboration and integration for both research and education between campuses.'

### **RetreivalQA**
- How the vectorstore is created is up to the programmer
- How Documents are retrieved is customized
- How retrieved Documents are presented to the LLM is abstracted

In [111]:
# 1) load Document
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://cse.hkust.edu.hk/admin/welcome/")
data = loader.load()
# 2) Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(data)
# 3) Store 
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
vectorstore = Chroma.from_documents(documents=all_splits,embedding=OpenAIEmbeddings())

Basic usage of RetrievalQA

In [189]:
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm,retriever=vectorstore.as_retriever())
ans = qa_chain({"query": question})
print_str_as_blocks(ans['result'])

There are several reasons why you might consider joining the Department
of Computer Science and Engineering (CSE) at HKUST. 

Firstly, CSE is consistently ranked among the top 30 in Computer Science and Information
Systems in the world. This indicates the high quality of education and research
opportunities available at the department.

Secondly, CSE is a research-oriented department with internationally recognized faculty.
This means that you will have the opportunity to engage in cutting-edge
research and work with experts in the field.

Additionally, CSE has extensive links with the industry, providing you with opportunities
for internships, collaborations, and networking. This industry connection
can enhance your career prospects and provide real-world experience.

Furthermore,
CSE offers state-of-the-art computing infrastructure, ensuring that you
have access to the necessary tools and resources for your studies and research.

Lastly,
CSE is known for its innovative teaching metho

Customizing prompt in RetrievalQA 👍

In [196]:
# Build prompt
from langchain.prompts import PromptTemplate
template = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use exactly three sentences with (sentence number) at the front. For example: (1) CSE means... .
Always say "thank you!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template,)

# Run chain
qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectorstore.as_retriever(),
                                       chain_type_kwargs={"prompt": QA_CHAIN_PROMPT})

result = qa_chain({"query": question})
print_str_as_blocks(result["result"])

 Joining CSE means joining a department with a rich history of cutting-edge
research and innovation. Our faculty, staff, and students are recognized
as international thought leaders and technology inventors in the field of
computer science and engineering. We offer state-of-the-art computing infrastructure,
extensive industry connections, and opportunities for research and enrichment.
Join us today! 


Getting retrieved Documents(source) used for getting final answer

In [93]:
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm,retriever=vectorstore.as_retriever(),
                                       return_source_documents=True)
result = qa_chain({"query": question})
print(len(result['source_documents']))
result['source_documents'][0]

4


Document(page_content='Consistently ranked among top-30 in Computer Science and Information \nSystems in the world, CSE at HKUST is research-oriented, student-centered \nDepartment. With internationally recognized faculty, extensive links with \nthe industry, and state-of-the-art computing infrastructure, CSE is the \nplace for cutting-edge research and innovative teaching. CSE enjoys the \nreputation as international thought leaders and technology inventors in', metadata={'source': 'https://cse.hkust.edu.hk/admin/welcome/', 'title': 'Welcome from Head of Department | HKUST CSE', 'language': 'en'})

In [95]:
# with web address
from langchain.chains import RetrievalQAWithSourcesChain
qa_chain = RetrievalQAWithSourcesChain.from_chain_type(llm,retriever=vectorstore.as_retriever())
result = qa_chain({"question": question})
result

{'question': 'What is CSE?',
 'answer': 'CSE stands for Computer Science and Engineering.\n',
 'sources': 'https://cse.hkust.edu.hk/admin/welcome/'}

### **load_QA_chain**

- How vectorstore is created is customized
- How document retrived are presented to LLM is customized|

Retrieved documents can be fed to an LLM for answer distillation in a few different ways.

**stuff**, **refine**, **map-reduce**, and **map-rerank** chains for passing documents to an LLM prompt are well summarized here.

- **stuff** is commonly used because it simply "stuffs" all retrieved documents into the prompt.

In [100]:
# 1) load Document
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://cse.hkust.edu.hk/admin/welcome/")
data = loader.load()
# 2) Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(data)
# 3) Store 
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
vectorstore = Chroma.from_documents(documents=all_splits,embedding=OpenAIEmbeddings())
# 4) retrieve
docs = vectorstore.similarity_search(question)
len(docs)

4

another way to retrieve: Use LLM to generate similar question to improve similar search stability

In [116]:
# MultiQueryRetriever
import logging
from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever
logging.basicConfig()
logging.getLogger('langchain.retrievers.multi_query').setLevel(logging.INFO)
retriever_from_llm = MultiQueryRetriever.from_llm(retriever=vectorstore.as_retriever(),
                                                  llm=ChatOpenAI(temperature=0))
docs_llm = retriever_from_llm.get_relevant_documents(query=question)
len(docs_llm)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the benefits of joining HKUST?', '2. Can you tell me why I should consider joining HKUST?', '3. What makes HKUST a good choice for students?']


5

Basic usage
**load_qa_chain** is an easy way to pass documents to an LLM using these various approaches (e.g., see chain_type).

In [117]:
from langchain.chains.question_answering import load_qa_chain
chain = load_qa_chain(llm, chain_type="stuff")
chain({"input_documents": docs_llm, "question": question},return_only_outputs=False)

{'input_documents': [Document(page_content='Consistently ranked among top-30 in Computer Science and Information \nSystems in the world, CSE at HKUST is research-oriented, student-centered \nDepartment. With internationally recognized faculty, extensive links with \nthe industry, and state-of-the-art computing infrastructure, CSE is the \nplace for cutting-edge research and innovative teaching. CSE enjoys the \nreputation as international thought leaders and technology inventors in', metadata={'source': 'https://cse.hkust.edu.hk/admin/welcome/', 'title': 'Welcome from Head of Department | HKUST CSE', 'language': 'en'}),
  Document(page_content='Welcome from Head of Department | HKUST CSE\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMore about HKUST\n\nUniversity News\nAcademic Departments A-Z\nLife@HKUST\nLibrary\n\n\nMap & Directions\nCareers at HKUST\nFaculty Profiles\nAbout HKUST\n\n\n\n\n\n\n\n\n\n\n\n\nIntranet\n\n\nSchool of Engineering\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n