# Retrieval Question/Answering
---

This example showcases question answering over an index built from a folder of PDFs, returning the sources and splitting with NLTK. 
- The Pinecone index is set up first 

## Set up Pinecone Index

In [1]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings 
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
# from langchain.llms import OpenAI
# from langchain.chains import RetrievalQA

Read all pdf files from the folder:
"G:\.shortcut-targets-by-id\1vE28d8xZuJXkpcinFbuku9FJgeaDd48K\ICOLD - CFRD New Bulletin 2023"

In [2]:
folder = r"C:\2023\streamlit-langchain\ICOLD - CFRD New Bulletin 2023"

In [3]:
from langchain.document_loaders import DirectoryLoader

In [8]:
# Load all PDFs in a folder recursively
loader = DirectoryLoader(folder, "**/*.pdf", show_progress=True, use_multithreading=True)

For some reason, the latest Langchain API requires that you manually install Tesseract and Poppler, which are not Python packages

In [9]:
docs = loader.load()

[Errno 13] Permission denied: 'C:\\Users\\JUAN_T~1\\AppData\\Local\\Temp\\tmpruup5_uw'
[Errno 13] Permission denied: 'C:\\Users\\JUAN_T~1\\AppData\\Local\\Temp\\tmp1ud7yecw'
PDF text extraction failed, skip text extraction...
PDF text extraction failed, skip text extraction...
[Errno 13] Permission denied: 'C:\\Users\\JUAN_T~1\\AppData\\Local\\Temp\\tmpocmpbxl4'
PDF text extraction failed, skip text extraction...
100%|██████████| 64/64 [04:33<00:00,  4.27s/it]


In [11]:
len(docs)

61

In [12]:
from langchain.text_splitter import NLTKTextSplitter
text_splitter = NLTKTextSplitter(chunk_size=2000)

In [13]:
texts = text_splitter.split_documents(docs)

Created a chunk of size 2048, which is longer than the specified 2000


Created a chunk of size 5291, which is longer than the specified 2000
Created a chunk of size 4206, which is longer than the specified 2000
Created a chunk of size 4691, which is longer than the specified 2000
Created a chunk of size 2138, which is longer than the specified 2000
Created a chunk of size 2023, which is longer than the specified 2000
Created a chunk of size 2016, which is longer than the specified 2000


In [14]:
print(texts[10].page_content)

Face deflection after reservoir impounding, section B, D=(0.0 ÷ 908) mm.

Figure 37.

Face deflection after dam construction, section C, D=(0.0 ÷ 106.3) mm.

18

10th Benchmark Workshop on Numerical Aspects of dams, Paris, September, 2009

Figure 38.

Face deflection after reservoir impounding, section C, D=(0.0 ÷ 623.8) mm.

CONCLUSIONS

From the performed analysis following main conclusions could be drawn out:

Program package SOFiSTiK is powerful tool for complex three-dimensional analysis of dams.

It has rich possibilities for modeling of the dam body, and also possibilities for application of different constitutive laws, as well as for complex load influences.

From the analysis of dam behaviour for the loading states (state after dam construction and state after reservoir impounding), obtained values and distribution of the dam settlements and vertical stresses are usual for this type of dam.

The maximal vertical settlement is in the intermediate part of the dam, located approx

In [15]:
print(f"Number of chunks: {len(texts)}")

Number of chunks: 1799


In [29]:
# from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI, OpenAIEmbeddings 
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# docsearch = Chroma.from_documents(texts, embeddings)

In [30]:
# Set up Pinecone
from pinecone import Pinecone 
pinecone = Pinecone()
# pinecone.init(api_key="48640420-7e79-46d4-b71d-d07286818fef", environment="us-central1-gcp")

In [31]:
pinecone.Index('cfrds')

<pinecone.data.index.Index at 0x17bda9bb650>

In [32]:
# check if 'icold' exists in the indexes
# if 'icold' in pinecone.list_indexes():
#     # Delete the index if it already exists
#     pinecone.delete_index('icold')
# # create a new index
# pinecone.create_index('icold', dimension=1536, metric='cosine')

In [33]:
# Upsert the documents into the index
from tqdm.autonotebook import tqdm
from langchain_community.vectorstores import Pinecone

In [34]:
len(texts)

1799

This next step requires an API key for OpenAI Embeddings, and it looks like it takes some time for the payment to go through. 

In [39]:
docsearch = Pinecone.from_documents(texts, embeddings, index_name='cfrds')

In [40]:
retriever = docsearch.as_retriever()

In [41]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI(model="gpt-4-turbo-preview")
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)
chain.invoke("What is a cfrd?")


'A CFRD, or Concrete-Faced Rockfill Dam, is a type of dam that relies on the weight of rockfill to resist the force of water, utilizing a concrete face slab as an impervious barrier to prevent leakage. The concrete face slab is equipped with vertical and peripheral joints, which include waterstops to minimize water seepage. For taller CFRDs, horizontal construction joints are also sometimes employed to facilitate the placement of the concrete face slab in stages. This combination of rockfill and concrete slab is designed to offer excellent performance in terms of safety, economy, and environmental impact, making CFRDs a popular and competitive choice for dam construction.'

In [42]:
from operator import itemgetter
from langchain.schema import format_document

from langchain.prompts.prompt import PromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain_core.messages import AIMessage, HumanMessage, get_buffer_string

memory = ConversationBufferMemory(
    return_messages=True, output_key="answer", input_key="question"
)
# First we add a step to load memory
# This adds a "memory" key to the input object
loaded_memory = RunnablePassthrough.assign(
    chat_history=RunnableLambda(memory.load_memory_variables) | itemgetter("history"),
)
# Now we calculate the standalone question
_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""

CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

standalone_question = {
    "standalone_question": {
        "question": lambda x: x["question"],
        "chat_history": lambda x: get_buffer_string(x["chat_history"]),
    }
    | CONDENSE_QUESTION_PROMPT
    | ChatOpenAI(temperature=0, )
    | StrOutputParser(),
}
# Now we retrieve the documents
retrieved_documents = {
    "docs": itemgetter("standalone_question") | retriever,
    "question": lambda x: x["standalone_question"],
}
# Now we construct the inputs for the final prompt
DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")

def _combine_documents(docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)

final_inputs = {
    "context": lambda x: _combine_documents(x["docs"]),
    "question": itemgetter("question"),
}
# And finally, we do the part that returns the answers
ANSWER_PROMPT = ChatPromptTemplate.from_template(template)
answer = {
    "answer": final_inputs | ANSWER_PROMPT | ChatOpenAI(),
    "docs": itemgetter("docs"),
}
# And now we put it all together!
final_chain = loaded_memory | standalone_question | retrieved_documents | answer
inputs = {"question": "How do we mitigate leakage once a concrete face rockfill dam is constructed?"}
result = final_chain.invoke(inputs)
result


{'answer': AIMessage(content='To prevent leakage in a concrete face rockfill dam, it is important to identify the leakage locations on the concrete face slab, understand the severity of leakage at each location, and take proper rehabilitation measures to repair the face slab. Additionally, the use of advanced technologies, such as fluorescent tracer tests, can be employed to detect dam leakages and address them promptly. Regular monitoring and maintenance of the dam structure can also help prevent excessive leakage in a concrete face rockfill dam.'),
 'docs': [Document(page_content='Rockﬁll has been observed to be able to accept high leakage safely: therefore diversion capacity can be reduced by allowing overtopping of the rockﬁll.\n\n(b) Construction: rockﬁll is suitable for wet weather placement; foundation grouting can be done in parallel with rockﬁll placement; multistage construction of the rockﬁll embankment is possible; foundation clean-up requires no hard work except at the pli

In [45]:
print(result['answer'].content)

To prevent leakage in a concrete face rockfill dam, it is important to identify the leakage locations on the concrete face slab, understand the severity of leakage at each location, and take proper rehabilitation measures to repair the face slab. Additionally, the use of advanced technologies, such as fluorescent tracer tests, can be employed to detect dam leakages and address them promptly. Regular monitoring and maintenance of the dam structure can also help prevent excessive leakage in a concrete face rockfill dam.


In [53]:
for doc in result['docs']:
    print(doc.page_content)

Rockﬁll has been observed to be able to accept high leakage safely: therefore diversion capacity can be reduced by allowing overtopping of the rockﬁll.

(b) Construction: rockﬁll is suitable for wet weather placement; foundation grouting can be done in parallel with rockﬁll placement; multistage construction of the rockﬁll embankment is possible; foundation clean-up requires no hard work except at the plinth; and the plinth and face slab can be constructed quickly and economically using slip forming.

Introduction

1.

1.1 CFRDs and their advantages Concrete-face rockﬁll dams (CFRDs) are known to have origi- nated from the mining regions of Sierra Nevada in California in the 1850s; however, construction of modern dams started only in the late 1960s and early 1970s.

Since then, rapid progress has the been achieved in the construction of CFRDs throughout

(c) Performance: water load is transferred to the foundation

upstream of the dam axis, resulting in increased stability and a high f

In [None]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(temperature=0.5, max_tokens=2000),
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
    return_source_documents=True,
    )

In [None]:
def pretty_print(response):
    import textwrap
    # Split the response by lines of max 80 characters
    return '\n'.join(textwrap.wrap(response, 80))


In [None]:
query = "tell me about campos novos dam"
result = qa({"query": query, "n": 10})

InvalidRequestError: This model's maximum context length is 4097 tokens. However, you requested 4128 tokens (2128 in the messages, 2000 in the completion). Please reduce the length of the messages or completion.

In [None]:
print(pretty_print(result['result']))

The methodology described in the given context has been used to reproduce
various aspects of the behavior observed in different dams around the world,
including Campos Novos Dam in Brazil. However, the given context does not
provide detailed information about Campos Novos Dam. If you need more
information about it, please let me know, and I will try to help you find it.


In [None]:
result['source_documents'][3].metadata['source']

'G:\\.shortcut-targets-by-id\\1vE28d8xZuJXkpcinFbuku9FJgeaDd48K\\ICOLD - CFRD New Bulletin 2023\\B141_unlocked.pdf'

In [None]:
query = "What do you know about the Campos Novos Dam?"
result = qa({"query": query, "n": 10})
print(result['result'])

According to the context, the Campos Novos Dam is a dam that has been studied using the methodology developed by INGETEC. The methodology has been consistently reproduced in various aspects of behavior observed in different dams around the world, including the Campos Novos Dam in Brazil. However, there is no further information provided about the specific details of the Campos Novos Dam.


In [None]:
pretty_print(result['result'])

'According to the context, the Campos Novos Dam is a dam that has been studied\nusing the methodology developed by INGETEC. The methodology has been\nconsistently reproduced in various aspects of behavior observed in different\ndams around the world, including the Campos Novos Dam in Brazil. However, there\nis no further information provided about the specific details of the Campos\nNovos Dam.'

In [None]:
result['source_documents'][3].metadata['source']

'G:\\.shortcut-targets-by-id\\1vE28d8xZuJXkpcinFbuku9FJgeaDd48K\\ICOLD - CFRD New Bulletin 2023\\Dam Response\\sdar.64119.381.pdf'