# Retrieval Question/Answering

This notebook reads a folder with pdfs and creates a Pinecone Index from it.

In [1]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI


## Load the documents

In [2]:
folder = r"G:\.shortcut-targets-by-id\1vE28d8xZuJXkpcinFbuku9FJgeaDd48K\ICOLD - CFRD New Bulletin 2023"
from langchain.document_loaders import DirectoryLoader
# Load all PDFs in a folder recursively
loader = DirectoryLoader(folder, "**/*.pdf", use_multithreading=True)

This next step takes some time. About 6 minutes last time I checked.

In [3]:
docs = loader.load()

In [None]:
len(docs)

75

## Split the documents into texts

Here we create an NLTKTextSplitter object with chunk size 2000. Note that the quality of the retrieval and question answering may depend on the chunk size and the max token length for the LLM.

In [3]:
from langchain.text_splitter import NLTKTextSplitter
text_splitter = NLTKTextSplitter(chunk_size=2000)

In [4]:
texts = text_splitter.split_documents(docs)

NameError: name 'docs' is not defined

In [None]:
print(texts[10].page_content)

Limits on the location of the resultant are provided in Table 6 .

________________________________________________________________________________________________ Rev.

D 21-01-2021

Page 16 of 55

Name of Project: 0390801 - Pakal Dul, CFRD Design Consultancy

DOCUMENT No: 0390801-INF-SS-LT3.2--0001 - Structural Design of Concrete Face, Plinth and Parapet Wall Report

Table 6.

Requirements for Location of the Resultant

PARAMETER

USUAL CASE

UNUSUAL CASE

EXTREME CASE

Resultant location

Middle third

Middle half

Within the base

Compressive stresses shall not be larger than bearing capacity

Percentage of foundation in compression

100% in compression

75% in compression

Source: USACE EM-1110-2-2200

4.4.4.

Load cases

The dimensioning and reinforcement of the parapet walls takes into account the most critical loading conditions.

Usual case

Dead-weight, backfill pressures, water pressure at the normal operating level.

Unusual case

Dead-weight, backfill pressures, water pres

In [None]:
print(f"Number of chunks in the whole index: {len(texts)}")

Number of chunks in the whole index: 3028


In [None]:
# from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
# docsearch = Chroma.from_documents(texts, embeddings)

## Set up the Vector Store

In [None]:
# Set up Pinecone
import pinecone
from tqdm.autonotebook import tqdm
pinecone.init(api_key="48640420-7e79-46d4-b71d-d07286818fef", environment="us-central1-gcp")

  from tqdm.autonotebook import tqdm


### Create the Pinecone Index

Again this step takes some time.

In [None]:
# check if 'icold' exists in the indexes
if 'cfrds' in pinecone.list_indexes():
    # Delete the index if it already exists
    pinecone.delete_index('cfrds')
# create a new index
pinecone.create_index('cfrds', dimension=1536, metric='cosine')

### Upsert the documents into the index


In [1]:
from tqdm.autonotebook import tqdm
from langchain.vectorstores import Pinecone

  from tqdm.autonotebook import tqdm


In [2]:
docsearch = Pinecone.from_documents(texts, embeddings, index_name='cfrds')

NameError: name 'texts' is not defined

## Setup the Large Language Model

In [None]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(temperature=0.5, max_tokens=2000, model='gpt-4-32k'),
    # llm=ChatOpenAI(temperature=0.5, max_tokens=2000, model='gpt-3.5-turbo-16k'),
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
    return_source_documents=True,
    )

In [34]:
def pretty_print(response):
    import textwrap
    # Split the response by lines of max 80 characters
    return '\n'.join(textwrap.wrap(response, 80))


## Ask a Question of the LLM with retrieval.

In [35]:
query = "tell me about campos novos dam"
result = qa({"query": query, "n": 10})

In [36]:
print(pretty_print(result['result']))

The Campos Novos Dam is a concrete-faced rockfill dam (CFRD) located in Brazil.
It was constructed in 2006 as part of the Campos Novos Hydroelectric Power Plant
project. The dam is situated on the Canoas River in the state of Santa Catarina.
The Campos Novos Dam has a height of 125 meters and a crest length of 840
meters. It was designed to store water for hydropower generation and to regulate
the flow of the Canoas River. The reservoir created by the dam has a storage
capacity of approximately 6.5 billion cubic meters.  During its construction,
the dam faced some challenges related to concrete failure and cracking. These
issues were addressed through careful design and construction techniques. The
dam was instrumented to monitor its behavior during construction, which helped
inform the final design and construction methodology.  The Campos Novos Dam is
an important infrastructure project in Brazil, contributing to the country's
energy generation and water management.


### Check the sources

In [42]:
import os
for source in result['source_documents']:
    print(os.path.basename(source.metadata['source']))

sdar.64119.381.pdf
ICOLD 2023 CFRD Workshop_ FINAL.pdf
ICOLD 2023 CFRD Workshop_ FINAL.pdf
sdar.64119.381.pdf


In [38]:
query = "What do you know about the Campos Novos Dam?"
result = qa({"query": query, "n": 10})
print(result['result'])

Based on the provided context, here is what we know about the Campos Novos Dam:

- The Campos Novos Dam is mentioned in the context as one of the problematic cases in dam construction.
- There is a mention of "concrete failure mapping" related to the Campos Novos Dam.
- The dam experienced concrete cracks and failures, as indicated by the mapping of major cracks.
- The context includes a diagram showing the construction joint and the location of inclined and transversal cracks in slabs 16, 17, 19, 20, 21, and 22 of the dam.
- The context suggests that the Campos Novos Dam is located in Brazil and was built in 2006.

Unfortunately, the provided context does not offer detailed information about the specific characteristics or issues of the Campos Novos Dam.


In [39]:
pretty_print(result['result'])

'Based on the provided context, here is what we know about the Campos Novos Dam:\n- The Campos Novos Dam is mentioned in the context as one of the problematic\ncases in dam construction. - There is a mention of "concrete failure mapping"\nrelated to the Campos Novos Dam. - The dam experienced concrete cracks and\nfailures, as indicated by the mapping of major cracks. - The context includes a\ndiagram showing the construction joint and the location of inclined and\ntransversal cracks in slabs 16, 17, 19, 20, 21, and 22 of the dam. - The context\nsuggests that the Campos Novos Dam is located in Brazil and was built in 2006.\nUnfortunately, the provided context does not offer detailed information about\nthe specific characteristics or issues of the Campos Novos Dam.'

In [40]:
result['source_documents'][3].metadata['source']

'G:\\.shortcut-targets-by-id\\1vE28d8xZuJXkpcinFbuku9FJgeaDd48K\\ICOLD - CFRD New Bulletin 2023\\Dam Response\\sdar.64119.381.pdf'