## Install libraries & dependencies

In [8]:
!pip install langchain --quiet
!pip install --upgrade openai==0.28.1 --quiet
!pip install pdf2image --quiet
!pip install pdfminer.six --quiet
!pip install singlestoredb --quiet
!pip install tiktoken --quiet
!pip install --upgrade unstructured==0.10.14 --quiet

## Import the libraries

In [9]:
from langchain.document_loaders import PyPDFLoader
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
import os

## Load your custom document

In [12]:
from langchain.document_loaders import OnlinePDFLoader

loader = OnlinePDFLoader("https://unctad.org/system/files/official-document/wesp2023_en.pdf")

data = loader.load()

## Using LangChain framework to split the document into chunks

In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

print(f"You have {len(data)} document(s) in your data")
print(f"There are {len(data[0].page_content)} characters in your document")

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 2000, chunk_overlap = 0)
texts = text_splitter.split_documents(data)

print(f"You have {len(texts)} pages")

You have 1 document(s) in your data
There are 553192 characters in your document
You have 314 pages


## Use OpenAI API to generate embeddings for the document chunks

In [15]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key: ")

OpenAI API Key:  ········


## Let's store our document chunks into SingleStore database table 

Action required: Make sure you have selected the workspace and the database where you want to store your data.

In [17]:
from langchain.embeddings import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

#from langchain.vectorstores.singlestoredb as s2
from langchain.vectorstores import SingleStoreDB
#from langchain.vectorstores.utils import DistanceStrategy

#s2.ORDERING_DIRECTIVE["DOT_PRODUCT"] = s2.ORDERING_DIRECTIVE[DistanceStrategy.DOT_PRODUCT]

docsearch = SingleStoreDB.from_documents(
    texts,
    embedding,
    table_name = "pdf_wes",
    #distance_strategy = "DOT_PRODUCT"
)

  warn_deprecated(


## Let us check the text chunks and associated embeddings stored inside our database

In [261]:
%%sql
select * from pdf_wes limit 1;

RuntimeError: (singlestoredb.exceptions.ProgrammingError) 1146: Table 'database_c229c.pdf_wes' doesn't exist
[SQL: select * from pdf_wes limit 1;]
(Background on this error at: https://sqlalche.me/e/20/f405)
If you need help solving this issue, send us a message: https://ploomber.io/community


### Ask a query against your custom data (the pdf that you loaded) using just similarity search to retrieve the top k closest content

In [28]:
query = "What India's GDP growth is projected to be?"
docs = docsearch.similarity_search(query)
print(docs[0].page_content)

The outlook for South Asia has deteriorated and is subject to multiple downside risks amid global monetary tightening, fiscal vulnerabilities, rising inflation and extreme weather events. Regional GDP growth is expected to slow to 4.8 per cent in 2023 from an estimated 5.6 per cent expansion in 2022. Overall, weaker global demand, tighter monetary policy, additional supply disruptions, further escalation in commodity prices and the emergence of new COVID-19 variants pose significant risks in 2023.

India’s GDP growth rate is projected to moderate to 5.8 per cent in 2023 from an estimated 6.4 per cent in 2022 as higher interest rates and a global economic slowdown will weigh on investment and export performance (figure III.14). The outlook is more challenging for other countries in the region. In Pakistan, the economy is expected to expand by only 2.5 per cent in 2023 as devastating floods in 2022 caused significant damages, particularly for agriculture, with spillover effects on relate

## Here is the augmented response to the user query

In [29]:
import openai

prompt = f"The user asked: {query}. The most similar text from the document is: {docs[0].page_content}"

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
)

print(response['choices'][0]['message']['content'])

India’s GDP growth rate is projected to moderate to 5.8 per cent in 2023 from an estimated 6.4 per cent in 2022.


## Let’s test when knowledge base (custom documents like pdf) is not provided

In [31]:
from langchain.llms import OpenAI
llm = OpenAI(temperature=0.8)

  warn_deprecated(


In [33]:
llm.predict("What India's GDP growth is projected to be?")

'\n\nAs an AI, I do not have access to current or future economic data. It is best to consult a reputable source such as the World Bank or International Monetary Fund for accurate GDP projections.'