## ThirdAI's NeuralDB

NeuralDB, as the name suggests, is a combination of a neural network and a database. It provides a high-level API for users to insert different types of files into it and search through the file contents with natural language queries. The neural network part of it enables semantic search while the database part of it stores the paragraphs of the files that are inserted into it.

First, let's install the dependencies.

In [10]:
!pip3 install thirdai --upgrade
!pip3 install thirdai[neural_db]
!pip3 install langchain --upgrade
!pip3 install openai --upgrade
!pip3 install paper-qa --upgrade



In [None]:
from thirdai import licensing, neural_db as ndb
licensing.deactivate()
licensing.activate("1FB7DD-CAC3EC-832A67-84208D-C4E39E-V3")

Now, let's import the relevant module and define a neural db class.

In [11]:
db = ndb.NeuralDB(user_id="my_user") # you can use any username, in the future, this username will let you push models to the model hub

### You even load from a base DB from our Bazaar (optional but recommended)

We have a model bazaar that provides users with domain specific NeuralDBs that can jumpstart searching on their private documents. The Bazaar has two main types of DBs

1. Base DBs: These come with models that have either general QnA capabilities or domain specific capabilities like search on Medical Documents, Financial documents or Contracts. These come with an empty data index into which users can insert their files.

2. Pre-Indexed DBs: These are ready-to-search DBs that come with pre-trained models and their corresponding datasets. These are meant to  search through large public datasets like PubMed or Amazon 3MM Products or Stackoverflow issues etc.

In [12]:
# Set up a cache directory
import os
if not os.path.isdir("bazaar_cache"):
    os.mkdir("bazaar_cache")

from pathlib import Path
from thirdai.neural_db import Bazaar
bazaar = Bazaar(cache_dir=Path("bazaar_cache"))


Call fetch to refresh list of available DBs.

In [13]:
bazaar.fetch() # Optional arg filter="model name" to filter by model name.


Below is the list of all DBs in the Bazaar.

In [14]:
print(bazaar.list_model_names())


['Contract Review', 'General QnA', 'Finance QnA']


Finally load the DB

In [15]:
db = bazaar.get_model("General QnA")

### Insert your files

Let's insert things into it!

Currently, we natively support adding CSV, PDF and DOCX files. We also have a support to automatically scrape and parse URLs. All other file formats have to be converted into CSV files where each row represents a paragraph/text-chunk of the document.

#### Example 1: CSV files
The first example below shows how to insert a CSV file. Please note that a CSV file is required to have a column named "DOC_ID" with rows numbered from 0 to n_rows-1.

In [16]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [55]:
insertable_docs = []
csv_files = ['Stocks_Dataset.csv']

for file in csv_files:
    csv_doc = ndb.CSV(
        path=file,
        id_column="DOC_ID",
        strong_columns=["date", "open", "high", "low", "close", "volume", "Name"],
        weak_columns=["high", "low"],
        reference_columns=["date", "open", "high", "low", "close", "volume", "Name"])
    #
    insertable_docs.append(csv_doc)


#### Example 2: PDF files

In [42]:
insertable_docs = []
pdf_files = ['analysis.pdf']

for file in pdf_files:
    pdf_doc = ndb.PDF(file)
    insertable_docs.append(pdf_doc)

### Insert into NeuralDB

If you wish to insert without unsupervised training, you can set 'train=False' in the insert() method.

In [21]:
source_ids = db.insert(insertable_docs, train=False)

The above command is intended to be used with a base DB which already has reasonable knowledge of the domain. In general, we always recommend using 'train=True' as shown below.

#### Insert and Train

In [43]:
source_ids = db.insert(insertable_docs, train=True)

loaded data | source 'Documents:
analysis.pdf' | vectors 1326 | batches 1 | time 0s | complete

train | epoch 0 | train_steps 2467 | train_hash_precision@5=0.384465  | train_batches 1 | time 7s

train | epoch 1 | train_steps 2468 | train_hash_precision@5=0.773152  | train_batches 1 | time 4s

train | epoch 2 | train_steps 2469 | train_hash_precision@5=0.935445  | train_batches 1 | time 13s

train | epoch 3 | train_steps 2470 | train_hash_precision@5=0.966817  | train_batches 1 | time 8s

train | epoch 4 | train_steps 2471 | train_hash_precision@5=0.982956  | train_batches 1 | time 9s

train | epoch 5 | train_steps 2472 | train_hash_precision@5=0.99095  | train_batches 1 | time 4s



If you call the insert() method multiple times, the documents will automatically be de-duplicated. If insert=True, then the training will be done multiple times.

### Search

Now let's start searching.

In [56]:
search_results = db.search(
    query="what was in the dataset?",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text)
    # print(result.context(radius=1))
    # print(result.source)
    # print(result.metadata)
    print('************')

The Gold to Dow ratio reached an all-time-high of 1.01 in January 1980 when the price of gold hit $878 and the Dow Jones was trading in a range. Since stocks outperformed gold almost uninterruptedly for 2 decades until August 1999 when the ratio reached an all time low of 0.02.
************
According to Algorithm 1 we should build the forecast model for the unconstrained four- dimensional time series data {Yt}T t=1. Without loss of generality we first assume that all time series in Yt are stationary then a p-order (p  1) VAR model denoted by VAR(p) can be formulated as: Yt = a + A1Yt-1 + ... + ApYt-p + wt = a + p  j=1 AjYt-j + wt t = (p + 1) ... T (11) where Yt-j is the j-th lag of Yt; a = (a1 a2 a3 a4)T is a four-dementional vector of intercepts; Aj stands for the time-invariant 4 x 4 coefficient matrix; and wt = (w(1) t w(2) t w(3) t w(4) t )T is a four-dementional error term vector satisfying: (1) Mean zero: E(wt) = 0; (2) No correlation across time: E(wT t-kwt) = 0 for any non-zero

We can see that the search pulled up the right passage that contains the termination period "(i) five (5) years or (ii) when the confidential information no longer qualifies as a trade secret" .

In [57]:
search_results = db.search(
    query="General forecasting framework for OHLC data",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text)
    # print(result.context(radius=1))
    # print(result.source)
    # print(result.metadata)
    print('************')

If limit-up(limit-down) happens we firstly multiply x(c)(x(o)) and x(h) by 1.1 to make a relatively large interval. And then conduct measurements given in circumstances (2) and (3). In summary the general forecasting framework for OHLC data with T periods is described in Algorithm 1.
************
Algorithm 1 General forecasting framework for OHLC data 1: Get the raw candlestick charts with T periods from the capital market; 2: Extract the four-dimensional time series data of the candlestick charts record as {Xt}T t=1; 3: Conduct transformation method to {Xt}T t=1 and obtain {Yt}T t=1 according to Eq.
************


We can see that the search pulled up the right passage again that has "made by and between".

Now let's ask a tricky question.

In [58]:
search_results = db.search(
    query="AIC is formulated as",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text)
    # print(result.context(radius=1))
    # print(result.source)
    # print(result.metadata)
    print('************')

A trade-off must be evaluated to choose p the common used criterions in practice are AIC BIC and HQ (Hannan-Quinn). In this paper we prefer AIC because of its conciseness which is formulated as AIC(p) = ln 4 i=1 T j=1 ^u2 ij T + 2pK2 T (15) 14 where T stands for the total period number of OHLC series p is VAR lag order K is the VAR dimension and ^uij = ^Y (i) j - Y (i) j (1 <= i <= 4 1 <= j <= T) represents for the residuals of the VAR model.
************
Finally the simulated OHLC data {Xt}T t=1 are generated by applying the inverse transformation formula in Eq. (9). In order to evaluate the performance of the proposed method with different variance com- ponent levels we consider the following scenarios: Scenario 1: p = 1 T = 220 Y1 = [4 0.7 -0.85 0]T and A1 =             0.55 0.12 0.12 0.12 0.12 0.55 0.12 0.12 0.12 0.12 0.55 0.12 0.12 0.12 0.12 0.55             and Sw is a 4 x 4 diagonal matrix with diagonal element being 0.052 i.e. Sw = diag{0.052 0.052 0.052 0.052}.
************


### Get Answers from OpenAI using Langchain

In this section, we will show how to use LangChain and query OpenAI's QnA module to generate an answer from the references that you retrieve from the above DB. You'll have to specify your own OpenAI key for this module to work. You can replace this segment with any other generative model of your choice. You can choose to use an source model like MPT or Dolly for answer generation with the same prompt that you use with OpenAI.

In [59]:
import os
os.environ["OPENAI_API_KEY"] = "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"

In [36]:
from langchain.chat_models import ChatOpenAI
from paperqa.prompts import qa_prompt
from paperqa.chains import make_chain

llm = ChatOpenAI(
    model_name='gpt-3.5-turbo',
    temperature=0.1,
)

qa_chain = make_chain(prompt=qa_prompt, llm=llm)

In [37]:
def get_references(query):
    search_results = db.search(query,top_k=3)
    references = []
    for result in search_results:
        references.append(result.text)
    return references

def get_answer(query, references):
    return qa_chain.run(question=query, context='\n\n'.join(references[:3]), answer_length="abt 50 words")

In [60]:
query = "AIC is formulated as"

references = get_references(query)
print(references)

['A trade-off must be evaluated to choose p the common used criterions in practice are AIC BIC and HQ (Hannan-Quinn). In this paper we prefer AIC because of its conciseness which is formulated as AIC(p) = ln 4 i=1 T j=1 ^u2 ij T + 2pK2 T (15) 14 where T stands for the total period number of OHLC series p is VAR lag order K is the VAR dimension and ^uij = ^Y (i) j - Y (i) j (1 <= i <= 4 1 <= j <= T) represents for the residuals of the VAR model.', 'Finally the simulated OHLC data {Xt}T t=1 are generated by applying the inverse transformation formula in Eq. (9). In order to evaluate the performance of the proposed method with different variance com- ponent levels we consider the following scenarios: Scenario 1: p = 1 T = 220 Y1 = [4 0.7 -0.85 0]T and A1 =             0.55 0.12 0.12 0.12 0.12 0.55 0.12 0.12 0.12 0.12 0.55 0.12 0.12 0.12 0.12 0.55             and Sw is a 4 x 4 diagonal matrix with diagonal element being 0.052 i.e. Sw = diag{0.052 0.052 0.052 0.052}.', 'Special offer: get a

In [50]:
answer = get_answer(query, references)

print(answer)

AIC is formulated as AIC(p) = ln 4 i=1 T j=1 ^u2 ij T + 2pK2 T (15) 14, where p is the VAR lag order, T is the total period number of OHLC series, K is the VAR dimension, and ^uij = ^Y (i) j - Y (i) j represents the residuals of the VAR model (1 <= i <= 4, 1 <= j <= T) (Example2012).


### Load and Save
As usual, saving and loading the DB are one-liners.

In [40]:
# save your db
db.save("data.db")

# Loading is just like we showed above, with an optional progress handler
db.from_checkpoint("data.db", on_progress=lambda fraction: print(f"{fraction}% done with loading."))

0.16666666666666666% done with loading.
0.3333333333333333% done with loading.
0.5% done with loading.
0.6666666666666666% done with loading.
0.8333333333333334% done with loading.
1.0% done with loading.


<thirdai.neural_db.neural_db.NeuralDB at 0x7efce4eb9690>