In [4]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("OpenMatch/cocodr-large-msmarco")
model = AutoModelForMaskedLM.from_pretrained("OpenMatch/cocodr-large-msmarco")

Some weights of the model checkpoint at OpenMatch/cocodr-large-msmarco were not used when initializing BertForMaskedLM: ['norm.bias', 'loss.count_cat', 'embeddingHead.weight', 'loss.sum_losses', 'classifier.weight', 'classifier.bias', 'embeddingHead.bias', 'norm.weight', 'loss.h_fun']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMaskedLM were not initialized from the model checkpoint at OpenMatch/cocodr-large-msmarco and are newly initialized: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias'

In [18]:
from gpt_index import download_loader, GPTSimpleVectorIndex

In [9]:
import json
with open("creds.json", "r") as f:
    creds = json.load(f)

import os
os.environ['OPENAI_API_KEY'] = creds['OPENAI_API_KEY']

In [20]:
from pathlib import Path
from gpt_index import download_loader

PDFReader = download_loader("PDFReader")

loader = PDFReader()
documents = loader.load_data(file=Path('../data/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf'))

In [13]:
index = GPTSimpleVectorIndex(documents)

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 543794 tokens


In [17]:
index.save_to_disk('index.json')

In [None]:
index = GPTSimpleVectorIndex.load_from_disk('index.json')

In [36]:
all_responses = []

In [34]:
def args_to_dict(**kwargs):
    return {**kwargs}

In [32]:
response = index.query(query_str="How do you set up support vector machine equations?", similarity_top_k=5)
# 1m 41.3s
print(response)



To set up a support vector machine equation, one must first define a kernel function k(x,x') and a Lagrange multiplier an for each data point xn. The equation is then expressed in terms of the kernel function as y(x) = N∑n=1(an - ̂an)k(x,xn) + b, where b is a bias parameter. The constraints for the Lagrange multipliers are an ≥ 0, ̂an ≥ 0, an ≤ C, and ̂an ≤ C, where C is a parameter that controls the complexity of the model. Additionally, the Karush-Kuhn-Tucker (KKT) conditions must be satisfied, which state that at the solution the product of the dual variables and the constraints must vanish. This is expressed as an(ε1 + ξn + yn - tn) = 0, ̂an(ε1 + ̂ξn - yn + tn) = 0, (C - an)ξn = 0, and (C - ̂an)̂ξn = 0. Furthermore, a prior distribution over the parameter vector w and a zero


In [37]:
all_responses.append({
    "query": args_to_dict(query_str="How do you set up support vector machine equations?", similarity_top_k=5),
    "response": response
 })

In [33]:
response2 = index.query(query_str="How do you set up support vector machine equations?", similarity_top_k=1)
print(response2)
# 16.5s
# INFO:root:> [query] Total LLM token usage: 4254 tokens
# INFO:root:> [query] Total embedding token usage: 10 tokens

INFO:root:> [query] Total LLM token usage: 4254 tokens
INFO:root:> [query] Total embedding token usage: 10 tokens




To set up support vector machine equations, one must first define a linear model of the form y(x) = wTφ(x), where w is the weight parameter and φ(x) is the nonlinear basis function. The conditional distribution for a real-valued target variable t, given an input vector x, is then given by p(t|x,w,β) = N(t|y(x),β-1). The prior distribution over the parameter vector w is a zero-mean Gaussian prior with a separate hyperparameter αi for each of the weight parameters wi, as expressed in equation (7.80). Finally, predictions are expressed as linear combinations of kernel functions that are centered on training data points and that are required to be positive definite.


In [38]:
all_responses.append({
    "query": args_to_dict(query_str="How do you set up support vector machine equations?", similarity_top_k=1),
    "response": response2
 })

In [40]:
import pickle
with open('all_responses.pkl', 'wb') as f:
    pickle.dump(all_responses, f)