## Vector DB with metadata filters

- prepare a list with keywords
- for each keyword create a list of article links and the date, check if there's no overlap with the links
- for each article generate a pdf with a name date+random
- immediately pass through to the db, with the date as metadata
- dont save the pdf to optimise space complexity

# Setup

In [1]:
# pip install psycopg2-binary pgvector asyncpg "sqlalchemy[asyncio]" greenlet
# pip install llama-index-readers-file pymupdf
# pip install llama-index-vector-stores-postgres
# pip install llama-index-embeddings-huggingface
# pip install llama-index-llms-llama-cpp
# pip install llama-index-llms-openai

# need keywords for energy sector, financials sector and consumer cyclical sector (for now)
# i dont think i should add the names of ceos etc bc they woul dfor sure be mentioned with the company so its inefficient
# energy: XOM, CVX, COP, EOG, SLB, Exxon, Chevron, ConocoPhillips, EOG Resources Inc, Schlumberger NV
# oil, gas, energy, OPEC, power, electricity, green, utilities
# financials: JPM, BAC, WFC, AXP, MS, JPMorgam, Bank of America, Wells Fargo, American Express, Morgan Stanley
# bank, interest rates, savings, investment, regulation, inflation, employment, stock, bond, FED, SEC, NYSE, NASDAQ, S&P500

# tech: MSFT, AAPL, NVDA, GOOGL, META, Microsoft, Apple, Nvidia, Alphabet, Meta
# AI, Google, cybersecurity, fintech, data, cloud, 


# CC: AMZN, TSLA, HD, MCD, DIS, Amazon, Tesla, Home Depot, McDonald's, Disney
# no for now

# keywords = []

keywords = [
    # Energy Sector
    "XOM", "CVX", "COP", "EOG", "SLB",
    "Exxon", "Chevron", "ConocoPhillips", "EOG Resources Inc", "Schlumberger NV",
    "oil", "gas", "energy", "OPEC", "power",
    "electricity", "green", "utilities",

    # Financials Sector
    "JPM", "BAC", "WFC", "AXP", "MS",
    "JPMorgan", "Bank of America", "Wells Fargo", "American Express", "Morgan Stanley",
    "bank", "interest rates", "savings", "investment", "regulation",
    "inflation", "employment", "stock", "bond", "FED",
    "SEC", "NYSE", "NASDAQ", "S&P500",

    # Tech Sector
    "MSFT", "AAPL", "NVDA", "GOOGL", "META",
    "Microsoft", "Apple", "Nvidia", "Alphabet", "Meta",
    "AI", "Google", "cybersecurity", "fintech", "data", "cloud"
]



Sentence transformers setup

In [1]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")

In [2]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o-mini")

Postgres setup

In [4]:
import psycopg2

db_name = "vector_db"
host = "localhost"
password = "password"
port = "5432"
user = "maja2"
conn = psycopg2.connect(
    dbname="postgres",
    host=host,
    password=password,
    port=port,
    user=user,
)
conn.autocommit = True

with conn.cursor() as c:
    c.execute(f"DROP DATABASE IF EXISTS {db_name}")
    c.execute(f"CREATE DATABASE {db_name}")

In [5]:
from sqlalchemy import make_url
from llama_index.vector_stores.postgres import PGVectorStore

vector_store = PGVectorStore.from_params(
    database=db_name,
    host=host,
    password=password,
    port=port,
    user=user,
    table_name="llama2_paper",
    embed_dim=384,
)
# # hybrid
# url = make_url(conn)
# hybrid_vector_store = PGVectorStore.from_params(
#     database=db_name,
#     host=url.host,
#     password=url.password,
#     port=url.port,
#     user=url.username,
#     table_name="paul_graham_essay_hybrid_search",
#     embed_dim=1536,  # openai embedding dimension
#     hybrid_search=True,
#     text_search_config="english",
# )

# Ingestion Pipeline

Load data

In [6]:
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader

In [7]:
loader = PyMuPDFReader()
documents = loader.load(file_path="./data2/llama2.pdf")

Use a Text Splitter to Split Documents

In [8]:
from llama_index.core.node_parser import SentenceSplitter

In [9]:
text_parser = SentenceSplitter(
    chunk_size=1024,
    # separator=" ",
)

In [10]:
text_chunks = []
doc_idxs = []
for doc_idx, doc in enumerate(documents):
    cur_text_chunks = text_parser.split_text(doc.text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

Manually Construct Nodes from Text Chunks / Add metadata

In [12]:
from llama_index.core.schema import TextNode

nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
        text=text_chunk,
    )
    src_doc = documents[doc_idxs[idx]]
    node.metadata = src_doc.metadata
    nodes.append(node)
    
print(len(nodes))
print(nodes[0])

107
Node ID: d757e0c0-966b-41d3-a325-eecfd08cc82e
Text: Llama 2: Open Foundation and Fine-Tuned Chat Models Hugo
Touvron∗ Louis Martin† Kevin Stone† Peter Albert Amjad Almahairi
Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti
Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian
Fuller Cynthia Gao...


Generate Embeddings for each Node

In [15]:
for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding



In [18]:
node = nodes[0]
print("Text:", node.text)
print("Metadata:", node.metadata)
print("Embedding:", getattr(node, 'embedding', 'No embedding found'))

Text: Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗
Louis Martin†
Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey 

Load Nodes into a Vector Store

In [15]:
vector_store.add(nodes)

['422406ad-f62c-49da-bf3b-c9ac7740d3fc',
 'bf46ed30-c6f2-4a96-a561-6d6a35870306',
 '4ec4d802-3658-4140-99e0-c5505ecb5196',
 '7c0934dc-f564-4630-8b2c-44e50dac9267',
 '7bea4802-4717-4e35-b919-840ce03fafed',
 '0815dee1-fd30-4d0e-a4ac-8af98a3a3f96',
 '240b44ff-0f64-4e15-84ad-b9757d98ca20',
 '243d9fc4-6afa-430e-8c7a-5ac14b419647',
 '1f2215a3-db8f-4bea-b784-36bf92a3fcb9',
 'd8f24aee-9876-4dd1-911a-8fce33eeb453',
 '3c04dd71-b449-4ba3-8874-6b3365195078',
 '7f5eae28-949a-404b-8ee7-3bf2d12dee29',
 '09c2a011-890e-4871-b6f2-fb4081d0f661',
 '26d85b6c-9ef3-4dcc-af7b-b5d9ae01cab1',
 'f852c12e-3606-4fbd-a8c2-067e7327d35d',
 '00242be6-5c1e-4d36-bda4-88340aaf852a',
 '3ccd93c8-ef20-413a-855f-efae7bd280fa',
 'abaa9e09-064e-4647-8048-25d681e8036e',
 '4661625f-28ec-4db6-b665-c4e89bd4250f',
 'e7b53c06-6c2e-472f-9db3-7b7b2672003e',
 '7d71b4e2-8f4f-4123-ab96-67532d0d9380',
 '2fd3eb09-6a51-47a2-89fe-b08c7bec8bb7',
 'a5c9e867-5c58-4e2f-bc9a-0197b298df15',
 'd3e219d7-c612-45b9-8d5a-c8d967265948',
 '55147c50-35e4-

# Retrieval Pipeline

In [16]:
query_str = "Can you tell me about the key concepts for safety finetuning"

Generate a Query Embedding

In [17]:
query_embedding = embed_model.get_query_embedding(query_str)

Query the Vector Database

In [18]:
from llama_index.core.vector_stores import VectorStoreQuery

query_mode = "default"


vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=2, mode=query_mode
)

In [19]:
query_result = vector_store.query(vector_store_query)
print(query_result.nodes[0].get_content())

TruthfulQA ↑
ToxiGen ↓
MPT
7B
29.13
22.32
30B
35.25
22.61
Falcon
7B
25.95
14.53
40B
40.39
23.44
Llama 1
7B
27.42
23.00
13B
41.74
23.08
33B
44.19
22.57
65B
48.71
21.77
Llama 2
7B
33.29
21.25
13B
41.86
26.10
34B
43.45
21.19
70B
50.18
24.60
Table 11: Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the
percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we
present the percentage of toxic generations (the smaller, the better).
Benchmarks give a summary view of model capabilities and behaviors that allow us to understand general
patterns in the model, but they do not provide a fully comprehensive view of the impact the model may have
on people or real-world outcomes; that would require study of end-to-end product deployments. Further
testing and mitigation should be done to understand bias and other social issues for the specific context
in which a system may be deployed. For this, it may be necessary

Parse Result into a Set of Nodes

In [20]:
from llama_index.core.schema import NodeWithScore
from typing import Optional

nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
    score: Optional[float] = None
    if query_result.similarities is not None:
        score = query_result.similarities[index]
    nodes_with_scores.append(NodeWithScore(node=node, score=score))

Put into a Retriever

In [21]:
from llama_index.core import QueryBundle
from llama_index.core.retrievers import BaseRetriever
from typing import Any, List


class VectorDBRetriever(BaseRetriever):
    """Retriever over a postgres vector store."""

    def __init__(
        self,
        vector_store: PGVectorStore,
        embed_model: Any,
        query_mode: str = "default",
        similarity_top_k: int = 2,
    ) -> None:
        """Init params."""
        self._vector_store = vector_store
        self._embed_model = embed_model
        self._query_mode = query_mode
        self._similarity_top_k = similarity_top_k
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve."""
        query_embedding = embed_model.get_query_embedding(
            query_bundle.query_str
        )
        vector_store_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self._similarity_top_k,
            mode=self._query_mode,
        )
        query_result = vector_store.query(vector_store_query)

        nodes_with_scores = []
        for index, node in enumerate(query_result.nodes):
            score: Optional[float] = None
            if query_result.similarities is not None:
                score = query_result.similarities[index]
            nodes_with_scores.append(NodeWithScore(node=node, score=score))

        return nodes_with_scores

In [22]:
retriever = VectorDBRetriever(
    vector_store, embed_model, query_mode="default", similarity_top_k=2
)

# RetrieverQueryEngine Response

In [29]:

from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(retriever, llm=llm)

query_str = "Describe the RLHF impact of the temperature"
response = query_engine.query(query_str)

print(str(response))



The temperature parameter plays a crucial role in the exploration process during the RLHF training. It influences the diversity of the outputs generated by the model. A higher temperature allows for more varied responses, which can lead to better exploration of the output space. The optimal temperature is not static and changes during iterative model updates, indicating that RLHF affects how temperature is adjusted. Specifically, for the Llama 2-Chat-RLHF model, the optimal temperature when sampling between 10 and 100 outputs ranges from approximately 1.2 to 1.3. This necessitates a progressive re-adjustment of the temperature within a finite compute budget to maximize performance.


In [27]:
print(response.source_nodes[0].get_content())

Additionally, Llama 2 70B model outperforms all open-source models.
In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown
in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant
gap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al.,
2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4
and PaLM-2-L.
We also analysed the potential data contamination and share the details in Section A.6.
Benchmark (shots)
GPT-3.5
GPT-4
PaLM
PaLM-2-L
Llama 2
MMLU (5-shot)
70.0
86.4
69.3
78.3
68.9
TriviaQA (1-shot)
–
–
81.4
86.1
85.0
Natural Questions (1-shot)
–
–
29.3
37.5
33.0
GSM8K (8-shot)
57.1
92.0
56.5
80.7
56.8
HumanEval (0-shot)
48.1
67.0
26.2
–
29.9
BIG-Bench Hard (3-shot)
–
–
52.3
65.7
51.2
Table 4: Comparison to closed-source models on academic benchmarks. Results for GPT-3.5 and GPT-4
are from OpenAI