# Vector Store Notebook


In [1]:
import os
import logging
from logging import warnings
from loguru import logger

logger.add("logs/vector_store.log")
logger.info("Starting Vector Store Notebook")

from dotenv import load_dotenv, find_dotenv
import nest_asyncio
nest_asyncio.apply()
load_dotenv(find_dotenv())

warnings.filterwarnings('ignore')
logging.getLogger().setLevel(logging.ERROR)

[32m2025-03-29 16:49:44.904[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m7[0m - [1mStarting Vector Store Notebook[0m


In [2]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext
from IPython.display import Markdown, display

from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

In [3]:
QDRANT_HOSTED_URL = os.getenv("QDRANT_HOSTED_URL")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
client = qdrant_client.QdrantClient(
    QDRANT_HOSTED_URL,
    api_key=QDRANT_API_KEY,
)

In [4]:
from pathlib import Path

notebook_dir = Path().absolute()
file_path = str(notebook_dir / ".." / "data" / "paper" / "stanford-cynicism.pdf")

documents = SimpleDirectoryReader(input_files=[file_path]).load_data()

In [5]:
len(documents)

11

In [6]:
vector_store = QdrantVectorStore(
    client=client, 
    collection_name="stanford",
)
index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store
)

for doc in documents:
    print(f"Inserting document {doc.text[:50]}")
    index.insert(doc)

Inserting document  Semantic Analysis of Diversity Rhetoric and 
 Ide
Inserting document  This study focuses specifically on opinion pieces
Inserting document  text-embedding-3-large  model.  In this embedding
Inserting document  MFT posits that moral judgments are often rooted 
Inserting document  per year, the regression line (first graph on the
Inserting document  while the linguistic markers of diversity prolife
Inserting document  class  MFTAnalysis  (BaseModel): 
 authority_subv
Inserting document  Let’s look at observations in  Figure 3  . 
 The 
Inserting document  Discussion 
 Figure 3  shows that measures of cyn
Inserting document  ●  If students feel like only certain viewpoints 
Inserting document  Haidt, J.  The Righteous Mind: Why Good People ar


In [7]:
plain_query_engine = index.as_query_engine()

In [8]:
response = plain_query_engine.query("What is the main idea of the paper?")
print(response)

The main idea of the paper is to highlight the importance of promoting genuine inclusion and diverse perspectives in educational institutions to combat rising cynicism among students. The paper emphasizes the need for schools to go beyond superficial diversity efforts and ensure that different viewpoints are truly welcomed to maintain an open and trusting intellectual environment. It suggests strategies such as encouraging open debate, promoting transparency in campus media, and reevaluating diversity policies to address the underlying cynicism and foster a more inclusive atmosphere in higher education.


## Which RAG is better?


In [9]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

llm = OpenAI(model="gpt-4o-mini", temperature=0.01)
embedding = OpenAIEmbedding(model="text-embedding-3-small")

In [10]:
from llama_index.core.chat_engine.types import ChatMode
from llama_index.core.vector_stores.types import VectorStoreQueryMode
from llama_index.core.response_synthesizers import get_response_synthesizer, ResponseMode

more_query_engine = index.as_query_engine(
    llm=llm,
    similarity_top_k=8,
    vector_store_query_mode=VectorStoreQueryMode.MMR,
    vector_store_kwargs={
        "mmr_prefetch_k": 16,
    },
    response_synthesizer=get_response_synthesizer(
        response_mode=ResponseMode.TREE_SUMMARIZE,
    ))
response = more_query_engine.query("What is the main idea of the paper?")
print(response)

The main idea of the paper is to analyze the growing cynicism among students, particularly at Stanford University, by studying the narrowing range of topics discussed in the opinion sections of the Stanford Daily over a 15-year period. The study aims to understand how cynical rhetoric has evolved and how discrepancies between the rhetoric of diversity and the actual space for diverse ideas may contribute to the development of cynical attitudes. The paper also suggests potential strategies for fostering more inclusive conversations and reducing cynicism within academic communities.


## Synthetic Data Generation (SDG)


In [11]:
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

from ragas.testset import TestsetGenerator
from ragas.llms import LlamaIndexLLMWrapper
from ragas.embeddings import LlamaIndexEmbeddingsWrapper

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# generator with openai models
generator_llm = OpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbedding(model="text-embedding-3-small")

generator = TestsetGenerator.from_llama_index(
    llm=generator_llm,
    embedding_model=embeddings,
)

In [12]:
from pathlib import Path

notebook_dir = Path().absolute()
cities_path = str(notebook_dir / ".." / "data" / "cities")

documents = SimpleDirectoryReader(cities_path).load_data()
len(documents)

5

In [13]:
testset = generator.generate_with_llamaindex_docs(
    documents,
    testset_size=10,
)

Applying HeadlinesExtractor:   0%|          | 0/5 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/5 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/5 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/75 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/145 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

In [14]:
df = testset.to_pandas()
df.head()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What role did the Massachusett people play in ...,[History == === Indigenous era === Prior to Eu...,"Before European colonization, the region surro...",single_hop_specifc_query_synthesizer
1,How did the Napoleonic Wars impact Boston's tr...,[impressed with the effort of Rufus Putnam tha...,The Napoleonic Wars significantly curtailed Bo...,single_hop_specifc_query_synthesizer
2,What historical significance does Massachusett...,[Boston is the capital and most populous city ...,"Massachusetts, and specifically Boston, holds ...",single_hop_specifc_query_synthesizer
3,How did the acquisition of FleetBoston Financi...,[Boston declined economically as factories bec...,The acquisition of FleetBoston Financial by Ch...,single_hop_specifc_query_synthesizer
4,What cities are near Everett in relation to Bo...,[Geography == Boston has an area of 89.63 sq m...,Everett is bordered to the northeast by the ci...,single_hop_specifc_query_synthesizer


## Build a `QueryEngine`


In [15]:
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)

query_engine = vector_index.as_query_engine()

In [16]:
response_vector = query_engine.query(df["user_input"][0])

print(response_vector)

The Massachusett people established small, seasonal communities in present-day Boston before European colonization. They constructed one of the oldest fishweirs in New England on Boylston Street, indicating their presence in the region as early as 7,000 years before European arrival in the Western Hemisphere.


## Evaluating the `QueryEngine`


In [17]:
# import metrics
from ragas.metrics import (
    ContextPrecision,
    ContextRecall,
    Faithfulness,
    AnswerRelevancy,
    AnswerCorrectness
)

# init metrics with evaluator LLM
from ragas.llms import LlamaIndexLLMWrapper

evaluator_llm = LlamaIndexLLMWrapper(OpenAI(model="gpt-4o"))
metrics = [
    Faithfulness(llm=evaluator_llm),
    AnswerRelevancy(llm=evaluator_llm),
    ContextPrecision(llm=evaluator_llm),
    ContextRecall(llm=evaluator_llm),
    AnswerCorrectness(llm=evaluator_llm)
]

In [18]:
# convert to Ragas Evaluation Dataset
ragas_dataset = testset.to_evaluation_dataset()
ragas_dataset

EvaluationDataset(features=['user_input', 'reference_contexts', 'reference'], len=10)

In [19]:
from ragas.integrations.llama_index import evaluate

result = evaluate(
    query_engine=query_engine,
    metrics=metrics,
    dataset=ragas_dataset,
)

Running Query Engine:   0%|          | 0/10 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [20]:
from pprint import pprint
pprint(result, indent=4)

{'faithfulness': 0.4380, 'answer_relevancy': 0.9568, 'context_precision': 0.4500, 'context_recall': 0.5667, 'answer_correctness': 0.6157}


In [21]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall,answer_correctness
0,What role did the Massachusett people play in ...,[== History ==\n\n\n=== Indigenous era ===\nPr...,[History == === Indigenous era === Prior to Eu...,"The Massachusett people established small, sea...","Before European colonization, the region surro...",1.0,0.904087,1.0,1.0,0.61818
1,How did the Napoleonic Wars impact Boston's tr...,[=== Post-revolution and the War of 1812 ===\n...,[impressed with the effort of Rufus Putnam tha...,The Napoleonic Wars significantly curtailed Bo...,The Napoleonic Wars significantly curtailed Bo...,0.8,0.949569,1.0,1.0,0.579665
2,What historical significance does Massachusett...,[Boston is the capital and most populous city ...,[Boston is the capital and most populous city ...,"Massachusetts, particularly Boston, holds sign...","Massachusetts, and specifically Boston, holds ...",1.0,0.976116,1.0,1.0,0.719933
3,How did the acquisition of FleetBoston Financi...,[=== Post-revolution and the War of 1812 ===\n...,[Boston declined economically as factories bec...,The acquisition of FleetBoston Financial by Ba...,The acquisition of FleetBoston Financial by Ch...,0.0,0.993672,0.0,0.0,0.740763
4,What cities are near Everett in relation to Bo...,[=== Transportation ===\n\nLogan International...,[Geography == Boston has an area of 89.63 sq m...,The cities near Everett in relation to Boston ...,Everett is bordered to the northeast by the ci...,0.0,0.996257,0.5,1.0,0.419646
5,What role did Alcatraz play in San Francisco's...,"[By 1880, Chinese made up 9.3% of the populati...",[<1-hop>\n\nthe city celebrated its rebirth at...,Alcatraz served as a federal maximum security ...,"Alcatraz, a former military stockade, began it...",1.0,0.939577,0.0,0.666667,0.636899
6,What role do Northeastern University and North...,[=== Colleges and universities ===\n\nSince th...,[<1-hop>\n\nChicago is the most populous city ...,Northeastern University and Northwestern Unive...,Northeastern University and Northwestern Unive...,0.125,0.945373,0.0,0.0,0.575686
7,What role did Alcatraz Island play in San Fran...,"[By 1880, Chinese made up 9.3% of the populati...",[<1-hop>\n\nthe city celebrated its rebirth at...,Alcatraz Island played a significant role in S...,Alcatraz Island began its service as a federal...,0.0,0.958103,1.0,0.5,0.518719
8,What is the significance of Boston as a cultur...,[== Arts and culture ==\n\nBoston shares many ...,[<1-hop>\n\nBoston is the capital and most pop...,Boston serves as the cultural and financial ce...,Boston is significant as the cultural and fina...,0.454545,0.940853,0.0,0.5,0.679919
9,What role did the Dallas ISD play in the educa...,[=== Libraries ===\n\nThe city is served by th...,[<1-hop>\n\nof the public school students with...,The Dallas Independent School District (DISD) ...,The Dallas ISD provided educational services t...,0.0,0.964443,0.0,0.0,0.667253
