# Vector stores

The simplest vector store option for development and testing:
- Stores embeddings directly in RAM
- No external dependencies or setup
- Fast access but no persistence
- Perfect for prototyping and small datasets

Persistent vector storage with ChromaDB:
- Automatic disk persistence and collection detection
- Similarity search with confidence scores
- Collection reloading demonstration
- Production-ready persistence patterns

Facebook's FAISS library for production-scale similarity search:
- Optimized for speed and memory efficiency
- Multiple index types available
- GPU acceleration support
- Best for read-heavy, large-scale applications

| Feature | InMemory | ChromaDB | FAISS |
|---------|----------|----------|-------|
| **Persistence** | ‚ùå No | ‚úÖ Yes | ‚úÖ Yes |
| **Setup Complexity** | üü¢ None | üü° Simple | üü° Simple |
| **Performance** | üü¢ Fast | üü° Good | üü¢ Very Fast |
| **Memory Usage** | üî¥ High | üü° Medium | üü¢ Low |
| **Metadata Support** | ‚úÖ Basic | ‚úÖ Rich | ‚ùå Limited |
| **Best For** | Development | General Purpose | Production Scale |

## Choosing the Right Vector Store

**Use InMemoryVectorStore when:**
- Prototyping or development
- Small datasets (< 1000 documents)
- No need for persistence
- Testing different chunking strategies

**Use ChromaDB when:**
- Need persistence and metadata
- General-purpose applications
- Medium datasets (1K-100K documents)
- Want built-in filtering capabilities

**Use FAISS when:**
- Production deployments
- Large datasets (> 100K documents)
- Performance is critical
- Primarily read-heavy workloads
- Need distance-based similarity metrics

ZADANIE: Praca ze skryptem `1_in_memory.py`

ChromaDB mo≈ºe byƒá uruchomiona jako baza danych w pamiƒôci (dane nie sƒÖ zapisywane), z poziomy Pythona lub z jako serwer.

https://docs.trychroma.com/docs/run-chroma/client-server

## ChromaDB

In [1]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

In [2]:
from dotenv import load_dotenv
load_dotenv(override=True)

True

In [3]:
# Configuration
CHROMA_DB_PATH = "./chroma_db"
COLLECTION_NAME = "scientists_bios"

In [4]:
# Load and prepare documents
loader = DirectoryLoader("data/scientists_bios")
docs = loader.load()
print(f"Loaded {len(docs)} documents")

Loaded 5 documents


In [5]:
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = text_splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")

Created 22 chunks


In [6]:
# Create embeddings
embeddings = AzureOpenAIEmbeddings(model="text-embedding-3-small")

In [7]:
def check_existing_collection():
    """Check if ChromaDB collection already exists."""
    if os.path.exists(CHROMA_DB_PATH):
        print(f"üìÅ Found existing ChromaDB at {CHROMA_DB_PATH}")
        return True
    return False

# Create or load ChromaDB collection
existing_db = check_existing_collection()
existing_db

False

In [8]:
chroma_store = Chroma(
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings,
    persist_directory=CHROMA_DB_PATH
)

In [9]:
# Add documents if new collection or force refresh
if not existing_db:
    print("‚ûï Adding documents to new ChromaDB collection...")
    chroma_store.add_documents(documents=chunks)
    print(f"‚úÖ Added {len(chunks)} chunks to ChromaDB")
else:
    # Check collection size
    collection_size = len(chroma_store.get()['ids'])
    print(f"üìä Existing collection has {collection_size} documents")

    if collection_size == 0:
        print("‚ûï Collection is empty, adding documents...")
        chroma_store.add_documents(documents=chunks)
        print(f"‚úÖ Added {len(chunks)} chunks to ChromaDB")

‚ûï Adding documents to new ChromaDB collection...
‚úÖ Added 22 chunks to ChromaDB


*A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store.*

There can be e.g. arxiv retriever, wikipedia retriever, vector store retriever

https://docs.langchain.com/oss/python/integrations/retrievers/index#retrievers

In [None]:
# Create retriever
retriever = chroma_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

In [12]:
# Test similarity search with scores
print("\nüîç Testing similarity search with scores...")
test_query = "What did Marie Curie discover about radioactivity?"
similar_docs_with_scores = chroma_store.similarity_search_with_score(test_query, k=3)

print(f"Query: {test_query}")
print("Results with similarity scores:")
for i, (doc, score) in enumerate(similar_docs_with_scores, 1):
    print(f"\nChunk {i} (Score: {score:.3f}):")
    print(f"{doc.page_content[:150]}...")


üîç Testing similarity search with scores...
Query: What did Marie Curie discover about radioactivity?
Results with similarity scores:

Chunk 1 (Score: 0.556):
Scientific Achievements Discovery of Radioactivity: Working with her husband Pierre Curie, Marie discovered the elements polonium (named after her nat...

Chunk 2 (Score: 0.700):
Legacy and Death Marie Curie's work was crucial for the development of X-rays in surgery and cancer treatments. Despite her accomplishments, she faced...

Chunk 3 (Score: 0.730):
Marie Sklodowska - Curie (1867-1934) Marie Sk≈Çodowska Curie was a Polish and naturalized-French physicist and chemist who conducted pioneering researc...


In [14]:
# Create RAG chain
llm = AzureChatOpenAI(model="gpt-5-nano")

prompt = ChatPromptTemplate.from_template("""
You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise.

Question: {question}

Context: {context}

Answer:
""")

chroma_rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [15]:
# Demo questions
questions = [
    "What awards did Marie Curie receive?",
    "How did Charles Darwin develop his theory of evolution?",
    "What was Newton's contribution to mathematics?"
]


for i, question in enumerate(questions, 1):
    print(f"\nQ{i}: {question}")
    print("-" * 40)
    response = chroma_rag_chain.invoke(question)
    print(f"A{i}: {response}")


Q1: What awards did Marie Curie receive?
----------------------------------------
A1: She received two Nobel Prizes. In 1903 she won the Nobel Prize in Physics (shared with Pierre Curie and Henri Becquerel) for work on radiation phenomena, and in 1911 she won the Nobel Prize in Chemistry for the discovery of polonium and radium and the isolation of radium.

Q2: How did Charles Darwin develop his theory of evolution?
----------------------------------------
A2: Darwin developed his theory after years of careful observation and data collection from his voyage on the Beagle, including variations among Gal√°pagos finches, tortoises, and fossils that suggested common ancestry and adaptation. He reasoned that individuals with favorable traits survive and reproduce, passing those traits to offspring‚Äînatural selection‚Äîas the mechanism driving evolution. He delayed publishing due to controversy and then released his ideas in 1859's On the Origin of Species after Alfred Russel Wallace indep

Load again to demonstrate persistence

In [16]:
print("\nüîÑ Reloading ChromaDB to demonstrate persistence...")
chroma_store_reload = Chroma(
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings,
    persist_directory=CHROMA_DB_PATH
)

# create a retriever for the reloaded store
retriever_reload = chroma_store_reload.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

# create a RAG chain for the reloaded store
chroma_rag_chain_reload = (
    {"context": retriever_reload, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Perform a RAG query
response_reload = chroma_rag_chain_reload.invoke(questions[0])
print(f"\nAfter reload - Q1: {questions[0]}")
print(f"A1: {response_reload}")


üîÑ Reloading ChromaDB to demonstrate persistence...

After reload - Q1: What awards did Marie Curie receive?
A1: She received two Nobel Prizes: Physics in 1903 (shared with Pierre Curie and Henri Becquerel) and Chemistry in 1911 (for the discovery of polonium and radium and the isolation of radium).


In [18]:
collection_info = chroma_store.get()
collection_info.keys()

dict_keys(['ids', 'embeddings', 'documents', 'uris', 'included', 'data', 'metadatas'])

## FAISS (Facebook AI Similarity Search)

FAISS is not a complete vector database; it is a vector search library.

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

Zadanie: Skrypt `3_faiss_intro.py`

Uwaga: trzeba doinstalowaƒá pakiet faiss:

`pip install faiss-cpu`

lub (dla GPU)

`pip install faiss-gpu`

## ≈Åadowanie dokument√≥w z r√≥≈ºnych ≈∫r√≥de≈Ç

W folderze `03_document_loading` znajdujƒÖ siƒô skrypty pokazujƒÖce jak ≈Çadowaƒá dane z plik√≥w tekstowych, plik√≥w pdf oraz ze stron internetowych.

Przetwarzanie plik√≥w PDF zosta≈Ço uproszczone, poniewa≈º na poczƒÖtku skryptu pliki PDF sƒÖ tworzone z plik√≥w tekstowych. W rzeczywisto≈õci pliki PDF mogƒÖ zawieraƒá skany lub zdjƒôcia, dlatego do poprawnego wyodrƒôbnienia danych potrzebne sƒÖ modele AI typu OCR (Optical Character Recognition), na przyk≈Çad te dostƒôpne w us≈Çudze Azure Document Intelligence.

## Knwoledge Graph

Graf wiedzy (Knowledge Graph) to spos√≥b reprezentowania informacji w postaci powiƒÖza≈Ñ miƒôdzy pojƒôciami. Dane przedstawiane sƒÖ jako graf, w kt√≥rym wƒôz≈Çy (nodes) reprezentujƒÖ obiekty lub pojƒôcia, a krawƒôdzie (edges) opisujƒÖ relacje miƒôdzy nimi. Dziƒôki temu mo≈ºna modelowaƒá wiedzƒô w spos√≥b zrozumia≈Çy zar√≥wno dla ludzi, jak i maszyn.

![](https://www.atulhost.com/wp-content/uploads/2020/12/knowledge-graph-1536x864.jpg)

### Podstawowe pojƒôcia

1. Node (wƒôze≈Ç) ‚Äì reprezentuje byt lub pojƒôcie (np. osoba, miasto, film).

   Przyk≈Çad: Warszawa, Polska, Adam Mickiewicz.

2. Edge (krawƒôd≈∫) ‚Äì reprezentuje relacjƒô miƒôdzy dwoma wƒôz≈Çami.

   Przyk≈Çad: le≈ºy_w, napisa≈Ç, urodzi≈Ç_siƒô_w.

3. Triplet (tr√≥jka RDF - Resource Description Framework) ‚Äì podstawowa jednostka wiedzy w grafie. Sk≈Çada siƒô z:

   `(subject, predicate, object)`

   czyli (podmiot, orzeczenie, dope≈Çnienie).
   Przyk≈Çad:

   `(Adam_Mickiewicz, urodzi≈Ç_siƒô_w, Nowogr√≥dek)`

4. Ontology (ontologia) ‚Äì formalny opis struktury wiedzy w danym obszarze. Okre≈õla, jakie typy byt√≥w i relacji mogƒÖ wystƒôpowaƒá oraz jakie majƒÖ w≈Ça≈õciwo≈õci.

   Przyk≈Çad ontologii: https://schema.org/Person

   Przyk≈Çad ontologii i grafu wiedzy: https://dbpedia.org/page/Adam_Mickiewicz

### Dobre praktyki nazewnictwa w grafach wiedzy i Neo4j

| Element                                | Konwencja                                    | Przyk≈Çad                          |
| -------------------------------------- | -------------------------------------------- | --------------------------------- |
| **Wƒôze≈Ç (Node label)**                 | Z wielkiej litery (nazwa klasy / typu)       | `Osoba`, `KsiƒÖ≈ºka`, `Miasto`      |
| **Krawƒôd≈∫ (Edge / Relationship type)** | Z wielkich liter, czƒôsto w formie czasownika | `NAPISA≈Å`, `MIESZKA_W`, `LE≈ªY_W`  |
| **Atrybut (Property)**                 | Ma≈Çymi literami, czasem camelCase            | `name`, `birthYear`, `population` |
| **Identyfikator (URI lub ID)**         | Czƒôsto bez spacji, np. z podkre≈õlnikami      | `Adam_Mickiewicz`, `Nowy_SƒÖcz` , `https://projekt-neo4j.edu.pl/person/Adam_Mickiewicz`   |


### Neo4j

Neo4j to jedna z najpopularniejszych baz grafowych, w kt√≥rej dane sƒÖ przechowywane w≈Ça≈õnie w formie wƒôz≈Ç√≥w i krawƒôdzi z atrybutami.
U≈ºywa jƒôzyka zapyta≈Ñ Cypher, kt√≥ry pozwala intuicyjnie wyszukiwaƒá i analizowaƒá powiƒÖzania, np.:

```
MATCH (a:Author)-[:NAPISA≈Å]->(b:Book)
WHERE a.name = "Adam Mickiewicz"
RETURN b.title
```

Ciekawostka: nowe wersje Neo4j wspierajƒÖ **vector embeddings**, czyli reprezentacjƒô tektu oraz wyszukiwanie wierzcho≈Çk√≥w i krawƒôdzi przy pomocy podobie≈Ñstwa wektor√≥w.

https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/

### Wnioskowanie i zastosowania

Grafy wiedzy umo≈ºliwiajƒÖ wnioskowanie po≈õrednich relacji, np. je≈õli:

`(A, jest_czƒô≈õciƒÖ, B)` i `(B, jest_czƒô≈õciƒÖ, C)`,

to mo≈ºna logicznie wywnioskowaƒá, ≈ºe `(A, jest_czƒô≈õciƒÖ, C)`.

Na przyk≈Çad:

`(Jan, LUBI, Pan_Tadeusz)`, `(Adam_Mickiewicz, NAPISA≈Å, Pan_Tadeusz)`, `(Adam_Mickiewicz, NAPISA≈Å, Dziady)`, to Jan mo≈ºe lubiƒá "Dziady".

Zapytanie Cypher:

```
MATCH (u:User {name: "Jan"})-[:LUBI]->(:Book)<-[:NAPISA≈Å]-(a:Author)-[:NAPISA≈Å]->(b:Book)
WHERE NOT (u)-[:LUBI]->(b)
RETURN DISTINCT b.title AS polecane
```