# Notebook 1 — Elementary Vector Database Example

This notebook introduces the **core idea behind vector databases** in a minimal and intuitive way.

We start with a **small set of simple sentences**, convert them into numerical vectors using a pretrained language model, and then perform **semantic similarity search**.  
No files, no PDFs, no large code bases — just meaning mapped into a geometric space.

### Learning goals
- Understand how text is mapped into a vector space
- See how semantic similarity emerges from vector geometry
- Perform a simple similarity search without keywords

This notebook focuses on **intuition and concepts**.  
In the next notebook, we will apply the same ideas to **real documents and PDFs**.

In [1]:
# ------------------------------------------------------------
# Cell 1: Define a small set of example sentences
# ------------------------------------------------------------
# These sentences are intentionally simple and cover
# different semantic topics. Some are paraphrases, others
# are clearly unrelated. This makes similarity effects visible.

sentences = [
    "The sun is shining brightly today.",
    "It is a sunny day with clear skies.",
    "Heavy rain is falling over the city.",
    "The weather forecast predicts strong rainfall.",
    "Neural networks can learn complex patterns from data.",
    "Machine learning models improve through training.",
    "The cat is sleeping on the sofa.",
    "A dog is running quickly in the park.",
    "Transformers use self-attention mechanisms.",
    "Large language models are based on transformer architectures."
]

print(f"Number of sentences: {len(sentences)}")
for i, s in enumerate(sentences):
    print(f"{i:2d}: {s}")


Number of sentences: 10
 0: The sun is shining brightly today.
 1: It is a sunny day with clear skies.
 2: Heavy rain is falling over the city.
 3: The weather forecast predicts strong rainfall.
 4: Neural networks can learn complex patterns from data.
 5: Machine learning models improve through training.
 6: The cat is sleeping on the sofa.
 7: A dog is running quickly in the park.
 8: Transformers use self-attention mechanisms.
 9: Large language models are based on transformer architectures.


In [2]:
# ------------------------------------------------------------
# Proxy configuration for DWD (ofsquid)
# ------------------------------------------------------------

import os

PROXY = "http://ofsquid.dwd.de:8080"

os.environ["HTTP_PROXY"]  = PROXY
os.environ["HTTPS_PROXY"] = PROXY
os.environ["http_proxy"]  = PROXY
os.environ["https_proxy"] = PROXY

# Do not proxy local connections
os.environ["NO_PROXY"] = "localhost,127.0.0.1"

print("DWD proxy configured:", PROXY)

DWD proxy configured: http://ofsquid.dwd.de:8080


In [3]:
# ------------------------------------------------------------
# Cell 2: Load a sentence embedding model
# ------------------------------------------------------------
# We use a small and fast sentence transformer model.
# It maps each sentence to a fixed-size vector in R^d.
# No training is needed here; we only use the pretrained model.

from sentence_transformers import SentenceTransformer

# Load model (downloads once, then cached locally)
model = SentenceTransformer("all-MiniLM-L6-v2")

print("Model loaded.")
print("Embedding dimension:", model.get_sentence_embedding_dimension())

Model loaded.
Embedding dimension: 384


In [4]:
# ------------------------------------------------------------
# Cell 3: Encode sentences into embedding vectors
# ------------------------------------------------------------
# Each sentence is mapped to a vector z in R^d.
# The result is a matrix of shape (num_sentences, embedding_dim).

import numpy as np

embeddings = model.encode(sentences, convert_to_numpy=True)

print("Embedding array shape:", embeddings.shape)


Embedding array shape: (10, 384)


In [5]:
# ------------------------------------------------------------
# Cell 4: Inspect embedding values (sanity check)
# ------------------------------------------------------------
# We do NOT interpret individual numbers.
# The important point is: each sentence is now a point in space.

np.set_printoptions(precision=3, suppress=True)

print("First embedding vector (truncated):")
print(embeddings[0][:10])


First embedding vector (truncated):
[ 0.033  0.134  0.104  0.072  0.037 -0.041  0.095 -0.063 -0.017 -0.02 ]


In [6]:
# ------------------------------------------------------------
# Cell 5: Define example query sentences
# ------------------------------------------------------------
# These queries are paraphrases or conceptual variations
# of some of the original sentences.

queries = [
    "Today is bright and sunny.",
    "The forecast expects a lot of rain.",
    "How do neural networks learn from data?"
]

for q in queries:
    print("-", q)


- Today is bright and sunny.
- The forecast expects a lot of rain.
- How do neural networks learn from data?


In [7]:
# ------------------------------------------------------------
# Cell 6: Encode queries into vectors
# ------------------------------------------------------------
# Queries are embedded in exactly the same vector space
# as the original sentences.

query_embeddings = model.encode(queries, convert_to_numpy=True)

print("Query embedding shape:", query_embeddings.shape)

Query embedding shape: (3, 384)


In [8]:
# ------------------------------------------------------------
# Cell 7: Define cosine similarity function
# ------------------------------------------------------------
# Cosine similarity measures the angle between vectors.
# Values close to 1 mean high semantic similarity.

from numpy.linalg import norm

def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))


In [9]:
# ------------------------------------------------------------
# Cell 8: Perform similarity search
# ------------------------------------------------------------
# For a given query vector, compute similarity to all
# sentence embeddings and return the top-k matches.

def search_similar_sentences(query_embedding, sentences, embeddings, top_k=3):
    scores = []
    for i, emb in enumerate(embeddings):
        score = cosine_similarity(query_embedding, emb)
        scores.append((i, score))
    
    # Sort by similarity (highest first)
    scores.sort(key=lambda x: x[1], reverse=True)
    
    return scores[:top_k]


In [12]:
# ------------------------------------------------------------
# Cell 9: Run similarity search for each query
# ------------------------------------------------------------
# This is the core "aha" moment of the notebook.

for q, q_emb in zip(queries, query_embeddings):
    print("\n" + "=" * 60)
    print("Query:", q)
    print("-" * 60)
    
    results = search_similar_sentences(q_emb, sentences, embeddings, top_k=4)
    
    for idx, score in results:
        print(f"Score: {score:6.3f} | Sentence: {sentences[idx]}")



Query: Today is bright and sunny.
------------------------------------------------------------
Score:  0.882 | Sentence: The sun is shining brightly today.
Score:  0.767 | Sentence: It is a sunny day with clear skies.
Score:  0.328 | Sentence: The weather forecast predicts strong rainfall.
Score:  0.211 | Sentence: Heavy rain is falling over the city.

Query: The forecast expects a lot of rain.
------------------------------------------------------------
Score:  0.786 | Sentence: The weather forecast predicts strong rainfall.
Score:  0.553 | Sentence: Heavy rain is falling over the city.
Score:  0.420 | Sentence: It is a sunny day with clear skies.
Score:  0.260 | Sentence: The sun is shining brightly today.

Query: How do neural networks learn from data?
------------------------------------------------------------
Score:  0.769 | Sentence: Neural networks can learn complex patterns from data.
Score:  0.531 | Sentence: Machine learning models improve through training.
Score:  0.232 | 

### What we observe

- Semantically related sentences are retrieved even if
  they share no exact words.
- Paraphrases are detected reliably.
- Unrelated topics have much lower similarity scores.

This demonstrates the core idea behind vector databases:
**meaning is represented geometrically**.