# Semantic Search & Vector Analysis with Weaviate

This notebook demonstrates how to interact with a Vector Database to perform semantic retrieval. Unlike keyword search, vector search calculates the mathematical distance between concepts to find relevance.

### Step 1: Load and Preview Data

In [None]:
import requests
import json

# Download the data
resp = requests.get('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
data = json.loads(resp.text)  # Load data

# Parse the JSON and preview it
print(type(data), len(data))
print(json.dumps(data[0], indent=2))

def jprint(data):
    print(json.dumps(data, indent=2))

### Step 2: Initialize Weaviate

We use an embedded instance of Weaviate. The `X-OpenAI-Api-Key` is required because Weaviate will call OpenAI's embedding models to transform our text into high-dimensional vectors.

In [None]:
import weaviate
from weaviate import EmbeddedOptions
import os

client = weaviate.Client(
    embedded_options=EmbeddedOptions(),
    additional_headers={
        "X-OpenAI-Api-Key": os.environ.get("OPENAI_API_KEY")
    }
)

In [None]:
jprint(client.get_meta())

In [None]:
if client.schema.exists("Question"):
    client.schema.delete_class("Question")

In [None]:
class_obj = {
    "class": "Question",
    "vectorizer": "text2vec-openai",
}

client.schema.create_class(class_obj)

In [None]:
with client.batch.configure() as batch:
    for i, d in enumerate(data):
        print(f"importing question: {i+1}")
        properties = {
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        }
        batch.add_data_object(data_object=properties, class_name="Question")

### Step 3: Extracting the Vector Representation

Every object in a vector database is stored with an underlying array of numbers (the vector). This vector represents the "semantic fingerprint" of the data.



In [None]:
# Extract the vector for a question using .with_additional(['vector'])
result = (
    client.query
    .get("Question", ["question", "answer"])
    .with_additional(["vector"])
    .with_limit(1)
    .do()
)

In [None]:
# Display the vector representation
vector = result['data']['Get']["Question"][0]['_additional']['vector']
print(f"Vector preview (first 5 numbers): {vector[:5]}")

In [None]:
# How many dimensions does this model use?
print(f"Vector dimensionality: {len(vector)}")

### Step 4: Semantic Search and Vector Distance

When we search for "biology", Weaviate converts the word "biology" into a vector and finds objects whose vectors are closest to it. We use **Cosine Distance** to measure this closeness.



In [None]:
response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text({"concepts": ["biology"]})
    .with_limit(2)
    .do()
)
jprint(response)

### Step 5: Understanding Distance Thresholds

A distance of `0` means identical meaning. As the distance increases, the concepts become less related. We can set a `certainty` or `distance` threshold to filter out noise.

In [None]:
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_near_text({"concepts": ["biology"]})
    .with_additional(["distance"])
    .do()
)
jprint(response)

In [None]:
# Setting a maximum distance threshold to ensure quality
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_near_text({
        "concepts": ["animals"],
        "distance": 0.18 # Only return very close matches
    })
    .with_additional(["distance"])
    .do()
)
jprint(response)