# Advanced Vector Search with Weaviate

This notebook demonstrates how to load a 1K Jeopardy dataset into Weaviate and perform complex semantic queries. We will focus on schema definition, batch importing, counting objects, and combining semantic search with scalar filters.

### Step 1: Loading the 1K Dataset

In [None]:
import requests
import json

# Download the data
resp = requests.get('https://raw.githubusercontent.com/weaviate-tutorials/intro-workshop/main/data/jeopardy_1k.json')
data = json.loads(resp.text)  # Load data

# Parse the JSON and preview it
print(type(data), len(data))
print(json.dumps(data[1], indent=2))

### Understanding Schema and Properties

In Weaviate, a **Class** acts like a table in a traditional database. Each **Property** has a data type. By setting the vectorizer to `text2vec-openai`, Weaviate will automatically generate vectors for our data using OpenAI's models.



In [None]:
import weaviate
from weaviate import EmbeddedOptions
import os

client = weaviate.Client(
    embedded_options=EmbeddedOptions(),
    additional_headers={
        "X-OpenAI-Api-Key": os.environ.get("OPENAI_API_KEY")
    }
)

In [None]:
if client.schema.exists("Question"):
    client.schema.delete_class("Question")

In [None]:
# Q1: Load up the dataset, keep question, answer and round properties.
class_definition = {
    "class": "Question",
    "vectorizer": "text2vec-openai",
    "properties": [
        {"name": "question", "dataType": ["text"]},
        {"name": "answer", "dataType": ["text"]},
        {"name": "round", "dataType": ["text"]}
    ]
}

client.schema.create_class(class_definition)

In [None]:
# Insert the data into Weaviate using Batch
with client.batch.configure(batch_size=100) as batch:
    for o in data:
        properties = {
            "question": o["Question"],
            "answer": o["Answer"],
            "round": o["Round"]
        }
        batch.add_data_object(properties, "Question")

### Step 2: Database Aggregation

To verify that our 1,000 objects were imported correctly, we use the `aggregate` function. This is more efficient than fetching all objects just to count them.

In [None]:
# Q2. How do you check for the number of objects stored in the database?
count_response = client.query.aggregate("Question").with_meta_count().do()
print(f"Total objects in database: {count_response['data']['Aggregate']['Question'][0]['meta']['count']}")

### Step 3: Pure Semantic Search

By using `with_near_text`, Weaviate looks for the mathematical closeness between the query string and the stored questions. This finds meaning rather than just matching words.



In [None]:
# Q3. Search for objects close to "spicy food recipes" and show 4 QnA
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_near_text({"concepts": ["spicy food recipes"]})
    .with_limit(4)
    .do()
)

print(json.dumps(response, indent=2))

### Step 4: Hybrid Filtering

One of the most powerful features of vector databases is the ability to combine **semantic similarity** with **hard filters** (scalar search). Here we find spicy food questions but strictly only from the "Double Jeopardy!" round.

In [None]:
# Q4. Spicy food recipes related questions in Double Jeopardy rounds
where_filter = {
    "path": ["round"],
    "operator": "Equal",
    "valueString": "Double Jeopardy!"
}

response = (
    client.query
    .get("Question", ["question", "answer", "round"])
    .with_near_text({"concepts": ["spicy food recipes"]})
    .with_where(where_filter)
    .with_limit(3)
    .do()
)

print(json.dumps(response, indent=2))