# Building a Semantic Search Engine with Weaviate

In this lab, we use **Weaviate**, a specialized **Vector Database**. Unlike traditional relational databases (like MySQL) that use exact keyword matching, a vector database understands the *context* and *meaning* of data by representing it as numerical coordinates in a multi-dimensional space.

### Step 1: Load the Jeopardy Dataset
We start by fetching a small sample of Jeopardy questions. Notice that the data contains `Category`, `Question`, and `Answer` fields.

In [2]:
import requests
import json

# Download the data
resp = requests.get('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
data = json.loads(resp.text)  # Load data

# Parse the JSON and preview it
print(type(data),len(data))
print(json.dumps(data[0],indent=2))

<class 'list'> 10
{
  "Category": "SCIENCE",
  "Question": "This organ removes excess glucose from the blood & stores it as glycogen",
  "Answer": "Liver"
}


In [3]:
def json_print(data):
    print(json.dumps(data, indent=2))

### Understanding Embeddings

When we import this data into Weaviate, each question is sent to an AI model (in this case, OpenAI) which converts the text into an **embedding**. 



An embedding is a long list of numbers (a vector) that acts like a unique fingerprint for the meaning of the text. Because "DNA" and "Biology" are related concepts, their vectors will be mathematically "close" to each other in this space, even if they don't share the same words.

### Step 2: Initialize the Vector DB

In [6]:
import weaviate
from weaviate import EmbeddedOptions
import os

# Start up an instance of Weaviate
client = weaviate.Client(embedded_options=EmbeddedOptions(),
                        additional_headers={
                            "X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"]
                        })

Started /Users/linkedin/.cache/weaviate-embedded: process ID 26222


In [7]:
#Check that weaviate is up and running
json_print(client.get_meta())

### Step 3: Defining the Schema

Before loading data, we define a "Class" (similar to a table). We specify that we want to use the `text2vec-openai` vectorizer. This tells Weaviate to automatically handle the transformation of our text into vectors during the import process.



In [8]:
#Delete the schema if it already exists
if client.schema.exists("Question"):
    client.schema.delete_class("Question")

In [9]:
#Create the schema that will house our data
class_obj = {
    "class": "Question",
    "vectorizer": "text2vec-openai",  
}

client.schema.create_class(class_obj)

### Step 4: Batch Importing

We use **Batching** to upload multiple objects at once. This is more efficient for the network and the database. During this step, for every question added, a call is made to the embedding model to generate its vector location.

In [10]:
with client.batch.configure() as batch:
    for i, d in enumerate(data):  # Batch import data
        print(f"importing question: {i+1}")
            
        properties = {
            "answer": d["Answer"],
            "question":d["Question"],
            'category':d["Category"]
        }
        
        batch.add_data_object(
            data_object=properties,
            class_name="Question")

In [11]:
#Check how many objects we've loaded into the database
json_print(client.query.aggregate("Question").with_meta_count().do())

### Step 5: Querying the Knowledge

Now that the data is vectorized, we can retrieve items. In a semantic search, when you query for "biology", Weaviate would find the "DNA" question even if the word "biology" never appears in the text, because their vectors are close together.

In [12]:
#Extract and show any 3 questions and answers
json_print(client.query.get("Question", ["question","answer"]).with_limit(3).do())