# [STARTER] Udaplay Project

## Part 01 - Offline RAG

In this part of the project, you'll build your VectorDB using Chroma.

The data is inside folder `project/starter/games`. Each file will become a document in the collection you'll create.
Example.:
```json
{
  "Name": "Gran Turismo",
  "Platform": "PlayStation 1",
  "Genre": "Racing",
  "Publisher": "Sony Computer Entertainment",
  "Description": "A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.",
  "YearOfRelease": 1997
}
```


### Setup

In [1]:
# Only needed for Udacity workspace

import importlib.util
import sys

# Check if 'pysqlite3' is available before importing
if importlib.util.find_spec("pysqlite3") is not None:
    import pysqlite3
    sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [2]:
import os
import json
import chromadb
from openai import OpenAI
from chromadb.utils import embedding_functions
from dotenv import load_dotenv

In [3]:
# TODO: Create a .env file with the following variables
# OPENAI_API_KEY="YOUR_KEY"
# CHROMA_OPENAI_API_KEY="YOUR_KEY"
# TAVILY_API_KEY="YOUR_KEY"
# Load the .env file

In [4]:
# TODO: Load environment variables
# load_dotenv()
load_dotenv()

# Verify that key environment variables are loaded
print(" Environment variables loaded")
print(f" OPENAI_API_KEY: {'Found' if os.getenv('OPENAI_API_KEY') else 'NOT FOUND'}")
print(f" TAVILY_API_KEY: {'Found' if os.getenv('TAVILY_API_KEY') else 'NOT FOUND'}")
client = OpenAI(base_url = "https://openai.vocareum.com/v1", api_key=os.getenv("OPENAI_API_KEY"))

 Environment variables loaded
 OPENAI_API_KEY: Found
 TAVILY_API_KEY: Found


### VectorDB Instance

In [5]:
# TODO: Instantiate your ChromaDB Client
# Choose any path you want
# chroma_client = chromadb.PersistentClient(path="chromadb")
import chromadb
from chromadb.config import Settings

# Create a persistent ChromaDB client
# This will store the database on disk so it persists between sessions
chroma_client = chromadb.PersistentClient(
    path="./chromadb",  # Directory where the database will be stored
    settings=Settings(
        anonymized_telemetry=False,  # Disable telemetry for privacy
        allow_reset=True  # Allow resetting the database if needed
    )
)

print(" ChromaDB Persistent Client created successfully")
print(f" Database will be stored in: ./chromadb")

 ChromaDB Persistent Client created successfully
 Database will be stored in: ./chromadb


### Collection

In [6]:
# TODO: Pick one embedding function
# If picking something different than openai, 
# make sure you use the same when loading it
# embedding_fn = embedding_functions.OpenAIEmbeddingFunction()
from chromadb.utils import embedding_functions

# Option 1: OpenAI Embedding Function (RECOMMENDED)
# Uses OpenAI's text-embedding-3-small model
# Make sure your OPENAI_API_KEY is set in .env
embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="text-embedding-3-small"  # Fast and cost-effective
)

print(" Using OpenAI embedding function")
print(f" Model: text-embedding-3-small")

 Using OpenAI embedding function
 Model: text-embedding-3-small


In [7]:
# TODO: Create a collection
# Choose any name you want
# collection = chroma_client.create_collection(
#    name="udaplay",
#    embedding_function=embedding_fn
#)
collection_name = "udaplay_games"

# Clear existing collection and make it reusable vector store
try:
    chroma_client.delete_collection(name=collection_name)
    print(f" Deleted existing collection: '{collection_name}'")
except:
    print(f"ℹ No existing collection to delete")

# Create new collection
collection = chroma_client.create_collection(
    name=collection_name,
    embedding_function=embedding_fn,
    metadata={
        "description": "Video game database for UdaPlay AI agent",
        "hnsw:space": "cosine"  # Use cosine similarity for search
    }
)

print(f" Created collection: '{collection_name}'")
print(f" Embedding function: OpenAI text-embedding-3-small")

 Deleted existing collection: 'udaplay_games'
 Created collection: 'udaplay_games'
 Embedding function: OpenAI text-embedding-3-small


### Add documents

In [8]:
# Make sure you have a directory "project/starter/games"
data_dir = "games"

for file_name in sorted(os.listdir(data_dir)):
    if not file_name.endswith(".json"):
        continue

    file_path = os.path.join(data_dir, file_name)
    with open(file_path, "r", encoding="utf-8") as f:
        game = json.load(f)

    # You can change what text you want to index
    content = f"[{game['Platform']}] {game['Name']} ({game['YearOfRelease']}) - {game['Description']}"

    # Use file name (like 001) as ID
    doc_id = os.path.splitext(file_name)[0]

    collection.add(
        ids=[doc_id],
        documents=[content],
        metadatas=[game]
    )

TEST - Testing Semantic Search and that the vector database can be queried for semantic search

In [9]:
# Test semantic search queries
# See https://docs.trychroma.com/docs/embeddings/embedding-functions for code samples
test_queries = [
    "racing games from the 90s",
    "action games for PlayStation"
]
print("*" * 50)
print("Testing Semantic Search on Vector Database")
print("*" * 50)

for query in test_queries:
    print(f"\nQuery: '{query}'")
    print("-" * 50)
    
    results = collection.query(
        query_texts=[query],
        n_results=3
    )
    
    if results['metadatas'] and results['metadatas'][0]:
        for i, metadata in enumerate(results['metadatas'][0], 1):
            print(f"\n{i}. {metadata['Name']} ({metadata['YearOfRelease']})")
            print(f"   Platform: {metadata['Platform']}")
            print(f"   Description: {metadata['Description'][:100]}...")
    
    print("\n" + "=" * 50)

print("\n✓ Semantic search test complete!")

**************************************************
Testing Semantic Search on Vector Database
**************************************************

Query: 'racing games from the 90s'
--------------------------------------------------

1. Gran Turismo (1997)
   Platform: PlayStation 1
   Description: A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for t...

2. Gran Turismo 5 (2010)
   Platform: PlayStation 3
   Description: A comprehensive racing simulator featuring a vast selection of vehicles and tracks, with realistic d...

3. Pokémon Ruby and Sapphire (2002)
   Platform: Game Boy Advance
   Description: Third-generation Pokémon games set in the Hoenn region, featuring new Pokémon and double battles....


Query: 'action games for PlayStation'
--------------------------------------------------

1. Gran Turismo (1997)
   Platform: PlayStation 1
   Description: A realistic racing simulator featuring a wide array of cars and tracks, setting a 