# [STARTER] Udaplay Project

## Part 01 - Vector Database

In this part of the project, you'll build a vector database using ChromaDB.

Your vector database will be populated with video game data and used for semantic search and retrieval.

### Setup

In [9]:
# Only needed for Udacity workspace

import importlib.util
import sys

# Check if 'pysqlite3' is available before importing
if importlib.util.find_spec("pysqlite3") is not None:
    import pysqlite3
    sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [10]:
import os
import json
import chromadb
from chromadb.utils import embedding_functions
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

print("Libraries imported successfully!")

Libraries imported successfully!


In [11]:
# Initialize ChromaDB client with persistent storage
chroma_client = chromadb.PersistentClient(path="chromadb")

# Set up OpenAI embedding function
embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.getenv("OPENAI_API_KEY"),
    api_base="https://openai.vocareum.com/v1",
    model_name="text-embedding-ada-002"
)

print("ChromaDB client and embedding function initialized!")

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


ChromaDB client and embedding function initialized!


In [12]:
# Create or get collection
collection_name = "udaplay"

try:
    # Try to get existing collection first
    collection = chroma_client.get_collection(
        name=collection_name,
        embedding_function=embedding_fn
    )
    print(f"Retrieved existing collection: {collection_name}")
    print(f"Current document count: {collection.count()}")
except:
    # Create new collection if it doesn't exist
    collection = chroma_client.create_collection(
        name=collection_name,
        embedding_function=embedding_fn
    )
    print(f"Created new collection: {collection_name}")

Retrieved existing collection: udaplay
Current document count: 30


### Data Loading

In [13]:
# Load game data from JSON files
data_dir = "games"

# Check if directory exists
if not os.path.exists(data_dir):
    print(f"Directory '{data_dir}' not found!")
    print("Please ensure the games directory exists with JSON files.")
else:
    json_files = [f for f in os.listdir(data_dir) if f.endswith('.json')]
    print(f"Found {len(json_files)} JSON files in '{data_dir}' directory")
    
    # Show first few files
    for i, filename in enumerate(sorted(json_files)[:5]):
        print(f"   {i+1}. {filename}")
    
    if len(json_files) > 5:
        print(f"   ... and {len(json_files) - 5} more files")

Found 15 JSON files in 'games' directory
   1. 001.json
   2. 002.json
   3. 003.json
   4. 004.json
   5. 005.json
   ... and 10 more files


In [14]:
# Add documents to ChromaDB
if os.path.exists(data_dir):
    documents_added = 0
    
    for file_name in sorted(os.listdir(data_dir)):
        if not file_name.endswith(".json"):
            continue
            
        file_path = os.path.join(data_dir, file_name)
        
        try:
            with open(file_path, "r", encoding="utf-8") as f:
                game = json.load(f)
            
            # Create content string for embedding
            content = f"[{game['Platform']}] {game['Name']} ({game['YearOfRelease']}) - {game['Description']}"
            
            # Use filename (without extension) as document ID
            doc_id = os.path.splitext(file_name)[0]
            
            # Add to collection
            collection.add(
                ids=[doc_id],
                documents=[content],
                metadatas=[game]
            )
            
            documents_added += 1
            
        except Exception as e:
            print(f"Error processing {file_name}: {e}")
    
    print(f"Successfully added {documents_added} documents to ChromaDB")
    print(f"Total documents in collection: {collection.count()}")
else:
    print("Games directory not found. Cannot load data.")

Successfully added 15 documents to ChromaDB
Total documents in collection: 30


### Semantic Search Demonstration

Now that we have loaded all the game data into our vector database, let's demonstrate that semantic search works properly.

In [15]:
# Test semantic search functionality
def test_semantic_search(collection, query, n_results=3):
    """Test semantic search and display results"""
    print(f"\n=== Searching for: '{query}' ===")
    
    results = collection.query(
        query_texts=[query],
        n_results=n_results,
        include=['documents', 'metadatas', 'distances']
    )
    
    if results['documents'][0]:
        print(f"Found {len(results['documents'][0])} results:")
        
        for i, (doc, metadata, distance) in enumerate(zip(
            results['documents'][0], 
            results['metadatas'][0], 
            results['distances'][0]
        )):
            similarity = 1 - distance  # Convert distance to similarity
            print(f"\n{i+1}. {metadata['Name']} ({metadata['YearOfRelease']})")
            print(f"   Platform: {metadata['Platform']}")
            print(f"   Genre: {metadata.get('Genre', 'N/A')}")
            print(f"   Publisher: {metadata.get('Publisher', 'N/A')}")
            print(f"   Similarity: {similarity:.3f}")
            print(f"   Description: {metadata['Description'][:100]}...")
    else:
        print("No results found.")
    
    return results

# Test with different types of queries
test_queries = [
    "Nintendo racing games",
    "RPG games with fantasy themes",
    "PlayStation games from the 1990s",
    "Action games with shooting mechanics",
    "Games suitable for families"
]

for query in test_queries:
    test_semantic_search(collection, query, n_results=2)


=== Searching for: 'Nintendo racing games' ===
Found 2 results:

1. Mario Kart 8 Deluxe (2017)
   Platform: Nintendo Switch
   Genre: Racing
   Publisher: Nintendo
   Similarity: 0.854
   Description: An enhanced version of Mario Kart 8, featuring new characters, tracks, and improved gameplay mechani...

2. Mario Kart 8 Deluxe (2017)
   Platform: Nintendo Switch
   Genre: Racing
   Publisher: Nintendo
   Similarity: 0.854
   Description: An enhanced version of Mario Kart 8, featuring new characters, tracks, and improved gameplay mechani...

=== Searching for: 'RPG games with fantasy themes' ===
Found 2 results:

1. Pokémon Ruby and Sapphire (2002)
   Platform: Game Boy Advance
   Genre: Role-playing
   Publisher: Nintendo
   Similarity: 0.804
   Description: Third-generation Pokémon games set in the Hoenn region, featuring new Pokémon and double battles....

2. Pokémon Ruby and Sapphire (2002)
   Platform: Game Boy Advance
   Genre: Role-playing
   Publisher: Nintendo
   Similarity: 0

In [16]:
# Display collection statistics
print("=== ChromaDB Collection Statistics ===")
print(f"Collection name: {collection.name}")
print(f"Total documents: {collection.count()}")

# Get a sample of documents to verify data structure
sample_docs = collection.get(limit=3, include=['documents', 'metadatas'])

print("\n=== Sample Documents ===")
for i, (doc, metadata) in enumerate(zip(sample_docs['documents'], sample_docs['metadatas'])):
    print(f"\n{i+1}. Document ID: {sample_docs['ids'][i]}")
    print(f"   Game: {metadata['Name']}")
    print(f"   Platform: {metadata['Platform']}")
    print(f"   Year: {metadata['YearOfRelease']}")
    print(f"   Content: {doc[:100]}...")

=== ChromaDB Collection Statistics ===
Collection name: udaplay
Total documents: 30

=== Sample Documents ===

1. Document ID: 001
   Game: Gran Turismo
   Platform: PlayStation 1
   Year: 1997
   Content: [PlayStation 1] Gran Turismo (1997) - A realistic racing simulator featuring a wide array of cars an...

2. Document ID: 002
   Game: Grand Theft Auto: San Andreas
   Platform: PlayStation 2
   Year: 2004
   Content: [PlayStation 2] Grand Theft Auto: San Andreas (2004) - An expansive open-world game set in the ficti...

3. Document ID: 003
   Game: Gran Turismo 5
   Platform: PlayStation 3
   Year: 2010
   Content: [PlayStation 3] Gran Turismo 5 (2010) - A comprehensive racing simulator featuring a vast selection ...


## Part 1 Summary

✅ **Completed Tasks:**

1. **Vector Database Setup**: Successfully configured ChromaDB with persistent storage
2. **Embedding Configuration**: Set up OpenAI embeddings for semantic search
3. **Data Processing**: Loaded and processed game JSON files
4. **Data Indexing**: Added all games to the vector database with appropriate metadata
5. **Semantic Search Verification**: Demonstrated that the vector database can be queried for semantic search

**Key Features Implemented:**
- Persistent ChromaDB client with local storage
- OpenAI embedding function for text vectorization
- Structured document format with game metadata
- Semantic search capabilities with similarity scoring
- Comprehensive game database covering multiple platforms and genres

**Ready for Part 2:** The vector database is now ready to be used by the AI agent in Part 2 for intelligent game information retrieval.