# [STARTER] Udaplay Project

## Part 01 - Offline RAG

In this part of the project, you'll build your VectorDB using Chroma.

The data is inside folder `project/starter/games`. Each file will become a document in the collection you'll create.
Example.:
```json
{
  "Name": "Gran Turismo",
  "Platform": "PlayStation 1",
  "Genre": "Racing",
  "Publisher": "Sony Computer Entertainment",
  "Description": "A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.",
  "YearOfRelease": 1997
}
```


### Setup

In [1]:
# Only needed for Udacity workspace

import importlib.util
import sys

# Check if 'pysqlite3' is available before importing
if importlib.util.find_spec("pysqlite3") is not None:
    import pysqlite3
    sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [2]:
import os
import json
import chromadb
from chromadb.utils import embedding_functions
from dotenv import load_dotenv

In [3]:
# DONE: Create a .env file with the following variables
# OPENAI_API_KEY="YOUR_KEY"
# CHROMA_OPENAI_API_KEY="YOUR_KEY"
# TAVILY_API_KEY="YOUR_KEY"

In [4]:
load_dotenv()

True

### VectorDB Instance

In [5]:
# DONE: Instantiate your ChromaDB Client
# Choose any path you want
chroma_client = chromadb.PersistentClient(path="chromadb")

### Collection

In [6]:
# DONE: Pick one embedding function
# If picking something different from openai,
# make sure you use the same when loading it
embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.getenv("OPENAI_API_KEY"),
    api_base=os.getenv("CHAT_API_BASE_URL"),
)

In [7]:
# DONE: Create a collection
# Choose any name you want
collection = chroma_client.get_or_create_collection(
    name="udaplay",
    embedding_function=embedding_fn
)

### Add documents

In [8]:
# Make sure you have a directory "project/starter/games"
data_dir = "games"

for file_name in sorted(os.listdir(data_dir)):
    if not file_name.endswith(".json"):
        continue

    file_path = os.path.join(data_dir, file_name)
    with open(file_path, "r", encoding="utf-8") as f:
        game = json.load(f)

    # You can change what text you want to index
    content = f"[{game['Platform']}] {game['Name']} ({game['YearOfRelease']}) - {game['Description']}"

    # Use file name (like 001) as ID
    doc_id = os.path.splitext(file_name)[0]

    collection.add(
        ids=[doc_id],
        documents=[content],
        metadatas=[game]
    )

In [9]:
from pydantic import BaseModel, Field, ConfigDict
import pandas as pd
from typing import List, Optional, Dict, Any

print("Additional libraries loaded successfully")

Additional libraries loaded successfully


## Database Status Verification

Check current state of the vector database:

In [10]:
# Verify current database state
print("=" * 70)
print("DATABASE STATUS")
print("=" * 70)

total_docs = collection.count()
print(f"\nCollection Name: {collection.name}")
print(f"Total Documents: {total_docs}")

# Sample document
if total_docs > 0:
    sample = collection.peek(limit=1)
    print(f"\nSample Document ID: {sample['ids'][0]}")
    print(f"Content Preview: {sample['documents'][0][:120]}...")
    print(f"Available Metadata: {list(sample['metadatas'][0].keys())}")

print("=" * 70)

DATABASE STATUS

Collection Name: udaplay
Total Documents: 30

Sample Document ID: 001
Content Preview: [PlayStation 1] Gran Turismo (1997) - A realistic racing simulator featuring a wide array of cars and tracks, setting a ...
Available Metadata: ['Publisher', 'Description', 'YearOfRelease', 'Name', 'Platform', 'Genre']


## Data Models with Pydantic

Define structured data models for type safety and validation:

In [11]:
from pydantic import ConfigDict

class GameMetadata(BaseModel):
    """Structured game metadata model."""
    model_config = ConfigDict(populate_by_name=True)
    
    name: str = Field(alias='Name')
    platform: str = Field(alias='Platform')
    year: int = Field(alias='YearOfRelease')
    genre: Optional[str] = Field(default='Unknown', alias='Genre')
    publisher: Optional[str] = Field(default='Unknown', alias='Publisher')

class SearchResult(BaseModel):
    """Structured search result model."""
    id: str
    name: str
    platform: str
    year: int
    genre: str
    distance: float
    relevance_score: float = Field(ge=0, le=100)

class SearchResponse(BaseModel):
    """Complete search response model."""
    query: str
    total_results: int
    results: List[SearchResult]

print("Pydantic models defined")

Pydantic models defined


## Semantic Search Implementation

Implement and test semantic search with structured outputs:

In [12]:
def semantic_search(query: str, n_results: int = 5) -> SearchResponse:
    """
    Perform semantic search with structured output.
    
    Args:
        query: Natural language search query
        n_results: Number of results to return
        
    Returns:
        SearchResponse object with structured results
    """
    results = collection.query(
        query_texts=[query],
        n_results=n_results
    )
    
    search_results = []
    for i in range(len(results['ids'][0])):
        metadata = results['metadatas'][0][i]
        distance = results['distances'][0][i]
        
        search_results.append(SearchResult(
            id=results['ids'][0][i],
            name=metadata['Name'],
            platform=metadata['Platform'],
            year=metadata['YearOfRelease'],
            genre=metadata.get('Genre', 'Unknown'),
            distance=distance,
            relevance_score=round((1 - distance) * 100, 2)
        ))
    
    return SearchResponse(
        query=query,
        total_results=len(search_results),
        results=search_results
    )

print("Semantic search function ready")

Semantic search function ready


# Test Search 1

In [13]:
# Test Case 1: Genre-based search
print("=" * 70)
print("TEST 1: Genre-Based Search")
print("=" * 70)

response = semantic_search("action adventure games with great graphics", n_results=5)

print(f"\nQuery: '{response.query}'")
print(f"Total Results: {response.total_results}\n")

for i, result in enumerate(response.results, 1):
    print(f"{i}. {result.name}")
    print(f"   Platform: {result.platform} | Year: {result.year}")
    print(f"   Genre: {result.genre}")
    print(f"   Relevance: {result.relevance_score}%\n")

TEST 1: Genre-Based Search

Query: 'action adventure games with great graphics'
Total Results: 5

1. Red Dead Redemption 2
   Platform: PlayStation 4 | Year: 2018
   Genre: Action-adventure
   Relevance: 81.19%

2. Kinect Adventures!
   Platform: Xbox 360 | Year: 2010
   Genre: Party
   Relevance: 81.04%

3. The Witcher 3: Wild Hunt
   Platform: PlayStation 4 | Year: 2015
   Genre: Action RPG
   Relevance: 80.82%

4. Super Mario 64
   Platform: Nintendo 64 | Year: 1996
   Genre: Platformer
   Relevance: 80.34%

5. Dark Souls III
   Platform: PlayStation 4 | Year: 2016
   Genre: Action RPG
   Relevance: 80.04%



# Result as DataFrame

In [14]:
# Convert results to DataFrame
df = pd.DataFrame([r.model_dump() for r in response.results])
df = df[['name', 'platform', 'year', 'genre', 'relevance_score']]
df = df.rename(columns={'relevance_score': 'relevance_%'})

print("\nResults as DataFrame:")
print("=" * 70)
display(df)


Results as DataFrame:


Unnamed: 0,name,platform,year,genre,relevance_%
0,Red Dead Redemption 2,PlayStation 4,2018,Action-adventure,81.19
1,Kinect Adventures!,Xbox 360,2010,Party,81.04
2,The Witcher 3: Wild Hunt,PlayStation 4,2015,Action RPG,80.82
3,Super Mario 64,Nintendo 64,1996,Platformer,80.34
4,Dark Souls III,PlayStation 4,2016,Action RPG,80.04


# Test Search 2

In [22]:
# Test Case 2: Platform-specific search
print("=" * 70)
print("TEST 2: Platform-Specific Search")
print("=" * 70)

response2 = semantic_search("Nintendo exclusive Mario games", n_results=3)

print(f"\nQuery: '{response2.query}'")
print(f"Total Results: {response2.total_results}\n")

for i, result in enumerate(response2.results, 1):
    print(f"{i}. {result.name} ({result.year})")
    print(f"   Platform: {result.platform}")
    print(f"   Relevance: {result.relevance_score}%\n")

TEST 2: Platform-Specific Search

Query: 'Nintendo exclusive Mario games'
Total Results: 3

1. Super Mario 64 (1996)
   Platform: Nintendo 64
   Relevance: 84.92%

2. Mario Kart 8 Deluxe (2017)
   Platform: Nintendo Switch
   Relevance: 83.99%

3. Super Mario World (1990)
   Platform: Super Nintendo Entertainment System (SNES)
   Relevance: 83.26%



## Reusable Vector Store Manager

Create a production-ready manager class for the vector database:

In [16]:
class UdaPlayVectorStore:
    """
    Reusable vector store manager for video game data.
    Provides persistent storage, search, and analytics capabilities.
    """

    def __init__(self, persist_path: str = "chromadb", collection_name: str = "udaplay"):
        """Initialize the vector store with existing or new collection."""
        self.client = chromadb.PersistentClient(path=persist_path)
        self.embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
            api_key=os.getenv("OPENAI_API_KEY"),
            api_base=os.getenv("CHAT_API_BASE_URL"),
        )
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            embedding_function=self.embedding_fn
        )

    def add_game(self, game_data: dict, doc_id: str) -> None:
        """Add a single game to the database."""
        content = (
            f"[{game_data['Platform']}] {game_data['Name']} "
            f"({game_data['YearOfRelease']}) - {game_data['Description']}"
        )

        self.collection.add(
            ids=[doc_id],
            documents=[content],
            metadatas=[game_data]
        )

    def search(
        self,
        query: str,
        n_results: int = 5,
        filter_dict: Optional[Dict[str, Any]] = None
    ) -> SearchResponse:
        """
        Search the database with optional metadata filtering.

        Args:
            query: Search query
            n_results: Number of results
            filter_dict: Optional filter (e.g., {"Platform": "PS4"})

        Returns:
            SearchResponse with structured results
        """
        query_params = {
            "query_texts": [query],
            "n_results": n_results
        }

        if filter_dict:
            query_params["where"] = filter_dict

        results = self.collection.query(**query_params)

        search_results = []
        for i in range(len(results['ids'][0])):
            metadata = results['metadatas'][0][i]
            distance = results['distances'][0][i]

            search_results.append(SearchResult(
                id=results['ids'][0][i],
                name=metadata['Name'],
                platform=metadata['Platform'],
                year=metadata['YearOfRelease'],
                genre=metadata.get('Genre', 'Unknown'),
                distance=distance,
                relevance_score=round((1 - distance) * 100, 2)
            ))

        return SearchResponse(
            query=query,
            total_results=len(search_results),
            results=search_results
        )

    def get_stats(self) -> Dict[str, Any]:
        """Get database statistics."""
        return {
            "total_games": self.collection.count(),
            "collection_name": self.collection.name
        }

    def to_dataframe(self, search_response: SearchResponse) -> pd.DataFrame:
        """Convert search results to pandas DataFrame."""
        df = pd.DataFrame([r.model_dump() for r in search_response.results])
        return df[['name', 'platform', 'year', 'genre', 'relevance_score']]

print("UdaPlayVectorStore class defined")

UdaPlayVectorStore class defined


# Init Manager

In [17]:
# Initialize the vector store manager
vector_store = UdaPlayVectorStore()

# Get statistics
stats = vector_store.get_stats()
print("=" * 70)
print("VECTOR STORE STATISTICS")
print("=" * 70)
print(f"Collection: {stats['collection_name']}")
print(f"Total Games: {stats['total_games']}")
print("=" * 70)

VECTOR STORE STATISTICS
Collection: udaplay
Total Games: 30


# Advance Search

In [18]:
# Example 1: Simple search
print("\nExample 1: Simple Search")
print("-" * 70)
response = vector_store.search("multiplayer shooter games", n_results=3)

for i, result in enumerate(response.results, 1):
    print(f"{i}. {result.name} ({result.platform})")
    print(f"   Relevance: {result.relevance_score}%")

# Example 2: Filtered search
print("\nExample 2: Filtered Search (PS4 only)")
print("-" * 70)
response = vector_store.search(
    "racing games",
    n_results=3,
    filter_dict={"Platform": "PS4"}
)

for i, result in enumerate(response.results, 1):
    print(f"{i}. {result.name} ({result.year})")
    print(f"   Relevance: {result.relevance_score}%")


Example 1: Simple Search
----------------------------------------------------------------------
1. Overwatch 2 (Multi-platform)
   Relevance: 83.53%
2. Fortnite (Multi-platform)
   Relevance: 82.1%
3. Mortal Kombat 11 (PlayStation 4)
   Relevance: 79.1%

Example 2: Filtered Search (PS4 only)
----------------------------------------------------------------------


In [19]:
# Convert to DataFrame for analysis
print("\nSearch Results as DataFrame:")
print("=" * 70)

response = vector_store.search("open world RPG games", n_results=5)
df = vector_store.to_dataframe(response)

display(df.style.background_gradient(subset=['relevance_score'], cmap='YlGn'))


Search Results as DataFrame:


Unnamed: 0,name,platform,year,genre,relevance_score
0,Elden Ring,PlayStation 5,2022,Action RPG,84.3
1,The Witcher 3: Wild Hunt,PlayStation 4,2015,Action RPG,83.58
2,The Legend of Zelda: Breath of the Wild,Nintendo Switch,2017,Action-adventure,82.53
3,Marvel's Spider-Man,PlayStation 4,2018,Action-adventure,81.93
4,Minecraft,Xbox One,2014,"Sandbox, Survival",81.02


# Final 

In [20]:
# Final validation checklist
print("=" * 70)
print("PART 1 COMPLETION CHECKLIST")
print("=" * 70)

checklist = {
    "ChromaDB vector database setup": True,
    "Game data processed and embedded": collection.count() > 0,
    "Semantic search implemented": callable(semantic_search),
    "Vector store manager created": 'UdaPlayVectorStore' in dir(),
    "Pydantic models defined": 'SearchResponse' in dir(),
    "DataFrame export capability": hasattr(vector_store, 'to_dataframe')
}

all_passed = True
for item, status in checklist.items():
    status_symbol = "[PASS]" if status else "[FAIL]"
    print(f"{status_symbol} {item}")
    all_passed = all_passed and status

print("\n" + "=" * 70)
if all_passed:
    print("Part 1: COMPLETE")
else:
    print("Part 1: Some requirements not met")
print("=" * 70)

PART 1 COMPLETION CHECKLIST
[PASS] ChromaDB vector database setup
[PASS] Game data processed and embedded
[PASS] Semantic search implemented
[PASS] Vector store manager created
[PASS] Pydantic models defined
[PASS] DataFrame export capability

Part 1: COMPLETE
