# üé¨ CymbalFlix Discover - Database Setup

Welcome to the data engineering portion of CymbalFlix Discover! In this notebook, you'll set up your AlloyDB database with everything needed for an AI-powered movie discovery application.

## What We're Building

By the end of this notebook, your database will contain:

| Table | Records | Purpose |
|-------|---------|--------|
| `movies` | ~9,700 | Core catalog with AI-searchable summaries and vector embeddings |
| `genres` | 20 | Genre lookup table |
| `movie_genres` | ~21,000 | Many-to-many junction for movie genres |
| `users` | 610 | User profiles extracted from ratings data |
| `ratings` | 100,836 | Historical ratings for analytics |
| `tags` | 3,683 | User-generated tags for semantic analysis |
| `links` | ~9,700 | External IDs (IMDb, TMDb) for integration |
| `watchlist` | 0 | Ready for user watchlist operations |

## AlloyDB Extensions We'll Enable

- **`vector`** - PostgreSQL vector data type for embeddings
- **`alloydb_scann`** - Google's ScaNN index for lightning-fast vector search
- **`google_ml_integration`** - Direct Vertex AI access from SQL

## Security: IAM Authentication

Notice something missing? **No database passwords!** We're using IAM authentication, which means:
- Your Google Cloud identity is your database identity
- No passwords to manage, rotate, or accidentally commit to Git
- The AlloyDB Python Connector handles secure authentication automatically

Let's get started! üöÄ

---
## Step 1: Configure Your Environment

First, let's set up the configuration for your specific AlloyDB cluster. Fill in the form fields below with values from your lab instructions.

**Tip:** The form fields appear when you click on this cell. Just fill them in and run the cell!

In [35]:
# @title Configuration - Fill in your lab details { display-mode: "form" }
# @markdown Enter your project and cluster information from the lab instructions:

PROJECT_ID = "qwiklabs-gcp-00-afbef25f8738"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}
USER_EMAIL = "student-03-e495cafebc2d@qwiklabs.net"  # @param {type:"string"}
CLUSTER_ID = "cymbalflix-cluster"  # @param {type:"string"}
INSTANCE_ID = "cymbalflix-primary"  # @param {type:"string"}

# Database name we'll create
DB_NAME = "cymbalflix"

# GCS bucket with our MovieLens data
DATA_BUCKET = "gs://class-demo/ml-latest-small"

# Validate configuration
if not PROJECT_ID or PROJECT_ID == "":
    print("‚ùå Please enter your PROJECT_ID in the form field above!")
    print("   You can find it in the lab instructions or Cloud Console.")
else:
    print(f"‚úÖ Configuration set!")
    print(f"   Project:  {PROJECT_ID}")
    print(f"   Region:   {REGION}")
    print(f"   Cluster:  {CLUSTER_ID}")
    print(f"   Instance: {INSTANCE_ID}")
    print(f"\nüîê Using IAM authentication (no password required!)")

‚úÖ Configuration set!
   Project:  qwiklabs-gcp-00-afbef25f8738
   Region:   us-central1
   Cluster:  cymbalflix-cluster
   Instance: cymbalflix-primary

üîê Using IAM authentication (no password required!)


---
## Step 2: Install Dependencies & Connect to AlloyDB

We'll use the **AlloyDB Python Connector** to establish a secure connection. This connector:

- Handles IAM authentication automatically
- Creates encrypted connections without manual certificate management  
- Works seamlessly in Colab, Cloud Shell, or any Python environment
- Is the recommended approach for production applications

**Why not Auth Proxy?** The Python Connector is more reliable in notebook environments and eliminates the need to manage a separate proxy process.

In [36]:
# Install required packages
!pip install -q google-cloud-alloydb-connector[pg8000] \
    pandas google-cloud-storage sqlalchemy

print("‚úÖ Dependencies installed!")

‚úÖ Dependencies installed!


In [37]:
import pandas as pd
from google.cloud import storage
from google.cloud.alloydb.connector import Connector, IPTypes
import pg8000
import sqlalchemy
from sqlalchemy import text
import io
import re
import json
from datetime import datetime

# Build the instance URI for the connector
INSTANCE_URI = f"projects/{PROJECT_ID}/locations/{REGION}/clusters/{CLUSTER_ID}/instances/{INSTANCE_ID}"

# Initialize the AlloyDB connector
connector = Connector()

def get_connection(database="postgres"):
    """
    Create a connection to AlloyDB using the Python Connector.

    With enable_iam_auth=True, your Google Cloud identity is used
    for authentication - no password needed!
    """
    conn = connector.connect(
        INSTANCE_URI,
        "pg8000",
        user=USER_EMAIL,
        db=database,
        enable_iam_auth=True,  # Use your Google Cloud identity!
        ip_type=IPTypes.PUBLIC,
    )
    return conn

# Test the connection
print(f"üîó Connecting to: {INSTANCE_URI}")
print("‚è≥ Establishing secure connection...")

try:
    conn = get_connection()
    cursor = conn.cursor()
    cursor.execute("SELECT version();")
    version = cursor.fetchone()[0]
    cursor.execute("SELECT current_user;")
    current_user = cursor.fetchone()[0]
    cursor.close()
    conn.close()

    print("\n‚úÖ Successfully connected to AlloyDB!")
    print(f"\nüîê Authenticated as: {current_user}")
    print(f"\nüìä Database version:")
    print(f"   {version[:60]}...")
except Exception as e:
    print(f"\n‚ùå Connection failed: {e}")
    print("\nüîç Troubleshooting tips:")
    print("   1. Verify your PROJECT_ID is correct (check the form above)")
    print("   2. Make sure your AlloyDB cluster shows 'Ready' in Cloud Console")
    print("   3. Confirm the cluster and instance names match your Terraform output")
    print("   4. Check that your user has the AlloyDB IAM Database User role")

üîó Connecting to: projects/qwiklabs-gcp-00-afbef25f8738/locations/us-central1/clusters/cymbalflix-cluster/instances/cymbalflix-primary
‚è≥ Establishing secure connection...

‚úÖ Successfully connected to AlloyDB!

üîê Authenticated as: student-03-e495cafebc2d@qwiklabs.net

üìä Database version:
   PostgreSQL 16.9 on x86_64-pc-linux-gnu, compiled by Debian c...


---
## Step 3: Create the CymbalFlix Database

We'll create a dedicated database for CymbalFlix rather than using the default `postgres` database. This is a best practice‚Äîit keeps your application data isolated and makes it easier to manage permissions, backups, and migrations.

In [38]:
# Create the cymbalflix database
# We need to use autocommit mode for CREATE DATABASE
conn = get_connection("postgres")
conn.autocommit = True
cursor = conn.cursor()

# Check if database exists
cursor.execute("SELECT 1 FROM pg_database WHERE datname = %s", (DB_NAME,))
exists = cursor.fetchone()

if not exists:
    cursor.execute(f"CREATE DATABASE {DB_NAME}")
    print(f"‚úÖ Created database: {DB_NAME}")
else:
    print(f"‚ÑπÔ∏è  Database '{DB_NAME}' already exists - continuing...")

cursor.close()
conn.close()

‚ÑπÔ∏è  Database 'cymbalflix' already exists - continuing...


---
## Step 4: Enable Extensions

This is where AlloyDB becomes more than just PostgreSQL! We'll enable three powerful extensions:

| Extension | What It Does |
|-----------|-------------|
| `vector` | Adds the VECTOR data type for storing embeddings |
| `alloydb_scann` | Enables Google's ScaNN algorithm for fast similarity search |
| `google_ml_integration` | Connects AlloyDB directly to Vertex AI |

In [39]:
# Enable AlloyDB extensions
conn = get_connection(DB_NAME)
conn.autocommit = True
cursor = conn.cursor()

extensions = [
    ("vector", "Vector data type for embeddings"),
    ("alloydb_scann", "ScaNN index for lightning-fast vector similarity search"),
    ("google_ml_integration", "Direct Vertex AI integration for AI SQL functions")
]

print("üîß Enabling AlloyDB extensions...\n")

for ext_name, description in extensions:
    try:
        cursor.execute(f"CREATE EXTENSION IF NOT EXISTS {ext_name}")
        print(f"‚úÖ {ext_name}")
        print(f"   ‚îî‚îÄ {description}")
    except Exception as e:
        print(f"‚ö†Ô∏è  Could not enable {ext_name}: {e}")

cursor.close()
conn.close()

print("\nüéâ Extensions enabled!")

üîß Enabling AlloyDB extensions...

‚úÖ vector
   ‚îî‚îÄ Vector data type for embeddings
‚úÖ alloydb_scann
   ‚îî‚îÄ ScaNN index for lightning-fast vector similarity search
‚úÖ google_ml_integration
   ‚îî‚îÄ Direct Vertex AI integration for AI SQL functions

üéâ Extensions enabled!


---
## Step 5: Create the Database Schema

Our schema is designed for both transactional operations (watchlists, ratings) and analytical queries (trending movies, genre analysis).

**Key design decisions:**

- **Normalized genres** - Instead of storing "Action|Comedy|Sci-Fi" as text, we use a proper junction table
- **Vector column** - The `movies.summary_embedding` stores 3072-dimensional vectors for semantic search
- **Foreign keys** - Enforce data integrity across related tables
- **Timestamps** - Enable temporal analysis and audit trails

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê       ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê       ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   movies    ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÇ movie_genres ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÇ   genres    ‚îÇ
‚îÇ (+ vector)  ‚îÇ       ‚îÇ  (junction)  ‚îÇ       ‚îÇ  (lookup)   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
       ‚îÇ
       ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
       ‚îÇ                    ‚îÇ                     ‚îÇ
       ‚ñº                    ‚ñº                     ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê       ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê       ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   ratings   ‚îÇ       ‚îÇ    tags     ‚îÇ       ‚îÇ   links     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
       ‚îÇ                    ‚îÇ
       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                ‚ñº
          ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
          ‚îÇ    users    ‚îÇ
          ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                ‚îÇ
                ‚ñº
          ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
          ‚îÇ  watchlist  ‚îÇ
          ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

In [41]:
# Define our database schema
schema_sql = """
-- Core movie catalog with vector embeddings for semantic search
CREATE TABLE IF NOT EXISTS movies (
    movie_id INTEGER PRIMARY KEY,
    title VARCHAR(255) NOT NULL,
    year INTEGER,
    summary TEXT,
    summary_embedding VECTOR(3072)
);

-- Genre lookup table
CREATE TABLE IF NOT EXISTS genres (
    genre_id SERIAL PRIMARY KEY,
    genre_name VARCHAR(50) UNIQUE NOT NULL
);

-- Many-to-many junction table for movie genres
CREATE TABLE IF NOT EXISTS movie_genres (
    movie_id INTEGER REFERENCES movies(movie_id) ON DELETE CASCADE,
    genre_id INTEGER REFERENCES genres(genre_id) ON DELETE CASCADE,
    PRIMARY KEY (movie_id, genre_id)
);

-- User profiles (extracted from ratings data)
CREATE TABLE IF NOT EXISTS users (
    user_id INTEGER PRIMARY KEY,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Historical ratings for analytics
CREATE TABLE IF NOT EXISTS ratings (
    rating_id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(user_id) ON DELETE CASCADE,
    movie_id INTEGER REFERENCES movies(movie_id) ON DELETE CASCADE,
    rating NUMERIC(2,1) NOT NULL CHECK (rating >= 0.5 AND rating <= 5.0),
    rated_at TIMESTAMP
);

-- User-generated tags for semantic analysis
CREATE TABLE IF NOT EXISTS tags (
    tag_id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(user_id) ON DELETE CASCADE,
    movie_id INTEGER REFERENCES movies(movie_id) ON DELETE CASCADE,
    tag_text VARCHAR(255) NOT NULL,
    tagged_at TIMESTAMP
);

-- User watchlists (for transactional operations)
CREATE TABLE IF NOT EXISTS watchlist (
    user_id INTEGER REFERENCES users(user_id) ON DELETE CASCADE,
    movie_id INTEGER REFERENCES movies(movie_id) ON DELETE CASCADE,
    added_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (user_id, movie_id)
);

-- External database links (IMDb, TMDb)
CREATE TABLE IF NOT EXISTS links (
    movie_id INTEGER PRIMARY KEY REFERENCES movies(movie_id) ON DELETE CASCADE,
    imdb_id VARCHAR(20),
    tmdb_id INTEGER
);
"""

# Execute the schema
conn = get_connection(DB_NAME)
cursor = conn.cursor()
cursor.execute(schema_sql)
conn.commit()
cursor.close()
conn.close()

print("‚úÖ Database schema created!")
print("\nüìã Tables created:")
print("   ‚Ä¢ movies (with VECTOR(3072) for embeddings)")
print("   ‚Ä¢ genres")
print("   ‚Ä¢ movie_genres (junction table)")
print("   ‚Ä¢ users")
print("   ‚Ä¢ ratings")
print("   ‚Ä¢ tags")
print("   ‚Ä¢ watchlist")
print("   ‚Ä¢ links (IMDb/TMDb IDs)")

‚úÖ Database schema created!

üìã Tables created:
   ‚Ä¢ movies (with VECTOR(3072) for embeddings)
   ‚Ä¢ genres
   ‚Ä¢ movie_genres (junction table)
   ‚Ä¢ users
   ‚Ä¢ ratings
   ‚Ä¢ tags
   ‚Ä¢ watchlist
   ‚Ä¢ links (IMDb/TMDb IDs)


---
## Step 6: Load Data from Google Cloud Storage

Now comes the fun part‚Äîloading our MovieLens data! We'll load data directly from GCS and transform it as we go:

1. **Movies** - Extract year from title, e.g., "Toy Story (1995)" ‚Üí title="Toy Story", year=1995
2. **Summaries** - AI-generated movie descriptions (merge into movies)
3. **Embeddings** - Pre-computed 3072-dimensional vectors from Gemini
4. **Genres** - Parse pipe-delimited genres into a normalized structure
5. **Users** - Extract unique user IDs from ratings
6. **Ratings & Tags** - Load with timestamp conversion
7. **Links** - External database identifiers

Let's start with a helper function to load CSV files from GCS:

In [42]:
def load_csv_from_gcs(bucket_path, filename):
    """Load a CSV file from GCS into a pandas DataFrame."""
    # Parse the bucket path (handle gs:// prefix and nested paths)
    path = bucket_path
    if path.startswith("gs://"):
        path = path[5:]

    if "/" in path:
        parts = path.split("/", 1)
        bucket_name = parts[0]
        blob_path = f"{parts[1]}/{filename}"
    else:
        bucket_name = path
        blob_path = filename

    client = storage.Client(project=PROJECT_ID)
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(blob_path)

    content = blob.download_as_text()
    return pd.read_csv(io.StringIO(content))

print("‚úÖ GCS loader ready!")

‚úÖ GCS loader ready!


### 6.1 Load and Transform Movies

The MovieLens dataset stores the year in the title (e.g., "Jumanji (1995)"). We'll extract it into a separate column for better querying and analytics.

In [43]:
# Load movies from GCS
print("üì• Loading movies.csv from GCS...")
movies_df = load_csv_from_gcs(DATA_BUCKET, "movies.csv")
print(f"   Loaded {len(movies_df):,} movies")

# Extract year from title using regex
# Pattern matches " (YYYY)" at the end of the title
def extract_year_and_clean_title(title):
    match = re.search(r'\s*\((\d{4})\)\s*$', str(title))
    if match:
        year = int(match.group(1))
        clean_title = re.sub(r'\s*\(\d{4}\)\s*$', '', title).strip()
        return clean_title, year
    return title, None

# Apply the transformation
movies_df[['clean_title', 'year']] = movies_df['title'].apply(
    lambda x: pd.Series(extract_year_and_clean_title(x))
)
movies_df['title'] = movies_df['clean_title']
movies_df = movies_df.drop(columns=['clean_title'])

# Store genres for later processing
movies_with_genres = movies_df[['movieId', 'genres']].copy()

print("\n‚úÖ Movies processed!")
print(f"\nüìä Sample data:")
display(movies_df[['movieId', 'title', 'year']].head())

üì• Loading movies.csv from GCS...
   Loaded 9,742 movies

‚úÖ Movies processed!

üìä Sample data:


Unnamed: 0,movieId,title,year
0,1,Toy Story,1995.0
1,2,Jumanji,1995.0
2,3,Grumpier Old Men,1995.0
3,4,Waiting to Exhale,1995.0
4,5,Father of the Bride Part II,1995.0


### 6.2 Load and Merge Summaries

The summaries were generated using Gemini to provide rich, searchable descriptions of each movie. These enable semantic search‚Äîfinding movies based on meaning, not just keywords.

In [44]:
# Load summaries
print("üì• Loading summaries.csv from GCS...")
summaries_df = load_csv_from_gcs(DATA_BUCKET, "summaries.csv")
print(f"   Loaded {len(summaries_df):,} summaries")

# Merge summaries into movies
movies_df = movies_df.merge(summaries_df, on='movieId', how='left')

print("\n‚úÖ Summaries merged!")

# Show a sample summary
sample_movie = movies_df.iloc[0]
if pd.notna(sample_movie.get('summary')):
    print(f"\nüìù Sample summary for '{sample_movie['title']}':")
    print(f"   {sample_movie['summary'][:250]}...")

üì• Loading summaries.csv from GCS...
   Loaded 9,742 summaries

‚úÖ Summaries merged!

üìù Sample summary for 'Toy Story':
   "Toy Story," released in 1995, is an American animated adventure comedy film produced by Pixar Animation Studios and distributed by Walt Disney Pictures. It tells the story of a group of toys who come to life when humans are not present. The plot cen...


### 6.3 Load and Merge Embeddings

The embeddings are 3072-dimensional vectors generated by Gemini's embedding model. Each vector captures the semantic meaning of a movie's summary, enabling similarity search.

**Why 3072 dimensions?** That's what Gemini's `gemini-embedding-001` model produces. More dimensions can capture more nuance, but also require more storage and computation.

In [45]:
# Load embeddings
print("üì• Loading embeddings.csv from GCS...")
embeddings_df = load_csv_from_gcs(DATA_BUCKET, "embeddings.csv")
print(f"   Loaded {len(embeddings_df):,} embeddings")

# Merge embeddings into movies
movies_df = movies_df.merge(embeddings_df, on='movieId', how='left')

# Verify embedding format
sample_embedding = movies_df.iloc[0].get('embedding')
if pd.notna(sample_embedding):
    # Parse the JSON array to check dimensions
    try:
        embedding_values = json.loads(sample_embedding)
        print(f"\n‚úÖ Embeddings merged!")
        print(f"\nüî¢ Embedding details:")
        print(f"   Dimensions: {len(embedding_values)}")
        print(f"   Sample values: [{embedding_values[0]:.6f}, {embedding_values[1]:.6f}, ...]")
    except:
        print("\n‚úÖ Embeddings merged (format will be parsed during insert)")

üì• Loading embeddings.csv from GCS...
   Loaded 9,742 embeddings

‚úÖ Embeddings merged!

üî¢ Embedding details:
   Dimensions: 3072
   Sample values: [-0.012312, -0.015699, ...]


### 6.4 Insert Movies into AlloyDB

Now we'll insert our prepared movie data into AlloyDB. The vector embeddings are stored as JSON arrays‚ÄîAlloyDB's vector extension handles the conversion automatically.

In [46]:
# Prepare movies for insertion
conn = get_connection(DB_NAME)
cursor = conn.cursor()

print(f"üì§ Inserting {len(movies_df):,} movies into AlloyDB...")

# Insert movies with upsert logic
insert_count = 0
for _, row in movies_df.iterrows():
    try:
        # Handle embedding - it's stored as a JSON string
        embedding = None
        if pd.notna(row.get('embedding')):
            embedding = row['embedding']  # Keep as string for PostgreSQL

        cursor.execute("""
            INSERT INTO movies (movie_id, title, year, summary, summary_embedding)
            VALUES (%s, %s, %s, %s, %s)
            ON CONFLICT (movie_id) DO UPDATE SET
                title = EXCLUDED.title,
                year = EXCLUDED.year,
                summary = EXCLUDED.summary,
                summary_embedding = EXCLUDED.summary_embedding
        """, (
            int(row['movieId']),
            row['title'],
            int(row['year']) if pd.notna(row['year']) else None,
            row.get('summary') if pd.notna(row.get('summary')) else None,
            embedding
        ))
        insert_count += 1

        # Progress indicator
        if insert_count % 2000 == 0:
            print(f"   Processed {insert_count:,} movies...")
            conn.commit()

    except Exception as e:
        print(f"   ‚ö†Ô∏è  Error inserting movie {row['movieId']}: {e}")

conn.commit()
cursor.close()
conn.close()

print(f"\n‚úÖ Inserted {insert_count:,} movies successfully!")

üì§ Inserting 9,742 movies into AlloyDB...
   Processed 2,000 movies...
   Processed 4,000 movies...
   Processed 6,000 movies...
   Processed 8,000 movies...

‚úÖ Inserted 9,742 movies successfully!


### 6.5 Process and Load Genres

MovieLens stores genres as pipe-delimited strings (e.g., "Action|Comedy|Sci-Fi"). We'll normalize this into a proper relational structure with:
- A `genres` lookup table with unique genre names
- A `movie_genres` junction table linking movies to their genres

In [47]:
# Extract unique genres
all_genres = set()
for genres_str in movies_with_genres['genres']:
    if pd.notna(genres_str) and genres_str != '(no genres listed)':
        all_genres.update(genres_str.split('|'))

print(f"üé¨ Found {len(all_genres)} unique genres:")
print(f"   {', '.join(sorted(all_genres))}")

# Insert genres into lookup table
conn = get_connection(DB_NAME)
cursor = conn.cursor()

for genre in sorted(all_genres):
    cursor.execute(
        "INSERT INTO genres (genre_name) VALUES (%s) ON CONFLICT (genre_name) DO NOTHING",
        (genre,)
    )

conn.commit()

# Get genre IDs for the junction table
cursor.execute("SELECT genre_id, genre_name FROM genres")
genre_lookup = {name: gid for gid, name in cursor.fetchall()}

print(f"\n‚úÖ Genres inserted into lookup table!")

üé¨ Found 19 unique genres:
   Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, IMAX, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western

‚úÖ Genres inserted into lookup table!


In [48]:
# Create movie_genres junction records
print("üì§ Creating movie-genre associations...")

junction_count = 0
for _, row in movies_with_genres.iterrows():
    if pd.notna(row['genres']) and row['genres'] != '(no genres listed)':
        movie_id = int(row['movieId'])
        for genre in row['genres'].split('|'):
            if genre in genre_lookup:
                try:
                    cursor.execute(
                        "INSERT INTO movie_genres (movie_id, genre_id) VALUES (%s, %s) ON CONFLICT DO NOTHING",
                        (movie_id, genre_lookup[genre])
                    )
                    junction_count += 1
                except Exception as e:
                    pass  # Skip if movie doesn't exist

conn.commit()
cursor.close()
conn.close()

print(f"\n‚úÖ Created {junction_count:,} movie-genre associations!")

üì§ Creating movie-genre associations...

‚úÖ Created 22,050 movie-genre associations!


### 6.6 Load Users and Ratings

The ratings dataset contains over 100,000 ratings from 610 users. We'll:
1. Extract unique user IDs and create user records
2. Load ratings with converted timestamps (Unix epoch ‚Üí PostgreSQL timestamp)

In [49]:
# Load ratings
print("üì• Loading ratings.csv from GCS...")
ratings_df = load_csv_from_gcs(DATA_BUCKET, "ratings.csv")
print(f"   Loaded {len(ratings_df):,} ratings")

# Extract and insert unique users
unique_users = ratings_df['userId'].unique()
print(f"\nüë• Found {len(unique_users):,} unique users")

conn = get_connection(DB_NAME)
cursor = conn.cursor()

for uid in unique_users:
    cursor.execute(
        "INSERT INTO users (user_id) VALUES (%s) ON CONFLICT DO NOTHING",
        (int(uid),)
    )

conn.commit()
print("‚úÖ Users inserted!")

üì• Loading ratings.csv from GCS...
   Loaded 100,836 ratings

üë• Found 610 unique users
‚úÖ Users inserted!


In [50]:
# Insert ratings with timestamp conversion
print(f"üì§ Inserting {len(ratings_df):,} ratings...")

rating_count = 0
for _, row in ratings_df.iterrows():
    try:
        rated_at = datetime.fromtimestamp(row['timestamp'])
        cursor.execute("""
            INSERT INTO ratings (user_id, movie_id, rating, rated_at)
            VALUES (%s, %s, %s, %s)
        """, (
            int(row['userId']),
            int(row['movieId']),
            float(row['rating']),
            rated_at
        ))
        rating_count += 1

        if rating_count % 20000 == 0:
            print(f"   Processed {rating_count:,} ratings...")
            conn.commit()

    except Exception as e:
        pass  # Skip ratings for movies that don't exist

conn.commit()
cursor.close()
conn.close()

print(f"\n‚úÖ Inserted {rating_count:,} ratings!")

üì§ Inserting 100,836 ratings...
   Processed 20,000 ratings...
   Processed 40,000 ratings...
   Processed 60,000 ratings...
   Processed 80,000 ratings...
   Processed 100,000 ratings...

‚úÖ Inserted 100,836 ratings!


### 6.7 Load Tags

Tags are user-generated labels for movies‚Äîthings like "twist ending", "based on a book", or "visually stunning". These are great for demonstrating AlloyDB's AI SQL functions!

In [51]:
# Load tags
print("üì• Loading tags.csv from GCS...")
tags_df = load_csv_from_gcs(DATA_BUCKET, "tags.csv")
print(f"   Loaded {len(tags_df):,} tags")

conn = get_connection(DB_NAME)
cursor = conn.cursor()

print(f"üì§ Inserting tags...")

tag_count = 0
for _, row in tags_df.iterrows():
    try:
        tagged_at = datetime.fromtimestamp(row['timestamp'])
        cursor.execute("""
            INSERT INTO tags (user_id, movie_id, tag_text, tagged_at)
            VALUES (%s, %s, %s, %s)
        """, (
            int(row['userId']),
            int(row['movieId']),
            str(row['tag']),
            tagged_at
        ))
        tag_count += 1
    except Exception as e:
        pass  # Skip tags for movies that don't exist

conn.commit()
cursor.close()
conn.close()

print(f"\n‚úÖ Inserted {tag_count:,} tags!")

üì• Loading tags.csv from GCS...
   Loaded 3,683 tags
üì§ Inserting tags...

‚úÖ Inserted 3,683 tags!


### 6.8 Load Links

The links file contains external database identifiers for each movie:
- **IMDb ID** - Used for linking to IMDb pages (format: tt0000000)
- **TMDb ID** - The Movie Database ID for accessing additional metadata

In [52]:
# Load links
print("üì• Loading links.csv from GCS...")
links_df = load_csv_from_gcs(DATA_BUCKET, "links.csv")
print(f"   Loaded {len(links_df):,} links")

conn = get_connection(DB_NAME)
cursor = conn.cursor()

print(f"üì§ Inserting external links...")

link_count = 0
for _, row in links_df.iterrows():
    try:
        # Format IMDb ID with leading zeros (tt0000000 format)
        imdb_id = None
        if pd.notna(row.get('imdbId')):
            imdb_id = f"tt{int(row['imdbId']):07d}"

        tmdb_id = None
        if pd.notna(row.get('tmdbId')):
            tmdb_id = int(row['tmdbId'])

        cursor.execute("""
            INSERT INTO links (movie_id, imdb_id, tmdb_id)
            VALUES (%s, %s, %s)
            ON CONFLICT (movie_id) DO UPDATE SET
                imdb_id = EXCLUDED.imdb_id,
                tmdb_id = EXCLUDED.tmdb_id
        """, (
            int(row['movieId']),
            imdb_id,
            tmdb_id
        ))
        link_count += 1
    except Exception as e:
        pass  # Skip links for movies that don't exist

conn.commit()
cursor.close()
conn.close()

print(f"\n‚úÖ Inserted {link_count:,} external links!")

üì• Loading links.csv from GCS...
   Loaded 9,742 links
üì§ Inserting external links...

‚úÖ Inserted 9,742 external links!


---
## Step 7: Verify Your Data

Let's make sure everything loaded correctly with some verification queries.

In [54]:
# Verification queries
conn = get_connection(DB_NAME)
cursor = conn.cursor()

verification_queries = [
    ("movies", "SELECT COUNT(*) FROM movies"),
    ("  ‚îî‚îÄ with summaries", "SELECT COUNT(*) FROM movies WHERE summary IS NOT NULL"),
    ("  ‚îî‚îÄ with embeddings", "SELECT COUNT(*) FROM movies WHERE summary_embedding IS NOT NULL"),
    ("genres", "SELECT COUNT(*) FROM genres"),
    ("movie_genres", "SELECT COUNT(*) FROM movie_genres"),
    ("users", "SELECT COUNT(*) FROM users"),
    ("ratings", "SELECT COUNT(*) FROM ratings"),
    ("tags", "SELECT COUNT(*) FROM tags"),
    ("links", "SELECT COUNT(*) FROM links"),
]

print("üìä Data Verification Report")
print("=" * 45)

for name, query in verification_queries:
    cursor.execute(query)
    count = cursor.fetchone()[0]
    print(f"   {name}: {count:,}")

cursor.close()
conn.close()

print("=" * 45)
print("\n‚úÖ All data loaded successfully!")

üìä Data Verification Report
   movies: 9,742
     ‚îî‚îÄ with summaries: 9,742
     ‚îî‚îÄ with embeddings: 9,742
   genres: 19
   movie_genres: 22,050
   users: 610
   ratings: 100,836
   tags: 3,683
   links: 9,742

‚úÖ All data loaded successfully!


In [55]:
# Sample query: Top-rated movies with their genres
sample_query = """
SELECT
    m.title,
    m.year,
    ROUND(AVG(r.rating)::numeric, 2) as avg_rating,
    COUNT(r.rating_id) as num_ratings,
    STRING_AGG(DISTINCT g.genre_name, ', ' ORDER BY g.genre_name) as genres
FROM movies m
JOIN ratings r ON m.movie_id = r.movie_id
JOIN movie_genres mg ON m.movie_id = mg.movie_id
JOIN genres g ON mg.genre_id = g.genre_id
GROUP BY m.movie_id, m.title, m.year
HAVING COUNT(r.rating_id) >= 50
ORDER BY avg_rating DESC, num_ratings DESC
LIMIT 10;
"""

conn = get_connection(DB_NAME)
result_df = pd.read_sql(sample_query, conn)
conn.close()

print("üèÜ Top 10 Highest-Rated Movies (minimum 50 ratings):")
display(result_df)

  result_df = pd.read_sql(sample_query, conn)


üèÜ Top 10 Highest-Rated Movies (minimum 50 ratings):


Unnamed: 0,title,year,avg_rating,num_ratings,genres
0,"Shawshank Redemption, The",1994,4.43,634,"Crime, Drama"
1,Sunset Blvd. (a.k.a. Sunset Boulevard),1950,4.33,81,"Drama, Film-Noir, Romance"
2,Double Indemnity,1944,4.32,51,"Crime, Drama, Film-Noir"
3,"Philadelphia Story, The",1940,4.31,87,"Comedy, Drama, Romance"
4,Once Upon a Time in the West (C'era una volta ...,1968,4.31,54,"Action, Drama, Western"
5,Lawrence of Arabia,1962,4.3,135,"Adventure, Drama, War"
6,"Godfather, The",1972,4.29,384,"Crime, Drama"
7,Harold and Maude,1971,4.29,78,"Comedy, Drama, Romance"
8,Logan,2017,4.28,50,"Action, Sci-Fi"
9,Fight Club,1999,4.27,872,"Action, Crime, Drama, Thriller"


---
## Step 8: Create the ScaNN Index

Now for the feature that makes AlloyDB special for AI workloads‚Äîthe **ScaNN index**.

**What is ScaNN?** Scalable Nearest Neighbors is Google's algorithm for fast vector similarity search. It's the same technology that powers Google Search's ability to find similar content across billions of documents.

**Why do we need it?** Without an index, finding similar movies requires comparing your query vector against every single movie‚Äîthat's 9,700 comparisons. With ScaNN, the search narrows to a small subset almost instantly.

| Without ScaNN | With ScaNN |
|--------------|------------|
| Compare against all 9,700 movies | Compare against ~50 candidates |
| Linear time O(n) | Logarithmic time O(log n) |
| ~100ms per query | ~5ms per query |

In [56]:
# Create the ScaNN index
conn = get_connection(DB_NAME)
conn.autocommit = True
cursor = conn.cursor()

print("üîß Creating ScaNN index on movie embeddings...")
print("   This may take a moment...\n")

try:
    cursor.execute("""
        CREATE INDEX IF NOT EXISTS movies_embedding_scann_idx
        ON movies USING scann (summary_embedding cosine)
        WITH (num_leaves = 50, quantizer = 'sq8');
    """)
    print("‚úÖ ScaNN index created!")
    print("\nüìä Index configuration:")
    print("   ‚Ä¢ Distance metric: cosine (measures angle between vectors)")
    print("   ‚Ä¢ num_leaves: 50 (partitions for efficient search)")
    print("   ‚Ä¢ quantizer: sq8 (8-bit scalar quantization for speed)")
except Exception as e:
    if "already exists" in str(e).lower():
        print("‚ÑπÔ∏è  ScaNN index already exists")
    else:
        print(f"‚ö†Ô∏è  Could not create index: {e}")

cursor.close()
conn.close()

üîß Creating ScaNN index on movie embeddings...
   This may take a moment...

‚úÖ ScaNN index created!

üìä Index configuration:
   ‚Ä¢ Distance metric: cosine (measures angle between vectors)
   ‚Ä¢ num_leaves: 50 (partitions for efficient search)
   ‚Ä¢ quantizer: sq8 (8-bit scalar quantization for speed)


---
## Step 9: Semantic Search Demo üéØ

This is the payoff! Let's see semantic search in action.

**How it works:**
1. Your search query gets converted to a 3072-dimensional vector using Gemini's embedding model
2. AlloyDB uses the ScaNN index to find movies with similar vectors
3. Results are ranked by cosine similarity (1.0 = identical, 0.0 = completely different)

In [57]:
def semantic_search(query, limit=5):
    """
    Search for movies using semantic similarity.

    This converts your natural language query into a vector,
    then finds movies with similar vectors.
    """
    conn = get_connection(DB_NAME)

    search_sql = """
    WITH query_embedding AS (
    SELECT embedding(
        'gemini-embedding-001',   -- no registration needed
            %s                        -- the query text from Python
        )::vector AS embedding
    )
    SELECT
        m.title,
        m.year,
        ROUND((1 - (m.summary_embedding <=> q.embedding))::numeric, 3) AS similarity,
        LEFT(m.summary, 150) || '...' AS summary_preview
    FROM movies m
    CROSS JOIN query_embedding q
    WHERE m.summary_embedding IS NOT NULL
    ORDER BY m.summary_embedding <=> q.embedding
    LIMIT %s;

    """

    result = pd.read_sql(search_sql, conn, params=(query, limit))
    conn.close()
    return result

print("‚úÖ Semantic search function ready!")

‚úÖ Semantic search function ready!


In [58]:
# Demo 1: Conceptual search
print("üîç Search: 'A movie about artificial intelligence becoming self-aware'")
print("=" * 70)
results = semantic_search("A movie about artificial intelligence becoming self-aware")
display(results)

üîç Search: 'A movie about artificial intelligence becoming self-aware'


  result = pd.read_sql(search_sql, conn, params=(query, limit))


Unnamed: 0,title,year,similarity,summary_preview
0,Ex Machina,2015,0.631,"""Ex Machina"" (2015), written and directed by A..."
1,Transcendence,2014,0.621,"The 2014 science fiction thriller ""Transcenden..."
2,"I, Robot",2004,0.594,"The 2004 science fiction action film *I, Robot..."
3,Chappie,2015,0.584,*Chappie* is a 2015 science fiction action fil...
4,Aut√≥mata (Automata),2014,0.58,"""Aut√≥mata"" is a 2014 English-language Spanish-..."


In [59]:
# Demo 2: Emotional/thematic search
print("üîç Search: 'Heartwarming story about unlikely friendship'")
print("=" * 70)
results = semantic_search("Heartwarming story about unlikely friendship")
display(results)

üîç Search: 'Heartwarming story about unlikely friendship'


  result = pd.read_sql(search_sql, conn, params=(query, limit))


Unnamed: 0,title,year,similarity,summary_preview
0,Monsieur Ibrahim (Monsieur Ibrahim et les fleu...,2003,0.594,"""Monsieur Ibrahim et les fleurs du Coran"" is a..."
1,Hachiko: A Dog's Story (a.k.a. Hachi: A Dog's ...,2009,0.586,"""Hachiko: A Dog's Story"" (also known as ""Hachi..."
2,Somers Town,2008,0.584,"""Somers Town"" is a 2008 British independent co..."
3,Radio,2003,0.584,"The 2003 biographical sports drama film ""Radio..."
4,A Street Cat Named Bob,2016,0.583,The 2016 biographical drama film *A Street Cat...


In [60]:
# Demo 3: Compare semantic vs. what keyword search would find
print("üîç Search: 'space adventure'")
print("=" * 70)
print("\nüìä Semantic Search Results (finds movies by MEANING):")
results = semantic_search("space adventure")
display(results)

# Now show what a simple keyword search would find
print("\nüìä Traditional Keyword Search (finds movies by EXACT WORDS):")
conn = get_connection(DB_NAME)
keyword_results = pd.read_sql("""
    SELECT title, year, LEFT(summary, 100) || '...' as summary_preview
    FROM movies
    WHERE LOWER(title) LIKE '%space%'
       OR LOWER(summary) LIKE '%space adventure%'
    LIMIT 5;
""", conn)
conn.close()
display(keyword_results)

print("\nüí° Notice how semantic search finds thematically similar movies")
print("   even if 'space adventure' doesn't appear in the text!")

üîç Search: 'space adventure'

üìä Semantic Search Results (finds movies by MEANING):


  result = pd.read_sql(search_sql, conn, params=(query, limit))


Unnamed: 0,title,year,similarity,summary_preview
0,"Mystery of the Third Planet, The (Tayna tretey...",1981,0.576,"""The Mystery of the Third Planet"" (1981), orig..."
1,"Trip to the Moon, A (Voyage dans la lune, Le)",1902,0.573,"Georges M√©li√®s's ""A Trip to the Moon"" (Le Voya..."
2,A Cosmic Christmas,1977,0.571,"""A Cosmic Christmas,"" an animated television s..."
3,Cosmic Scrat-tastrophe,2015,0.57,"""Cosmic Scrat-tastrophe"" is a 2015 animated sh..."
4,Escape from Planet Earth,2013,0.565,*Escape from Planet Earth* is a 2013 animated ...



üìä Traditional Keyword Search (finds movies by EXACT WORDS):


  keyword_results = pd.read_sql("""


Unnamed: 0,title,year,summary_preview
0,Jumanji,1995,"The 1995 American fantasy adventure film ""Juma..."
1,Lawnmower Man 2: Beyond Cyberspace,1996,"In ""Lawnmower Man 2: Beyond Cyberspace"" (1996)..."
2,Space Jam,1996,The 1996 live-action/animated sports comedy fi...
3,2001: A Space Odyssey,1968,Stanley Kubrick's 1968 epic science fiction fi...
4,Lost in Space,1998,The 1998 science fiction action-adventure film...



üí° Notice how semantic search finds thematically similar movies
   even if 'space adventure' doesn't appear in the text!


---
## Step 10: Verify Columnar Engine

AlloyDB's columnar engine accelerates analytical queries by up to 100x. It works automatically‚ÄîAlloyDB identifies analytical query patterns and creates optimized columnar representations.

Let's verify it's enabled on your instance:

In [66]:
# Check columnar engine settings
conn = get_connection(DB_NAME)
cursor = conn.cursor()

print("üîß Columnar Engine Configuration")
print("=" * 50)

cursor.execute("""
    SELECT name, setting, short_desc
    FROM pg_settings
    WHERE name LIKE '%columnar%' OR name LIKE '%google_columnar%'
    ORDER BY name;
""")

results = cursor.fetchall()
if results:
    for name, setting, desc in results:
        print(f"   {name}: {setting}")
    print("\n‚úÖ Columnar engine is configured!")
    print("   Analytical queries will be automatically accelerated.")
else:
    print("   No columnar settings found (may be auto-configured)")

cursor.close()
conn.close()

üîß Columnar Engine Configuration
   google_columnar_engine.adaptive_auto_refresh_schedule: 
   google_columnar_engine.auto_columnarization_schedule: 
   google_columnar_engine.columnar_hash_joins_cost_factor: 100
   google_columnar_engine.enable_aggregate_distinct_in_aggregate_pushdown: off
   google_columnar_engine.enable_auto_columnarization: on
   google_columnar_engine.enable_auto_columnarization_storage_cache_spill: on
   google_columnar_engine.enable_auto_cu_selection: off
   google_columnar_engine.enable_columnar_scan: on
   google_columnar_engine.enable_hashed_inlist: off
   google_columnar_engine.enable_materialized_view: on
   google_columnar_engine.enable_select_distinct_in_aggregate_pushdown: on
   google_columnar_engine.enable_timestamptz_date: on
   google_columnar_engine.enable_vectorized_join: off
   google_columnar_engine.enable_vectorized_join_on_storage: off
   google_columnar_engine.enabled: on
   google_columnar_engine.enforce_new_defaults: off
   google_columnar

---
## üéâ Congratulations!

Your CymbalFlix database is fully operational! Here's what you've accomplished:

### Database Setup
- ‚úÖ Connected to AlloyDB using **IAM authentication** (no passwords!)
- ‚úÖ Created a dedicated `cymbalflix` database
- ‚úÖ Enabled vector, ScaNN, and ML integration extensions
- ‚úÖ Registered Vertex AI model endpoints

### Data Loading
- ‚úÖ Loaded ~9,700 movies with AI-generated summaries
- ‚úÖ Added 3072-dimensional vector embeddings for semantic search
- ‚úÖ Normalized genres into a proper relational structure
- ‚úÖ Loaded 100,000+ ratings and 3,600+ tags
- ‚úÖ Added external links (IMDb, TMDb)

### AI Features
- ‚úÖ Created a ScaNN index for lightning-fast vector similarity
- ‚úÖ Tested semantic search that finds movies by meaning
- ‚úÖ Verified columnar engine for analytical acceleration

### Security Highlight üîê

Notice how we never handled a database password? That's **IAM authentication** in action:
- Your Google Cloud identity IS your database identity
- The Python Connector handles secure token exchange automatically
- No credentials to rotate, leak, or accidentally commit to Git

This is the **production-ready** way to handle database authentication in Google Cloud.

---

### What's Next?

Return to the lab instructions for **Task 4**, where you'll build the CymbalFlix Discover web application using Streamlit. You'll create a user interface that lets anyone search for movies semantically and explore AI-powered recommendations!

üé¨ Your database is ready to power an AI-driven movie discovery experience! ü§ñ

In [None]:
# Cleanup: Close the connector when done
# Uncomment the line below when you're finished with the notebook
# connector.close()
# print("‚úÖ Connector closed.")