# Snowflake to Model Deployment Demo

In this demo, you'll walk through a complete machine learning pipeline from data ingestion to deployment and inference using containerized infrastructure.

## Demo Overview

This demo includes the following key steps:

1. **Data Ingestion from Snowflake**  
   Pull structured Titanic dataset from Snowflake.

2. **Feature Engineering**  
   Transform raw data into meaningful features for model training.

3. **Model Training with XGBoost**  
   Use XGBoost to train a classification model on the engineered dataset.

4. **Model Deployment**  
   Register and deploy the trained model.

5. **Batch Inference**  
   Call the deployed model to make predictions on new batches of data.


In [None]:
# Configuration
# Centralize all configuration variables here for easy management

# Model configuration
MODEL_NAME = "TITANIC"
TARGET_COLUMN = "SURVIVED"

# Data configuration
SOURCE_CSV = "data/titanic_snowflake.csv"
RAW_TABLE = "TITANIC_RAW"
PREDICT_TABLE = "TITANIC_PREDICT"
DYNAMIC_TABLE = "TITANIC_BATCH_INFERENCE"

# Model training configuration
TRAIN_SIZE = 0.70
RANDOM_STATE = 1234

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBClassifier
from snowflake.ml.registry import Registry
from snowflake.snowpark.context import get_active_session

session = get_active_session()

In [None]:
# Load and perform initial data cleaning
titanic = pd.read_csv(SOURCE_CSV)

print(f"Original shape: {titanic.shape}")
print(f"Columns: {list(titanic.columns)}")

# Drop columns not useful for modeling
titanic = titanic.drop(["AGE", "DECK", "ALIVE", "ADULT_MALE", 
                        "EMBARKED", "PCLASS", "ALONE", "SEX"], axis=1)

print(f"Shape after dropping columns: {titanic.shape}")
titanic.head()

## Step 1: Data Ingestion to Snowflake

This demonstrates the typical workflow of writing data to Snowflake and reading it back.

**Pattern:**
1. Convert pandas DataFrame → Snowpark DataFrame
2. Write to Snowflake table
3. Read back from Snowflake table → pandas DataFrame

In [None]:
# Write pandas DataFrame to Snowflake table
titanic_sf = session.create_dataframe(titanic)
titanic_sf.write.mode("overwrite").save_as_table(RAW_TABLE)

print(f"✓ Data written to {RAW_TABLE}")

In [None]:
# Read table from Snowflake back to pandas
titanic_raw = session.table(RAW_TABLE).to_pandas()
print(f"✓ Loaded {len(titanic_raw)} rows from {RAW_TABLE}")
titanic_raw.head()

In [None]:
%%sql
-- You can also query directly with SQL
SELECT * FROM {{RAW_TABLE}} LIMIT 5;

## Step 2: Feature Engineering

Prepare data for machine learning by:
1. Handling missing values
2. Creating dummy variables for categorical features
3. Converting boolean columns to integers

In [None]:
# Handle missing values
print(f"Missing values before: {titanic.isnull().sum().sum()}")
titanic.dropna(inplace=True)
print(f"Missing values after: {titanic.isnull().sum().sum()}")
print(f"Rows remaining: {len(titanic)}")

In [None]:
# Create dummy variables and convert booleans to integers
titanic = pd.get_dummies(titanic, drop_first=True)
titanic = titanic.apply(lambda x: x.astype(int) if x.dtype == 'bool' else x)

print("Final feature types:")
print(titanic.dtypes)

In [None]:
# Prepare features (X) and target (y)
x = titanic.drop(TARGET_COLUMN, axis=1)
y = titanic[TARGET_COLUMN]

print(f"Features shape: {x.shape}")
print(f"Target distribution:\n{y.value_counts()}")

In [None]:
# Split data into training and test sets
xtrain, xtest, ytrain, ytest = train_test_split(
    x, y, train_size=TRAIN_SIZE, random_state=RANDOM_STATE
)

print(f"Training set: {xtrain.shape}")
print(f"Test set: {xtest.shape}")

## Step 3: Model Training with Hyperparameter Tuning

Using GridSearchCV to find optimal XGBoost parameters.

In [None]:
# Define hyperparameter grid
param_grid = {
    "n_estimators": [100, 200],
    "learning_rate": [0.1, 0.5],
    "max_depth": [3, 5],
    "min_child_weight": [1, 3]
}

# Train model with grid search
model = XGBClassifier(objective='binary:logistic', eval_metric='logloss')
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=1)

print("Training model with GridSearchCV...")
grid_search.fit(xtrain, ytrain)

In [None]:
# Evaluate best model
best_params = grid_search.best_params_
best_score = grid_search.best_score_
best_model = grid_search.best_estimator_
test_score = best_model.score(xtest, ytest)

print("=" * 50)
print("MODEL EVALUATION RESULTS")
print("=" * 50)
print(f"Best Parameters: {best_params}")
print(f"Best CV Score: {best_score:.4f}")
print(f"Test Score: {test_score:.4f}")
print("=" * 50)

In [None]:
# Prepare metrics for model registry
metrics = {
    "cv_accuracy": best_score,
    "test_accuracy": test_score,
    "best_params": str(best_params)
}

print("Metrics to log:")
for k, v in metrics.items():
    print(f"  {k}: {v}")

## Step 4: Model Registration

Register the trained model in Snowflake Model Registry with:
- **target_platforms=["WAREHOUSE"]** - Enables SQL-based inference
- **Version control** - Automatic versioning with metadata
- **Sample input data** - For schema inference

In [None]:
# Register model in Snowflake Model Registry
reg = Registry(session=session)

sample_input = xtrain.sample(n=1)

print(f"Registering model '{MODEL_NAME}' to Snowflake Model Registry...")

titanic_model = reg.log_model(
    model_name=MODEL_NAME,
    options={"relax_version": True},
    target_platforms=["WAREHOUSE"],  # Enables SQL-based inference
    model=best_model,
    sample_input_data=sample_input,
    metrics=metrics,
)

print(f"✓ Model registered: {titanic_model.model_name} v{titanic_model.version_name}")

In [None]:
# View all models in registry
models_df = reg.show_models()
models_df[models_df['name'] == MODEL_NAME]

In [None]:
# View all versions of this model
versions = reg.get_model(MODEL_NAME).show_versions()
versions.sort_values(by='created_on', ascending=False)

In [None]:
# Get the most recent model version
model_version = reg.get_model(MODEL_NAME).last()
version_name = model_version.version_name

print(f"Using model version: {version_name}")
model_version

## Step 5: Batch Inference - Method 1 (Python)

Use the model registry's `run()` method to generate predictions directly in Python.

In [None]:
# Run batch inference using Python
print(f"Running batch inference on {len(xtest)} test records...")

remote_prediction = model_version.run(xtest, function_name="PREDICT_PROBA")

print(f"✓ Predictions generated")
remote_prediction.head(10)

## Step 6: Batch Inference - Method 2 (SQL)

Write test data to Snowflake table and score it using SQL with model UDFs.

# Write test data to Snowflake for SQL-based inference
test_sf = session.create_dataframe(xtest.reset_index(drop=True))
test_sf.write.mode("overwrite").save_as_table(PREDICT_TABLE)

print(f"✓ Test data written to {PREDICT_TABLE}")
session.table(PREDICT_TABLE).show()

In [None]:
%%sql
-- SQL-based batch inference using model UDF
-- The model is called as: MODEL_NAME!FUNCTION_NAME(*)

SELECT 
    *,
    ROUND({{MODEL_NAME}}!PREDICT_PROBA(*):output_feature_0, 2) AS survival_probability
FROM {{PREDICT_TABLE}}
LIMIT 10;

## Step 7: Batch Inference - Method 3 (Dynamic Tables)

**Dynamic Tables** provide automated, continuously updated batch inference:

- **Automated Refresh**: Runs inference automatically based on `target_lag`
- **Incremental Updates**: Only processes new/changed data
- **Production Ready**: No scheduled jobs or orchestration needed

### Demo Workflow:
1. Create dynamic table with model inference
2. Insert new data into source table
3. Dynamic table automatically refreshes within target_lag (1 minute)
4. View updated predictions

In [None]:
%%sql
-- Create dynamic table for automated batch inference
CREATE OR REPLACE DYNAMIC TABLE {{DYNAMIC_TABLE}}
TARGET_LAG = '1 minute' 
WAREHOUSE = {{current_wh}} 
AS
SELECT 
    *,
    ROUND({{MODEL_NAME}}!PREDICT_PROBA(*):output_feature_0, 2) AS survival_probability
FROM {{PREDICT_TABLE}};

-- View initial results
SELECT * FROM {{DYNAMIC_TABLE}} LIMIT 10;

## Step 8: Test Dynamic Table with New Data

Insert new records and observe automatic inference within 1 minute.

In [None]:
%%sql
-- Drop all demo resources
DROP DYNAMIC TABLE IF EXISTS {{DYNAMIC_TABLE}};
DROP TABLE IF EXISTS {{PREDICT_TABLE}};
DROP TABLE IF EXISTS {{RAW_TABLE}};

-- Note: Model remains in registry for reuse. 
-- To remove model: reg.delete_model(MODEL_NAME)

## Step 9: Cleanup Resources

Run this cell to remove all demo resources when complete.

In [None]:
# Check dynamic table refresh status
refresh_info = session.sql(f"""
    SELECT 
        scheduling_state,
        last_successful_run_timestamp,
        next_scheduled_refresh_time
    FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLE_REFRESH_HISTORY(
        NAME => '{DYNAMIC_TABLE}'
    ))
    ORDER BY last_successful_run_timestamp DESC
    LIMIT 1
""").to_pandas()

refresh_info

In [None]:
%%sql
-- Insert new records to trigger dynamic table refresh
INSERT INTO {{PREDICT_TABLE}} (
    SIBSP, PARCH, FARE, CLASS_SECOND, CLASS_THIRD,
    WHO_MAN, WHO_WOMAN,
    EMBARK_TOWN_QUEENSTOWN, EMBARK_TOWN_SOUTHAMPTON
) VALUES
(0, 0, 10.5, 0, 1, 1, 0, 1, 0),
(2, 1, 23.0, 1, 0, 0, 1, 0, 1),
(0, 2, 15.75, 1, 0, 0, 1, 1, 0),
(1, 1, 7.925, 0, 1, 1, 0, 0, 1),
(0, 0, 7.75, 0, 1, 1, 0, 0, 1);

-- Wait ~1 minute, then check the dynamic table for new predictions
SELECT COUNT(*) AS total_records FROM {{DYNAMIC_TABLE}};