# MLOps Data Versioning for Anime-Inspired CYOA Story Generator

This notebook demonstrates how to use MLflow for data versioning in our Choose Your Own Adventure story generation project. We'll load our anime-inspired dataset, log it with MLflow, apply preprocessing, and version the processed data.

## Key MLOps Concepts:
- **Data Versioning**: Track different versions of datasets
- **Artifact Logging**: Store datasets as MLflow artifacts
- **Parameter Tracking**: Log dataset metadata and configuration
- **Reproducibility**: Ensure experiments can be reproduced with exact data versions

## Workflow Overview:
1. **Setup MLflow** tracking and experiment
2. **Load** the anime-inspired CYOA dataset
3. **Log original data** as MLflow artifacts with parameters/metrics
4. **Apply preprocessing** (lowercase, cleanup, add IDs)
5. **Version processed data** and log as new artifacts

In [1]:
# Import required libraries
import pandas as pd
import mlflow
import mlflow.data
import os
from datetime import datetime
import tempfile

# Set up MLflow tracking
mlflow.set_tracking_uri("file:./mlruns")  # Local MLflow tracking
mlflow.set_experiment("cyoa_data_versioning")

print(" MLflow setup complete!")
print(f" Tracking URI: {mlflow.get_tracking_uri()}")
print(f" Current experiment: {mlflow.get_experiment_by_name('cyoa_data_versioning')}")
print(f" Session started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

 MLflow setup complete!
 Tracking URI: file:./mlruns
 Current experiment: <Experiment: artifact_location='file:///C:/Users/jaide/OneDrive/Desktop/cyoa-mlops/mlruns/412490187678699798', creation_time=1751427449531, experiment_id='412490187678699798', last_update_time=1751427449531, lifecycle_stage='active', name='cyoa_data_versioning', tags={}>
 Session started: 2025-07-01 22:50:59


In [2]:
# Load the original dataset
data_path = "data/story_dataset.csv"

try:
    # Load the CSV file
    df = pd.read_csv(data_path)
    print(" Dataset loaded successfully!")
    print(f" Dataset shape: {df.shape}")
    print(f" Columns: {list(df.columns)}")
    print("\n First few rows:")
    display(df.head())
    
    # Display basic statistics
    print(f"\n Dataset Statistics:")
    print(f"    Total stories: {len(df)}")
    print(f"    Unique genres: {df['genre'].nunique()}")
    print(f"    Genre distribution:")
    for genre, count in df['genre'].value_counts().items():
        print(f"     - {genre}: {count}")
        
except FileNotFoundError:
    print(" Error: Could not find the dataset file. Please ensure 'data/story_dataset.csv' exists.")
except Exception as e:
    print(f" Error loading dataset: {str(e)}")

 Dataset loaded successfully!
 Dataset shape: (5, 5)
 Columns: ['prompt', 'response', 'choice_a', 'choice_b', 'genre']

 First few rows:


Unnamed: 0,prompt,response,choice_a,choice_b,genre
0,You enter a mystical forest where ancient spir...,The moonlight filters through the ethereal lea...,Follow the glowing blue wisps deeper into the ...,Take the stone path toward the distant mountai...,fantasy
1,"At the magical academy, you discover a forbidd...",The leather-bound tome pulses with mysterious ...,Open the book and risk unleashing its power,Report the forbidden tome to the headmaster,magic_school
2,Your train arrives at a small countryside stat...,The quiet platform holds only an elderly man f...,Approach the mysterious girl and ask about the...,Help the elderly man with the cats and learn a...,slice_of_life
3,"In the virtual reality game world, you face th...","Your guild members have fallen, and only you r...",Use your ultimate skill despite the risk of be...,Try to negotiate with the dragon emperor using...,gaming_isekai
4,The time loop resets again as the school festi...,You remember everything from the previous 47 l...,Confront the person you suspect is causing the...,Try a completely different approach and help s...,time_loop



 Dataset Statistics:
    Total stories: 5
    Unique genres: 5
    Genre distribution:
     - fantasy: 1
     - magic_school: 1
     - slice_of_life: 1
     - gaming_isekai: 1
     - time_loop : 1


In [3]:
# Start MLflow run for data versioning
with mlflow.start_run(run_name="data_versioning_v1.0") as run:
    
    print(f" Starting MLflow run: {run.info.run_id}")
    
    # Log dataset parameters
    dataset_size = len(df)
    dataset_version = "v1.0"
    num_genres = df['genre'].nunique()
    
    # Log parameters to MLflow
    mlflow.log_param("dataset_size", dataset_size)
    mlflow.log_param("dataset_version", dataset_version)
    mlflow.log_param("num_genres", num_genres)
    mlflow.log_param("data_source", "anime_inspired_cyoa")
    mlflow.log_param("creation_date", datetime.now().strftime('%Y-%m-%d'))
    
    print(" Parameters logged:")
    print(f"    Dataset size: {dataset_size}")
    print(f"    Dataset version: {dataset_version}")
    print(f"    Number of genres: {num_genres}")
    
    # Log the original dataset as an MLflow artifact
    mlflow.log_artifact(data_path, "original_data")
    print(f" Original dataset logged as artifact: {data_path}")
    
    # Log additional metrics
    avg_prompt_length = df['prompt'].str.len().mean()
    avg_response_length = df['response'].str.len().mean()
    
    mlflow.log_metric("avg_prompt_length", avg_prompt_length)
    mlflow.log_metric("avg_response_length", avg_response_length)
    
    print(f" Metrics logged:")
    print(f"    Average prompt length: {avg_prompt_length:.1f} characters")
    print(f"    Average response length: {avg_response_length:.1f} characters")

 Starting MLflow run: e1b2bc34037f451da2b2fa48ea8018d2
 Parameters logged:
    Dataset size: 5
    Dataset version: v1.0
    Number of genres: 5
 Original dataset logged as artifact: data/story_dataset.csv
 Metrics logged:
    Average prompt length: 60.0 characters
    Average response length: 84.0 characters


In [4]:
# Apply preprocessing to the dataset
print(" Starting data preprocessing...")

# Create a copy for preprocessing
df_processed = df.copy()

# Preprocessing steps:
# 1. Convert prompts to lowercase for consistency
df_processed['prompt'] = df_processed['prompt'].str.lower()

# 2. Remove any leading/trailing whitespace
df_processed['prompt'] = df_processed['prompt'].str.strip()
df_processed['response'] = df_processed['response'].str.strip()
df_processed['choice_a'] = df_processed['choice_a'].str.strip()
df_processed['choice_b'] = df_processed['choice_b'].str.strip()

# 3. Add a preprocessing timestamp
df_processed['processed_at'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

# 4. Add unique IDs for tracking
df_processed['story_id'] = range(1, len(df_processed) + 1)

print(" Preprocessing completed!")
print(f" Processed dataset shape: {df_processed.shape}")

print("\n Sample of processed data:")
display(df_processed[['story_id', 'prompt', 'genre', 'processed_at']].head(3))

# Show the difference in prompts (original vs processed)
print("\n Preprocessing examples:")
for i in range(min(3, len(df))):
    print(f"Original prompt {i+1}: {df.iloc[i]['prompt'][:50]}...")
    print(f"Processed prompt {i+1}: {df_processed.iloc[i]['prompt'][:50]}...")
    print("---")

 Starting data preprocessing...
 Preprocessing completed!
 Processed dataset shape: (5, 7)

 Sample of processed data:


Unnamed: 0,story_id,prompt,genre,processed_at
0,1,you enter a mystical forest where ancient spir...,fantasy,2025-07-01 22:51:25
1,2,"at the magical academy, you discover a forbidd...",magic_school,2025-07-01 22:51:25
2,3,your train arrives at a small countryside stat...,slice_of_life,2025-07-01 22:51:25



 Preprocessing examples:
Original prompt 1: You enter a mystical forest where ancient spirits ...
Processed prompt 1: you enter a mystical forest where ancient spirits ...
---
Original prompt 2: At the magical academy, you discover a forbidden s...
Processed prompt 2: at the magical academy, you discover a forbidden s...
---
Original prompt 3: Your train arrives at a small countryside station ...
Processed prompt 3: your train arrives at a small countryside station ...
---


In [5]:
# Save processed dataset and log as MLflow artifact
processed_data_path = "data/processed_story_dataset.csv"

# Save the processed dataset
df_processed.to_csv(processed_data_path, index=False)
print(f" Processed dataset saved to: {processed_data_path}")

# Continue the same MLflow run and log the processed dataset
with mlflow.start_run(run_id=run.info.run_id):
    
    # Log the processed dataset as an artifact
    mlflow.log_artifact(processed_data_path, "processed_data")
    
    # Log preprocessing parameters
    mlflow.log_param("preprocessing_applied", "lowercase_prompts,whitespace_trim,add_ids")
    mlflow.log_param("processed_dataset_version", "v1.0_processed")
    
    # Log preprocessing metrics
    mlflow.log_metric("processing_timestamp", datetime.now().timestamp())
    
    print(" Processed dataset logged to MLflow!")
    print(f" Run ID: {run.info.run_id}")
    print(f" Artifacts logged:")
    print(f"    Original dataset: original_data/story_dataset.csv")
    print(f"    Processed dataset: processed_data/processed_story_dataset.csv")

print("\n Data versioning workflow completed successfully!")
print(f" To view results, start MLflow UI with: mlflow ui")
print(f" Then navigate to: http://localhost:5000")

 Processed dataset saved to: data/processed_story_dataset.csv
 Processed dataset logged to MLflow!
 Run ID: e1b2bc34037f451da2b2fa48ea8018d2
 Artifacts logged:
    Original dataset: original_data/story_dataset.csv
    Processed dataset: processed_data/processed_story_dataset.csv

 Data versioning workflow completed successfully!
 To view results, start MLflow UI with: mlflow ui
 Then navigate to: http://localhost:5000
