# DDP_KBIT Jupyter Notebook Interface

This notebook provides a simple interface to run the DDP_KBIT distributed deep learning system without using command line arguments. It wraps the existing `main.py` functionality for easy experimentation.

## Setup and Imports

## 세션 초기화 (매번 실행 필요)

아래 셀을 매 세션마다 가장 먼저 실행하여 로컬 모듈에 연결하세요.

In [2]:
import os
import sys
import json
from pathlib import Path

try:
    import ddp_kbit
    from ddp_kbit.main import (
        setup_logging, 
        load_external_config,
        run_training_mode,
        run_experiment_mode, 
        create_sample_config
    )
    print("✓ Successfully imported DDP_KBIT modules")
    print("🎉 DDP_KBIT notebook interface ready!")
except ImportError as e:
    print(f"❌ Error importing DDP_KBIT modules: {e}")
    print("⚠️  DDP_KBIT setup incomplete. Some features may not work.")



✓ Successfully imported DDP_KBIT modules
🎉 DDP_KBIT notebook interface ready!


## Configuration Setup

In [3]:
# Setup logging
setup_logging("INFO")

# Create a mock args object to simulate command line arguments
class NotebookArgs:
    def __init__(self):
        self.config_path = "sample_config.json"
        self.distributed = False
        self.experiment_type = "single"
        self.iterations = 3
        self.log_level = "INFO"

# Initialize default arguments
args = NotebookArgs()

print("✓ Configuration setup complete")
print(f"Config path: {args.config_path}")
print(f"Distributed: {args.distributed}")
print(f"Iterations: {args.iterations}")

✓ Configuration setup complete
Config path: sample_config.json
Distributed: False
Iterations: 3


## Create Sample Configuration (Run this first)

In [4]:
# Create a sample configuration file

create_sample_config()
print("✓ Sample configuration created!")

# Display the configuration
if os.path.exists("sample_config.json"):
    with open("sample_config.json", 'r') as f:
        config = json.load(f)
    print("\nCurrent configuration:")
    print(json.dumps(config, indent=2))


Created sample_config.json - customize this file for your needs.
✓ Sample configuration created!

Current configuration:
{
  "training_config": {
    "base_model_type": "NeuralNetwork",
    "optimizer_class": "torch.optim.adam.Adam",
    "optimizer_params": {
      "lr": 0.001
    },
    "loss_fn": "torch.nn.modules.loss.CrossEntropyLoss",
    "perform_validation": true,
    "num_epochs": 1,
    "batch_size": 32,
    "metrics": {
      "loss": "Loss",
      "accuracy": "Accuracy"
    }
  },
  "mongo_config": {
    "connection_id": "my-mongo-1",
    "mongo_database": "kbit-db",
    "collection": "mnist_train_avro"
  },
  "kafka_config": {
    "bootstrap_servers": [
      "155.230.35.200:32100",
      "155.230.35.213:32100",
      "155.230.35.215:32100"
    ],
    "data_load_topic": "kbit-p3r1"
  },
  "data_loader_config": {
    "data_loader_type": "kafka",
    "local_data_path": "/root/processed_mnist",
    "offsets_data": [
      "0:0:19999",
      "1:0:19999",
      "2:0:19999"
    ],

## Training Mode

Run single node or distributed training.

In [6]:
# Single node training
print("🚀 Starting single node training...")
args.distributed = False

try:
    run_training_mode(args)
    print("✅ Training completed successfully!")
except Exception as e:
    print(f"❌ Training failed: {e}")

2025-08-20 12:49:07 - root - INFO - Starting training mode...


🚀 Starting single node training...
FILES IN THIS DIRECTORY
['DDP_KBIT', 'jars', 'config.json', 'mnist_pb2.py', 'spark_DL_checkpoints', 'sample_config.json']
Current Spark configuration:
spark.app.id = app-20250820124907-0047
spark.app.initial.file.urls = spark://192.168.141.56:39337/files/mnist_pb2.py
spark.app.initial.jar.urls = spark://192.168.141.56:39337/jars/jsr305-3.0.0.jar,spark://192.168.141.56:39337/jars/spark-sql-kafka-0-10_2.12-3.5.1.jar,spark://192.168.141.56:39337/jars/commons-pool2-2.11.1.jar,spark://192.168.141.56:39337/jars/rapids-4-spark_2.12-24.06.1.jar,spark://192.168.141.56:39337/jars/commons-logging-1.1.3.jar,spark://192.168.141.56:39337/jars/spark-token-provider-kafka-0-10_2.12-3.5.1.jar,spark://192.168.141.56:39337/jars/hadoop-client-runtime-3.3.4.jar,spark://192.168.141.56:39337/jars/lz4-java-1.8.0.jar,spark://192.168.141.56:39337/jars/kafka-clients-3.4.1.jar,spark://192.168.141.56:39337/jars/spark-streaming-kafka-0-10_2.12-3.5.1.jar,spark://192.168.141.56:39337

2025-08-20 12:49:08 - TorchDistributor - INFO - Started distributed training with 3 executor processes
[W820 12:49:33.964632073 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=6, addr=[192-168-177-154.spark-worker-ui-service.spark.svc.cluster.local]:37864, remote=[192-168-109-174.spark-worker-ui-service.spark.svc.cluster.local]:40399): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fc7e3ca3eb0 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7fc8257254d1 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d6a40d (0x7fc82572640d in /usr/local/lib/python3.9/dist-package

❌ Training failed: An error occurred while calling o263.collectToPython


In [None]:
# Distributed training (uncomment to run)
# print("🚀 Starting distributed training...")
# args.distributed = True

# try:
#     run_training_mode(args)
#     print("✅ Distributed training completed successfully!")
# except Exception as e:
#     print(f"❌ Distributed training failed: {e}")

## Experiment Mode

Run single experiments or multiple iterations with statistical analysis.

In [None]:
# Single experiment
print("🧪 Running single experiment...")
args.experiment_type = "single"

try:
    run_experiment_mode(args)
    print("✅ Single experiment completed successfully!")
except Exception as e:
    print(f"❌ Single experiment failed: {e}")

In [None]:
# Multiple experiments with statistical analysis
print("🧪 Running multiple experiments...")
args.experiment_type = "multiple"
args.iterations = 5  # You can change this number

try:
    run_experiment_mode(args)
    print(f"✅ {args.iterations} experiments completed successfully!")
except Exception as e:
    print(f"❌ Multiple experiments failed: {e}")

## Utility Functions

Helper functions for notebook usage.

In [None]:
def quick_train(distributed=False, config_path="sample_config.json"):
    """Quick training function for easy execution."""
    args.distributed = distributed
    args.config_path = config_path
    
    print(f"🚀 Quick training - Distributed: {distributed}")
    try:
        run_training_mode(args)
        print("✅ Training completed!")
    except Exception as e:
        print(f"❌ Training failed: {e}")

def quick_experiment(experiment_type="single", iterations=3):
    """Quick experiment function for easy execution."""
    args.experiment_type = experiment_type
    args.iterations = iterations
    
    print(f"🧪 Quick experiment - Type: {experiment_type}, Iterations: {iterations}")
    try:
        run_experiment_mode(args)
        print("✅ Experiment completed!")
    except Exception as e:
        print(f"❌ Experiment failed: {e}")

print("✓ Utility functions loaded!")
print("\nUse these functions for quick execution:")
print("- quick_train(distributed=False)")
print("- quick_experiment(experiment_type='multiple', iterations=5)")

## Quick Execution Examples

Use the utility functions for quick execution.

In [None]:
# Example: Quick single training
# quick_train()

# Example: Quick multiple experiments
# quick_experiment(experiment_type="multiple", iterations=3)

print("💡 Uncomment the lines above to run quick examples!")