# Hadoop Docker Stack Troubleshooting Guide

This notebook helps troubleshoot Docker Compose deployment issues for the Hadoop Big Data stack. We'll diagnose the missing image errors and provide solutions.

## Problem Overview
The error `docker.io/apache/hadoop:3.3.4: not found` indicates that the specified Docker image doesn't exist on Docker Hub. We'll fix this by using alternative, working images.

## What You'll Learn
- How to check Docker image availability
- How to fix Docker Compose configuration issues
- How to monitor container health and troubleshoot deployment problems
- How to use alternative Docker images for Hadoop components

## Section 1: Import Required Libraries

Let's start by importing the necessary Python libraries for Docker API interaction and system monitoring.

In [1]:
import docker
import subprocess
import json
import os
import yaml
import requests
import time
from datetime import datetime

# Initialize Docker client
try:
    client = docker.from_env()
    print("✅ Docker client connected successfully")
    print(f"Docker version: {client.version()['Version']}")
except Exception as e:
    print(f"❌ Failed to connect to Docker: {e}")
    print("Make sure Docker Desktop is running")

# Check Docker system info
try:
    info = client.info()
    print(f"📊 Containers: {info['Containers']} | Images: {info['Images']} | Memory: {info['MemTotal'] / (1024**3):.1f}GB")
except Exception as e:
    print(f"⚠️ Could not get Docker system info: {e}")

ModuleNotFoundError: No module named 'docker'

## Section 2: Check Docker Image Availability

Let's check if the problematic images exist and find alternatives.

In [None]:
def check_image_exists(image_name):
    """Check if a Docker image exists locally or remotely"""
    try:
        # Check locally first
        client.images.get(image_name)
        return "local", "✅ Available locally"
    except docker.errors.ImageNotFound:
        # Check if it exists on Docker Hub
        try:
            if ":" in image_name:
                repo, tag = image_name.split(":", 1)
            else:
                repo, tag = image_name, "latest"
            
            # Simple Docker Hub API check
            url = f"https://registry.hub.docker.com/v2/repositories/{repo}/tags/{tag}"
            response = requests.get(url, timeout=5)
            
            if response.status_code == 200:
                return "remote", "✅ Available on Docker Hub"
            else:
                return "missing", "❌ Not found on Docker Hub"
        except Exception as e:
            return "error", f"⚠️ Error checking: {e}"

# Check problematic images from your docker-compose
problematic_images = [
    "apache/hadoop:3.3.4",
    "apache/hive:3.1.3",
    "bitnami/spark:3.4.1"
]

print("🔍 Checking image availability:\n")
for image in problematic_images:
    status, message = check_image_exists(image)
    print(f"{image:<30} | {message}")

print("\n" + "="*60)
print("💡 SOLUTION: Use alternative working images")
print("="*60)

## Section 3: Analyze Docker Compose Configuration

Let's analyze the current docker-compose.yml file and identify issues.

In [None]:
# Read and analyze docker-compose.yml
compose_file = "/home/jovyan/work/../docker-compose.yml"

try:
    with open(compose_file, 'r') as f:
        compose_content = yaml.safe_load(f)
    
    print("📄 Docker Compose Analysis:")
    print("="*40)
    
    # Check version
    if 'version' in compose_content:
        print(f"⚠️  Version field found: {compose_content['version']} (deprecated in newer Docker Compose)")
    
    # Check services and their images
    services = compose_content.get('services', {})
    print(f"📊 Found {len(services)} services")
    
    print("\n🔍 Service Images:")
    for service_name, service_config in services.items():
        image = service_config.get('image', 'No image specified')
        print(f"  {service_name:<20} → {image}")
        
        # Check if image exists
        if image != 'No image specified':
            status, message = check_image_exists(image)
            if status == "missing":
                print(f"    ❌ {message}")
    
except FileNotFoundError:
    print(f"❌ Docker compose file not found: {compose_file}")
    print("💡 We'll create a working version")

print("\n" + "="*60)
print("🔧 FIXES NEEDED:")
print("1. Remove version field (deprecated)")
print("2. Replace non-existent images with working alternatives")
print("3. Update configuration for compatibility")
print("="*60)

## Section 4: Create Working Docker Configuration

Based on the analysis, we'll create a working configuration using reliable images from the bde2020 project, which provides well-maintained Hadoop ecosystem containers.

In [None]:
# Create a working Docker Compose configuration
working_compose = """
services:
  namenode:
    image: bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8
    container_name: namenode
    restart: always
    ports:
      - 9870:9870
      - 9000:9000
    volumes:
      - hadoop_namenode:/hadoop/dfs/name
    environment:
      - CLUSTER_NAME=test
    env_file:
      - ./hadoop.env

  datanode:
    image: bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8
    container_name: datanode
    restart: always
    ports:
      - 9864:9864
    volumes:
      - hadoop_datanode:/hadoop/dfs/data
    environment:
      SERVICE_PRECONDITION: "namenode:9870"
    env_file:
      - ./hadoop.env

  resourcemanager:
    image: bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8
    container_name: resourcemanager
    restart: always
    ports:
      - 8088:8088
    environment:
      SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864"
    env_file:
      - ./hadoop.env

  nodemanager1:
    image: bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8
    container_name: nodemanager
    restart: always
    ports:
      - 8042:8042
    environment:
      SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864 resourcemanager:8088"
    env_file:
      - ./hadoop.env

  historyserver:
    image: bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8
    container_name: historyserver
    restart: always
    ports:
      - 8188:8188
    environment:
      SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864 resourcemanager:8088"
    volumes:
      - hadoop_historyserver:/hadoop/yarn/timeline
    env_file:
      - ./hadoop.env

  postgres:
    image: postgres:13
    container_name: postgres-hive
    restart: always
    environment:
      POSTGRES_DB: metastore
      POSTGRES_USER: hive
      POSTGRES_PASSWORD: hive123
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data

  hive-metastore:
    image: apache/hive:4.0.0
    container_name: hive-metastore
    restart: always
    ports:
      - "9083:9083"
    environment:
      SERVICE_PRECONDITION: "namenode:9870 datanode:9864 postgres:5432"
      DB_DRIVER: postgres
      SERVICE_NAME: 'metastore'
    depends_on:
      - postgres
    volumes:
      - ./init/init-hive-db.sql:/docker-entrypoint-initdb.d/init-hive-db.sql

  spark-master:
    image: bitnami/spark:3.3
    container_name: spark-master
    restart: always
    ports:
      - "8080:8080"
      - "7077:7077"
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no

  spark-worker:
    image: bitnami/spark:3.3
    container_name: spark-worker
    restart: always
    ports:
      - "8081:8081"
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    depends_on:
      - spark-master

volumes:
  hadoop_namenode:
  hadoop_datanode:
  hadoop_historyserver:
  postgres_data:
"""

# Write the working configuration
output_file = "/home/jovyan/work/docker-compose-working.yml"
with open(output_file, 'w') as f:
    f.write(working_compose)

print("✅ Created working Docker Compose configuration!")
print(f"📁 Saved to: {output_file}")
print("\n🔧 Key improvements:")
print("• Uses verified bde2020 Hadoop images")
print("• Uses official PostgreSQL for Hive Metastore")
print("• Uses Bitnami Spark images (well-maintained)")
print("• Removed deprecated version field")
print("• Added proper service dependencies")
print("• Configured persistent volumes")

## Section 5: Environment Configuration

We need a proper environment file for Hadoop configuration variables.

In [None]:
# Create Hadoop environment configuration
hadoop_env = """
CORE_CONF_fs_defaultFS=hdfs://namenode:9000
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*
CORE_CONF_io_compression_codecs=org.apache.hadoop.io.compress.SnappyCodec

HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
HDFS_CONF_dfs_nameservices=cluster1
HDFS_CONF_dfs_ha_namenodes_cluster1=nn1,nn2
HDFS_CONF_dfs_namenode_rpc_address_cluster1_nn1=namenode:9000
HDFS_CONF_dfs_namenode_http_address_cluster1_nn1=namenode:9870
HDFS_CONF_dfs_replication=1

YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource___tracker_address=resourcemanager:8031
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_mapreduce_map_output_compress=true
YARN_CONF_mapred_map_output_compress_codec=org.apache.hadoop.io.compress.SnappyCodec
YARN_CONF_yarn_nodemanager_resource_memory___mb=1400
YARN_CONF_yarn_scheduler_maximum___allocation___mb=1400
YARN_CONF_yarn_scheduler_minimum___allocation___mb=128
YARN_CONF_yarn_nodemanager_vmem___check___enabled=false
"""

# Write the environment file
env_file = "/home/jovyan/work/hadoop.env"
with open(env_file, 'w') as f:
    f.write(hadoop_env)

print("✅ Created Hadoop environment configuration!")
print(f"📁 Saved to: {env_file}")
print("\n🔧 Key configurations:")
print("• HDFS default filesystem: hdfs://namenode:9000")
print("• Web UI enabled on port 9870")
print("• YARN ResourceManager on port 8032")
print("• Memory limits set for single-node setup")
print("• Replication factor set to 1 (for testing)")
print("• Timeline service enabled for job history")

## Section 6: Deploy and Monitor

Now let's start the working environment and monitor the services as they come online.

In [None]:
# Function to monitor Docker containers
def monitor_containers():
    """Monitor the status of all containers in our stack"""
    try:
        result = subprocess.run(['docker', 'ps', '--format', 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'], 
                              capture_output=True, text=True, check=True)
        print("🐳 CONTAINER STATUS:")
        print("="*80)
        print(result.stdout)
        
        # Check if all expected containers are running
        expected_containers = [
            'namenode', 'datanode', 'resourcemanager', 'nodemanager', 
            'historyserver', 'postgres-hive', 'hive-metastore', 
            'spark-master', 'spark-worker'
        ]
        
        running_containers = []
        for line in result.stdout.split('\n')[1:]:  # Skip header
            if line.strip():
                container_name = line.split('\t')[0]
                running_containers.append(container_name)
        
        print("\n📊 SERVICE SUMMARY:")
        print("-" * 40)
        for container in expected_containers:
            status = "✅ Running" if container in running_containers else "❌ Not Running"
            print(f"{container:<20} {status}")
            
    except subprocess.CalledProcessError as e:
        print(f"❌ Error checking containers: {e}")
        return False
    
    return len(running_containers) > 0

# Function to check service health
def check_service_health():
    """Check if web UIs are accessible"""
    services = {
        'Hadoop NameNode': 'http://localhost:9870',
        'YARN ResourceManager': 'http://localhost:8088', 
        'Spark Master': 'http://localhost:8080',
        'Spark Worker': 'http://localhost:8081'
    }
    
    print("\n🌐 WEB UI HEALTH CHECK:")
    print("="*50)
    
    for service_name, url in services.items():
        try:
            response = requests.get(url, timeout=5)
            if response.status_code == 200:
                print(f"✅ {service_name:<25} → {url}")
            else:
                print(f"⚠️  {service_name:<25} → {url} (Status: {response.status_code})")
        except requests.exceptions.RequestException:
            print(f"❌ {service_name:<25} → {url} (Not accessible)")

# Start monitoring
print("🚀 STARTING DOCKER ENVIRONMENT MONITORING")
print("="*60)
monitor_containers()
check_service_health()

## Quick Start Guide

**To use this troubleshooting notebook:**

1. **Copy the working files to your Docker directory:**
   ```bash
   cp /home/jovyan/work/docker-compose-working.yml /path/to/your/docker/directory/docker-compose.yml
   cp /home/jovyan/work/hadoop.env /path/to/your/docker/directory/hadoop.env
   ```

2. **Start the environment:**
   ```bash
   cd /path/to/your/docker/directory
   docker-compose up -d
   ```

3. **Monitor progress:** Run the monitoring cell above to check container status

4. **Access web interfaces:**
   - Hadoop NameNode: http://localhost:9870
   - YARN ResourceManager: http://localhost:8088
   - Spark Master: http://localhost:8080
   - Spark Worker: http://localhost:8081

**Troubleshooting tips:**
- If containers fail to start, check `docker-compose logs [service-name]`
- Ensure ports 9870, 9000, 8088, 8080, 8081, 5432 are not in use
- Wait 2-3 minutes for all services to fully initialize
- PostgreSQL and Hadoop services have startup dependencies