# Week 4 — Part 01: Timeouts and Failures Lab

**Estimated Time:** 60-90 minutes

**Prerequisites:** Python, basic understanding of HTTP requests, error handling

---

## Learning Objectives

By completing this lab, you will:

- ✅ Understand different types of failures in distributed systems
- ✅ Learn to implement timeout mechanisms for API calls
- ✅ Practice handling various failure scenarios gracefully
- ✅ Explore techniques for failure detection and recovery
- ✅ Develop robust error handling patterns

## Key Concepts

- **Timeouts**: Limits on how long to wait for operations
- **Failure Handling**: Strategies for dealing with system failures
- **Retry Logic**: Automatically retrying failed operations
- **Circuit Breakers**: Preventing cascading failures
- **Graceful Degradation**: Maintaining functionality during partial failures

---

## Exercise 1: Understanding Timeouts

Let's start by understanding what timeouts are and why they're important.

In [None]:
# Import required libraries
import time
import random
import requests
from requests.exceptions import Timeout, ConnectionError, RequestException

print("Timeout and failure handling environment set up successfully!")

### Task 1.1: Implementing Basic Timeouts

Let's implement basic timeout mechanisms for function calls.

In [None]:
# Function that simulates a slow operation
def slow_operation(duration):
    """
    Simulate a slow operation.
    
    Args:
        duration (float): How long to sleep in seconds
    
    Returns:
        str: Success message
    """
    time.sleep(duration)
    return f"Operation completed after {duration} seconds"

# Function with timeout using a simple approach
def call_with_timeout(func, args=(), kwargs=None, timeout=5):
    """
    Call a function with a timeout.
    
    Args:
        func (function): Function to call
        args (tuple): Positional arguments for the function
        kwargs (dict): Keyword arguments for the function
        timeout (float): Timeout in seconds
    
    Returns:
        tuple: (success, result_or_error)
    """
    if kwargs is None:
        kwargs = {}
    
    start_time = time.time()
    
    try:
        result = func(*args, **kwargs)
        elapsed = time.time() - start_time
        
        if elapsed > timeout:
            return False, f"Operation timed out after {elapsed:.2f} seconds"
        
        return True, result
    except Exception as e:
        elapsed = time.time() - start_time
        return False, f"Operation failed after {elapsed:.2f} seconds: {e}"

# Test timeout functionality
print("Testing Timeout Implementation:")
print("============================")

# Test with a quick operation
success, result = call_with_timeout(slow_operation, args=(1,), timeout=5)
print(f"
Quick operation (1s): Success={success}, Result={result}")

# Test with a slow operation that exceeds timeout
success, result = call_with_timeout(slow_operation, args=(10,), timeout=5)
print(f"
Slow operation (10s): Success={success}, Result={result}")

### Task 1.2: HTTP Request Timeouts

Let's explore timeouts for HTTP requests.

In [None]:
# HTTP request with timeout
def http_request_with_timeout(url, timeout=5):
    """
    Make an HTTP request with timeout.
    
    Args:
        url (str): URL to request
        timeout (float): Timeout in seconds
    
    Returns:
        dict: Request results
    """
    try:
        start_time = time.time()
        response = requests.get(url, timeout=timeout)
        elapsed = time.time() - start_time
        
        return {
            'success': True,
            'status_code': response.status_code,
            'elapsed_time': elapsed,
            'content_length': len(response.content)
        }
    except Timeout:
        return {
            'success': False,
            'error': 'Request timed out',
            'elapsed_time': timeout
        }
    except ConnectionError:
        return {
            'success': False,
            'error': 'Connection error',
            'elapsed_time': time.time() - start_time if 'start_time' in locals() else 0
        }
    except RequestException as e:
        return {
            'success': False,
            'error': f'Request failed: {e}',
            'elapsed_time': time.time() - start_time if 'start_time' in locals() else 0
        }

# Test HTTP request timeouts
print("Testing HTTP Request Timeouts:")
print("============================")

# Test with a fast responding server
result = http_request_with_timeout('https://httpbin.org/delay/1', timeout=5)
print(f"
Fast request: Success={result['success']}, Status={result.get('status_code', 'N/A')}, Time={result['elapsed_time']:.2f}s")

# Test with a slow responding server
result = http_request_with_timeout('https://httpbin.org/delay/10', timeout=5)
print(f"
Slow request: Success={result['success']}, Error={result.get('error', 'N/A')}, Time={result['elapsed_time']:.2f}s")

---

## Exercise 2: Handling Different Failure Types

Let's explore different types of failures and how to handle them.

### Task 2.1: Simulating Different Failure Types

Let's create functions that simulate different types of failures.

In [None]:
# Functions that simulate different failure types
class SimulatedFailure(Exception):
    """Base class for simulated failures."""
    pass

class NetworkFailure(SimulatedFailure):
    """Simulates network failures."""
    pass

class ServiceFailure(SimulatedFailure):
    """Simulates service failures."""
    pass

class RateLimitFailure(SimulatedFailure):
    """Simulates rate limit failures."""
    pass

def operation_with_failures(failure_type=None, failure_probability=0.3):
    """
    Simulate an operation that may fail.
    
    Args:
        failure_type (str): Type of failure to simulate
        failure_probability (float): Probability of failure (0.0 to 1.0)
    
    Returns:
        str: Success message
    
    Raises:
        SimulatedFailure: Various failure types
    """
    # Simulate processing time
    time.sleep(random.uniform(0.1, 1.0))
    
    # Check if we should fail
    if random.random() < failure_probability:
        if failure_type == 'network':
            raise NetworkFailure("Network connection lost")
        elif failure_type == 'service':
            raise ServiceFailure("Service temporarily unavailable")
        elif failure_type == 'rate_limit':
            raise RateLimitFailure("Rate limit exceeded")
        else:
            # Random failure type
            failure_types = [NetworkFailure, ServiceFailure, RateLimitFailure]
            failure_class = random.choice(failure_types)
            raise failure_class("Random simulated failure")
    
    return "Operation completed successfully"

# Test different failure types
print("Testing Different Failure Types:")
print("===============================")

failure_types = ['network', 'service', 'rate_limit', 'random']

for failure_type in failure_types:
    try:
        result = operation_with_failures(failure_type=failure_type, failure_probability=0.5)
        print(f"{failure_type.capitalize()} failure: Success - {result}")
    except SimulatedFailure as e:
        print(f"{failure_type.capitalize()} failure: Failed - {e}")
    except Exception as e:
        print(f"{failure_type.capitalize()} failure: Unexpected error - {e}")

### Task 2.2: Comprehensive Error Handling

Let's implement comprehensive error handling for different failure types.

In [None]:
# Comprehensive error handling
def handle_operation_with_failures(operation_func, max_retries=3, base_delay=1.0):
    """
    Handle an operation with comprehensive error handling and retries.
    
    Args:
        operation_func (function): Function to call
        max_retries (int): Maximum number of retry attempts
        base_delay (float): Base delay between retries (with exponential backoff)
    
    Returns:
        dict: Operation results
    """
    for attempt in range(max_retries + 1):
        try:
            start_time = time.time()
            result = operation_func()
            elapsed = time.time() - start_time
            
            return {
                'success': True,
                'result': result,
                'attempts': attempt + 1,
                'elapsed_time': elapsed
            }
        
        except NetworkFailure as e:
            if attempt < max_retries:
                delay = base_delay * (2 ** attempt)  # Exponential backoff
                print(f"Network failure on attempt {attempt + 1}: {e}. Retrying in {delay:.2f}s...")
                time.sleep(delay)
            else:
                return {
                    'success': False,
                    'error': f'Network failure after {max_retries + 1} attempts: {e}',
                    'error_type': 'network',
                    'attempts': max_retries + 1
                }
        
        except ServiceFailure as e:
            if attempt < max_retries:
                delay = base_delay * (2 ** attempt)
                print(f"Service failure on attempt {attempt + 1}: {e}. Retrying in {delay:.2f}s...")
                time.sleep(delay)
            else:
                return {
                    'success': False,
                    'error': f'Service failure after {max_retries + 1} attempts: {e}',
                    'error_type': 'service',
                    'attempts': max_retries + 1
                }
        
        except RateLimitFailure as e:
            if attempt < max_retries:
                # For rate limits, wait longer
                delay = base_delay * (2 ** attempt) * 2
                print(f"Rate limit exceeded on attempt {attempt + 1}: {e}. Waiting {delay:.2f}s...")
                time.sleep(delay)
            else:
                return {
                    'success': False,
                    'error': f'Rate limit failure after {max_retries + 1} attempts: {e}',
                    'error_type': 'rate_limit',
                    'attempts': max_retries + 1
                }
        
        except Exception as e:
            return {
                'success': False,
                'error': f'Unexpected error: {e}',
                'error_type': 'unexpected',
                'attempts': attempt + 1
            }
    
    # This should never be reached
    return {
        'success': False,
        'error': 'Unknown error in retry loop',
        'error_type': 'unknown',
        'attempts': max_retries + 1
    }

# Test comprehensive error handling
print("Testing Comprehensive Error Handling:")
print("====================================")

# Test with different failure types
failure_scenarios = [
    ('No failure', lambda: operation_with_failures(failure_type=None, failure_probability=0.0)),
    ('Network failure', lambda: operation_with_failures(failure_type='network', failure_probability=0.7)),
    ('Service failure', lambda: operation_with_failures(failure_type='service', failure_probability=0.7)),
    ('Rate limit failure', lambda: operation_with_failures(failure_type='rate_limit', failure_probability=0.7))
]

for scenario_name, operation in failure_scenarios:
    print(f"
{scenario_name}:")
    result = handle_operation_with_failures(operation, max_retries=3)
    
    if result['success']:
        print(f"  Success: {result['result']} (Attempts: {result['attempts']})")
    else:
        print(f"  Failed: {result['error']} (Attempts: {result['attempts']})")

---

## Exercise 3: Advanced Failure Handling Patterns

Let's explore advanced patterns for handling failures.

### Task 3.1: Circuit Breaker Pattern

Let's implement a circuit breaker pattern.

In [None]:
# Circuit breaker implementation
class CircuitBreaker:
    """
    Implements the circuit breaker pattern.
    """
    
    def __init__(self, failure_threshold=5, timeout=60):
        """
        Initialize the circuit breaker.
        
        Args:
            failure_threshold (int): Number of failures before opening circuit
            timeout (int): Time in seconds before attempting to close circuit
        """
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN
    
    def call(self, func, *args, **kwargs):
        """
        Call a function through the circuit breaker.
        
        Args:
            func (function): Function to call
            *args: Positional arguments for the function
            **kwargs: Keyword arguments for the function
        
        Returns:
            Result of the function call
        
        Raises:
            Exception: If circuit is open or function fails
        """
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'HALF_OPEN'
                print("Circuit breaker: HALF_OPEN - testing service")
            else:
                raise Exception("Circuit breaker is OPEN")
        
        try:
            result = func(*args, **kwargs)
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise e
    
    def on_success(self):
        """
        Called when a function call succeeds.
        """
        self.failure_count = 0
        self.state = 'CLOSED'
    
    def on_failure(self):
        """
        Called when a function call fails.
        """
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.failure_count >= self.failure_threshold:
            self.state = 'OPEN'
            print(f"Circuit breaker: OPEN - {self.failure_count} failures")
    
    def get_state(self):
        """
        Get the current state of the circuit breaker.
        
        Returns:
            str: Current state (CLOSED, OPEN, HALF_OPEN)
        """
        return self.state

# Test circuit breaker
print("Testing Circuit Breaker Pattern:")
print("===============================")

# Create a circuit breaker with low threshold for testing
cb = CircuitBreaker(failure_threshold=3, timeout=10)

# Test function that fails most of the time
def unreliable_function():
    if random.random() < 0.8:  # 80% failure rate
        raise ServiceFailure("Service unavailable")
    return "Success!"

# Test the circuit breaker
for i in range(10):
    try:
        result = cb.call(unreliable_function)
        print(f"Attempt {i+1}: Success - {result} (State: {cb.get_state()})")
    except Exception as e:
        print(f"Attempt {i+1}: Failed - {e} (State: {cb.get_state()})")
    
    time.sleep(0.5)  # Small delay between attempts

### Task 3.2: Graceful Degradation

Let's implement graceful degradation strategies.

In [None]:
# Graceful degradation implementation
class ServiceClient:
    """
    Service client with graceful degradation.
    """
    
    def __init__(self):
        self.primary_service_available = True
        self.secondary_service_available = True
    
    def primary_service(self):
        """
        Primary service that may fail.
        
        Returns:
            str: Service result
        
        Raises:
            ServiceFailure: If service is unavailable
        """
        if not self.primary_service_available:
            raise ServiceFailure("Primary service unavailable")
        
        if random.random() < 0.3:  # 30% failure rate
            raise ServiceFailure("Primary service temporarily unavailable")
        
        return "Primary service result"
    
    def secondary_service(self):
        """
        Secondary service as fallback.
        
        Returns:
            str: Service result
        
        Raises:
            ServiceFailure: If service is unavailable
        """
        if not self.secondary_service_available:
            raise ServiceFailure("Secondary service unavailable")
        
        if random.random() < 0.1:  # 10% failure rate
            raise ServiceFailure("Secondary service temporarily unavailable")
        
        return "Secondary service result (degraded)"
    
    def fallback_result(self):
        """
        Fallback result when all services fail.
        
        Returns:
            str: Fallback result
        """
        return "Fallback cached result"
    
    def get_data_with_degradation(self):
        """
        Get data with graceful degradation.
        
        Returns:
            dict: Result with quality indicator
        """
        # Try primary service first
        try:
            result = self.primary_service()
            return {
                'data': result,
                'quality': 'high',
                'source': 'primary'
            }
        except ServiceFailure as e:
            print(f"Primary service failed: {e}. Trying secondary service...")
        
        # Try secondary service
        try:
            result = self.secondary_service()
            return {
                'data': result,
                'quality': 'medium',
                'source': 'secondary'
            }
        except ServiceFailure as e:
            print(f"Secondary service failed: {e}. Using fallback...")
        
        # Use fallback
        result = self.fallback_result()
        return {
            'data': result,
            'quality': 'low',
            'source': 'fallback'
        }

# Test graceful degradation
print("Testing Graceful Degradation:")
print("===========================")

client = ServiceClient()

for i in range(5):
    result = client.get_data_with_degradation()
    print(f"
Attempt {i+1}:")
    print(f"  Data: {result['data']}")
    print(f"  Quality: {result['quality']}")
    print(f"  Source: {result['source']}")
    
    time.sleep(0.5)

---

## Exercise 4: Practice Challenges

Now it's your turn to apply what you've learned. Try to complete the following challenges:

### Challenge 4.1: Create a Resilient HTTP Client

Create a resilient HTTP client that handles timeouts, retries, and circuit breaking.

In [None]:
# TODO: Create a resilient HTTP client

class ResilientHTTPClient:
    """
    Resilient HTTP client with timeouts, retries, and circuit breaking.
    """
    
    def __init__(self, timeout=5, max_retries=3, circuit_breaker=None):
        """
        Initialize the client.
        
        Args:
            timeout (float): Request timeout in seconds
            max_retries (int): Maximum number of retries
            circuit_breaker (CircuitBreaker): Circuit breaker instance
        """
        # YOUR CODE HERE
        pass
    
    def get(self, url, **kwargs):
        """
        Make a GET request with resilience features.
        
        Args:
            url (str): URL to request
            **kwargs: Additional arguments for requests
        
        Returns:
            dict: Request results
        """
        # YOUR CODE HERE
        # Implement with timeouts, retries, and circuit breaking
        pass
    
    def post(self, url, **kwargs):
        """
        Make a POST request with resilience features.
        
        Args:
            url (str): URL to request
            **kwargs: Additional arguments for requests
        
        Returns:
            dict: Request results
        """
        # YOUR CODE HERE
        pass

print("Resilient HTTP client class template created.")
print("Implement the class with timeout, retry, and circuit breaker features.")

### Challenge 4.2: Failure Simulation Framework

Create a framework for simulating and testing different failure scenarios.

In [None]:
# TODO: Create a failure simulation framework

class FailureSimulationFramework:
    """
    Framework for simulating and testing failure scenarios.
    """
    
    def __init__(self):
        """
        Initialize the framework.
        """
        self.scenarios = {}
        # YOUR CODE HERE
        pass
    
    def add_scenario(self, name, failure_function, probability=0.5):
        """
        Add a failure scenario.
        
        Args:
            name (str): Name of the scenario
            failure_function (function): Function that may fail
            probability (float): Probability of failure (0.0 to 1.0)
        """
        # YOUR CODE HERE
        pass
    
    def run_scenario(self, name, iterations=100):
        """
        Run a failure scenario multiple times.
        
        Args:
            name (str): Name of the scenario
            iterations (int): Number of iterations
        
        Returns:
            dict: Statistics about the scenario runs
        """
        # YOUR CODE HERE
        # Implement scenario execution and statistics collection
        pass
    
    def compare_strategies(self, scenarios, strategies):
        """
        Compare different failure handling strategies.
        
        Args:
            scenarios (list): List of scenario names
            strategies (list): List of handling strategy functions
        
        Returns:
            dict: Comparison results
        """
        # YOUR CODE HERE
        pass

print("Failure simulation framework class template created.")
print("Implement the framework for simulating and testing failure scenarios.")

### Challenge 4.3: Monitoring and Alerting

Create a monitoring system that tracks failures and sends alerts.

In [None]:
# TODO: Create a monitoring and alerting system

class FailureMonitor:
    """
    Monitors failures and sends alerts.
    """
    
    def __init__(self, alert_threshold=10, alert_window=300):
        """
        Initialize the monitor.
        
        Args:
            alert_threshold (int): Number of failures before alerting
            alert_window (int): Time window in seconds for counting failures
        """
        self.alert_threshold = alert_threshold
        self.alert_window = alert_window
        self.failure_log = []
        # YOUR CODE HERE
        pass
    
    def record_failure(self, failure_type, details=None):
        """
        Record a failure.
        
        Args:
            failure_type (str): Type of failure
            details (dict): Additional details about the failure
        
        Returns:
            bool: True if alert should be sent
        """
        # YOUR CODE HERE
        # Implement failure recording and alert checking
        pass
    
    def send_alert(self, failure_summary):
        """
        Send an alert about failures.
        
        Args:
            failure_summary (dict): Summary of recent failures
        """
        # YOUR CODE HERE
        # Implement alert sending (could be email, Slack, etc.)
        print(f"ALERT: {failure_summary}")
        pass
    
    def get_failure_stats(self, time_window=3600):
        """
        Get failure statistics for a time window.
        
        Args:
            time_window (int): Time window in seconds
        
        Returns:
            dict: Failure statistics
        """
        # YOUR CODE HERE
        pass

print("Failure monitor class template created.")
print("Implement the monitoring system for tracking failures and sending alerts.")

---

## Summary and Key Takeaways

### Concepts Practiced

- **Timeouts**: Limits on how long to wait for operations
- **Failure Handling**: Strategies for dealing with system failures
- **Retry Logic**: Automatically retrying failed operations
- **Circuit Breakers**: Preventing cascading failures
- **Graceful Degradation**: Maintaining functionality during partial failures

### Best Practices

- ✅ Always implement timeouts for external calls
- ✅ Use exponential backoff for retries
- ✅ Implement circuit breakers for failing services
- ✅ Provide graceful degradation when possible
- ✅ Monitor and alert on failure patterns

### Next Steps

- Practice implementing more sophisticated retry strategies
- Learn about advanced circuit breaker patterns
- Explore tools for distributed tracing and monitoring
- Investigate service mesh technologies for failure handling