# s5cmd Progress in Jupyter

This notebook documents the working solution for displaying s5cmd progress in Jupyter notebooks.

**Problem**: `s5cmd cp --show-progress` uses `\r` (carriage return) to update the progress bar in place.
When run via subprocess, Jupyter doesn't interpret `\r` correctly, causing progress lines to stack.
Additionally, s5cmd disables progress output when it detects stdout is not a TTY.

**Solution**: Use a pseudo-TTY (`pty`) to trick s5cmd into outputting progress, then handle `\r` manually
with `IPython.display.clear_output()` for clean single-line updates.

**Key flags**:
- `--show-progress`: Show byte-level progress bar
- `--if-size-differ`: Only copy if file size differs (sync-like behavior)
- `--if-source-newer`: Only copy if source is newer (sync-like behavior)

In [None]:
import shutil
from pathlib import Path

print(f"s5cmd path: {shutil.which('s5cmd')}")
!s5cmd version

In [None]:
# Test paths
TEST_BUCKET = "s3://visionlab-datasets/slipstream-cache/debug-progress/"
LOCAL_CACHE = Path.home() / ".lightning/chunks/6088ed39051d2929849d270c9e7a505a/1743941646.7863863/slipcache-6088ed39/.slipstream"

print(f"Local cache: {LOCAL_CACHE}")
print(f"Exists: {LOCAL_CACHE.exists()}")

if LOCAL_CACHE.exists():
    print("\nFiles:")
    total_size = 0
    for f in sorted(LOCAL_CACHE.iterdir()):
        size = f.stat().st_size
        total_size += size
        size_str = f"{size / 1e9:.2f} GB" if size > 1e9 else f"{size / 1e6:.1f} MB" if size > 1e6 else f"{size / 1e3:.1f} KB"
        print(f"  {f.name}: {size_str}")
    print(f"\nTotal: {total_size / 1e9:.2f} GB")

## The Solution: PTY-based Progress

This function uses a pseudo-TTY to get real-time progress from s5cmd, then handles `\r` manually.

In [None]:
import subprocess
import os
import pty
import select
import re
from IPython.display import clear_output

def s5cmd_with_pty_progress(cmd, verbose=True):
    """
    Run s5cmd with a pseudo-TTY to get real progress output in Jupyter.
    
    Uses PTY to trick s5cmd into thinking it's connected to a real terminal,
    then handles carriage returns manually for clean Jupyter display.
    
    Args:
        cmd: Command list (e.g., ["s5cmd", "cp", "--show-progress", ...])
        verbose: If True, display progress. If False, run silently.
    
    Returns:
        Process return code
    """
    # Create pseudo-terminal
    master_fd, slave_fd = pty.openpty()
    
    process = subprocess.Popen(
        cmd,
        stdout=slave_fd,
        stderr=slave_fd,
        stdin=slave_fd,
        close_fds=True
    )
    
    os.close(slave_fd)  # Close slave in parent
    
    # Strip ANSI color codes for cleaner display
    ansi_escape = re.compile(r'\x1b\[[0-9;]*m')
    line_buffer = ""
    
    try:
        while True:
            # Check if there's data to read (with timeout)
            if select.select([master_fd], [], [], 0.1)[0]:
                try:
                    data = os.read(master_fd, 1024)
                    if not data:
                        break
                    
                    text = data.decode('utf-8', errors='replace')
                    
                    for char in text:
                        if char == '\r':
                            # Carriage return - update display in place
                            if line_buffer.strip() and verbose:
                                clean = ansi_escape.sub('', line_buffer)
                                clear_output(wait=True)
                                print(clean)
                            line_buffer = ""
                        elif char == '\n':
                            # Newline - print and continue
                            if line_buffer.strip() and verbose:
                                clean = ansi_escape.sub('', line_buffer)
                                print(clean)
                            line_buffer = ""
                        else:
                            line_buffer += char
                            
                except OSError:
                    break
            
            # Check if process finished
            if process.poll() is not None:
                # Drain any remaining output
                try:
                    while select.select([master_fd], [], [], 0.1)[0]:
                        data = os.read(master_fd, 1024)
                        if not data:
                            break
                        if verbose:
                            text = data.decode('utf-8', errors='replace')
                            clean = ansi_escape.sub('', text)
                            print(clean, end='')
                except OSError:
                    pass
                break
                
    except KeyboardInterrupt:
        process.terminate()
        print("\nInterrupted by user")
    finally:
        os.close(master_fd)
    
    return process.returncode

## Test Upload with Progress

In [None]:
import os
import sys

if hasattr(os, "sched_getaffinity"):
    n_workers = len(os.sched_getaffinity(0)) - 1
else:
    n_workers = os.cpu_count() - 1

In [None]:
# Delete existing files to force actual upload
!s5cmd rm "{TEST_BUCKET}test-upload/*" 2>/dev/null || echo "Nothing to delete"

local_path = str(LOCAL_CACHE) + "/*"
remote_path = TEST_BUCKET + "test-upload/"

cmd = ["s5cmd", "cp", "--show-progress", "--if-size-differ", "--if-source-newer", "--concurrency", str(n_workers),
       local_path, remote_path]

print(f"Uploading: {local_path}")
print(f"To: {remote_path}")
print("---")
rc = s5cmd_with_pty_progress(cmd)
print(f"---\nDone! Return code: {rc}")

In [None]:
# Run again - should skip all files (already uploaded)
print("Running again (should skip unchanged files):")
print("---")
rc = s5cmd_with_pty_progress(cmd)
print(f"---\nDone! Return code: {rc}")

## Test Download with Progress

In [None]:
import tempfile

download_dir = Path(tempfile.mkdtemp()) / "downloaded"
download_dir.mkdir()

remote_source = TEST_BUCKET + "test-upload/*"

cmd = ["s5cmd", "cp", "--show-progress", "--concurrency", str(n_workers), remote_source, str(download_dir) + "/"]

print(f"Downloading: {remote_source}")
print(f"To: {download_dir}")
print("---")
rc = s5cmd_with_pty_progress(cmd)
print(f"---\nDone! Return code: {rc}")

# Verify
print(f"\nDownloaded files:")
for f in sorted(download_dir.iterdir()):
    print(f"  {f.name}")

## Cleanup

In [None]:
# List test files
print("Test files in S3:")
!s5cmd ls {TEST_BUCKET}

In [None]:
# Uncomment to delete all test files
# !s5cmd rm "{TEST_BUCKET}*"

## Summary

The `s5cmd_with_pty_progress()` function provides real-time progress display for s5cmd operations:

1. **Uses PTY**: Tricks s5cmd into thinking it's connected to a real terminal
2. **Handles `\r`**: Uses `clear_output(wait=True)` to update progress in place
3. **Strips ANSI codes**: Removes color codes for cleaner display
4. **Works everywhere**: Functions in both Jupyter notebooks and terminal scripts

**Key flags for sync-like behavior**:
- `--if-size-differ`: Only copy if file size differs
- `--if-source-newer`: Only copy if source is newer

**Usage in slipstream**:
```python
from slipstream.s3_sync import run_s5cmd_with_progress

cmd = ["s5cmd", "cp", "--show-progress", "--if-size-differ", "--if-source-newer",
       f"{local_path}/*", f"{remote_path}/"]
run_s5cmd_with_progress(cmd)
```