# MemoryLane: Line-by-Line Memory Profiler

MemoryLane is a powerful Python profiler that shows memory usage **after each executed source line**. It helps you identify memory bottlenecks, understand memory allocation patterns, and optimize your code's memory efficiency.

## Key Features

- **Line-by-line tracking**: See memory usage after every executed line
- **Delta reporting**: Track memory changes between lines
- **Peak memory tracking**: Monitor high-water memory usage
- **Nested function support**: Profile through function calls
- **PyTorch integration**: Built-in support for CUDA and CPU memory tracking
- **VS Code integration**: Ctrl+click line numbers to jump to source code

Let's dive into some examples!


In [1]:
import torch
import torch.nn as nn
from memorylane import profile

# Check if CUDA is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Clear any existing CUDA memory
if torch.cuda.is_available():
    torch.cuda.empty_cache()


Using device: cuda


## Basic Demo

Let's start with a simple example that shows how MemoryLane works. The `@profile` decorator will trace every line execution and show memory usage:

### Understanding the Output

Each line shows:
- **Mem**: Current total allocated memory
- **ΔMem**: Change in memory since the previous traced line
- **Peak**: Peak memory usage seen so far
- **ΔPeak**: Change in peak memory
- **Line number**: Clickable line reference (Ctrl+click in VS Code to jump to source)
- **Source code**: The actual line that was executed

⚠️ **Important**: You only see lines that actually execute, in the order they execute. This means:
- Conditional code that doesn't run won't appear
- Loop iterations will be "unrolled" showing each iteration
- Multi-line expressions execute sub-expressions before assignments


In [2]:
@profile
def basic_demo():
    """Basic example showing memory allocation patterns."""
    
    # Create a large tensor - this will allocate significant memory
    N = 5000  # 5000x5000 = 25M elements * 4 bytes = ~100MB
    x = torch.randn(N, N, device=device)
    
    # Matrix multiplication temporarily uses more memory
    y = x @ x  # Peak memory will spike during computation
    
    # In-place operation should not increase memory
    y.relu_()
    
    # Memory reduction through aggregation
    result = y.sum()
    
    return result

basic_demo()


 [1;35m━━━━━━ MemoryLane: Line-by-Line Memory Profiler ━━━━━━[0m
 [1mTracing [0m[1;36m'basic_demo'[0m [1m([0mfile: [90m/tmp/ipykernel_1960718/[0m[90m1200114249.py[0m[90m:[0m[1;90m1[0m[1m)[0m:
 [2mMem:      0 MB[0m | [2mΔMem:      0 MB[0m | [2mPeak:      0 MB[0m | [2mΔPeak:      0 MB[0m | [90m1200114249.py:6   [0m | [97;40m    [0m[97;40mN[0m[97;40m [0m[91;40m=[0m[97;40m [0m[37;40m5000[0m[97;40m  [0m[37;40m# 5000x5000 = 25M elements * 4 bytes = ~100MB[0m
 [1;32mMem:     96 MB[0m | [1;32mΔMem:     96 MB[0m | [1;32mPeak:     96 MB[0m | [1;32mΔPeak:     96 MB[0m | [90m1200114249.py:7   [0m | [97;40m    [0m[97;40mx[0m[97;40m [0m[91;40m=[0m[97;40m [0m[97;40mtorch[0m[91;40m.[0m[97;40mrandn[0m[97;40m([0m[97;40mN[0m[97;40m,[0m[97;40m [0m[97;40mN[0m[97;40m,[0m[97;40m [0m[97;40mdevice[0m[91;40m=[0m[97;40mdevice[0m[97;40m)[0m
 [1;32mMem:    200 MB[0m | [1;32mΔMem:    104 MB[0m | [1;32mPeak:    200 MB[0

tensor(7.0548e+08, device='cuda:0')

Important note: note how the `Peak` column is *not* cleared between runs - it reflects the actual cached memory amount. In a sense, the `Peak` is the *process-wide running memory maximum*. 

This turns out to be quite useful, since it allows you to capture peaks that happen inside inner-scope expressions. But this can also lead to confusion, since it is not limited to the scope of the decorated function. Notice that if we re-run the `basic_demo` function, the peak memory is *not* reset.

In [3]:
basic_demo()

 [1;35m━━━━━━ MemoryLane: Line-by-Line Memory Profiler ━━━━━━[0m
 [1mTracing [0m[1;36m'basic_demo'[0m [1m([0mfile: [90m/tmp/ipykernel_1960718/[0m[90m1200114249.py[0m[90m:[0m[1;90m1[0m[1m)[0m:
 [2mMem:      8 MB[0m | [2mΔMem:      0 MB[0m | [2mPeak:    200 MB[0m | [2mΔPeak:      0 MB[0m | [90m1200114249.py:6   [0m | [97;40m    [0m[97;40mN[0m[97;40m [0m[91;40m=[0m[97;40m [0m[37;40m5000[0m[97;40m  [0m[37;40m# 5000x5000 = 25M elements * 4 bytes = ~100MB[0m
 [1;32mMem:    104 MB[0m | [1;32mΔMem:     96 MB[0m | [2mPeak:    200 MB[0m | [2mΔPeak:      0 MB[0m | [90m1200114249.py:7   [0m | [97;40m    [0m[97;40mx[0m[97;40m [0m[91;40m=[0m[97;40m [0m[97;40mtorch[0m[91;40m.[0m[97;40mrandn[0m[97;40m([0m[97;40mN[0m[97;40m,[0m[97;40m [0m[97;40mN[0m[97;40m,[0m[97;40m [0m[97;40mdevice[0m[91;40m=[0m[97;40mdevice[0m[97;40m)[0m
 [1;32mMem:    200 MB[0m | [1;32mΔMem:     96 MB[0m | [2mPeak:    200 MB[0m | [2mΔ

tensor(7.0527e+08, device='cuda:0')

## Memory Optimization Example

Let's look at a function with suboptimal memory usage and then optimize it. This shows how MemoryLane helps identify memory bottlenecks.


In [4]:

### [Seeded Randomness for Reproducibility]
N = 5000
_seed: int = 42  # Chosen seed for reproducibility

@profile
def inefficient_computation():
    """Example with poor memory efficiency - creating unnecessary intermediate tensors.
    """
    # Set the random seed for deterministic results
    torch.manual_seed(_seed)
    x = torch.randn(N, N, device=device)

    # BAD: Creating many intermediate tensors that all stay in memory
    temp1 = x * 2
    temp2 = temp1 + 1
    temp3 = temp2.pow(2)
    temp4 = temp3 - x
    temp5 = temp4.relu()
    result = temp5.mean()

    return result

@profile
def efficient_computation():
    """Optimized version using in-place operations and chaining.
    """
    # Set the random seed for deterministic results
    torch.manual_seed(_seed)
    x = torch.randn(N, N, device=device)

    # BETTER: Chain operations and use in-place when possible
    # This reduces intermediate tensor allocations
    result = ((x * 2 + 1).pow_(2) - x).relu_().mean()

    return result

result1 = inefficient_computation()
result2 = efficient_computation()

print(f"\nComparison:")
print(f"Both functions computed same results: {torch.all(result1 == result2)}")
print(f"Notice how the optimized version uses ~1/6 the memory (104 MB vs. 584 MB)!")


 [1;35m━━━━━━ MemoryLane: Line-by-Line Memory Profiler ━━━━━━[0m
 [1mTracing [0m[1;36m'inefficient_computation'[0m [1m([0mfile: [90m/tmp/ipykernel_1960718/[0m[90m260837587.py[0m[90m:[0m[1;90m5[0m[1m)[0m:
 [2mMem:      8 MB[0m | [2mΔMem:      0 MB[0m | [2mPeak:    200 MB[0m | [2mΔPeak:      0 MB[0m | [90m260837587.py:10  [0m | [97;40m    [0m[97;40mtorch[0m[91;40m.[0m[97;40mmanual_seed[0m[97;40m([0m[97;40m_seed[0m[97;40m)[0m
 [1;32mMem:    104 MB[0m | [1;32mΔMem:     96 MB[0m | [2mPeak:    200 MB[0m | [2mΔPeak:      0 MB[0m | [90m260837587.py:11  [0m | [97;40m    [0m[97;40mx[0m[97;40m [0m[91;40m=[0m[97;40m [0m[97;40mtorch[0m[91;40m.[0m[97;40mrandn[0m[97;40m([0m[97;40mN[0m[97;40m,[0m[97;40m [0m[97;40mN[0m[97;40m,[0m[97;40m [0m[97;40mdevice[0m[91;40m=[0m[97;40mdevice[0m[97;40m)[0m
 [1;32mMem:    200 MB[0m | [1;32mΔMem:     96 MB[0m | [2mPeak:    200 MB[0m | [2mΔPeak:      0 MB[0m | [90m260837

## Nested Function Tracing

MemoryLane can trace through nested function calls. When a decorated function calls another decorated function, both are traced with proper indentation to show the call hierarchy.


In [5]:
@profile
def child_function(x):
    y = torch.zeros_like(x)
    result = x.pow(2) + y
    return result.sum(dim=-1)

@profile  
def main_function():
    
    # Initial allocation
    data = torch.randn(2000, 2000, device=device)
    
    # Process in chunks to demonstrate nested calls
    chunk1 = data[:1000]
    chunk2 = data[1000:]
    
    # These calls will be traced with indentation
    result1 = child_function(chunk1)
    result2 = child_function(chunk2)
    
    # Combine results
    final_result = result1.mean() + result2.mean()
    
    return final_result

print("=== NESTED FUNCTION TRACING ===")
output = main_function()
print(f"\nFinal output: {output.item():.4f}")
print("Notice how child function calls are indented to show the call hierarchy!")


=== NESTED FUNCTION TRACING ===
 [1;35m━━━━━━ MemoryLane: Line-by-Line Memory Profiler ━━━━━━[0m
 [1mTracing [0m[1;36m'main_function'[0m [1m([0mfile: [90m/tmp/ipykernel_1960718/[0m[90m1308085490.py[0m[90m:[0m[1;90m7[0m[1m)[0m:
 [1;32mMem:     24 MB[0m | [1;32mΔMem:     16 MB[0m | [2mPeak:    584 MB[0m | [2mΔPeak:      0 MB[0m | [90m1308085490.py:11  [0m | [97;40m    [0m[97;40mdata[0m[97;40m [0m[91;40m=[0m[97;40m [0m[97;40mtorch[0m[91;40m.[0m[97;40mrandn[0m[97;40m([0m[37;40m2000[0m[97;40m,[0m[97;40m [0m[37;40m2000[0m[97;40m,[0m[97;40m [0m[97;40mdevice[0m[91;40m=[0m[97;40mdevice[0m[97;40m)[0m
 [2mMem:     24 MB[0m | [2mΔMem:      0 MB[0m | [2mPeak:    584 MB[0m | [2mΔPeak:      0 MB[0m | [90m1308085490.py:14  [0m | [97;40m    [0m[97;40mchunk1[0m[97;40m [0m[91;40m=[0m[97;40m [0m[97;40mdata[0m[97;40m[[0m[97;40m:[0m[37;40m1000[0m[97;40m][0m
 [2mMem:     24 MB[0m | [2mΔMem:      0 MB[0m | [2m

## PyTorch nn.Module Integration

MemoryLane works seamlessly with PyTorch modules. You can decorate the `forward` method to trace memory usage during neural network computations. This is particularly useful for understanding memory bottlenecks in deep learning models.


In [6]:
# Clear memory before neural network example
if torch.cuda.is_available():
    torch.cuda.empty_cache()

class SimpleNetwork(nn.Module):
    """A simple neural network to demonstrate memory profiling."""
    
    def __init__(self, input_size=1000, hidden_size=2048, output_size=512):
        super().__init__()
        self.linear1 = nn.Linear(input_size, hidden_size)
        self.linear2 = nn.Linear(hidden_size, hidden_size)
        self.linear3 = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(0.1)
        
    @profile  # Decorate the forward method
    def forward(self, x):
        """Forward pass with memory profiling."""
        
        # Layer 1: input -> hidden
        h1 = self.linear1(x)
        h1_activated = torch.relu(h1)
        
        # Layer 2: hidden -> hidden (with residual connection)
        h2 = self.linear2(h1_activated)
        h2_activated = torch.relu(h2)
        
        # Add residual connection (this creates temporary tensors)
        h2_residual = h2_activated + h1_activated
        
        # Apply dropout
        h2_dropped = self.dropout(h2_residual)
        
        # Layer 3: hidden -> output
        output = self.linear3(h2_dropped)
        
        return output

# Create and move model to device
model = SimpleNetwork().to(device)

# Create input data - larger batch to see memory effects
batch_size = 256
input_data = torch.randn(batch_size, 1000, device=device)

print("=== NEURAL NETWORK FORWARD PASS ===")
with torch.no_grad():  # Disable gradients for inference
    predictions = model(input_data)

print(f"\nOutput shape: {predictions.shape}")
print(f"Output mean: {predictions.mean().item():.4f}")
print(f"Output std: {predictions.std().item():.4f}")
print("\nTip: You can profile backward passes too by decorating forward() without torch.no_grad()!")


=== NEURAL NETWORK FORWARD PASS ===
 [1;35m━━━━━━ MemoryLane: Line-by-Line Memory Profiler ━━━━━━[0m
 [1mTracing [0m[1;36m'forward'[0m [1m([0mfile: [90m/tmp/ipykernel_1960718/[0m[90m731318301.py[0m[90m:[0m[1;90m15[0m[1m)[0m:


 [1;32mMem:     39 MB[0m | [1;32mΔMem:      2 MB[0m | [2mPeak:    584 MB[0m | [2mΔPeak:      0 MB[0m | [90m731318301.py:20  [0m | [97;40m    [0m[97;40mh1[0m[97;40m [0m[91;40m=[0m[97;40m [0m[97;40mself[0m[91;40m.[0m[97;40mlinear1[0m[97;40m([0m[97;40mx[0m[97;40m)[0m
 [1;32mMem:     41 MB[0m | [1;32mΔMem:      2 MB[0m | [2mPeak:    584 MB[0m | [2mΔPeak:      0 MB[0m | [90m731318301.py:21  [0m | [97;40m    [0m[97;40mh1_activated[0m[97;40m [0m[91;40m=[0m[97;40m [0m[97;40mtorch[0m[91;40m.[0m[97;40mrelu[0m[97;40m([0m[97;40mh1[0m[97;40m)[0m
 [1;32mMem:     43 MB[0m | [1;32mΔMem:      2 MB[0m | [2mPeak:    584 MB[0m | [2mΔPeak:      0 MB[0m | [90m731318301.py:24  [0m | [97;40m    [0m[97;40mh2[0m[97;40m [0m[91;40m=[0m[97;40m [0m[97;40mself[0m[91;40m.[0m[97;40mlinear2[0m[97;40m([0m[97;40mh1_activated[0m[97;40m)[0m
 [1;32mMem:     45 MB[0m | [1;32mΔMem:      2 MB[0m | [2mPeak:    584 MB[0m | [2mΔ

## Advanced Features and Tips

### Profiler Configuration

MemoryLane supports several configuration options for different use cases:


In [7]:
# Different memory tracking modes
@profile(memory_type="torch_cuda")  # Track CUDA memory (default)
def cuda_example():
    return torch.randn(100, 100, device="cuda" if torch.cuda.is_available() else "cpu")

@profile(memory_type="torch_cpu")   # Track CPU memory
def cpu_example():
    return torch.randn(100, 100, device="cpu")

@profile(memory_type="python")      # Track Python memory
def python_example():
    return [i**2 for i in range(1000)]

# Threshold filtering - only show significant changes
@profile(threshold=10*1024**2, only_show_significant=True)  # Only show changes > 10MB
def filtered_example():
    small_tensor = torch.randn(10, 10, device=device)  # Won't show (too small)
    large_tensor = torch.randn(5000, 5000, device=device)  # Will show
    return large_tensor.sum()

print("=== FILTERED PROFILING (only significant changes) ===")
filtered_result = filtered_example()

print(f"\nResult: {filtered_result.item():.2f}")
print("Notice: Only the large tensor allocation was shown because we set only_show_significant=True")

=== FILTERED PROFILING (only significant changes) ===
 [1;35m━━━━━━ MemoryLane: Line-by-Line Memory Profiler ━━━━━━[0m
 [1mTracing [0m[1;36m'filtered_example'[0m [1m([0mfile: [90m/tmp/ipykernel_1960718/[0m[90m4048755409.py[0m[90m:[0m[1;90m15[0m[1m)[0m:
 [1;32mMem:    133 MB[0m | [1;32mΔMem:     96 MB[0m | [2mPeak:    584 MB[0m | [2mΔPeak:      0 MB[0m | [90m4048755409.py:18  [0m | [97;40m    [0m[97;40mlarge_tensor[0m[97;40m [0m[91;40m=[0m[97;40m [0m[97;40mtorch[0m[91;40m.[0m[97;40mrandn[0m[97;40m([0m[37;40m5000[0m[97;40m,[0m[97;40m [0m[37;40m5000[0m[97;40m,[0m[97;40m [0m[97;40mdevice[0m[91;40m=[0m[97;40mdevice[0m[97;40m)[0m[97;40m  [0m[37;40m# Will show[0m

Result: -9510.14
Notice: Only the large tensor allocation was shown because we set only_show_significant=True


### Best Practices

1. **Start with broad profiling**: Use `@profile` without arguments to get the full picture
2. **Use thresholds for noisy code**: Set `threshold` and `only_show_significant=True` for cleaner output
3. **Profile incrementally**: Start with main functions, then dive into bottlenecks
4. **Check peak memory**: Watch for spikes in the Peak column - these indicate temporary high memory usage
5. **Use VS Code integration**: Ctrl+click line numbers to jump directly to problematic code
6. **Clear memory between runs**: Use `torch.cuda.empty_cache()` for consistent baselines

### What to Look For

- **Large ΔMem values**: Indicate significant memory allocations
- **Peak spikes**: Show temporary memory usage that might cause OOM errors
- **Memory not decreasing**: May indicate memory leaks or inefficient cleanup
- **Unexpected execution order**: Remember that loops unroll and expressions evaluate in pieces
- **Gradual memory accumulation**: Could indicate memory leaks in loops

### Common Debugging Scenarios

MemoryLane excels at debugging:
- **Out-of-memory errors**: Find which line pushes you over the limit
- **Memory leaks**: Identify where memory doesn't get freed as expected
- **Inefficient algorithms**: Compare different implementations side-by-side
- **Model optimization**: Profile different model architectures or batch sizes


In [8]:
@profile(threshold=5*1024**2, only_show_significant=True)  # Show changes > 5MB
def debug_memory_accumulation():
    """Example showing how to debug memory accumulation in loops."""
    
    data_list = []
    
    for i in range(3):  # Small loop for demo
        # Each iteration creates a large tensor
        batch_data = torch.randn(1000, 2000, device=device)
        
        # Process the data
        processed = batch_data.pow(2).mean(dim=1)
        
        # Store results - this might accumulate memory!
        data_list.append(processed)
        del batch_data
        # Optional: del batch_data  # Explicitly free intermediate data
    
    # Combine all results
    final_result = torch.stack(data_list).sum()
    
    return final_result

print("=== DEBUGGING MEMORY ACCUMULATION ===")
result = debug_memory_accumulation()
print(f"\nFinal result: {result.item():.2f}")
print("\nTry uncommenting the 'del batch_data' line to see the difference!")
print("This shows how MemoryLane helps identify memory accumulation patterns.")


=== DEBUGGING MEMORY ACCUMULATION ===
 [1;35m━━━━━━ MemoryLane: Line-by-Line Memory Profiler ━━━━━━[0m
 [1mTracing [0m[1;36m'debug_memory_accumulation'[0m [1m([0mfile: [90m/tmp/ipykernel_1960718/[0m[90m758106828.py[0m[90m:[0m[1;90m1[0m[1m)[0m:
 [1;32mMem:     45 MB[0m | [1;32mΔMem:      8 MB[0m | [2mPeak:    584 MB[0m | [2mΔPeak:      0 MB[0m | [90m758106828.py:9   [0m | [97;40m        [0m[97;40mbatch_data[0m[97;40m [0m[91;40m=[0m[97;40m [0m[97;40mtorch[0m[91;40m.[0m[97;40mrandn[0m[97;40m([0m[37;40m1000[0m[97;40m,[0m[97;40m [0m[37;40m2000[0m[97;40m,[0m[97;40m [0m[97;40mdevice[0m[91;40m=[0m[97;40mdevice[0m[97;40m)[0m
 [1;31mMem:     38 MB[0m | [1;31mΔMem:     -8 MB[0m | [2mPeak:    584 MB[0m | [2mΔPeak:      0 MB[0m | [90m758106828.py:16  [0m | [97;40m        [0m[96;40mdel[0m[97;40m [0m[97;40mbatch_data[0m
 [1;32mMem:     45 MB[0m | [1;32mΔMem:      8 MB[0m | [2mPeak:    584 MB[0m | [2mΔPeak:     