# Assignment 1 - Algorithms and Data Structures Analysis

## Overview

This notebook performs empirical analysis of algorithm performance to verify theoretical Big O complexity predictions. We test two fundamental algorithmic problems with multiple implementations to demonstrate how different approaches scale with input size.

## Summary

This notebook tests and visualizes the performance of:

**UnionFind Algorithms** (4 implementations):
- **Quick Find**: O(1) find, O(N) union - simple but inefficient for many unions
- **Quick Union**: O(N) find, O(1) union - better for sparse operations
- **Weighted Quick Union**: O(log N) find, O(log N) union - balanced approach with tree size tracking
- **Weighted Quick Union with Path Compression**: O(α(N)) amortized - optimal performance
- Tested with varying input sizes (1K-100K elements) and proportional operations

**3Sum Algorithms** (3 implementations):
- **Brute Force**: O(N³) - checks all possible triplets
- **Optimized Two Pointers**: O(N²) - sorts array and uses two-pointer technique
- **Hash Set**: O(N²) - uses hash table for constant-time lookups
- Tested with different array sizes (80-8K elements)

**Analysis Methodology**:
- Performance plots comparing execution times across input sizes
- Log-log scale plots to verify Big O complexity slopes (slope = complexity exponent)
- Statistical timing measurements using %timeit for accuracy
- Complexity verification through slope analysis


In [None]:
import sys

# Add src directory to path to import custom algorithm implementations
sys.path.append("../src")

import random

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.optimize import curve_fit

# Import custom algorithm implementations
from threesum import (
    generate_test_data,
    three_sum_brute_force,
    three_sum_optimized,
    three_sum_optimized_with_hash,
)
from unionfind import (
    QuickFind,
    QuickUnion,
    WeightedQuickUnion,
    WeightedQuickUnionPathCompression,
)

# Configure plotting style for pretty-looking graphs
plt.style.use("seaborn-v0_8")
sns.set_palette("husl")

# Set random seeds for reproducible results across runs for consistency
random.seed(42)
np.random.seed(42)

print("Libraries imported successfully!")
print(f"Python version: {sys.version}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"Matplotlib version: {matplotlib.__version__}")

In [None]:
# Helper functions for performance testing and data generation


def setup_unionfind_test(uf_class, n: int, operations: list[tuple[int, int]]):
    """
    Setup UnionFind instance and operations for %timeit testing.

    This function creates a closure that contains the UnionFind instance
    and all operations to be performed, allowing %timeit to measure
    just the algorithm execution time without setup overhead.
    """
    uf = uf_class(n)

    def run_operations():
        for p, q in operations:
            uf.union(p, q)

    return run_operations


def generate_unionfind_operations(n: int, num_operations: int) -> list[tuple[int, int]]:
    """
    Generate random union operations for testing.

    Creates a list of random (p, q) pairs where both p and q are
    valid indices in the range [0, n-1]. The number of operations
    is typically proportional to n to ensure meaningful performance
    comparisons across different input sizes.
    """
    operations = []
    for _ in range(num_operations):
        p = random.randint(0, n - 1)
        q = random.randint(0, n - 1)
        operations.append((p, q))
    return operations


def setup_threesum_test(func, nums: list[int]):
    """
    Setup 3Sum function for %timeit testing.

    Creates a closure that contains the test data and function
    to be tested, allowing %timeit to measure just the algorithm
    execution time without data generation overhead.

    Uses value-based deduplication to avoid inflated result counts.
    """

    def run_threesum():
        return func(nums, return_values=True)

    return run_threesum


print("Performance measurement functions defined!")

In [None]:
# =============================================================================
# UnionFind Performance Analysis
# =============================================================================
# This section tests the performance of four different UnionFind implementations
# across varying input sizes to verify their theoretical time complexities.

print("=== UnionFind Performance Analysis ===")

# Test parameters: Input sizes from 1K to 100K elements
# These sizes are chosen to show clear performance differences between algorithms
# and allow for meaningful slope analysis in log-log plots
n_values = [1000, 5000, 10000, 50000, 100000]

# UnionFind algorithms to test - ordered from least to most efficient
uf_algorithms = [
    ("Quick Find", QuickFind),  # O(1) find, O(N) union
    ("Quick Union", QuickUnion),  # O(N) find, O(1) union
    ("Weighted Quick Union", WeightedQuickUnion),  # O(log N) find, O(log N) union
    (
        "Weighted Quick Union with Path Compression",
        WeightedQuickUnionPathCompression,
    ),  # O(α(N)) amortized
]

# Store performance results for analysis and visualization
uf_results = []

# Test each algorithm with each input size
for n in n_values:
    # Scale operations proportionally to N
    # 0.9 ratio ensures sufficient operations while avoiding complete connectivity
    # This creates a realistic workload where most elements get connected
    num_operations = int(0.9 * n)
    print(f"\nTesting with N = {n}, Operations = {num_operations}")
    operations = generate_unionfind_operations(n, num_operations)

    # Test each UnionFind implementation with the same set of operations
    for name, uf_class in uf_algorithms:
        test_func = setup_unionfind_test(uf_class, n, operations)

        # Use %timeit for accurate timing measurements
        # -q flag suppresses output, -o flag returns timing object
        result = %timeit -q -o test_func() # pyright: ignore

        # Store results for later analysis
        uf_results.append(
            {
                "Algorithm": name,
                "N": n,
                "Operations": num_operations,
                "Time (s)": result.best,  # Best time from multiple runs
                "Average (s)": result.average,  # Average time from multiple runs
            }
        )
        print(f"  {name}: {result.best:.6f}s (best), {result.average:.6f}s (avg)")

print(f"\nUnionFind analysis completed! {len(uf_results)} measurements taken.")

In [None]:
# =============================================================================
# 3Sum Performance Analysis
# =============================================================================
# This section tests three different approaches to the 3Sum problem:
# 1. Brute Force: O(N³) - checks all possible triplets
# 2. Optimized Two Pointers: O(N²) - sorts array and uses two-pointer technique
# 3. Hash Set: O(N²) - uses hash table for constant-time lookups
#
# NOTE: All algorithms now use value-based deduplication (return_values=True)
# to avoid inflated result counts from duplicate values in random test data.

print("=== 3Sum Performance Analysis ===")

# Test parameters for 3Sum algorithms
# Different array sizes are used for different algorithms based on their complexity:
# - Brute force: smaller sizes (80-400) due to O(N³) complexity
# - Optimized algorithms: larger sizes (500-8000) due to O(N²) complexity
array_sizes_brute = [80, 120, 200, 300, 400]  # Smaller sizes for O(N³) algorithm
array_sizes_optimized = [
    500,
    1000,
    2000,
    5000,
    8000,
]  # Larger sizes for O(N²) algorithms

# 3Sum algorithms to test - ordered by theoretical efficiency
threesum_algorithms = [
    ("Brute Force", three_sum_brute_force),  # O(N³) - checks all triplets
    ("Optimized Two Pointers", three_sum_optimized),  # O(N²) - two-pointer technique
    ("Hash Set", three_sum_optimized_with_hash),  # O(N²) - hash table approach
]

# Store performance results for analysis and visualization
threesum_results = []

# Test brute force algorithm with smaller array sizes
# Due to O(N³) complexity, we use smaller sizes to keep execution time reasonable
print("\n--- Testing Brute Force with smaller sizes ---")
for size in array_sizes_brute:
    print(f"\nTesting Brute Force with array size = {size}")

    # Generate random test data for this array size
    test_data = generate_test_data(size)

    name, func = threesum_algorithms[0]  # Brute Force algorithm
    test_func = setup_threesum_test(func, test_data)

    # Run algorithm once to get solution count and verify correctness
    sample_result = func(test_data, return_values=True)
    result = %timeit -q -o test_func()  # pyright: ignore

    # Store results including number of solutions found
    threesum_results.append(
        {
            "Algorithm": name,
            "Array Size": size,
            "Time (s)": result.best,
            "Average (s)": result.average,
            "Solutions Found": len(sample_result),
        }
    )
    print(
        f"  {name}: {result.best:.6f}s (best), "
        f"{result.average:.6f}s (avg), {len(sample_result)} unique value triplets"
    )

# Test optimized algorithms with larger array sizes
# Due to O(N²) complexity, test with larger sizes and keeping execution time reasonable
print("\n--- Testing Optimized algorithms with larger sizes ---")
for size in array_sizes_optimized:
    print(f"\nTesting with array size = {size}")

    # Generate random test data for this array size
    test_data = generate_test_data(size)

    # Test both optimized algorithms (skip brute force)
    for name, func in threesum_algorithms[1:]:  # Skip brute force
        test_func = setup_threesum_test(func, test_data)

        # Run algorithm once to get solution count and verify correctness
        sample_result = func(test_data, return_values=True)
        result = %timeit -q -o test_func()  # pyright: ignore

        # Store results including number of solutions found
        threesum_results.append(
            {
                "Algorithm": name,
                "Array Size": size,
                "Time (s)": result.best,
                "Average (s)": result.average,
                "Solutions Found": len(sample_result),
            }
        )
        print(
            f"  {name}: {result.best:.6f}s (best), "
            f"{result.average:.6f}s (avg), {len(sample_result)} unique value triplets"
        )

print(f"\n3Sum analysis completed! {len(threesum_results)} measurements taken.")

In [None]:
# =============================================================================
# Power-Law Curve Fitting - Simple Constants Extraction
# =============================================================================

# Convert results to DataFrames for easier analysis
threesum_df = pd.DataFrame(threesum_results)


def power_law(x, a, b):
    return a * (x**b)


def power_law_with_offset(x, a, b, c):
    """Power law with constant offset: c + a * x^b"""
    return c + a * (x**b)


print("=== Power-Law Constants: time ≈ a·N^b ===\n")

# Store fitting results for overlay plotting
fitted_params = {}

MIN_FITTING_POINTS = 2  # Minimum data points required for curve fitting
MIN_LARGE_POINTS = 3    # Minimum points for large-size fitting strategy

for algorithm in threesum_df["Algorithm"].unique():
    data = threesum_df[threesum_df["Algorithm"] == algorithm].sort_values(
        "Array Size"
    )
    print(f"\nFitting {algorithm}:")
    print(f"  Data points: {len(data)}")
    print(f"  Size range: {data['Array Size'].min()}-{data['Array Size'].max()}")
    print(
        f"  Time range: {data['Time (s)'].min():.6f}-"
        f"{data['Time (s)'].max():.6f}"
    )

    # Show actual data points to understand the trend
    for _, row in data.iterrows():
        print(f"    Size {row['Array Size']:4d}: {row['Time (s)']:.6f}s")

    if len(data) >= MIN_FITTING_POINTS:
        # Try simple power law first
        try:
            (a, b), _ = curve_fit(power_law, data["Array Size"], data["Time (s)"])
            simple_fit = (a, b)
            print(f"  Simple fit: a={a:.2e}, b={b:.2f}")
        except Exception as e:
            print(f"  Simple fitting failed: {e}")
            simple_fit = None

        # For Two Pointers, try fitting only larger sizes where scaling is clear
        if "Two Pointers" in algorithm and len(data) >= MIN_LARGE_POINTS:
            try:
                # Use only the largest 3 data points where true scaling emerges
                large_data = data.tail(MIN_LARGE_POINTS)
                print(
                    f"  Trying fit on large sizes only: "
                    f"{large_data['Array Size'].tolist()}"
                )
                (a_large, b_large), _ = curve_fit(
                    power_law, large_data["Array Size"], large_data["Time (s)"]
                )
                print(f"  Large-size fit: a={a_large:.2e}, b={b_large:.2f}")

                # Use the large-size fit if it has better scaling behavior
                if b_large > b if simple_fit else True:
                    fitted_params[algorithm] = (a_large, b_large, 'large_only')
                    print("  → Using large-size fit (better scaling)")
                elif simple_fit:
                    fitted_params[algorithm] = (a, b, 'simple')
                    print("  → Using simple fit")
            except Exception as e:
                print(f"  Large-size fitting failed: {e}")
                if simple_fit:
                    fitted_params[algorithm] = (a, b, 'simple')
        elif simple_fit:
            fitted_params[algorithm] = (a, b, 'simple')

print(f"\nSuccessfully fitted {len(fitted_params)} algorithms")

In [None]:
# =============================================================================
# Performance Visualization with Fitted Curve Overlays
# =============================================================================

print("=== Creating Performance Visualizations ===")

# Convert UnionFind results to DataFrame for easier plotting and analysis
uf_df = pd.DataFrame(uf_results)

# Slope verification for complexity analysis
print("\n=== Slope Analysis for Complexity Verification ===")

# Essential constants for linting compliance
MIN_DATA_POINTS = 2  # Minimum data points required for slope analysis
LARGE_SIZE_TAIL = 3  # Number of largest points to use for large-size fits
FIT_TYPE_INDEX = 2  # Index for fit type in fitted_params tuple

# Meaningful constants for code maintainability
LINE_WIDTH = 2  # Line width for data plots
FIT_LINE_WIDTH = 1.5  # Line width for fitted curves
FIT_ALPHA = 0.8  # Alpha transparency for fitted curves
MARKER_SIZE_SMALL = 5  # Small marker size
MARKER_SIZE_MEDIUM = 6  # Medium marker size
MARKER_SIZE_LARGE = 7  # Large marker size
GRID_ALPHA = 0.3  # Grid transparency
FIT_POINTS = 100  # Number of points for full-range fits
PARTIAL_FIT_POINTS = 50  # Number of points for partial-range fits

# UnionFind slope analysis
print("\nUnionFind Complexity Verification:")
for algorithm in uf_df["Algorithm"].unique():
    data = uf_df[uf_df["Algorithm"] == algorithm].sort_values("N")
    if len(data) >= MIN_DATA_POINTS:
        log_n = np.log10(data["N"].values)
        log_time = np.log10(data["Time (s)"].values)
        slope = (log_time[-1] - log_time[0]) / (log_n[-1] - log_n[0])
        print(f"  {algorithm}: slope = {slope:.2f}")

# 3Sum slope analysis
print("\n3Sum Complexity Verification:")
for algorithm in threesum_df["Algorithm"].unique():
    data = threesum_df[threesum_df["Algorithm"] == algorithm].sort_values(
        "Array Size"
    )
    if len(data) >= MIN_DATA_POINTS:
        log_size = np.log10(data["Array Size"].values)
        log_time = np.log10(data["Time (s)"].values)
        slope = (log_time[-1] - log_time[0]) / (log_size[-1] - log_size[0])
        print(f"  {algorithm}: slope = {slope:.2f}")

# Create comprehensive performance visualization with 4 subplots
plt.figure(figsize=(15, 10))

# Define distinctive markers and line styles for each algorithm
uf_styles = {
    "Quick Find": {
        "marker": "o",
        "linestyle": "-",
        "markersize": MARKER_SIZE_MEDIUM
    },
    "Quick Union": {
        "marker": "s",
        "linestyle": "--",
        "markersize": MARKER_SIZE_MEDIUM
    },
    "Weighted Quick Union": {
        "marker": "^",
        "linestyle": "-.",
        "markersize": MARKER_SIZE_LARGE
    },
    "Weighted Quick Union with Path Compression": {
        "marker": "D",
        "linestyle": ":",
        "markersize": MARKER_SIZE_SMALL
    }
}

threesum_styles = {
    "Brute Force": {
        "marker": "o",
        "linestyle": "-",
        "markersize": MARKER_SIZE_MEDIUM
    },
    "Optimized Two Pointers": {
        "marker": "s",
        "linestyle": "--",
        "markersize": MARKER_SIZE_MEDIUM
    },
    "Hash Set": {
        "marker": "^",
        "linestyle": "-.",
        "markersize": MARKER_SIZE_LARGE
    }
}

# Subplot 1: UnionFind performance on linear scale
plt.subplot(2, 2, 1)
for algorithm in uf_df["Algorithm"].unique():
    data = uf_df[uf_df["Algorithm"] == algorithm]
    style = uf_styles[algorithm]
    plt.plot(
        data["N"], data["Time (s)"], label=algorithm,
        marker=style["marker"], linestyle=style["linestyle"],
        markersize=style["markersize"], linewidth=LINE_WIDTH
    )
plt.xlabel("Number of Elements (N)")
plt.ylabel("Time (seconds)")
plt.title("UnionFind Performance Comparison")
plt.legend(fontsize=8)
plt.grid(True, alpha=GRID_ALPHA)

# Subplot 2: UnionFind performance on log-log scale
plt.subplot(2, 2, 2)
for algorithm in uf_df["Algorithm"].unique():
    data = uf_df[uf_df["Algorithm"] == algorithm]
    style = uf_styles[algorithm]
    plt.loglog(
        data["N"], data["Time (s)"], label=algorithm,
        marker=style["marker"], linestyle=style["linestyle"],
        markersize=style["markersize"], linewidth=LINE_WIDTH
    )
plt.xlabel("Number of Elements (N) - Log Scale")
plt.ylabel("Time (seconds) - Log Scale")
plt.title("UnionFind Performance (Log-Log Scale)")
plt.legend(fontsize=8)
plt.grid(True, alpha=GRID_ALPHA)

# Subplot 3: 3Sum performance with fitted curves
plt.subplot(2, 2, 3)
for algorithm in threesum_df["Algorithm"].unique():
    data = threesum_df[threesum_df["Algorithm"] == algorithm]
    style = threesum_styles[algorithm]
    plt.plot(
        data["Array Size"], data["Time (s)"], label=f"{algorithm} (data)",
        marker=style["marker"], linestyle=style["linestyle"],
        markersize=style["markersize"], linewidth=LINE_WIDTH
    )

    # Add fitted curve overlay with appropriate range
    if algorithm in fitted_params:
        a, b = fitted_params[algorithm][:FIT_TYPE_INDEX]
        fit_type = (
            fitted_params[algorithm][FIT_TYPE_INDEX]
            if len(fitted_params[algorithm]) > FIT_TYPE_INDEX
            else 'simple'
        )

        if fit_type == 'large_only' and "Two Pointers" in algorithm:
            # For Two Pointers large-only fit, show curve only for larger sizes
            large_data = data.tail(LARGE_SIZE_TAIL)
            x_fit = np.linspace(
                large_data["Array Size"].min(),
                data["Array Size"].max(),
                PARTIAL_FIT_POINTS
            )
            fit_label = f"{algorithm} (fit: N^{b:.1f}, large sizes)"
        else:
            # Standard full-range fitting
            x_fit = np.linspace(
                data["Array Size"].min(),
                data["Array Size"].max(),
                FIT_POINTS
            )
            fit_label = f"{algorithm} (fit: N^{b:.1f})"

        y_fit = power_law(x_fit, a, b)
        plt.plot(
            x_fit, y_fit, "--", alpha=FIT_ALPHA,
            linewidth=FIT_LINE_WIDTH, label=fit_label
        )

plt.xlabel("Array Size")
plt.ylabel("Time (seconds)")
plt.title("3Sum Performance with Fitted Curves")
plt.legend(fontsize=7, loc='upper left')
plt.grid(True, alpha=GRID_ALPHA)

# Subplot 4: 3Sum performance on log-log scale with fitted curves
plt.subplot(2, 2, 4)
for algorithm in threesum_df["Algorithm"].unique():
    data = threesum_df[threesum_df["Algorithm"] == algorithm]
    style = threesum_styles[algorithm]
    plt.loglog(
        data["Array Size"], data["Time (s)"], label=f"{algorithm} (data)",
        marker=style["marker"], linestyle=style["linestyle"],
        markersize=style["markersize"], linewidth=LINE_WIDTH
    )

    # Add fitted curve overlay with appropriate range
    if algorithm in fitted_params:
        a, b = fitted_params[algorithm][:FIT_TYPE_INDEX]
        fit_type = (
            fitted_params[algorithm][FIT_TYPE_INDEX]
            if len(fitted_params[algorithm]) > FIT_TYPE_INDEX
            else 'simple'
        )

        if fit_type == 'large_only' and "Two Pointers" in algorithm:
            # For Two Pointers large-only fit, show curve only for larger sizes
            large_data = data.tail(LARGE_SIZE_TAIL)
            x_fit = np.linspace(
                large_data["Array Size"].min(),
                data["Array Size"].max(),
                PARTIAL_FIT_POINTS
            )
            fit_label = f"{algorithm} (fit: N^{b:.1f}, large sizes)"
        else:
            # Standard full-range fitting
            x_fit = np.linspace(
                data["Array Size"].min(),
                data["Array Size"].max(),
                FIT_POINTS
            )
            fit_label = f"{algorithm} (fit: N^{b:.1f})"

        y_fit = power_law(x_fit, a, b)
        plt.loglog(
            x_fit, y_fit, "--", alpha=FIT_ALPHA,
            linewidth=FIT_LINE_WIDTH, label=fit_label
        )

plt.xlabel("Array Size - Log Scale")
plt.ylabel("Time (seconds) - Log Scale")
plt.title("3Sum Performance (Log-Log) with Fitted Curves")
plt.legend(fontsize=7, loc='upper left')
plt.grid(True, alpha=GRID_ALPHA)

plt.tight_layout()
plt.show()

print("Enhanced visualizations with improved curve fitting created!")