## Code Profiling vs. Benchmarking

### 1. Profiling

* The objective of profiling is to analyze where time and resources are spent within a program.

* Examines the runtime behavior of your program, e.g.,
  - How long each function takes to execute
  - How many times a function is called
  - Memory usage of certain variables or data structures

* Output: Detailed report showing the time spent in each function, the number of calls, and possibly the call hierarchy (which functions call which).

* Main focus of courses like ICS432




## 2. Benchmarking

* The objective of benchmarking is to measure the overall performance of a program or system.

* Involves running a program under specific conditions and measuring metrics such as:
  - Execution time (runtime): The total time it takes for a program or a specific operation to complete from start to finish.
  - Throughput: The amount of work a system can perform or data it can process in a given amount of time.
  - Latency: The time it takes to initiate or complete a specific operation or request.

* Focuses on the end-to-end performance of an entire program or system.

* Output: Measurements for a set of performance metrics, such as the total time taken to complete a task, the number of operations performed per second, or the memory usage during execution.



## Relevance to BDA

* Process terabytes/petabytes efficiently
  * Avoid prohibitive slowdowns
* Efficiently manage the cost of workign with large datasets
  * Optimize configurations or plateform choices for cost-effectiveness


## Importance of Profiling:

* **Resource Efficiency & Algorithm Optimization**: 
  * Helps identify inefficient code sections that consume excessive CPU, memory, or I/O resources
  * Reveals which parts of algorithms are the most time-consuming
    * Needed for targeted optimizations
    * Main focus of ICS432

* **Bottleneck Identification**:  
  * When working with systems like Hadoop or Spark, profiling can pinpoint where bottlenecks occur
  * Essential for optimizing workflows and improving overall system performance

## Importance of Benchmarking

* **Performance Measurement**
   * Evaluates entire data processing pipeline
   * Helps understand system behavior under various loads

* **System Comparison**
   * Enables comparison of frameworks, storage solutions, and hardware
   * Facilitates informed decisions based on quantitative data

* **Capacity Planning**
   * Helps determine system limits
   * Allows planning for future scaling needs

## Code Execution Time Profiling wtih `time`

* Using `time` module
* Simple to use for quick measurements
* Less accurate due to system clock resolution
* Includes Python interpreter startup time
* Suitable for longer-running processes

* Why It Matters
- High User Time: Optimize your code
- High System Time: Optimize system calls or I/O operations


In [None]:
def long_running_func(num_iterations=10, sleep_time=0.1):
    """
    A function that simulates a long-running process.

    Args:
    num_iterations (int): Number of iterations to run. Default is 10.
    sleep_time (float): Time to sleep in each iteration in seconds. Default is 0.1.

    Returns: None
    """
    counter = 0
    for _ in range(num_iterations): 
        counter += 1
        time.sleep(sleep_time)  
    


In [None]:
import time

start_time = time.time()
long_running_func()
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")


## Profiling with `IPython` Magic Command: `%time`

- Measures the execution time of a single run of a function or statement

```python
%time long_running_func()
```

* What It Does
  * Executes `long_running_func()` and measures user, sys, 

* Measures:
  * User Time: time spent running your program's code.

  * System Time: time spent by the operating system working for your program
- Examples:
  - Reading or writing files
  - Allocating memory
  - Handling network operations





In [None]:
%time long_running_func(sleep_time=0.1)


## Using `timeit` module

* Designed for accurate timing of small code snippets
* Runs code multiple times for statistical accuracy
* Attempts to factor out Python interpreter overhead
* Preferred for benchmarking and comparing code snippets


In [None]:
import timeit

# Single run
single_run_time = timeit.timeit(long_running_func, number=1)
print(f"Single run time: {single_run_time} seconds")



In [None]:
average_time = timeit.timeit(long_running_func, number=10) / 10
print(f"Average time over 10 runs: {average_time} seconds")



### Using `cProfile` for detailed profiling
- A built-in profiling library in Python to produce detailed statistics about function calls in a Python program
- Useful for identifying bottlenecks in larger programs
- Can sort results by various metrics (e.g., cumulative time, call count)



In [None]:
import cProfile

cProfile.run('long_running_func')

### Output Explanation

```
         3 function calls in 0.000 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
```

- Indicates 3 function calls were made
- Total time appears to be less than 0.001 seconds (rounded to 0.000)
  * cProfile measures CPU time, not wall time

1. ncalls: Number of calls to the function
2. tottime: Total time spent in the function (excluding time in subfunctions)
3. percall: Average time spent per call (tottime / ncalls)
4. cumtime: Cumulative time spent in the function and all subfunctions
5. percall: Average cumulative time per call
6. filename:lineno(function): Location and name of the function



In [None]:
### Using `psutil` for overall process memory
* Provides overall memory usage of the Python process
* Faster than memory_profiler for whole-program memory tracking
*  Can track other system resources as well (CPU, disk I/O, etc.)


In [None]:
import psutil
import os

def get_memory_usage():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024  # in MB

print(f"Current memory usage: {get_memory_usage()} MB")

In [None]:
## Data Size Profiling

### Using `sys.getsizeof()`
* Provides basic size information for Python objects
* Does not account for referenced objects (e.g., list contents)
* Quick and simple for basic data structures


In [None]:
import sys

data = [1, 2, 3, 4, 5]
print(f"Size of data: {sys.getsizeof(data)} bytes")


In [None]:
import sys

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
print(f"Size of data: {sys.getsizeof(data)} bytes")

### Using `pympler` for complex objects
* More accurate size estimation for complex, nested objects
* Accounts for the full object graph
* Slower than sys.getsizeof() but more comprehensive


In [None]:
!pip install pympler

In [None]:

from pympler import asizeof

complex_data = {'a': [1, 2, 3], 'b': {'c': 4, 'd': 5}}
print(f"Total size of complex_data: {asizeof.asizeof(complex_data)} bytes")

## RAM Usage Profiling Using `memory_profiler`
* Tracks line-by-line memory usage
* Useful for identifying memory-intensive operations
* Can be slower due to frequent memory measurements
* Install with: pip install memory_profiler

In [None]:
# !pip install memory_profiler

In [None]:
from memory_profiler import profile


In [None]:
%load_ext memory_profiler

In [None]:
@profile
def memory_hungry_function():
    big_list = [i for i in range(1000000)]
    del big_list



In [None]:
%mprun -f memory_hungry_function memory_hungry_function()