# Dask

**Bag**: Parallelize compute on unstructured data (json, text files, objects).

**Array**: For larger-then-RAM numpy arrays.

**DataFrame**: For larger-then-RAM pandas dataframes.

**Delayer**: ParallelizeParallelize chains of operations.

**Dask-ML**: Scikit-learn-like interface.

## Note on choosing a scheduler

**Choose scheduler="processes" in scenarios where your computations are CPU-bound** and can benefit from parallel execution across multiple CPU cores. **The default scheduler for Dask is the threaded scheduler (scheduler="threads"), which is generally efficient for I/O bound tasks** and operations that release the Global Interpreter Lock (GIL), but it might not always be the best choice for CPU-bound tasks due to Python's GIL, which prevents multiple threads from executing Python bytecodes at once.

In [1]:
import dask.dataframe as dd
import pandas as pd
import numpy as np

# Create a large pandas dataframe
df = pd.DataFrame({
    'x': np.random.rand(1000), 
    'y': np.random.rand(1000),
})

# Convert the pandas dataframe to a dask dataframe
ddf = dd.from_pandas(df, npartitions=20)


In [2]:
def complex_computation(row):
    # Simulate a CPU-intensive task
    result = 0
    for _ in range(1000):
        result += np.sin(row['x']) + np.cos(row['y'])
    return result

In [3]:
import time

# Apply the function with the default scheduler
start_time = time.time()
result_threaded = ddf.apply(lambda row: complex_computation(row), axis=1, meta=('x', 'float')).compute()
end_time = time.time()
print(f"Time with the default (threaded) scheduler: {end_time - start_time} seconds")

# Apply the function with the multiprocessing scheduler
start_time = time.time()
result_processes = ddf.apply(lambda row: complex_computation(row), axis=1, meta=('x', 'float')).compute(scheduler='processes')
end_time = time.time()
print(f"Time with the 'processes' scheduler: {end_time - start_time} seconds")

Time with the default (threaded) scheduler: 7.970299243927002 seconds
Time with the 'processes' scheduler: 5.263658285140991 seconds
