### How big is my DataFrame?
`df.info()` , `df.memory_usage()`

### NumPy transformations
- Many NumPy transformations, while fast, use one or more temporary arrays. Therefore, those transformations require more storage than the original array required.
- The function memory_footprint() has been provided to return the total amount of memory (in megabytes or MB) currently in use by our program. This function uses the psutil and os modules.

In [2]:
import numpy as np
import psutil, os

def memory_footprint():
    '''Returns memory (in MB) being used by Python process'''
    mem = psutil.Process(os.getpid()).memory_info().rss
    return (mem / 1024 **2)

In [4]:
N = (1024 **2) //8 # number of floats that fill 1 MB
celsius = np.random.randn(50 * N) # Random array filling 50MB

# Print the size in MB of the celsius array
print(celsius.nbytes / 1024**2)

# Call memory_footprint(): before
before = memory_footprint()

# Convert celsius by multiplying by 9/5 and adding 32: fahrenheit
fahrenheit = celsius * 9/5 + 32

# Call memory_footprint(): after
after = memory_footprint()

# Print the difference between after and before
print(after - before)

50.0
50.09375


### Building a pipeline with delayed
- If we use `dask.delayed`, we don't need to use generators; the dask scheduler will manage memory usage.
- **Task** : define three decorated functions to complete the pipeline: a function to total the number of flights, a function to count the number of delayed flights, and a function to aggregate the results.

In [1]:
from dask import delayed


In [2]:
# Define count_flights
@delayed
def count_flights(df):
    return len(df)

# Define count_delayed
@delayed
def count_delayed(df):
    return (df['DEP_DELAY']>0).sum()

# Define pct_delayed
@delayed
def pct_delayed(n_delayed, n_flights):
    return 100 * sum(n_delayed) / sum(n_flights)

- These functions constitute the pieces of the pipeline for our flight-delay analysis

### Computing pipelined results
- Now that the dask.delayed functions are defined, we can use them to construct the pipeline of delayed tasks.
-  loop over the file names, store the temporary information in lists, and aggregate the final result.
- The distinction here is that we are working with `dask.delayed` functions and objects, not real, computed values. The computation will only be executed when we call `.compute()` on the final result.

In [None]:
filenames = 