# Part 4 Optimization Essentials

- Install essential libraries
- Leverage vectorized operations where possible
- Optimize data structures(Downsizing)
- Using eval()
- Parallelized computations

## Install essential libraries
- numexpr (for better computation performance)
- pyarrow (for quicker reading of csv data)
- dask (for parallel processing)

## Leverage vectorized operations where possible
- always try to vectorize computations where possible (https://realpython.com/numpy-array-programming/)
- for a list of available vector methods, please refer to


In [3]:
import numpy as np
import pandas as pd

In [6]:
# 2 different ways of counting a given integer's occurrence
large_df = pd.DataFrame({
    'A': np.random.randint(0, 1000, 10 ** 7),
})


def count_loop(df: pd.DataFrame, target: int) -> int:
    return sum(x == target for x in df["A"])


def count_vectorized(df: pd.DataFrame, target: int) -> int:
    return (df["A"] == target).sum()

In [9]:
%%timeit
count_loop(large_df, 500)

1.08 s ± 5.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [10]:
%%timeit
count_vectorized(large_df, 500)

11.9 ms ± 74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Leverage vectorized operations can lead to a 10x performance improvement!

## Optimize data structures(Right-sizing)
By default, Pandas always assigns the highest memory datatype to columns.
For instance, if Pandas interpreted a column as integer-valued, there are possibly four sub-categories (signed) to choose from:

todo: change this to table
int8: 8-bit-integer that covers integers from [-2⁷, 2⁷].
int16: 16-bit-integer that covers integers from [-2¹⁵, 2¹⁵].
int32: 32-bit-integer that covers integers from [-2³¹, 2³¹].
int64: 64-bit-integer that covers integers from [-2⁶³, 2⁶³].

Similar connotations exist for float-valued numbers as well: float16, float32 and float64.

- For str values, consider changing to categorical data but note that str functions will no longer be applicable.

In [11]:
large_df_default = pd.DataFrame({
    'A': np.random.randint(0, 1000, 10 ** 7),
})
large_df_default.memory_usage(deep=True)

Index         128
A        40000000
dtype: int64

In [13]:
# since we only have int between 0, 1000 this can be reduced to int 8
large_df_default.A = large_df_default.A.astype(np.int8)
large_df_default.memory_usage(deep=True)

Index         128
A        10000000
dtype: int64

## Using eval()

In [None]:
## Parallelized computations