# Scalability

### Check 4 different python packages

In the context large datasets, the need to check the scalability of our solutions is imperative.

Although `pandas` is the easiest most powerful tool out there, it struggles with performance issues when dealing with very large datasets that do not fit into memory.

Here's some alternatives:

- **Pandas** is excellent for datasets that fit comfortably in memory, providing rich functionality for data manipulation and analysis
- **Dask** scales pandas workflows to larger datasets by parallelizing operations and working with data that doesn’t fit into a single machine’s memory. It does this by breaking the dataset into chunks and processing these chunks in parallel across multiple threads or machines
- **Vaex** uses memory mapping, efficient algorithms, and lazy evaluations to handle very large datasets (> 1e$^9$ rows) effectively without the need to load the entire dataset into memory. It's particularly good for out-of-core computations and streaming data to create visualizations and statistical summaries
- **Modin** aims to speed up pandas operations by using parallel and distributed computing transparently.

The results shown below compare these different packages (results obtained on a MacOS with a 3,1 GHz Dual-Core Intel Core i5 with Memory: 8 GB 2133 MHz)

| File       | Size (MB) |
|------------|-----------|
| R80736.csv | 49.30     |
| R80721.csv | 48.49     |
| R80790.csv | 48.55     |
| R80711.csv | 48.85     |

| Library | Time taken (sec) | Max memory usage (MB) |
|---------|------------------|-----------------------|
| pandas  | 6.13             | 735.23 MB             |
| dask    | 18.38            | 603.20 MB             |
| vaex    | 3.11             | 268.40 MB             |
| modin   | 11.60            | 245.38 MB             |

**Conclusion**:
  - Time Efficiency: `vaex` and `pandas` show best results,
  - Memory Efficiency: `modin` and `vaex` show best results,
  - Given the ease of use, I will use `pandas` for this tutorial.
    

In [1]:
from pathlib import Path
data_dir = Path.cwd().parent / 'windml' / 'data'
from windml.core.functions import compare_data_libraries

compare_data_libraries(data_dir)

Found 4 CSV files.
File: R80711.csv, Size: 49.30 MB
File: R80790.csv, Size: 48.49 MB
File: R80721.csv, Size: 48.55 MB
File: R80736.csv, Size: 48.85 MB
Library: pandas, Time taken: 5.46 seconds, Max memory usage: 664.48 MB
Library: dask, Time taken: 8.04 seconds, Max memory usage: 556.34 MB
Library: vaex, Time taken: 1.68 seconds, Max memory usage: 318.00 MB
Library: modin, Time taken: 4.46 seconds, Max memory usage: 299.65 MB
