Skip to content

Commit

Permalink
DOCS-#3692: Improve Out-of-Core docs (#3705)
Browse files Browse the repository at this point in the history
Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com>
Co-authored-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>
Co-authored-by: Rehan Sohail Durrani <rdurrani@berkeley.edu>
Signed-off-by: Doris Lee <dorisjunglinlee@gmail.com>
  • Loading branch information
4 people committed Dec 10, 2021
1 parent 2cfabaf commit cf426c4
Showing 1 changed file with 40 additions and 33 deletions.
73 changes: 40 additions & 33 deletions docs/getting_started/out_of_core.rst
Original file line number Diff line number Diff line change
@@ -1,51 +1,58 @@
Out of Core in Modin
====================
Out-of-memory data with Modin
=============================

If you are working with very large files or would like to exceed your memory, you may
change the primary location of the `DataFrame`_. If you would like to exceed memory, you
can use your disk as an overflow for the memory.
When using pandas, you might run into a memory error if you are working with large datasets that cannot fit in memory or perform certain memory-intensive operations (e.g., joins).

Starting Modin with out of core enabled
---------------------------------------
Modin solves this problem by spilling over to disk, in other words, it uses your disk as an overflow for memory so that you can work with datasets that are too large to fit in memory. By default, Modin leverages out-of-core methods to handle datasets that don't fit in memory for both Ray and Dask engines.

Out of core is now enabled by default for both Ray and Dask engines.

Disabling Out of Core
---------------------
Motivating Example: Memory error with pandas
--------------------------------------------

Out of core is enabled by the compute engine selected. To disable it, start your
preferred compute engine with the appropriate arguments. For example:
pandas makes use of in-memory data structures to store and operate on data, which means that if you have a dataset that is too large to fit in memory, it will cause an error on pandas. As an example, let's creates a 80GB DataFrame by appending together 40 different 2GB DataFrames.

.. code-block:: python
import modin.pandas as pd
import ray
import pandas
import numpy as np
df = pandas.concat([pandas.DataFrame(np.random.randint(0, 100, size=(2**20, 2**8))) for _ in range(40)]) # Memory Error!
ray.init(_plasma_directory="/tmp") # setting to disable out of core in Ray
df = pd.read_csv("some.csv")
When we run this on a laptop with 32GB of RAM, pandas will run out of memory and throw an error (e.g., :code:`MemoryError` , :code:`Killed: 9`).

If you are using Dask, you have to modify local configuration files. Visit the
Dask documentation_ on object spilling to see how.
The `pandas documentation <https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html>`_ has a great section on recommendations for scaling your analysis to these larger datasets. However, this generally involves loading in less data or rewriting your pandas code to process the data in smaller chunks.

Running an example with out of core
-----------------------------------
Operating on out-of-memory data with Modin
------------------------------------------

Before you run this, please make sure you follow the instructions listed above.
In order to work with data that exceeds memory constraints, you can use Modin to handle these large datasets.

.. code-block:: python
import modin.pandas as pd
import numpy as np
frame_data = np.random.randint(0, 100, size=(2**20, 2**8)) # 2GB each
df = pd.DataFrame(frame_data).add_prefix("col")
big_df = pd.concat([df for _ in range(20)]) # 20x2GB frames
print(big_df)
nan_big_df = big_df.isna() # The performance here represents a simple map
print(big_df.groupby("col1").count()) # group by on a large dataframe
This example creates a 40GB DataFrame from 20 identical 2GB DataFrames and performs
various operations on them. Feel free to play around with this code and let us know what
you think!

.. _Dataframe: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
df = pd.concat([pd.DataFrame(np.random.randint(0, 100, size=(2**20, 2**8))) for _ in range(40)]) # 40x2GB frames -- Working!
df.info()
Not only does Modin let you work with datasets that are too large to fit in memory, we can perform various operations on them without worrying about memory constraints.

Advanced: Configuring out-of-core settings
------------------------------------------

.. why would you want to disable out of core?
By default, out-of-core functionality is enabled by the compute engine selected.
To disable it, start your preferred compute engine with the appropriate arguments. For example:

.. code-block:: python
import modin.pandas as pd
import ray
ray.init(_plasma_directory="/tmp") # setting to disable out of core in Ray
df = pd.read_csv("some.csv")
If you are using Dask, you have to modify local configuration files. Visit the
Dask documentation_ on object spilling for more details.


.. _documentation: https://distributed.dask.org/en/latest/worker.html#memory-management

0 comments on commit cf426c4

Please sign in to comment.