-
Notifications
You must be signed in to change notification settings - Fork 647
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com> Co-authored-by: Vasily Litvinov <vasilij.n.litvinov@intel.com> Co-authored-by: Rehan Sohail Durrani <rdurrani@berkeley.edu> Signed-off-by: Doris Lee <dorisjunglinlee@gmail.com>
- Loading branch information
1 parent
2cfabaf
commit cf426c4
Showing
1 changed file
with
40 additions
and
33 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,51 +1,58 @@ | ||
Out of Core in Modin | ||
==================== | ||
Out-of-memory data with Modin | ||
============================= | ||
|
||
If you are working with very large files or would like to exceed your memory, you may | ||
change the primary location of the `DataFrame`_. If you would like to exceed memory, you | ||
can use your disk as an overflow for the memory. | ||
When using pandas, you might run into a memory error if you are working with large datasets that cannot fit in memory or perform certain memory-intensive operations (e.g., joins). | ||
|
||
Starting Modin with out of core enabled | ||
--------------------------------------- | ||
Modin solves this problem by spilling over to disk, in other words, it uses your disk as an overflow for memory so that you can work with datasets that are too large to fit in memory. By default, Modin leverages out-of-core methods to handle datasets that don't fit in memory for both Ray and Dask engines. | ||
|
||
Out of core is now enabled by default for both Ray and Dask engines. | ||
|
||
Disabling Out of Core | ||
--------------------- | ||
Motivating Example: Memory error with pandas | ||
-------------------------------------------- | ||
|
||
Out of core is enabled by the compute engine selected. To disable it, start your | ||
preferred compute engine with the appropriate arguments. For example: | ||
pandas makes use of in-memory data structures to store and operate on data, which means that if you have a dataset that is too large to fit in memory, it will cause an error on pandas. As an example, let's creates a 80GB DataFrame by appending together 40 different 2GB DataFrames. | ||
|
||
.. code-block:: python | ||
import modin.pandas as pd | ||
import ray | ||
import pandas | ||
import numpy as np | ||
df = pandas.concat([pandas.DataFrame(np.random.randint(0, 100, size=(2**20, 2**8))) for _ in range(40)]) # Memory Error! | ||
ray.init(_plasma_directory="/tmp") # setting to disable out of core in Ray | ||
df = pd.read_csv("some.csv") | ||
When we run this on a laptop with 32GB of RAM, pandas will run out of memory and throw an error (e.g., :code:`MemoryError` , :code:`Killed: 9`). | ||
|
||
If you are using Dask, you have to modify local configuration files. Visit the | ||
Dask documentation_ on object spilling to see how. | ||
The `pandas documentation <https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html>`_ has a great section on recommendations for scaling your analysis to these larger datasets. However, this generally involves loading in less data or rewriting your pandas code to process the data in smaller chunks. | ||
|
||
Running an example with out of core | ||
----------------------------------- | ||
Operating on out-of-memory data with Modin | ||
------------------------------------------ | ||
|
||
Before you run this, please make sure you follow the instructions listed above. | ||
In order to work with data that exceeds memory constraints, you can use Modin to handle these large datasets. | ||
|
||
.. code-block:: python | ||
import modin.pandas as pd | ||
import numpy as np | ||
frame_data = np.random.randint(0, 100, size=(2**20, 2**8)) # 2GB each | ||
df = pd.DataFrame(frame_data).add_prefix("col") | ||
big_df = pd.concat([df for _ in range(20)]) # 20x2GB frames | ||
print(big_df) | ||
nan_big_df = big_df.isna() # The performance here represents a simple map | ||
print(big_df.groupby("col1").count()) # group by on a large dataframe | ||
This example creates a 40GB DataFrame from 20 identical 2GB DataFrames and performs | ||
various operations on them. Feel free to play around with this code and let us know what | ||
you think! | ||
|
||
.. _Dataframe: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html | ||
df = pd.concat([pd.DataFrame(np.random.randint(0, 100, size=(2**20, 2**8))) for _ in range(40)]) # 40x2GB frames -- Working! | ||
df.info() | ||
Not only does Modin let you work with datasets that are too large to fit in memory, we can perform various operations on them without worrying about memory constraints. | ||
|
||
Advanced: Configuring out-of-core settings | ||
------------------------------------------ | ||
|
||
.. why would you want to disable out of core? | ||
By default, out-of-core functionality is enabled by the compute engine selected. | ||
To disable it, start your preferred compute engine with the appropriate arguments. For example: | ||
|
||
.. code-block:: python | ||
import modin.pandas as pd | ||
import ray | ||
ray.init(_plasma_directory="/tmp") # setting to disable out of core in Ray | ||
df = pd.read_csv("some.csv") | ||
If you are using Dask, you have to modify local configuration files. Visit the | ||
Dask documentation_ on object spilling for more details. | ||
|
||
|
||
.. _documentation: https://distributed.dask.org/en/latest/worker.html#memory-management |