FIX-#6879: Convert the right DF to single partition before broadcasting in query_compiler.merge #6880

arunjose696 · 2024-01-25T11:42:03Z

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves The query_compiler.merge reconstructs the Right dataframe for every partition of Left Dataframe #6879
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

arunjose696 · 2024-01-25T11:52:46Z

Few issues I have with the approach
1)peak memory consumption in the worker that converts dataframe to single partition would be still high and same as the previous approach.
2)As we are broadcasting the right df as modin DF this is slightly heavier in memory compared to converting the dataframe to pandas. When comparing peak memory consumption of workers for the below snippet the observation was.
Approach with converting right.to_pandas<current<master

modin_df =  pd.DataFrame(np.random.randint(0,100,size=(1000, 1000)),)
modin_df2 = pd.DataFrame(np.random.randint(0,100,size=(10000, 1000)))
modin_result = pd.merge(modin_df, modin_df2, how="left")

Can I have some suggestions on how to improve this?

anmyachev · 2024-01-25T19:50:12Z

Few issues I have with the approach 1)peak memory consumption in the worker that converts dataframe to single partition would be still high and same as the previous approach. 2)As we are broadcasting the right df as modin DF this is slightly heavier in memory compared to converting the dataframe to pandas. When comparing peak memory consumption of workers for the below snippet the observation was. Approach with converting right.to_pandas<current<master
modin_df =  pd.DataFrame(np.random.randint(0,100,size=(1000, 1000)),)
modin_df2 = pd.DataFrame(np.random.randint(0,100,size=(10000, 1000)))
modin_result = pd.merge(modin_df, modin_df2, how="left")
Can I have some suggestions on how to improve this?

Might it be possible to call right.to_pandas in the worker process instead of the main one? (with some changes)

arunjose696 · 2024-01-26T10:51:46Z

Might it be possible to call right.to_pandas in the worker process instead of the main one? (with some changes)

To call right.to_pandas on workers we would need to still send the right modin dataframe to worker. Wouldnt it still increase the memory consumption same as that in this case, why would it be better compared to the current approach?

YarShev · 2024-01-27T13:01:35Z

@arunjose696, I think @anmyachev means calling to_pandas in a single worker process to get a single partitioned Modin DataFrame out of all partitions of the right Modin DataFrame. It seems we could do the following but I am not sure if this works. I guess some changes would be required for the implementation to work.

def force_materialization(self) -> "PandasDataframe":
    row_partitions = self._partition_mgr_cls.row_partitions(self._partitions)
    col_partition = self._partition_mgr_cls.column_partitions(row_partitions)
    new_frame = np.array([col_partition[0].apply(lambda df: df, num_splits=1)])
    return new_frame

arunjose696 · 2024-01-29T11:44:49Z

def force_materialization(self) -> "PandasDataframe":
    row_partitions = self._partition_mgr_cls.row_partitions(self._partitions)
    col_partition = self._partition_mgr_cls.column_partitions(row_partitions)
    new_frame = np.array([col_partition[0].apply(lambda df: df, num_splits=1)])
    return new_frame

I tried this approach. by making a small change, it converts to a single partiton df . However the the memory consumption is increasing for several workers during the force_materialization even though only one remote worker is utilized. I tried checking the memory consumption of workers with below script in which I log the memory consumption in workers before and after the force materialization call. The memory consumption in multiple workers have gone up.

force_materialization.py

import modin.pandas as pd
from modin.utils import execute
import numpy as np
import time
import sys
import os
import re
import platform
import warnings

import psutil

_VM_PEAK_PATTERN = r"VmHWM:\s+(\d+)"

def get_max_memory_usage(proc=psutil.Process()):
    """Reads maximum memory usage in MB from process history. Returns 0 on non-linux systems
    or if the process is not alive."""
    max_mem = 0
    try:
        with open(f"/proc/{proc.pid}/status", "r") as stat:
            for match in re.finditer(_VM_PEAK_PATTERN, stat.read()):
                max_mem = float(match.group(1))
                # MB conversion
                max_mem = int(max_mem / 1024)
                break
    except FileNotFoundError:
        if platform.system() == "Linux":
            warnings.warn(f"Couldn't open `/proc/{proc.pid}/status` file. Is the process alive?")
        else:
            warnings.warn("Couldn't get the max memory usage on a non-Linux platform.")
        return 0
    max_mem_used=max_mem + sum(get_max_memory_usage(c) for c in proc.children())
    print(f"for proceess with name {proc.name()} and  { len(proc.children())} children the mem used is {max_mem}")
    return max_mem_used
modin_df =  pd.DataFrame(np.random.randint(0,100,size=(1000, 1000)),)

execute(modin_df)
print("before force_materialization")
print(f"total memory consumed ={get_max_memory_usage()}")
xf=modin_df._query_compiler._modin_frame.force_materialization()

print("/n/nafter force_materialization")
print(f"total memory consumed ={get_max_memory_usage()}")

anmyachev · 2024-01-29T13:03:40Z

def force_materialization(self) -> "PandasDataframe":
    row_partitions = self._partition_mgr_cls.row_partitions(self._partitions)
    col_partition = self._partition_mgr_cls.column_partitions(row_partitions)
    new_frame = np.array([col_partition[0].apply(lambda df: df, num_splits=1)])
    return new_frame
I tried this approach. by making a small change, it works. However the the memory consumption is increasing for several workers during the force_materialization even though only one remote worker is utilized. I tried checking the memory consumption of workers with below script in which I log the memory consumption in workers before and after the force materialization call. The memory consumption in multiple workers have gone up.

As far as I remember, with this approach it is possible that intermediate partitions are created using method force_materialization, which may explain the increased memory consumption.

modin/modin/core/dataframe/pandas/partitioning/axis_partition.py

Lines 131 to 139 in fe19363

    
           # If this axis partition is made of axis partitions 
        
           # for the other axes, squeeze such partitions into a single 
        
           # block so that this partition only holds a one-dimensional 
        
           # list of blocks. We could change this implementation to 
        
           # hold a 2-d list of blocks, but that would complicate the 
        
           # code quite a bit. 
        
           self._list_of_block_partitions.append( 
        
               partition.force_materialization().list_of_block_partitions[0] 
        
           )

I would like to consider the possibility of creating a pandas dataframe in a worker process, without creating intermediate objects. The closest implementation to what I think is the best solution here is the following code:

modin/modin/core/dataframe/pandas/partitioning/partition_manager.py

Line 762 in fe19363

def to_pandas(cls, partitions):

arunjose696 · 2024-02-05T15:17:38Z

I would like to consider the possibility of creating a pandas dataframe in a worker process, without creating intermediate objects. The closest implementation to what I think is the best solution here is the following code:

I have done an implementation which makes use of to_pandas and calling it in remote function, could you check this once.

modin/core/dataframe/pandas/partitioning/partition_manager.py

modin/core/dataframe/pandas/utils.py

Signed-off-by: arunjose696 <arunjose696@gmail.com>

modin/core/dataframe/pandas/utils.py

Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>

anmyachev

LGTM!

modin/core/dataframe/pandas/partitioning/partition_manager.py

modin/core/dataframe/pandas/utils.py

modin/core/dataframe/pandas/partitioning/partition_manager.py

Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>

YarShev · 2024-02-13T09:30:35Z

@anmyachev, any comments?

arunjose696 requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev, dchigarev and a team as code owners January 25, 2024 11:42

arunjose696 changed the title ~~dfToSinglePartition~~ FIX:6879 Convert the right DF to single partition before broadcasting in query_compiler.merge Jan 25, 2024

arunjose696 force-pushed the dfToSinglePartition branch from b06563a to 69725c1 Compare January 25, 2024 12:11

arunjose696 changed the title ~~FIX:6879 Convert the right DF to single partition before broadcasting in query_compiler.merge~~ FIX-#6879 Convert the right DF to single partition before broadcasting in query_compiler.merge Jan 25, 2024

arunjose696 changed the title ~~FIX-#6879 Convert the right DF to single partition before broadcasting in query_compiler.merge~~ FIX-#6879: Convert the right DF to single partition before broadcasting in query_compiler.merge Jan 25, 2024

arunjose696 force-pushed the dfToSinglePartition branch 2 times, most recently from 5331555 to 0a872b3 Compare January 25, 2024 12:18

arunjose696 force-pushed the dfToSinglePartition branch 6 times, most recently from 3880889 to 07522fb Compare February 5, 2024 15:15

anmyachev reviewed Feb 6, 2024

View reviewed changes

modin/core/dataframe/pandas/partitioning/partition_manager.py Outdated Show resolved Hide resolved

arunjose696 force-pushed the dfToSinglePartition branch 2 times, most recently from 7813297 to d6e62ba Compare February 7, 2024 12:31

arunjose696 force-pushed the dfToSinglePartition branch 2 times, most recently from ff5639d to 942a2a9 Compare February 12, 2024 13:01

anmyachev reviewed Feb 12, 2024

View reviewed changes

modin/core/dataframe/pandas/utils.py Outdated Show resolved Hide resolved

arunjose696 force-pushed the dfToSinglePartition branch from 942a2a9 to 9f216d2 Compare February 12, 2024 13:13

removing duplicated code

70cb788

Signed-off-by: arunjose696 <arunjose696@gmail.com>

arunjose696 force-pushed the dfToSinglePartition branch from 9f216d2 to 70cb788 Compare February 12, 2024 13:18

YarShev reviewed Feb 12, 2024

View reviewed changes

modin/core/dataframe/pandas/utils.py Outdated Show resolved Hide resolved

Update modin/core/dataframe/pandas/utils.py

de7dd30

Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>

anmyachev previously approved these changes Feb 12, 2024

View reviewed changes

arunjose696 dismissed anmyachev’s stale review via f15250d February 12, 2024 14:49

arunjose696 force-pushed the dfToSinglePartition branch from f15250d to 57b254d Compare February 12, 2024 14:52

github-advanced-security bot found potential problems Feb 12, 2024

View reviewed changes

modin/core/dataframe/pandas/partitioning/partition_manager.py Fixed Show fixed Hide fixed

arunjose696 force-pushed the dfToSinglePartition branch 4 times, most recently from 09f5d6c to b6c2f3b Compare February 12, 2024 15:03

removing explicit copy

25c0904

arunjose696 force-pushed the dfToSinglePartition branch from b6c2f3b to 25c0904 Compare February 12, 2024 15:07

YarShev reviewed Feb 12, 2024

View reviewed changes

modin/core/dataframe/pandas/partitioning/partition_manager.py Outdated Show resolved Hide resolved

YarShev reviewed Feb 12, 2024

View reviewed changes

modin/core/dataframe/pandas/utils.py Outdated Show resolved Hide resolved

YarShev reviewed Feb 12, 2024

View reviewed changes

modin/core/dataframe/pandas/utils.py Outdated Show resolved Hide resolved

PR comments

8445597

YarShev reviewed Feb 12, 2024

View reviewed changes

modin/core/dataframe/pandas/partitioning/partition_manager.py Outdated Show resolved Hide resolved

YarShev reviewed Feb 12, 2024

View reviewed changes

modin/core/dataframe/pandas/partitioning/partition_manager.py Outdated Show resolved Hide resolved

YarShev reviewed Feb 12, 2024

View reviewed changes

modin/core/dataframe/pandas/partitioning/partition_manager.py Outdated Show resolved Hide resolved

Address comments

28af95a

Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>

YarShev approved these changes Feb 13, 2024

View reviewed changes

anmyachev approved these changes Feb 13, 2024

View reviewed changes

anmyachev merged commit 9ff1c15 into modin-project:master Feb 13, 2024
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#6879: Convert the right DF to single partition before broadcasting in query_compiler.merge #6880

FIX-#6879: Convert the right DF to single partition before broadcasting in query_compiler.merge #6880

arunjose696 commented Jan 25, 2024 •

edited by YarShev

arunjose696 commented Jan 25, 2024 •

edited

anmyachev commented Jan 25, 2024

arunjose696 commented Jan 26, 2024

YarShev commented Jan 27, 2024

arunjose696 commented Jan 29, 2024 •

edited

anmyachev commented Jan 29, 2024

arunjose696 commented Feb 5, 2024 •

edited

anmyachev left a comment

YarShev commented Feb 13, 2024

FIX-#6879: Convert the right DF to single partition before broadcasting in query_compiler.merge #6880

FIX-#6879: Convert the right DF to single partition before broadcasting in query_compiler.merge #6880

Conversation

arunjose696 commented Jan 25, 2024 • edited by YarShev

What do these changes do?

arunjose696 commented Jan 25, 2024 • edited

anmyachev commented Jan 25, 2024

arunjose696 commented Jan 26, 2024

YarShev commented Jan 27, 2024

arunjose696 commented Jan 29, 2024 • edited

anmyachev commented Jan 29, 2024

arunjose696 commented Feb 5, 2024 • edited

anmyachev left a comment

Choose a reason for hiding this comment

YarShev commented Feb 13, 2024

arunjose696 commented Jan 25, 2024 •

edited by YarShev

arunjose696 commented Jan 25, 2024 •

edited

arunjose696 commented Jan 29, 2024 •

edited

arunjose696 commented Feb 5, 2024 •

edited