Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concat is slow #6864

Open
river7816 opened this issue Jan 18, 2024 · 2 comments
Open

Concat is slow #6864

river7816 opened this issue Jan 18, 2024 · 2 comments
Labels
bug 🦗 Something isn't working External Pull requests and issues from people who do not regularly contribute to modin Memory 💾 Issues related to memory P3 Very minor bugs, or features we can hopefully add some day.

Comments

@river7816
Copy link

Hi, here is my code. I want to concatenate 5100 dataframes into one big dataframe. The memory usage of these 5100 dataframes is about 160G. However, the processing speed is really slow. I have been waiting for about 40 minutes, but it still hasn't finished. Here is my code:

import modin.pandas as pd
import os
import ray
from tqdm import tqdm

ray.init()

def merge_feather_files(folder_path, suffix):
    # 获取所有符合条件的文件名
    files = [f for f in os.listdir(folder_path) if f.endswith(suffix + '.feather')]

    # 读取所有文件并合并
    df_list = [pd.read_feather(os.path.join(folder_path, file)) for file in tqdm(files)]
    merged_df = pd.concat(df_list)

    # 按照日期列排序
    # merged_df.sort_values(by='date', inplace=True)
    merged_df.to_feather('factors_date/factors'+suffix+'.feather')
    

# 使用方法示例
folder_path = 'factors_batch'  # 设置文件夹路径
for i in range(1,25):   
    merged_df = merge_feather_files(folder_path, f'_{i}')

My machine has 128 cores, I am certain that they are not all utilized for processing.

@river7816 river7816 added question ❓ Questions about Modin Triage 🩹 Issues that need triage labels Jan 18, 2024
@river7816
Copy link
Author

Hi, here is my code. I want to concatenate 5100 dataframes into one big dataframe. The memory usage of these 5100 dataframes is about 160G. However, the processing speed is really slow. I have been waiting for about 40 minutes, but it still hasn't finished. Here is my code:

import modin.pandas as pd
import os
import ray
from tqdm import tqdm

ray.init()

def merge_feather_files(folder_path, suffix):
    # 获取所有符合条件的文件名
    files = [f for f in os.listdir(folder_path) if f.endswith(suffix + '.feather')]

    # 读取所有文件并合并
    df_list = [pd.read_feather(os.path.join(folder_path, file)) for file in tqdm(files)]
    merged_df = pd.concat(df_list)

    # 按照日期列排序
    # merged_df.sort_values(by='date', inplace=True)
    merged_df.to_feather('factors_date/factors'+suffix+'.feather')
    

# 使用方法示例
folder_path = 'factors_batch'  # 设置文件夹路径
for i in range(1,25):   
    merged_df = merge_feather_files(folder_path, f'_{i}')

My machine has 128 cores, I am certain that they are not all utilized for processing.

Here is the bug report:

UserWarning: `DataFrame.to_feather` is not currently supported by PandasOnRay, defaulting to pandas implementation.
Please refer to https://modin.readthedocs.io/en/stable/supported_apis/defaulting_to_pandas.html for explanation.
(raylet) [2024-01-18 12:25:56,935 E 1383634 1383634] (raylet) node_manager.cc:3022: 6 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 695b81b72bd60e890b66e98b555a4048a90dbdd0909377d214beb80f, IP: 192.168.8.180) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.8.180`
(raylet) 
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
(raylet) [2024-01-18 12:26:59,006 E 1383634 1383634] (raylet) node_manager.cc:3022: 8 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 695b81b72bd60e890b66e98b555a4048a90dbdd0909377d214beb80f, IP: 192.168.8.180) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.8.180`
(raylet) 
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
(raylet) [2024-01-18 12:28:01,160 E 1383634 1383634] (raylet) node_manager.cc:3022: 8 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 695b81b72bd60e890b66e98b555a4048a90dbdd0909377d214beb80f, IP: 192.168.8.180) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.8.180`
(raylet) 
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
(raylet) [2024-01-18 12:29:07,013 E 1383634 1383634] (raylet) node_manager.cc:3022: 5 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 695b81b72bd60e890b66e98b555a4048a90dbdd0909377d214beb80f, IP: 192.168.8.180) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.8.180`
(raylet) 
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.


@anmyachev anmyachev added External Pull requests and issues from people who do not regularly contribute to modin Memory 💾 Issues related to memory bug 🦗 Something isn't working P3 Very minor bugs, or features we can hopefully add some day. and removed Triage 🩹 Issues that need triage question ❓ Questions about Modin labels Jan 19, 2024
@anmyachev
Copy link
Collaborator

The reason seems to be the same as in #6865 (to_feather that defaults to pandas)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working External Pull requests and issues from people who do not regularly contribute to modin Memory 💾 Issues related to memory P3 Very minor bugs, or features we can hopefully add some day.
Projects
None yet
Development

No branches or pull requests

2 participants