Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is one error in GUI with multiprocess #12

Open
szwfive opened this issue Mar 7, 2024 · 7 comments
Open

There is one error in GUI with multiprocess #12

szwfive opened this issue Mar 7, 2024 · 7 comments

Comments

@szwfive
Copy link

szwfive commented Mar 7, 2024

20240307213707

The hardware environment is a single 3090. And timing of this error is uncertain. Although the slam process continues later, the stuck GUI cause the memory overflow. This may be the reason why Issue#7 cannot run.

@muskie82
Copy link
Owner

muskie82 commented Mar 7, 2024

Hi,

We previously encountered a similar issue randomly in our environment, though it hasn't appeared recently.
If you happen to find a way to reproduce the error deterministically, could you share it with us?

The problem seems to be:

  1. PyTorch's multiprocessing communication between the main process and GUI fails for some reason, causing unpredictable GUI crashes.
  2. The main process continues to send messages to the queue (q_main2vis), but with the GUI down, these items aren't cleared from GUI process, leading to memory overflow.

This issue occurs only when the GUI is enabled, so it should not be a problem when running in headless mode.

@szwfive
Copy link
Author

szwfive commented Mar 9, 2024

Thank you for your answer.
I have located the source of the issue, in line 400 of slam frontend. py, "gains=clone_obj (self. issues)". It seems that an unknown error occurred during the copy.deepcopy process. I am trying to find the problem more accurately.

@Tianci-Wen
Copy link

I also have the same problem while running this repo in a single GPU 4090. I'm also trying to solve this problem.
image

@whwh747
Copy link

whwh747 commented Apr 3, 2024

same error

@hnglp
Copy link

hnglp commented Apr 4, 2024

Thank you for your answer. I have located the source of the issue, in line 400 of slam frontend. py, "gains=clone_obj (self. issues)". It seems that an unknown error occurred during the copy.deepcopy process. I am trying to find the problem more accurately.

Hello, is there any progress on solving the issue? Additionally, I would like to ask if I can run this project with an RTX 3070 (39GB). Regards.

@whwh747
Copy link

whwh747 commented Apr 4, 2024

I'm sorry, I can't resolve it.

@JIAZHAOQIAN
Copy link

JIAZHAOQIAN commented Jul 5, 2024

I modified the function named 'get_latest_queue' in gui_utils.py as follows:

def get_latest_queue(q):
    message = None
    while True:
        try:
            message_latest = q.get_nowait()
            if message is not None:
                del message
            message = message_latest
        except queue.Empty:
            if q.qsize() < 1:
                break
        # zajia: for unsolved bug related to "torch.storage._UntypedStorage"
        except TypeError:
            print("get a torch.storage._UntypedStorage error!")
            break
    return message

I try to capture the error and do not modifiy the 'message' variable. It seems work properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants