Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime is less than 10 hours for colab pro + User #3451

Closed
CathySunshine opened this issue Mar 1, 2023 · 30 comments
Closed

runtime is less than 10 hours for colab pro + User #3451

CathySunshine opened this issue Mar 1, 2023 · 30 comments
Labels

Comments

@CathySunshine
Copy link

CathySunshine commented Mar 1, 2023

I am a google colab pro + user.

I could run my work for 24 continuous hours in January 2023. However, since the beginning of February, my job times out after running for less than 10 hours. Although it was said on the website that Colab Pro+ supports continuous code execution for up to 24 hours if you have sufficient compute units, I wasn't able to run my job for over 10 hours for the past four weeks.

As a paying customer, I find this issue quite frustrating, as it limits my ability to complete my work efficiently. I rely on Google Colab Pro+ to run my workloads, and I need the ability to run them continuously for up to 24 hours. I would appreciate it if you could investigate this issue and provide me with a solution. Please let me know what could be causing this problem and what steps I can take to prevent it from happening again.

For your reference, I am using Chrome and have more than 300 compute units left, so compute units shouldn't be the problem. Besides, many of my compute units have been wasted due to the disconnect.

compute units

disconnect problem

@cperry-goog
Copy link

Can you submit feedback in product? Can't diagnose from GitHub.

@CathySunshine
Copy link
Author

Can you submit feedback in product

I did submit my feedback in colab a few days ago and yesterday.

@colaboratory-team
Copy link
Contributor

Thanks for letting us know. We found your feedback report (b/271299210) and will get back to you soon.

@colaboratory-team
Copy link
Contributor

We checked your activity history, and also noted that you reported #3436 as well. This issue appears to be the consequence of the issue of the mounted drive getting disconnected in ~10h (#3436). Please note that the 24h background execution applies only when there is an active execution on your session. If your notebook session no longer executes any code (even when your checkpoint results saving fails due to the disconnected mounted drive), the idle timeout then starts to apply and your session will be disconnected after the timeout (as mentioned here).

We will continue investigating the disconnecting mounted drive issue. In the meantime as workarounds, retrying on failed checkpoint results saving steps might help (you may need to unmount and remount your Google Drive filesystem on a retry). Another workaround could be to save your checkpoint results in your VM's local filesystem directory, and then copy them over to your mounted Google Drive filesystem (but this will require your entire execution to be done within 24 hours, and you may still need to unmount/remount your Google Drive filesystem, before copying your checkpoint results from your VM's local directory to your Google Drive filesystem). We apologize for any inconvenience this might have caused.

@ioritree
Copy link

ioritree commented Mar 8, 2023

@colaboratory-team This happens once a day and you have to compensate pro+users for the loss of the calculation unit. Because this is not our problem.

@LeapGamer
Copy link

@colaboratory-team This happens once a day and you have to compensate pro+users for the loss of the calculation unit. Because this is not our problem.

Agree, I've probably lost a few hundred credits + lost training time due to this bug, happens every time I leave it running overnight or while I'm out.

@ranjitkathiriya
Copy link

looks like I'm getting the same problem

@CathySunshine
Copy link
Author

@colaboratory-team Could you please provide an update on the investigation into the disconnecting mounted drive issue? It has been over two weeks, and it appears that more paying users are encountering the same problem.

@Great-Bucket
Copy link

I am seeing something similar, described here #3525

I thought maybe it was the version of StyleGAN3 that I was using, so I changed to use the official NVIDIA repo - but the result was exactly the same: at right around 10 hours, the cell execution for train.py stops. In my case the error message is "error: Traceback (most recent call last): Exception ignored in: Exception ignored in sys.unraisablehook".

The runtime, though, continues to be active which means I then pay for compute points that I'm not using. But though the runtime is still active, following the error, I am unable to run even a simple command such as "pwd" in any notebook cell. The only thing I can do, as far as I know, is stop the runtime and start the training process over from the top.

Maybe it's time I started using Paperspace...

@ioritree
Copy link

I am seeing something similar, described here #3525

I thought maybe it was the version of StyleGAN3 that I was using, so I changed to use the official NVIDIA repo - but the result was exactly the same: at right around 10 hours, the cell execution for train.py stops. In my case the error message is "error: Traceback (most recent call last): Exception ignored in: Exception ignored in sys.unraisablehook".

The runtime, though, continues to be active which means I then pay for compute points that I'm not using. But though the runtime is still active, following the error, I am unable to run even a simple command such as "pwd" in any notebook cell. The only thing I can do, as far as I know, is stop the runtime and start the training process over from the top.

Maybe it's time I started using Paperspace...

i see, all problem begin is "After 10 Hours...." then something will happen

@cody151
Copy link

cody151 commented Apr 7, 2023

This is actually a ridiculous scam, why did I pay for compute units if I can't even use them. I am still being shown captchas unnecessary even though I have paid for units my output isn't saved because the stupid colab instance gets disconnected and lost even if code is being run. This is ridiculous and I would like my money back!

@truongtankhoa90
Copy link

truongtankhoa90 commented Apr 8, 2023

I have just subscribed Colab Pro+, and face the same issue now. Colab Notebook will disconnect to Google Drive after about 10 hours. Please help us to fix it @colaboratory-team. Thanks

@jridevapp
Copy link

Facing the same issue... Lost days and money to this...

@Daiiszuki
Copy link

Any workarounds been found?

@arita37
Copy link

arita37 commented May 5, 2023

@colaboratory-team
This is critical issue for PAID USERS (vs no-paid users).
It canceled out the benefit of paying

@kitamuratomokazu
Copy link

@colaboratory-team
I have the same issue.

@Furkaragoz
Copy link

@colaboratory-team
I have the same issue

@semusings
Copy link

semusings commented Jun 4, 2023

@colaboratory-team I also have same issue. I lost my compute units.

I need a refund for this.

@apavlo89
Copy link

apavlo89 commented Jun 9, 2023

Also having the same issue. Please fix asap

@xerxes-k
Copy link

why is it closed? google drive mounting issue keeps popping up

@danielathome19
Copy link

This issue happens to me after running for only 4 hours on average.

@Nayrouzzz
Copy link

it happened after running it for only 4 hours :(

@zjemily
Copy link

zjemily commented Aug 25, 2023

I do experience it as well with Pro+ - If it wasn't for me noticing, indeed, there could be a significant loss of compute units, especially leaving it with a A100, which I don't do specifically because of this.

@MuharremcanGulye
Copy link

I have same issue. When i leave my computer at night, after 4 or 5 hours i get "OSError: [Errno 107] Transport endpoint is not connected" this error. Im using colab pro as well.

@wer-kle
Copy link

wer-kle commented Oct 8, 2023

Same issue, using V100 High RAM. I want my $50 back.

@TonojiKiobya
Copy link

It happens only after executing for 4 hours. This is a scam. Shame , shame, shame for the dishonesty!

@hrishi-04
Copy link

It happens only after executing for 4 hours. This is a scam. Shame , shame, shame for the dishonesty!

same with me; i use collab pro and it stopped my execution at 4 hours, google colab pro is a damn scam

@s4lm-xi
Copy link

s4lm-xi commented Feb 27, 2024

Keep this issue active dont make it closed. I am still experiencing the same timeout issue even as a pro user. Whats the point of paying?!?!

@IssaAljanabi
Copy link

@colaboratory-team
I am paying for the service as Colab user and got disconnected after only 5 hours. I need to train a large model for 30 hours continuous without interruptions

@izidorg
Copy link

izidorg commented May 27, 2024

I have exactly the same issue. Need solution a.s.a.p
I'm using the default FireFox browser on ubuntu 22.04.
I've running the same notebook on Google Chrom now, and will update here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests