Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waiting Executions #8136

Closed
arthurrferroni opened this issue Dec 21, 2023 · 61 comments
Closed

Waiting Executions #8136

arthurrferroni opened this issue Dec 21, 2023 · 61 comments

Comments

@arthurrferroni
Copy link

Describe the bug
if a workflow has a Wait node, and the status is WAITING(> 65 seconds), n8n does not execute and has errors in the log

To Reproduce
1 - create a basic workflow with a node Wait, set it to 5 minutes.
2 - this will not execute and print a log error

Expected behavior
Expected to executed on the seted time

Environment (please complete the following information):

  • OS: Ubuntu 20.04
  • n8n Version: Tested on: 1.22.1, 1.22.0, 1.21.1, 1.21.0 ,1.18.4 and others
  • Node.js Version: v20.5.1
  • Database system: Postgres
  • Operation mode: main

Additional context
n8n logs 3
n8n logs 2
n8n logs
waiting

@arthurrferroni
Copy link
Author

Timezone configs:
GENERIC_TIMEZONE="America/Sao_Paulo"
TZ="America/Sao_Paulo"

Worfkflow is on default

@krynble
Copy link
Contributor

krynble commented Dec 22, 2023

Hey @arthurrferroni thank you for reporting - we saw this issue before but could never reproduce it.

Do you have database access? Can you check the execution_entity's row for this execuion ID 594672 and check the value of the waitTill column? I'm guessing n8n seems to be saving an invalid date to this field, but we could never reproduce it.

@arthurrferroni
Copy link
Author

Yes I do, but I think is saving correctly, if I restart n8n's pm2, the executions that have already passed the scheduled time and are in waiting status are executed when starting pm2.

i've excluded the ID 594672, but i created a new one id 596515, and is teh row of the database

596515 false "manual" "2023-12-22 12:10:38.888+00" "2023-12-22 12:10:38.896+00" "2023-12-22 08:59:38.896+00" "waiting" "TB3PVn7m7Tdf6IVU"

@arthurrferroni
Copy link
Author

#6624

@Joffcom
Copy link
Member

Joffcom commented Jan 2, 2024

Hey @arthurrferroni,

Out of interest do you get the same issue if you use the Docker image or Node 18 which is our current recommended version?

Can you share the configuration for Postgres as well? Are you self hosting that as well and does it have any timezone configurations or was it just an apt install postgres.

@luizeof
Copy link
Contributor

luizeof commented Jan 3, 2024

Hello @janober @Joffcom, I hope 2024 is another year of health, achievements and success in your life and your family.

I've Same problem here. I was able to test up to version 1.14 and the same problem.

I am using Docker Swarm, N8N in Queue Mode. If the Wait is less than 60 seconds it works perfectly, if you set it to 65 seconds or more it waits 60 seconds and considers the workflow completed, but the nodes following the Wait are not executed.

I did the same test using RabbitMQ (I use it in all my workflows), if there is a Wait within a SubWorkflow it considers it completed as well.

No errors are displayed or logs are generated.

Last week I did tests with my friend @cbalbinos and we came to the conclusion that if the wait takes more than 65 seconds, N8N considers it to have been executed successfully and does not execute the rest of the nodes. He opened #8167 .

I belive it is related to #7699

@RuanMD
Copy link

RuanMD commented Jan 3, 2024

eu tenho o mesmo problema, segue o post
https://community.n8n.io/t/wait-does-not-work-after-65-seconds/34776/4

@Joffcom
Copy link
Member

Joffcom commented Jan 3, 2024

Hey @luizeof,

If a wait is over a certain time it goes into the database in a waiting state but the workflow isn't really considered "finished", From the testing I did this morning and from the images in #8167 we can see that the waiting executions are working.

In your test can you share any error output you are getting in the logs as the issue with RabbitMQ is not the same as the issue reported here and I don't want this issue to be mixed up with different issues being reported.

@Joffcom
Copy link
Member

Joffcom commented Jan 3, 2024

@RuanMD do you see the same error message in your log?

@RuanMD
Copy link

RuanMD commented Jan 3, 2024

o you see the same error message in your log?

I don't know where the error is, but I did several tests, when the wait is longer than 65 seconds it just stays waiting and doesn't execute the automation

@Joffcom
Copy link
Member

Joffcom commented Jan 3, 2024

Hey @RuanMD,

The log will be in the docker output of n8n, I would not be surprised if there are a couple of different issues here.

@arthurrferroni
Copy link
Author

I haven't tested with node 18, only 20, running on pm2, the execution was saved with waiting status, not finished, but when the time set expires doesn't execute the next nodes, and logs the error on the pm2 logs.

@Joffcom
Copy link
Member

Joffcom commented Jan 4, 2024

Hey @arthurrferroni,

Are you able to test with 18 which is the version we officially support?

Do you also have your database configuration handy?

@arthurrferroni
Copy link
Author

I'll up another vps and taste this, I'm running on postgres SQL, 6cpu and 16gb ram

@RuanMD
Copy link

RuanMD commented Jan 4, 2024

I'll up another vps and taste this, I'm running on postgres SQL, 6cpu and 16gb ram

image
I'm using version 18, on Docker Swarm with Postgres 14, 4GB machine and 3cpu

@Joffcom
Copy link
Member

Joffcom commented Jan 4, 2024

@RuanMD if you are using docker it will be node 18 already, in your case what I want to know is if you see the same error in the docker logs although we already have someone internally looking at your report so it may be worth keeping to your post as it could be a different issue and we may end up repeating things.

@luizeof
Copy link
Contributor

luizeof commented Jan 4, 2024

hello @Joffcom

I was able to reproduce the latest version:

image

image

I set 85 seconds of waiting. The workflow run by the editor works well.

In Production it is not executed.

I'm running in queue mode:

  • In the Worker container, no execution log appears.
  • In the editor container, the log below appears:
Started with job ID: 65119 (Execution ID: 66643)
Started with job ID: 65120 (Execution ID: 66644)
ApplicationError: Only saved workflows can be resumed.
There was a problem starting the waiting execution with id "66644": "Only saved workflows can be resumed."
ApplicationError: Only saved workflows can be resumed.
There was a problem starting the waiting execution with id "66644": "Only saved workflows can be resumed."
Started with job ID: 65121 (Execution ID: 66646)
ApplicationError: Only saved workflows can be resumed.
There was a problem starting the waiting execution with id "66644": "Only saved workflows can be resumed."
Started with job ID: 65122 (Execution ID: 66647)
ApplicationError: Only saved workflows can be resumed.
There was a problem starting the waiting execution with id "66646": "Only saved workflows can be resumed."
ApplicationError: Only saved workflows can be resumed.
There was a problem starting the waiting execution with id "66644": "Only saved workflows can be resumed."
ApplicationError: Only saved workflows can be resumed.
There was a problem starting the waiting execution with id "66646": "Only saved workflows can be resumed."
ApplicationError: Only saved workflows can be resumed.
There was a problem starting the waiting execution with id "66647": "Only saved workflows can be resumed."
Started with job ID: 65123 (Execution ID: 66649)
Started with job ID: 65124 (Execution ID: 66650)
ApplicationError: Only saved workflows can be resumed.
There was a problem starting the waiting execution with id "66644": "Only saved workflows can be resumed."
ApplicationError: Only saved workflows can be resumed.
There was a problem starting the waiting execution with id "66647": "Only saved workflows can be resumed."
ApplicationError: Only saved workflows can be resumed.
There was a problem starting the waiting execution with id "66646": "Only saved workflows can be resumed."

@Joffcom
Copy link
Member

Joffcom commented Jan 5, 2024

Hey @luizeof,

If you look at your error message it is not the same as the one earlier in this topic, But looking at that error I have to ask... Was the workflow saved before you ran it?

@luizeof
Copy link
Contributor

luizeof commented Jan 5, 2024

Hey @luizeof,

If you look at your error message it is not the same as the one earlier in this topic, But looking at that error I have to ask... Was the workflow saved before you ran it?

the workflow was saved.

@Joffcom
Copy link
Member

Joffcom commented Jan 5, 2024

Hey @luizeof,

That is interesting, So I can see your timezone is Sao Paulo and I know another report of this is also using that Timezone. I wonder if that is related as I am not able to reproduce this with Europe/London with either the unsaved or the other error message.

I will do some testing on Monday and see if it is timezone related.

@luizeof
Copy link
Contributor

luizeof commented Jan 5, 2024

@Joffcom make some tests changing this

if (waitValue < 65000) {

@Joffcom
Copy link
Member

Joffcom commented Jan 8, 2024

@luizeof We already know one of the issues will be coming from where the data is saved to the database to be picked up again later the bit we don't know is why it can't always be reproduced and why there are different errors for example your error is n8n saying the workflow was not saved the original error in this workflow is around something not being a function that should be.

Once we can reproduce the different errors in here we will be able to resolve it.

@luizeof
Copy link
Contributor

luizeof commented Jan 9, 2024

@Joffcom @RuanMD I updated to the latest version (1.22.4) and so far the problem has stopped.

@RuanMD
Copy link

RuanMD commented Jan 9, 2024

Contributor

thank you luiz

@Joffcom
Copy link
Member

Joffcom commented Jan 9, 2024

The fix in 1.22.4 fixes something that I think was introduced in 1.22.2 or 1.22.3 so if you were seeing this before that this change should make no difference 🤔

@netroy
Copy link
Member

netroy commented Jan 12, 2024

@luizeof what version of postgres are you running?

@luizeof
Copy link
Contributor

luizeof commented Jan 12, 2024

@luizeof what version of postgres are you running?

postgres 16

@netroy
Copy link
Member

netroy commented Jan 12, 2024

Thanks. Can you also please share the output of this query 🙏🏽

SELECT status, "waitTill" FROM "execution_entity" WHERE "waitTill" IS NOT NULL;

@Joffcom
Copy link
Member

Joffcom commented Jan 14, 2024

Quick update, the issue originally reported in this post has been seen in 1.22.6 so there is a possibility that there are 2 different issues being reported in this thread.

If your log file does not mention that waitTill.getTime is not a function can you create a new issue and let us know so we can filter out what is relevant to this problem.

@DRIMOL
Copy link

DRIMOL commented Jan 18, 2024

@luizeof e @Joffcom
It worked for me using both parameters

  - GENERIC_TIMEZONE=America/Recife
  - TZ=América/Recife

@krynble
Copy link
Contributor

krynble commented Jan 18, 2024

Hey @DRIMOL and @luizeof interesting when you mention that using Recife's timezone solved it - leads me to think that there's something wrong with the São Paulo timezone. Daylight saving time has been extinguished since 2019. I'm intrigued whether this could be the issue, like times being broken because of this.

@Joffcom
Copy link
Member

Joffcom commented Jan 18, 2024

The report I have been looking into is using Berlin, but lets not forget again that there is likely more than one issue here.

@DRIMOL
Copy link

DRIMOL commented Jan 18, 2024

Ei@DRIMOLe@luizeofinteressante quando você menciona que usar o fuso horário de Recife resolveu - me leva a pensar que há algo errado com o fuso horário de São Paulo. O horário de verão foi extinto desde 2019. Estou intrigado se esse poderia ser o problema, como o tempo está quebrado por causa disso.

I didn't test SP, I just used Recife because it's closer to me, but what I changed that resolved it was that I only used the parameter GENERIC_TIMEZONE=America/Recife, when I put the second TZ=América/Recife it worked, you can call me on zap to exchange an idea if you want +557999977-1721

@DRIMOL
Copy link

DRIMOL commented Jan 18, 2024

The report I have been looking into is using Berlin, but lets not forget again that there is likely more than one issue here.

I tested it again now and it's not working, this has happened other times, apparently as soon as I deploy it to my n8n queue stack it works, but after a few hours it doesn't work again

@Joffcom
Copy link
Member

Joffcom commented Jan 19, 2024

So an update on this issue...

Can you upgrade to 1.25.0 which contains a temporary fix for the issue originally reported here... If after using 1.25.0 you are still seeing the same issue we will need debug log files as it means your issue is likely caused by something else.

@DRIMOL
Copy link

DRIMOL commented Jan 20, 2024

my editor doesn't stop loading when I put it in version 1.25.0 I had to revert to version 1.22.6 to get it working again, about the logs I have a question in my case as I use n8n queue I have 6 work how will I know which one will is the error log?

Gravacao.de.Tela.2024-01-20.as.11.09.01.mov

@Joffcom
Copy link
Member

Joffcom commented Jan 22, 2024

Hey @DRIMOL,

That error is odd and unexpected, Does the browser dev console show any errors?

When it comes to checking the log output I would check all of the workers and the main instance to see what is there.

@directpromo
Copy link

Can you upgrade to 1.25.0 which contains a temporary fix for the issue originally reported here... If after using 1.25.0 you are still seeing the same issue we will need debug log files as it means your issue is likely caused by something else.

I've upgraded to version 1.27.1.
The bug hasn't gone away.

What do I need to do? I am not a programmer :(

@DRIMOL
Copy link

DRIMOL commented Feb 2, 2024

I noticed that the error also happens on the regular n8n
in version 1.18.2 shows the error in the log 'waitTill.getTime is not a function'

After updating to version 1.25.1, two flows that were on wait were marked as success, but then I did more tests and it still doesn't work, but this time it doesn't return any log as you can see in the image below.
Captura de Tela 2024-02-02 às 15 04 26

Is there anything I can do to help discover the bug?

@Joffcom
Copy link
Member

Joffcom commented Feb 2, 2024

Can you enable debug logging and share the output?

@DRIMOL
Copy link

DRIMOL commented Feb 2, 2024

here follows the log https://gist.github.com/DRIMOL/6e963cccbac8c814b9f6203b1e119ecd

the selected time was 70 seconds
Captura de Tela 2024-02-02 às 18 42 22

see the expected return time
Captura de Tela 2024-02-02 às 18 42 05

@DRIMOL
Copy link

DRIMOL commented Feb 2, 2024

1 - I just did a test here and setting the workflow time zone to UTC worked correctly

2 -Time difference error showing in the editor
Captura de Tela 2024-02-02 às 18 42 05
It is related to version 1.25.1, I returned version 1.24.1 and the time shows correct, being only 70 seconds after the start time.

3 - I believe that these errors reported in this link below also have to do with the same bug
https://community.n8n.io/t/editor-does-not-show-executions-correctly/39659

@Joffcom
Copy link
Member

Joffcom commented Feb 3, 2024

I think there are still different issues here as in your log snippet I couldn’t see the message originally reported or the text from the fix we put in place so I suspect the wait has not executed or there was no issue.

The times seen in the ui can vary sometimes if n8n is set to one locale with the TZ env option and an external database is in use with a different TZ option set. There is another issue opened about this.

@DRIMOL
Copy link

DRIMOL commented Feb 3, 2024

I believe it's the same error, just in the log it stopped displaying the same message, my regular n8n after I returned to version 1.24.1 apparently has been working correctly since yesterday, but for that to happen, I had to put it in version 1.25 . 1 and then go back to 1.24.1, I didn't understand how this happened, since before in version 1.24.1 I had the error.

As for the N8n queue, I haven't been able to test it yet in version 1.25.1 because this error occurred when updating and I can't touch it anymore, luckily I didn't update what I use in production, just a test one.

https://community.n8n.io/t/erro-ao-atualizar-para-1-25-1/39631

@Joffcom
Copy link
Member

Joffcom commented Feb 3, 2024

The fix we put in for the original issue will output a special log message if that specific condition is hit which I didn’t see in your log file which is what I am basing my thoughts on so it makes sense to me that if we are not seeing that message it is not the same cause or issue.

Annoyingly with this one we are not able to actually reproduce it and we have not yet been given a configuration that reproduces it the 3 I have tried all work as expected which is very odd.

@DRIMOL
Copy link

DRIMOL commented Feb 4, 2024

https://gist.github.com/DRIMOL/d86ff759ade0b80098f21d408650d78d

It didn't work again on the regular n8n, as you can see in the logs above, it is in version 1.24.1, but after I downgraded it took more than 24 hours to give the error again, I believe it's not a matter of time and if of amount of work on the n8n, something causes it to stop working after some amount of use, perhaps this is a reason for you not being able to reproduce the error, it is necessary that the n8n is in production and in constant use to reproduce the error.

I will update to version 1.25.1

@Joffcom
Copy link
Member

Joffcom commented Feb 4, 2024

@DRIMOL I would expect the issue to be in 1.24.1 as we didn't put the fix in until after that release.

My instance of n8n handles thousands of production and gets executions a day and our internal instance handles a lot more so I suspect I would see a usage limit if that was the case but as I mentioned your log from the newer version does not contain the log output for the fix for the original issue.

This is why I have been saying there is likely to be 2 different issues here.

As you are able to reproduce this can you provide your n8n configuration so we can test it along with the external database configuration?

We know from the original message what the error is which is how we were able to fix it but if your logs do not output that error or the message about the fix then it is to be something else, the more information we can get the more likely it will be that we can actually fix this second issue.

@DRIMOL
Copy link

DRIMOL commented Feb 4, 2024

https://gist.github.com/DRIMOL/a08f6c0bdda5c74e27b292024c5828a1

This is the Stack I use for my n8n queue, but I can't update to version 1.25.1, because in some tests I did on another n8n in the same stack, I get an error if I need to downgrade the worker doesn't start as I reported here ,
https://community.n8n.io/t/erro-ao-atualizar-para-1-25-1/39631

@netroy
Copy link
Member

netroy commented Feb 4, 2024

@DRIMOL if you can reliably reproduce this, would it be possible for you to reproduce this locally on your personal computer? if yes, would it also be possible for you to jump on a call with us to help us debug this?

This is likely a weird random bug in the postgres driver code, that's very difficult to fix without being able to debug it.

@DRIMOL
Copy link

DRIMOL commented Feb 5, 2024

The problem is the following, I did another installation of n8n with the same stack and on the same VPS from the company HOSTHATCH with only streams to try to catch the error, but it doesn't reproduce even in versions prior to 1.25.0 so I believe that if I do A local installation on my Mac also won't reproduce the error, would it be possible, we set up a video call and I'll show you everything you need on the two n8n that are in production, one regular and one queued, or The only way would be for me to reproduce the bug on my local machine?

@netroy
Copy link
Member

netroy commented Feb 5, 2024

The issue isn't that I don't believe you. I actually do.
The issue is that this is one of those problems that are so deep down in code (somewhere in a 3rd-party postgres driver), that the only way to fix this is to be be able to reproduce this in an environment where one of us can attach a debugger, and live inspect the code.
We can't do that in a production environment, not just because that's very complicated, but mostly because during debugging the entire application can get frozen, and any incoming triggers might get stuck (or time out).

We really need a way to reproduce this in a non-production environment.

@netroy
Copy link
Member

netroy commented Feb 5, 2024

@DRIMOL Would it be possible for you to run a custom docker image?

@DRIMOL
Copy link

DRIMOL commented Feb 5, 2024

I tried to update my n8n in production and now it only shows this error
Screenshot 2024-02-05 at 16 43 09

and I can't reverse it in any way

@netroy
Copy link
Member

netroy commented Feb 7, 2024

@DRIMOL Can you please try setting PGTZ=UTC env variable on your n8n containers? This will force the application to use UTC timestamps.

@krynble
Copy link
Contributor

krynble commented Feb 9, 2024

Hey @DRIMOL can you send me an email to omar@n8n.io? We can try to set up a call.

You can write in Portuguese as I am also brazilian :)

@netroy
Copy link
Member

netroy commented Feb 16, 2024

For anyone still seeing this error, please upgrade to 1.29.1, which includes a fix for a bug in the postgres driver where the timestamps were not converted to a valid Date object sometimes.

@netroy netroy closed this as completed Feb 16, 2024
@DRIMOL
Copy link

DRIMOL commented Feb 16, 2024

@krynble and @netroy I'm going to wait for 1.29.1 to become the latest or a later version, I've already wasted a lot of time with this bug, I'm a few projects behind, but if it doesn't resolve it in this new version I'll send you an email, thanks for your attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants