Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing view was not added #113

Closed
burdiyan opened this issue Mar 20, 2018 · 8 comments
Closed

Failing view was not added #113

burdiyan opened this issue Mar 20, 2018 · 8 comments
Labels

Comments

@burdiyan
Copy link
Contributor

This is really weird thing that started happening without any reason.

We have one processor that has 1 input, 1 join and 6 lookup edges in the group graph.

When processor starts, there's no error in the log and everything seems to be working fine. There are messages for corresponding views being opened and so on.

But for some reason consumer is not consuming any messages.

When we stop a processor (we handle SIGTERM and SIGINT gracefully, and then execute Stop() on the processor) it throws some errors like: failing: error in view: Errors: view error opening partition ...: error removing partition .... : partition was not added.

That does not depend on how long I wait after consumer is rebalanced. Actually those topics are empty on my local machine and same thing is happening.

Real mystery for me.

@db7
Copy link
Collaborator

db7 commented Mar 20, 2018

That is weird. The table topics exist, have partitions, but no data, right?

Could it be that the processor is not consuming any message from the input stream because it's waiting the views to recover? I think that we handle empty topics.

Have you tried adding the monitoring interface to check what the partitions are doing? That is useful sometimes (https://github.com/lovoo/goka/tree/master/examples/monitoring).

Can you try adding something to the tables in each partition? Does the input starts being consumed?

@db7 db7 added the bug label Mar 20, 2018
@burdiyan
Copy link
Contributor Author

There's no data only on my local machine, but in our dev environment there's actually constant updates for the input topic.

The most weird thing is that it was working fine initially, and then after some hours started behaving like this. I then tried restart the application, kafka, zookeeper, change group name, and etc. and nothing helps. It just stuck without any noticeable reason.

Regarding monitoring interface, it seems like lookup tables are displayed there, only inputs and joins, and I could extract any valuable information from there.

@db7
Copy link
Collaborator

db7 commented Mar 21, 2018

So the problem happens in your local machine as well as in your dev environment?
I think you'll need to trace the application to find this one out.

And yes, we need to add the lookup tables to the monitoring, sorry for the useless pointer. I was just wondering if all joined and lookup tables were recovered. Perhaps you can manually call the processor's Stats() and log.Printf the state of all partitions, if you haven't done so yet. If one of the empty tables has a stuck partition, then it's a bug in goka.

@burdiyan
Copy link
Contributor Author

@db7 Yes, the problem happens on my machine with or without any data, and in our dev environment where all mentioned topics have data.

I tried to print processor's stats, and at some point lookup partition get stalled:

7: &goka.PartitionStats{
    Now: time.Time{},
    Table: struct { Status goka.PartitionStatus; Stalled bool; Offset int64; Hwm int64; StartTime time.Time; RecoveryTime time.Time }{
    Status: 0,
    Stalled: true,
    Offset: 5133,
    Hwm: 7524,
    StartTime: time.Time{},
    RecoveryTime: time.Time{},
    },
    Input: map[string]goka.InputStats{
    "bm-core.phoenix_db.vehicles": goka.InputStats{
        Count: 5134,
        Bytes: 4385033,
        Delay: 62823458663513,
    },
    },
    Output: map[string]goka.OutputStats{
    },
},
8: &goka.PartitionStats{
    Now: time.Time{},
    Table: struct { Status goka.PartitionStatus; Stalled bool; Offset int64; Hwm int64; StartTime time.Time; RecoveryTime time.Time }{
    Status: 0,
    Stalled: true,
    Offset: 5107,
    Hwm: 6686,
    StartTime: time.Time{},
    RecoveryTime: time.Time{},
    },
    Input: map[string]goka.InputStats{
    "bm-core.phoenix_db.vehicles": goka.InputStats{
        Count: 5108,
        Bytes: 4368563,
        Delay: 43727458739095,
    },
    },
    Output: map[string]goka.OutputStats{
    },
},
9: &goka.PartitionStats{
    Now: time.Time{},
    Table: struct { Status goka.PartitionStatus; Stalled bool; Offset int64; Hwm int64; StartTime time.Time; RecoveryTime time.Time }{
    Status: 2,
    Stalled: true,
    Offset: 7109,
    Hwm: 7110,
    StartTime: time.Time{},
    RecoveryTime: time.Time{},
    },
    Input: map[string]goka.InputStats{
    "bm-core.phoenix_db.vehicles": goka.InputStats{
        Count: 7110,
        Bytes: 6277190,
        Delay: 6651032697,
    },
    },
    Output: map[string]goka.OutputStats{
    },
},

Some partitions get stalled right 1 record before HWM.

@db7
Copy link
Collaborator

db7 commented Mar 21, 2018

Partition 9 is indeed stalled, but it is recovered. Are there records above 7109 in the Kafka topic?

@burdiyan
Copy link
Contributor Author

@db7 Yes, right now the latest offset of this partition is 7293. This topic is updated fairly often.

@db7
Copy link
Collaborator

db7 commented Mar 26, 2018

@burdiyan any progress with this problem? could you identify whether it occurs inside goka or not?

If this is still a problem, can you come up with a minimal example where the problem occurs? I could then give a try.

@burdiyan
Copy link
Contributor Author

burdiyan commented Apr 3, 2018

It seems like this behavior was something weird mix of different things with our Kafka Setup, old version of Goka, network issues and custom Sarama configs. Will reopen if it will happen again.

@burdiyan burdiyan closed this as completed Apr 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants