Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yet another NPE #35

Open
jay7x opened this issue Mar 17, 2020 · 16 comments
Open

Yet another NPE #35

jay7x opened this issue Mar 17, 2020 · 16 comments
Labels
bug Something isn't working

Comments

@jay7x
Copy link
Contributor

jay7x commented Mar 17, 2020

Just hit into NPE below yesterday (using commit 3c95089). Tried today with latest commit from master (f30b9ad) though it's still here. Output below is from latest version.

What's changed. I did migration to eCryptfs. I stopped kafka-backup, renamed target dir, emptied and chattr +i'd backup sink config (to prevent kafka-backup to be started by Puppet again). Then I deployed eCryptfs changes, did rsync back, then un-chattr +i'd it and reapplied Puppet.

Now main question should we try to debug this at all? Or should I just wipe it and do another fresh backup? This is QA so we have some time in case.

[2020-03-17 02:23:47,321] INFO [Consumer clientId=connector-consumer-chrono_qa-backup-sink-0, groupId=connect-chrono_qa-backup-sink] Setting offset for partition [redacted].chrono-billable-datasink-0 to the committed offset FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=kafka5.node:9093 (id: 5 rack: null), epoch=187}} (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:762)
[2020-03-17 02:23:47,697] ERROR WorkerSinkTask{id=chrono_qa-backup-sink-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:179)
java.lang.NullPointerException
        at de.azapps.kafkabackup.sink.BackupSinkTask.close(BackupSinkTask.java:122)
        at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:397)
        at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:591)
        at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:196)
        at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
        at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
[2020-03-17 02:23:47,705] ERROR WorkerSinkTask{id=chrono_qa-backup-sink-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:180)
[2020-03-17 02:23:47,705] INFO Stopped BackupSinkTask (de.azapps.kafkabackup.sink.BackupSinkTask:139)
@jay7x
Copy link
Contributor Author

jay7x commented Mar 20, 2020

JFYI, I wiped it out and started with fresh backup. Feel free to close this issue.

@itadventurer
Copy link
Owner

Hmm… This is weird… The relevant lines of code are these here:

PartitionWriter partitionWriter = partitionWriters.get(topicPartition);
partitionWriter.close();

partitionWriters are filled on the open() call in

this.partitionWriters.put(topicPartition, partitionWriter);

Which is called for every TopicPartition to open… It does not make sense to me why this happens :(

Do you have your data still available for further debugging? Would be interesting to try to reproduce. We should at least show a meaningful error instead of just throwing a NPE…

@itadventurer itadventurer added the bug Something isn't working label Mar 21, 2020
@jay7x
Copy link
Contributor Author

jay7x commented Mar 23, 2020

Hi! Yes, I have old directory backup still. Though using it may affect current cluster backup state I guess as I didn't changed sink name..

@jay7x
Copy link
Contributor Author

jay7x commented Mar 23, 2020

My guess here is this target dir was broken in some way during eCryptfs enabling attempts. Maybe some file was changed accidentally or something like this.

@itadventurer
Copy link
Owner

Hmm… Does it contain sensitive information? If not feel free to upload it to https://send.firefox.com/ and send me a link. I would try to reproduce the error.
Otherwise, you could try to reproduce it with a fresh cluster or we close the issue for now hoping you are right ;)

@jay7x
Copy link
Contributor Author

jay7x commented Mar 24, 2020

Happened today on another cluster too...
I have Azure backup cronjob which is stopping kafka-backup, then umounting eCryptfs, then doing azcopy sync, then mounting eCryptfs back and starting kafka-backup.
Tonight umount step failed, so script failed too (set -e). I guess this is time when issue happens. Though I need to recheck timeline carefully. Will update this issue later.

UPD. I did logs check just now. NPE happened earlier actually. Kafka-backup was killed by OOM multiple times... It seems -Xmx1024M or Docker memory_limit=1152M is not enough for this cluster :( Any ideas about how to calculate HEAP/RAM size for kafka-backup?

Do you want me to do some debugging on this data? I cannot upload it as it's contains company's sensitive data...

@jay7x
Copy link
Contributor Author

jay7x commented Mar 27, 2020

BTW can failed sink cause kafka-connect to exit? It'd be nice if whole standalone connect process will fail in case of single sink failure (when no other sink/connector exists).

@itadventurer
Copy link
Owner

BTW can failed sink cause kafka-connect to exit? It'd be nice if whole standalone connect process will fail in case of single sink failure (when no other sink/connector exists).
See #46

UPD. I did logs check just now. NPE happened earlier actually. Kafka-backup was killed by OOM multiple times... It seems -Xmx1024M or Docker memory_limit=1152M is not enough for this cluster :( Any ideas about how to calculate HEAP/RAM size for kafka-backup?

Sorry did miss the update. Feel free to just add another comment ;)

I have no idea how to calculate that right now. Have opened a new ticket #47 for that discussion

Do you want me to do some debugging on this data? I cannot upload it as it's contains company's sensitive data...

Yes please! That would be awesome!

@jay7x
Copy link
Contributor Author

jay7x commented Mar 31, 2020

Do you want me to do some debugging on this data? I cannot upload it as it's contains company's sensitive data...

Yes please! That would be awesome!

I'm not skilled in java debugging unfortunately... though I can run something if you guide me on this.

@itadventurer
Copy link
Owner

Ok, I will try to think how to do that during the next days maybe I find the issue myself by chance 😂 (want to write more tests)
Maybe if it is possible to reproduce that with non-company data that would be awesome!

@jay7x
Copy link
Contributor Author

jay7x commented Mar 31, 2020

According to what I saw here, I'd suggest you to kill kafka-backup process by kill -9 few times. I guess you may get to the state :) I really hope it's not related to eCryptfs...

@itadventurer
Copy link
Owner

I have seen it today in my test setup too… Currently I am failing to reproduce it. Will try it again during the next days…

@jay7x
Copy link
Contributor Author

jay7x commented May 30, 2020

Hit into this again after few hours of #88 and one OOM..

@jay7x
Copy link
Contributor Author

jay7x commented May 31, 2020

Saw this tonight on kafka-backup service shutdown before doing Azure blobstore backup:

May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: [2020-05-30 19:19:24,572] INFO WorkerSinkTask{id=chrono_prod-backup-sink-0} Committing offsets synchronously using sequence number 2782: {xxx-4=OffsetAndMetadata{offset=911115, leaderEpoch=null, metadata=''}, yyy-5=OffsetAndMetadata{offset=11850053, leaderEpoch=null, metadata=''}, [...]
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: [2020-05-30 19:19:24,622] ERROR WorkerSinkTask{id=chrono_prod-backup-sink-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:179)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: org.apache.kafka.common.errors.WakeupException
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeTriggerWakeup(ConsumerNetworkClient.java:511)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:275)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:212)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:937)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1473)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1431)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at org.apache.kafka.connect.runtime.WorkerSinkTask.doCommitSync(WorkerSinkTask.java:333)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at org.apache.kafka.connect.runtime.WorkerSinkTask.doCommit(WorkerSinkTask.java:361)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:432)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:591)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:196)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: #011at java.base/java.lang.Thread.run(Unknown Source)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: [2020-05-30 19:19:24,634] ERROR WorkerSinkTask{id=chrono_prod-backup-sink-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:180)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: [2020-05-30 19:19:24,733] INFO Stopped BackupSinkTask (de.azapps.kafkabackup.sink.BackupSinkTask:139)
May 30 19:19:24 backupmgrp1 docker/kafka-backup-chrono_prod[16472]: [2020-05-30 19:19:24,771] INFO [Consumer clientId=connector-consumer-chrono_prod-backup-sink-0, groupId=connect-chrono_prod-backup-sink] Revoke previously assigned partitions [...]

@itadventurer
Copy link
Owner

🤔 maybe i should try to reproduce it by letting run Kafka Backup for a few hours and producing a lot of data… need to think about how to debug that in the most meaningful manner…

I think it would help a bit if you could monitor your Kafka Backup setup. Maybe we will see something useful in the metrics 🤔

@andresgomezfrr
Copy link

I reproduce the same error:

[2020-07-10 11:05:21,755] ERROR WorkerSinkTask{id=backup-sink-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
java.lang.NullPointerException
	at de.azapps.kafkabackup.sink.BackupSinkTask.close(BackupSinkTask.java:122)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:397)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:591)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:196)
	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I am using a pod setup en k8s and Azure File Share filesystem to store the backup. I will try to add some logs at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants