-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write errors when running vs HDP 2.3 (==Hadoop 2.7) #15
Comments
Hi Alex,
Let me know if you want to try this patch and if so, if it solves your issue. Take care and all the best, |
Björn Hi! thanks for the reply and the effort in crafting the patch (+ nice simple workaround!). I'll have a play as soon as I get a moment and report back Alex |
I just tested that and it worked perfectly, thanks I wonder if you should put the thread id in the |
Glad to hear that ;) Putting the thread id in @metadata is a good idea, Will do this and then push it to master. Thanks again for your effort and help in solving this issue! |
…thread is set to true. Appending with multiple threads to the same file via webhdfs may cause issues with locked file leases. Using the thread_id as part of the path configuration value will take care that only one thread will append to a given file while still running with multiple pipeline workers.
Hi Alex, I merged the bugfix into master with #16.
I will close this issue if this fixes your problem. Take care and all the best, |
Hello @dstore-dbap , sorry to ask this here, but where did you find information about webhdfs files lock ? I plan to scale logstash horizontally to load-balance the high CPU load, this look a bad idea so :/ Is there any configuration to do on Hadoop size maybe? Many thanks, |
Hi ebuildy, |
Thanks for the fantastic explanation, so I cannot setup 3x logstash instances getting events from a queue and writing this to webhdfs (same file)? |
It depends on how frequent you are going to append. Though judging from the fact that you want to use multiple logstash instances, it might be quite frequent ;) |
@ebuildy what's the end client? Most Hadoop-y applications work on directories of files anyway |
thats true, it should be a problem actually. I plan to use Hive, Apache Drill and Apache Spark. |
I've used a similar stack with directories, you should be fine, you can just put a wildcard in the RDD builder, it gets handled pretty efficiently |
Nice to heard, thanks. |
Yes, thats exactly what we do with our hive queries. Works fine ;) |
Closing this issue now. |
Hi @dstore-dbap I know this bug has been closed but I am investigating issues on WebHDFS side whereby the plugin throws exceptions like
I have already set the My config is
and the pipeline configuration is
Interestingly even if I keep a single worker, 2 distinct thread ID files are created. |
I get the error as well. |
@jvosu2 In our experience it isn't a problem with plugin. |
@jvosu2 Hello, I want to know what config you used to fix this problem finally,set pipeline.workers = 1? |
No, it still happens even then. The default behavior of the hdfs plugin is to retry 5 times and then drop the message. Instead of drop I write to a file - I ended up writing my own plugin by combining the file writer output plugin with the hdfs writer output plugin, and if it fails after the # of retrys, then I write to a local file. I later use filebeats to upload the files in the error queue. In my setup, we have ssl and httpfs (not webhdfs) so the firewall forces me to send all messages to only one node, which is a bottleneck. Eventually I will instead need to write my own hdfs client. |
@jvosu2 Sorry to bother you. I have to agree with your opinion after many times to fix this problem.I want to write a plugin similar to yours but I am not good at ruby. |
I'm running a very boring configuration in Logstash 2.1.2 that reads from a local CSV file and writes to webhdfs (connecting to an 8 node HDFS cluster created by a vanilla HDP install)
On some ~10MB file with ~200K lines
I get a non-stop stream of the following errors
A random number of the events are discarded each run, ~25-50%
Looking at the server logs, it's filled with things like the following:
Which appears to be caused by multiple threads accessing the same file according to Google. And there's a related but cryptic comment in the the source code here (sic):
So:
reduce number of workers
, but that presumably means the global number of worker threads, which would cripple the performance of the other logstash configsThanks in advance of any insight/help!
The text was updated successfully, but these errors were encountered: