Big Data in the Linode cloud - Part 1: Streaming Data Processing using Apache Storm#419
Conversation
|
Any news on this? Thanks in advance |
|
Sorry for the delay @pathbreak. Thanks for your patience. We're currently reviewing the topic and your submission. It shouldn't be much longer now before we can get back to you with an answer. |
|
Hi @pathbreak, I've been assigned to test and edit your guide to start preparing it for publication. Because this is such an extensive setup, this process could take a few days, but if you have any questions please feel free to reach out. Once finished, I'll make a pull request against this branch with any proposed changes. Thanks, and I look forward to working with you. |
|
Hi @pbzona, Thank you, I look forward to it too. I do have some questions about changes required, because some things have changed after I had drafted this article back in Feb-March 2016:
I had held off on incorporating these changes because I wasn't certain if the article was being considered for publication. Now I'm thinking of handling the first change immediately before you start your tests and review. For the second change which is more extensive in scope, I suggest retaining Ubuntu 14.04 as the only supported Ubuntu version for now, while I investigate Ubuntu 16 and add support for it. I can send those changes as an update for the published article later on. What do you think? |
|
Hi @pathbreak Good call on the Linode plans and sizes - since those changes will be trivial (in the scope of editing the guide, at least), and since I've already gotten started, I'll handle those changes. Of course, the scripts themselves will need to be updated on your end since they're hosted on your GitHub account. You're correct, Ubuntu 16.04 has been our preferred distro for new guides. I agree that we can proceed with Ubuntu 14.04 and Debian 8 for now just to work out the broad details, but we'd love to get support for 16.04 on this guide. If you'd be kind enough to make those changes we can add them in at some point, whether that's before or after publication. I'll speak with our docs team and get back to you about how we'd like to implement those updates. Thanks! |
|
@pathbreak I spoke with the team and we're okay with publishing the guide for Ubuntu 14.04 and Debian 8, then updating it at a later time. So once we've gone through the editing process and published this guide, we can work on updating it for Ubuntu 16.04 and arrange details then. For now, I'll continue testing and editing the guide as-is. |
|
@pbzona Thank you for clarifying that. The Ubuntu 16 support will have to go at a slower pace on my side because of my other projects. I'll target this month end to make and test script changes. |
|
@pathbreak Sounds great. There's no rush for the Ubuntu 16.04 update, so if that takes a bit longer it's okay. This is an extensive guide already, so the initial review will take some time on our part. I would, however, ask that you make any updates in a separate branch to avoid conflicts here as we work through it. Once this is merged, we can integrate the two. |
|
@pathbreak I've noticed when creating the Zookeeper cluster that your config file comments suggest a cluster size of 5-9 nodes, but the default value is 3. In order to keep things simpler, would you be willing to specify 3-9 as an optimal number of nodes instead? Of course a 5 node minimum cluster would allow for more redundancy, but I'd like to make sure that no one reads that line and feels that 5 nodes are too expensive to continue. Since you'll be updating the scripts, what are your thoughts on making that change? |
Good catch! I'll update the config file template to suggest 3-9. |
|
@pathbreak I'm receiving an internal server error at the point where I try to access the web UI for monitoring the Storm cluster. It appears Nimbus is running fine on its node, and the Zookeeper nodes are all up and running. Here's what the error looks like: I've checked the ipsets on the Storm client, and the syntax looks fine to me . The output indicates they're flushing and loading just fine. Admittedly I am not that familiar with Storm or Zookeeper, but I've checked what seem to be the most important files (storm.yaml, zoo.cfg) and everything seems to be configured properly. Any idea where else this error might be coming from? |
|
@pathbreak To give some additional information, I think I've narrowed the problem down to the Zookeeper cluster. I'm seeing a lot of connection exceptions in the logs, accompanied by the message "Cannot open channel to at election address c1-pri-zknode/xx.xx.xx.xx:3888" and then "connection refused" or "connect timed out." This would explain the issue with the Storm UI failing to connect. So far I have adjusted |
|
@pbzona , Thank you for those symptoms and details. Briefly, storm application flow is:
Based on this flow:
I think I'll add a diagnostics script to do some quick checks on a cluster, since any problem you run into other users too may run into. It'll take some time for me to write, so I thought I'll write some of the steps here in the comment so that you can try to proceed with them immediately. |
|
@pathbreak Thanks for the detailed response. My Zookeeper nodes are aware of each other and Zookeeper is running, but there are no members in the all-whitelists set. This would explain why they're not connecting to anything. Rather than add the ipsets manually and have them possibly be overwritten, is there a simple way to add whitelists to the Zookeeper nodes? I'm not sure I understand how your script handles that - it seems that it loads a set of iptables rules upon creating each node in the cluster, but I don't see where the ipset |
Yes, there is an "update-firewall" command to recreate and distribute the ipsets and iptable rules to all zookeeper nodes: A correct configuration should have these ipsets and iptables:
It's done in create_cluster_security_configurations. The Zookeeper firewall configuration is touched during these actions:
However, it is not touched when user is added to Storm client whitelist. I would like to understand the sequence of actions that led to this configuration failure, and see how to avoid it again. |
|
@pathbreak Thanks, this clears things up quite a bit. I tried the However, here's an interesting piece of output on the command you suggested: I've obfuscated the IP addresses, but I did confirm they match the private IPs of the Zookeeper nodes. Unfortunately I've stopped and started the clusters so many times that I can't recall every action I took. While trying to resolve the problem, I've always shut down the Storm cluster prior to the Zookeeper cluster (per the guide) and any configuration changes on the cluster nodes have been made only to One thing that stands out, and I suspect this might be the problem, is that Here are my .ipsets and .info files: I should mention that the .info file was named |
|
@pbzona Thank you very much for those details and observations, because they helped me realize the root cause of all the zookeeper failures is buggy cluster name handling in my scripts. Many areas in the scripts are assuming that the values of CLUSTER_NAME variables in conf files and the cluster directory name are same (example: if CLUSTER_NAME is "zk-cluster-example", directory should be "/zk-cluster-example/"), but this is not enforced in the scripts nor in the default values nor made clear in my article. This mismatch leads to those file not found errors, because it's expecting to find those files under a "./CLUSTER_NAME/" directory but that directory is named something else. I remember I had to change cluster name handling at one point because the same name is used for ipsets, but ipset imposed some restrictions on names. It looks like these bugs have originated at that point in time. Since I was in the habit of always setting CLUSTER_NAME to directory name, I failed to discover this bug (a good example of being too familiar with one's own code while testing 😃 ). One possible solution I can think of that doesn't require changing the article is to impose those name restrictions at the very start when creating new cluster conf directory, and use directory name as cluster name. However, I'll have to analyze the code to make sure it doesn't impact something else and impose too many changes in the article. For now, I think you have two options to proceed in a way that this does not become a recurring blocker for further testing:
|
|
@pathbreak Thanks for taking a look, that explains a lot! So if I recreate the clusters, I should to use the default |
|
The directory name given with "... new-cluster-conf DIRECTORY_NAME" and the value of CLUSTER_NAME should be same. If both are "zk-cluster-example" or "storm-cluster-example" it'll work. Any name should be <= 25 characters (due to ipset restrictions). |
|
@pathbreak I tried naming the clusters after the default filenames as a workaround ( Once I started Nimbus manually, I was still receiving a 503 response. Going back to when I brought up the Storm cluster initially, it doesn't appear Nimbus started properly then either. I ignored it at the time, since the output indicated that this was not necessarily an error. After doing some digging in The Nimbus node only has active connections on ports 22 and 6627 (java, I'm assuming this is Zookeeper). I ran the checks you mentioned above to check the status of the other cluster nodes, so I am pretty sure it's Nimbus causing the issue now. I'm not sure when and how it forms the connection to the client node, and I am thinking this is probably where the issue is since the Zookeeper->Nimbus and browser->client node connections seem okay. Any suggestions? EDIT: When checking the health of the cluster hosts, I performed the checks you mentioned previously. Everything looked normal, and it seems that renaming the clusters did in fact resolve the ipsets issue I was having before. I've tried restarting Nimbus via And now it seems the browser->client may not be fine after all. Pings sent to the client's public IP are timing out. A traceroute shows 100% packet lost at the Storm client. Sorry for so many edits, this should be my last one until I hear back. |
|
@pathbreak I've found the source of the issue: the Storm UI was not starting on the client node (I should have thought to check for this in the running processes, but didn't expect it would be separate from Apache). I had to call the I am still not certain how the connection is being formed between the client and Nimbus, as there are still only two active listening ports on the Nimbus node. Are all Java processes being run on the same port? Additionally, the Storm UI shows only one supervisor node and I have not made any configuration changes to either of the supervisors. I did not touch the Zookeeper clusters (aside from checking their hosts and running processes), and I can't think what else could have impacted the supervisors. Any idea what may have caused this? One last question: how much of the starting/stopping of clusters and services is handled by your scripts as opposed to the services (meaning Zookeeper and Storm themselves)? If there are bugs in the scripts I can come back to this project once they've been fixed, but if these errors are being caused by the actual services, it may require some additional explanation in the guide. |
|
@pbzona, I tried to replicate all the failures as best as I could based on the sequence of actions described so far, to discover what needs to be corrected or changed.
Can you please check Storm-ui daemon once started is actually quite fault-tolerant. For example, I experimented with shutting down nimbus daemon and zookeeper cluster, and then starting storm-ui. It did come up correctly, and showed only the expected internal server(ThriftException) errors we have seen in one of your earlier gists. It doesn't contact Zookeeper cluster directly at all, so the state of Zookeeper cluster won't affect Storm-ui directly. What it does is it gets all its information only from the Nimbus daemon over port 6627. So if Nimbus is down or waiting for ZK, Storm-UI's requests will fail with that internal exception. But even if Storm-ui goes down, it'll be brought up again because it runs under Supervisord's monitoring. One way I was able to prevent Storm-ui from coming up at all and thus reproduce the 503 error was by modifying "storm.yaml" to make it malformed YAML. Otherwise I can't think of any reason for storm-ui to not come up at all, or for a user to have to restart it manually. Since the article does mention that storm.yaml can be customized, perhaps a warning section or troubleshooting section can be added recommending user to validate that file https://yaml-online-parser.appspot.com/ after modifications.
There is only 1 java process on Nimbus node, and that is the Nimbus daemon. Only two listening ports - 22 and 6627 - on Nimbus are as expected. When a browser request is made to storm-ui webapp, it opens a temporary connection to Nimbus port 6627, gets the requested information, closes that connection, and renders the page.
This means for some reason the storm-supervisor daemon running on that supervisor node failed to start up correctly, or failed to connect to Zookeeper cluster and announce its state Storm UI gets all its info from Nimbus daemon, and Nimbus in turn gets current state of cluster from Zookeeper. So if a supervisor node couldn't announce itself to Zookeeper, Nimbus won't know about it and Storm UI won't see it. Can you please check
All services are started and stopped explicitly by my scripts. Supervisord is used to start them under monitoring, so that if they go down for some reason after having started, they're brought back up automatically. The services themselves don't start or stop other services. I did try to reproduce all the failures here, but some actions turned out to be successful for me. For example, I was able to rename cluster directories to "...-example" and still get the clusters to work correctly with only change being to update ZOOKEEPER_CLUSTER path in Also I'm implementing getting rid of CLUSTER_NAME variable altogether from the conf files, and making the directory name itself as the cluster name. It won't change any of the the command line code in article. I am doing this in 'review1' branch of my scripts repo and docs repo. I'll ensure the change is thoroughly tested. Is this workflow ok with you while you proceed by freshly creating the clusters, or would you prefer pausing your tests till I make these changes and merge to master? |
|
@pathbreak Thanks for the details. There are no errors in Upon checking the sterr and stdout logs on the failed supervisor node, both are empty. I checked on the working supervisor, and it seems those are empty as well. Neither of these nodes (or any other) was restarted over the weekend, so I am sure there was no data loss. When I got the 503 error, this was after creating the clusters freshly with names matching CLUSTER_NAME, although I did create everything from the same workstation that I set up when I originally tested (and got the thrift exception). I'll try again from scratch and let you know if I run into any issues. You can go ahead and merge your changes once they're finished - I've just been trying to get things running smoothly, so I can just reimplement most of my changes on your updated branch. Do you have an expected timeframe? If it'll only be a day or two I'll continue testing the current version to see if I can get it running - any changes you make should only simplify things in preparing the draft. If it'll be longer, that's no problem, but I'll probably put testing on hold and come back to it. |
|
@pbzona, Thank you for checking those logs. I didn't realize you had got 503 errors even after recreating clusters freshly. I mistakenly thought it had happened during the previous test sequence, sorry about that. It's strange that storm-ui did not start even with default configurations, I'm not really sure what's going on there. I'll retest and see if I can reproduce it somehow. As for the changes, I estimate I'll need till this weekend because I'm able to work on this only for a couple of hours at night. I'll make it available on Monday (29th). |
|
@pathbreak I recreated the entire setup yesterday (including the Linode I was using as a workstation to create the cluster manager) and everything seemed to work fine with no errors. I'm not sure what I did differently, but it's entirely possible there was some silly error on my part the first few times. Thanks for the update on your changes, I'll continue testing the other features of your scripts and resume full editing and testing once your changes have been implemented. |
|
@pbzona Thank you for testing, I'm glad it worked without hitches. Please do let me know if you feel some steps can be simplified or further automated. I have now implemented the cluster name simplifications and completed testing on Linode cloud with both Ubuntu 14.04 and Debian 8 (I've not yet started the Ubuntu 16.04 support).
example: The software release is available in master and tagged as 'release-0.2.1-alpha' (https://github.com/pathbreak/storm-linode/tree/release-0.2.1-alpha) You need to recreate a cluster manager from scratch for this release, because the cluster manager setup script has been changed to detect and download whatever is the latest 'release*' version of the software. I've also made related changes in the README as well as my article to, serve as a guide in case you need them. |
|
@pbzona Thanks for the PR. I'll read through the changed version tonight and tomorrow, and get back to you tomorrow. |
|
@pbzona I've completed merging all the review comments and reformatting into this PR branch. All the corresponding code changes (except Ubuntu 16 changes) have also been committed into latest release-0.3.0 (https://github.com/pathbreak/storm-linode/releases/tag/release-0.3.0). |
|
|
||
| This guide explains how to create Storm clusters on the Linode cloud using a set of shell scripts that use Linode's Application Programming Interfaces (APIs) to programmatically create and configure large clusters. The deployed architecture will look like this: | ||
|
|
||
|  |
There was a problem hiding this comment.
Concerning the iconography used in this and your other graphics, where did they come from? What licensing is associated with them?
There was a problem hiding this comment.
@alexfornuto Thank you for the review comments.
This was written a couple of months ago and I didn't record all the sources at the time. But when I need icons, I usually get them from http://www.iconarchive.com and http://findicons.com with open/libre licenses. Here are 3 of the icons I used and could find again today by searching
- http://www.iconarchive.com/show/unleashed-icons-by-pc-unleashed/notebook-icon.html
- http://www.iconarchive.com/show/oxygen-icons-by-oxygen-icons.org/Devices-computer-laptop-icon.html
- http://findicons.com/icon/185522/network_server?id=185909
However, since there are about a dozen more icons used and I am not keen on searching them all again to verify, I have simply removed all icons from all my illustrations and committed the new diagrams with same filenames.
There was a problem hiding this comment.
Thanks, but that wasn't necessary, since we'll probably just replace the graphics with ones made in-house. If you'd like to roll back that commit, it will make it easier to merge my copy edit PR once it's complete.
|
|
||
| The first step is setting up a central *Cluster Manager* to store details of all Storm clusters, and enable authorized users to create, manage or access those clusters. This can be a local workstation or a Linode. | ||
|
|
||
| You can set up multiple Cluster Manager nodes if necessary to satisfy organizational or security needs. For example, two departments can each have their own Cluster Manager nodes to manage their respective clusters and keep data access restricted to their own members. |
There was a problem hiding this comment.
In the interest of clarity and streamlining, I am considering removing this paragraph. Unless there is some over-arching system that I don't yet grasp, wouldn't another cluster manager and cluster be a separate, unique entity? In other words, wouldn't it just be 'repeating this guide again, for a different department in your organization on a separate environment'?
There was a problem hiding this comment.
@alexfornuto Yes, the removal is ok with me. There is no over-arching concept here, and another department can simply repeat the guide, that is correct.
I don't mind doing this change myself but since I'm not clear about the workflow here, do let me know I can go ahead and make this change, or you plan to do it as a PR.
There was a problem hiding this comment.
I'll include it in mine, thanks for asking!
| ~/storm-linode/linode_api.py datacenters table | ||
|
|
||
| - `DISTRIBUTION`: | ||
| This is the ID of the distribution to install on the Cluster Manager Linode. This guide has been tested only on Ubuntu 14.04 or Debian 8; other distributions are not supported. |
There was a problem hiding this comment.
These IDs are not necessarily static, they may change in the future. Can your linode_api.py also be used to return distribution IDs?
There was a problem hiding this comment.
@alexfornuto Yes, it can already print out available distributions and kernels by querying Linode API.
Syntax is same:
source ~/storm-linode/api_env_linode.conf
~/storm-linode/linode_api.py distributions table
and
source ~/storm-linode/api_env_linode.conf
~/storm-linode/linode_api.py kernels table
Again, do let me know if I can go ahead with adding this.
There was a problem hiding this comment.
UPDATE: I see the option in your .py file, and will include the additional information in my PR.
|
|
||
| - `IMAGE_DISK_SIZE` | ||
|
|
||
| The size of the image disk in MB. The default value of 5000MB is generally sufficient, since the installation only consists of the OS with Java and Zookeeper software installed. |
There was a problem hiding this comment.
Is there any reason not to set this as large as the smallest Linode plan will allow, less swap disk space?
There was a problem hiding this comment.
@alexfornuto There were a couple of reasons for this decision:
- Most important reason was faster cloning. Since an image is cloned for every node in cluster, I noticed cluster creation is relatively faster if disk sizes are small. A related observation was that cloning in general was slow when image was created in a different datacenter from the one where clones are created. Small disk sizes help in that scenario too. The software allows creating image in one DC and creating clusters in another.
- Linode allocates 10 GB maximum space for all images of an account. Experimentally I've seen it go above that upto around 12 GB before image creation failed. Although 5000 MB disk image doesn't result in 5000 MB consumption of total space for images, I thought it prudent to default to small disk sizes to avoid image creation failures due to large disks and to enable users to create many images if needed.
- It provides flexibility to users, for example, if they prefer to store data in separate disks on the cluster nodes, separated from the OS and software. This is the recommended Zookeeper conf for performance reasons - keeping its data on a separate disk so that OS or other software writes don't hamper Zookeeper's write queues.
There was a problem hiding this comment.
Works for me, thanks for clarifying!
|
|
||
| For example: | ||
|
|
||
| ./storm-cluster-linode.sh cp storm-cluster1 "~" "~/*.data" |
There was a problem hiding this comment.
Can we please replace this example with one that uses example files, and not just the ~ alias? I feel it may be confusing.
There was a problem hiding this comment.
@alexfornuto I didn't understand this comment and the one below. Please clarify what example files you are referring to, and what is "~ alias". Thank you.
|
|
||
| For example: | ||
|
|
||
| ./zookeeper-cluster-linode.sh cp zk-cluster1 "~" "~/*.data" |
There was a problem hiding this comment.
As above: Can we please replace this example with one that uses example files, and not just the ~ alias? I feel it may be confusing.
| external_resources: | ||
| - '[Apache Storm project website](http://storm.apache.org/)' | ||
| - '[Apache Storm documentation](http://storm.apache.org/documentation.html)' | ||
| - '[Storm - Distributed and Fault-Tolerant Real-time Computation](http://www.infoq.com/presentations/Storm-Introduction)' |
There was a problem hiding this comment.
Why is this third-party presentation a trust resource?
There was a problem hiding this comment.
@alexfornuto InfoQ is a reputed and respectable resource, the presenter Nathan Marz was the author and lead of Apache Storm project, and it's a good introductory talk on Storm. I assumed the external resources section was to list useful links for your readers to know more about some topic. If it's against policy or something, I'm fine with removing it and the other links.
There was a problem hiding this comment.
It's fine, I just wanted to make sure that it was a trusted source, and based on your reply I'm confident that it is.
|
@pathbreak I've staged the copy edits here |
|
|
||
| For example: | ||
|
|
||
| ./zookeeper-cluster-linode.sh cp zk-cluster1 "~" "~/*.data" |
There was a problem hiding this comment.
@pathbreak I don't know where my other comment went, but I saw your reply to it via email. I was referring to the ~ on this line, which seems to indicate that we're copying the entire contents of the local home folder to the clusters, and changing the output on the other side to *.data. Can you clarify this?
There was a problem hiding this comment.
For some reason, Github is not showing me the correct line in the document. From your comment, I believe you are referring to this command? If so, then
./storm-cluster-linode.sh cp storm-cluster1 "~" "~/*.data"
is actually copying all *.data files from cluster manager node to home directory on each node (which happens to be /root since the scripts run as root). This command syntax is [CLUSTER] [TARGET DIR] [SOURCE FILES]. It's the reversed order of regular cp/scp but that's because this command accepts one target directory and multiple source files.
The idea here is to enable a user to transfer their application specific data files to all nodes with a single command. Since the software does not install any data files, it's left to the user to infer here that "*.data" is just a placeholder for their own data files related to the problem they are solving using Storm.
Perhaps it can be clearer if the sentence is reworded to:
For example, if your data files are named "*.data" and your topology requires them for processing, you can copy them to /root on all cluster nodes with:
./storm-cluster-linode.sh cp storm-cluster1 "~" "~/*.data"
The alternative is to remove the example completely, because there are no concrete data files to transfer. The software doesn't install any.
There was a problem hiding this comment.
I think we should leave the example in, but explain as you just did that source and target are reverse from normal cp, and maybe include the syntax example you noted. Since I've already give you a PR, I'll leave that to you to add after you merge it. I hate dealing with merge conflicts.
|
@alexfornuto Thank you for the thorough review and copy edit PR. I have merged all your changes in commit 2cd69d4 and added the |
|
Thanks @pathbreak, this looks great! Our graphics team is working on updating the graphics for this guide, then we can publish! In the meantime, we need to know if you prefer the bounty as Linode credit or a PayPal payment, and which email address to use. If you like, you can email these details to contribute@linode.com |
|
@alexfornuto Thank you. I have emailed details to contribute@linode.com from "my name at my github username dot com". |
|
Thanks @pathbreak, I've got new images staged here. Once we process your payment details I'll get this published. It's almost done! |
|
@alexfornuto Great! Thank you for that update. I liked the new images from your graphics team; please convey my thanks for the effort they have put in. |
|
@pathbreak this PR is about to be marked as merged, as I pull #578 into master. Thanks for your submission! I've passed along your payment details to our payment department, and you should receive your bounty in 24-48 hours. EDIT: ok... I guess you would need the latest version in your fork for this to show as merged. I'll give you a PR to merge to sync everything up. Not that it matters too much, your guide is in the repo, it's just nice to see the "Merged" icon next to ones own PR, eh? |
Syncing up.
|
I saw the published article in your docs site and it's looking fantastic! A special thank you, @pbzona, for your rigorous checking and excellent suggestions for both the article and the scripts. Thank you @alexfornuto and @EdwardAngert for your thorough reviews and feedback. You guys are rockstars! I enjoyed the entire experience of learning and working with you all. Thanks a lot to all of you again, and keep up your excellent work! @alexfornuto Very thoughtful of you for that PR and yes, it does look neat now 😄 ! |
|
Thanks @pathbreak! This was by far one of the most interesting projects I've gotten to test, and it was truly a pleasure working with you. Great work! |
|
Hrmmm... @pathbreak I'm not sure why this still shows as different from master. Maybe you want to go to your branch and git pull --recursive from the main repo? |
|
@alexfornuto I've pulled from linode/docs master into this PR branch, bigdata-storm. But after doing so, I notice that my last commit about the 'cp' command explanation b546e5a has not made it into the merge and is missing from the published article's Copy Files to all nodes section. It seems to be missing in the main PR #578 too. Can you please check why that commit is missing from the merge? I probably need not do anything more from my side at this point, but if there is, please let me know. |
|
@pathbreak Yup, there it is. I'll merge now. Thanks again! |
I'd like to contribute this article I wrote that describes how to create Storm clusters on the Linode cloud using a set of shell scripts that use Linode's powerful Application Programming Interfaces (APIs) to programmatically create and configure large clusters.