Skip to content

Big Data in the Linode cloud - Part 1: Streaming Data Processing using Apache Storm#419

Merged
alexfornuto merged 4 commits intolinode:masterfrom
pathbreak:bigdata-storm
Oct 31, 2016
Merged

Big Data in the Linode cloud - Part 1: Streaming Data Processing using Apache Storm#419
alexfornuto merged 4 commits intolinode:masterfrom
pathbreak:bigdata-storm

Conversation

@pathbreak
Copy link
Copy Markdown
Contributor

I'd like to contribute this article I wrote that describes how to create Storm clusters on the Linode cloud using a set of shell scripts that use Linode's powerful Application Programming Interfaces (APIs) to programmatically create and configure large clusters.

@pbzona pbzona self-assigned this Jun 2, 2016
@pathbreak
Copy link
Copy Markdown
Contributor Author

Any news on this? Thanks in advance

@alexfornuto
Copy link
Copy Markdown
Contributor

Sorry for the delay @pathbreak. Thanks for your patience. We're currently reviewing the topic and your submission. It shouldn't be much longer now before we can get back to you with an answer.

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Aug 10, 2016

Hi @pathbreak,

I've been assigned to test and edit your guide to start preparing it for publication. Because this is such an extensive setup, this process could take a few days, but if you have any questions please feel free to reach out. Once finished, I'll make a pull request against this branch with any proposed changes.

Thanks, and I look forward to working with you.

@pathbreak
Copy link
Copy Markdown
Contributor Author

pathbreak commented Aug 11, 2016

Hi @pbzona,

Thank you, I look forward to it too.

I do have some questions about changes required, because some things have changed after I had drafted this article back in Feb-March 2016:

  • Linode instance names and RAM sizes have increased. For example, the old "Linode 1GB" no longer exists, "Linode 2GB" should become the default linode size, and newer instances like "Linode 120GB" should be included. This involves some changes both in scripts and in the article, but they are mostly trivial changes.
  • From couple of comments in other articles here, I understand Ubuntu 16.04 LTS is being preferred over Ubuntu 14.04 LTS for these guides. I haven't investigated Ubuntu 16 in detail, but I think this involves more extensive changes both in scripts and article because of the shift to systemd based daemons.

I had held off on incorporating these changes because I wasn't certain if the article was being considered for publication. Now I'm thinking of handling the first change immediately before you start your tests and review. For the second change which is more extensive in scope, I suggest retaining Ubuntu 14.04 as the only supported Ubuntu version for now, while I investigate Ubuntu 16 and add support for it. I can send those changes as an update for the published article later on. What do you think?

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Aug 11, 2016

Hi @pathbreak

Good call on the Linode plans and sizes - since those changes will be trivial (in the scope of editing the guide, at least), and since I've already gotten started, I'll handle those changes. Of course, the scripts themselves will need to be updated on your end since they're hosted on your GitHub account.

You're correct, Ubuntu 16.04 has been our preferred distro for new guides. I agree that we can proceed with Ubuntu 14.04 and Debian 8 for now just to work out the broad details, but we'd love to get support for 16.04 on this guide. If you'd be kind enough to make those changes we can add them in at some point, whether that's before or after publication. I'll speak with our docs team and get back to you about how we'd like to implement those updates.

Thanks!

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Aug 11, 2016

@pathbreak I spoke with the team and we're okay with publishing the guide for Ubuntu 14.04 and Debian 8, then updating it at a later time. So once we've gone through the editing process and published this guide, we can work on updating it for Ubuntu 16.04 and arrange details then. For now, I'll continue testing and editing the guide as-is.

@pathbreak
Copy link
Copy Markdown
Contributor Author

@pbzona Thank you for clarifying that.
I'll make and test the instance related changes in my scripts this weekend, and let you know when it's done.

The Ubuntu 16 support will have to go at a slower pace on my side because of my other projects. I'll target this month end to make and test script changes.

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Aug 11, 2016

@pathbreak Sounds great. There's no rush for the Ubuntu 16.04 update, so if that takes a bit longer it's okay. This is an extensive guide already, so the initial review will take some time on our part.

I would, however, ask that you make any updates in a separate branch to avoid conflicts here as we work through it. Once this is merged, we can integrate the two.

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Aug 11, 2016

@pathbreak I've noticed when creating the Zookeeper cluster that your config file comments suggest a cluster size of 5-9 nodes, but the default value is 3. In order to keep things simpler, would you be willing to specify 3-9 as an optimal number of nodes instead? Of course a 5 node minimum cluster would allow for more redundancy, but I'd like to make sure that no one reads that line and feels that 5 nodes are too expensive to continue.

Since you'll be updating the scripts, what are your thoughts on making that change?

@pathbreak
Copy link
Copy Markdown
Contributor Author

@pbzona,

would you be willing to specify 3-9 as an optimal number of nodes instead

Good catch! I'll update the config file template to suggest 3-9.
I'll be making these, and any other changes arising from these discussions, in "review1" branches of my repos.

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Aug 16, 2016

@pathbreak I'm receiving an internal server error at the point where I try to access the web UI for monitoring the Storm cluster. It appears Nimbus is running fine on its node, and the Zookeeper nodes are all up and running.

Here's what the error looks like:
https://gist.github.com/pbzona/065b43ab544f423df26c502726043523

I've checked the ipsets on the Storm client, and the syntax looks fine to me . The output indicates they're flushing and loading just fine. Admittedly I am not that familiar with Storm or Zookeeper, but I've checked what seem to be the most important files (storm.yaml, zoo.cfg) and everything seems to be configured properly. Any idea where else this error might be coming from?

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Aug 16, 2016

@pathbreak To give some additional information, I think I've narrowed the problem down to the Zookeeper cluster. I'm seeing a lot of connection exceptions in the logs, accompanied by the message "Cannot open channel to at election address c1-pri-zknode/xx.xx.xx.xx:3888" and then "connection refused" or "connect timed out."

This would explain the issue with the Storm UI failing to connect. So far I have adjusted syncLimit and initLimit in zoo.cfg to give it about 5 seconds to establish the connection, but no luck. I can give you more information if you need it, just looking for some direction as to where this may be coming from.

@pathbreak
Copy link
Copy Markdown
Contributor Author

pathbreak commented Aug 16, 2016

@pbzona , Thank you for those symptoms and details.

Briefly, storm application flow is:

  • Storm UI on client server => contacts Nimbus
  • Nimbus => gets current state of cluster (list of supervisors, list of topologies, etc) from Zookeeper cluster
  • Supervisors => update their states to Zookeeper cluster

Based on this flow:

  • I suspect Nimbus daemon on nimbus node is down because it couldn't connect to Zookeeper cluster. That explains the exception in your gist. UPDATE: After some testing, it turns out Nimbus does stay up even if Zookeeper is down, and keeps retrying to connect, but in this state, it does not respond to requests from storm UI on port 6627.
  • From comment above, it may be that Zookeeper cluster itself isn't in a healthy state.
    Some configuration checks for zookeeper:
    1. Does /etc/hosts on each zookeeper node knows about all the other nodes?
      Should be something like this:
        #START:zk-cluster1

    xx.xx.xx.xx c1-pub-zknode1
    xx.xx.xx.xx c1-pri-zknode1
    xx.xx.xx.xx c1-pub-zknode2
    xx.xx.xx.xx c1-pri-zknode2
    xx.xx.xx.xx c1-pub-zknode3
    xx.xx.xx.xx c1-pri-zknode3
    #END:zk-cluster1
  1. On each zookeeper node, 'sudo ipset list' should show all zookeeper nodes as part of
    one of the ipsets:
    root@c1-pri-zknode1:~# sudo ipset list
    Name: zk-cluster1-wl
    Type: hash:ip
    Revision: 2
    Header: family inet hashsize 1024 maxelem 65536
    Size in memory: 16552
    References: 1
    Members:
    xx.xx.xx.xx
    xx.xx.xx.xx
    xx.xx.xx.xx

    Name: all-whitelists
    Type: list:set
    Revision: 2
    Header: size 32
    Size in memory: 160
    References: 1
    Members:
    zk-cluster1-wl  
  1. Verify that zookeeper daemon is running on each zk node:
    ps -ef | grep java

  2. I have tested on Linode's cloud with default settings for syncLimit and initLimit. So I don't think they're the problem, and it's probably better to reset them to defaults to avoid complicating the diagnosis.

I think I'll add a diagnostics script to do some quick checks on a cluster, since any problem you run into other users too may run into. It'll take some time for me to write, so I thought I'll write some of the steps here in the comment so that you can try to proceed with them immediately.

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Aug 16, 2016

@pathbreak Thanks for the detailed response. My Zookeeper nodes are aware of each other and Zookeeper is running, but there are no members in the all-whitelists set. This would explain why they're not connecting to anything.

Rather than add the ipsets manually and have them possibly be overwritten, is there a simple way to add whitelists to the Zookeeper nodes? I'm not sure I understand how your script handles that - it seems that it loads a set of iptables rules upon creating each node in the cluster, but I don't see where the ipset zk-cluster1-wl is added to the all-whitelists set.

@pathbreak
Copy link
Copy Markdown
Contributor Author

no members in the all-whitelists set...is there a simple way to add whitelists to the Zookeeper nodes?

Yes, there is an "update-firewall" command to recreate and distribute the ipsets and iptable rules to all zookeeper nodes:

  ./zookeeper-cluster-linode.sh update-firewall zk-cluster1/zkcluster1.conf

A correct configuration should have these ipsets and iptables:

    root@c1-pri-zknode1:~# sudo ipset list
    Name: zk-cluster1-wl
    Members:
    ...

    Name: storm-cluster1-wl
    Members:
    ...

    Name: all-whitelists
    Members:
    zk-cluster1-wl  
    storm-cluster1-wl



    root@c1-pri-zknode1:~# sudo iptables -L -n -v
    Chain INPUT (policy DROP 0 packets, 0 bytes)
     pkts bytes target     prot opt in     out     source               destination         
        0     0 ACCEPT     all  --  lo     *       0.0.0.0/0            0.0.0.0/0           
      161  9540 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            match-set all-whitelists src
      169 21722 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED
        2   120 ACCEPT     tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:22

    Chain FORWARD (policy DROP 0 packets, 0 bytes)
     pkts bytes target     prot opt in     out     source               destination         

    Chain OUTPUT (policy ACCEPT 237 packets, 23186 bytes)
     pkts bytes target     prot opt in     out     source               destination

how your script handles that...

It's done in create_cluster_security_configurations.

The Zookeeper firewall configuration is touched during these actions:

  • Zookeeper cluster creation
  • Any Storm cluster creation/deletion that refers to this Zookeeper cluster
  • When additional nodes are added to referring Storm clusters
  • 'update-firewall' command to either the Zookeeper cluster itself, or to any of the referring Storm clusters.This command is to be run
    when admin adds any custom firewall rules to "zk-cluster1/zk-cluster1-.rules" or "storm-cluster/storm-cluster1-.rules".

However, it is not touched when user is added to Storm client whitelist.
So one of the actions above must have failed.

I would like to understand the sequence of actions that led to this configuration failure, and see how to avoid it again.
Do you remember the sequence of actions done?
Can you take a quick look at "zk-cluster1/.ipsets", "zk-cluster1/.rules" and "zk-cluster1/zk-cluster1.info" files and see if anything seems wrong? Or publish them here so I can try to spot what has gone wrong? (it's fine if IP addresses are obfuscated)

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Aug 17, 2016

@pathbreak Thanks, this clears things up quite a bit. I tried the update-firewall command and it correctly adds all Zookeeper nodes to the whitelist, and my ipsets and iptables resemble the example you provided. The all-whitelists

However, here's an interesting piece of output on the command you suggested:
https://gist.github.com/pbzona/5a5a856a3515b02131fd5dd31b9315f2

I've obfuscated the IP addresses, but I did confirm they match the private IPs of the Zookeeper nodes.

Unfortunately I've stopped and started the clusters so many times that I can't recall every action I took. While trying to resolve the problem, I've always shut down the Storm cluster prior to the Zookeeper cluster (per the guide) and any configuration changes on the cluster nodes have been made only to storm.yaml and zoo.cfg respectively. These are now returned to their default values.

One thing that stands out, and I suspect this might be the problem, is that zk-cluster1-whitelist.ipsets was not created upon creating the cluster, and I had to copy it from zk-cluster-example-whitelist.ipsets manually. I did not make any changes after copying it, aside from changing the hostnames (zk-cluster-example-wl to zk-cluster1). This should have been created with the cluster, correct? If so, it's possible that I missed a step or failed to fill in an image variable, so we can probably attribute that to user error.

Here are my .ipsets and .info files:
https://gist.github.com/pbzona/13666643c3eb3c51b508db4c14e073e0
https://gist.github.com/pbzona/dc70de2c531a0bcc5925dd0bccd4c4cc

I should mention that the .info file was named zk-cluster-example.info rather than zk-cluster1.info. Is this expected?

@pathbreak
Copy link
Copy Markdown
Contributor Author

pathbreak commented Aug 17, 2016

@pbzona Thank you very much for those details and observations, because they helped me realize the root cause of all the zookeeper failures is buggy cluster name handling in my scripts.

Many areas in the scripts are assuming that the values of CLUSTER_NAME variables in conf files and the cluster directory name are same (example: if CLUSTER_NAME is "zk-cluster-example", directory should be "/zk-cluster-example/"), but this is not enforced in the scripts nor in the default values nor made clear in my article. This mismatch leads to those file not found errors, because it's expecting to find those files under a "./CLUSTER_NAME/" directory but that directory is named something else.

I remember I had to change cluster name handling at one point because the same name is used for ipsets, but ipset imposed some restrictions on names. It looks like these bugs have originated at that point in time. Since I was in the habit of always setting CLUSTER_NAME to directory name, I failed to discover this bug (a good example of being too familiar with one's own code while testing 😃 ).

One possible solution I can think of that doesn't require changing the article is to impose those name restrictions at the very start when creating new cluster conf directory, and use directory name as cluster name. However, I'll have to analyze the code to make sure it doesn't impact something else and impose too many changes in the article.

For now, I think you have two options to proceed in a way that this does not become a recurring blocker for further testing:

  • change the CLUSTER_NAME variables of both ZK and Storm clusters to match their respective directories, change all the filenames to start with that CLUSTER_NAME, and change their contents using "sed" to replace "storm-cluster-example/zk-cluster-example" with "storm-cluster1/zk-cluster1".
  • or probably easier is to simply recreate both clusters freshly and change CLUSTER_NAME to match their directory names.

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Aug 18, 2016

@pathbreak Thanks for taking a look, that explains a lot!

So if I recreate the clusters, I should to use the default zk-cluster-example and storm-cluster-example as the cluster names? I'll give that a try and let you know if I encounter any errors.

@pathbreak
Copy link
Copy Markdown
Contributor Author

The directory name given with "... new-cluster-conf DIRECTORY_NAME" and the value of CLUSTER_NAME should be same. If both are "zk-cluster-example" or "storm-cluster-example" it'll work. Any name should be <= 25 characters (due to ipset restrictions).

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Aug 19, 2016

@pathbreak I tried naming the clusters after the default filenames as a workaround (zk-cluster-example and storm-cluster-example) and I'm receiving a 503 Service Unavailable. After checking the status of the hosts in each cluster, I discovered that Nimbus failed to start properly.

Once I started Nimbus manually, I was still receiving a 503 response. Going back to when I brought up the Storm cluster initially, it doesn't appear Nimbus started properly then either. I ignored it at the time, since the output indicated that this was not necessarily an error.

After doing some digging in nimbus.log I failed to see any lines indicating that it connected properly to the client node (I don't know what this would look like, but it was easy to spot where Zookeeper connections were formed). It appears that the client node only has active connections on ports 80 and 22.

The Nimbus node only has active connections on ports 22 and 6627 (java, I'm assuming this is Zookeeper).

I ran the checks you mentioned above to check the status of the other cluster nodes, so I am pretty sure it's Nimbus causing the issue now. I'm not sure when and how it forms the connection to the client node, and I am thinking this is probably where the issue is since the Zookeeper->Nimbus and browser->client node connections seem okay. Any suggestions?

EDIT: When checking the health of the cluster hosts, I performed the checks you mentioned previously. Everything looked normal, and it seems that renaming the clusters did in fact resolve the ipsets issue I was having before.

I've tried restarting Nimbus via supervisorctl restart storm-nimbus. It reestablishes its connection with Zookeeper just fine, but not with the client node.

And now it seems the browser->client may not be fine after all. Pings sent to the client's public IP are timing out. A traceroute shows 100% packet lost at the Storm client. Sorry for so many edits, this should be my last one until I hear back.

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Aug 19, 2016

@pathbreak I've found the source of the issue: the Storm UI was not starting on the client node (I should have thought to check for this in the running processes, but didn't expect it would be separate from Apache). I had to call the storm ui method manually to get the UI to serve properly. This was unexpected since it was partially loading with my last configuration, although it was throwing an error.

I am still not certain how the connection is being formed between the client and Nimbus, as there are still only two active listening ports on the Nimbus node. Are all Java processes being run on the same port?

Additionally, the Storm UI shows only one supervisor node and I have not made any configuration changes to either of the supervisors. I did not touch the Zookeeper clusters (aside from checking their hosts and running processes), and I can't think what else could have impacted the supervisors. Any idea what may have caused this?

One last question: how much of the starting/stopping of clusters and services is handled by your scripts as opposed to the services (meaning Zookeeper and Storm themselves)? If there are bugs in the scripts I can come back to this project once they've been fixed, but if these errors are being caused by the actual services, it may require some additional explanation in the guide.

@pathbreak
Copy link
Copy Markdown
Contributor Author

@pbzona, I tried to replicate all the failures as best as I could based on the sequence of actions described so far, to discover what needs to be corrected or changed.

Regarding the Storm UI 503 error:

Can you please check /var/log/supervisor/storm-ui-stderr---*.log on Client node to see if there are any errors and let me know?

Storm-ui daemon once started is actually quite fault-tolerant. For example, I experimented with shutting down nimbus daemon and zookeeper cluster, and then starting storm-ui. It did come up correctly, and showed only the expected internal server(ThriftException) errors we have seen in one of your earlier gists.

It doesn't contact Zookeeper cluster directly at all, so the state of Zookeeper cluster won't affect Storm-ui directly.

What it does is it gets all its information only from the Nimbus daemon over port 6627. So if Nimbus is down or waiting for ZK, Storm-UI's requests will fail with that internal exception.

But even if Storm-ui goes down, it'll be brought up again because it runs under Supervisord's monitoring.

One way I was able to prevent Storm-ui from coming up at all and thus reproduce the 503 error was by modifying "storm.yaml" to make it malformed YAML.
Now YAML is a rather finicky format, and even little unnoticeable things like a missing space after a colon or a tab instead of space or more than 2 spaces can make it malformed. For example, all I did was remove the space after colon on this line storm.log.dir: "/var/log/storm" and it prevented storm-ui from coming up (more precisely, supervisord tried to keep restarting and forking it, and failing).
You mentioned previously that some configuration changes were made to storm.yaml and then reverted. Is there a possibility that while reverting it became inadvertently malformed? If you remember the change or have the file somewhere, you can try validating it at https://yaml-online-parser.appspot.com/.

Otherwise I can't think of any reason for storm-ui to not come up at all, or for a user to have to restart it manually. Since the article does mention that storm.yaml can be customized, perhaps a warning section or troubleshooting section can be added recommending user to validate that file https://yaml-online-parser.appspot.com/ after modifications.

how the connection is being formed between the client and Nimbus, as there are still only two active listening ports on the Nimbus node. Are all Java processes being run on the same port?

There is only 1 java process on Nimbus node, and that is the Nimbus daemon. Only two listening ports - 22 and 6627 - on Nimbus are as expected. When a browser request is made to storm-ui webapp, it opens a temporary connection to Nimbus port 6627, gets the requested information, closes that connection, and renders the page.
Connection is always initiated by Storm-ui to Nimbus daemon; Nimbus never initiates one to Storm-ui

the Storm UI shows only one supervisor node

This means for some reason the storm-supervisor daemon running on that supervisor node failed to start up correctly, or failed to connect to Zookeeper cluster and announce its state

Storm UI gets all its info from Nimbus daemon, and Nimbus in turn gets current state of cluster from Zookeeper. So if a supervisor node couldn't announce itself to Zookeeper, Nimbus won't know about it and Storm UI won't see it.

Can you please check /var/log/supervisor/storm-supervisor-stderr---*.log and /var/log/supervisor/storm-supervisor-stdout---*.log on that Supervisor node to see if there are any errors and let me know?

how much of the starting/stopping of clusters and services is handled by your scripts as opposed to the services (meaning Zookeeper and Storm themselves)?

All services are started and stopped explicitly by my scripts. Supervisord is used to start them under monitoring, so that if they go down for some reason after having started, they're brought back up automatically. The services themselves don't start or stop other services.

I did try to reproduce all the failures here, but some actions turned out to be successful for me. For example, I was able to rename cluster directories to "...-example" and still get the clusters to work correctly with only change being to update ZOOKEEPER_CLUSTER path in storm-cluster-example.conf. I think all the errors seen here are cascaded errors due to the initial difference between directory names and cluster names, and renaming or editing some files manually. Normally, there is no need for a user to do things like start storm-ui or storm-nimbus manually. Is it possible for you to recreate the clusters freshly, this time with directory names and CLUSTER_NAME set to same value?

Also I'm implementing getting rid of CLUSTER_NAME variable altogether from the conf files, and making the directory name itself as the cluster name. It won't change any of the the command line code in article. I am doing this in 'review1' branch of my scripts repo and docs repo. I'll ensure the change is thoroughly tested. Is this workflow ok with you while you proceed by freshly creating the clusters, or would you prefer pausing your tests till I make these changes and merge to master?

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Aug 22, 2016

@pathbreak Thanks for the details. There are no errors in /var/log/supervisor/storm-ui-stderr---*.log and I've checked storm.yaml for extra whitespace or anything. I've recently had YAML problems with other applications, so I recreated the cluster from scratch to make sure the default settings were exactly as they should be. I double checked them in less (to avoid any mistaken edits on my part` and everything seems to be fine.

Upon checking the sterr and stdout logs on the failed supervisor node, both are empty. I checked on the working supervisor, and it seems those are empty as well. Neither of these nodes (or any other) was restarted over the weekend, so I am sure there was no data loss.

When I got the 503 error, this was after creating the clusters freshly with names matching CLUSTER_NAME, although I did create everything from the same workstation that I set up when I originally tested (and got the thrift exception). I'll try again from scratch and let you know if I run into any issues.

You can go ahead and merge your changes once they're finished - I've just been trying to get things running smoothly, so I can just reimplement most of my changes on your updated branch. Do you have an expected timeframe? If it'll only be a day or two I'll continue testing the current version to see if I can get it running - any changes you make should only simplify things in preparing the draft. If it'll be longer, that's no problem, but I'll probably put testing on hold and come back to it.

@pathbreak
Copy link
Copy Markdown
Contributor Author

@pbzona, Thank you for checking those logs.

I didn't realize you had got 503 errors even after recreating clusters freshly. I mistakenly thought it had happened during the previous test sequence, sorry about that. It's strange that storm-ui did not start even with default configurations, I'm not really sure what's going on there. I'll retest and see if I can reproduce it somehow.

As for the changes, I estimate I'll need till this weekend because I'm able to work on this only for a couple of hours at night. I'll make it available on Monday (29th).

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Aug 23, 2016

@pathbreak I recreated the entire setup yesterday (including the Linode I was using as a workstation to create the cluster manager) and everything seemed to work fine with no errors. I'm not sure what I did differently, but it's entirely possible there was some silly error on my part the first few times.

Thanks for the update on your changes, I'll continue testing the other features of your scripts and resume full editing and testing once your changes have been implemented.

@pathbreak
Copy link
Copy Markdown
Contributor Author

@pbzona Thank you for testing, I'm glad it worked without hitches. Please do let me know if you feel some steps can be simplified or further automated.

I have now implemented the cluster name simplifications and completed testing on Linode cloud with both Ubuntu 14.04 and Debian 8 (I've not yet started the Ubuntu 16.04 support).

  • CLUSTER_NAME variable has been removed.
  • The name supplied with new-image-conf or new-cluster-conf is treated as the cluster name.
  • Users can now supply just the cluster or image names everywhere instead of conf file paths.

example:
$ ./storm-cluster-linode.sh create storm-cluster1 api_env_linode.conf

The software release is available in master and tagged as 'release-0.2.1-alpha' (https://github.com/pathbreak/storm-linode/tree/release-0.2.1-alpha)

You need to recreate a cluster manager from scratch for this release, because the cluster manager setup script has been changed to detect and download whatever is the latest 'release*' version of the software.

I've also made related changes in the README as well as my article to, serve as a guide in case you need them.

https://github.com/pathbreak/docs/compare/bigdata-storm...pathbreak:bigdata-storm-review1?diff=split&expand=1&name=bigdata-storm-review1

@pathbreak
Copy link
Copy Markdown
Contributor Author

@pbzona Thanks for the PR. I'll read through the changed version tonight and tomorrow, and get back to you tomorrow.

@pathbreak
Copy link
Copy Markdown
Contributor Author

pathbreak commented Sep 16, 2016

@pbzona I've completed merging all the review comments and reformatting into this PR branch.

All the corresponding code changes (except Ubuntu 16 changes) have also been committed into latest release-0.3.0 (https://github.com/pathbreak/storm-linode/releases/tag/release-0.3.0).


This guide explains how to create Storm clusters on the Linode cloud using a set of shell scripts that use Linode's Application Programming Interfaces (APIs) to programmatically create and configure large clusters. The deployed architecture will look like this:

![Architecture](/docs/assets/storm-architecture-650px.png)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning the iconography used in this and your other graphics, where did they come from? What licensing is associated with them?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexfornuto Thank you for the review comments.

This was written a couple of months ago and I didn't record all the sources at the time. But when I need icons, I usually get them from http://www.iconarchive.com and http://findicons.com with open/libre licenses. Here are 3 of the icons I used and could find again today by searching

However, since there are about a dozen more icons used and I am not keen on searching them all again to verify, I have simply removed all icons from all my illustrations and committed the new diagrams with same filenames.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but that wasn't necessary, since we'll probably just replace the graphics with ones made in-house. If you'd like to roll back that commit, it will make it easier to merge my copy edit PR once it's complete.


The first step is setting up a central *Cluster Manager* to store details of all Storm clusters, and enable authorized users to create, manage or access those clusters. This can be a local workstation or a Linode.

You can set up multiple Cluster Manager nodes if necessary to satisfy organizational or security needs. For example, two departments can each have their own Cluster Manager nodes to manage their respective clusters and keep data access restricted to their own members.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of clarity and streamlining, I am considering removing this paragraph. Unless there is some over-arching system that I don't yet grasp, wouldn't another cluster manager and cluster be a separate, unique entity? In other words, wouldn't it just be 'repeating this guide again, for a different department in your organization on a separate environment'?

Copy link
Copy Markdown
Contributor Author

@pathbreak pathbreak Oct 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexfornuto Yes, the removal is ok with me. There is no over-arching concept here, and another department can simply repeat the guide, that is correct.

I don't mind doing this change myself but since I'm not clear about the workflow here, do let me know I can go ahead and make this change, or you plan to do it as a PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll include it in mine, thanks for asking!

~/storm-linode/linode_api.py datacenters table

- `DISTRIBUTION`:
This is the ID of the distribution to install on the Cluster Manager Linode. This guide has been tested only on Ubuntu 14.04 or Debian 8; other distributions are not supported.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These IDs are not necessarily static, they may change in the future. Can your linode_api.py also be used to return distribution IDs?

Copy link
Copy Markdown
Contributor Author

@pathbreak pathbreak Oct 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexfornuto Yes, it can already print out available distributions and kernels by querying Linode API.

Syntax is same:

        source ~/storm-linode/api_env_linode.conf
        ~/storm-linode/linode_api.py distributions table

and

        source ~/storm-linode/api_env_linode.conf
        ~/storm-linode/linode_api.py kernels table

Again, do let me know if I can go ahead with adding this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE: I see the option in your .py file, and will include the additional information in my PR.


- `IMAGE_DISK_SIZE`

The size of the image disk in MB. The default value of 5000MB is generally sufficient, since the installation only consists of the OS with Java and Zookeeper software installed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to set this as large as the smallest Linode plan will allow, less swap disk space?

Copy link
Copy Markdown
Contributor Author

@pathbreak pathbreak Oct 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexfornuto There were a couple of reasons for this decision:

  • Most important reason was faster cloning. Since an image is cloned for every node in cluster, I noticed cluster creation is relatively faster if disk sizes are small. A related observation was that cloning in general was slow when image was created in a different datacenter from the one where clones are created. Small disk sizes help in that scenario too. The software allows creating image in one DC and creating clusters in another.
  • Linode allocates 10 GB maximum space for all images of an account. Experimentally I've seen it go above that upto around 12 GB before image creation failed. Although 5000 MB disk image doesn't result in 5000 MB consumption of total space for images, I thought it prudent to default to small disk sizes to avoid image creation failures due to large disks and to enable users to create many images if needed.
  • It provides flexibility to users, for example, if they prefer to store data in separate disks on the cluster nodes, separated from the OS and software. This is the recommended Zookeeper conf for performance reasons - keeping its data on a separate disk so that OS or other software writes don't hamper Zookeeper's write queues.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me, thanks for clarifying!


For example:

./storm-cluster-linode.sh cp storm-cluster1 "~" "~/*.data"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please replace this example with one that uses example files, and not just the ~ alias? I feel it may be confusing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexfornuto I didn't understand this comment and the one below. Please clarify what example files you are referring to, and what is "~ alias". Thank you.


For example:

./zookeeper-cluster-linode.sh cp zk-cluster1 "~" "~/*.data"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above: Can we please replace this example with one that uses example files, and not just the ~ alias? I feel it may be confusing.

external_resources:
- '[Apache Storm project website](http://storm.apache.org/)'
- '[Apache Storm documentation](http://storm.apache.org/documentation.html)'
- '[Storm - Distributed and Fault-Tolerant Real-time Computation](http://www.infoq.com/presentations/Storm-Introduction)'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this third-party presentation a trust resource?

Copy link
Copy Markdown
Contributor Author

@pathbreak pathbreak Oct 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexfornuto InfoQ is a reputed and respectable resource, the presenter Nathan Marz was the author and lead of Apache Storm project, and it's a good introductory talk on Storm. I assumed the external resources section was to list useful links for your readers to know more about some topic. If it's against policy or something, I'm fine with removing it and the other links.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine, I just wanted to make sure that it was a trusted source, and based on your reply I'm confident that it is.

@alexfornuto
Copy link
Copy Markdown
Contributor

@pathbreak I've staged the copy edits here


For example:

./zookeeper-cluster-linode.sh cp zk-cluster1 "~" "~/*.data"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pathbreak I don't know where my other comment went, but I saw your reply to it via email. I was referring to the ~ on this line, which seems to indicate that we're copying the entire contents of the local home folder to the clusters, and changing the output on the other side to *.data. Can you clarify this?

Copy link
Copy Markdown
Contributor Author

@pathbreak pathbreak Oct 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, Github is not showing me the correct line in the document. From your comment, I believe you are referring to this command? If so, then

 ./storm-cluster-linode.sh cp storm-cluster1 "~" "~/*.data"

is actually copying all *.data files from cluster manager node to home directory on each node (which happens to be /root since the scripts run as root). This command syntax is [CLUSTER] [TARGET DIR] [SOURCE FILES]. It's the reversed order of regular cp/scp but that's because this command accepts one target directory and multiple source files.

The idea here is to enable a user to transfer their application specific data files to all nodes with a single command. Since the software does not install any data files, it's left to the user to infer here that "*.data" is just a placeholder for their own data files related to the problem they are solving using Storm.

Perhaps it can be clearer if the sentence is reworded to:

For example, if your data files are named "*.data" and your topology requires them for processing, you can copy them to /root on all cluster nodes with:

    ./storm-cluster-linode.sh cp storm-cluster1 "~" "~/*.data"

The alternative is to remove the example completely, because there are no concrete data files to transfer. The software doesn't install any.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave the example in, but explain as you just did that source and target are reverse from normal cp, and maybe include the syntax example you noted. Since I've already give you a PR, I'll leave that to you to add after you merge it. I hate dealing with merge conflicts.

@pathbreak
Copy link
Copy Markdown
Contributor Author

pathbreak commented Oct 17, 2016

@alexfornuto Thank you for the thorough review and copy edit PR.

I have merged all your changes in commit 2cd69d4 and added the cp command explanation in commit b546e5a

@alexfornuto
Copy link
Copy Markdown
Contributor

Thanks @pathbreak, this looks great! Our graphics team is working on updating the graphics for this guide, then we can publish! In the meantime, we need to know if you prefer the bounty as Linode credit or a PayPal payment, and which email address to use. If you like, you can email these details to contribute@linode.com

@pathbreak
Copy link
Copy Markdown
Contributor Author

pathbreak commented Oct 21, 2016

@alexfornuto Thank you. I have emailed details to contribute@linode.com from "my name at my github username dot com".
If your graphics team requires my original Inkscape SVG illustrations, let me know. I'll upload them somewhere.

@alexfornuto
Copy link
Copy Markdown
Contributor

Thanks @pathbreak, I've got new images staged here. Once we process your payment details I'll get this published. It's almost done!

@pathbreak
Copy link
Copy Markdown
Contributor Author

@alexfornuto Great! Thank you for that update. I liked the new images from your graphics team; please convey my thanks for the effort they have put in.

@alexfornuto
Copy link
Copy Markdown
Contributor

alexfornuto commented Oct 25, 2016

@pathbreak this PR is about to be marked as merged, as I pull #578 into master. Thanks for your submission! I've passed along your payment details to our payment department, and you should receive your bounty in 24-48 hours.

EDIT: ok... I guess you would need the latest version in your fork for this to show as merged. I'll give you a PR to merge to sync everything up. Not that it matters too much, your guide is in the repo, it's just nice to see the "Merged" icon next to ones own PR, eh?

@pathbreak
Copy link
Copy Markdown
Contributor Author

I saw the published article in your docs site and it's looking fantastic!

A special thank you, @pbzona, for your rigorous checking and excellent suggestions for both the article and the scripts.

Thank you @alexfornuto and @EdwardAngert for your thorough reviews and feedback.

You guys are rockstars! I enjoyed the entire experience of learning and working with you all. Thanks a lot to all of you again, and keep up your excellent work!


@alexfornuto Very thoughtful of you for that PR and yes, it does look neat now 😄 !

@pbzona
Copy link
Copy Markdown
Contributor

pbzona commented Oct 26, 2016

Thanks @pathbreak! This was by far one of the most interesting projects I've gotten to test, and it was truly a pleasure working with you. Great work!

@alexfornuto
Copy link
Copy Markdown
Contributor

Hrmmm... @pathbreak I'm not sure why this still shows as different from master. Maybe you want to go to your branch and git pull --recursive from the main repo?

@pathbreak
Copy link
Copy Markdown
Contributor Author

@alexfornuto I've pulled from linode/docs master into this PR branch, bigdata-storm.

But after doing so, I notice that my last commit about the 'cp' command explanation b546e5a has not made it into the merge and is missing from the published article's Copy Files to all nodes section. It seems to be missing in the main PR #578 too.

Can you please check why that commit is missing from the merge? I probably need not do anything more from my side at this point, but if there is, please let me know.

@alexfornuto
Copy link
Copy Markdown
Contributor

@pathbreak Yup, there it is. I'll merge now. Thanks again!

@alexfornuto alexfornuto merged commit 466ed10 into linode:master Oct 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants