Apache Storm - Big Data#1
Conversation
|
@pbzone, I don't have a ready diagram to illustrate application flow, but can come up with something. I'll send them tomorrow or this weekend. As for a practical topology, I understand your point. In production, Storm is usually deployed as backend middleware in a bigger system, and not as a standalone system. It takes in some data stream - like web server log files or server performance indicators, for example; processes them to calculate something or find some patterns; and writes the output of that processing to a database like MySQL or MongoDB, or to a big data file system like Hadoop's HDFS. This output is then displayed, either as-is or by combining it with other processing, by business specific web application dashboards. But for a tutorial, such an end-to-end system is too much work, and would require extensive changes in both code and article. Instead, I'll try to come up with a simple topology with a meaningful output which does not require user to do additional configuration steps. I need to find a high velocity data set that general users can relate to, do some processing on it using Storm, and write something meaningful to a text file somewhere. Tweet processing is a typical example quoted for Storm, and there are examples of it around, but even that requires a couple of Twitter API authorization steps by end user. As for the article name change, it's fine. I did plan it as a series initially with additional articles on Kafka, Spark and Flink, but got busy with other projects. |
|
@pathbreak Thanks for the clarification, I didn't realize the emitted words was the actual count - maybe that could be clarified in the guide? It would be nice to have something practical, but based on the examples I've seen I agree it would be too much work to develop a full system right now. If you'd like to do a sample topology that would be great, but don't worry about it if you don't have time. You've already contributed a ton of information here, and I certainly don't want to ask for too much. I think with the Twitter API setup would add too much length to this guide, so we can come back to that in the future and possibly do a follow up guide on it. Please let me know when you've had a chance to review and merge, and I'll pass this along for a final test run and some finishing touches. Thanks! |
|
@pbzona Sorry about the delays, I got caught up in other project and couldn't finish anything by Friday. Some updates
An unrelated question I have is regarding branching. I had made my initial PR to your repo to pull from my "bigdata-storm" branch. As per my understanding of your PR workflow, I should now merge this "bigdata-storm-review1" branch to "bigdata-storm" branch, and not to "master"; and there's no need for a second PR. Is this correct? Additionally, should I also rebase my "bigdata-storm" branch to your latest master? |
|
@pbzona Forgot to ask: Do you think the architecture diagram at 5119ea5#diff-26a5c925040ccb2f549a2be8326fd492 is clear enough to users? |
|
@pathbreak Sorry for the wait, I was unexpectedly out of the office for a few days. The new diagram looks great! I think it's clear and helpful, thanks for putting that together. I also agree with the other changes you made. I added a brief paragraph about the Reddit comment topology, and will be submitting a PR to your review branch. Very nice job with that, I think it gives a more helpful illustration of what Storm is actually capable of. To address your branching question, once you've reviewed and merged my final PR to Once this is done, please let me know (you can comment in the original PR, linode#419) and I'll get this over to one of our other team members for a final test and edit. Thank you again for all your work on this, it's turning out to be an excellent guide! |
|
@pbzona,Thank you for the kind compliments; your contributions and comments have been invaluable towards improving both the guide and code. I have now merged your latest PR. Can you clarify how to proceed about Ubuntu 16.04 support? If it's not to be included in the guide this time around, then I'll go ahead with the final merge and notify you in the original PR's discussion once done. |
|
@pathbreak Since we've already spent quite a bit of time, let's move forward without Ubuntu 16.04 support for now. I'd like to get the guide published as soon as we can, and once it's up, we can come back to that as an update. |
|
@pbzona That's fine with me. I've merged all these changes now into the main PR. Corresponding code changes have been committed (except any Ubuntu 16 changes) into release-0.3.0 (https://github.com/pathbreak/storm-linode/releases/tag/release-0.3.0). |
"#installing-linode-s-public-ssh-key" is a non-existent link. The right one is "#adding-the-public-key"
Configuration checklist #1 internal checklist link broken
duplicated @ravenx99's changes in cpanel dns guide
minor edits to guide; fixed index
@pbzona This update looks good to me. Merged. Thank you.
Copyediting and style fixes
First round edits to GPG for SSH Auth Guide.
I've done some significant reformatting of this guide to address a couple things:
There are relatively minor copyedits throughout the guide, mostly to make it more in line with our style guide. I've also reworked a few sections to avoid redundant explanations, and reduce the overall length of the guide.
I did add some additional information at the beginning of the guide based on your description of the application flow and my research. Please feel free to review this and let me know if anything I added is inaccurate, misleading, or if you think there's a clearer way to explain it.
A couple additional things I think the guide could benefit from:
Do you have a chart or graphic showing the application flow? You've included some really nice diagrams throughout the guide, so if you have one or can easily create one, I think that would be a good supplement to the explanation, and help the user understand exactly what sort of configuration they're creating. The architecture diagram helps in part with this, but it doesn't necessarily show how data moves through the final product.
Is there a simple way to show the output of the wordcount topology, rather than just its execution? Alternatively, is there a freely available custom topology that could give users something more practical that they could look at when following along? The Storm UI shows that the topology is running, but not where the data is actually going, and I wasn't able to find much on this (I'm also not a Java developer, so I don't understand 100% of what's happening in the samples).
Finally, a note on the title. I noticed that you had included "Part 1" in the title. If this is to be part of a series of guides, we can always go back later to include that either in the title or the guide itself. I just took it out since this will stand alone for now.