Apache Storm - Big Data by pbzona · Pull Request #1 · pathbreak/docs

pbzona · 2016-09-06T22:41:20Z

I've done some significant reformatting of this guide to address a couple things:

To ensure that it will load properly on our site
To ensure consistency with our house style

There are relatively minor copyedits throughout the guide, mostly to make it more in line with our style guide. I've also reworked a few sections to avoid redundant explanations, and reduce the overall length of the guide.

I did add some additional information at the beginning of the guide based on your description of the application flow and my research. Please feel free to review this and let me know if anything I added is inaccurate, misleading, or if you think there's a clearer way to explain it.

A couple additional things I think the guide could benefit from:

Do you have a chart or graphic showing the application flow? You've included some really nice diagrams throughout the guide, so if you have one or can easily create one, I think that would be a good supplement to the explanation, and help the user understand exactly what sort of configuration they're creating. The architecture diagram helps in part with this, but it doesn't necessarily show how data moves through the final product.
Is there a simple way to show the output of the wordcount topology, rather than just its execution? Alternatively, is there a freely available custom topology that could give users something more practical that they could look at when following along? The Storm UI shows that the topology is running, but not where the data is actually going, and I wasn't able to find much on this (I'm also not a Java developer, so I don't understand 100% of what's happening in the samples).

Finally, a note on the title. I noticed that you had included "Part 1" in the title. If this is to be part of a series of guides, we can always go back later to include that either in the title or the guide itself. I just took it out since this will stand alone for now.

…e new category

pathbreak · 2016-09-07T13:27:24Z

@pbzone, I don't have a ready diagram to illustrate application flow, but can come up with something. I'll send them tomorrow or this weekend.

As for a practical topology, I understand your point.
The word count topology is a "hello world" kind of example for Storm. Just counts words in generated sentences, and emits each word as a separate message. So, the number of emitted messages in the table shown in Storm UI is also the actual word count. That value is not written out anywhere else other than to the supervisor logs.

In production, Storm is usually deployed as backend middleware in a bigger system, and not as a standalone system. It takes in some data stream - like web server log files or server performance indicators, for example; processes them to calculate something or find some patterns; and writes the output of that processing to a database like MySQL or MongoDB, or to a big data file system like Hadoop's HDFS. This output is then displayed, either as-is or by combining it with other processing, by business specific web application dashboards.

But for a tutorial, such an end-to-end system is too much work, and would require extensive changes in both code and article.

Instead, I'll try to come up with a simple topology with a meaningful output which does not require user to do additional configuration steps. I need to find a high velocity data set that general users can relate to, do some processing on it using Storm, and write something meaningful to a text file somewhere. Tweet processing is a typical example quoted for Storm, and there are examples of it around, but even that requires a couple of Twitter API authorization steps by end user.

As for the article name change, it's fine. I did plan it as a series initially with additional articles on Kafka, Spark and Flink, but got busy with other projects.

pbzona · 2016-09-07T13:52:10Z

@pathbreak Thanks for the clarification, I didn't realize the emitted words was the actual count - maybe that could be clarified in the guide? It would be nice to have something practical, but based on the examples I've seen I agree it would be too much work to develop a full system right now.

If you'd like to do a sample topology that would be great, but don't worry about it if you don't have time. You've already contributed a ton of information here, and I certainly don't want to ask for too much. I think with the Twitter API setup would add too much length to this guide, so we can come back to that in the future and possibly do a follow up guide on it.

Please let me know when you've had a chance to review and merge, and I'll pass this along for a final test run and some finishing touches. Thanks!

pathbreak · 2016-09-11T20:51:17Z

@pbzona Sorry about the delays, I got caught up in other project and couldn't finish anything by Friday. Some updates

I have reviewed your reworked version, and like it. The introductory information you've written is all accurate. I've merged your PR verbatim in commit 59ab835
I made a couple of changes to your reworked version in commit
51390aa
Please review them.

The main change is that ZK_IMAGE_CONF and STORM_IMAGE_CONF can take just the respective image directory names (absolute or relative); no need to specify their full ".conf" filenames.
The other changes are just typo corrections.
I added a more detailed architecture diagram in 5119ea5

In the same commit, I corrected a mistake in the "Start a new topology" section's command line that I had made from the beginning. Name of topology argument is specific only to WordCountTopology class, and not a general argument that can be passed to any topology class.
I wrote a more practical topology to analyze Reddit comments sentiments, and a viewer app to display results. It's at https://github.com/pathbreak/reddit-sentiment-storm. I was thinking of including its Usage section steps in this guide's "Start a new topology" section as an example. What do you think? Is it an acceptable example for Linode's audience if included? And is it the right section to include in?

An unrelated question I have is regarding branching. I had made my initial PR to your repo to pull from my "bigdata-storm" branch. As per my understanding of your PR workflow, I should now merge this "bigdata-storm-review1" branch to "bigdata-storm" branch, and not to "master"; and there's no need for a second PR. Is this correct? Additionally, should I also rebase my "bigdata-storm" branch to your latest master?

pathbreak · 2016-09-12T05:28:12Z

@pbzona Forgot to ask: Do you think the architecture diagram at 5119ea5#diff-26a5c925040ccb2f549a2be8326fd492 is clear enough to users?

pbzona · 2016-09-15T21:28:22Z

@pathbreak Sorry for the wait, I was unexpectedly out of the office for a few days. The new diagram looks great! I think it's clear and helpful, thanks for putting that together. I also agree with the other changes you made.

I added a brief paragraph about the Reddit comment topology, and will be submitting a PR to your review branch. Very nice job with that, I think it gives a more helpful illustration of what Storm is actually capable of.

To address your branching question, once you've reviewed and merged my final PR to bigdata-storm-review1, please do merge that branch to your bigdata-storm branch. However, you won't need to merge it to your master, or rebase to our latest master. Since this is a new file, there should be no conflicts when it's ultimately merged into our master branch.

Once this is done, please let me know (you can comment in the original PR, linode#419) and I'll get this over to one of our other team members for a final test and edit. Thank you again for all your work on this, it's turning out to be an excellent guide!

pathbreak · 2016-09-16T12:37:08Z

@pbzona,Thank you for the kind compliments; your contributions and comments have been invaluable towards improving both the guide and code.

I have now merged your latest PR. Can you clarify how to proceed about Ubuntu 16.04 support? If it's not to be included in the guide this time around, then I'll go ahead with the final merge and notify you in the original PR's discussion once done.

pbzona · 2016-09-16T13:51:32Z

@pathbreak Since we've already spent quite a bit of time, let's move forward without Ubuntu 16.04 support for now. I'd like to get the guide published as soon as we can, and once it's up, we can come back to that as an update.

pathbreak · 2016-09-16T16:04:15Z

@pbzona That's fine with me. I've merged all these changes now into the main PR. Corresponding code changes have been committed (except any Ubuntu 16 changes) into release-0.3.0 (https://github.com/pathbreak/storm-linode/releases/tag/release-0.3.0).

Dns manager overhaul

"#installing-linode-s-public-ssh-key" is a non-existent link. The right one is "#adding-the-public-key"

Clarifying and formatting

Configuration checklist #1 internal checklist link broken

Redis

@ravenx99

duplicated @ravenx99's changes in cpanel dns guide

minor edits to guide; fixed index

@pbzona

@pbzona This update looks good to me. Merged. Thank you.

Update ssl

Copyediting and style fixes

Alpine Edits

First round edits to GPG for SSH Auth Guide.

Phil Zona added 5 commits August 29, 2016 11:14

Renamed file with our conventions and created index file

2914390

Updated metadata for proper page loading

86cc58a

Formatting changed

cb7c766

Added big-data directory to applications index page for display of th…

334f0de

…e new category

Formatting changes and copyedits

1fba69e

pbzona mentioned this pull request Sep 6, 2016

Big Data in the Linode cloud - Part 1: Streaming Data Processing using Apache Storm linode/docs#419

Merged

pathbreak merged commit 1fba69e into pathbreak:bigdata-storm-review1 Sep 11, 2016

pathbreak pushed a commit that referenced this pull request Oct 27, 2016

Merge pull request #1 from afornuto/dns-manager-overhaul

c0fca3d

Dns manager overhaul

pathbreak pushed a commit that referenced this pull request Oct 27, 2016

Configuration checklist #1 internal checklist link broken

83657d4

"#installing-linode-s-public-ssh-key" is a non-existent link. The right one is "#adding-the-public-key"

pathbreak pushed a commit that referenced this pull request Oct 27, 2016

Merge pull request #1 from EdwardAngert/nagios414

41ee814

Clarifying and formatting

pathbreak pushed a commit that referenced this pull request Oct 27, 2016

Merge pull request #1 from Swashy/Swashy-patch-1

02ceb43

Configuration checklist #1 internal checklist link broken

pathbreak pushed a commit that referenced this pull request Oct 27, 2016

Merge pull request #1 from alexfornuto/redis

5a9d514

Redis

pathbreak pushed a commit that referenced this pull request Oct 27, 2016

Merge pull request #1 from alexfornuto/dns-slave-updates

15dc202

duplicated @ravenx99's changes in cpanel dns guide

pathbreak pushed a commit that referenced this pull request Oct 27, 2016

Merge pull request #1 from EdwardAngert/backupfix

bf31425

minor edits to guide; fixed index

pathbreak pushed a commit that referenced this pull request Oct 27, 2016

Merge pull request #1 from pbzona/unbundle-nginx-gitlab-2

2ca73c8

@pbzona This update looks good to me. Merged. Thank you.

pathbreak pushed a commit that referenced this pull request Oct 27, 2016

Merge pull request #1 from afornuto/update-ssl

1542b1b

Update ssl

pathbreak pushed a commit that referenced this pull request Oct 27, 2016

Merge pull request #1 from pbzona/shared-hosting

da82c56

Copyediting and style fixes

pathbreak pushed a commit that referenced this pull request Oct 27, 2016

Merge pull request #1 from pbzona/alpine

b98f05d

Alpine Edits

pathbreak pushed a commit that referenced this pull request Oct 27, 2016

Merge pull request #1 from alexfornuto/gpg-key-ssh-auth

9264257

First round edits to GPG for SSH Auth Guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Storm - Big Data#1

Apache Storm - Big Data#1
pathbreak merged 5 commits intopathbreak:bigdata-storm-review1from
pbzona:bigdata

pbzona commented Sep 6, 2016

Uh oh!

pathbreak commented Sep 7, 2016

Uh oh!

pbzona commented Sep 7, 2016

Uh oh!

pathbreak commented Sep 11, 2016 •

edited

Loading

Uh oh!

pathbreak commented Sep 12, 2016

Uh oh!

pbzona commented Sep 15, 2016

Uh oh!

pathbreak commented Sep 16, 2016

Uh oh!

pbzona commented Sep 16, 2016

Uh oh!

pathbreak commented Sep 16, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pbzona commented Sep 6, 2016

Uh oh!

pathbreak commented Sep 7, 2016

Uh oh!

pbzona commented Sep 7, 2016

Uh oh!

pathbreak commented Sep 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pathbreak commented Sep 12, 2016

Uh oh!

pbzona commented Sep 15, 2016

Uh oh!

pathbreak commented Sep 16, 2016

Uh oh!

pbzona commented Sep 16, 2016

Uh oh!

pathbreak commented Sep 16, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pathbreak commented Sep 11, 2016 •

edited

Loading