Skip to content

Apache Storm - Big Data#1

Merged
pathbreak merged 5 commits intopathbreak:bigdata-storm-review1from
pbzona:bigdata
Sep 11, 2016
Merged

Apache Storm - Big Data#1
pathbreak merged 5 commits intopathbreak:bigdata-storm-review1from
pbzona:bigdata

Conversation

@pbzona
Copy link
Copy Markdown

@pbzona pbzona commented Sep 6, 2016

I've done some significant reformatting of this guide to address a couple things:

  1. To ensure that it will load properly on our site
  2. To ensure consistency with our house style

There are relatively minor copyedits throughout the guide, mostly to make it more in line with our style guide. I've also reworked a few sections to avoid redundant explanations, and reduce the overall length of the guide.

I did add some additional information at the beginning of the guide based on your description of the application flow and my research. Please feel free to review this and let me know if anything I added is inaccurate, misleading, or if you think there's a clearer way to explain it.

A couple additional things I think the guide could benefit from:

  1. Do you have a chart or graphic showing the application flow? You've included some really nice diagrams throughout the guide, so if you have one or can easily create one, I think that would be a good supplement to the explanation, and help the user understand exactly what sort of configuration they're creating. The architecture diagram helps in part with this, but it doesn't necessarily show how data moves through the final product.

  2. Is there a simple way to show the output of the wordcount topology, rather than just its execution? Alternatively, is there a freely available custom topology that could give users something more practical that they could look at when following along? The Storm UI shows that the topology is running, but not where the data is actually going, and I wasn't able to find much on this (I'm also not a Java developer, so I don't understand 100% of what's happening in the samples).

Finally, a note on the title. I noticed that you had included "Part 1" in the title. If this is to be part of a series of guides, we can always go back later to include that either in the title or the guide itself. I just took it out since this will stand alone for now.

@pathbreak
Copy link
Copy Markdown
Owner

@pbzone, I don't have a ready diagram to illustrate application flow, but can come up with something. I'll send them tomorrow or this weekend.

As for a practical topology, I understand your point.
The word count topology is a "hello world" kind of example for Storm. Just counts words in generated sentences, and emits each word as a separate message. So, the number of emitted messages in the table shown in Storm UI is also the actual word count. That value is not written out anywhere else other than to the supervisor logs.

In production, Storm is usually deployed as backend middleware in a bigger system, and not as a standalone system. It takes in some data stream - like web server log files or server performance indicators, for example; processes them to calculate something or find some patterns; and writes the output of that processing to a database like MySQL or MongoDB, or to a big data file system like Hadoop's HDFS. This output is then displayed, either as-is or by combining it with other processing, by business specific web application dashboards.

But for a tutorial, such an end-to-end system is too much work, and would require extensive changes in both code and article.

Instead, I'll try to come up with a simple topology with a meaningful output which does not require user to do additional configuration steps. I need to find a high velocity data set that general users can relate to, do some processing on it using Storm, and write something meaningful to a text file somewhere. Tweet processing is a typical example quoted for Storm, and there are examples of it around, but even that requires a couple of Twitter API authorization steps by end user.

As for the article name change, it's fine. I did plan it as a series initially with additional articles on Kafka, Spark and Flink, but got busy with other projects.

@pbzona
Copy link
Copy Markdown
Author

pbzona commented Sep 7, 2016

@pathbreak Thanks for the clarification, I didn't realize the emitted words was the actual count - maybe that could be clarified in the guide? It would be nice to have something practical, but based on the examples I've seen I agree it would be too much work to develop a full system right now.

If you'd like to do a sample topology that would be great, but don't worry about it if you don't have time. You've already contributed a ton of information here, and I certainly don't want to ask for too much. I think with the Twitter API setup would add too much length to this guide, so we can come back to that in the future and possibly do a follow up guide on it.

Please let me know when you've had a chance to review and merge, and I'll pass this along for a final test run and some finishing touches. Thanks!

@pathbreak pathbreak merged commit 1fba69e into pathbreak:bigdata-storm-review1 Sep 11, 2016
@pathbreak
Copy link
Copy Markdown
Owner

pathbreak commented Sep 11, 2016

@pbzona Sorry about the delays, I got caught up in other project and couldn't finish anything by Friday. Some updates

  1. I have reviewed your reworked version, and like it. The introductory information you've written is all accurate. I've merged your PR verbatim in commit 59ab835

  2. I made a couple of changes to your reworked version in commit
    51390aa
    Please review them.

    The main change is that ZK_IMAGE_CONF and STORM_IMAGE_CONF can take just the respective image directory names (absolute or relative); no need to specify their full ".conf" filenames.
    The other changes are just typo corrections.

  3. I added a more detailed architecture diagram in 5119ea5

    In the same commit, I corrected a mistake in the "Start a new topology" section's command line that I had made from the beginning. Name of topology argument is specific only to WordCountTopology class, and not a general argument that can be passed to any topology class.

  4. I wrote a more practical topology to analyze Reddit comments sentiments, and a viewer app to display results. It's at https://github.com/pathbreak/reddit-sentiment-storm. I was thinking of including its Usage section steps in this guide's "Start a new topology" section as an example. What do you think? Is it an acceptable example for Linode's audience if included? And is it the right section to include in?

An unrelated question I have is regarding branching. I had made my initial PR to your repo to pull from my "bigdata-storm" branch. As per my understanding of your PR workflow, I should now merge this "bigdata-storm-review1" branch to "bigdata-storm" branch, and not to "master"; and there's no need for a second PR. Is this correct? Additionally, should I also rebase my "bigdata-storm" branch to your latest master?

@pathbreak
Copy link
Copy Markdown
Owner

@pbzona Forgot to ask: Do you think the architecture diagram at 5119ea5#diff-26a5c925040ccb2f549a2be8326fd492 is clear enough to users?

@pbzona
Copy link
Copy Markdown
Author

pbzona commented Sep 15, 2016

@pathbreak Sorry for the wait, I was unexpectedly out of the office for a few days. The new diagram looks great! I think it's clear and helpful, thanks for putting that together. I also agree with the other changes you made.

I added a brief paragraph about the Reddit comment topology, and will be submitting a PR to your review branch. Very nice job with that, I think it gives a more helpful illustration of what Storm is actually capable of.

To address your branching question, once you've reviewed and merged my final PR to bigdata-storm-review1, please do merge that branch to your bigdata-storm branch. However, you won't need to merge it to your master, or rebase to our latest master. Since this is a new file, there should be no conflicts when it's ultimately merged into our master branch.

Once this is done, please let me know (you can comment in the original PR, linode#419) and I'll get this over to one of our other team members for a final test and edit. Thank you again for all your work on this, it's turning out to be an excellent guide!

@pathbreak
Copy link
Copy Markdown
Owner

@pbzona,Thank you for the kind compliments; your contributions and comments have been invaluable towards improving both the guide and code.

I have now merged your latest PR. Can you clarify how to proceed about Ubuntu 16.04 support? If it's not to be included in the guide this time around, then I'll go ahead with the final merge and notify you in the original PR's discussion once done.

@pbzona
Copy link
Copy Markdown
Author

pbzona commented Sep 16, 2016

@pathbreak Since we've already spent quite a bit of time, let's move forward without Ubuntu 16.04 support for now. I'd like to get the guide published as soon as we can, and once it's up, we can come back to that as an update.

@pathbreak
Copy link
Copy Markdown
Owner

@pbzona That's fine with me. I've merged all these changes now into the main PR. Corresponding code changes have been committed (except any Ubuntu 16 changes) into release-0.3.0 (https://github.com/pathbreak/storm-linode/releases/tag/release-0.3.0).

pathbreak pushed a commit that referenced this pull request Oct 27, 2016
pathbreak pushed a commit that referenced this pull request Oct 27, 2016
"#installing-linode-s-public-ssh-key" is a non-existent link. The right one is "#adding-the-public-key"
pathbreak pushed a commit that referenced this pull request Oct 27, 2016
pathbreak pushed a commit that referenced this pull request Oct 27, 2016
Configuration checklist #1 internal checklist link broken
pathbreak pushed a commit that referenced this pull request Oct 27, 2016
pathbreak pushed a commit that referenced this pull request Oct 27, 2016
duplicated @ravenx99's changes in cpanel dns guide
pathbreak pushed a commit that referenced this pull request Oct 27, 2016
minor edits to guide; fixed index
pathbreak pushed a commit that referenced this pull request Oct 27, 2016
@pbzona This update looks good to me. Merged. Thank you.
pathbreak pushed a commit that referenced this pull request Oct 27, 2016
pathbreak pushed a commit that referenced this pull request Oct 27, 2016
Copyediting and style fixes
pathbreak pushed a commit that referenced this pull request Oct 27, 2016
pathbreak pushed a commit that referenced this pull request Oct 27, 2016
First round edits to GPG for SSH Auth Guide.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants