-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial 1: Spark Line Count
- Input: Here is the screenshot of an input file. I have included same sentences to check if a program works correctly or not.
- Workflow: The basic workflow I used for counting lines and sorting count is here.
- Code Snippet: The spark code for above workflow is given below.
- Explanation:
Consider the following input for a simple explanation.
This is apache storm line count example. It is map reduce procedure.
This is apache storm line count example.
It is very fast and accurate.
It is 100 times faster than Hadoop.
It is map reduce procedure.
####Procedure of Line Count and Sorting.
1. Split sentence by periods (.)
Output of this stage is:
This is apache storm line count example
It is map reduce procedure
This is apache storm line count example
It is very fast and accurate
It is 100 times faster than Hadoop
It is map reduce procedure
This is input for next stage where mapping is performed.
2. Generate key value pair using map
Output of this stage is:
(This is apache storm line count example,1)
(It is map reduce procedure,1)
(This is apache storm line count example,1)
(It is very fast and accurate,1)
(It is 100 times faster than Hadoop,1)
(It is map reduce procedure,1)
3. Reduce operation to count similar line
Output:
(This is apache storm line count example,2)
(It is very fast and accurate,1)
(It is 100 times faster than Hadoop,1)
(It is map reduce procedure,2)
4. Swap Key and Values with each other
Output:
(2,This is apache storm line count example)
(1,It is very fast and accurate)
(1,It is 100 times faster than Hadoop)
(2,It is map reduce procedure)
5. Use of sortByKey function for sorting
Output:
(1,It is very fast and accurate)
(1,It is 100 times faster than Hadoop)
(2,This is apache storm line count example)
(2,It is map reduce procedure)
6. Again swap key and value with each other
Output:
(It is very fast and accurate,1)
(It is 100 times faster than Hadoop,1)
(This is apache storm line count example,2)
(It is map reduce procedure,2)
- Output Here is the screenshots of output for input given above.