Skip to content

Tutorial 1: Spark Line Count

Marmik edited this page Sep 1, 2016 · 2 revisions

Program for Spark Line Count and Sort

  • Input: Here is the screenshot of an input file. I have included same sentences to check if a program works correctly or not.

Input1

  • Workflow: The basic workflow I used for counting lines and sorting count is here.

Workflow

  • Code Snippet: The spark code for above workflow is given below.

Code Snippet

  • Explanation:

Consider the following input for a simple explanation.

This is apache storm line count example. It is map reduce procedure.

This is apache storm line count example.

It is very fast and accurate.

It is 100 times faster than Hadoop.

It is map reduce procedure.

####Procedure of Line Count and Sorting.

1. Split sentence by periods (.)

Output of this stage is:

This is apache storm line count example

It is map reduce procedure

This is apache storm line count example

It is very fast and accurate

It is 100 times faster than Hadoop

It is map reduce procedure

This is input for next stage where mapping is performed.

2. Generate key value pair using map

Output of this stage is:

(This is apache storm line count example,1)

(It is map reduce procedure,1)

(This is apache storm line count example,1)

(It is very fast and accurate,1)

(It is 100 times faster than Hadoop,1)

(It is map reduce procedure,1)

3. Reduce operation to count similar line

Output:

(This is apache storm line count example,2)

(It is very fast and accurate,1)

(It is 100 times faster than Hadoop,1)

(It is map reduce procedure,2)

4. Swap Key and Values with each other

Output:

(2,This is apache storm line count example)

(1,It is very fast and accurate)

(1,It is 100 times faster than Hadoop)

(2,It is map reduce procedure)

5. Use of sortByKey function for sorting

Output:

(1,It is very fast and accurate)

(1,It is 100 times faster than Hadoop)

(2,This is apache storm line count example)

(2,It is map reduce procedure)

6. Again swap key and value with each other

Output:

(It is very fast and accurate,1)

(It is 100 times faster than Hadoop,1)

(This is apache storm line count example,2)

(It is map reduce procedure,2)

  • Output Here is the screenshots of output for input given above.

output1

output2

Reference