A Spark Streaming demo framework is a framework for streaming data counting and aggregating, like Twitter Rainbird, but has more feature compared to it.
- it has several kinds of operators, not only counter operator (like in Rainbird).
- it can process different topics of Kafka message with topic specific parser in one framework.
- it is highly configurable with XML file.
The whole architecture of streaming demo is like this:
The UML class chart is:
Users want to use this framework should configure the XML file conf/properties.xml
, like clickstream
examples
<applications>
<application>
<category>clickstream</category>
<parser>example.clickstream.ClickEventParser</parser>
<items>
<item>source_ip</item>
<item>dest_url</item>
<item>visit_date</item>
<item>ad_revenue</item>
<item>user_agent</item>
<item>c_code</item>
<item>l_code</item>
<item>s_keyword</item>
<item>avg_time_onsite</item>
</items>
<properties>
<property window="30" slide="10" hierarchy="/" type="count">
<key>dest_url</key>
<output>stream.framework.output.TachyonStrToLongOutput</output>
</property>
<property window="30" slide="10" hierarchy="/" type="aggregate">
<key>dest_url</key>
<value>source_ip</value>
<output>stream.framework.output.TachyonStrToStrOutput</output>
</property>
<property window="30" slide="10" hierarchy="/" type="distinct_aggregate_count">
<key>dest_url</key>
<value>source_ip</value>
<output>stream.framework.output.TachyonStrToLongOutput</output>
</property>
</properties>
</application>
</applications>
Here several application
can co-exists in one applications
, user can configure application
specific parameter like above:
category
is the category of Kafka messge, also use this as the topic.parser
is the user defined parser class class, user should extendsAbstractParser
to self-defined ones.items
is the schema of input message, in case input message has several items, this is the name of each item.properties
is the one you want to operate, you could specify several kinds of operators with some properties
Currently framework supports 3 operators for user:
CountOperator
:CountOperator
will count the occurrence of specifickey
, like PV (page views), here are several parameters related toCountOperator
,-
window
: specify the timing window for Spark Streaming to collect data and calculate -
slide
: specify the sliding parameter for this window, take10
as example, Spark Streaming will process data in each 10 seconds for 30 seconds window data. -
hierarchy
:hierarchy
means if you want to delimit data using specified delimiter, like/
. As for page view analysis, each url should split like this:http://xyz.com/aaa/bbb/ccc => xyz.com/aaa/bbb/ccc xyz.com/aaa/bbb xyz.com/aaa xyz.com
-
type
: specify the operator you choose. -
output
: specify the output class you implemented to store the output data in this kind of data.output
should extendAbstractEventOutput
.
-
AggregateOperator
:AggregateOperator
will aggregate thevalue
bykey
which you specified, and the parameters of this operator is the same asCountOperator
.DistinctAggregateCountOperator
:DistinctAggregateCountOperator
will count the distinctvalue
with specifiedkey
, also parameters is the same as above.- besides, user can create their own
Operator
by extendsAbstractOperator
, which is easy and obvious.