Skip to content

huangtianan/Spark-multiple-job-Examples

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark multiple job Examples

Download Spark-multiple-job-Examples

Dependencies

simplevoronoi

Download simplevoronoi and place simplevoronoi-version-SNAPSHOT.jar in Spark-multiple-job-Examples/lib directory

Create jar file

cd Spark-multiple-job-Examples and run

sbt
> package

After successful complete in directory target/scala-2.10/spark_multiple_job_examples_2.10-SNAPSHOT-0.1.jar should be created now we will run classes from this jar as spark jobs

Spark dependencies

twitter4j-core
Twitter4j Stream
spark-streaming
spark-streaming-twitter
simplevoronoi

Download listed libraries some where on disc

and add these in SPARK_CLASSPATH in your user .bashrc file or spark-env.sh file

in this tutorial we export these in spark-env.sh so add these lines in you SparkHome/conf/spark-env.sh


export SPARK_CLASSPATH=PathToFile/twitter4j-core-3.0.3.jar:$SPARK_CLASSPATH
export SPARK_CLASSPATH=PathToFile/twitter4j-stream-3.0.3.jar:$SPARK_CLASSPATH
export SPARK_CLASSPATH=PathToFile/spark-streaming-twitter_2.10-1.4.1.jar:$SPARK_CLASSPATH
export SPARK_CLASSPATH=PathToFile/simplevoronoi-0.2-SNAPSHOT.jar:$SPARK_CLASSPATH

Examples

write tweets from twitter live stream in CSV file


1.1: we need twitter API credentials if you dont have create frist here

1.2: create a file with Twitter API credentiale like name twitter-credentials.txt

1.3: enter credential

TWITTER_API_KEY=ApiKey
TWITTER_API_SECRET=ApiSecret
TWITTER_ACCESS_TOKEN=AccessToken
TWITTER_ACCESS_TOKEN_SECRET=AccessTokenSecret

1.4: run sbt job to write live stream from twitter in a csv file

sbt "run-main xulu.FetchTweets twitter-credentials.txt tweets.csv"

This job write tweets in fomate "Longitude,Latitude,Text "
to change modifie FetchTweets.scala and run again

shell output

-56.544541,-29.089541,Por que ni estamos jugando,

-69.922686,18.462675,Aprenda hablar amigo

-118.565107,34.280215,today a boy told me I'm pretty

121.039399,14.72272,@Kringgelss labuyoo. Hahaha

-34.875339,-7.158832@keithmeneses_ oi td bem? sdds 😔💚

103.766123,1.380696,Xian Lim on iShine 3 2

......

When you have enough tweets, stop the program by pressing CTRL + C

spark live Twitter stream

2.1: Edit file Spark-multiple-job-Examples/src/mian/scala/xulu/TwitterLiveStreaming.scala
find the line ssc.checkpoint("hdfs://HDFS_IP:9000/checkpoint") at end of file.
replace HDFS_IP with Hadoop master IP and rebuild jar file

cd Spark-multiple-job-Examples and run

sbt
> package

cd to spark home

bin/spark-submit --class xulu.TwitterLiveStreaming --master spark://sparkMasterIP:7077 PathTo/spark_multiple_job_examples_2.10-SNAPSHOT-0.1.jar (Optional kyeWords separated by space)

This example should only print tweet text from Twitter live stream on shell

you are able to do write it any where or change the query

modifie TwitterLiveStreaming.scala and rebluild you jar and run again

Segmenting Audience with KMeans and Voronoi Diagram using Spark and MLlib

In this example, we will be using the k-means clustering algorithm implemented in Spark Machine Learning Library(MLLib) to segment the dataset by geolocation.

3.1: we need twitter data to perform this example.

in example one we show how you write Twitter data in a file so run example one to fetch some data from Twitter.

For your convenience, we provide the file tweets_drink.csv

3.2: upload Twitter data tweets_drink.csv in to Hadoop

3.3: run the job

bin/spark-submit --class xulu.KMeansApp --master spark://SparkMasterIP:7077 PathTo/spark_multiple_job_examples_2.10-SNAPSHOT-0.1.jar hdfs://HDFS_IP:9000/UploadedLocation/tweets_drink.csv results/KmeanApp.png

result

Kmean App result

Analyzing your audience location with Twitter Streams and Heat Maps

4.1: Downlaod Twitter data about Drink

You can run the Example one With Keywords about Drink to get the tweets having a word related to a drink:

sbt "run-main xulu.FetchTweets twitter-credentials.txt tweets_drink.csv \ redbull schweppes coke cola pepsi fanta orangina soda \ coffee cafe expresso latte tea \ alcohol booze alcoholic whiskey tequila vodka booze cognac baccardi \ drink beer rhum liquor gin ouzo brandy mescal alcoholic wine drink"

When you have enough tweets, stop the program by pressing CTRL + C

Unfortunately only a fraction of all the tweets have geolocation information(the publisher has to tweet from a phone and has to opt in to send its position). So you might need to wait several hours (even days if the words are not popular) to get enough tweets to draw on a map. For your convenience, we provide the file tweets_drink.csv which already contains the tweets that we collected using those keywords.

The tweet file contains on each line, the tweet longitude ,latitude, and message:

4.2: upload Twitter data tweets_drink.csv in to Hadoop

4.3: run the job

bin/spark-submit --class xulu.HeatMap --master spark://SparkMasterIP:7077 PathTo/spark_multiple_job_examples_2.10-SNAPSHOT-0.1.jar hdfs://HDFS_IP:9000/UploadedLocation/tweets_drink.csv results/heatmapApp.png 0.5 coke pepsi

result

HeatMap App result

On this map, we can clearly see that the word 'coke'(in green) is used much more than the word 'pepsi'(in red) in the tweets. There are some places which are yellow, that means that there are tweets on coke and tweets on pepsi (yellow = red + green). Interestingly enough, we can see that coke is not used much in South America unlike pepsi which is used in Brazil and in Argentina.

Following

https://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/
https://chimpler.wordpress.com/2014/06/26/analyzing-your-audience-location-with-twitter-streams-and-heat-maps/

About

Spark Mlib clustering and Spark Twitter Steaming tutorial

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 100.0%