#Twitter + Watson Tone Analyzer sample Notebook Part 1: Loading the data
In this Notebook, we show how to load the custom library generate as part of the Twitter + Watson Tone Analyzer streaming application. Code can be found here: https://github.com/ibm-cds-labs/spark.samples/tree/master/streaming-twitter.
The following code is using a pre-built jar has been posted on the Github project, but you can replace with your own url if needed.

In [1]:
%AddJar https://github.com/ibm-cds-labs/spark.samples/raw/master/dist/streaming-twitter-assembly-1.1.jar -f

Starting download from https://github.com/ibm-cds-labs/spark.samples/raw/master/dist/streaming-twitter-assembly-1.1.jar
Finished download of streaming-twitter-assembly-1.1.jar


##Set up the Twitter and Watson credentials
Please refer to the tutorial for details on how to find the Twitter and Watson credentials, then add the value in the placeholders specified in the code below

In [2]:
val demo = com.ibm.cds.spark.samples.StreamingTwitter

demo.setConfig("twitter4j.oauth.consumerKey","XXXXXXXXXXXXXXXXXX")
demo.setConfig("twitter4j.oauth.consumerSecret","XXXXXXXXXXXXXXXXXX")
demo.setConfig("twitter4j.oauth.accessToken","XXXXXXXXXXXXXXXXXX")
demo.setConfig("twitter4j.oauth.accessTokenSecret","XXXXXXXXXXXXXXXXXX")
demo.setConfig("watson.tone.url","https://gateway.watsonplatform.net/tone-analyzer-experimental/api")
demo.setConfig("watson.tone.password","XXXXXXXXXXXXXXXXXX")
demo.setConfig("watson.tone.username","16aeea04-efe1-4bfd-b51e-27fd11b40434")

##Start the Spark Stream to collect live tweets
Start a new Twitter Stream that collects the live tweets and enrich them with Sentiment Analysis scores. The stream is run for a duration specified in the second argument of the **startTwitterStreaming** method.
Note: if no duration is specified then the stream will run until the **stopTwitterStreaming** method is called.

In [3]:
import org.apache.spark.streaming._
demo.startTwitterStreaming(sc, Seconds(100))

Twitter stream started
Tweets are collected real-time and analyzed
To stop the streaming and start interacting with the data use: StreamingTwitter.stopTwitterStreaming
Stopping Twitter stream. Please wait this may take a while
Twitter stream stopped
You can now create a sqlContext and DataFrame with 1465 Tweets created. Sample usage: 
val (sqlContext, df) = com.ibm.cds.spark.samples.StreamingTwitter.createTwitterDataFrames(sc)
df.printSchema
sqlContext.sql("select author, text from tweets").show


##Create a SQLContext and a dataframe with all the tweets
Note: this method will register a SparkSQL table called tweets

In [4]:
val (sqlContext, df) = demo.createTwitterDataFrames(sc)

A new table named tweets with 1465 records has been correctly created and can be accessed through the SQLContext variable
Here's the schema for tweets
root
 |-- author: string (nullable = true)
 |-- date: string (nullable = true)
 |-- lang: string (nullable = true)
 |-- text: string (nullable = true)
 |-- lat: double (nullable = true)
 |-- long: double (nullable = true)
 |-- Cheerfulness: double (nullable = true)
 |-- Negative: double (nullable = true)
 |-- Anger: double (nullable = true)
 |-- Analytical: double (nullable = true)
 |-- Confident: double (nullable = true)
 |-- Tentative: double (nullable = true)
 |-- Openness: double (nullable = true)
 |-- Agreeableness: double (nullable = true)
 |-- Conscientiousness: double (nullable = true)



##Execute a SparkSQL query that contains all the data
Note: There is a temporary issue on the current Spark Beta system that prevent data to be correctly serialized to a parquet file. it is important to use the "limit clause" in the SQL query as it provides a workaround.

In [5]:
val fullSet = sqlContext.sql("select * from tweets limit 100000")  //Select all columns
fullSet.show

author               date                 lang text                 lat long Cheerfulness Negative Anger Analytical Confident Tentative Openness Agreeableness Conscientiousness
Griffin              Sun Sep 27 21:43:... en   If they add the a... 0.0 0.0  67.0         0.0      0.0   52.0       0.0       0.0       8.0      97.0          8.0              
Ruth Roldan          Sun Sep 27 21:43:... en   Nakakasuya. Busit.   0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       97.0     0.0           68.0             
Kayley               Sun Sep 27 21:43:... en   RT @GirIsloveboys... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       47.0     0.0           1.0              
Destiny Apocalypse   Sun Sep 27 21:43:... en   RT @FoodP0rnn_: P... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       100.0    1.0           98.0             
? dd ?               Sun Sep 27 21:43:... en   As if dads camera... 0.0 0.0  0.0          100.0    100.0 0.0       

##Persist the dataset into a parquet file on Object Storage service
The parquet file will be reloaded in IPython Part 2 Notebook
Note: you can disregard the warning messages related to SLF4J

In [6]:
fullSet.saveAsParquetFile("swift://twitter2.spark/tweetsFull7.parquet")

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.


##SparkSQL query example on the data.
Select all the tweets that have Anger score greated than 70%

In [7]:
val angerSet = sqlContext.sql("select author, text, Anger from tweets where Anger > 70")
println(angerSet.count)
angerSet.show

126
author          text                 Anger
? dd ?          As if dads camera... 100.0
Richard Hills   @TraceyMartinMP e... 100.0
Julia           @versacesachi @Co... 100.0
Maria           RT @lyssadeIrey: ... 100.0
Sophia kapoor   What 2020 Will Br... 100.0
J.J Idoko       RT @ManagersDiary... 100.0
lil yon. ??     @taycuzz_ yeah I'... 100.0
emilee          I know I'm annoyi... 100.0
Antoinette B.   When I get home..... 100.0
elizabeth       @Jamboogy_ kill i... 100.0
Madame          RT @TommyPickles9... 100.0
? ??            RT @tinatbh: netf... 100.0
Isaiah Marrs    #bokep #fucking #... 100.0
Gino Martinez   RT @DrrakeTheType... 100.0
Katie           RT @hotlinetao: w... 100.0
Emily Ciocca    For real love is ... 100.0
college dropout @Rebel_Ron94 aw i... 100.0
T.              RT @3famousamos: ... 100.0
Alistair Noble  @TheSealBoy no sh... 100.0
Kevin Cordovi   RT @vuhsace: It s... 100.0
