<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align = "center"> Spark Fundamentals 1 - Introduction to Spark</h1>
<h2 align = "center"> Lab 4. Scala - Working with Scala Libraries</h2>
<br align = "left">

**Related free online courses:**

Related courses can be found in the following learning paths:

- [Spark Fundamentals path](https://cognitiveclass.ai/learn/spark/)
- [Big Data Fundamentals path](https://cognitiveclass.ai/learn/big-data/) 

<img src = "http://spark.apache.org/images/spark-logo.png", height = 100, align = 'left'>

## Creating a Spark application using Spark SQL

Spark SQL provides the ability to write relational queries to be run on Spark. There is the abstraction SchemaRDD which is to create an RDD in which you can run SQL, HiveQL, and Scala. In this lab section, you will use SQL to find out the average weather and precipitation for a given time period in New York. The purpose is to demonstrate how to use the Spark SQL libraries on Spark.

### Please note that in Spark 1.3 DataFrames have replaced schemaRDDs however, it is still possible to switch between the two for supporting legacy systems. DataFrames is the recommended method going forward

### Let's first download the data that we will be working with in this lab

In [1]:
// download module to run shell commands within this notebook
import sys.process._

In [2]:
// download data from IBM Servier
// this may take ~30 seconds depending on your internet speed
"wget --quiet https://ibm.box.com/shared/static/j8skrriqeqw66f51iyz911zyqai64j2g.zip" !

println("Data Downloaded!")



Data Downloaded!


In [3]:
// unzip the folder's content into "resources" directory
"unzip -q -o -d ./resources j8skrriqeqw66f51iyz911zyqai64j2g.zip" !

println("Data Extracted!")



Data Extracted!


In [4]:
// list the extracted files
"ls -1 ./resources/LabData/" !

README.md
derby.log
followers.txt
metastore_db
notebook.log
nyctaxi.csv
nyctaxi100.csv
nyctaxisub.csv
nycweather.csv
pom.xml
taxistreams.py
users.txt




0

Let's take a look at the nycweather data. So run the following code:

In [5]:
val lines = scala.io.Source.fromFile("./resources/LabData/nycweather.csv").mkString
println(lines)

"2013-01-01",1,0
"2013-01-02",-2,0
"2013-01-03",-2,0
"2013-01-04",1,0
"2013-01-05",3,0
"2013-01-06",4,0
"2013-01-07",5,0
"2013-01-08",6,0
"2013-01-09",7,0
"2013-01-10",7,0
"2013-01-11",6,13.97
"2013-01-12",7,0.51
"2013-01-13",8,0
"2013-01-14",8,2.29
"2013-01-15",3,3.05
"2013-01-16",2,17.53
"2013-01-17",4,0
"2013-01-18",-1,0
"2013-01-19",5,0
"2013-01-20",6,0
"2013-01-21",-2,0
"2013-01-22",-7,0
"2013-01-23",-9,0
"2013-01-24",-8,0
"2013-01-25",-7,1.78
"2013-01-26",-6,0
"2013-01-27",-3,0
"2013-01-28",1,5.59
"2013-01-29",6,1.52
"2013-01-30",9,1.02
"2013-01-31",8,22.86
"2013-02-01",-2,0
"2013-02-02",-4,0.51
"2013-02-03",-3,0.51
"2013-02-04",-3,0
"2013-02-05",-1,0.51
"2013-02-06",1,0
"2013-02-07",-2,0
"2013-02-08",-1,29.21
"2013-02-09",-3,9.65
"2013-02-10",-3,0
"2013-02-11",4,12.45
"2013-02-12",4,0
"2013-02-13",4,0.76
"2013-02-14",4,0
"2013-02-15",8,0
"2013-02-16",2,0.51
"2013-02-17",-4,0
"2013-02-18",-3,0
"2013-02-19",5,3.81
"2013-02-20",0,0
"2013-02-21",-2,0
"2013-02-22",0,0
"2013-02-23",4,

lines = 


""2013-01-01",1,0
"2013-01-02",-2,0
"2013-01-03",-2,0
"2013-01-04",1,0
"2013-01-05",3,0
"2013-01-06",4,0
"2013-01-07",5,0
"2013-01-08",6,0
"2013-01-09",7,0
"2013-01-10",7,0
"2013-01-11",6,13.97
"2013-01-12",7,0.51
"2013-01-13",8,0
"2013-01-14",8,2.29
"2013-01-15",3,3.05
"2013-01-16",2,17.53
"2013-01-17",4,0
"2013-01-18",-1,0
"2013-01-19",5,0
"2013-01-20",6,0
"2013-01-21",-2,0
"2013-01-22",-7,0
"2013-01-23",-9,0
"2013-01-24",-8,0
"2013-01-25",-7,1.78
"2013-01-26",-6,0
"2013-01-27",-3,0
"2013-01-28",1,5.59
"2013-01-29",6,1.52
"2013-01-30",9,1.02
"2013-01-31",8,22.86
"2013-02-01",-2,0
"2013-02-02",-4,0.51
"2013-02-03",-3,0.51
"2013-02-04",-3,0
"2013-02-05",-1,0.51
"2013-02-06",1,0
"2013-02-07",-2,0
"2013-02-08",-1,29.21
"2013-02-09",-3,9.65
"2013-02-10",-3,0
"2013-02-11",4,...


There are three columns in the dataset, the date, the mean temperature in Celsius, and the precipitation for the day. Since we already know the schema, we will infer the schema using reflection.

You will first need to define the SparkSQL context. Do so by creating it from an existing SparkContext. Type in:

In [6]:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

sqlContext = org.apache.spark.sql.SQLContext@4f53832




org.apache.spark.sql.SQLContext@4f53832

Next, you need to import a library for creating a SchemaRDD. Type this:

In [7]:
import sqlContext.implicits._

Create a case class in Scala that defines the schema of the table. Type in:

In [8]:
case class Weather(date: String, temp: Int, precipitation: Double)

defined class Weather


Create the RDD of the Weather object:

In [9]:
val weather = sc.textFile("/resources/LabData/nycweather.csv").map(_.split(",")). map(w => Weather(w(0), w(1).trim.toInt, w(2).trim.toDouble)).toDF()

weather = [date: string, temp: int ... 1 more field]


[date: string, temp: int ... 1 more field]

You first load in the file, and then you map it by splitting it up by the commas and then another mapping to get it into the Weather class.

Next you need to register the RDD as a table. Type in:

In [10]:
weather.registerTempTable("weather")



At this point, you are ready to create and run some queries on the RDD. You want to get a list of the hottest dates with some precipitation. Type in:

In [11]:
val hottest_with_precip = sqlContext.sql("SELECT * FROM weather WHERE precipitation > 0.0 ORDER BY temp DESC")

hottest_with_precip.collect()

Name: org.apache.spark.sql.catalyst.errors.package$TreeNodeException
Message: execute, tree:
Exchange rangepartitioning(temp#5 DESC NULLS LAST, 200)
+- *(1) Filter (precipitation#6 > 0.0)
   +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line40.$read$$iw$$iw$Weather, true]).date, true, false) AS date#4, assertnotnull(input[0, $line40.$read$$iw$$iw$Weather, true]).temp AS temp#5, assertnotnull(input[0, $line40.$read$$iw$$iw$Weather, true]).precipitation AS precipitation#6]
      +- Scan ExternalRDDScan[obj#3]

StackTrace: Exchange rangepartitioning(temp#5 DESC NULLS LAST, 200)
+- *(1) Filter (precipitation#6 > 0.0)
   +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, Weather, true]).date, true, false) AS date#4, assertnotnull(input[0, Weather, true]).temp AS temp#5, assertnotnull(input[0, Weather, true]).precipi

Normal RDD operations will work. Print the top hottest days with some precipitation out to the console:

In [12]:
hottest_with_precip.map(x => ("Date: " + x(0), "Temp : " + x(1), "Precip: " + x(2))).top(10).foreach(println)

Name: Unknown Error
Message: lastException: Throwable = null
<console>:36: error: not found: value hottest_with_precip
       hottest_with_precip.map(x => ("Date: " + x(0), "Temp : " + x(1), "Precip: " + x(2))).top(10).foreach(println)
       ^

StackTrace: 

## Creating a Spark application using MLlib

In this section, Spark will be used to acquire the K-Means clustering for drop-off latitudes and longitudes of taxis for 3 clusters. The sample data contains a subset of taxi trips with hack license, medallion, pickup date/time, drop off date/time, pickup/drop off latitude/longitude, passenger count, trip distance, trip time and other information. As such, this may give a good indication of where to best to hail a cab.

Remember, this is only a subset of the file that you used in a previous exercise. If you ran this exercise on the full dataset, it would take a long time as we are only running on a test environment with limited resources.

Import the needed packages for K-Means algorithm and Vector packages:

In [13]:
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors

Create an RDD

In [14]:
val taxiFile = sc.textFile("/resources/LabData/nyctaxisub.csv")

taxiFile = /resources/LabData/nyctaxisub.csv MapPartitionsRDD[8] at textFile at <console>:37


/resources/LabData/nyctaxisub.csv MapPartitionsRDD[8] at textFile at <console>:37

Determine the number of rows in taxiFile.

In [15]:
taxiFile.count()

Name: org.apache.hadoop.mapred.InvalidInputException
Message: Input path does not exist: file:/resources/LabData/nyctaxisub.csv
StackTrace:   at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  a

Cleanse the data.

In [16]:
val taxiData=taxiFile.filter(_.contains("2013")).
    filter(_.split(",")(3)!="" ).    //dropoff_latitude
    filter(_.split(",")(4)!="")      //dropoff_longitude

taxiData = MapPartitionsRDD[11] at filter at <console>:41


lastException: Throwable = null


MapPartitionsRDD[11] at filter at <console>:41

The first filter limits the rows to those that occurred in the year 2013. This will also remove any header in the file. The third and fourth columns contain the drop off latitude and longitude. The transformation will throw exceptions if these values are empty.

Do another count to see what was removed.

In [17]:
taxiData.count()

Name: org.apache.hadoop.mapred.InvalidInputException
Message: Input path does not exist: file:/resources/LabData/nyctaxisub.csv
StackTrace:   at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  a

In this case, if we had used the full set of data, it would have filtered out a great many more lines.

To fence the area roughly to New York City use this command:

In [18]:
val taxiFence=taxiData.
    filter(_.split(",")(3).toDouble>40.70).
    filter(_.split(",")(3).toDouble<40.86).
    filter(_.split(",")(4).toDouble>(-74.02)).
    filter(_.split(",")(4).toDouble<(-73.93))

taxiFence = MapPartitionsRDD[15] at filter at <console>:45


lastException: Throwable = null


MapPartitionsRDD[15] at filter at <console>:45

Determine how many are left in taxiFence:

In [19]:
taxiFence.count()

Name: org.apache.hadoop.mapred.InvalidInputException
Message: Input path does not exist: file:/resources/LabData/nyctaxisub.csv
StackTrace:   at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  a

Approximately, 43,354 rows were dropped since these drop-off points are outside of New York City.

Create Vectors with the latitudes and longitudes that will be used as input to the K-Means algorithm.

In [28]:
val taxi=taxiFence.
    map{
        line=>Vectors.dense(
            line.split(',').slice(3,5).map(_ .toDouble)
        )
    }

taxi = MapPartitionsRDD[16] at map at <console>:44


lastException: Throwable = null


MapPartitionsRDD[16] at map at <console>:44

In [20]:
val iterationCount=10
val clusterCount=3

val model=KMeans.train(taxi,clusterCount,iterationCount)
val clusterCenters=model.clusterCenters.map(_.toArray)

clusterCenters.foreach(lines=>println(lines(0),lines(1)))

Name: Unknown Error
Message: lastException: Throwable = null
<console>:42: error: not found: value taxi
       val model=KMeans.train(taxi,clusterCount,iterationCount)
                              ^

StackTrace: 

Now we know the map co-ordinates. Not surprisingly, the second point is between the Theater District and Grand Central. The third point is in The Village, NYU, Soho and Little Italy area. The first point is the Upper East Side, presumably where people are more likely to take cabs than subways.



## Creating a Spark application using Spark Streaming

This section focuses on Spark Streams, an easy to build, scalable, stateful (e.g. sliding windows) stream processing library. Streaming jobs are written the same way Spark batch jobs are coded and support Java, Scala and Python. In this exercise, taxi trip data will be streamed using a socket connection and then analyzed to provide a summary of number of passengers by taxi vendor. This will be implemented in the Spark shell using Scala.

There are two relevant files for this section. The first one is the nyctaxi100.csv which will serve as the source of the stream. The other file is a python file, taxistreams.py, which will feed the csv file through a socket connection to simulate a stream.

### <span style="color: red">IN ORDER TO START THE STREAM PLEASE OPEN A NEW PYTHON NOTEBOOK AND RUN THE CODE BELOW IN IT:</span> 

To open a new Python notebook click on the blue notebook button at the top right of this page, next to the search box. Choose PYTHON 2 and then copy and past the code below into the cell in the new Python notebook. Run the cell as normal. To interrupt the kernel hit the STOP button in the Action buttons above.

```
!python /resources/LabData/taxistreams.py

```

Once started, the program will bind and listen to the localhost socket 7777. When a connection is made, it will read ‘nyctaxi100.csv’ and send across the socket. The sleep is set such that one line will be sent every 0.5 seconds, or 2 rows a second. This was intentionally set to a high value to make it easier to view the data during execution.

Turn off logging so that you can see the output of the application and Import the required libraries:

In [21]:
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._

Create the StreamingContext by using the existing SparkContext (sc). It will be using a 1 second batch interval, which means the stream is divided to 1 second batches and each batch becomes a RDD. This is intentional to make it easier to read the data during execution.

In [22]:
val ssc = new StreamingContext(sc,Seconds(1))

ssc = org.apache.spark.streaming.StreamingContext@10344997


org.apache.spark.streaming.StreamingContext@10344997

Create the socket stream that connects to the localhost socket 7777. This matches the port that the Python script is listening on. Each batch from the Stream be a lines RDD.

In [23]:
val lines = ssc.socketTextStream("localhost",7777)

lines = org.apache.spark.streaming.dstream.SocketInputDStream@27b2dad9


org.apache.spark.streaming.dstream.SocketInputDStream@27b2dad9

Next, put in the business logic to split up the lines on each comma and mapping pass(15), which is the vendor, and pass(7), which is the passenger count. Then this is reduced by key resulting in a summary of number of passengers by vendor.

In [24]:
val pass = lines.map(_.split(",")).
    map(pass=>(pass(15),pass(7).toInt)).
    reduceByKey(_+_)

pass = org.apache.spark.streaming.dstream.ShuffledDStream@4c8a0136


org.apache.spark.streaming.dstream.ShuffledDStream@4c8a0136

Print out to the console:

In [25]:
pass.print()

The next two line starts the stream. 

In [26]:
ssc.start()
ssc.awaitTermination()

-------------------------------------------
Time: 1575346559000 ms
-------------------------------------------

-------------------------------------------
Time: 1575346560000 ms
-------------------------------------------
("VTS",2)

-------------------------------------------
Time: 1575346561000 ms
-------------------------------------------
("CMT",1)
("VTS",5)

-------------------------------------------
Time: 1575346562000 ms
-------------------------------------------
("CMT",5)
("VTS",1)

-------------------------------------------
Time: 1575346563000 ms
-------------------------------------------
("CMT",1)
("VTS",2)

-------------------------------------------
Time: 1575346564000 ms
-------------------------------------------
("VTS",4)

-------------------------------------------
Time: 1575346565000 ms
-------------------------------------------
("CMT",2)
("VTS",1)

-------------------------------------------
Time: 1575346566000 ms
-------------------------------------------
("CMT

Name: java.lang.InterruptedException
Message: null
StackTrace:   at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
  at org.apache.spark.streaming.ContextWaiter.waitForStopOrError(ContextWaiter.scala:63)
  at org.apache.spark.streaming.StreamingContext.awaitTermination(StreamingContext.scala:618)

-------------------------------------------
Time: 1575346571000 ms
-------------------------------------------
("VTS",6)

-------------------------------------------
Time: 1575346572000 ms
-------------------------------------------
("CMT",4)

-------------------------------------------
Time: 1575346573000 ms
-------------------------------------------
("CMT",2)
("VTS",1)

-------------------------------------------
Time: 1575346574000 ms
-------------------------------------------
("CMT",2)
("VTS",1)

-------------------------------------------
Time: 1575346575000 ms
-------------------------------------------

-------------------------------------------
Time: 1575346576000 ms
-------------------------------------------

-------------------------------------------
Time: 1575346577000 ms
-------------------------------------------

-------------------------------------------
Time: 1575346578000 ms
-------------------------------------------

-------------------------------------------


-------------------------------------------
Time: 1575346644000 ms
-------------------------------------------

-------------------------------------------
Time: 1575346645000 ms
-------------------------------------------

-------------------------------------------
Time: 1575346646000 ms
-------------------------------------------

-------------------------------------------
Time: 1575346647000 ms
-------------------------------------------

-------------------------------------------
Time: 1575346648000 ms
-------------------------------------------

-------------------------------------------
Time: 1575346649000 ms
-------------------------------------------

-------------------------------------------
Time: 1575346650000 ms
-------------------------------------------

-------------------------------------------
Time: 1575346651000 ms
-------------------------------------------

-------------------------------------------
Time: 1575346652000 ms
-------------------------------------

It will take a few cycles for the connection to be recognized, and then the data is sent. In this case, 2 rows per second of taxi trip data is receive in a 1 second batch interval.

In the Python terminal, the contents of the file are printed as they are streamed.

**Note: TO STOP THE STREAM PLEASE INTERRUPT THE KERNEL IN BOTH THE OTHER PYTHON NOTEBOOK AND THIS NOTEBOOK. THEN RESTART THIS NOTEBOOK'S KERNEL TO CONTINUE ONTO THE GRAPHX APPLICATION**

This is just a simple example showing how you can take streaming data into Spark and do some type of processing on it. In the case here, the taxi and the number of passengers was extracted from the data stream.

## Creating a Spark application using GraphX

Users.txt is a set of users and followers is the relationship between the users. Take a look at the contents of these two files.

In [29]:
println("Users: ")
println(scala.io.Source.fromFile("./resources/LabData/users.txt").mkString)

println("Followers: ")
println(scala.io.Source.fromFile("./resources/LabData/followers.txt").mkString)

Users: 
1,BarackObama,Barack Obama
2,ladygaga,Goddess of Love
3,jeresig,John Resig
4,justinbieber,Justin Bieber
6,matei_zaharia,Matei Zaharia
7,odersky,Martin Odersky
8,anonsys

Followers: 
2 1
4 1
1 2
6 3
7 3
7 6
6 7
3 7



Import the GraphX package:

In [30]:
import org.apache.spark.graphx._

Create the users RDD and parse into tuples of user id and attribute list:

In [32]:
val users = (sc.textFile("./resources/LabData/users.txt").map(line => line.split(",")).map(parts => (parts.head.toLong, parts.tail)))

users.take(5).foreach(println)

(1,[Ljava.lang.String;@f264a68)
(2,[Ljava.lang.String;@3640ae82)
(3,[Ljava.lang.String;@6e7fbcd0)
(4,[Ljava.lang.String;@13707999)
(6,[Ljava.lang.String;@2e381495)


users = MapPartitionsRDD[197] at map at <console>:54


lastException: Throwable = null


MapPartitionsRDD[197] at map at <console>:54

Parse the edge data, which is already in userId -> userId format

In [33]:
val followerGraph = GraphLoader.edgeListFile(sc, "./resources/LabData/followers.txt")

followerGraph = org.apache.spark.graphx.impl.GraphImpl@51cdcea3


org.apache.spark.graphx.impl.GraphImpl@51cdcea3

Attach the user attributes

In [34]:
val graph = followerGraph.outerJoinVertices(users) {
    case (uid, deg, Some(attrList)) => attrList
    case (uid, deg, None) => Array.empty[String]
}

graph = org.apache.spark.graphx.impl.GraphImpl@19e0c5b8


org.apache.spark.graphx.impl.GraphImpl@19e0c5b8

Restrict the graph to users with usernames and names:

In [35]:
val subgraph = graph.subgraph(vpred = (vid, attr) => attr.size == 2)

subgraph = org.apache.spark.graphx.impl.GraphImpl@185a37ed


org.apache.spark.graphx.impl.GraphImpl@185a37ed

Compute the PageRank

In [36]:
val pagerankGraph = subgraph.pageRank(0.001)

pagerankGraph = org.apache.spark.graphx.impl.GraphImpl@3850b08e


org.apache.spark.graphx.impl.GraphImpl@3850b08e

Get the attributes of the top pagerank users

In [37]:
val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices) {
    case (uid, attrList, Some(pr)) => (pr, attrList.toList)
    case (uid, attrList, None) => (0.0, attrList.toList)
}

userInfoWithPageRank = org.apache.spark.graphx.impl.GraphImpl@1ae21320


org.apache.spark.graphx.impl.GraphImpl@1ae21320

Print the line out:

In [38]:
println(userInfoWithPageRank.vertices.top(5)(Ordering.by(_._2._1)).mkString("\n"))

(1,(1.4610558475474507,List(BarackObama, Barack Obama)))
(2,(1.3926425103962674,List(ladygaga, Goddess of Love)))
(7,(1.2956193310217194,List(odersky, Martin Odersky)))
(3,(0.9985540153884633,List(jeresig, John Resig)))
(6,(0.7013832556651652,List(matei_zaharia, Matei Zaharia)))


<div class="alert alert-success alertsuccess" style="margin-top: 20px">
**Tip**: Enjoyed using Jupyter notebooks with Spark? Get yourself a free 
    <a href="http://cocl.us/DSX_on_Cloud">IBM Cloud</a> account where you can use Data Science Experience notebooks
    and have *two* Spark executors for free!
</div>

## Summary

Having completed this exercise, you should have some familiarity with using the Spark libraries. In particular, you use Spark SQL to effectively query data inside of Spark. You used Spark Streaming to process incoming streams of batch data. You used Spark's MLlib to compute the *k*-means algorithm to find the best place to hail a cab. Finally, you used Spark's GraphX library to perform and parallel graph calculations on a dataset to find the attributes of the top users.

This notebook is part of the free course on **Cognitive Class** called *Spark Fundamentals I*. If you accessed this notebook outside the course, you can take this free self-paced course, online by going to: https://cognitiveclass.ai/courses/what-is-spark/

### About the Authors:  
Hi! It's [Alex Aklson](https://www.linkedin.com/in/aklson/), one of the authors of this notebook. I hope you found this lab educational! There is much more to learn about Spark but you are well on your way. Feel free to connect with me if you have any questions.
<hr>