# <hr style="clear: both" />

# Running Spark in YARN-client mode

This notebook demonstrates how to set up a SparkContext that uses SURFsara's Hadoop cluster: [YARN resourcemanager](http://head05.hathi.surfsara.nl:8088/cluster) (note you will need to be authenticated via kerberos on your machine to visit the resourcemanager link) for executors.

First initialize kerberos via a Jupyter terminal. 
In the terminal execute: <BR>
<i>kinit -k -t data/robertop.keytab robertop@CUA.SURFSARA.NL</i><BR>
Print your credentials:


In [1]:
! klist

Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: robertop@CUA.SURFSARA.NL

Valid starting       Expires              Service principal
04/29/2016 07:08:10  04/30/2016 07:08:10  krbtgt/CUA.SURFSARA.NL@CUA.SURFSARA.NL
	renew until 04/29/2016 07:08:10


In [2]:
! hdfs dfs -ls 
execfile('../spark-scripts/bullet.py')

Found 5 items
drwx------   - robertop hdfs          0 2016-04-29 12:19 .Trash
drwxr-xr-x   - robertop hdfs          0 2016-04-29 14:02 .sparkStaging
drwx------   - robertop hdfs          0 2016-04-06 15:54 .staging
drwxr-xr-x   - robertop hdfs          0 2016-04-27 13:07 mattia
drwxr-xr-x   - robertop hdfs          0 2016-04-13 10:00 recsys2016Competition


Verify that we can browse HDFS:

Next initialize Spark. Note that the code below starts a job on the Hadoop cluster that will remain running while the notebook is active. Please close and halt the notebook when you are done. Starting the SparkContext can take a little longer. You can check the YARN resourcemanager to see the current status/usage of the cluster.

In [1]:
import os
os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python2.7'

HDFS_PATH = "hdfs://hathi-surfsara"

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sconf = SparkConf()

# Master is now yarn-client. The YARN and hadoop config is read from the environment
sconf.setMaster("yarn-client")

# You can control many Spark settings via the SparkConf. This determines the amount of executors on the cluster:
sconf.set("spark.executor.instances", "200")
#sconf.set("spark.executor.memory", "20g")

# UFW (firewall) is active on the VM. We explicitly opened these ports and Spark should not bind to random ports:
sconf.set("spark.driver.port", 51800)
sconf.set("spark.fileserver.port", 51801)
sconf.set("spark.broadcast.port", 51802)
sconf.set("spark.replClassServer.port", 51803)
sconf.set("spark.blockManager.port", 51804)
sconf.set("spark.authenticate", True)
sconf.set("spark.yarn.keytab", "/home/jovyan/work/data/robertop.keytab")
sconf.set("spark.yarn.access.namenodes", HDFS_PATH + ":8020")

try:
    sc = SparkContext(conf=sconf)
    sqlCtx = SQLContext(sc) 
    sendNotificationToMattia("Spark Context", "Ready!")
except Exception, err:
    sendNotificationToMattia("Fuck you!", str(err)) 
    print str(err)

NameError: name 'sendNotificationToMattia' is not defined

# <hr style="clear: both" />

# Now you can run your code

Pick a clustering algorithm (name of the file that provides a classify(x,y [,threshold]) function)

In [5]:
execfile('../spark-scripts/conventions.py')
execfile('../spark-scripts/splitCluster.py')
execfile('../spark-scripts/eval.py')
execfile('../spark-scripts/implicitPlaylistAlgoFunctions.py')
execfile('../spark-scripts/implicitPlaylistAlgoMain.py')

CLUSTER_ALGO = 'jaccardBase'
THRESHOLD = 0.9

BASE_PATH = HDFS_PATH + '/user/robertop/mattia'


# Load data 

In [82]:
clusterSongsFileRDD = sc.pickleFile(BASE_PATH + '/clusters/' + CLUSTER_ALGO + str(THRESHOLD)[2:])

songToClusterRDD = clusterSongsFileRDD.flatMap(lambda x: [(int(i), x[0]) for i in x[1]] )


playlistRDD = sc.textFile(BASE_PATH + '/30Mdataset/entities/playlist.idomaar')
playlistRDD = playlistRDD.map(lambda x: json.loads(x.split('\t')[4]))
playlistRDD = playlistRDD.map(lambda x: x['objects'])
print "Total playlist: " + str(playlistRDD.count())

def filter_bad_entry(x):
    try:
        result = [i['id'] for i in x]
        return (result, 0)
    except:
        return (x, 1)

playlistRDD = playlistRDD.map(filter_bad_entry)
bad_n = playlistRDD.filter(lambda x: x[1] == 1 or len(x[0]) == 0).count()
print "Bad playlists: " + str(bad_n)

playlistOkRDD = playlistRDD.filter(lambda x: x[1] == 0 and len(x[0]))
playlist_count = float(playlistOkRDD.count())
print "Good playlists: " + str(playlist_count)

Total playlist: 57561
Bad playlists: 9139
Good playlists: 48422.0


In [83]:
def unique_percentage(x):
    playlist_length = float(len(x[0]))
    
    play_set = set()
    for i in x[0]:
        play_set.add(i)
    play_unique = float(len(play_set))
    
    percentage = play_unique / playlist_length
    
    return  (x, percentage)

playlistUniRDD = playlistOkRDD.map(unique_percentage)
playlistUniRDD.count()

48422

In [84]:
unique_total = playlistUniRDD.map(lambda x: x[1]).sum()
print unique_total

playlist_avg_uniqueness = unique_total / playlist_count

print "Average Percentage of Uniqueness: " + str(playlist_avg_uniqueness*100) + ' %'

48407.3316817
Average Percentage of Uniqueness: 0.999697073265


In [76]:
playlistFlatRDD = playlistOkRDD.map(lambda x: x[0]).zipWithIndex().flatMap(lambda x: [(i,x[1]) for i in x[0]])

#SANITY CHECK ON NUMBER OF SONGS AND CLUSTERS
uniqueSongsRDD = playlistFlatRDD.groupByKey()
uniqueClusterRDD = uniqueSongsRDD.join(songToClusterRDD)

n_songs = uniqueSongsRDD.count()
n_clus = uniqueClusterRDD.count()

print "Songs: " + str(n_songs)
print "Clusters: " + str(n_clus)
if n_songs == n_clus: 
    print "OK" 
else: 
    print "BAD!";

Songs: 466244
Clusters: 466244


In [80]:
playlistJoinedRDD = playlistFlatRDD.join(songToClusterRDD).map(lambda x: x[1]).groupByKey()
playlistJoinedRDD.take(10)

[(0, <pyspark.resultiterable.ResultIterable at 0x7f50358d5c50>),
 (43014, <pyspark.resultiterable.ResultIterable at 0x7f50358d5850>),
 (37386, <pyspark.resultiterable.ResultIterable at 0x7f50358d5d10>),
 (31758, <pyspark.resultiterable.ResultIterable at 0x7f50358d5790>),
 (26130, <pyspark.resultiterable.ResultIterable at 0x7f5035843dd0>),
 (20502, <pyspark.resultiterable.ResultIterable at 0x7f5035843b50>),
 (14874, <pyspark.resultiterable.ResultIterable at 0x7f5035843950>),
 (9246, <pyspark.resultiterable.ResultIterable at 0x7f5035843d10>),
 (3618, <pyspark.resultiterable.ResultIterable at 0x7f5035843cd0>),
 (46632, <pyspark.resultiterable.ResultIterable at 0x7f50358434d0>)]

In [None]:
def unique_percentage_cluster(x):
    x_list = list(x[1])
    playlist_length = float(len(x_list))
    
    play_set = set()
    for i in x_list:
        play_set.add(i)
    play_unique = float(len(play_set))
    
    percentage = play_unique / playlist_length
    
    return  (x, percentage)