# <hr style="clear: both" />

# Running Spark in YARN-client mode

This notebook demonstrates how to set up a SparkContext that uses SURFsara's Hadoop cluster: [YARN resourcemanager](http://head05.hathi.surfsara.nl:8088/cluster) (note you will need to be authenticated via kerberos on your machine to visit the resourcemanager link) for executors.

First initialize kerberos via a Jupyter terminal. 
In the terminal execute: <BR>
<i>kinit -k -t data/robertop.keytab robertop@CUA.SURFSARA.NL</i><BR>
Print your credentials:


In [1]:
! klist

Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: robertop@CUA.SURFSARA.NL

Valid starting       Expires              Service principal
04/26/2016 06:57:12  04/27/2016 06:57:12  krbtgt/CUA.SURFSARA.NL@CUA.SURFSARA.NL
	renew until 04/26/2016 06:57:12


In [2]:
! hdfs dfs -ls 
execfile('../spark-scripts/bullet.py')

Found 5 items
drwx------   - robertop hdfs          0 2016-04-26 06:00 .Trash
drwxr-xr-x   - robertop hdfs          0 2016-04-26 06:57 .sparkStaging
drwx------   - robertop hdfs          0 2016-04-06 15:54 .staging
drwxr-xr-x   - robertop hdfs          0 2016-04-19 14:35 mattia
drwxr-xr-x   - robertop hdfs          0 2016-04-13 10:00 recsys2016Competition


Verify that we can browse HDFS:

Next initialize Spark. Note that the code below starts a job on the Hadoop cluster that will remain running while the notebook is active. Please close and halt the notebook when you are done. Starting the SparkContext can take a little longer. You can check the YARN resourcemanager to see the current status/usage of the cluster.

In [3]:
import os
os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python2.7'

HDFS_PATH = "hdfs://hathi-surfsara"

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sconf = SparkConf()

# Master is now yarn-client. The YARN and hadoop config is read from the environment
sconf.setMaster("yarn-client")

# You can control many Spark settings via the SparkConf. This determines the amount of executors on the cluster:
sconf.set("spark.executor.instances", "200")
#sconf.set("spark.executor.memory", "20g")

# UFW (firewall) is active on the VM. We explicitly opened these ports and Spark should not bind to random ports:
sconf.set("spark.driver.port", 51800)
sconf.set("spark.fileserver.port", 51801)
sconf.set("spark.broadcast.port", 51802)
sconf.set("spark.replClassServer.port", 51803)
sconf.set("spark.blockManager.port", 51804)
sconf.set("spark.authenticate", True)
sconf.set("spark.yarn.keytab", "/home/jovyan/work/data/robertop.keytab")
sconf.set("spark.yarn.access.namenodes", HDFS_PATH + ":8020")

try:
    sc = SparkContext(conf=sconf)
    sqlCtx = SQLContext(sc) 
    sendNotificationToMattia("Spark Context", str(sc))
except Exception, err:
    sendNotificationToMattia("Fuck you!", str(err)) 
    print str(err)

# <hr style="clear: both" />

# Now you can run your code

Pick a clustering algorithm (name of the file that provides a classify(x,y [,threshold]) function)

In [5]:
execfile('../spark-scripts/evalCluster.py')
execfile('../spark-scripts/utilsCluster.py')

CLUSTER_ALGO = 'plain'
THRESHOLD = 0.0
THRESHOLD_STR = str(THRESHOLD)[2:]

import json
import copy

BASE_PATH = HDFS_PATH + '/user/robertop/mattia'

conf = {}

conf['split'] = {}
conf['split']['reclistSize'] = 100
conf['split']['callParams'] = {}
conf['split']['excludeAlreadyListenedTest'] = True
conf['split']['name'] = 'SenzaRipetizioni_1'
conf['split']['split'] = conf['split']['name']
conf['split']['minEventsPerUser'] = 5
conf['split']['inputData'] = HDFS_PATH + '/user/robertop/mattia/clusterBase.split/SenzaRipetizioni_1'
#conf['split']['inputData'] = 's3n://contentwise-research-poli/30musicdataset/newFormat/relations/sessions.idomaar'
conf['split']['bucketName'] = BASE_PATH
conf['split']['percUsTr'] = 0.05
conf['split']['ts'] = int(0.75 * (1421745857 - 1390209860) + 1390209860) - 10000
conf['split']['minEventPerSession'] = 5
conf['split']['onlineTrainingLength'] = 5
conf['split']['GTlength'] = 1
conf['split']['minEventPerSessionTraining'] = 10
conf['split']['minEventPerSessionTest'] = 11
conf['split']['mode'] = 'session'
conf['split']['forceSplitCreation'] = False
conf['split']["prop"] = {'reclistSize': conf['split']['reclistSize']}
conf['split']['type'] = None
conf['split']['out'] = HDFS_PATH + '/user/robertop/mattia/clusterBase.split/'
conf['split']['location'] = '30Mdataset/relations/sessions'

conf['evaluation'] = {}
conf['evaluation']['metric'] = {}
conf['evaluation']['metric']['type'] = 'recall'
conf['evaluation']['metric']['prop'] = {}
conf['evaluation']['metric']['prop']['N'] = [1,2,5,10,15,20,25,50,100]
conf['evaluation']['name'] = 'recall@N'

conf['general'] = {}
conf['general']['clientname'] = "clusterBase.split"
conf['general']['bucketName'] = BASE_PATH
conf['general']['tracksPath'] = '30Mdataset/entities/tracks.idomaar.gz'

conf['algo'] = {}
conf['algo']['props'] = {}
# ***** EXAMPLE OF CONFIGURATION *****#
conf['algo']['props']["sessionJaccardShrinkage"] = 5
conf['algo']['props']["clusterSimilarityThreshold"] = 0.1
conf['algo']['props']["expDecayFactor"] = 0.7
# ****** END EXAMPLE ****************#
clusterSim = 0.1
sessionJaccardShrinkage = 5
expDecay = 0.7

conf['split']['excludeAlreadyListenedTest'] = str(True)
conf['algo']['name'] = CLUSTER_ALGO + THRESHOLD_STR + '_ImplicitPlaylist_shk_%d_clustSim_%.3f_decay_%.3f' % \
                    (sessionJaccardShrinkage, clusterSim, expDecay )

# Load a recommendation

In [5]:
recRDD = loadRecommendations(conf).map(json.loads)
recRDD.take(3)

[{u'id': 1983301421578074,
  u'linkedinfo': {u'objects': [{u'action': u'play',
     u'id': 2524537,
     u'playratio': 0.52,
     u'playstart': 0,
     u'playtime': 105,
     u'type': u'track'},
    {u'action': u'play',
     u'id': 2524546,
     u'playratio': 0.61,
     u'playstart': 105,
     u'playtime': 115,
     u'type': u'track'},
    {u'action': u'play',
     u'id': 2524532,
     u'playratio': 0.63,
     u'playstart': 220,
     u'playtime': 99,
     u'type': u'track'},
    {u'action': u'play',
     u'id': 2524538,
     u'playratio': 2.83,
     u'playstart': 319,
     u'playtime': 486,
     u'type': u'track'},
    {u'action': u'play',
     u'id': 1735774,
     u'playratio': 0.72,
     u'playstart': 1177,
     u'playtime': 192,
     u'type': u'track'}],
   u'response': [{u'id': 1721969, u'rank': 0, u'type': u'track'},
    {u'id': 1704967, u'rank': 1, u'type': u'track'},
    {u'id': 1704967, u'rank': 2, u'type': u'track'},
    {u'id': 2055345, u'rank': 3, u'type': u'track'},
    {u'

# Load clusters mapping

In [6]:
cluster_path = BASE_PATH + "/clusters/" + CLUSTER_ALGO + THRESHOLD_STR
clustersRDD = sc.pickleFile(cluster_path)
clustersRDD.take(10)

[(0, [1730002, 1730001]),
 (1, [1567745]),
 (2, [1905460]),
 (3, [287144]),
 (4, [3155900]),
 (5, [64357]),
 (6, [1135504]),
 (7, [882710]),
 (8, [1091211]),
 (9, [1707402])]

# Substitute cluster with list of songs and compute metrics

In [23]:
options = [ plug_songs, all_cluster, plug_one_song ]

for i in options:
    conf['evaluation']['name'] = 'recall@N_' + i.__name__   
    
    rec = mapClusterRecToListOfSongs(recRDD, clustersRDD)
    print rec.take(1)
    
    try:
        computeMetrics(conf, rec)
        sendNotificationToMattia("Compute Metrics " + i.__name__ , "Good")
    except Exception, err:
        er_str = str(err)
        print er_str
        sendNotificationToMattia("Fuck", er_str)
    

sendNotificationToMattia("Finished", "Check me")

recall@N_all_cluster successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/jaccardBase9_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700_5#07#01/recall@N_all_cluster/metrics
