# <hr style="clear: both" />

# Running Spark in YARN-client mode

This notebook demonstrates how to set up a SparkContext that uses SURFsara's Hadoop cluster: [YARN resourcemanager](http://head05.hathi.surfsara.nl:8088/cluster) (note you will need to be authenticated via kerberos on your machine to visit the resourcemanager link) for executors.

First initialize kerberos via a Jupyter terminal. 
In the terminal execute: <BR>
<i>kinit -k -t data/robertop.keytab robertop@CUA.SURFSARA.NL</i><BR>
Print your credentials:


In [1]:
! klist

Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: robertop@CUA.SURFSARA.NL

Valid starting       Expires              Service principal
05/02/2016 08:44:09  05/03/2016 08:44:09  krbtgt/CUA.SURFSARA.NL@CUA.SURFSARA.NL
	renew until 05/02/2016 08:44:09


In [2]:
! hdfs dfs -ls 
execfile('../spark-scripts/bullet.py')

Found 5 items
drwx------   - robertop hdfs          0 2016-05-01 06:00 .Trash
drwxr-xr-x   - robertop hdfs          0 2016-05-02 08:00 .sparkStaging
drwx------   - robertop hdfs          0 2016-04-06 15:54 .staging
drwxr-xr-x   - robertop hdfs          0 2016-04-27 13:07 mattia
drwxr-xr-x   - robertop hdfs          0 2016-04-13 10:00 recsys2016Competition


Verify that we can browse HDFS:

Next initialize Spark. Note that the code below starts a job on the Hadoop cluster that will remain running while the notebook is active. Please close and halt the notebook when you are done. Starting the SparkContext can take a little longer. You can check the YARN resourcemanager to see the current status/usage of the cluster.

In [3]:
import os
os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python2.7'

HDFS_PATH = "hdfs://hathi-surfsara"

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sconf = SparkConf()

# Master is now yarn-client. The YARN and hadoop config is read from the environment
sconf.setMaster("yarn-client")

# You can control many Spark settings via the SparkConf. This determines the amount of executors on the cluster:
sconf.set("spark.executor.instances", "200")
sconf.set("spark.executor.memory", "10g")

# UFW (firewall) is active on the VM. We explicitly opened these ports and Spark should not bind to random ports:
sconf.set("spark.driver.port", 51800)
sconf.set("spark.fileserver.port", 51801)
sconf.set("spark.broadcast.port", 51802)
sconf.set("spark.replClassServer.port", 51803)
sconf.set("spark.blockManager.port", 51804)
sconf.set("spark.authenticate", True)
sconf.set("spark.yarn.keytab", "/home/jovyan/work/data/robertop.keytab")
sconf.set("spark.yarn.access.namenodes", HDFS_PATH + ":8020")

try:
    sc = SparkContext(conf=sconf)
    sqlCtx = SQLContext(sc) 
    sendNotificationToMattia("Spark Context", str(sc))
except Exception, err:
    sendNotificationToMattia("Fuck you!", str(err)) 
    print str(err)

# <hr style="clear: both" />

# Now you can run your code

Pick a clustering algorithm (name of the file that provides a classify(x,y [,threshold]) function)

In [4]:
execfile('../spark-scripts/evalCluster.py')
execfile('../spark-scripts/utilsCluster.py')

CLUSTER_ALGO = 'jaccardBase'
THRESHOLD = 0.751
THRESHOLD_STR = str(THRESHOLD)[2:]

import json
import copy

BASE_PATH = HDFS_PATH + '/user/robertop/mattia'

conf = {}

conf['split'] = {}
conf['split']['reclistSize'] = 100
conf['split']['callParams'] = {}
conf['split']['excludeAlreadyListenedTest'] = True
conf['split']['name'] = 'SenzaRipetizioni_1'
conf['split']['split'] = conf['split']['name']
conf['split']['minEventsPerUser'] = 5
conf['split']['inputData'] = HDFS_PATH + '/user/robertop/mattia/clusterBase.split/SenzaRipetizioni_1'
#conf['split']['inputData'] = 's3n://contentwise-research-poli/30musicdataset/newFormat/relations/sessions.idomaar'
conf['split']['bucketName'] = BASE_PATH
conf['split']['percUsTr'] = 0.05
conf['split']['ts'] = int(0.75 * (1421745857 - 1390209860) + 1390209860) - 10000
conf['split']['minEventPerSession'] = 5
conf['split']['onlineTrainingLength'] = 5
conf['split']['GTlength'] = 1
conf['split']['minEventPerSessionTraining'] = 10
conf['split']['minEventPerSessionTest'] = 11
conf['split']['mode'] = 'session'
conf['split']['forceSplitCreation'] = False
conf['split']["prop"] = {'reclistSize': conf['split']['reclistSize']}
conf['split']['type'] = None
conf['split']['out'] = HDFS_PATH + '/user/robertop/mattia/clusterBase.split/'
conf['split']['location'] = '30Mdataset/relations/sessions'

conf['evaluation'] = {}
conf['evaluation']['metric'] = {}
conf['evaluation']['metric']['type'] = 'recall'
conf['evaluation']['metric']['prop'] = {}
conf['evaluation']['metric']['prop']['N'] = [1,2,5,10,15,20,25,50,100]
conf['evaluation']['name'] = 'recall@N'

conf['general'] = {}
conf['general']['clientname'] = "clusterBase.split"
conf['general']['bucketName'] = BASE_PATH
conf['general']['tracksPath'] = '30Mdataset/entities/tracks.idomaar.gz'

conf['algo'] = {}
conf['algo']['props'] = {}
# ***** EXAMPLE OF CONFIGURATION *****#
conf['algo']['props']["sessionJaccardShrinkage"] = 5
conf['algo']['props']["clusterSimilarityThreshold"] = 0.1
conf['algo']['props']["expDecayFactor"] = 0.7
# ****** END EXAMPLE ****************#
clusterSim = 0.1
sessionJaccardShrinkage = 5
expDecay = 0.7

conf['split']['excludeAlreadyListenedTest'] = str(True)
conf['algo']['name'] = CLUSTER_ALGO + THRESHOLD_STR + '_ImplicitPlaylist_shk_%d_clustSim_%.3f_decay_%.3f' % \
                    (sessionJaccardShrinkage, clusterSim, expDecay )

# Load a recommendation

In [33]:
recRDD = loadRecommendations(conf).map(json.loads)
recRDD.take(1)

[{u'id': 3457001420907355,
  u'linkedinfo': {u'objects': [{u'action': u'play',
     u'id': 3228363,
     u'playratio': 1.11,
     u'playstart': 0,
     u'playtime': 244,
     u'type': u'track'},
    {u'action': u'play',
     u'id': 3873665,
     u'playratio': 0.98,
     u'playstart': 244,
     u'playtime': 155,
     u'type': u'track'},
    {u'action': u'play',
     u'id': 3720190,
     u'playratio': 0.51,
     u'playstart': 399,
     u'playtime': 128,
     u'type': u'track'},
    {u'action': u'play',
     u'id': 3093573,
     u'playratio': None,
     u'playstart': 527,
     u'playtime': 343,
     u'type': u'track'},
    {u'action': u'play',
     u'id': 1056393,
     u'playratio': 0.99,
     u'playstart': 1276,
     u'playtime': 134,
     u'type': u'track'}],
   u'response': [{u'id': 3467179, u'rank': 0, u'type': u'track'},
    {u'id': 1727008, u'rank': 1, u'type': u'track'},
    {u'id': 2212447, u'rank': 2, u'type': u'track'},
    {u'id': 2006573, u'rank': 3, u'type': u'track'},
    {u

# Load clusters mapping

In [6]:
cluster_path = BASE_PATH + "/clusters/" + CLUSTER_ALGO + THRESHOLD_STR
clustersRDD = sc.pickleFile(cluster_path)
clustersRDD.take(10)

[(0, [221540]),
 (1, [287144]),
 (2, [41679]),
 (3, [1730002, 1730001]),
 (4, [3155900]),
 (5, [64357]),
 (6, [1135504]),
 (7, [549305]),
 (8, [1707402]),
 (9, [2797192])]

# Substitute cluster with list of songs and compute metrics

In [8]:
execfile('../spark-scripts/evalCluster.py')
execfile('../spark-scripts/utilsCluster.py')

options = [ 'plug_songs', 'all_cluster']

for name in options:
    
    rec = mapClusterRecToListOfSongs(recRDD, clustersRDD, name)
    
    try:
        conf['evaluation']['name'] = 'recall@N/' + name
        computeMetrics(conf, rec)
        sendNotificationToMattia("Compute Metrics " + conf['evaluation']['name'] , "Good")
        
        conf['evaluation']['name'] = 'precision@N/' + name
        computeMetrics_precision(conf, rec)
        sendNotificationToMattia("Compute Metrics " + conf['evaluation']['name'] , "Good")
        
    except Exception, err:
        er_str = str(err)
        print er_str
        print "Skipping..."
        sendNotificationToMattia("Fuck", er_str)
    

sendNotificationToMattia("Finished", "Check me")

recall@N/plug_songs successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/jaccardBase751_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700_5#07#01/recall@N/plug_songs/metrics
precision@N/plug_songs successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/jaccardBase751_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700_5#07#01/precision@N/plug_songs/metrics
recall@N/all_cluster successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/jaccardBase751_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700_5#07#01/recall@N/all_cluster/metrics
precision@N/all_cluster successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/jaccardBase751_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700_5#07#01/precision@N/all_cluster/metrics


In [10]:
conf['evaluation']['name'] = 'recall@N' 
computeMetrics(conf, rec)
sendNotificationToMattia("Compute Metrics " + conf['evaluation']['name'] , "Good")
        
conf['evaluation']['name'] = 'precision@N'
computeMetrics_precision(conf, rec)
sendNotificationToMattia("Compute Metrics " + conf['evaluation']['name'] , "Good")

recall@N successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/plain0_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700_5#07#01/recall@N/metrics
precision@N successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/plain0_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700_5#07#01/precision@N/metrics


# Compute NEW Metrics for Clustering!

In [None]:
execfile('../spark-scripts/utilsCluster.py')

rec = mapClusterRecToListOfSongs(recRDD, clustersRDD)

In [None]:
execfile('../spark-scripts/evalClusterNew.py')

computeNewRecallPrecision(conf,rec, loss = False)
computeNewRecallPrecision(conf,rec, loss = True)

newRecall@N successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/jaccardBase751_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700_5#07#01/newRecall@N/metrics
newPrecision@N successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/jaccardBase751_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700_5#07#01/newPrecision@N/metrics
