# <hr style="clear: both" />

# Running Spark in YARN-client mode

This notebook demonstrates how to set up a SparkContext that uses SURFsara's Hadoop cluster: [YARN resourcemanager](http://head05.hathi.surfsara.nl:8088/cluster) (note you will need to be authenticated via kerberos on your machine to visit the resourcemanager link) for executors.

First initialize kerberos via a Jupyter terminal. 
In the terminal execute: <BR>
<i>kinit -k -t data/robertop.keytab robertop@CUA.SURFSARA.NL</i><BR>
Print your credentials:


In [1]:
! klist

Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: robertop@CUA.SURFSARA.NL

Valid starting       Expires              Service principal
05/06/2016 08:31:29  05/07/2016 08:31:29  krbtgt/CUA.SURFSARA.NL@CUA.SURFSARA.NL
	renew until 05/06/2016 08:31:29


In [1]:
! hdfs dfs -ls 
execfile('../spark-scripts/bullet.py')

Found 5 items
drwx------   - robertop hdfs          0 2016-05-06 12:54 .Trash
drwxr-xr-x   - robertop hdfs          0 2016-05-06 14:36 .sparkStaging
drwx------   - robertop hdfs          0 2016-04-06 15:54 .staging
drwxr-xr-x   - robertop hdfs          0 2016-04-27 13:07 mattia
drwxr-xr-x   - robertop hdfs          0 2016-04-13 10:00 recsys2016Competition


Verify that we can browse HDFS:

Next initialize Spark. Note that the code below starts a job on the Hadoop cluster that will remain running while the notebook is active. Please close and halt the notebook when you are done. Starting the SparkContext can take a little longer. You can check the YARN resourcemanager to see the current status/usage of the cluster.

In [2]:
import os
os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python2.7'

HDFS_PATH = "hdfs://hathi-surfsara"

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sconf = SparkConf()

# Master is now yarn-client. The YARN and hadoop config is read from the environment
sconf.setMaster("yarn-client")

# You can control many Spark settings via the SparkConf. This determines the amount of executors on the cluster:
sconf.set("spark.executor.instances", "200")
#sconf.set("spark.executor.memory", "10g")

# UFW (firewall) is active on the VM. We explicitly opened these ports and Spark should not bind to random ports:
sconf.set("spark.driver.port", 51800)
sconf.set("spark.fileserver.port", 51801)
sconf.set("spark.broadcast.port", 51802)
sconf.set("spark.replClassServer.port", 51803)
sconf.set("spark.blockManager.port", 51804)
sconf.set("spark.authenticate", True)
sconf.set("spark.yarn.keytab", "/home/jovyan/work/data/robertop.keytab")
sconf.set("spark.yarn.access.namenodes", HDFS_PATH + ":8020")

try:
    sc = SparkContext(conf=sconf)
    sqlCtx = SQLContext(sc) 
    sendNotificationToMattia("Spark Context", str(sc))
except Exception, err:
    sendNotificationToMattia("Fuck you!", str(err)) 
    print str(err)

# <hr style="clear: both" />

# Now you can run your code

Pick a clustering algorithm (name of the file that provides a classify(x,y [,threshold]) function)

In [3]:
execfile('../spark-scripts/evalCluster.py')
execfile('../spark-scripts/utilsCluster.py')

CLUSTER_ALGO = 'jaccardBase'
THRESHOLD = 0.751
THRESHOLD_STR = str(THRESHOLD)[2:]

import json
import copy

BASE_PATH = HDFS_PATH + '/user/robertop/mattia'

conf = {}

conf['split'] = {}
conf['split']['reclistSize'] = 100
conf['split']['callParams'] = {}
conf['split']['excludeAlreadyListenedTest'] = True
conf['split']['name'] = 'SenzaRipetizioni_1'
conf['split']['split'] = conf['split']['name']
conf['split']['minEventsPerUser'] = 5
conf['split']['inputData'] = HDFS_PATH + '/user/robertop/mattia/clusterBase.split/SenzaRipetizioni_1'
#conf['split']['inputData'] = 's3n://contentwise-research-poli/30musicdataset/newFormat/relations/sessions.idomaar'
conf['split']['bucketName'] = BASE_PATH
conf['split']['percUsTr'] = 0.05
conf['split']['ts'] = int(0.75 * (1421745857 - 1390209860) + 1390209860) - 10000
conf['split']['minEventPerSession'] = 5
conf['split']['onlineTrainingLength'] = 5
conf['split']['GTlength'] = 1
conf['split']['minEventPerSessionTraining'] = 10
conf['split']['minEventPerSessionTest'] = 11
conf['split']['mode'] = 'session'
conf['split']['forceSplitCreation'] = False
conf['split']["prop"] = {'reclistSize': conf['split']['reclistSize']}
conf['split']['type'] = None
conf['split']['out'] = HDFS_PATH + '/user/robertop/mattia/clusterBase.split/'
conf['split']['location'] = '30Mdataset/relations/sessions'

conf['evaluation'] = {}
conf['evaluation']['metric'] = {}
conf['evaluation']['metric']['type'] = 'recall'
conf['evaluation']['metric']['prop'] = {}
conf['evaluation']['metric']['prop']['N'] = [1,2,5,10,15,20,25,50,100]
conf['evaluation']['name'] = 'recall@N'

conf['general'] = {}
conf['general']['clientname'] = "clusterBase.split"
conf['general']['bucketName'] = BASE_PATH
conf['general']['tracksPath'] = '30Mdataset/entities/tracks.idomaar.gz'

conf['algo'] = {}
conf['algo']['props'] = {}
# ***** EXAMPLE OF CONFIGURATION *****#
conf['algo']['props']["sessionJaccardShrinkage"] = 5
conf['algo']['props']["clusterSimilarityThreshold"] = 0.1
conf['algo']['props']["expDecayFactor"] = 0.7
# ****** END EXAMPLE ****************#
clusterSim = 0.1
sessionJaccardShrinkage = 5
expDecay = 0.7

conf['split']['excludeAlreadyListenedTest'] = str(True)
conf['algo']['name'] = CLUSTER_ALGO + THRESHOLD_STR + '_ImplicitPlaylist_shk_%d_clustSim_%.3f_decay_%.3f' % \
                    (sessionJaccardShrinkage, clusterSim, expDecay )

# Load a recommendation

In [4]:
recRDD = loadRecommendations(conf).map(json.loads)
recRDD.take(1)

[{u'id': 3457001420907355,
  u'linkedinfo': {u'objects': [{u'action': u'play',
     u'id': 3228363,
     u'playratio': 1.11,
     u'playstart': 0,
     u'playtime': 244,
     u'type': u'track'},
    {u'action': u'play',
     u'id': 3873665,
     u'playratio': 0.98,
     u'playstart': 244,
     u'playtime': 155,
     u'type': u'track'},
    {u'action': u'play',
     u'id': 3720190,
     u'playratio': 0.51,
     u'playstart': 399,
     u'playtime': 128,
     u'type': u'track'},
    {u'action': u'play',
     u'id': 3093573,
     u'playratio': None,
     u'playstart': 527,
     u'playtime': 343,
     u'type': u'track'},
    {u'action': u'play',
     u'id': 1056393,
     u'playratio': 0.99,
     u'playstart': 1276,
     u'playtime': 134,
     u'type': u'track'}],
   u'response': [{u'id': 3467179, u'rank': 0, u'type': u'track'},
    {u'id': 1727008, u'rank': 1, u'type': u'track'},
    {u'id': 2212447, u'rank': 2, u'type': u'track'},
    {u'id': 2006573, u'rank': 3, u'type': u'track'},
    {u

# Load clusters mapping

In [5]:
cluster_path = BASE_PATH + "/clusters/" + CLUSTER_ALGO + THRESHOLD_STR
clustersRDD = sc.pickleFile(cluster_path)
clustersRDD.take(10)

[(0, [221540]),
 (1, [287144]),
 (2, [41679]),
 (3, [1730002, 1730001]),
 (4, [3155900]),
 (5, [64357]),
 (6, [1135504]),
 (7, [549305]),
 (8, [1707402]),
 (9, [2797192])]

# Substitute cluster with list of songs and compute metrics

In [7]:
execfile('../spark-scripts/evalClusterNew.py')
execfile('../spark-scripts/utilsCluster.py')

options = [ 'plug_songs' ]

for name in options:
    
    rec = mapClusterRecToListOfSongs(recRDD, clustersRDD, name)
    
    try:
        conf['algo']['name'] += '/' + name
        computeNewRecallPrecision(conf, rec)
        sendNotificationToMattia("Compute Metrics " + conf['evaluation']['name'] , "Good")
        
        conf['algo']['name'] = CLUSTER_ALGO + THRESHOLD_STR + '_ImplicitPlaylist_shk_%d_clustSim_%.3f_decay_%.3f' % \
                    (sessionJaccardShrinkage, clusterSim, expDecay )
    except Exception, err:
        er_str = str(err)
        print er_str
        print "Skipping..."
        sendNotificationToMattia("Fuck", er_str)
    

sendNotificationToMattia("Finished", "Check me")

newRecall@N successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/jaccardBase751_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700plug_songsplug_songs_5#07#01/newRecall@N/metrics
newPrecision@N successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/jaccardBase751_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700plug_songsplug_songs_5#07#01/newPrecision@N/metrics


# Compute NEW Metrics for Clustering!

In [8]:
execfile('../spark-scripts/utilsCluster.py')

rec = mapClusterRecToListOfSongs(recRDD, clustersRDD)
rec.take(3)

['{"linkedinfo": {"objects": [{"playratio": 1.04, "playstart": 0, "action": "play", "playtime": 335, "type": "track", "id": 708455}, {"playratio": 1.08, "playstart": 335, "action": "play", "playtime": 309, "type": "track", "id": 3585145}, {"playratio": 1.0, "playstart": 644, "action": "play", "playtime": 349, "type": "track", "id": 2319201}, {"playratio": 1.0, "playstart": 993, "action": "play", "playtime": 217, "type": "track", "id": 524783}, {"playratio": 1.0, "playstart": 1210, "action": "play", "playtime": 523, "type": "track", "id": 2544712}], "subjects": [{"type": "user", "id": 31444}], "response": [{"type": "track", "id": 1767905, "rank": 0}, {"type": "track", "id": 144315, "rank": 1}, {"type": "track", "id": 144314, "rank": 1}, {"type": "track", "id": 144129, "rank": 2}, {"type": "track", "id": 144092, "rank": 2}, {"type": "track", "id": 3273620, "rank": 3}, {"type": "track", "id": 2935585, "rank": 4}, {"type": "track", "id": 2935590, "rank": 5}, {"type": "track", "id": 415779,

In [21]:
execfile('../spark-scripts/evalClusterNew.py')

computeNewRecallPrecision(conf,rec, loss = True)

newLossRecall@N successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/jaccardBase751_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700_5#07#01/newLossRecall@N/metrics
newLossPrecision@N successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/jaccardBase751_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700_5#07#01/newLossPrecision@N/metrics


# Compute Cluster Loss

In [None]:
execfile('../spark-scripts/evalClusterNew.py')
computeClusterLoss(conf,rec)

# Compute performance with clustering just in Evaluation

In [8]:
conf['algo']['name'] = 'plain0_ImplicitPlaylist_shk_%d_clustSim_%.3f_decay_%.3f' % \
                    (sessionJaccardShrinkage, clusterSim, expDecay )
    
plainRDD = loadRecommendations(conf)

execfile('../spark-scripts/evalClusterNew.py')
#computeNewRecallPrecision(conf, plainRDD, loss = False, plain = True)

plainRDD = plainRDD.map(json.loads)

conf['algo']['name'] = CLUSTER_ALGO + THRESHOLD_STR + '_ImplicitPlaylist_shk_%d_clustSim_%.3f_decay_%.3f' % \
                    (sessionJaccardShrinkage, clusterSim, expDecay )
    
songToClusterRDD = clustersRDD.flatMap(lambda x: [(int(i), x[0]) for i in x[1]] )  

plainFlatRDD = plainRDD.flatMap(lambda x: [(i['id'], (i['rank'], x)) for i in x['linkedinfo']['response']])

plainJoinRDD = plainFlatRDD.join(songToClusterRDD).map(lambda x: (json.dumps(x[1][0][1]), (x[1][1], x[1][0][0]) ))



In [9]:
from operator import itemgetter
def plug_clusters(x):
    row_dic = json.loads(x[0])
    to_be_plugged = sorted(list(x[1]), key = itemgetter(1))
    plugged = set()
    rank = 0
    row_dic['linkedinfo']['response'] = []
    
    for i in to_be_plugged:
        cl_id = i[0]
        if not cl_id in plugged:
            entry = {"type": "track", "id": cl_id, "rank": rank}
            row_dic['linkedinfo']['response'].append(entry)
            plugged.add(cl_id)
            rank += 1
            
    return json.dumps(row_dic)

plainGroupRDD = plainJoinRDD.groupByKey().map(plug_clusters)

recPlainClusterRDD = plainGroupRDD.map(json.loads)

In [10]:
execfile('../spark-scripts/utilsCluster.py')

recPlainCluster = mapClusterRecToListOfSongs(recPlainClusterRDD, clustersRDD)


In [None]:
execfile('../spark-scripts/evalClusterNew.py')

computeNewRecallPrecision(conf,recPlainCluster, loss = True, plain = True)

In [19]:
sendNotificationToMattia("Finished", "Check")

In [10]:
splitPath = path.join(conf['general']['bucketName'], conf['general']['clientname'])
    # basePath = "s3n://" + conf['general']['bucketName'] + "/"+conf['general']['clientname']+"/"
GTpath = path.join(splitPath, "GT")
    # GTpath = splitPath+"GT"

algo_conf = conf['algo']['name'] + '_' + \
                '#'.join([str(v) for k, v in conf['algo']['props'].iteritems()])
algo_conf = re.sub(r'[^A-Za-z0-9#_]', '', algo_conf)

confPath = path.join(splitPath, 'Rec', algo_conf)
recPath = path.join(confPath, "recommendations")
    # recPath = splitPath+"/Rec/"+ conf['algo']['name']+"/recommendations/"

gtRDD = sc.textFile(GTpath).map(lambda x: json.loads(x))

gtRDD = gtRDD.map(lambda x: (x['linkedinfo']['gt'][0]['id'], [i['id'] for i in x['linkedinfo']['objects']]))

gtRDD.count()

302045

In [20]:
! pwd

/home/jovyan/work/notebooks/mattia/PlaylistRecMicro/playlistrec-master/notebooks


In [42]:
DATA_PATH = '/home/jovyan/work/data/mattia'
TEST_PATH = BASE_PATH + '/clusterBase.split/Rec/jaccardBase751_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700_5#07#01/plain/newRecall@N/metrics'
test_resultRDD = sc.textFile(TEST_PATH).collect()

with open(DATA_PATH + '/results/jaccardBase751/eval/ideal/recall@N', 'w') as f:
    for line in test_resultRDD:
        f.write(line + '\n')

In [17]:
os.getcwd()

'/home/jovyan/work/notebooks/mattia/PlaylistRecMicro/playlistrec-master/notebooks'