# <hr style="clear: both" />

# Running Spark in YARN-client mode

This notebook demonstrates how to set up a SparkContext that uses SURFsara's Hadoop cluster: [YARN resourcemanager](http://head05.hathi.surfsara.nl:8088/cluster) (note you will need to be authenticated via kerberos on your machine to visit the resourcemanager link) for executors.

First initialize kerberos via a Jupyter terminal. 
In the terminal execute: <BR>
<i>kinit -k -t data/robertop.keytab robertop@CUA.SURFSARA.NL</i><BR>
Print your credentials:


In [1]:
! klist

Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: robertop@CUA.SURFSARA.NL

Valid starting       Expires              Service principal
04/26/2016 14:23:44  04/27/2016 14:23:44  krbtgt/CUA.SURFSARA.NL@CUA.SURFSARA.NL
	renew until 04/26/2016 14:23:44


In [2]:
! hdfs dfs -ls 
execfile('../spark-scripts/bullet.py')

Found 5 items
drwx------   - robertop hdfs          0 2016-04-27 06:00 .Trash
drwxr-xr-x   - robertop hdfs          0 2016-04-28 06:11 .sparkStaging
drwx------   - robertop hdfs          0 2016-04-06 15:54 .staging
drwxr-xr-x   - robertop hdfs          0 2016-04-27 13:07 mattia
drwxr-xr-x   - robertop hdfs          0 2016-04-13 10:00 recsys2016Competition


Verify that we can browse HDFS:

Next initialize Spark. Note that the code below starts a job on the Hadoop cluster that will remain running while the notebook is active. Please close and halt the notebook when you are done. Starting the SparkContext can take a little longer. You can check the YARN resourcemanager to see the current status/usage of the cluster.

In [3]:
import os
os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python2.7'

HDFS_PATH = "hdfs://hathi-surfsara"

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sconf = SparkConf()

# Master is now yarn-client. The YARN and hadoop config is read from the environment
sconf.setMaster("yarn-client")

# You can control many Spark settings via the SparkConf. This determines the amount of executors on the cluster:
sconf.set("spark.executor.instances", "200")
#sconf.set("spark.executor.memory", "20g")

# UFW (firewall) is active on the VM. We explicitly opened these ports and Spark should not bind to random ports:
sconf.set("spark.driver.port", 51800)
sconf.set("spark.fileserver.port", 51801)
sconf.set("spark.broadcast.port", 51802)
sconf.set("spark.replClassServer.port", 51803)
sconf.set("spark.blockManager.port", 51804)
sconf.set("spark.authenticate", True)
sconf.set("spark.yarn.keytab", "/home/jovyan/work/data/robertop.keytab")
sconf.set("spark.yarn.access.namenodes", HDFS_PATH + ":8020")

try:
    sc = SparkContext(conf=sconf)
    sqlCtx = SQLContext(sc) 
    sendNotificationToMattia("Spark Context", "Ready!")
except Exception, err:
    sendNotificationToMattia("Fuck you!", str(err)) 
    print str(err)

# <hr style="clear: both" />

# Now you can run your code

Pick a clustering algorithm (name of the file that provides a classify(x,y [,threshold]) function)

In [4]:
execfile('../spark-scripts/conventions.py')
execfile('../spark-scripts/splitCluster.py')
execfile('../spark-scripts/eval.py')
execfile('../spark-scripts/implicitPlaylistAlgoFunctions.py')
execfile('../spark-scripts/implicitPlaylistAlgoMain.py')

CLUSTER_ALGO = 'jaccardBase'
THRESHOLD = 0.751



# Reading the conf file

In [5]:
import json
import copy

BASE_PATH = HDFS_PATH + '/user/robertop/mattia'

conf = {}

conf['split'] = {}
conf['split']['reclistSize'] = 100
conf['split']['callParams'] = {}
conf['split']['excludeAlreadyListenedTest'] = True
conf['split']['name'] = 'SenzaRipetizioni_1'
conf['split']['split'] = conf['split']['name']
conf['split']['minEventsPerUser'] = 5
conf['split']['inputData'] = HDFS_PATH + '/user/robertop/mattia/clusterBase.split/SenzaRipetizioni_1'
#conf['split']['inputData'] = 's3n://contentwise-research-poli/30musicdataset/newFormat/relations/sessions.idomaar'
conf['split']['bucketName'] = BASE_PATH
conf['split']['percUsTr'] = 0.05
conf['split']['ts'] = int(0.75 * (1421745857 - 1390209860) + 1390209860) - 10000
conf['split']['minEventPerSession'] = 5
conf['split']['onlineTrainingLength'] = 5
conf['split']['GTlength'] = 1
conf['split']['minEventPerSessionTraining'] = 10
conf['split']['minEventPerSessionTest'] = 11
conf['split']['mode'] = 'session'
conf['split']['forceSplitCreation'] = False
conf['split']["prop"] = {'reclistSize': conf['split']['reclistSize']}
conf['split']['type'] = None
conf['split']['out'] = HDFS_PATH + '/user/robertop/mattia/clusterBase.split/'
conf['split']['location'] = '30Mdataset/relations/sessions'

conf['evaluation'] = {}
conf['evaluation']['metric'] = {}
conf['evaluation']['metric']['type'] = 'recall'
conf['evaluation']['metric']['prop'] = {}
conf['evaluation']['metric']['prop']['N'] = [1,2,5,10,15,20,25,50,100]
conf['evaluation']['name'] = 'recall@N'

conf['general'] = {}
conf['general']['clientname'] = "clusterBase.split"
conf['general']['bucketName'] = BASE_PATH
conf['general']['tracksPath'] = '30Mdataset/entities/tracks.idomaar.gz'

conf['algo'] = {}
conf['algo']['name'] = 'ClusterBase'
conf['algo']['props'] = {}
# ***** EXAMPLE OF CONFIGURATION *****#
conf['algo']['props']["sessionJaccardShrinkage"] = 5
conf['algo']['props']["clusterSimilarityThreshold"] = 0.1
conf['algo']['props']["expDecayFactor"] = 0.7
# ****** END EXAMPLE ****************#



In [6]:
SPLIT_NEEDED = False
if SPLIT_NEEDED:
    try:
        splitter(conf)
        sendNotificationToMattia("Splitted", "Train and Test")
    except Exception, err:
        print str(err)
        sendNotificationToMattia("Fuck you!", str(err)) 

# Load data 

In [7]:
clusterSongsFileRDD = sc.pickleFile(BASE_PATH + '/clusters/' + CLUSTER_ALGO + str(THRESHOLD)[2:])

songToClusterRDD = clusterSongsFileRDD.flatMap(lambda x: [(int(i), x[0]) for i in x[1]] )

print songToClusterRDD.take(3)

import json
execfile('../spark-scripts/utilsCluster.py')
train, test = loadDataset(conf)

train_count = train.count()
test_count = test.count()
print train_count
print test_count
sendNotificationToMattia("Train and Test Loaded", "Train: " + str(train_count) + "\nTest: " + str(test_count))

[(221540, 0), (287144, 1), (41679, 2)]
[u'{"id": "1880599", "linkedinfo": {"subjects": [{"type": "user", "id": 25088}], "objects": [{"playratio": null, "playstart": 0, "action": "play", "playtime": 217, "type": "track", "id": 4463481}, {"playratio": 0.99, "playstart": 217, "action": "play", "playtime": 193, "type": "track", "id": 3721462}, {"playratio": 0.99, "playstart": 410, "action": "play", "playtime": 256, "type": "track", "id": 1918267}, {"playratio": 0.99, "playstart": 666, "action": "play", "playtime": 151, "type": "track", "id": 1320280}, {"playratio": 0.99, "playstart": 817, "action": "play", "playtime": 205, "type": "track", "id": 2468464}, {"playratio": 0.99, "playstart": 1022, "action": "play", "playtime": 320, "type": "track", "id": 1619933}, {"playratio": 1.0, "playstart": 1342, "action": "play", "playtime": 266, "type": "track", "id": 2228799}, {"playratio": 0.99, "playstart": 1608, "action": "play", "playtime": 292, "type": "track", "id": 1879248}, {"playratio": 0.99, 

In [16]:
train_count = train.count()
test_count = test.count()
print train_count
print test_count
sendNotificationToMattia("Train and Test Loaded", "Train: " + str(train_count) + "\nTest: " + str(test_count))

507817
302045


# Flat (TrackID, (Index, Rec))  and Join with Song -> Cluster

In [14]:
def flat_map_tracks_ids(x):
    objects = x['linkedinfo']['objects']
    result = []
    for i in range(len(objects)):
        result.append( (objects[i]['id'], (i, x)) )
    return result

trainFlat = train.map(lambda x: json.loads(x)).flatMap(flat_map_tracks_ids)
train_pre_count = trainFlat.count()

trainJoin = trainFlat.join(songToClusterRDD)
train_post_count = trainJoin.count()

print "Number of entries in train before: " + str(train_pre_count)
print "Number of entries in train after: " + str(train_post_count)
sendNotificationToMattia("TRAIN" , "Number of entries in train before: " + str(train_pre_count) + "\nNumber of entries in train after: " + str(train_post_count))



testFlat = test.map(lambda x: json.loads(x)).flatMap(flat_map_tracks_ids)
test_pre_count = testFlat.count()

testJoin = testFlat.join(songToClusterRDD)
test_post_count = testJoin.count()

print "Number of entries in test before: " + str(test_pre_count)
print "Number of entries in test after: " + str(test_post_count)
sendNotificationToMattia("TEST" , "Number of entries in train before: " + str(test_pre_count) + "\nNumber of entries in train after: " + str(test_post_count))


Number of entries in train before: 14579340
Number of entries in train after: 14579340
Number of entries in train before: 1490264
Number of entries in train after: 1490264


# Extract Rec and Group by key (Rec)

In [15]:
trainSub = trainJoin.map(lambda x: (json.dumps(x[1][0][1]), (x[1][0][0], x[1][1])))
trainAgg = trainSub.groupByKey().mapValues(list)

train_sub_count = trainSub.count()
train_agg_count = trainAgg.count()
print "Number of entries in train before: " + str(train_sub_count)
print "Number of entries in train after: " + str(train_agg_count)
print "Equal to original train: " + str(train_count == train_agg_count)

testSub = testJoin.map(lambda x: (json.dumps(x[1][0][1]), (x[1][0][0], x[1][1])))
testAgg = testSub.groupByKey().mapValues(list)

test_sub_count = testSub.count()
test_agg_count = testAgg.count()
print "\nNumber of entries in test before: " + str(test_sub_count)
print "Number of entries in test after: " + str(test_agg_count)
print "Equal to original test: " + str(test_count == test_agg_count)

Number of entries in train before: 14579340
Number of entries in train after: 507817
Number of entries in train before: 1490264
Number of entries in train after: 302045


# Plug Cluster IDs 

In [19]:
def plug_clusters(x):
    row_dic = json.loads(x[0])
    to_be_plugged = x[1]
    for i in to_be_plugged:
        index = i[0]
        cl_id = i[1]
        row_dic['linkedinfo']['objects'][index]['id'] = cl_id
    return json.dumps(row_dic)
    
trainRDD = trainAgg.map(plug_clusters)
trainRDD_count = trainRDD.count()
print "TrainRDD: " + str(trainRDD_count)

testRDD = testAgg.map(plug_clusters)
testRDD_count = testRDD.count()
print "TestRDD: " + str(testRDD_count)

sendNotificationToMattia("Train and Test RDDs", "TrainRDD: " + str(trainRDD_count) + "\nTestRDD: " + str(testRDD_count))

TrainRDD: 507817
TestRDD: 302045


In [20]:
from os import path
basePath = path.join(conf['general']['bucketName'], conf['general']['clientname'])
splitPath = path.join(basePath, conf['split']['name'])

clusterSim = 0.1
sessionJaccardShrinkage = 5
expDecay = 0.7

conf['split']['excludeAlreadyListenedTest'] = str(True)
conf['algo']['props']["sessionJaccardShrinkage"] = sessionJaccardShrinkage
conf['algo']['props']["clusterSimilarityThreshold"] = clusterSim
conf['algo']['props']["expDecayFactor"] = expDecay
conf['algo']['name'] = CLUSTER_ALGO + str(THRESHOLD)[2:] + '_ImplicitPlaylist_shk_%d_clustSim_%.3f_decay_%.3f' % \
                    (sessionJaccardShrinkage, clusterSim, expDecay )
            

try:
    playlists = extractImplicitPlaylists(trainRDD, conf).cache()
    sendNotificationToMattia("Playlist extracted", "Let's go!") 
    
    recJsonRDD = executeImplicitPlaylistAlgo(playlists, testRDD, conf)
    sendNotificationToMattia("Recommendation done", "Let's go!") 
    
    saveRecommendations(conf, recJsonRDD, overwrite=True)
    sendNotificationToMattia("Written!!!", "Let's go!")    
except Exception, err:
    print str(err)
    sendNotificationToMattia("Fuck you!", str(err))

Recommendations successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/jaccardBase751_ImplicitPlaylist_shk_5_clustSim_0100_decay_0700_5#07#01/recommendations
