# <hr style="clear: both" />

# Running Spark in YARN-client mode

This notebook demonstrates how to set up a SparkContext that uses SURFsara's Hadoop cluster: [YARN resourcemanager](http://head05.hathi.surfsara.nl:8088/cluster) (note you will need to be authenticated via kerberos on your machine to visit the resourcemanager link) for executors.

First initialize kerberos via a Jupyter terminal. 
In the terminal execute: <BR>
<i>kinit -k -t data/robertop.keytab robertop@CUA.SURFSARA.NL</i><BR>
Print your credentials:


In [1]:
! klist

Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: robertop@CUA.SURFSARA.NL

Valid starting       Expires              Service principal
06/21/2016 07:57:32  06/22/2016 07:57:31  krbtgt/CUA.SURFSARA.NL@CUA.SURFSARA.NL
	renew until 06/21/2016 07:57:32


In [2]:
! hdfs dfs -ls 
execfile('../spark-scripts/bullet.py')

Found 5 items
drwx------   - robertop hdfs          0 2016-06-20 06:00 .Trash
drwxr-xr-x   - robertop hdfs          0 2016-06-21 09:00 .sparkStaging
drwx------   - robertop hdfs          0 2016-04-06 15:54 .staging
drwxr-xr-x   - robertop hdfs          0 2016-05-25 06:28 mattia
drwxr-xr-x   - robertop hdfs          0 2016-04-13 10:00 recsys2016Competition


Verify that we can browse HDFS:

Next initialize Spark. Note that the code below starts a job on the Hadoop cluster that will remain running while the notebook is active. Please close and halt the notebook when you are done. Starting the SparkContext can take a little longer. You can check the YARN resourcemanager to see the current status/usage of the cluster.

In [3]:
import os
os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python2.7'

HDFS_PATH = "hdfs://hathi-surfsara"

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sconf = SparkConf(False)

sconf.setAppName("micro-clustering")

# Master is now yarn-client. The YARN and hadoop config is read from the environment
sconf.setMaster("yarn-client")

# You can control many Spark settings via the SparkConf. This determines the amount of executors on the cluster:
sconf.set("spark.executor.instances", "100")
sconf.set("spark.executor.memory", "20g")

# UFW (firewall) is active on the VM. We explicitly opened these ports and Spark should not bind to random ports:
sconf.set("spark.driver.port", 51800)
sconf.set("spark.fileserver.port", 51801)
sconf.set("spark.broadcast.port", 51802)
sconf.set("spark.replClassServer.port", 51803)
sconf.set("spark.blockManager.port", 51804)
sconf.set("spark.authenticate", True)
sconf.set("spark.yarn.keytab", "/home/jovyan/work/data/robertop.keytab")
sconf.set("spark.yarn.access.namenodes", HDFS_PATH + ":8020")

try:
    sc = SparkContext(conf=sconf)
    sqlCtx = SQLContext(sc) 
    #sendNotificationToMattia("Spark Context", "Ready!")
except Exception, err:
    #sendNotificationToMattia("Fuck you!", str(err)) 
    print str(err)

# <hr style="clear: both" />

# Now you can run your code

Pick a clustering algorithm (name of the file that provides a classify(x,y [,threshold]) function)

In [4]:
import json
execfile('../spark-scripts/utilsCluster.py')
execfile('../spark-scripts/conventions.py')
execfile('../spark-scripts/splitCluster.py')
execfile('../spark-scripts/implicitPlaylistAlgoFunctions.py')
execfile('../spark-scripts/implicitPlaylistAlgoMain.py')

J = '25'
GROUP = 'avgShrink2'
FILTER = '3600'

CLUSTER_ALGO = 'collaborative/'
THRESHOLD = '0.min_j_25_avgShrink2_Switch42'



# Reading the conf file

In [5]:
import json
import copy

BASE_PATH = HDFS_PATH + '/user/robertop/mattia'

conf = {}

conf['split'] = {}
conf['split']['reclistSize'] = 100
conf['split']['callParams'] = {}
conf['split']['excludeAlreadyListenedTest'] = True
conf['split']['name'] = 'SenzaRipetizioni_1'
conf['split']['split'] = conf['split']['name']
conf['split']['minEventsPerUser'] = 5
conf['split']['inputData'] = HDFS_PATH + '/user/robertop/mattia/clusterBase.split/SenzaRipetizioni_1'
#conf['split']['inputData'] = 's3n://contentwise-research-poli/30musicdataset/newFormat/relations/sessions.idomaar'
conf['split']['bucketName'] = BASE_PATH
conf['split']['percUsTr'] = 0.05
conf['split']['ts'] = int(0.75 * (1421745857 - 1390209860) + 1390209860) - 10000
conf['split']['minEventPerSession'] = 5
conf['split']['onlineTrainingLength'] = 5
conf['split']['GTlength'] = 1
conf['split']['minEventPerSessionTraining'] = 10
conf['split']['minEventPerSessionTest'] = 11
conf['split']['mode'] = 'session'
conf['split']['forceSplitCreation'] = False
conf['split']["prop"] = {'reclistSize': conf['split']['reclistSize']}
conf['split']['type'] = None
conf['split']['out'] = HDFS_PATH + '/user/robertop/mattia/clusterBase.split/'
conf['split']['location'] = '30Mdataset/relations/sessions'

conf['evaluation'] = {}
conf['evaluation']['metric'] = {}
conf['evaluation']['metric']['type'] = 'recall'
conf['evaluation']['metric']['prop'] = {}
conf['evaluation']['metric']['prop']['N'] = [1,2,5,10,15,20,25,50,100]
conf['evaluation']['name'] = 'recall@N'

conf['general'] = {}
conf['general']['clientname'] = "clusterBase.split"
conf['general']['bucketName'] = BASE_PATH
conf['general']['tracksPath'] = '30Mdataset/entities/tracks.idomaar.gz'

conf['algo'] = {}
conf['algo']['name'] = 'ClusterBase'
conf['algo']['props'] = {}
# ***** EXAMPLE OF CONFIGURATION *****#
conf['algo']['props']["sessionJaccardShrinkage"] = 7.5
conf['algo']['props']["clusterSimilarityThreshold"] = 0.2
conf['algo']['props']["expDecayFactor"] = 0.7
# ****** END EXAMPLE ****************#



In [6]:
execfile('../spark-scripts/splitCluster.py')
SPLIT_NEEDED = False
if SPLIT_NEEDED:
    try:
        splitter(conf)
        sendNotificationToMattia("Splitted", "Train and Test")
    except Exception, err:
        print str(err)
        sendNotificationToMattia("Fuck you!", str(err)) 

# Load data 

In [7]:
clusterSongsFileRDD = sc.pickleFile(BASE_PATH + '/clusters/' + CLUSTER_ALGO + str(THRESHOLD)[2:])

songToClusterRDD = clusterSongsFileRDD.flatMap(lambda x: [(int(i), x[0]) for i in x[1]] )

print songToClusterRDD.take(3)

train, test = loadDataset(conf)

train_count = train.count()
test_count = test.count()
print train_count
print test_count
print songToClusterRDD.count()

[(2247393, 9846), (892234, 9846), (1327890, 44313)]
487637
290262
3893303


# Flat (TrackID, (Index, Rec))  and Join with Song -> Cluster

In [8]:
def flat_map_tracks_ids(x):
    objects = x['linkedinfo']['objects']
    result = []
    for i in range(len(objects)):
        result.append( (objects[i]['id'], (i, x)) )
    return result

trainFlat = train.map(lambda x: json.loads(x)).flatMap(flat_map_tracks_ids)
trainFlat_count = trainFlat.count()
trainJoin = trainFlat.join(songToClusterRDD)
trainJoin_count = trainJoin.count()

testFlat = test.map(lambda x: json.loads(x)).flatMap(flat_map_tracks_ids)
testFlat_count = testFlat.count()
testJoin = testFlat.join(songToClusterRDD)
testJoin_count = testJoin.count()

print "Train: " + str(trainFlat_count) + ' - ' + str(trainJoin_count)
print "Test : " + str(testFlat_count) + ' - ' + str(testJoin_count)

Train: 13813899 - 13813899
Test : 1431864 - 1431864


# Extract Rec and Group by key (Rec)

In [9]:
trainSub = trainJoin.map(lambda x: ( json.dumps(x[1][0][1]), (x[1][0][0], x[1][1])))
trainAgg = trainSub.groupByKey()
#train_agg_count = trainAgg.count()

#print "Number of entries in train after: " + str(train_agg_count)
#print "Equal to original train: " + str(train_count == train_agg_count)

testSub = testJoin.map(lambda x: ( json.dumps(x[1][0][1]), (x[1][0][0], x[1][1])))
testAgg = testSub.groupByKey()
#test_agg_count = testAgg.count()
#print "Number of entries in test after: " + str(test_agg_count)
#print "Equal to original test: " + str(test_count == test_agg_count)

# Plug Cluster IDs 

In [None]:
def plug_clusters(x):
    row_dic = json.loads(x[0])
    to_be_plugged = list(x[1])
    for i in to_be_plugged:
        index = i[0]
        cl_id = i[1]
        row_dic['linkedinfo']['objects'][index]['id'] = cl_id
    return json.dumps(row_dic)

    
trainRDD = trainAgg.map(plug_clusters)
trainRDD_count = trainRDD.count()
print "TrainRDD: " + str(trainRDD_count)

testRDD = testAgg.map(plug_clusters)
testRDD_count = testRDD.count()
print "TestRDD: " + str(testRDD_count)


TrainRDD: 487637
TestRDD: 290262


In [None]:
from os import path
basePath = path.join(conf['general']['bucketName'], conf['general']['clientname'])
splitPath = path.join(basePath, conf['split']['name'])

clusterSim = 0.2
sessionJaccardShrinkage = 7.5
expDecay = 0.7

conf['split']['excludeAlreadyListenedTest'] = True
conf['algo']['props']["sessionJaccardShrinkage"] = sessionJaccardShrinkage
conf['algo']['props']["clusterSimilarityThreshold"] = clusterSim
conf['algo']['props']["expDecayFactor"] = expDecay
conf['algo']['name'] = CLUSTER_ALGO + str(THRESHOLD)[2:] + '_ImplicitPlaylist_shk_%d_clustSim_%.3f_decay_%.3f' % \
                    (sessionJaccardShrinkage, clusterSim, expDecay )
            

try:
    playlists = extractImplicitPlaylists(trainRDD, conf).cache()
    sendNotificationToMattia("Playlist extracted", "Let's go!") 
    
    recJsonRDD = executeImplicitPlaylistAlgo(playlists, testRDD, conf)
    sendNotificationToMattia("Recommendation done", "Let's go!") 
    
    saveRecommendations(conf, recJsonRDD, overwrite=True)
    sendNotificationToMattia("Written!!!", "Let's go!")    
except Exception, err:
    print str(err)
    sendNotificationToMattia("Fuck you!", str(err))

Recommendations successfully written to hdfs://hathi-surfsara/user/robertop/mattia/clusterBase.split/Rec/collaborativemin_j_25_avgShrink2_Switch42_ImplicitPlaylist_shk_7_clustSim_0200_decay_0700_75#07#02/recommendations


#  

In [None]:
sc

#  

# For plain (without clsuetring) run this

In [13]:
trainRDD, testRDD = loadDataset(conf)

trainRDD = trainRDD.repartition(200)
testRDD = testRDD.repartition(200)

In [14]:
trainRDD.flatMap(lambda x: [i['id'] for i in json.loads(x)['linkedinfo']['objects']]).filter(lambda x: x >= 3893303).take(10)

[]

In [11]:

from os import path
basePath = path.join(conf['general']['bucketName'], conf['general']['clientname'])
splitPath = path.join(basePath, conf['split']['name'])

clusterSim = 0.1
sessionJaccardShrinkage = 10
expDecay = 0.7

conf['split']['excludeAlreadyListenedTest'] = True
conf['algo']['props']["sessionJaccardShrinkage"] = sessionJaccardShrinkage
conf['algo']['props']["clusterSimilarityThreshold"] = clusterSim
conf['algo']['props']["expDecayFactor"] = expDecay
conf['algo']['name'] = CLUSTER_ALGO + str(THRESHOLD)[2:] + '_ImplicitPlaylist_shk_%d_clustSim_%.3f_decay_%.3f' % \
                    (sessionJaccardShrinkage, clusterSim, expDecay )
            

try:
    playlists = extractImplicitPlaylists(trainRDD, conf)
    sendNotificationToMattia("Playlist extracted", "Let's go!") 
    
    recJsonRDD = executeImplicitPlaylistAlgo(playlists, testRDD, conf)
    sendNotificationToMattia("Recommendation done", "Let's go!") 
    
    saveRecommendations(conf, recJsonRDD, overwrite=True)
    sendNotificationToMattia("Written!!!", "Let's go!")    
except Exception, err:
    print str(err)
    sendNotificationToMattia("Fuck you!", str(err))

KeyboardInterrupt: 