# <hr style="clear: both" />

# Running Spark in YARN-client mode

This notebook demonstrates how to set up a SparkContext that uses SURFsara's Hadoop cluster: [YARN resourcemanager](http://head05.hathi.surfsara.nl:8088/cluster) (note you will need to be authenticated via kerberos on your machine to visit the resourcemanager link) for executors.

First initialize kerberos via a Jupyter terminal. 
In the terminal execute: <BR>
<i>kinit -k -t data/robertop.keytab robertop@CUA.SURFSARA.NL</i><BR>
Print your credentials:


In [1]:
! klist

Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: robertop@CUA.SURFSARA.NL

Valid starting       Expires              Service principal
06/02/2016 06:33:55  06/03/2016 06:33:55  krbtgt/CUA.SURFSARA.NL@CUA.SURFSARA.NL
	renew until 06/02/2016 06:33:55


In [2]:
! hdfs dfs -ls 
execfile('../spark-scripts/bullet.py')

Found 5 items
drwx------   - robertop hdfs          0 2016-05-26 06:00 .Trash
drwxr-xr-x   - robertop hdfs          0 2016-06-02 15:38 .sparkStaging
drwx------   - robertop hdfs          0 2016-04-06 15:54 .staging
drwxr-xr-x   - robertop hdfs          0 2016-05-25 06:28 mattia
drwxr-xr-x   - robertop hdfs          0 2016-04-13 10:00 recsys2016Competition


Verify that we can browse HDFS:

Next initialize Spark. Note that the code below starts a job on the Hadoop cluster that will remain running while the notebook is active. Please close and halt the notebook when you are done. Starting the SparkContext can take a little longer. You can check the YARN resourcemanager to see the current status/usage of the cluster.

In [3]:
import os
os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python2.7'

HDFS_PATH = "hdfs://hathi-surfsara"

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sconf = SparkConf(False)

sconf.setAppName("micro-clustering")

# Master is now yarn-client. The YARN and hadoop config is read from the environment
sconf.setMaster("yarn-client")

# You can control many Spark settings via the SparkConf. This determines the amount of executors on the cluster:
sconf.set("spark.executor.instances", "200")
sconf.set("spark.executor.memory", "10g")

# UFW (firewall) is active on the VM. We explicitly opened these ports and Spark should not bind to random ports:
sconf.set("spark.driver.port", 51800)
sconf.set("spark.fileserver.port", 51801)
sconf.set("spark.broadcast.port", 51802)
sconf.set("spark.replClassServer.port", 51803)
sconf.set("spark.blockManager.port", 51804)
sconf.set("spark.authenticate", True)
sconf.set("spark.yarn.keytab", "/home/jovyan/work/data/robertop.keytab")
sconf.set("spark.yarn.access.namenodes", HDFS_PATH + ":8020")

try:
    sc = SparkContext(conf=sconf)
    sqlCtx = SQLContext(sc) 
    sendNotificationToMattia("Spark Context", "Ready!")
except Exception, err:
    sendNotificationToMattia("Fuck you!", str(err)) 
    print str(err)

# <hr style="clear: both" />

# Now you can run your code

Pick a clustering algorithm (name of the file that provides a classify(x,y [,threshold]) function)

In [4]:
import json
execfile('../spark-scripts/utilsCluster.py')
execfile('../spark-scripts/conventions.py')
execfile('../spark-scripts/splitCluster.py')
execfile('../spark-scripts/implicitPlaylistAlgoFunctions.py')
execfile('../spark-scripts/implicitPlaylistAlgoMain.py')



# Reading the conf file

In [5]:
import json
import copy

BASE_PATH = HDFS_PATH + '/user/robertop/mattia'

conf = {}

conf['split'] = {}
conf['split']['reclistSize'] = 100
conf['split']['callParams'] = {}
conf['split']['excludeAlreadyListenedTest'] = True
conf['split']['name'] = 'SenzaRipetizioni_1'
conf['split']['split'] = conf['split']['name']
conf['split']['minEventsPerUser'] = 5
conf['split']['inputData'] = HDFS_PATH + '/user/robertop/mattia/clusterBase.split/SenzaRipetizioni_1'
#conf['split']['inputData'] = 's3n://contentwise-research-poli/30musicdataset/newFormat/relations/sessions.idomaar'
conf['split']['bucketName'] = BASE_PATH
conf['split']['percUsTr'] = 0.05
conf['split']['ts'] = int(0.75 * (1421745857 - 1390209860) + 1390209860) - 10000
conf['split']['minEventPerSession'] = 5
conf['split']['onlineTrainingLength'] = 5
conf['split']['GTlength'] = 1
conf['split']['minEventPerSessionTraining'] = 10
conf['split']['minEventPerSessionTest'] = 11
conf['split']['mode'] = 'session'
conf['split']['forceSplitCreation'] = False
conf['split']["prop"] = {'reclistSize': conf['split']['reclistSize']}
conf['split']['type'] = None
conf['split']['out'] = HDFS_PATH + '/user/robertop/mattia/clusterBase.split/'
conf['split']['location'] = '30Mdataset/relations/sessions'

conf['evaluation'] = {}
conf['evaluation']['metric'] = {}
conf['evaluation']['metric']['type'] = 'recall'
conf['evaluation']['metric']['prop'] = {}
conf['evaluation']['metric']['prop']['N'] = [1,2,5,10,15,20,25,50,100]
conf['evaluation']['name'] = 'recall@N'

conf['general'] = {}
conf['general']['clientname'] = "clusterBase.split"
conf['general']['bucketName'] = BASE_PATH
conf['general']['tracksPath'] = '30Mdataset/entities/tracks.idomaar.gz'

conf['algo'] = {}
conf['algo']['name'] = 'ClusterBase'
conf['algo']['props'] = {}
# ***** EXAMPLE OF CONFIGURATION *****#
conf['algo']['props']["sessionJaccardShrinkage"] = 5
conf['algo']['props']["clusterSimilarityThreshold"] = 0.1
conf['algo']['props']["expDecayFactor"] = 0.7
# ****** END EXAMPLE ****************#



# Load data 

In [6]:
train, test = loadDataset(conf)

print train.take(1)
print test.take(1)

[u'{"id": "1880568", "linkedinfo": {"subjects": [{"type": "user", "id": 25088}], "objects": [{"playratio": 0.67, "playstart": 0, "action": "play", "playtime": 163, "type": "track", "id": 2895012}, {"playratio": 1.0, "playstart": 163, "action": "play", "playtime": 215, "type": "track", "id": 548739}, {"playratio": 0.99, "playstart": 378, "action": "play", "playtime": 190, "type": "track", "id": 1830769}, {"playratio": 1.0, "playstart": 568, "action": "play", "playtime": 261, "type": "track", "id": 2363674}, {"playratio": 0.87, "playstart": 829, "action": "play", "playtime": 188, "type": "track", "id": 325757}, {"playratio": 0.99, "playstart": 1017, "action": "play", "playtime": 212, "type": "track", "id": 2833842}, {"playratio": 1.0, "playstart": 1229, "action": "play", "playtime": 207, "type": "track", "id": 1885477}, {"playratio": 0.99, "playstart": 1436, "action": "play", "playtime": 182, "type": "track", "id": 2988740}, {"playratio": 1.04, "playstart": 1618, "action": "play", "playt

In [7]:
trainRDD = train.map(json.loads).flatMap(lambda x: [(int(x['id']), i['id']) for i in x['linkedinfo']['objects']])
testRDD = test.map(json.loads).flatMap(lambda x: [(int(x['id']), i['id']) for i in x['linkedinfo']['objects']])

max_under_2mld = 2764449
min_above_2mld = 101418988229

mapBigId = testRDD.filter(lambda x: x[0] > 2*10**9).map(lambda x: x[0]).distinct().zipWithIndex().collect()
dictBigIds = sc.broadcast(dict(mapBigId))

def substitute_big_ids(x, dic):
    if x[0] < 2*10**9:
        return x
    else:
        new_id = dic.value[x[0]] + 2764449 +1
        return (new_id, x[1])

testRDD = testRDD.map(lambda x: substitute_big_ids(x, dictBigIds))
testRDD.filter(lambda x: x[0] > 2*10**9).count()


0

In [8]:
totalTrainRDD = trainRDD.union(testRDD).distinct()
totalTrainRDD.count()

12575662

# Matrix Factorization (Implicit Ratings)

In [9]:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

ratings = totalTrainRDD.map(lambda x: Rating(x[0], x[1], 1))
ratings.take(1)


[Rating(user=182472, product=1212718, rating=1.0)]

# Load Ground Truth For Sessions

In [10]:
splitPath = os.path.join(conf['general']['bucketName'], conf['general']['clientname'])
GTpath = os.path.join(splitPath, "GT")
gtRDD = sc.textFile(GTpath).map(lambda x: json.loads(x))

n_rec = float(gtRDD.count())

groundTruthRDD = gtRDD.flatMap(lambda x: ([(x['linkedinfo']['gt'][0]['id'], k['id']) for k in x['linkedinfo']['objects']]))
groundTruthRDD = groundTruthRDD.map(lambda x: substitute_big_ids(x, dictBigIds))
print groundTruthRDD.filter(lambda x: x[0] > 2*10**9).count()

totRec = float(groundTruthRDD.count())

print n_rec
print totRec

0
290262.0
1370128.0


# Load Ground Truth For Users

In [71]:
splitPath = os.path.join(conf['general']['bucketName'], conf['general']['clientname'])
GTpath = os.path.join(splitPath, "GT")
gtRDD = sc.textFile(GTpath).map(lambda x: json.loads(x)).map(lambda x: (x['linkedinfo']['subjects'][0]['id'], [i['id']  for i in x['linkedinfo']['objects']]))


groundTruthRDD = gtRDD.reduceByKey(lambda x,y: list(set(y) | set(x)))
n_rec = float(groundTruthRDD.count()) 
groundTruthRDD = groundTruthRDD.flatMap(lambda x: [(x[0], i) for i in x[1]])
print float(groundTruthRDD.count())
groundTruthRDD = groundTruthRDD.distinct()
print float(groundTruthRDD.count())

1259548.0
1259548.0


# Evaluate

In [None]:
import os
import time
DATA_PATH = '/home/jovyan/work/data/mattia/resultsNew'

factors = [100, 200]
numIterations = [10]

for f in factors:
    for n_i in numIterations:
        
        model = ALS.trainImplicit(ratings, f, n_i)

        products_for_users = model.recommendProductsForUsers(conf['split']['reclistSize'])
        recRDD = products_for_users.flatMap(lambda x: [(x[0], (i.product, k)) for k,i in enumerate(x[1])])
        
        hitRDDPart = recRDD.join(groundTruthRDD).filter(lambda x: x[1][0][0] == x[1][1])

        hitRDD = hitRDDPart.map(lambda x: (x[0], x[1][0][1], 1.0))
        
        path = os.path.join('MF_Sessions', 'factors'+str(f), 'iterations'+str(n_i))
        directory = os.path.join(DATA_PATH, path)
        if not os.path.exists(directory):
            os.makedirs(directory)

        values = {}
        for n in conf['evaluation']['metric']['prop']['N']:
             values[n] = hitRDD.filter(lambda x: x[1] < n).map(lambda x: x[2]).sum()
        
        result = []
        for n in conf['evaluation']['metric']['prop']['N']:
            temp = {}
            temp['type'] = 'metric'
            temp['id'] = -1
            temp['ts'] = time.time()
            temp['properties'] = {}
            temp['properties']['name'] = conf['evaluation']['name']
            temp['evaluation'] = {}
            temp['evaluation']['N'] = n
            temp['evaluation']['value'] = float(values[n]) / totRec
            temp['linkedinfo'] = {}
            temp['linkedinfo']['subjects'] = []
            temp['linkedinfo']['subjects'].append({})
            temp['linkedinfo']['subjects'][0]['splitName'] = conf['split']['name']
            temp['linkedinfo']['subjects'][0]['algoName'] = conf['algo']['name']
            result.append(temp)
    
        file_recall = os.path.join(DATA_PATH, path, 'recall@N')
        with open(file_recall, 'w') as f:
            for i in result:
                line = json.dumps(i)
                f.write(line + '\n')
        print 'Written ' + file_recall

        
        result = []
        for n in conf['evaluation']['metric']['prop']['N']:
            temp = {}
            temp['type'] = 'metric'
            temp['id'] = -1
            temp['ts'] = time.time()
            temp['properties'] = {}
            temp['properties']['name'] = conf['evaluation']['name']
            temp['evaluation'] = {}
            temp['evaluation']['N'] = n
            temp['evaluation']['value'] = float(values[n]) / (n*n_rec)
            temp['linkedinfo'] = {}
            temp['linkedinfo']['subjects'] = []
            temp['linkedinfo']['subjects'].append({})
            temp['linkedinfo']['subjects'][0]['splitName'] = conf['split']['name']
            temp['linkedinfo']['subjects'][0]['algoName'] = conf['algo']['name']
            result.append(temp)
    
        file_precision = os.path.join(DATA_PATH, path, 'precision@N')
        with open(file_precision, 'w') as f:
            for i in result:
                line = json.dumps(i)
                f.write(line + '\n')
        print 'Written ' + file_precision
        


In [None]:
sendNotificationToMattia("Matrix Factorization", "Done.")

<pyspark.context.SparkContext at 0x7fb6c92aa710>