# <hr style="clear: both" />

# Running Spark in YARN-client mode

This notebook demonstrates how to set up a SparkContext that uses SURFsara's Hadoop cluster: [YARN resourcemanager](http://head05.hathi.surfsara.nl:8088/cluster) (note you will need to be authenticated via kerberos on your machine to visit the resourcemanager link) for executors.

First initialize kerberos via a Jupyter terminal. 
In the terminal execute: <BR>
<i>kinit -k -t data/robertop.keytab robertop@CUA.SURFSARA.NL</i><BR>
Print your credentials:


In [1]:
! klist

Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: robertop@CUA.SURFSARA.NL

Valid starting       Expires              Service principal
05/11/2016 07:06:30  05/12/2016 07:06:30  krbtgt/CUA.SURFSARA.NL@CUA.SURFSARA.NL
	renew until 05/11/2016 07:06:30


In [2]:
! hdfs dfs -ls 
execfile('../spark-scripts/bullet.py')

Found 5 items
drwx------   - robertop hdfs          0 2016-05-07 06:00 .Trash
drwxr-xr-x   - robertop hdfs          0 2016-05-10 16:17 .sparkStaging
drwx------   - robertop hdfs          0 2016-04-06 15:54 .staging
drwxr-xr-x   - robertop hdfs          0 2016-04-27 13:07 mattia
drwxr-xr-x   - robertop hdfs          0 2016-04-13 10:00 recsys2016Competition


Verify that we can browse HDFS:

Next initialize Spark. Note that the code below starts a job on the Hadoop cluster that will remain running while the notebook is active. Please close and halt the notebook when you are done. Starting the SparkContext can take a little longer. You can check the YARN resourcemanager to see the current status/usage of the cluster.

In [3]:
import os
os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python2.7'

HDFS_PATH = "hdfs://hathi-surfsara"

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sconf = SparkConf()

# Master is now yarn-client. The YARN and hadoop config is read from the environment
sconf.setMaster("yarn-client")

# You can control many Spark settings via the SparkConf. This determines the amount of executors on the cluster:
sconf.set("spark.executor.instances", "200")
#sconf.set("spark.executor.memory", "10g")

# UFW (firewall) is active on the VM. We explicitly opened these ports and Spark should not bind to random ports:
sconf.set("spark.driver.port", 51800)
sconf.set("spark.fileserver.port", 51801)
sconf.set("spark.broadcast.port", 51802)
sconf.set("spark.replClassServer.port", 51803)
sconf.set("spark.blockManager.port", 51804)
sconf.set("spark.authenticate", True)
sconf.set("spark.yarn.keytab", "/home/jovyan/work/data/robertop.keytab")
sconf.set("spark.yarn.access.namenodes", HDFS_PATH + ":8020")

try:
    sc = SparkContext(conf=sconf)
    sqlCtx = SQLContext(sc) 
    sendNotificationToMattia("Spark Context", "Ready!")
except Exception, err:
    sendNotificationToMattia("Fuck you!", str(err)) 
    print str(err)

# <hr style="clear: both" />

# Now you can run your code

Pick a clustering algorithm (name of the file that provides a classify(x,y [,threshold]) function)

# Reading the conf file

In [4]:
import json
import copy

BASE_PATH = HDFS_PATH + '/user/robertop/mattia'

conf = {}

conf['split'] = {}
conf['split']['reclistSize'] = 100
conf['split']['callParams'] = {}
conf['split']['excludeAlreadyListenedTest'] = True
conf['split']['name'] = 'test'
conf['split']['split'] = conf['split']['name']
conf['split']['minEventsPerUser'] = 5
conf['split']['inputData'] = HDFS_PATH + '/user/robertop/mattia/clusterBase.split/SenzaRipetizioni_1'
#conf['split']['inputData'] = 's3n://contentwise-research-poli/30musicdataset/newFormat/relations/sessions.idomaar'
conf['split']['bucketName'] = BASE_PATH
conf['split']['percUsTr'] = 0.05
conf['split']['ts'] = int(0.75 * (1421745857 - 1390209860) + 1390209860) - 10000
conf['split']['minEventPerSession'] = 5
conf['split']['onlineTrainingLength'] = 5
conf['split']['GTlength'] = 1
conf['split']['minEventPerSessionTraining'] = 10
conf['split']['minEventPerSessionTest'] = 11
conf['split']['mode'] = 'session'
conf['split']['forceSplitCreation'] = False
conf['split']["prop"] = {'reclistSize': conf['split']['reclistSize']}
conf['split']['type'] = list
conf['split']['out'] = HDFS_PATH + '/user/robertop/mattia/clusterBase.split/'
conf['split']['location'] = '30Mdataset/relations/sessions'

conf['evaluation'] = {}
conf['evaluation']['metric'] = {}
conf['evaluation']['metric']['type'] = 'recall'
conf['evaluation']['metric']['prop'] = {}
conf['evaluation']['metric']['prop']['N'] = [1,2,5,10,15,20,25,50,100]
conf['evaluation']['name'] = 'recall@N'

conf['general'] = {}
conf['general']['clientname'] = "clusterBase.split"
conf['general']['bucketName'] = BASE_PATH
conf['general']['tracksPath'] = '30Mdataset/entities/tracks.idomaar.gz'

conf['algo'] = {}
conf['algo']['name'] = 'ClusterBase'
conf['algo']['props'] = {}
# ***** EXAMPLE OF CONFIGURATION *****#
conf['algo']['props']["sessionJaccardShrinkage"] = 5
conf['algo']['props']["clusterSimilarityThreshold"] = 0.1
conf['algo']['props']["expDecayFactor"] = 0.7
# ****** END EXAMPLE ****************#




# Load Sessions 


In [5]:
execfile('../spark-scripts/conventions.py')
execfile('../spark-scripts/splitCluster.py')
execfile('../spark-scripts/eval.py')
execfile('../spark-scripts/implicitPlaylistAlgoFunctions.py')
execfile('../spark-scripts/implicitPlaylistAlgoMain.py')

import json
execfile('../spark-scripts/utilsCluster.py')

train, test = loadDataset(conf)

train.take(3)

[u'{"id": "1880599", "linkedinfo": {"subjects": [{"type": "user", "id": 25088}], "objects": [{"playratio": null, "playstart": 0, "action": "play", "playtime": 217, "type": "track", "id": 4463481}, {"playratio": 0.99, "playstart": 217, "action": "play", "playtime": 193, "type": "track", "id": 3721462}, {"playratio": 0.99, "playstart": 410, "action": "play", "playtime": 256, "type": "track", "id": 1918267}, {"playratio": 0.99, "playstart": 666, "action": "play", "playtime": 151, "type": "track", "id": 1320280}, {"playratio": 0.99, "playstart": 817, "action": "play", "playtime": 205, "type": "track", "id": 2468464}, {"playratio": 0.99, "playstart": 1022, "action": "play", "playtime": 320, "type": "track", "id": 1619933}, {"playratio": 1.0, "playstart": 1342, "action": "play", "playtime": 266, "type": "track", "id": 2228799}, {"playratio": 0.99, "playstart": 1608, "action": "play", "playtime": 292, "type": "track", "id": 1879248}, {"playratio": 0.99, "playstart": 1900, "action": "play", "p

# Filter "skipped" songs and "problematic" song

In [None]:
tracksRDD = sc.textFile(BASE_PATH + '/30Mdataset/entities/tracks.idomaar.gz')
tracksRDD = tracksRDD.repartition(200)
tracksRDD = tracksRDD.map(lambda x: x.split('\t')).map(lambda x: (x[1], json.loads(x[3])['name'].split('/') ) )
tracksRDD = tracksRDD.map(lambda x: (x[0], " ".join( (x[1][0], x[1][2]) ) )).distinct()

def my_replace_punct(x):
    ret = ""
    for i in x:
        if i == '+':
            ret += ' '
        else:
            ret += i
    return ret

exceptionRDD = tracksRDD.map(lambda x : (int(x[0]), my_replace_punct(x[1]))).filter(lambda x: x[1] == 'ZZ Top She Loves My Automobile')
exceptions = exceptionRDD.map(lambda x: x[0]).collect()


In [34]:
trainRDD = train.map(json.loads)

MIN_LISTENING = 5

def filter_skipped(x, exceptions):
    exceptions = exceptions.value
    for i in x['linkedinfo']['objects']:
        if i['playtime'] < MIN_LISTENING and i['playtime'] != -1:
            del i;
        elif i['id'] in exceptions:
            del i;
    return x

exceptions = set(exceptions)
exBroad = sc.broadcast(exceptions)
trainFilteredRDD = trainRDD.map(lambda x: filter_skipped(x, exBroad)) 
trainFilteredRDD.count()

507817

# Songs co-occurrences

In [40]:
def extract_co_occur(x):
    result = []
    for song in x['linkedinfo']['objects']:
        id1 = song['id']
        co_occ = []
        for song2 in x['linkedinfo']['objects']:
            id2 = song2['id']
            if id1 != id2:
                co_occ.append(id2)
        result.append( (id1, co_occ) )
    
    return result

co_occRDD = trainFilteredRDD.flatMap(extract_co_occur).reduceByKey(lambda x, y: list( set(x).union(set(y)) ) )
co_occRDD.take(3)

[(1605632,
  [1605633,
   1605634,
   969219,
   1197573,
   1605641,
   1605642,
   1605643,
   1605644,
   1605645,
   348497,
   1605649,
   1605650,
   1605651,
   1624596,
   1605653,
   1605654,
   1605655,
   1605656,
   1605657,
   3044381,
   617477,
   1298979,
   3142694,
   3583016,
   86569,
   600620,
   2497117,
   2831416,
   2341946,
   2722363,
   181822,
   2959937,
   3567171,
   3580997,
   300616,
   373260,
   1078346,
   2145867,
   3150095,
   2646109,
   395532,
   848483,
   2804324,
   1161317,
   1515537,
   2428522,
   1385067,
   2419218,
   1041009,
   497267,
   505974,
   1605652,
   3494010,
   2697851,
   2145917,
   3233227,
   1385087,
   2507906,
   2328200,
   979593,
   2134667,
   3126416,
   3041048,
   2877590,
   4955292,
   1385126,
   153260,
   3622062,
   352285,
   1523893,
   3107000,
   2801849,
   1384639,
   1197763,
   396485,
   1317062,
   2264264,
   3089097,
   2517194,
   3700940,
   2882850,
   1078479,
   3150032,
   807459,

# Similar Sessions

In [41]:
# Map to (userID, session) and Group by Key
def session_to_songs(x):
    user = x['linkedinfo']['subjects'][0]['id']
    songs = [i['id'] for i in x['linkedinfo']['objects']]
    return (user, songs)

trainUserRDD = trainFilteredRDD.map( session_to_songs ).groupByKey()
trainUserRDD.count()

30583

In [42]:
SIM_THR = conf['algo']['props']["clusterSimilarityThreshold"]

def similar_sessions(x):
    sessions = list(x[1])
    l = len(sessions)
    couples = []
    
    for i in range(l):
        sess_1 = set(sessions[i])
        for j in range(i):
            sess_2 = set(sessions[j])
            
            sim_j = compute_jaccard_index(sess_1, sess_2)
            if sim_j > SIM_THR:
                couples.append((sess_1, sess_2, sim_j))
    
    return couples


userSimSessRDD = trainUserRDD.flatMap(similar_sessions)
userSimSessRDD.count()

439464

# Exctract Couples of Candidate Songs from similar Sessions

In [43]:
def extract_couples(x):
    sim = x[2]
    result = []
    
    for i in x[0]:
        if i in x[1]:
            continue
            
        for j in x[1]:
            if i != j:
                candidate = (i,j, sim) if i<j else (j,i, sim)
                result.append(candidate)
    
    return result


couplesCandidateRDD = userSimSessRDD.flatMap(extract_couples)
couplesCandidateRDD.count()


1137152798

In [46]:
couplesCandidateStringRDD = couplesCandidateRDD.map(lambda x: (str(x[0]) + ' ' + str(x[1]), x[2]) ).groupByKey()
couplesCandidateNewRDD = couplesCandidateStringRDD.map(lambda x: ( int(x[0].split(' ')[0]), int(x[0].split(' ')[1]), x[1] ))
couplesCandidateNewRDD.count()

90238090

# Join with Co-Occurrences and Filter

In [51]:
def check_co_occur(x):
    co_occ = set(x[1][1])
    if x[1][0][0] in co_occ:
        return False
    return True

couplesJoinOccDD = couplesCandidateNewRDD.join(co_occRDD)
goodCandidatesRDD = couplesJoinOccDD.filter(check_co_occur).map(lambda x: ((x[0], x[1][0][0]), x[1][0][1] ))
goodCandidatesRDD.count()

[(4816896,
  ((4816928, <pyspark.resultiterable.ResultIterable at 0x7f07b67834d0>),
   [1195465,
    4816864,
    4816883,
    4816884,
    1195466,
    4816885,
    4816885,
    4816886,
    1195467,
    4816860,
    4816858,
    1195468,
    2471162,
    4816887,
    4816888,
    4816889,
    4816890,
    2471140,
    4816871,
    4816891,
    4816892,
    4816893,
    4816894,
    4816895,
    4816897,
    4816898,
    4816899,
    4816900,
    4816861]))]

# Keep couples that appear together more than average (2)

In [63]:
MEAN_LEN = goodCandidatesRDD.map(lambda x: len(x[1])).mean()
print MEAN_LEN

robustCandidateRDD = goodCandidatesRDD.filter(lambda x: len(list(x[1])) > MEAN_LEN)
robustCandidateRDD.count()

8096331

# Compute Average Similarity with Shrink

In [68]:
SHRINK = 5

def average_sim_shrink(x, shrink):
    values = list(x[1])
    num = float(sum(values))
    den = float(len(values) + SHRINK)
    avg = num / den
    return (x[0], avg)

candidateScoresRDD = robustCandidateRDD.map(lambda x: average_sim_shrink(x, SHRINK))
candidateScoresRDD.take(3)

[((3615416, 3624245), 0.07897524531668154),
 ((98392, 3850680), 0.07692070508352138),
 ((98392, 869198), 0.07692070508352138)]

# Satistics on Scores

In [70]:
scoresRDD = candidateScoresRDD.map(lambda x: x[1])

stats = scoresRDD.stats().asDict()
print stats


# Score > $\mu$ + $\sigma$ 

In [82]:
MEAN_SCORE = stats['mean']
STD_SCORE = stats['stdev']

bestCandidatesRDD = candidateScoresRDD.filter(lambda x: x[1] > MEAN_SCORE + STD_SCORE)
bestCandidatesRDD.count()

1101560

# Create clusters for each song

In [85]:
flatCouplesRDD = bestCandidatesRDD.flatMap(lambda x: [(x[0][0], x[0]), (x[0][1], x[0])])

#Group by key (song). Each song has now one cluster
def merge_couples(x, y):
    return list(set(x) | set(y))

songClusterRDD = flatCouplesRDD.reduceByKey(merge_couples)
songClusterRDD.take(3)


[(2684416,
  [2684416,
   2537601,
   2537602,
   2537604,
   2537606,
   2537607,
   2537608,
   1202188,
   2537610,
   2537611,
   2537612,
   2537613,
   2537614,
   2537615,
   2537616,
   2624274,
   1202195,
   1202197,
   1202202,
   1202191,
   707039,
   1349196,
   2911686,
   1202153,
   1202157,
   1202158,
   2812719,
   1202163,
   1188469,
   1202169,
   1202182,
   1202171]),
 (702464,
  [702464,
   262149,
   262151,
   262152,
   262153,
   262157,
   262159,
   262161,
   262162,
   262163,
   262166,
   262167,
   262170,
   262171,
   262173,
   262175,
   262179,
   262180,
   262182,
   262189,
   262190,
   553651,
   855220,
   855102,
   855104,
   855106,
   855107,
   553654,
   853064,
   553561,
   553658,
   553659,
   553660,
   553661,
   553662,
   855163,
   855165,
   855167,
   855169,
   853127,
   853129,
   4453543,
   4453544,
   4453545,
   4453546,
   4453547,
   4453548,
   4453549,
   4453550,
   4453551,
   4453552,
   4453553,
   553650,


In [86]:
songClusterRDD.count()

82082