# <hr style="clear: both" />

# Running Spark in YARN-client mode

This notebook demonstrates how to set up a SparkContext that uses SURFsara's Hadoop cluster: [YARN resourcemanager](http://head05.hathi.surfsara.nl:8088/cluster) (note you will need to be authenticated via kerberos on your machine to visit the resourcemanager link) for executors.

First initialize kerberos via a Jupyter terminal. 
In the terminal execute: <BR>
<i>kinit -k -t data/robertop.keytab robertop@CUA.SURFSARA.NL</i><BR>
Print your credentials:


In [1]:
! klist

Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: robertop@CUA.SURFSARA.NL

Valid starting       Expires              Service principal
05/09/2016 08:04:01  05/10/2016 08:04:01  krbtgt/CUA.SURFSARA.NL@CUA.SURFSARA.NL
	renew until 05/09/2016 08:04:01


In [2]:
! hdfs dfs -ls 
execfile('../spark-scripts/bullet.py')

Found 5 items
drwx------   - robertop hdfs          0 2016-05-07 06:00 .Trash
drwxr-xr-x   - robertop hdfs          0 2016-05-09 01:39 .sparkStaging
drwx------   - robertop hdfs          0 2016-04-06 15:54 .staging
drwxr-xr-x   - robertop hdfs          0 2016-04-27 13:07 mattia
drwxr-xr-x   - robertop hdfs          0 2016-04-13 10:00 recsys2016Competition


Verify that we can browse HDFS:

Next initialize Spark. Note that the code below starts a job on the Hadoop cluster that will remain running while the notebook is active. Please close and halt the notebook when you are done. Starting the SparkContext can take a little longer. You can check the YARN resourcemanager to see the current status/usage of the cluster.

In [3]:
import os
import json
os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python2.7'

HDFS_PATH = "hdfs://hathi-surfsara"
BASE_PATH = HDFS_PATH + '/user/robertop/mattia'

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sconf = SparkConf()

# Master is now yarn-client. The YARN and hadoop config is read from the environment
sconf.setMaster("yarn-client")

# You can control many Spark settings via the SparkConf. This determines the amount of executors on the cluster:
sconf.set("spark.executor.instances", "200")
#sconf.set("spark.executor.memory", "20g")

# UFW (firewall) is active on the VM. We explicitly opened these ports and Spark should not bind to random ports:
sconf.set("spark.driver.port", 51800)
sconf.set("spark.fileserver.port", 51801)
sconf.set("spark.broadcast.port", 51802)
sconf.set("spark.replClassServer.port", 51803)
sconf.set("spark.blockManager.port", 51804)
sconf.set("spark.authenticate", True)
sconf.set("spark.yarn.keytab", "/home/jovyan/work/data/robertop.keytab")
sconf.set("spark.yarn.access.namenodes", HDFS_PATH + ":8020")

try:
    sc = SparkContext(conf=sconf)
    sqlCtx = SQLContext(sc) 
    sendNotificationToMattia("Spark Context", "Ready!")
except Exception, err:
    sendNotificationToMattia("Fuck you!", str(err)) 
    print str(err)

# <hr style="clear: both" />

# Now you can run your code

Pick a clustering algorithm (name of the file that provides a classify(x,y [,threshold]) function)

# Load data 

In [13]:
playlistRDD = sc.textFile(BASE_PATH + '/30Mdataset/entities/playlist.idomaar')
playlistIdsRDD = playlistRDD.map(lambda x: json.loads(x.split('\t')[4]))
playlistIdsRDD = playlistIdsRDD.map(lambda x: x['objects'])
total_playlists = playlistIdsRDD.count()
print "Total playlist: " + str(total_playlists)

def filter_bad_entry(x):
    try:
        result = [i['id'] for i in x]
        return (result, 0)
    except:
        return (x, 1)

playlistIdsRDD = playlistIdsRDD.map(filter_bad_entry)
bad_n = playlistIdsRDD.filter(lambda x: x[1] == 1 or len(x[0]) == 0).count()
print "Bad playlists: " + str(bad_n)

playlistOkRDD = playlistIdsRDD.filter(lambda x: x[1] == 0 and len(x[0]))
playlist_count = float(playlistOkRDD.count())
print "Good playlists: " + str(int(playlist_count)) + " (" + str(playlist_count/total_playlists * 100) + "% )"


Total playlist: 57561
Bad playlists: 9139
Good playlists: 48422 (84.1229304564% )


In [19]:
playlistFlatRDD = playlistOkRDD.zipWithIndex().flatMap(lambda x: [(i, x[1]) for i in x[0][0] ] )
print playlistFlatRDD.take(3)
total_songs = playlistFlatRDD.count()
print total_songs

[(3006631, 0), (1885124, 0), (2548942, 0)]
1603040


# Load songs

In [20]:
def my_replace_punct(x):
    ret = ""
    for i in x:
        if i == '+':
            ret += ' '
        else:
            ret += i
    return ret

tracksRDD = sc.textFile(BASE_PATH + '/30Mdataset/entities/tracks.idomaar.gz')
tracksRDD = tracksRDD.repartition(200)
tracksRDD = tracksRDD.map(lambda x: x.split('\t')).map(lambda x: (x[1], json.loads(x[3])['name'].split('/') ) )
tracksRDD = tracksRDD.map(lambda x: (x[0], " ".join( (x[1][0], x[1][2]) ) )).distinct()
tracksRDD = tracksRDD.map(lambda x : (int(x[0]), my_replace_punct(x[1])))
tracksRDD.take(3)

[(u'2740924', u'Talk+Talk Such+A+Shame+(Dub+Mix)'),
 (u'4573230', u'ZZ+Top She+Loves+My+Automobile'),
 (u'2260352', u'P.SUS Live+Today')]

# Join Songs from Playlists with their name

In [25]:
# Join  -> (songID, (playlistID, name))
# Map   -> (playlistID, (songID, name))
# Group -> (playlistID, [list of (songID, name) ] )

playSongsRDD = playlistFlatRDD.join(tracksRDD).map(lambda x: (x[1][0], (x[0], x[1][1])) ).groupByKey()
playSongsRDD.take(3)

[(1048582, (9229, u'Eighteen Nightmares At The Lux Mother Of Girl')),
 (131098, (42518, u'Al Gromer Khan Mahal')),
 (262196, (35822, u'ASP Ich bin ein wahrer Satan (Dope Stars Inc. Remix)'))]

# Co-occurent Song in Playlist with high similarity (NO Duplicate)

In [36]:
CLUSTER_ALGO = 'jaccardBase'
THRESHOLD = 0.5
execfile('../spark-scripts/' + CLUSTER_ALGO + '.py')

def classify_couple(x):
    x_list = list(x[1])
    inserted = set()
    l = len(x_list)
    
    result = []
    
    for i in range(l):
        song_1 = x_list[i]
        id_1 = song_1[0]
        
        for j in range(i):
            song_2 = x_list[j]
            id_2 = song_2[0]
            
            if id_1 == id_2:
                continue
                
            couple_id = (id_1, id_2) if id_1 < id_2 else (id_2, id_1)
            if couple_id in inserted:
                continue
            
            if  song_1[1] != song_2[1] and classify(song_1[1], song_2[1]):
                couple = (song_1, song_2) if id_1 < id_2 else (song_2, song_1)
                result.append( couple )
            
    return result
            
            

couplesRDD = playSongsRDD.flatMap(classify_couple).distinct()
print couplesRDD.count()
couplesRDD.take(3)

[((1545422, u'Late of the Pier VW'),
  (1545425, u'Late of the Pier White Snake')),
 ((1121889, u'Esbj%C3%B6rn Svensson Trio Evening in Atlantis'),
  (1121905, u'Esbj%C3%B6rn Svensson Trio In My Garage')),
 ((2945679, u'The Prodigy Climbatize'), (3883172, u'The Prodigy Skylined')),
 ((2279307, u'Queens of the Stone Age Leg of Lamb'),
  (2279513, u'Queens of the Stone Age The Blood Is Love')),
 ((2890993, u'The Jesus and Mary Chain Deep One Perfect Morning'),
  (2891175, u'The Jesus and Mary Chain Snakedriver')),
 ((2957864, u'The Rolling Stones Connection'),
  (2958651, u'The Rolling Stones Little By Little')),
 ((2796936, u'The Beach Boys Surfin%27 U.S.A.'),
  (2797197, u'The Beach Boys Wonderful')),
 ((3076027, u'Tori Amos Girl'), (3076491, u'Tori Amos Spark')),
 ((2756515, u'Tears for Fears Change'), (2756711, u'Tears for Fears Secrets')),
 ((3518148, u'Ill Ni%C3%B1o Liar'), (3518164, u'Ill Ni%C3%B1o Numb')),
 ((247314, u'Art Blakey A Night in Tunisia'),
  (3366460, u'Dizzy Gillespi

# Songs in same cluster non co-occurent -> Duplicates (High probability)

In [45]:
CLUSTER_ALGO = 'jaccardBase'
THRESHOLD = 0.751

clusterSongsFileRDD = sc.pickleFile(BASE_PATH + '/clusters/' + CLUSTER_ALGO + str(THRESHOLD)[2:])
clusterSongsFileRDD = clusterSongsFileRDD.filter(lambda x: len(x[1]) >1)
clusterSongsFileRDD.take(3)


[(3833856, <pyspark.resultiterable.ResultIterable at 0x7f38a7619750>),
 (2232320, <pyspark.resultiterable.ResultIterable at 0x7f38a75601d0>),
 (1540098, <pyspark.resultiterable.ResultIterable at 0x7f38a75600d0>)]

# Song to list of songs in clusters

In [50]:
songToClusterRDD = clusterSongsFileRDD.flatMap(lambda x: [(i, x[1]) for i in x[1]] )
songToClusterRDD = songToClusterRDD.reduceByKey(lambda x, y: list(set(x) | set(y)))
songToClusterRDD.take(10)

[(0, [0, 1937715]),
 (3635200, [3635200, 3635196]),
 (1902400,
  [1902402,
   1902400,
   1902404,
   1902399,
   1902401,
   1902406,
   1902405,
   1902403,
   1902397,
   1902398]),
 (157600, [157601, 157600]),
 (1144800, [1144800, 1144794]),
 (1885200, [1885202, 1885200, 1885201, 1885209, 1885243]),
 (2378800, [2378800, 2378801, 2378802]),
 (1234000, [1233998, 1233999, 1234000]),
 (582800, [582795, 582800]),
 (3502000, [3502001, 3502000])]

# Song to list of songs in playlists

In [51]:
songToPlaylistRDD = playlistOkRDD.flatMap(lambda x: [(i, x[0]) for i in x[0]] ).groupByKey()
songToPlaylistRDD.take(30)

[(3497984, <pyspark.resultiterable.ResultIterable at 0x7f38a76d0c90>),
 (3765596, <pyspark.resultiterable.ResultIterable at 0x7f38a76d0d90>),
 (3760130, <pyspark.resultiterable.ResultIterable at 0x7f38a76d0c10>),
 (1400836, <pyspark.resultiterable.ResultIterable at 0x7f38a76d0e10>),
 (2286934, <pyspark.resultiterable.ResultIterable at 0x7f38a76d0590>),
 (1703942, <pyspark.resultiterable.ResultIterable at 0x7f38a76d0690>),
 (1810440, <pyspark.resultiterable.ResultIterable at 0x7f38a76d0110>),
 (875180, <pyspark.resultiterable.ResultIterable at 0x7f38a76d04d0>),
 (483330, <pyspark.resultiterable.ResultIterable at 0x7f38a76d0bd0>),
 (2154510, <pyspark.resultiterable.ResultIterable at 0x7f38a76d0610>),
 (1458504, <pyspark.resultiterable.ResultIterable at 0x7f38a76d0950>),
 (2834452, <pyspark.resultiterable.ResultIterable at 0x7f38a76d0410>),
 (637614, <pyspark.resultiterable.ResultIterable at 0x7f38a76d0d50>),
 (3022872, <pyspark.resultiterable.ResultIterable at 0x7f38a76d0c50>),
 (1261594

# First check songs with playlists

In [55]:
ids_playlist = songToPlaylistRDD.map(lambda x: x[0]).collect()
ids_playlist

[98308,
 3120632,
 573448,
 953004,
 1400844,
 3682306,
 1671182,
 2457616,
 4842840,
 3089808,
 622612,
 2615982,
 901124,
 983066,
 294944,
 2170914,
 3719204,
 262182,
 1351720,
 3659100,
 213036,
 163886,
 901128,
 1138740,
 1433654,
 3842104,
 1638458,
 1982524,
 524350,
 3047488,
 353632,
 1982530,
 1417284,
 671814,
 114764,
 3113038,
 655440,
 1419960,
 852050,
 1138772,
 3489806,
 2482264,
 1876058,
 360540,
 3721914,
 1114206,
 483344,
 2859106,
 3343718,
 4767846,
 2849468,
 1269866,
 2728044,
 2121838,
 3154032,
 3669474,
 162582,
 1022654,
 3293204,
 1720444,
 106622,
 2953966,
 221316,
 689516,
 352394,
 1319056,
 2437144,
 3522706,
 497662,
 3661972,
 3581294,
 2089112,
 1088196,
 245914,
 3309724,
 1083770,
 319656,
 1564842,
 3487090,
 2425008,
 4833458,
 1212596,
 1871902,
 1605814,
 2051876,
 229560,
 3784890,
 3060626,
 3153952,
 2900162,
 2785476,
 91510,
 2506950,
 531114,
 319688,
 1876170,
 3424460,
 520226,
 2908366,
 1130704,
 614610,
 3752150,
 2179288,
 2117

In [56]:
songClusterPlaylistRDD = songToClusterRDD.join(songToPlaylistRDD)
songClusterPlaylistRDD.take(3)

[(1848396,
  ([1848394, 1848396],
   <pyspark.resultiterable.ResultIterable at 0x7f38a752fc90>)),
 (2182860,
  ([2182860, 2182862],
   <pyspark.resultiterable.ResultIterable at 0x7f38a752f390>)),
 (937464,
  ([937465, 937464],
   <pyspark.resultiterable.ResultIterable at 0x7f38a752f450>))]

In [57]:
def cluster_not_in_playlist(x):
    cluster = x[1][0]
    playlist = list(x[1][1])
    result = set()
    
    for i, cl_1 in enumerate(cluster):
        for j in range(i):
            cl_2 = cluster[j]
            if cl_1 in playlist and cl_2 in playlist:
                continue
            
            couple = (cl_1, cl_2) if cl_1 < cl_2 else (cl_2, cl_1)
            

60889