# Graph based Music Recommender. Task 1

Build the edges of the type “track-track”. To do it you will need to count the collaborative similarity between all the tracks: if a user has started listening to track B within 7 minutes after starting track A, then you should add 1 to the weight of the edge from vertex A to vertex B (initial weight is equal to 0).

Example:

     userId artistId trackId timestamp
     7        12        1          1534574189
     7        13        4          1534574289 
     5        12        1          1534574389 
     5        13        4          1534594189 
     6        12        1          1534574489 
     6        13        4          1534574689 

The track 1 is similar to the track 4 with the weight 2 (before normalization): the user 7 and the user 6 listened these 2 tracks together in the 7 minutes long window:

* userId 7: 1534574289 - 1534574189 = 100 seconds = 1 min 40 seconds < 7 minutes
* userId 6: 1534574689 - 1534574489 = 200 seconds = 3 min 20 seconds < 7 minutes

Note that the track 4 is similar to the track 1 with the same weight 2.

Tip: consider joining the graph to itself with the UserId and remove pairs with the same tracks.For each track choose top 50 tracks ordered by weight similar to it and normalize weights of its edges (divide the weight of each edge on a sum of weights of all edges). Use rank() to choose top 40 tracks as is done in the demo.

Sort the resulting Data Frame in the descending order by the column norm_weight, and then in the ascending order this time first by “id1”, then by “id2”. Take top 40 rows, select only the columns “id1”, “id2”, and print the columns “id1”, “id2” of the resulting dataframe.

Output example:

     54719		767867
     54719		767866
     50787		327676


In [1]:
import os
os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python'
os.environ['PYTHONHASHSEED'] = '42'

In [2]:
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.enableHiveSupport().master("local[2]").getOrCreate()

import pyspark.sql.functions as f
from pyspark.sql import Window

In [3]:
data = sparkSession.read.parquet("/data/sample264").alias("data")
meta = sparkSession.read.parquet("/data/meta").alias("meta")

In [4]:
data.describe()

DataFrame[summary: string, userId: string, trackId: string, artistId: string, timestamp: string]

In [5]:
# Test data for testing

#import pyspark.sql.types as t
s_test_data = """
 7        12        1          1534574189
 7        13        4          1534574289 
 5        12        1          1534574389 
 5        13        4          1534594189 
 6        12        1          1534574489 
 6        13        4          1534574689 
 6        17        8          1534574789 
"""
parsed_test_data = [s.strip().split() for s in s_test_data.strip().split("\n")]
test_data = sparkSession.createDataFrame(
    parsed_test_data,
    ["userId", "artistId", "trackId", "timestamp"])
# test_data.toPandas()

In [6]:
use_data = data
# use_data = test_data # FIXME
data1 = use_data.select(["userId", "trackId", "timestamp"]).alias("data1")
data2 = data1.alias("data2")

played_together_cnt = data1 \
    .join(data2,
          (f.col("data1.userId") == f.col("data2.userId")) \
          & (f.col("data1.trackId") != f.col("data2.trackId")) \
          & (f.abs(f.col("data2.timestamp") - f.col("data1.timestamp")) <= 7 * 60) \
         ) \
    .select(f.col("data1.trackId").alias("id1"), f.col("data2.trackId").alias("id2")) \
    .groupBy(["id1", "id2"]) \
    .agg(f.count('*').alias("cnt"))

In [7]:
# played_together_cnt.toPandas()

In [8]:
# For each track choose top 50
window = Window.partitionBy("id1").orderBy(f.col("cnt").desc())
played_together_each_top50 = played_together_cnt \
    .select(["id1", "id2", "cnt", f.rank().over(window).alias("rank")]) \
    .where(f.col("rank") < 51)

In [9]:
# played_together_each_top50.toPandas()

In [10]:
window = Window.partitionBy("id1")

played_together = played_together_each_top50 \
    .select(["id1", "id2", (f.col("cnt") / f.sum("cnt").over(window)).alias("weight")])

In [11]:
# played_together.toPandas()

In [12]:
played_together.persist()

DataFrame[id1: int, id2: int, weight: double]

In [13]:
top40 = played_together.orderBy(\
    f.col("weight").desc(), f.col("id1").asc(),  f.col("id2").asc()).take(40)

In [14]:
for row in top40:
    print("{}\t{}".format(row[0], row[1]))

798256	923706
798319	837992
798322	876562
798331	827364
798335	840741
798374	816874
798375	810685
798379	812055
798380	840113
798396	817687
798398	926302
798405	867217
798443	905923
798457	918918
798460	891840
798461	940379
798470	840814
798474	963162
798477	883244
798485	955521
798505	905671
798545	949238
798550	936295
798626	845438
798691	818279
798692	898823
798702	811440
798704	937570
798725	933147
798738	894170
798745	799665
798782	956938
798801	950802
798820	890393
798833	916319
798865	962662
798931	893574
798946	946408
799012	809997
799024	935246


# Graph based Music Recommender. Task 2

Build the edges of the type “user-track”. Take the amount of times the track was listened by the user as the weight of the edge from the user’s vertex to the track’s vertex.

Tip: group the dataframe by columns userId and trackId and use function “count” of DF API.

For each user take top-1000 and normalize them.

Sort the resulting Data Frame in descending order by the column norm_weight, and then in ascending order this time first by “id1”, then by “id2”. Take top 40 rows, select only the columns “id1”, “id2”, and print the columns “id1”, “id2” of the resulting dataframe.

The part of the result on the sample dataset:

    ...
    195 946408
    215 860111
    235 897176
    300 857973
    321 915545
    ...

In [15]:
user_track_cnt = data \
    .groupBy(["userId", "trackId"]) \
    .count()

In [16]:
window = Window.partitionBy("userId").orderBy(f.col("count").desc())

user_track_top1000 = user_track_cnt \
    .select(["*", f.row_number().over(window).alias("rank")]) \
    .where(f.col("rank") <= 1000)

In [17]:
window = Window.partitionBy("userId")

user_track = user_track_top1000 \
    .select(["*", (f.col("count") / f.sum("count").over(window)).alias("weight")])

In [18]:
user_track.persist()

DataFrame[userId: int, trackId: int, count: bigint, rank: int, weight: double]

In [19]:
top40 = user_track.select("userId", "trackId", "weight").orderBy(\
    f.col("weight").desc(), f.col("userId").asc(), f.col("trackId").asc()).take(40)

for row in top40:
    print("{}\t{}".format(row[0], row[1]))

66	965774
116	867268
128	852564
131	880170
195	946408
215	860111
235	897176
300	857973
321	915545
328	943482
333	818202
346	864911
356	961308
428	943572
431	902497
445	831381
488	841340
542	815388
617	946395
649	901672
658	937522
662	881433
698	935934
708	952432
746	879259
747	879259
776	946408
784	806468
806	866581
811	948017
837	799685
901	871513
923	879322
934	940714
957	945183
989	878364
999	967768
1006	962774
1049	849484
1057	920458


# Graph based Music Recommender. Task 3

Build the edges of the type “user-artist”. Take the amount of times the user has listened to the artist’s tracks as the weight of the edge from the user’s vertex to the artist’s vertex.

Tip: group the dataframe by the columns userId and trackId and use the function “count” of DF API. For each user take top-100 artists and normalize weights.

Sort the resulting Data Frame in descending order by the column norm_weight, and then in ascending order this time first by “id1”, then by “id2”. Take top 40 rows, select only the columns “id1”, “id2”, and print the columns “id1”, “id2” of the resulting dataframe.

The part of the result on the sample dataset:

    ...
    131 983068
    195 997265
    215 991696
    235 990642
    288 1000564
    ...


In [20]:
user_artist_cnt = data \
    .groupBy(["userId", "artistId"]) \
    .count()

In [21]:
window = Window.partitionBy("userId").orderBy(f.col("count").desc())

user_artist_top100 = user_artist_cnt \
    .select(["*", f.row_number().over(window).alias("rank")]) \
    .where(f.col("rank") <= 100)

In [22]:
window = Window.partitionBy("userId")

user_artist = user_artist_top100 \
    .select(["*", (f.col("count") / f.sum("count").over(window)).alias("weight")])

In [23]:
user_artist.persist()

DataFrame[userId: int, artistId: int, count: bigint, rank: int, weight: double]

In [24]:
top40 = user_artist.select("userId", "artistId", "weight").orderBy(\
    f.col("weight").desc(), f.col("userId").asc(), f.col("artistId").asc()).take(40)

for row in top40:
    print("{}\t{}".format(row[0], row[1]))

66	993426
116	974937
128	1003021
131	983068
195	997265
215	991696
235	990642
288	1000564
300	1003362
321	986172
328	967986
333	1000416
346	982037
356	974846
374	1003167
428	993161
431	969340
445	970387
488	970525
542	969751
612	987351
617	970240
649	973851
658	973232
662	975279
698	995788
708	968848
746	972032
747	972032
776	997265
784	969853
806	995126
811	996436
837	989262
901	988199
923	977066
934	990860
957	991171
989	975339
999	968823


# Graph based Music Recommender. Task 4

Build the edges of the type “artist-track”. Take the amount of times the track HAS BEEN listened by all users as the weight of the edge from the artist’s vertex to the track’s vertex.

Tip: group the dataframe by the columns “artistId” and “trackId” and use the function “count” of DF API. For each artist take top-100 tracks and normalize weights.

Sort the resulting Data Frame in descending order by the column norm_weight, and then in ascending order this time first by “id1”, then by “id2”. Take top 40 rows, select only the columns “id1”, “id2”, and print the columns “id1”, “id2” of the resulting dataframe.

    ...
    968017 859321
    968022 852786
    968034 807671
    968038 964150
    968042 835935
    ...

In [25]:
artist_track_cnt = data \
    .groupBy(["artistId", "trackId"]) \
    .count()

In [26]:
window = Window.partitionBy("artistId").orderBy(f.col("count").desc())

artist_track_top100 = artist_track_cnt \
    .select(["*", f.row_number().over(window).alias("rank")]) \
    .where(f.col("rank") <= 100)

In [27]:
window = Window.partitionBy("artistId")

artist_track = artist_track_top100 \
    .select(["*", (f.col("count") / f.sum("count").over(window)).alias("weight")])

In [28]:
artist_track.persist()

DataFrame[artistId: int, trackId: int, count: bigint, rank: int, weight: double]

In [29]:
top40 = artist_track.select("artistId", "trackId", "weight").orderBy(\
    f.col("weight").desc(), f.col("artistId").asc(), f.col("trackId").asc()).take(40)

for row in top40:
    print("{}\t{}".format(row[0], row[1]))

967993	869415
967998	947428
968004	927380
968017	859321
968022	852786
968034	807671
968038	964150
968042	835935
968043	913568
968046	935077
968047	806127
968065	907906
968073	964586
968086	813446
968092	837129
968118	914441
968125	821410
968140	953008
968148	877445
968161	809793
968163	803065
968168	876119
968189	858639
968221	896937
968224	892880
968232	825536
968237	932845
968238	939177
968241	879045
968242	911250
968248	953554
968255	808494
968259	880230
968265	950148
968266	824437
968269	913243
968272	816049
968278	946743
968285	847460
968286	940006


# Graph based Music Recommender. Task 5

For the user with Id 776748 find all the tracks and artists connected to him. Use original dataframe not a normalized one. Sort founded items first by artist then by name in ascending order, leave only columns ”Artist” and “Name” and print top-40.

Each output line can take one of the following forms:

1. Artist: <artist-name> <track-name>
2. Artist: <artist-name> Artist: <artist-name>

These two forms help distinguish “user-track” suggestions (as shown in 1) from “user-artist” suggestions (as shown in 2).

The part of the result on the sample dataset:

    ...
    Artist: Blur Artist: Blur
    Artist: Blur Girls and Boys
    Artist: Clawfinger Artist: Clawfinger
    Artist: Clawfinger Nothing Going On
    Artist: Disturbed Artist: Disturbed
    ...

In [30]:
USERID = 776748

In [31]:
one_user_tracks = data\
    .where(f.col("userId") == USERID) \
    .select("trackId").distinct() \
    .join(meta, \
          f.col("trackId") == f.col("meta.Id") \
         ) \
    .orderBy("Artist", "Name")

In [32]:
# one_user_tracks.toPandas()

In [33]:
result = one_user_tracks.select("Artist", "Name").collect()

last_artist = ''
n = 0
MAX_N = 40

def print_next(s):
    global n
    if n < MAX_N:
        print(s)
    n += 1

for row in result:
    (artist, title) = row
    if artist != last_artist:
        last_artist = artist
        print_next("{} {}".format(artist, artist))
    print_next("{} {}".format(artist, title))

Artist: 3 Doors Down Artist: 3 Doors Down
Artist: 3 Doors Down Kryptonite
Artist: 311 Artist: 311
Artist: 311 Beautiful disaster
Artist: Blur Artist: Blur
Artist: Blur Girls and Boys
Artist: Clawfinger Artist: Clawfinger
Artist: Clawfinger Nothing Going On
Artist: Disturbed Artist: Disturbed
Artist: Disturbed The Vengeful One
Artist: Gotthard Artist: Gotthard
Artist: Gotthard Eagle
Artist: Green Day Artist: Green Day
Artist: Green Day 21 Guns
Artist: Green Day Kill The DJ
Artist: Iggy Pop Artist: Iggy Pop
Artist: Iggy Pop Sunday
Artist: Korn Artist: Korn
Artist: Korn Here To Stay
Artist: Linkin Park Artist: Linkin Park
Artist: Linkin Park In The End
Artist: Linkin Park Numb
Artist: Lordi Artist: Lordi
Artist: Lordi Hard Rock Hallelujah
Artist: Nickelback Artist: Nickelback
Artist: Nickelback She Keeps Me Up
Artist: Nomy Artist: Nomy
Artist: Nomy Cocaine
Artist: Papa Roach Artist: Papa Roach
Artist: Papa Roach Getting Away With Murder
Artist: Rise Against Artist: Rise Against
Artist: Ri

# Graph based Music Recommender. Task 6

For the user with Id 776748 print top-40 recommended tracks. Build music recommendations with the algorithm described in the lesson 3 of the fifth week. Initialize coordinates of vector x_0 corresponding to the user’s vertex and all the vertices from the task 5 with ones and all other coordinates with zeros. Do 5 iterations:

￼
Take alpha = 0.15. and the next balancing functions:

* beta(user, user → artist) = 0.5
* beta(user, user → track) = 0.5
* beta(track, track → track) = 1
* beta(artist, artist → track) = 1

You should receive a table with 3 columns: “name”, “artist” and “rank”. Sort the resulting dataframe in descending order by “rank”, select top 40 recommended tracks, select only the columns “name”, “artist” and “rank”, leave 5 digits after the decimal point in “rank” and print the resulting dataframe.

The part of the result on the sample dataset:

    ...
    Prayer Of The Refugee Artist: Rise Against 1.35278102029
    Eagle Artist: Gotthard 1.21412311013
    21 Guns Artist: Green Day 1.17301653219
    Wait And Bleed Artist: Slipknot 0.921552328559
    Beautiful disaster Artist: 311 0.921552328559
    ...

In [34]:
USERID = 776748
ALPHA = 0.15
BETA_USER_ARTIST = 0.5
BETA_USER_TRACK = 0.5
BETA_TRACK_TRACK = 1
BETA_ARTIST_TRACK = 1

### Weight update with test data

In [35]:
# vertices: of only one type
# edges: only of one type
#
# I don't need to rewrite edge weight. But let's retain this code
# for later reference
def step(vertices, edges, alpha, beta):
    alpha_rdd = vertices.select(["id", (f.col("weight") * alpha).alias("weight")])
    mia = (1 - alpha) * beta
    edge_rdd = edges.alias("edges") \
        .join(vertices.alias("vertices"), f.col("id") == f.col("id1"), "left_outer") \
        .select(["edges.type", "id1", "id2", \
                 f.when(f.isnull(f.col("id")), f.col("edges.weight")) \
                     .otherwise(mia * f.col("vertices.weight") * f.col("edges.weight")) \
                     .alias("weight") \
                ])
    return (alpha_rdd, edge_rdd)

In [36]:
# Only user vertice is to be created
# def step_alpha(vertices, alpha):
#     alpha_rdd = vertices.select(["id", (f.col("weight") * alpha).alias("weight")])
#     return alpha_rdd
def step_alpha(user_id, alpha):
    return sparkSession.createDataFrame([[user_id, alpha]], ["id", "weight"])

In [37]:
def step_tgt(vertices, edges, alpha, beta):
    mia = (1 - alpha) * beta
    tgt_rdd = vertices.alias("vertices") \
        .join(edges.alias("edges"), f.col("id") == f.col("id1")) \
        .select([f.col("id2").alias("id"), \
                (mia * f.col("vertices.weight") * f.col("edges.weight")).alias("weight") \
                ])
    return tgt_rdd

In [38]:
def unite_weights(df1, df2, df3, df4, df5):
    df = df1.union(df2).union(df3).union(df4).union(df5)
    df_sum = df.groupBy("id").agg(f.sum("weight").alias("weight"))
    return df_sum

In [39]:
test_vertices = sparkSession.createDataFrame(
    [["user", 1, 1.1], ["user", 2, 2.2], ["user", 3, 3.3]],
    ["type", "id", "weight"])
test_edges = sparkSession.createDataFrame(
    [["ut", 1, 11, 11.1], ["ut", 2,  11, 22.2], ["ut", 1, 3, 32.3], ["ut", 7, 73, 77.3]],
    ["type", "id1", "id2", "weight"])

In [40]:
x = step_alpha(USERID, 0.5)
x.toPandas()

Unnamed: 0,id,weight
0,776748,0.5


In [41]:
e = step_tgt(test_vertices, test_edges, 0.5, 0.25)
e.toPandas()

Unnamed: 0,id,weight
0,11,1.52625
1,3,4.44125
2,11,6.105


In [42]:
e = step_tgt(test_vertices, test_edges, 0.5, 0.25)
ff = unite_weights(e, e, e, e, e)
ff.toPandas()

Unnamed: 0,id,weight
0,3,22.20625
1,11,38.15625


### Reproduce demo from lection

In [43]:
d_b_u_t = 0.4
d_b_u_a = 0.6
d_b_a_a = 1.0
d_b_t_t = 0.3
d_b_t_a = 0.7

In [44]:
d_vertices = sparkSession.createDataFrame(
    [["Na", 0.2], ["Ff", 0.2], ["Gd", 0.2], ["21", 0.2], ["Tm", 0.2]],
    ["id", "weight"])

In [45]:
d_edges = sparkSession.createDataFrame(
    [
        ["Na", "Gd", 0.4], # ua
        ["Na", "FF", 0.6], # ua
        ["Na", "21", 1.0], # ut
        ["Ff", "Gd", 1.0], # aa
        ["Gd", "Ff", 1.0], # aa
    ],
    ["id1", "id2" "weight"])

### Prepare initial data

In [46]:
edges_u_a = user_artist.select([
    f.col("userId").alias("id1"), f.col("artistId").alias("id2"), "weight"]).persist()
edges_t_t = played_together
edges_u_t = user_track.select([
    f.col("userId").alias("id1"), f.col("trackId").alias("id2"), "weight"]).persist()
edges_a_t = artist_track.select([
    f.col("artistId").alias("id1"), f.col("trackId").alias("id2"), "weight"]).persist()

In [47]:
task5_t = one_user_tracks = data\
    .where(f.col("userId") == USERID) \
    .select(f.col("trackId").alias("id"), f.lit(1).alias("weight")).distinct()
task5_a = one_user_tracks = data\
    .where(f.col("userId") == USERID) \
    .select(f.col("artistId").alias("id"), f.lit(1).alias("weight")).distinct()
initial_state = sparkSession.createDataFrame([[USERID, 1.0]], ["id", "weight"]) \
    .union(task5_t).union(task5_a)

In [48]:
# initial_state.toPandas()
1

1

### iterate five times

In [49]:
def make_step(vertices_in):
    v_self = step_alpha(USERID, ALPHA)
    v_u_a = step_tgt(vertices_in, edges_u_a, ALPHA, BETA_USER_ARTIST)
    v_u_t = step_tgt(vertices_in, edges_u_t, ALPHA, BETA_USER_TRACK)
    v_t_t = step_tgt(vertices_in, edges_t_t, ALPHA, BETA_TRACK_TRACK)
    v_a_t = step_tgt(vertices_in, edges_a_t, ALPHA, BETA_ARTIST_TRACK)
    vertices_out = unite_weights(v_self, v_u_a, v_u_t, v_t_t, v_a_t)
    return vertices_out.persist()

In [50]:
step1 = make_step(initial_state)

In [51]:
step2 = make_step(step1)

In [52]:
step3 = make_step(step2)

In [53]:
step4 = make_step(step3)

In [54]:
step5 = make_step(step4)

In [55]:
"""
step5.orderBy(f.col("weight").desc()).toPandas()
"""

'\nstep5.orderBy(f.col("weight").desc()).toPandas()\n'

### Generate the ansser in the format

In [56]:
step1.unpersist()
step2.unpersist()
step3.unpersist()
step4.unpersist()
recommendation = step5

details = recommendation.alias("r").join(
        meta.alias("m"), f.col("r.id") == f.col("m.id")
    ) \
    .orderBy(f.col("weight").desc())

In [57]:
top40 = details.select("Name", "Artist", "weight").take(40)
for row in top40:
    (n, a, w) = row
    print("{} {} {}".format(n, a, round(w, 5)))

Kill The DJ Artist: Green Day 1.42809
Come Out and Play Artist: The Offspring 1.37473
I Hate Everything About You Artist: Three Days Grace 1.37362
Prayer Of The Refugee Artist: Rise Against 1.35278
Eagle Artist: Gotthard 1.21412
21 Guns Artist: Green Day 1.17302
Wait And Bleed Artist: Slipknot 0.92155
Beautiful disaster Artist: 311 0.92155
Here To Stay Artist: Korn 0.91653
Hard Rock Hallelujah Artist: Lordi 0.91653
Nothing Going On Artist: Clawfinger 0.80983
In The End Artist: Linkin Park 0.80292
Numb Artist: Linkin Park 0.80292
Kryptonite Artist: 3 Doors Down 0.68799
Sky is Over Artist: Serj Tankian 0.68799
Take It Out On Me Artist: Thousand Foot Krutch 0.47024
Girls and Boys Artist: Blur 0.40245
Cocaine Artist: Nomy 0.20893
Getting Away With Murder Artist: Papa Roach 0.20648
Artist: Green Day Artist: Green Day 0.01181
Artist: Linkin Park Artist: Linkin Park 0.00472
Artist: The Offspring Artist: The Offspring 0.00472
Artist: Clawfinger Artist: Clawfinger 0.00472
The Vengeful One Artis