### Graph based Music Recommender. Task 1

#### Task description
Data description (DataFrames in parquet format)
Location - /data/sample264

Fields: trackId, userId, timestamp, artistId

trackId - id of the track
userId - id of the user
artistId - id of the artist
timestamp - timestamp of the moment the user starts listening to a track
Location - /data/meta

Fields: type, Name, Artist, Id

Type could be “track” or “artist”
Name is the title of the track if the type == “track” and the name of the musician or group if the type == “artist”.
Artist states for the creator of the track in case the type == “track” and for the name of the musician or group in case the type == “artist”.
Id - id of the item
Task
Build the edges of the type “track-track”. To do it you will need to count the collaborative similarity between all the tracks: if a user has started listening to the tracks A and B together in the limited time interval (equal to 7 minutes), then you should add 1 to the weight of the edge from vertex A to vertex B (initial weight is equal to 0). For each track choose top 40 tracks ordered by weight similar to it and normalize weights of its edges (divide the weight of each edge on a summary of weights of all edges).

Sort the resulting Data Frame in descending order by the column norm_count, take top 40 rows, select only the columns “id1”, “id2”, sort them in ascending order this time first by “id1”, then by “id2” and print the columns “id1”, “id2” of the resulting dataFrame.

In [1]:
import os

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import Window

execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.1
      /_/

Using Python version 2.7.12 (default, Nov 19 2016 06:48:10)
SparkSession available as 'spark'.


In [2]:
sparkSession = SparkSession.builder.enableHiveSupport().master("local [2]").getOrCreate()

data = sparkSession.read.parquet("/data/sample264")
meta = sparkSession.read.parquet("/data/meta")

In [3]:
def norm(df, key1, key2, field, n): 
    
    window = Window.partitionBy(key1).orderBy(col(field).desc())
        
    topsDF = df.withColumn("row_number", row_number().over(window)) \
        .filter(col("row_number") <= n) \
        .drop(col("row_number")) 
        
    tmpDF = topsDF.groupBy(col(key1)).agg(col(key1), sum(col(field)).alias("sum_" + field))
   
    normalizedDF = topsDF.join(tmpDF, key1, "inner") \
        .withColumn("norm_" + field, col(field) / col("sum_" + field)) \
        .cache()

    return normalizedDF

In [4]:
data2 = data.select(col('userId').alias('userId'), 
                    col('trackId').alias('trackId2'), 
                    col('artistId').alias('artistId2'), 
                    col('timestamp').alias('timestamp2'))

In [5]:
trackToTrak = data.join(data2, 'userId', 'inner') \
                  .filter((col('trackId') < col('trackId2')) & (abs(col('timestamp') - col('timestamp2')) < 421)) \
                  .groupBy(col('trackId'), col('trackId2')) \
                  .agg(count(lit(1)).alias('count'))
    
trackToTrackList = norm(trackToTrak, "trackId", "trackId2", "count", 40) \
        .withColumn("id", col("trackId")) \
        .withColumn("id2", col("trackId2")) \
        .withColumn("norm_count", col("norm_count") * 0.5) \
        .orderBy(desc("norm_count"), asc("id"), asc("id2")) \
        .limit(40) \
        .select(col("id"), col("id2")) \
        .collect()

In [6]:
for val in trackToTrackList:
    print "%s %s" % val

798256 923706
798319 837992
798322 876562
798331 827364
798335 840741
798374 816874
798375 810685
798379 812055
798380 840113
798396 817687
798398 926302
798405 867217
798443 905923
798457 918918
798460 891840
798461 940379
798470 840814
798474 963162
798477 883244
798485 955521
798505 905671
798545 949238
798550 936295
798626 845438
798691 818279
798692 898823
798702 811440
798704 937570
798725 933147
798738 894170
798745 799665
798782 956938
798801 950802
798820 890393
798833 916319
798865 962662
798931 893574
798946 946408
799012 809997
799024 935246
