<a href="https://colab.research.google.com/github/nivedita-rajesh/Song-Reccomendation-System/blob/main/Song_Recommender.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Music Recommender System using Apache Spark and Python

Spark is an open-source cluster computing framework. It is used to handle real-time generated data. It is built on top on Hadoop MapReduce. It is found to be quicker than most of the alternatives.

## Necessary Package Imports

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 34 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 54.3 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=49af0e3edf987ec608ee8a0353c5af276f4c40e42f75ac81997e727cb11590ce
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


## Loading data

A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. 

In [2]:
from pyspark import SparkContext
sc = SparkContext()

## Getting the Emotion

The result obtained from the speech emotion recogniser is fed here so that the songs that goes well with the mood are played.

In [3]:
MOOD = input("The emotion detected from the voice is: ")

The emotion detected from the voice is: sad


## Loading Data
The song dataset is read into a RDD and each record is cleaned.

In [None]:
# Read file into RDD
lines = sc.textFile("song.txt")

# Call collect() to get all data into a list - removing the first header line 
llist = lines.collect()[1:]

lst=[]
for line in llist:
  lst.append(line.strip().replace('"',"").split(";"))


Now that we got the complete dataset cleaned, we assign each emotion to every song based on the genre. This data is written to the dataset text file and will be used for further processing.

In [39]:
file=open('dataset.txt','a')
for track in lst :

  if 'spotify_id' in track:
    continue
  genre = track[-1]
  if(genre in ['tone' ,'wonky',"modern classical","deep liquid", "bounce", "metal", "australian dance", "lo-fi","trap music", "soundtrack", "chillwave"]):
    track.append('angry')
    file.write(str(track)+"\n")
  
  elif(genre in ['anime', "comedy rock" ,"boogie woogie","bubblegum dance","preverb", "chicago soul", "alternative emo", "pop emo", "downtempo","worship","compositional ambient"]):
    track.append('sad')
    file.write(str(track)+"\n")

  elif("pop" in genre or genre in ["spanish hip hop","salsa"]  or "rock" in genre or "metal" in genre or "funk" in genre):
    track.append('happy')
    file.write(str(track)+"\n")

  else:
    track.append("neutral")
    file.write(str(track)+"\n")



`user_history.txt` is a file that contains the songs that a user has listened to. Every time a user listens to a song, its metadata is being appended to the txt file. Hence, if a user listens to a song more than once, the file will have repeated records.

The cell below is a simulation of a user listening to a song.

In [90]:
import random
user_lst=[]

datasetRdd = sc.textFile('dataset.txt')
datasetList =(datasetRdd.collect())
random.shuffle(datasetList)

num_songs=0
with open('user_history.txt','w') as f:
  while num_songs <20:
    rand_int = random.randint(0,datasetRdd.count())
    num_songs+=1
    for _ in range(0,35):
      #user listening to this song
      f.write(str(random.choice(datasetList[rand_int:rand_int+20])+"\n"))

## MapReduce

The typical map reduce is implemented here.
First, each metadata is made into a tuple with second element as 1 to show the count. The output of the map function is of the form (record,1). 
Then, we reduce the data using `reduceByKey` function - the output will have unique keys, with values being the count or the number of times the user has listened to the song - (record, count)

In [76]:
dataset = sc.textFile("user_history.txt")
a = dataset.map(lambda x:(x,1)).reduceByKey(lambda x,y : x + y).collect()

fin_list = []

for i in a[1:]:

  # data from file is string
  # converting list-like string to list
  res = i[0].strip('][').replace("'","").split(', ') 

  if(len(res)<4):
    #removing data having wrong schema
    continue
  
  # when there are multiple artists for a song separated by comma, the rdd considers the comma as a column delimiter.
  # so, we replace the comma in between the artists' names and replace it with &
  # this code works with any number of artists
  if res[-3].isdigit():
    ans = "&".join(res[2:-2])

    while(not res[3].isdigit()):
      del res[3]
    res[2] = ans

  res.append(i[1])

  # another filter to remove all the invalid schema
  if (len(res)==7):
    fin_list.append(res)

In [91]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import rand 

# converting fin_list to rdd
rdd = sc.parallelize(fin_list)
spark = SparkSession(sc)

print("Actual Dataset")
#the column names
schema = ["spotify_id","title","artist(s)","popularity","genre", "emotion", "count"]
# shuffling
rdd.toDF(schema).orderBy(rand()).show()

Actual Dataset
+--------------------+--------------------+--------------------+----------+--------------------+-------+-----+
|          spotify_id|               title|           artist(s)|popularity|               genre|emotion|count|
+--------------------+--------------------+--------------------+----------+--------------------+-------+-----+
|4RG9Ulx2XrTg2achB...|           Lucky You|The Lightning See...|         5|             britpop|  happy|    3|
|5qk1xXcERl8RW645z...|    Rebellion (Lies)|     Arcade Fire&171|       171|        canadian pop|  happy|    1|
|5eo4Fjw9duI5NbvSR...|Pra de Pé Me Apla...|        Mc Danado&40|        40|   deep funk carioca|  happy|    2|
|4kgD89hUKCdmnSlPO...|            Me Again|         J Mascis&64|        64|               lo-fi|  angry|    3|
|4LrOmj6MoxMdCLf2z...|Murder On The Dan...|Sophie Ellis-Bext...|       101|        new wave pop|  happy|    2|
|1pWscEL17ssjbKXIS...|           On My Way|Tiësto&Bright Spa...|        12|                 edm|n

In [85]:
print("The number of songs user has listened to based on genre")
emotions = (rdd.toDF().select('_6').rdd.flatMap(lambda x: x).collect())
print(f'Sad: {emotions.count("sad")}')
print(f'Happy: {emotions.count("happy")}')
print(f'Neutral: {emotions.count("neutral")}')
print(f'Angry: {emotions.count("angry")}')
#to show the number of songs each emotion has

The number of songs user has listened to based on genre
Sad: 4
Happy: 156
Neutral: 241
Angry: 10


Filtering the dataset using the `MOOD` variable we got from emotion detector

In [92]:
print("Filtered Dataset")
#filtering based on second last element: MOOD
rdd=rdd.filter(lambda x: x[-2]==MOOD)

if(rdd.count()==0):
  print(f"Sorry, we could not find {MOOD} songs from your playlist")
else:
  rdd.toDF(schema).orderBy(rand()).show()


Filtered Dataset
+--------------------+--------------------+------------------+----------+--------------------+-------+-----+
|          spotify_id|               title|         artist(s)|popularity|               genre|emotion|count|
+--------------------+--------------------+------------------+----------+--------------------+-------+-----+
|00z9Rax2KTHqKLKYn...|Water From A Vine...|  William Orbit&68|        68|           downtempo|    sad|    2|
|0P8ANiSwsjYPx3zrm...|             Ever Be|     Aaron Shust&8|         8|             worship|    sad|    3|
|5S1ARzZSleRh8Vpbw...|              orange|Quentin Sirjacq&84|        84|compositional amb...|    sad|    1|
|1l3KYsTzEKQBpkHEx...|      Missing Photos|     Last Days&124|       124|compositional amb...|    sad|    3|
+--------------------+--------------------+------------------+----------+--------------------+-------+-----+



Sorting the filtered dataset

In [93]:
print("Sorted Dataset")

# sorting rdd based on the last element - count, in descending order
rdd = rdd.sortBy(lambda x:x[-1], ascending=False)
rdd.toDF(schema).show()


Sorted Dataset
+--------------------+--------------------+------------------+----------+--------------------+-------+-----+
|          spotify_id|               title|         artist(s)|popularity|               genre|emotion|count|
+--------------------+--------------------+------------------+----------+--------------------+-------+-----+
|1l3KYsTzEKQBpkHEx...|      Missing Photos|     Last Days&124|       124|compositional amb...|    sad|    3|
|0P8ANiSwsjYPx3zrm...|             Ever Be|     Aaron Shust&8|         8|             worship|    sad|    3|
|00z9Rax2KTHqKLKYn...|Water From A Vine...|  William Orbit&68|        68|           downtempo|    sad|    2|
|5S1ARzZSleRh8Vpbw...|              orange|Quentin Sirjacq&84|        84|compositional amb...|    sad|    1|
+--------------------+--------------------+------------------+----------+--------------------+-------+-----+



Displaying the top-k songs based on the number of times the user has listened to it. If k is greater than the number of rows of the RDD, all the elements in the RDD are printed.

In [94]:
k = int(input("How many songs do you want?"))

top_k = rdd.take(k)
print(f"Our top {rdd.count() if k>rdd.count()  else k} recommendations are: \n")
for song in top_k:
  print(f"- {song[1]} by {song[2]} has a popularity of {song[3]}. You have listened to it {song[-1]} time(s)")
  print(f"  Listen song at https://open.spotify.com/track/{song[0]}")
  print()

How many songs do you want?10
Our top 4 recommendations are: 

- Missing Photos by Last Days&124 has a popularity of 124. You have listened to it 3 time(s)
  Listen song at https://open.spotify.com/track/1l3KYsTzEKQBpkHExOMR0d

- Ever Be by Aaron Shust&8 has a popularity of 8. You have listened to it 3 time(s)
  Listen song at https://open.spotify.com/track/0P8ANiSwsjYPx3zrmgN2Pv

- Water From A Vine Leaf by William Orbit&68 has a popularity of 68. You have listened to it 2 time(s)
  Listen song at https://open.spotify.com/track/00z9Rax2KTHqKLKYnGHzNk

- orange by Quentin Sirjacq&84 has a popularity of 84. You have listened to it 1 time(s)
  Listen song at https://open.spotify.com/track/5S1ARzZSleRh8VpbwsLdAk

