# Spotify Songs Study - Similarity

## Introduction

Have you ever asked yourself how we can recommend songs music based on your taste? **Similarity** is the answer.
Similarity measures how much two objects have similar shapes, values, or distances.
Thus, we can use similarity to measure similar songs and create a fine recommentation for the users based on previously listened songs.

Dataset: [Spotify Song Attributes](https://www.kaggle.com/geomack/spotifyclassification) - An attempt to build a classifier that can predict whether or not I like a song.

**Disclaimer**: This is a simple study case of similarity. There are many state-of-art algorithms for song recommendation. Anyway, this notebook can be used as a first step for this study, and also a base test algorithm for your experiments.

## Pre Definitions

Import packages, and create useful functions (code hidden).

In [1]:
# Installing youtube tool
!pip install youtube-search-python=='1.3.1'

Collecting youtube-search-python==1.3.1
  Downloading youtube_search_python-1.3.1-py3-none-any.whl (11 kB)
Installing collected packages: youtube-search-python
Successfully installed youtube-search-python-1.3.1
You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [2]:
# Import the needs
import numpy as np # linear algebra
import pandas as pd # data processing

import json
from youtubesearchpython import SearchVideos # YouTube search tool

In [3]:
# Get a Song str search
def getMusicName(elem):
    return '{} - {}'.format(elem['artist'], elem['song_title'])


# Function to search a YouTube Video
def youtubeSearchVideo(music, results=1):
    searchJson = SearchVideos(music, offset=1, mode="json", max_results=results).result()
    searchParsed = json.loads(searchJson)
    searchParsed = searchParsed['search_result'][0]
    return {'title': searchParsed['title'], \
            'duration': searchParsed['duration'], \
            'views': searchParsed['views'], \
            'url': searchParsed['link'] }

## Loading data

How many songs do we have?

In [4]:
# Load dataset
dfSongs = pd.read_csv('/kaggle/input/spotifyclassification/data.csv', index_col=0)

# Number of rows and columns
rows, cols = dfSongs.shape
print('Number of songs: {}'.format(rows))
print('Number of attributes per song: {}'.format(cols))

Number of songs: 2017
Number of attributes per song: 16


What are the song attributes?

In [5]:
# Print the columns
display(dfSongs.columns)

Index(['acousticness', 'danceability', 'duration_ms', 'energy',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'time_signature', 'valence', 'target',
       'song_title', 'artist'],
      dtype='object')

In [6]:
# Print the attributes type
dfSongs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2017 entries, 0 to 2016
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      2017 non-null   float64
 1   danceability      2017 non-null   float64
 2   duration_ms       2017 non-null   int64  
 3   energy            2017 non-null   float64
 4   instrumentalness  2017 non-null   float64
 5   key               2017 non-null   int64  
 6   liveness          2017 non-null   float64
 7   loudness          2017 non-null   float64
 8   mode              2017 non-null   int64  
 9   speechiness       2017 non-null   float64
 10  tempo             2017 non-null   float64
 11  time_signature    2017 non-null   float64
 12  valence           2017 non-null   float64
 13  target            2017 non-null   int64  
 14  song_title        2017 non-null   object 
 15  artist            2017 non-null   object 
dtypes: float64(10), int64(4), object(2)
memory

Printing the first rows.

In [7]:
dfSongs[['song_title', 'artist']].head(5)

Unnamed: 0,song_title,artist
0,Mask Off,Future
1,Redbone,Childish Gambino
2,Xanny Family,Future
3,Master Of None,Beach House
4,Parallel Lines,Junior Boys


Searching some video, for example.

In [8]:
# Select a song
anySong = dfSongs.loc[0]
# Get the song name
anySongName = getMusicName(anySong)
print('name:', anySongName)

# Search in YouTube
youtubeSearchVideo(anySongName)

name: Future - Mask Off


{'title': 'Future - Mask Off (Official Music Video)',
 'duration': '4:50',
 'views': 441952777,
 'url': 'https://www.youtube.com/watch?v=xvZqHgFz51I'}

## Similarity Queries

We created queries to retrive the elements more similar based on [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance).
"In mathematics, the Euclidean distance between two points is a number, the length of a line segment between the two points."
In this sense, the distance the closer to 0 the more similar the songs are.

### k-nearest neighbors algorithm (k-NN)

The [k-NN algoritm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) searches for the $k$ similar elements based on a query point at the center; or a threshold distance limit based on a query point, which is in a pre defined radius. Thus, we have two kinds of k-NN:

* $k$ query: return $k$ closest songs.
* Range query: return all songs with 'distance' $\leq$ 'threshold'.

In [9]:
# K-query
def knnQuery(queryPoint, arrCharactPoints, k):
    tmp = arrCharactPoints.copy(deep=True)
    tmp['dist'] = tmp.apply(lambda x: np.linalg.norm(x-queryPoint), axis=1)
    tmp = tmp.sort_values('dist')
    return tmp.head(k).index

# Range query
def rangeQuery(queryPoint, arrCharactPoints, radius):
    tmp = arrCharactPoints.copy(deep=True)
    tmp['dist'] = tmp.apply(lambda x: np.linalg.norm(x-queryPoint), axis=1)
    tmp['radius'] = tmp.apply(lambda x: 1 if x['dist'] <= radius else 0, axis=1)
    return tmp.query('radius == 1').index

In [10]:
# Execute k-NN removing the 'query point'
def querySimilars(df, columns, idx, func, param):
    arr = df[columns].copy(deep=True)
    queryPoint = arr.loc[idx]
    arr = arr.drop([idx])
    response = func(queryPoint, arr, param)
    return response

#### $k$ query

Trying a query using `knnQuery`.

For example, let's search for $k=3$ similar songs to a query point `songIndex=5` (music: `"Drake - Sneakin"`).

In [11]:
# Selecting song and attributes
songIndex = 5 # query point, selected song
columns = ['acousticness','danceability','energy','instrumentalness','liveness','speechiness','valence']

# Selecting query parameters
func, param = knnQuery, 3 # k=3

# Querying
response = querySimilars(dfSongs, columns, songIndex, func, param)

In [12]:
# Select a song
anySong = dfSongs.loc[songIndex]
# Get the song name
anySongName = getMusicName(anySong)
# Retrive a YouTube link
youtube = youtubeSearchVideo(anySongName)

# Print
print('# Query Point')
print(songIndex, anySongName)
print(youtube['url'])

# Query Point
5 Drake - Sneakin’
https://www.youtube.com/watch?v=YlxMgXacV-A


In [13]:
print('# Similar songs')
for idx in response:
    anySong = dfSongs.loc[idx]
    anySongName = getMusicName(anySong)
    youtube = youtubeSearchVideo(anySongName)
    
    print(idx, anySongName)
    print(youtube['url'])

# Similar songs
732 Tyga - Dope
https://www.youtube.com/watch?v=q9xpsmDXy48
6 Drake - Childs Play
https://www.youtube.com/watch?v=2F2GlWMb_l4
799 Pete Wingfield - 18 With A Bullet
https://www.youtube.com/watch?v=3toBfCJt67w


#### Range query

Trying a query using `rangeQuery`.

For example, let's search similar songs using $dist \leq 0.15$, and query point `songIndex=10` (music: `"The Avalanches - Subways - In Flagranti Extended Edit"`).

In [14]:
# Selecting song and attributes
songIndex = 10 # query point, selected song
columns = ['acousticness','danceability','energy','instrumentalness','liveness','speechiness','valence']

# Selecting query parameters
func, param = rangeQuery, 0.15 # threshold distance

# Querying
response = querySimilars(dfSongs, columns, songIndex, func, param)

In [15]:
# Select a song
anySong = dfSongs.loc[songIndex]
# Get the song name
anySongName = getMusicName(anySong)
# Retrive a YouTube link
youtube = youtubeSearchVideo(anySongName)

# Print
print('# Query Point')
print(songIndex, anySongName)
print(youtube['url'])

# Query Point
10 The Avalanches - Subways - In Flagranti Extended Edit
https://www.youtube.com/watch?v=ZcT0Q2tRnvs


In [16]:
print('# Similar songs')
for idx in response:
    anySong = dfSongs.loc[idx]
    anySongName = getMusicName(anySong)
    youtube = youtubeSearchVideo(anySongName)
    
    print(idx, anySongName)
    print(youtube['url'])

# Similar songs
1132 Fall Out Boy - Immortals
https://www.youtube.com/watch?v=l9PxOanFjxQ
1279 Britt Nicole - Be The Change
https://www.youtube.com/watch?v=wLikOq_ESr8
1426 Rick Ross - Game Ain't Based On Sympathy
https://www.youtube.com/watch?v=lws_tnf5DjE
1676 Bassnectar - Mind Tricks
https://www.youtube.com/watch?v=Xmj1lfvLKMQ
1842 Natalie Imbruglia - Torn
https://www.youtube.com/watch?v=VV1XWJN3nJo


### Questions - Study

So far, we have been able to make queries searching for similar songs based on distance to a query point, using knnQuery and rangeQuery. In this way, it is possible to find similar songs based on a user's tastes.

Anyway, we can also create our own personalized query points and modify the columns to explore other options. For example, query the most cheerful songs, selecting a specific set of song attributes `columns = ['danceability','energy','valence']`; and searching for the $k$ most high values of `'danceability'=1,'energy'=1,'valence'=1`. Thus, **question**: _What are the top 5 active, cheerful songs on our list?_

In [17]:
# Defining the query point and the attributes
k = 5
queryPoint = [1, 1, 1] # query point
columns = ['danceability','energy','valence']

# Searching for the songs
arr = dfSongs[columns].copy(deep=True)
response = knnQuery(queryPoint, arr, k)

# Printing
print('# Active, cheerful songs')
for idx in response:
    anySong = dfSongs.loc[idx]
    anySongName = getMusicName(anySong)
    youtube = youtubeSearchVideo(anySongName)
    
    print(idx, anySongName)
    print(youtube['url'])

# Active, cheerful songs
1900 Gwen Stefani - Hollaback Girl
https://www.youtube.com/watch?v=Kgjkth6BRRY
1586 Imagination Movers - My Favorite Snack
https://www.youtube.com/watch?v=tJssLXJhWrc
405 Chuck Brown and the Soul Searchers - Bustin' Loose
https://www.youtube.com/watch?v=wwHi10qX8u8
1967 2 LIVE CREW - Hoochie Mama
https://www.youtube.com/watch?v=MEkVI_7eRO4
2013 Dillon Francis - Candy
https://www.youtube.com/watch?v=ZYPzmsLJ52o


We can also change of perspective. In this way, **question**: _What are the top 5 less active or not animated songs on our list?_ We just need to change our query point to values of `'danceability'=0,'energy'=0,'valence'=0`.

In [18]:
# Defining the query point and the attributes
k = 5
queryPoint = [0, 0, 0] # query point
columns = ['danceability','energy','valence']

# Searching for the songs
arr = dfSongs[columns].copy(deep=True)
response = knnQuery(queryPoint, arr, k)

# Printing
print('# Active, cheerful songs')
for idx in response:
    anySong = dfSongs.loc[idx]
    anySongName = getMusicName(anySong)
    youtube = youtubeSearchVideo(anySongName)
    
    print(idx, anySongName)
    print(youtube['url'])

# Active, cheerful songs
817 Nikolaus Harnoncourt - Mozart: Requiem in D Minor, K. 626: VIII. Lacrimosa
https://www.youtube.com/watch?v=PZddaC_gBHE
1598 Robert Schumann - Piano Quartet in E flat, Op.47: 3. Andante cantabile
https://www.youtube.com/watch?v=u4EDwHCR-PI
1600 Carl Philipp Emanuel Bach - Trio Sonata in G Major, Wq. 144: I. Adagio
https://www.youtube.com/watch?v=E0zG_j1Duxo
1602 Ludwig van Beethoven - String Quintet in C Major, Op. 29: II. Adagio molto espressivo
https://www.youtube.com/watch?v=gaOpQ7ZZ1H8
1876 Frédéric Chopin - Nocturne No.1 In B Flat Minor, Op.9 No.1
https://www.youtube.com/watch?v=ZtIW2r1EalM
