<div style="display: flex; align-items: center;">
    <img src="static/logo.png" width="120" height="120">
    <h1>Project on Spotify Recommendation Systems</h1>
</div>
<p>
    Spotify's recommendation system is a key feature of the service, designed to suggest new music tracks to users based on their personal preferences. 
    The platform employs a hybrid approach, combining collaborative filtering and content-based techniques. This approach enables users to discover new music aligned with their tastes, enhancing the listening experience.
</p>
<h2>Table of Contents</h2>
<ol>
    <li><a href="#theoretical-overview">Theoretical Overview</a></li>
    <li><a href="#introduction">Introduction</a></li>
    <li><a href="#data">Data</a></li>
    <li><a href="#proposed-method">Proposed Method</a></li>
    <li><a href="#results">Results</a></li>
    <li><a href="#conclusion">Conclusion</a></li>
</ol>

<section id="theoretical-overview">
    <h2>1. Theoretical Overview</h2>
</section>

<div style="display: flex; align-items: center;">
    <h3>Before We Begin ... What Exactly is a Recommendation System?!</h3>
    <img src="static/emoji.png" width="80" height="80">
</div>

A recommendation system is a software application or algorithm designed to suggest specific items to users, such as products, services, digital content, or information, based on their preferences, past behaviors, or individual profiles. These systems are widely used on platforms like Netflix, Amazon, Spotify, and others.

The primary challenge these systems face is filtering through vast amounts of content (such as millions of tracks on Spotify) to provide users with only what truly interests them. Users interact with a set of items (in our case, music tracks), and the goal is to predict their preferences in order to highlight personalized recommendations that align with their tastes.

There are "non-intelligent" recommendation systems that are not tailored to specific user interests, such as lists of favorites, wishlists, top-10 rankings, or most popular items. On the other hand, there are customized recommendation systems based on user preferences, and Spotify's recommendation system falls into this advanced category.

<h3>Utility Function and Utility Matrix</h3><br>

**Utility Function and User Scores**
The process of predicting user preferences for a product is based on a **utility function**. This function determines a score representing the user's preference for the item in question. The objective is to predict this score for each user-item pair, allowing the system to evaluate the utility of the product for the user.

**Utility Matrix for Preference Analysis**
The utility matrix is an essential tool for analyzing the preferences of multiple users. Columns represent items, while rows correspond to users. Each element at position (i,j) represents the rating indicating user i's preference for item j, obtained through the utility function. This matrix, often sparse, is used to predict missing ratings.

<table>
    <thead>
        <tr>
            <th rowspan="2"></th>
            <th colspan="3">Harry Potter</th>
            <th rowspan="2">Twilight</th>
            <th colspan="3">Star Wars</th>
        </tr>
        <tr>
            <th>HP1</th>
            <th>HP2</th>
            <th>HP3</th>
            <th>SW1</th>
            <th>SW2</th>
            <th>SW3</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Anita (A)</td>
            <td>4</td>
            <td>5</td>
            <td></td>
            <td>5</td>
            <td>1</td>
            <td>O</td>
            <td></td>
        </tr>
        <tr>
            <td>Beyonce (B)</td>
            <td>5</td>
            <td>5</td>
            <td>4</td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
        </tr>
        <tr>
            <td>Calvin (C)</td>
            <td></td>
            <td></td>
            <td>2</td>
            <td>4</td>
            <td></td>
            <td>5</td>
            <td></td>
        </tr>
        <tr>
            <td>David (D)</td>
            <td>3</td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td>3</td>
        </tr>
    </tbody>
</table>

<br>
**Key Challenges**
• Populating the matrix (cold start problem) since data is needed for new users;<br>
• Estimating missing ratings;<br>
• Evaluating the recommendation system;<br>

**Populating the Utility Matrix**
The utility matrix can be populated explicitly or implicitly:
- **Explicitly**: Requires users to rate items, but this approach is often ineffective due to low participation or user resistance. Incentives, such as financial rewards, may be necessary to encourage users to leave reviews.
- **Implicitly**: Infers user preferences based on their behavior on the platform. However, assigning ratings to new items remains a challenge.

**Challenges with Extrapolating Utilities**
- **Sparsity of the Matrix**: Not all users provide ratings for every item.
- **New Items**: New objects may not have any assigned ratings.
- **New Users**: Newly registered users may lack a rating history.

<h2>Content-Based vs Collaborative Filtering</h2>

**Content-Based**: Recommends new items based on the content of items the user has interacted with.<br>
**Collaborative Filtering**: Suggests new items based on the preferences of other users with similar tastes.

<h2>Content-Based</h2>

Used for recommending movies, websites, blogs, and news articles. A user who likes certain items is profiled by extracting information about those items. The system seeks a match with available products and recommends them to the user.

A content-based recommendation system suggests new objects to users based on the characteristics of objects they have previously interacted with. This approach is commonly used for recommending movies, websites, blogs, and news articles. To profile objects, features are manually or automatically extracted, such as author, title, keywords, or image tags.

**Feature Extraction**
- **Manual**: Through APIs to obtain specific information.
- **Automatic**: Using techniques like TF-IDF to extract important words or image tags.

**Similarity Calculation**
Movies, for example, are profiled using fixed-length feature vectors, which can be binary (0 and 1) or contain ratings from 0 to 5. Cosine similarity between feature vectors is used to calculate the similarity between items. The choice of scaling factor alpha affects the similarity measure.

**User Profiling**
A user is profiled as an aggregate of the items they have shown interest in. The user vector is obtained using the same features as the items, calculating the average of the features of the items that caught the user's attention, summarizing all the features of the items the user has interacted with. An item whose vector has a high similarity measure to the user's vector is recommended.

Pros:
- No need to know about other users to make recommendations.
- Performs well even for users with very specific tastes.
- Can recommend new items even if they haven't been rated by other users.

Cons:
- Manually finding the right features for items is difficult.
- Profiling a new user without prior information is problematic.
- Tends to specialize in the user's existing tastes.

<h2>Collaborative Filtering</h2>

Seeks similar users to recommend items that have been liked by similar users.

<h4>User-User</h4>

In User-User Collaborative Filtering, the key idea is that similar users have similar preferences. Considering a user X, a set of N users with similar tastes to X is identified. Jaccard similarity, which returns a value between 0 and 1, is used to evaluate user similarity.

$$
\text{Jaccard Similarity: } J(A, B) = \frac{\lvert A \cap B \rvert}{\lvert A \cup B \rvert}
$$

Each user is profiled as an aggregate of the items they are interested in, resulting in a **set** for each user. Similarity measures between all pairs of sets (and thus between users) are evaluated using either **Jaccard similarity** or **cosine similarity**. However, using these metrics to evaluate set similarity can yield insignificant results, especially when a user assigns the same rating to all items.>

$$r_x = \{1, 0, 0, 1, 3\}$$

**Similar Users with Overlapping-Item Mean Centering**
To address the issues mentioned above, the mean of each user's ratings is calculated, and this mean is subtracted from each rating in the matrix. This increases the disparity between cosine similarities obtained, nullifying values for users who assigned the same rating to all items.

The formula for calculating cosine similarity between users with mean centering becomes:

$$
\text{sim}(x,y) \;=\; 
\frac{\displaystyle \sum_{s \in S_{xy}} \bigl(r_{xs} - \bar{r}_x\bigr)\,\bigl(r_{ys} - \bar{r}_y\bigr)}
{\sqrt{\displaystyle \sum_{s \in S_{xy}} \bigl(r_{xs} - \bar{r}_x\bigr)^2}
 \;\sqrt{\displaystyle \sum_{s \in S_{xy}} \bigl(r_{ys} - \bar{r}_y\bigr)^2}}
$$

Where:  

- **$S_xy$**: Set of items rated by both users $x$ and $y$
- **$r_x$**: Rating of user $x$
- **$r_y$**: Rating of user $y$
- **$\bar{r}_x$**: Mean rating of user $x$
- **$\bar{r}_y$**: Mean rating of user $y$

This approach favors recommendations based only on items rated by both users, improving the relevance of recommendations.

<h4>Item-Item</h4>

In Item-Item Collaborative Filtering, the goal is to predict the score for an item I by finding similar items and evaluating the ratings users have assigned to those similar items. The approach is based on the concept that similar items receive similar ratings from users.

$$
r_{xi} \;=\; \frac{\displaystyle \sum_{j \in N(i, x)} s_{ij}\,r_{xj}}
{\displaystyle \sum_{j \in N(i, x)} s_{ij}}
$$

Where:
- **$S_ij$**: Similarity between items i and j
- **$r_xj$**: Rating of user X for item j

The procedure is similar to User-User Collaborative Filtering. The neighborhood of items (N) is calculated, the mean is subtracted from all elements of each item vector, and similarity values between all rows are computed. Only the k most similar items to the target item are considered.

The average of the scores that user X has given to the k items is calculated to predict the score that user X will give to item i. A weighted average is preferable, where the weight is determined by the similarity measure between the j-th item and the target item i.

Pros:
- Works well with various types of items.

Cons:
- The matrix is sparse.
- Cold start problems, especially when the matrix is empty.
- Difficulty recommending items to users with very specific tastes (popularity bias).


<section id="introduction">
    <h2>2. Introduction</h2>
</section>

<p>
The recommendation system being built is a content-based system. This type of recommendation system suggests items similar to those that a user has enjoyed in the past. In this specific case, the "similarities" between songs are determined by their characteristics, such as:
<ul>
  <li><b>id</b>: The unique ID of the song.</li>
  <li><b>name</b>: The name of the song.</li>
  <li><b>valence</b>: A measure from 0.0 to 1.0 describing the musical positivity conveyed by a song.</li>
  <li><b>year</b>: The year the song's album was released.</li>
  <li><b>acousticness</b>: A measure from 0.0 to 1.0 of how acoustic a song is.</li>
  <li><b>danceability</b>: A measure from 0.0 to 1.0 of how suitable a song is for dancing.</li>
  <li><b>duration_ms</b>: The duration of the song in milliseconds.</li>
  <li><b>energy</b>: A measure from 0.0 to 1.0 representing a perceptual measure of intensity and activity.</li>
  <li><b>explicit</b>: A binary indicator showing whether a song contains explicit content.</li>
  <li><b>instrumentalness</b>: A measure from 0.0 to 1.0 of how vocal-free a song is.</li>
  <li><b>key</b>: The key of the song.</li>
  <li><b>liveness</b>: A measure from 0.0 to 1.0 representing the presence of an audience in the recording.</li>
  <li><b>loudness</b>: The overall loudness of a song in decibels.</li>
  <li><b>mode</b>: The tonal mode of the song.</li>
  <li><b>popularity</b>: The popularity of the song.</li>
  <li><b>speechiness</b>: A measure from 0.0 to 1.0 indicating the presence of spoken words in a song.</li>
  <li><b>tempo</b>: The overall tempo of the song in beats per minute (BPM).</li>
</ul>
</p>
<p>
In the absence of a dataset, one needs to be created, which was done using Spotify's REST APIs. This will be discussed in more detail in the dedicated <a href="#data">data section</a>. 
The recommendation system will be built in several phases. First, a Spotify App will be created, which will be used for authentication via OAuth. Once logged into a personal Spotify account, access to all user data will be granted.
</p>
<h3>Why Clustering?</h3>
<p>
The challenge arises from the need to evaluate the recommendation system. To do so, "true labels" are required as a benchmark for the predictions made by the system. In the absence of such data, it was created using a clustering algorithm to group songs with similar audio features into a single cluster. Each song was then labeled with the label of its corresponding cluster.
A content-based system like this provides recommendations by analyzing the structure of items and tends to perform well, especially when user preferences are very specific. Given this strength of the system, the following assumption is made: all songs in a cluster reflect the tastes of a specific user.
This strong assumption allows not only the acquisition of real labels (i.e., the user who likes the song) but also enables testing the system on users with specific tastes. This will be further discussed in the <a href='#results'>results section</a>.
</p>
<h3>PCA</h3>
<p>
It was decided to perform PCA before clustering. Principal Component Analysis (PCA) is a highly relevant statistical technique in data analysis and machine learning. Its utility lies in its ability to reduce the complexity of a multivariate dataset, simplifying data representation without losing significant information.
To represent items (songs) in a two-dimensional space, this type of analysis is used to select two significant components that allow the projection of data onto a plane without losing significant information.
</p>
<h3>How the Recommendation System Works</h3>
<p>
The input to the model consists of a list of songs considered "liked" or appreciated by a specific user. This list of songs represents the user's musical preferences and serves as an aggregated profile of their tastes.
The recommendation system uses this list of songs as a starting point to identify other similar songs that the user might enjoy. By analyzing the musical characteristics of the songs in the user-provided list, the system creates an aggregated profile representing the user's preferences. This aggregated profile is then used to find songs in the broader catalog that are similar to the songs liked by the user.
This comparison is performed using cosine distance, a similarity measure between vectors that considers the angle between them. Songs in the dataset are ranked based on their distance from the aggregated profile of the songs in the provided list. The most similar songs, i.e., those with the smallest distance, are selected as recommendations.
Finally, the recommendations are returned to the user, providing a list of songs that are similar to those in the initial list. If the number of unique recommendations is less than the desired amount, additional songs are added until the required quantity is reached.
</p>
<h2>Why Is a System Like This Interesting?</h2>
<p>
A recommendation system like the one described plays a fundamental role in various contexts for several reasons. First, it offers a level of user personalization that goes beyond simply providing a wide range of content. The ability to suggest songs based on individual user preferences enhances engagement and improves the listening experience.
A key aspect of this personalization is the discovery of new content. Thanks to recommendation systems, users can be exposed to songs and artists they might never have encountered otherwise. This not only enriches their musical experience but also promotes diversity and exploration of musical genres that might otherwise be overlooked.
</p>

<section id="data">
    <h2>3. Data</h2>
</section>

<p>
During the dataset preparation phase for the Spotify recommendation system project, various online data sources were explored. Many ready-to-use and reliable datasets were available, but it was decided to build a dataset from scratch because we did not want to leverage the work of others. Additionally, building a dataset from scratch requires effort and dedication.
To create the Spotify track dataset, several playlists on Spotify were subscribed to. 
To increase the <b>heterogeneity</b> of the available data, playlists containing tracks from different decades and with generally varied characteristics were selected. This ensures that the observed data are not overly specific to a single type of track and do not bias the predictions of the recommendation system. In this way, it is ensured that the recommendation system can provide accurate and diverse suggestions to the user.
Subsequently, all tracks within the playlists were collected using the Spotify APIs.
For each playlist (approximately 150) that the user added to their favorites, all tracks were extracted, specifically the ‘id’ and ‘name’, which represent the identifier and title of the track. Then, the Spotify APIs were used again to retrieve the features that could not be initially obtained, extracting them directly from the playlists.
The concatenation of all tracks extracted in this manner resulted in the construction of the dataset “data.csv”, which can be found in the project directory “datasets/data.csv”. The decision to build the dataset from scratch unfortunately limits its size; the total amount of collected data amounts to approximately 24k records.
</p>

<section id="proposed-method">
    <h2>4. Proposed Method</h2>
</section>

<div style="display: flex; align-items: center;">
    <img src="static/Screen/Screen_2.png">
</div>

- Once the application is created, it will be accessible from the Dashboard.

<div style="display: flex; align-items: center;">
    <img src="static/Screen/Screen_3.png">
</div>

- After creating the app, you will have access to the app credentials. These will be required for API authorization to obtain an access token.

<div style="display: flex; align-items: center;">
    <img src="static/Screen/Screen_4.png">
</div>

- On the app settings page, add a <strong>Redirect URI</strong>. This is the URL to which Spotify will redirect the user after authentication.<br>
You can use http://localhost:5000/callback for development purposes.

<div style="display: flex; align-items: center;">
    <img src="static/Screen/Screen_5.png">
</div>

<h3>Dataset Construction</h3>
<p>

Spotipy, a Python client for the Spotify Web API, was used to facilitate data retrieval and querying of Spotify's catalog for songs. As described at the beginning of the project, the Spotify app has already been created. The next step will be configuring Spotipy with the client ID and secret key of the app created on the Spotify Developer page.

The Spotipy module is installed as follows:
<code>pip install spotipy</code>
</p>

In [12]:
import os
from dotenv import load_dotenv
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import warnings

# Ignore warnings to keep the output clean
warnings.filterwarnings("ignore")

# Load environment variables from the .env file
load_dotenv()

# Retrieve environment variables for secure credentials handling
client_id = os.getenv('SPOTIPY_CLIENT_ID')  # Spotify Client ID
client_secret = os.getenv('SPOTIPY_CLIENT_SECRET')  # Spotify Client Secret
redirect_uri = os.getenv('SPOTIPY_REDIRECT_URI')  # Redirect URI for OAuth flow

print(f"Client ID: {client_id}")
print(f"Client secret: {client_secret}")
print(f'Redirect URI: {redirect_uri}')

# Define the scope of permissions required for the application
scope = 'playlist-read-private'

# Initialize the Spotify object with OAuth authentication
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(
    client_id=client_id,  # Use the Client ID from the environment variable
    client_secret=client_secret,  # Use the Client Secret from the environment variable
    redirect_uri=redirect_uri,  # Use the Redirect URI from the environment variable
    scope=scope  # Specify the required permissions
))

Client ID: cd0f7e9279034469a5c4560808cb31a8
Client secret: da3fe74efc5c4a84b71399c157366e09
Redirect URI: http://localhost:5000/callback


<p>
Support functions were written to keep the code clean, modular, and reusable. These functions were designed to:
<ul>
    <li>extract a URL linking to the track;</li>
    <li>make the link clickable;</li>
    <li>display the album cover image;</li>
    <li>show the duration in minutes instead of milliseconds;</li>
</ul>
</p>

In [None]:
from IPython.display import display, HTML

playlists = sp.current_user_playlists()

for idx, playlist in enumerate(playlists['items']):
    print(f"{idx+1}) Playlist name: {playlist['name']}")
    if playlist['images']:
        display(HTML(f'<img src="{playlist["images"][0]["url"]}" alt="Cover Image" style="max-height: 120px; max-width: 120px;">'))
    spotify_url_html = make_clickable(extract_spotify_url(playlist['external_urls']))
    print(f"Spotify URL: {extract_href(spotify_url_html)}")
    print(f"No. Tracks: {playlist['tracks']['total']}")
    print(f"Owner: {playlist['owner']['display_name']}")
    print("\n")

In [None]:
from typing import List, Dict

def get_song_features(sp: spotipy.Spotify, song_ids: List[str]) -> List[Dict]:
    features_list = []
    for song_id in song_ids:
        features = {}
        track_info = sp.track(song_id)
        audio_features = sp.audio_features(song_id)[0]

        features['id'] = song_id
        features['name'] = track_info['name']
        features['valence'] = audio_features['valence']
        features['year'] = track_info['album']['release_date'][:4]
        features['acousticness'] = audio_features['acousticness']
        features['danceability'] = audio_features['danceability']
        features['duration_ms'] = audio_features['duration_ms']
        features['energy'] = audio_features['energy']
        features['explicit'] = int(track_info['explicit'])
        features['instrumentalness'] = audio_features['instrumentalness']
        features['key'] = audio_features['key']
        features['liveness'] = audio_features['liveness']
        features['loudness'] = audio_features['loudness']
        features['mode'] = audio_features['mode']
        features['popularity'] = track_info['popularity']
        features['speechiness'] = audio_features['speechiness']
        features['tempo'] = audio_features['tempo']

        features_list.append(features)
    return features_list

In [None]:
songs = []
playlists = sp.current_user_playlists()
for playlist in playlists['items']:
    results = sp.playlist(playlist['id'], fields="tracks,next")
    tracks = results['tracks']
    for item in tracks['items']:
        if item['track']:
            track = item['track']
            song_data = {'Title': track['name'],'ID': track['id']}
            songs.append(song_data)  
            
for idx, song in enumerate(songs):
    print(f"{idx}) ID: {song['ID']}, Title: {song['Titolo']}")

In [None]:
song_ids = [song['ID'] for song in songs]
songs = get_song_features(sp, song_ids)

df = pd.DataFrame(songs)
df.to_csv('datasets/songs.csv', mode='a', header=False, index=False)

<h2>Analysis of obtained dataset</h2>

In [15]:
import pandas as pd

data = pd.read_csv("datasets/data.csv")
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24722 entries, 0 to 24721
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                24722 non-null  object 
 1   name              24722 non-null  object 
 2   valence           24722 non-null  float64
 3   year              24722 non-null  int64  
 4   acousticness      24722 non-null  float64
 5   danceability      24722 non-null  float64
 6   duration_ms       24722 non-null  float64
 7   energy            24722 non-null  float64
 8   explicit          24722 non-null  float64
 9   instrumentalness  24722 non-null  float64
 10  key               24722 non-null  float64
 11  liveness          24722 non-null  float64
 12  loudness          24722 non-null  float64
 13  mode              24722 non-null  float64
 14  popularity        24722 non-null  float64
 15  speechiness       24722 non-null  float64
 16  tempo             24722 non-null  float6

In [16]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

data = pd.read_csv('datasets/data.csv')

numeric_columns = ['valence', 'year', 'acousticness', 'danceability', 'duration_ms', 'energy', 'explicit', 'instrumentalness', 'key', 'liveness', 'loudness', 'popularity', 'speechiness', 'tempo']
numeric_data = data[numeric_columns]
            
scaler = StandardScaler() # Standardise data
scaled_data = scaler.fit_transform(numeric_data)

pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_data)

k = 10 
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(scaled_data)

data['user'] = kmeans.labels_ # Assign the cluster's label to the record

plt.figure(figsize=(10, 6))
plt.scatter(pca_result[:, 0], pca_result[:, 1], c=data['user'], cmap='viridis')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.title('Clustering Visualization')
plt.colorbar(label='User')
plt.show()

ModuleNotFoundError: No module named 'sklearn'

In [None]:
data['user'] = [f'user_{i+1}' for i in kmeans.labels_] 

def custom_sort(user_label):
    num = int(user_label.split('_')[1])
    return num

data['user_numeric'] = data['user'].apply(custom_sort)
data = data.sort_values(by='user_numeric').drop(columns='user_numeric').reset_index(drop=True)
data.sample(5)
data.to_csv('datasets/data.csv')

<h3>Calculation of the Corresponding Decade</h3>
<p>
A function named <strong>get_decade</strong> has been implemented, which calculates the start of the corresponding decade based on the provided year. The result was assigned to a new column named 'decade' in the dataset.
</p>
<p>
Subsequently, a bar chart (countplot) was created using the Seaborn library to visually represent the count of tracks in each decade. This visualization allows for a clear identification of the data distribution across different decades, providing a snapshot of the relative frequencies of decades in the dataset.
</p>

In [None]:
import seaborn as sns

def get_decade(year):
    period_start = int(year / 10) * 10
    decade = '{}s'.format(period_start)
    return decade

data['decade'] = data['year'].apply(get_decade)

sns.set_theme(rc={'figure.figsize': (10, 6)})
sns.countplot(data['decade']);