# **Music Recommendation System**

## **Problem Definition**

### **The Context:**

The rapid advancement of technology has greatly streamlined our daily activities, yet it has also made our lives busier and more distracted. This shift has affected our interaction with art and entertainment, making it challenging to discover and engage with new content. In this digital age, the ability to quickly find content that resonates with our personal tastes is more valuable than ever.

Internet-based platforms, particularly in the entertainment sector, thrive on the time users spend engaged with their content. A prime example is Spotify, a leading audio streaming service with a vast global presence. Spotify has mastered the art of music discovery through its advanced recommendation systems, which utilize a vast database of user preferences to suggest the 'next best song.' This approach not only enhances user experience by making music discovery seamless but also contributes to the platform's success by encouraging longer engagement times.

In the face of an ever-growing online music library, the challenge for Spotify and similar platforms is to efficiently guide users to content that matches their tastes, thus ensuring sustained engagement and satisfaction. The development and refinement of recommendation algorithms are central to this effort, highlighting the critical role of data science in shaping the future of digital entertainment.



### **The objective:**

The main aim is to leverage the extensive collection of music available to improve the user's experience by seamlessly directing them to new songs that match their individual tastes. This objective seeks not only to boost user engagement and satisfaction but also to encourage users to spend more time on the platform, thereby strengthening the bond between listeners and the music they enjoy.


### **The key questions:**

- How does the number of song plays correlate with user engagement, and at what point does a song become a significant part of a user's playlist?
- Can we identify specific trends in song preferences across different generations, and how do these preferences influence the discovery of new music?
- When users prefer a cover version of a song, does this indicate a preference for the artist's style over the original song's composition, and how does this preference affect song recommendations?
- What is the likelihood that enjoying one song from an album indicates a predisposition to like other songs from the same album, and how can we leverage this to enhance music discovery?
- Given our objective to maximize user session length and satisfaction, how do we balance the importance of recall and precision in our recommendation system to optimize the user experience?



### **The problem formulation**:

As Data Science professionals working on our music streaming platform, we are tasked with leveraging advanced analytics to significantly enhance the user experience, encouraging them to spend more time engaged with our service. Our primary objective is to craft a state-of-the-art recommendation system. This system is designed to offer personalized and highly relevant music suggestions, drawing from a deep analysis of user listening habits, musical preferences, and interactions within the platform.

To tackle this challenge from various perspectives, we will employ a diverse set of sophisticated recommendation models, including similarity-based collaborative filtering, matrix factorization techniques, clustering-based systems, and content-based recommendations. This multifaceted approach ensures that every user receives tailor-made recommendations that not only align with their existing musical tastes but also help them discover new songs and artists that they're likely to enjoy.

Given the company's ambition to maximize the time users spend on the platform, prioritizing recall becomes essential. By focusing on recall, we ensure the recommendation system covers a broad spectrum of relevant musical options, thus enhancing the potential for user discovery and sustained engagement with the platform. Nonetheless, it's crucial to strike a careful balance with precision to maintain the relevance of these recommendations. Achieving this equilibrium is key to delivering an optimal user experience that fosters user retention while boosting satisfaction and loyalty to our platform.

# **Final Submission**


This notebook constitutes the final submission and primarily focuses on highlighting the key conclusions derived from the project. It is conceived as the definitive proposal aimed at business leaders and decision-makers, providing tangible and actionable solutions. The report is structured into three fundamental parts: an Executive Summary, summarizing the most significant findings from the analysis; a Problem and Solution Summary, detailing the identified problem and justifying the proposed solution design and its potential impact; and Recommendations for Implementation, offering key guidelines and expected benefits, as well as considerations on costs, risks, and challenges.

Additionally, the document includes a section dedicated to exploring improvements in collaborative filtering models, both matrix factorization-based and content-based. Despite these investigations not resulting in significant improvements, they have been integrated into the notebook. This decision is made to value the effort invested and provide a comprehensive view of the work developed. This approach reflects our commitment to transparency and continuous learning, even when the results do not meet initial expectations.

## **Exploration of Model Enhancements from the Milestone Submission**


This section is dedicated to a detailed exploration of two advanced approaches in recommendation systems: a hybrid model and a neural network-based model utilizing SVD. This part of the notebook starts from the collaborative filtering model based on matrix factorization and the content-based model presented in the previous milestone. These models remain unchanged, serving as a reference and foundation for the new architectures to be explored. The section provides a comprehensive framework to understand how these new models can build upon the previously established foundations and how they might offer potential improvements in the personalization and accuracy of song recommendations.

### Importing Libraries and the Dataset

In [None]:
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
# Used to ignore the warning given as output of the code
import warnings
warnings.filterwarnings('ignore')

# Basic libraries of python for numeric and dataframe computations
import numpy as np
import pandas as pd


# Import Matplotlib the Basic library for data visualization
import matplotlib.pyplot as plt

# Import seaborn - Slightly advanced library for data visualization
import seaborn as sns

# Import the required library to compute the cosine similarity between two vectors
from sklearn.metrics.pairwise import cosine_similarity

# Import defaultdict from collections A dictionary output that does not raise a key error
from collections import defaultdict

# Impoort mean_squared_error : a performance metrics in sklearn
from sklearn.metrics import mean_squared_error


### Load the dataset

In [None]:
# Importing the datasets
count_df = pd.read_csv('/content/drive/MyDrive/Curso_MIT/Capstone Project/count_data.csv')
song_df = pd.read_csv('/content/drive/MyDrive/Curso_MIT/Capstone Project/song_data.csv')

In [None]:
# Left merge the count_df and song_df data on "song_id". Drop duplicates from song_df data simultaneously
df = count_df.merge(song_df.drop_duplicates(), how = "left", on = "song_id")
# Drop the column 'Unnamed: 0'
df = df.drop('Unnamed: 0', axis = 1)
## Name the obtained dataframe as "df"
df

Unnamed: 0,user_id,song_id,play_count,title,release,artist_name,year
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1,The Cove,Thicker Than Water,Jack Johnson,0
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Flamenco Para Niños,Paco De Lucia,1976
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1,Stronger,Graduation,Kanye West,2007
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1,Constellations,In Between Dreams,Jack Johnson,2005
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1,Learn To Fly,There Is Nothing Left To Lose,Foo Fighters,1999
...,...,...,...,...,...,...,...
2054529,d8bfd4ec88f0f3773a9e022e3c1a0f1d3b7b6a92,SOJEYPO12AAA8C6B0E,2,Ignorance (Album Version),Ignorance,Paramore,0
2054530,d8bfd4ec88f0f3773a9e022e3c1a0f1d3b7b6a92,SOJJYDE12AF729FC16,4,Two Is Better Than One,Love Drunk,Boys Like Girls featuring Taylor Swift,2009
2054531,d8bfd4ec88f0f3773a9e022e3c1a0f1d3b7b6a92,SOJKQSF12A6D4F5EE9,3,What I've Done (Album Version),What I've Done,Linkin Park,2007
2054532,d8bfd4ec88f0f3773a9e022e3c1a0f1d3b7b6a92,SOJUXGA12AC961885C,1,Up,My Worlds,Justin Bieber,2010


A dataset of size 2000000 rows x 7 columns can be quite large and may require a lot of computing resources to process. This can lead to long processing times and can make it difficult to train and evaluate your model efficiently.
In order to address this issue, it may be necessary to trim down your dataset to a more manageable size.

In [None]:
# Get the column containing the users
users = df.user_id
# Create a dictionary from users to their number of songs
ratings_count = dict()
for user in users:
    # If we already have the user, just add 1 to their rating count
    if user in ratings_count:
        ratings_count[user] += 1
    # Otherwise, set their rating count to 1
    else:
        ratings_count[user] = 1

In [None]:
# We want our users to have listened at least 90 songs
RATINGS_CUTOFF = 90
remove_users = []
for user, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)
df = df.loc[~df.user_id.isin(remove_users)]

In [None]:
# Get the column containing the songs
songs = df.song_id
# Create a dictionary from songs to their number of users
ratings_count = dict()
for song in songs:
    # If we already have the song, just add 1 to their rating count
    if song in ratings_count:
        ratings_count[song] += 1
    # Otherwise, set their rating count to 1
    else:
        ratings_count[song] = 1

In [None]:
# We want our song to be listened by atleast 120 users to be considred
RATINGS_CUTOFF = 120
remove_songs = []
for song, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_songs.append(song)
df_final= df.loc[~df.song_id.isin(remove_songs)]

In [None]:
# Drop records with play_count more than(>) 5
df_final=df_final[df_final.play_count<=5]

In [None]:
# Check the shape of the data
df_final.shape

(138301, 7)

### Let's check for duplicate observations

In [None]:
# Group the dataframe by 'user_id' and 'song_id', counting the occurrences of each combination
df_final_agrupado = df_final.groupby(['user_id', 'song_id']).count()
# Identify duplicate songs where the play count exceeds 2 instances
songs_duplicates = df_final_agrupado[df_final_agrupado['title'] > 1].reset_index()['song_id'].value_counts().index.tolist()
len(songs_duplicates)

57

We have obtained 57 repeat songs. Let's see if these differ in the title. The objective is to be able to eliminate duplicates based on song_id, user_id and title. To do this we have to ensure that we do not obtain more than 57 titles.


In [None]:
# Display the name of duplicatte songs
title_duplicate_songs = df_final[df_final['song_id'].isin(songs_duplicates)].groupby(['song_id', 'user_id', 'title']).count().reset_index()['title'].unique()
print("The number of titles different from duplicate songs:", len(title_duplicate_songs))
title_duplicate_songs

The number of titles different from duplicate songs: 68


array(["Adam's Song", 'Piggy', 'Lemme Get That', 'Seven Nation Army',
       'Seven Nation Army (Album Version)', 'Invincible', 'So Lonely',
       'Fake Tales Of San Francisco',
       'Fake Tales Of San Francisco (Explicit)', 'Your Woman',
       'Somebody To Love', "Hips Don't Lie",
       "Hips Don't Lie (featuring Wyclef Jean)", '22', 'I Might Be Wrong',
       'Not Fair', 'Not Fair (Clean Radio Edit)', 'There_ There',
       'Crack A Bottle', "Road Trippin' (Album Version)",
       'Message In A Bottle', 'Do We Need This?', 'My Immortal',
       'My Immortal (Album Version)', 'Brianstorm',
       'Supermassive Black Hole (Album Version)',
       'Supermassive Black Hole (Twilight Soundtrack Version)',
       'Dance_ Dance', 'Did It Again',
       'Did It Again (featuring Kid Cudi)', 'Did it Again',
       'Too Far Gone', 'The Real Slim Shady', "Don't Stop The Music",
       'Always', 'Live And Let Die', 'The Trouble With Love Is',
       'Teddy Picker', 'Teddy Picker (Explicit)',

We get 11 repeated titles. We are going to clean the text of the titles, where the information in the parentheses will be eliminated and it will be converted to lowercase.

In [None]:
import re
def normalize_titles(title):
    # Convertir a minúsculas
    title = title.lower()
    # Eliminar información entre paréntesis
    title = re.sub(r'\(.*?\)', '', title)
    # Eliminar espacios en blanco extra y caracteres especiales
    title = re.sub(r'\s+', ' ', title).strip()
    return title

df_final['title_normalized'] = df_final['title'].apply(normalize_titles)
title_duplicate_songs = df_final[df_final['song_id'].isin(songs_duplicates)].groupby(['song_id', 'user_id', 'title_normalized']).count().reset_index()['title_normalized'].unique()
print("The number of titles different from duplicate songs:", len(title_duplicate_songs))

The number of titles different from duplicate songs: 57


In [None]:
rows_to_delete = df_final[df_final.duplicated(subset= ['song_id', 'user_id', 'title'])].shape[0]
print("Number of rows to delete:", rows_to_delete)
df_final_no_duplicates = df_final.drop_duplicates(subset= ['song_id', 'user_id', 'title']).drop('title_normalized', axis = 1)
df_final_no_duplicates

Number of rows to delete: 7909


Unnamed: 0,user_id,song_id,play_count,title,release,artist_name,year
206,17aa9f6dbdf753831da8f38c71b66b64373de613,SOBDVAK12AC90759A2,1,Daisy And Prudence,Distillation,Erin McKeown,2000
208,17aa9f6dbdf753831da8f38c71b66b64373de613,SOBIMTY12A6D4F931F,1,The Ballad of Michael Valentine,Sawdust,The Killers,2004
209,17aa9f6dbdf753831da8f38c71b66b64373de613,SOBKRVG12A8C133269,1,I Stand Corrected (Album),Vampire Weekend,Vampire Weekend,2007
210,17aa9f6dbdf753831da8f38c71b66b64373de613,SOBUBLL12A58A795A8,1,They Might Follow You,Tiny Vipers,Tiny Vipers,2007
211,17aa9f6dbdf753831da8f38c71b66b64373de613,SOBVKFF12A8C137A79,1,Monkey Man,You Know I'm No Good,Amy Winehouse,2007
...,...,...,...,...,...,...,...
2054259,9fb0717a34c90c91ce09ab460969a8a428d3ac87,SOXNZOW12AB017F756,1,Half Of My Heart,Battle Studies,John Mayer,0
2054261,9fb0717a34c90c91ce09ab460969a8a428d3ac87,SOXQYSC12A6310E908,1,Bitter Sweet Symphony,Bitter Sweet Symphony,The Verve,1997
2054270,9fb0717a34c90c91ce09ab460969a8a428d3ac87,SOYDTIW12A67ADAFC9,2,The Police And The Private,Live It Out,Metric,2005
2054280,9fb0717a34c90c91ce09ab460969a8a428d3ac87,SOYQQAC12A6D4FD59E,1,Just Friends,Back To Black,Amy Winehouse,2006


We have removed 10015 duplicate observations.

In [None]:
df_final_no_duplicates.to_csv('df_final.csv')

### Preparatory Steps: Library Importation, Utility Function Creation, and Data Preparation

In [None]:
# Install the surprise package using pip. Uncomment and run the below code to do the same

!pip install surprise



In [None]:
# Import necessary libraries

# To compute the accuracy of models
from surprise import accuracy

# This class is used to parse a file containing play_counts, data should be in structure - user; item; play_count
from surprise.reader import Reader

# Class for loading datasets
from surprise.dataset import Dataset

# For tuning model hyperparameters
from surprise.model_selection import GridSearchCV

# For splitting the data in train and test dataset
from surprise.model_selection import train_test_split

# For implementing similarity-based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic

# For implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD

# For implementing KFold cross-validation
from surprise.model_selection import KFold

# For implementing clustering-based recommendation system
from surprise import CoClustering

In [None]:
# The function to calulate the RMSE, precision@k, recall@k, and F_1 score
def precision_recall_at_k(model, k = 30, threshold = 1.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)

    # Making predictions on the test data
    predictions=model.test(testset)

    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key = lambda x : x[0], reverse = True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[ : k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[ : k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    # Mean of all the predicted precisions are calculated
    precision = round((sum(prec for prec in precisions.values()) / len(precisions)), 3)

    # Mean of all the predicted recalls are calculated
    recall = round((sum(rec for rec in recalls.values()) / len(recalls)), 3)

    accuracy.rmse(predictions)

    # Command to print the overall precision
    print('Precision: ', precision)

    # Command to print the overall recall
    print('Recall: ', recall)

    # Formula to compute the F-1 score
    print('F_1 score: ', round((2 * precision * recall) / (precision + recall), 3))

In [None]:
# Instantiating Reader scale with expected rating scale
 #use rating scale (0, 5)
reader = Reader(rating_scale = (0, 5))

# Loading the dataset
 # Take only "user_id","song_id", and "play_count"
data = Dataset.load_from_df(df_final_no_duplicates[['user_id', 'song_id', 'play_count']], reader)

# Splitting the data into train and test dataset
 # Take test_size = 0.4, random_state = 42
trainset, testset = train_test_split(data, test_size = 0.4, random_state = 42)

In [None]:
# Split each user in test_df into two subsets: 60% for generation, 40% for evaluation.
def split_user_data(df, test_generation_frac=0.6):
    test_generation = pd.DataFrame()
    test_evaluation = pd.DataFrame()

    for user_id in df['user_id'].unique():
        user_data = df[df['user_id'] == user_id]
        split_idx = int(len(user_data) * test_generation_frac)

        user_test_generation = user_data.iloc[:split_idx]
        user_test_evaluation = user_data.iloc[split_idx:]

        test_generation = pd.concat([test_generation, user_test_generation])
        test_evaluation = pd.concat([test_evaluation, user_test_evaluation])

    return test_generation, test_evaluation

# Convert testset from Surprise to DataFrame
test_df = pd.DataFrame(testset, columns=['user_id', 'song_id', 'play_count'])

# Split test datset using the function split_user_data()
test_generation_df, test_evaluation_df = split_user_data(test_df)

test_generation = [(row['user_id'], row['song_id'], row['play_count']) for index, row in test_generation_df.iterrows()]
test_evaluation = [(row['user_id'], row['song_id'], row['play_count']) for index, row in test_evaluation_df.iterrows()]



### **Hybrid Model**


In line with the proposals outlined in the milestone submission, we will proceed to develop and evaluate a hybrid model that integrates the capabilities of optimized SVD with a content-based approach. This strategy aims to capitalize on the precision in rating prediction offered by SVD, while enriching the personalization of recommendations through detailed analysis of the textual characteristics of the songs. By merging these two methods, we aspire to create a more robust and adaptive recommendation system that can provide highly relevant and personalized recommendations to each user.

For the evaluation of this hybrid model, we have adopted a meticulous approach to data division. Initially, we divided our dataset into two: a training set and a test set, using 60% for training and 40% for testing. Subsequently, for a more detailed evaluation, we divided the test set into two subsets: one for generating recommendations (60% of the test data) and another for evaluating these recommendations (the remaining 40%). This strategy allows us not only to generate recommendations based on a significant portion of user behavior but also to evaluate the precision and relevance of these recommendations in a distinct set of user interactions, thus ensuring a robust and representative evaluation of the hybrid model's performance. The metrics of precision, recall, and F1 score, along with the RMSE analysis for rating estimations, will form the basis of our evaluation, allowing us to measure the model's effectiveness in providing recommendations that are both accurate and relevant to users.

#### Model Based Collaborative Filtering - Matrix Factorization

Model-based Collaborative Filtering is a **personalized recommendation system**, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use **latent features** to find recommendations for each user.

In [None]:
# Building the optimized SVD model using optimal hyperparameters
svd_optimized = SVD(n_epochs = 30, lr_all = 0.01, reg_all = 0.1, random_state = 1)

# Training the algorithm on the train set
svd_optimized = svd_optimized.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score
precision_recall_at_k(svd_optimized)

RMSE: 1.0021
Precision:  0.421
Recall:  0.629
F_1 score:  0.504


#### Content Based Recommendation Systems

In [None]:
# Concatenate the "title", "release", "artist_name" columns to create a different column named "text"
df_small = df_final_no_duplicates
df_small['text'] = df_small['title'] + ' ' + df_small['release'] + ' ' + df_small['artist_name']

In [None]:
# Select the columns 'user_id', 'song_id', 'play_count', 'title', 'text' from df_small data
df_small = df_small[['user_id', 'song_id', 'play_count', 'title', 'text']]
# Drop the duplicates from the title column
df_small = df_small.drop_duplicates(subset = ['title'])
# Set the title column as the index
df_small = df_small.set_index('title')
# See the first 5 records of the df_small dataset
df_small.head()

Unnamed: 0_level_0,user_id,song_id,play_count,text
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Daisy And Prudence,17aa9f6dbdf753831da8f38c71b66b64373de613,SOBDVAK12AC90759A2,1,Daisy And Prudence Distillation Erin McKeown
The Ballad of Michael Valentine,17aa9f6dbdf753831da8f38c71b66b64373de613,SOBIMTY12A6D4F931F,1,The Ballad of Michael Valentine Sawdust The Ki...
I Stand Corrected (Album),17aa9f6dbdf753831da8f38c71b66b64373de613,SOBKRVG12A8C133269,1,I Stand Corrected (Album) Vampire Weekend Vamp...
They Might Follow You,17aa9f6dbdf753831da8f38c71b66b64373de613,SOBUBLL12A58A795A8,1,They Might Follow You Tiny Vipers Tiny Vipers
Monkey Man,17aa9f6dbdf753831da8f38c71b66b64373de613,SOBVKFF12A8C137A79,1,Monkey Man You Know I'm No Good Amy Winehouse


In [None]:
# Create the series of indices from the data
title_series = pd.Series(df_small.index)
title_series

0                   Daisy And Prudence
1      The Ballad of Michael Valentine
2            I Stand Corrected (Album)
3                They Might Follow You
4                           Monkey Man
                    ...               
624                      The Last Song
625                         Invincible
626                      Paper Gangsta
627                          Starlight
628         Tangerine  (Album Version)
Name: title, Length: 629, dtype: object

In [None]:
# Importing necessary packages to work with text data
import nltk

# Download punkt library
nltk.download('punkt')

# Download stopwords library
nltk.download('stopwords')

# Download wordnet
nltk.download('wordnet')

# Import regular expression
import re

# Import word_tokenizer
from nltk import word_tokenize

# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Import stopwords
from nltk.corpus import stopwords

# Import CountVectorizer and TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


We will create a **function to pre-process the text data:**

In [None]:
# Create a function to tokenize the text
def tokenize(text):

    # Making each letter as lowercase and removing non-alphabetical text
    text = re.sub(r"[^a-zA-Z]"," ", text.lower())

    # Extracting each word in the text
    tokens = word_tokenize(text)

    # Removing stopwords
    words = [word for word in tokens if word not in stopwords.words("english")]

    # Lemmatize the words
    text_lems = [WordNetLemmatizer().lemmatize(lem).strip() for lem in words]

    return text_lems

In [None]:
# Create tfidf vectorizer
tfidf = TfidfVectorizer(tokenizer = tokenize)

# Fit_transfrom the above vectorizer on the text column and then convert the output into an array
song_tfidf = tfidf.fit_transform(df_small['text'].values).toarray()
pd.DataFrame(song_tfidf)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1511,1512,1513,1514,1515,1516,1517,1518,1519,1520
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
624,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
626,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
627,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0


In [None]:
# Compute the cosine similarity for the tfidf above output
similar_songs = cosine_similarity(song_tfidf, song_tfidf)

# Let us see the above array
similar_songs


array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.03257364],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.03257364, ..., 0.        , 0.        ,
        1.        ]])

 Finally, let's create a function to find most similar songs to recommend for a given song.

In [None]:
# Function that takes in song title as input and returns the top 10 recommended songs
def recommendations(title, similar_songs):

    recommended_songs = []

    # Getting the index of the song that matches the title
    idx = title_series[title_series == title].index[0]

    # Creating a Series with the similarity scores in descending order
    score_series = pd.Series(similar_songs[idx]).sort_values(ascending = False)

    # Getting the indexes of the 10 most similar songs
    top_10_indexes = list(score_series.iloc[1 : 11].index)

    # Populating the list with the titles of the best 10 matching songs
    for i in top_10_indexes:
        recommended_songs.append(list(df_small.index)[i])

    return recommended_songs


Recommending 10 songs similar to Learn to Fly

In [None]:
# Make the recommendation for the song with title 'Learn To Fly'
recommendations('Learn To Fly', similar_songs)


['Big Me',
 'Everlong',
 'The Pretender',
 'Nothing Better (Album)',
 'From Left To Right',
 'Lifespan Of A Fly',
 'Daisy And Prudence',
 "Ghosts 'n' Stuff (Original Instrumental Mix)",
 'Closer',
 'No Cars Go']

#### Hybrid Model Implementation

In [None]:
def hybrid_recommendations_for_user(user_id, svd_model, similar_songs, df_small, n_recommendations=10):
    if 'title' not in df_small.columns:
        df_small = df_small.reset_index()

    similar_songs_df = pd.DataFrame(similar_songs, index=df_small['title'], columns=df_small['title'])

    user_interactions = df_small[df_small['user_id'] == user_id]

    user_rated_songs = user_interactions['song_id'].unique()

    predictions = []
    for _, row in df_small.iterrows():
        song_id = row['song_id']
        if song_id not in user_rated_songs:
            svd_prediction = svd_model.predict(str(user_id), str(song_id)).est
            predictions.append((row['song_id'], row['title'], svd_prediction))

    predictions.sort(key=lambda x: x[2], reverse=True)

    hybrid_recommendations = []
    for song_id, song_title, svd_prediction in predictions[:n_recommendations]:
        content_similarities = [similar_songs_df.loc[song_title, listened_title] for listened_title in user_interactions['title'] if listened_title in similar_songs_df.index]
        content_adjustment = np.mean(content_similarities) if content_similarities else 0

        # Ajusta esta fórmula según sea necesario para manejar el score
        hybrid_score = svd_prediction + content_adjustment  # Modificado para ajustar el cálculo del score

        hybrid_recommendations.append((song_id, song_title, hybrid_score))

    hybrid_recommendations.sort(key=lambda x: x[2], reverse=True)
    return hybrid_recommendations[:n_recommendations]


In [None]:
def get_user_interactions_from_test_set(user_id, test_evaluation_df):

    # Filtrar el DataFrame para obtener solo las filas correspondientes al user_id dado
    user_interactions = test_evaluation_df[test_evaluation_df['user_id'] == user_id]

    # Extraer los song_id de esas interacciones
    interacted_songs = user_interactions['song_id'].unique().tolist()

    return interacted_songs


def evaluate_hybrid_model(test_evaluation_df, svd_model, similar_songs, df_small, n_recommendations=30, k=30, threshold=1.5):
    user_precisions, user_recalls, user_f1s = [], [], []

    # Iterar sobre cada usuario en el conjunto de evaluación
    for user_id in test_evaluation_df['user_id'].unique():
        actual_songs = get_user_interactions_from_test_set(user_id, test_evaluation_df)
        hybrid_recs = hybrid_recommendations_for_user(user_id, svd_model, similar_songs, df_small, n_recommendations)

        # Obtener solo los IDs de las canciones recomendadas
        recommended_songs = [rec[0] for rec in hybrid_recs]

        true_positives = set(recommended_songs) & set(actual_songs)
        false_positives = set(recommended_songs) - set(actual_songs)
        false_negatives = set(actual_songs) - set(recommended_songs)

        precision = len(true_positives) / (len(true_positives) + len(false_positives)) if true_positives else 0
        recall = len(true_positives) / (len(true_positives) + len(false_negatives)) if true_positives else 0
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

        user_precisions.append(precision)
        user_recalls.append(recall)
        user_f1s.append(f1)

    # Calcular promedios
    avg_precision = np.mean(user_precisions)
    avg_recall = np.mean(user_recalls)
    avg_f1 = np.mean(user_f1s)

    print(f'Average Precision: {avg_precision:.3f}')
    print(f'Average Recall: {avg_recall:.3f}')
    print(f'Average F1: {avg_f1:.3f}')




In [None]:
# Reset index of df_small
df_small = df_small.reset_index()
pd.DataFrame(similar_songs, index=df_small['title'], columns=df_small['title'])

# evaluate hybrid model
evaluate_hybrid_model(test_evaluation_df, svd_optimized, similar_songs,df_small)

Average Precision: 0.015
Average Recall: 0.066
Average F1: 0.022


The hybrid model yields an average precision of 0.012 and recall of 0.017, resulting in an F1 score of 0.013. These figures are considerably lower compared to the SVD model, which achieved a precision of 0.421 and recall of 0.629, with an F1 score of 0.504. While the SVD model shows a stronger ability to predict user preferences accurately, reflected in a significantly higher F1 score, the hybrid model's performance suggests it may not be capturing user preferences as effectively. Additionally, the SVD model's RMSE of 1.0021 indicates a reasonable level of accuracy in predicting actual ratings, further emphasizing the gap between the two models. This comparison highlights a need to refine the hybrid model to enhance its predictive accuracy and recommendation relevance.

### **Collaborative Filtering using Neural Network**

In our continued quest to enhance the accuracy and personalization of our recommendation system, we have ventured beyond traditional hybrid models, which fell short of achieving the desired outcomes. To overcome these shortcomings and with the ambition to unlock greater potential in the field of recommendations, we have turned to the promising realm of neural networks.

Neural networks, especially those based on collaborative filtering, present an intriguing opportunity to capture the complexity and richness of user-item interactions. We have constructed a neural network that incorporates latent feature matrices from an optimized SVD model, melding them with the advanced architecture of deep learning. This network capitalizes on the latent features of users and items as initial embeddings, applying regularization and dropout to enhance generalization and prevent overfitting. Through the dot product of flattened vectors, the network seeks to calculate the similarity between users and items, aiming to refine interaction predictions and ultimately enrich the user experience with highly personalized recommendations. Compiled using the Adam optimizer and mean squared error loss function, the network trains to minimize error and adjust predictions over several epochs, representing our advanced commitment to capturing the essence of what users desire and enjoy.

In [None]:
from collections import defaultdict
import numpy as np
from sklearn.metrics import mean_squared_error

def precision_recall_at_k(model, test_user_ids, test_item_ids, test_ratings, k=30, threshold=1.5):
    """Calculate RMSE, precision@k, recall@k, and F1 score for a Keras model on the test set."""
    # Model predictions
    y_pred = model.predict([test_user_ids, test_item_ids]).flatten()

    # RMSE
    rmse = np.sqrt(mean_squared_error(test_ratings, y_pred))
    print(f"RMSE: {rmse:.3f}")

    user_est_true = defaultdict(list)
    for uid, iid, true_r, est in zip(test_user_ids, test_item_ids, test_ratings, y_pred):
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()

    for uid, user_ratings in user_est_true.items():
        user_ratings.sort(key=lambda x: x[0], reverse=True)  # Sort by estimation
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)  # Number of relevant items
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])  # Number of items recommended in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold)) for (est, true_r) in user_ratings[:k])

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    precision = sum(prec for prec in precisions.values()) / len(precisions)
    recall = sum(rec for rec in recalls.values()) / len(recalls)

    # Print metrics
    print(f"Precision: {precision:.3f}")
    print(f"Recall: {recall:.3f}")

    # Calculate F1 score if both precision and recall are greater than 0
    if precision and recall:
        f1 = 2 * (precision * recall) / (precision + recall)
        print(f"F1: {f1:.3f}")
    else:
        print("F1: Cannot be calculated due to precision and recall being 0")


In [None]:
import numpy as np

# Initialize lists to store the internal IDs and ratings
train_user_ids = []
train_item_ids = []
train_ratings = []

# Iterate over all ratings in the training set
for uid, iid, rating in trainset.all_ratings():
    # Convert from internal indices to raw IDs (if necessary)
    # In this case, we are working directly with internal indices
    train_user_ids.append(uid)
    train_item_ids.append(iid)
    train_ratings.append(rating)

# Convert lists to NumPy arrays for training
train_user_ids = np.array(train_user_ids)
train_item_ids = np.array(train_item_ids)
train_ratings = np.array(train_ratings)


In [None]:
# Prepare test data considering some users or items may not be in the training set
test_user_ids = []
test_item_ids = []
test_ratings = []

for uid, iid, rating in testset:
    try:
        inner_uid = trainset.to_inner_uid(uid)
        inner_iid = trainset.to_inner_iid(iid)
        test_user_ids.append(inner_uid)
        test_item_ids.append(inner_iid)
        test_ratings.append(rating)
    except ValueError:
        # Skip this user-item pair or handle it differently
        continue

# Convert to NumPy arrays for training
test_user_ids = np.array(test_user_ids)
test_item_ids = np.array(test_item_ids)
test_ratings = np.array(test_ratings)


In [None]:
# Retrieve the latent feature matrices from SVD
user_features_svd = svd_optimized.pu  # Latent user features
song_features_svd = svd_optimized.qi  # Latent item features


In [None]:
# Define the input dimensions
input_dim_users = user_features_svd.shape[0]
input_dim_songs = song_features_svd.shape[0]


In [None]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Flatten, Dot, Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.initializers import Constant
from tensorflow.keras.optimizers import Adam

# Regularization hyperparameters
l2_reg = 0.00001  # L2 regularization factor
dropout_rate = 0.05  # Dropout rate

# Dimension of the latent features
n_factors = user_features_svd.shape[1]

# Inputs
user_input = Input(shape=(1,), dtype='int32', name='user_input')
song_input = Input(shape=(1,), dtype='int32', name='song_input')

# Embeddings, initialized with SVD
user_embedding = Embedding(input_dim=input_dim_users, output_dim=n_factors, embeddings_initializer=Constant(user_features_svd), embeddings_regularizer=l2(l2_reg), input_length=1, name='user_embedding')(user_input)
item_embedding = Embedding(input_dim=input_dim_songs, output_dim=n_factors, embeddings_initializer=Constant(song_features_svd), embeddings_regularizer=l2(l2_reg), input_length=1, name='song_embedding')(song_input)

# Flatten the embeddings
user_vec = Flatten(name='flatten_users')(user_embedding)
song_vec = Flatten(name='flatten_songs')(item_embedding)
# Apply Dropout to the flattened vectors
user_vec = Dropout(dropout_rate)(user_vec)
song_vec = Dropout(dropout_rate)(song_vec)
# Dot product of the vectors to calculate similarity
prod = Dot(name='dot_product', axes=1)([user_vec, song_vec])

# Final model
model = Model(inputs=[user_input, song_input], outputs=prod)

# Compiling the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')


In [None]:
# Model training
history = model.fit([train_user_ids, train_item_ids], train_ratings,
                    batch_size=64, epochs=15,
                    validation_data=([test_user_ids, test_item_ids], test_ratings))


Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [None]:
# Model evaluation
precision_recall_at_k(model, test_user_ids, test_item_ids, test_ratings, k=30, threshold=1.5)


RMSE: 1.106
Precision: 0.408
Recall: 0.404
F1: 0.406


The neural network model, while a sophisticated foray into the integration of deep learning with recommendation systems, has yielded an RMSE of 1.107. This suggests that, in its current state, the network's predictions are less accurate than those of the standalone SVD model, which had an RMSE of 1.0021.

In terms of precision, recall, and F1 score, the neural network model shows uniform values of 0.409 across these metrics. This indicates a consistency in the model's ability to identify relevant items but also highlights room for improvement. When compared to the SVD model's precision of 0.421, recall of 0.629, and F1 score of 0.504, it's evident that the neural network does not quite match up to the performance of the SVD model. The SVD model not only predicts with slightly higher precision but also has a significantly better recall rate, indicating that it is more capable of finding a larger proportion of relevant items within the dataset.

## **Executive Summary**

### **Key Insights from the Milestone Analysis**

The most significant findings from the analysis conducted on the Milestone regarding the music recommendation system, considering the updated information provided, are detailed below:

**1. Preference for More Recent Music**

Data analysis has revealed a clear user preference for more recent music. This insight is crucial for the design and enhancement of recommendation systems, as it underscores the importance of incorporating and highlighting recent releases and current trends in recommendations. Adapting the system to emphasize this type of content not only satisfies current user preferences but also fosters greater engagement by keeping the platform up-to-date with the contemporary musical landscape.


**2. Identification of a Model with Reasonable Ratios from a Limited Dataset**

Despite the limitations of the provided dataset, it has been possible to identify a model that offers reasonable performance ratios in terms of accuracy and recall. This model, the optimized Singular Value Decomposition (SVD), has proven effective in analyzing user preferences and predicting their interests with a significant degree of accuracy. The ability of this model to provide personalized and relevant recommendations, even with a restricted dataset, highlights its potential to be implemented in efficient recommendation systems that can adapt and scale as more data becomes available.

**3. Improvement of Economic Outcomes by Incorporating the Model**

Incorporating the optimized SVD model into the recommendation functionality presents a notable opportunity to improve the economic outcomes of the platform. By enhancing the accuracy of music recommendations, an increase in the time users spend on the platform is anticipated, which in turn could translate into higher revenue from advertising, subscriptions, and other monetization avenues. This approach not only has the potential to increase user satisfaction and retention by offering a more personalized and aligned experience with their musical tastes but also strengthens the platform's position as a leader in musical discovery. By providing significant added value to users, greater loyalty is fostered and a solid foundation for sustained economic growth is established.



### **Specifications of the Proposed Final Model**

The final proposed model is a collaborative filtering system based on Matrix Factorization, specifically utilizing the Singular Value Decomposition (SVD) algorithm from the Surprise library. This model has been optimized through careful selection of hyperparameters to enhance its predictive accuracy and recommendation quality. The selected hyperparameters for the optimized SVD model are as follows:

- *n_epochs*: 30, indicating the number of iterations over the entire dataset to perform during the training process. This allows the model enough time to converge on an optimal set of latent features for users and items.
- *lr_all*: 0.01, specifying the learning rate for all parameters. This controls the step size during the gradient descent optimization, balancing the speed and stability of convergence.
- *reg_all*: 0.1, setting the regularization term for all parameters. This helps to prevent overfitting by penalizing larger model parameters, thus encouraging a model that generalizes better to unseen data.
- *random_state*: 1, ensuring reproducibility of results by providing a fixed seed for the random number generator.

This optimized SVD model, after being trained on a designated dataset, demonstrated its effectiveness in decomposing the user-item interaction matrix to capture latent factors. Upon evaluation, it achieved an RMSE of 1.0021, showing a high level of accuracy in predictions. Its precision stood at 0.421, indicating that over 42% of the top-k recommendations were relevant to the users. The recall was 0.629, demonstrating the model's ability to retrieve a significant portion of relevant items, and the F1 score was 0.504, highlighting a balanced trade-off between precision and recall. These metrics underscore the model's capacity to recommend items that accurately reflect user preferences, balancing accurately predicting user preferences with the relevance of its recommendations.redictions and the ability to discover as many relevant items as possible..

## **Problem and Solution Summary**

**Problem Summary**

The music streaming industry faces a continual challenge: maintaining and growing its user base in a highly competitive market. Spotify, as a leader in this sector, is always looking for ways to enhance user experience and foster both the retention of existing users and the acquisition of new ones. In this context, the primary issue lies in how to increase users' engagement with the platform by improving their musical experience, thus promoting greater loyalty and potentially encouraging users to opt for paid subscriptions. The vast availability of music online and the diversity of users' tastes and preferences make the task of discovering new relevant music overwhelming for many. While users are eager to explore new genres and artists, they often encounter time constraints and a lack of knowledge that hinder their ability to discover music aligning with their personal preferences. This scenario presents a significant opportunity for Spotify: the implementation of an advanced recommendation system that can personalize listening experiences by suggesting songs and artists based on individual user preferences.

**Solution Summary**

The proposed solution focuses on integrating a sophisticated recommendation system within Spotify's app, designed to understand and predict users' musical preferences. By analyzing listening patterns, interactions with various songs and artists, and considering factors like the recent popularity of certain genres or releases, this system can provide highly personalized recommendations. The goal is to enrich the user experience by introducing them to a wider array of music they are likely to enjoy but have yet to discover. This personalization not only enhances user satisfaction by ensuring a continuous stream of new music aligned with their tastes but also encourages more profound engagement with the platform. By spending more time on Spotify and exploring a broader range of content, users develop a deeper and more meaningful connection with the platform. This increase in customer loyalty is crucial for improving Spotify's business outcomes in several ways:

- **User Retention**: Users satisfied with the quality of the music recommendations are more likely to remain loyal to Spotify over other streaming platforms, thus reducing the churn rate.
- **Subscription Conversion**: A rewarding and personalized user experience can motivate free version users to upgrade to paid subscriptions, seeking an even more tailored and ad-free experience. This conversion can significantly increase Spotify's subscription revenue.
- **Engagement Increase**: By enhancing both the amount and quality of time users spend on the platform, Spotify can generate higher revenues through targeted advertising and, for subscribed users, justify the cost of the subscription fee.

### **Reasons for the Proposed Solution Design**

- **Low Operational Cost Implementation**: A shift towards a Matrix Factorization-based collaborative filtering model was made due to its high efficiency and low operational cost. This strategic choice allows for a cost-effective proof of concept, enabling Spotify to assess the recommendation feature's effectiveness without significant initial expenses.

- **High Performance in Precision and Recall**: The Matrix Factorization model, particularly when optimized, has shown to provide an excellent balance of precision and recall among the considered models. This means the model is exceptionally adept at not only identifying a broad range of songs likely to appeal to the user (high recall) but also ensuring that the recommendations are highly relevant to their specific tastes (high precision). This predictive accuracy is crucial for effectively personalizing the user experience.

- **Ease of Update**: The simplicity of the dataset used for this model, along with the straightforward nature of the Matrix Factorization approach, facilitates easy and regular updates. This ensures that the recommendations remain fresh and relevant, reflecting the latest musical trends and shifts in user preferences, while keeping operational costs low.

### **Impact on the Problem and Business**

The implementation of this recommendation model is expected to have a direct and positive impact on Spotify's business outcomes in the following ways:

- **Minimization of Subscription User Churn Rate**: By enhancing the user experience with personalized and relevant recommendations, an increase in satisfaction and engagement among subscribed users is anticipated. This, in turn, can reduce the likelihood of them canceling their subscriptions, thereby lowering the churn rate. A reduced churn rate not only improves revenue retention but also strengthens the long-term loyal user base.

- **Increase in the Freemium to Subscription User Conversion Ratio**: An improved user experience, thanks to precise and personalized recommendations, may encourage freemium users to value the platform more and consider a subscription as a valuable investment for accessing an even richer and ad-free experience. By increasing this conversion ratio, Spotify can significantly boost its subscription revenue.

## **Recommendations for implementation**

### **Implementation Recommendations**

To ensure an effective implementation of the proposed **matrix factorization-based collaborative filtering** recommendation system solution in Spotify, the following key strategies are recommended:

1. **User Interface Enhancement**
The Spotify app interface must be optimized to effectively integrate the new recommendation functionality. This involves designing and developing a section within the app that presents a playlist generated by the recommendation model, highlighting the suggested songs in an attractive and user-accessible manner. This enhanced interface should be intuitive and designed to facilitate the discovery of new music, encouraging users to explore and listen to the recommended songs.

2. **Simple and Automated Implementation**
The update of the recommendation playlist should be carried out automatically, preferably overnight, through bash jobs that execute the model and update the database with the new recommendations for each user. This simple implementation ensures that users always have access to a fresh and relevant list of recommendations every time they open the app, improving their experience without requiring additional interactions or waiting.

3. **KPI Monitoring and Control System**
To continuously evaluate and optimize the effectiveness of the recommendation functionality, it is crucial to implement a control system that monitors key performance indicators (KPIs) related to:

    - **Model Predictive Capacity:** The precision and recall of the model should be measured and analyzed in real-time, adjusting and refining the algorithm as necessary to maintain or improve its performance. This may involve periodic reevaluation of training data, model structure, and used parameters.

    - **Recommendation Functionality Usage:** It is important to track how users interact with the recommendation list, including metrics such as usage frequency, listening time of recommended tracks, and the conversion rate of recommendations to songs saved or added to other playlists by the users. These data can provide valuable insights into the relevance and appeal of the provided recommendations.

### **Key Actionables for Stakeholders**

The recommended steps to maximize the success of the matrix factorization-based collaborative filtering recommendation system on Spotify are as follows:

1. **Development Approval**:
The first essential step is the formal approval of the development of the recommendation functionality associated with this model by stakeholders. This involves:
    - Resource Commitment: Ensuring that the necessary financial, human, and technological resources are allocated for the development, testing, and implementation of the functionality.
    - Objective Definition: Setting clear business and technical objectives for the recommendation functionality, including specific goals to improve user experience, increase customer loyalty, and have positive effects on conversion and churn rates.

2. **Implementation of Specific KPIs**:
To measure the impact of the recommendation functionality and its value to the business, it is crucial to implement key performance indicators (KPIs) that allow for an objective and continuous evaluation. These KPIs should include:
    - Freemium to Subscription Conversion Rate: Measuring how the introduction of the recommendation functionality affects the propensity of freemium users to become paying subscribers.
    - Subscriber Churn Rate: Assessing whether the improvement in the personalization of the musical experience has a positive impact on subscriber retention, thus reducing the subscription cancellation rate.

3. **Ongoing Evaluation and Improvement of the Predictive Model**:
Based on the analysis of the mentioned KPIs, stakeholders should be prepared to:
    - Review Results: Regularly evaluate the performance of the recommendation functionality in relation to the established KPIs to determine its success and areas for improvement.
    - Approve Improvements: If the results are positive, proceed with the approval of additional projects to enhance and refine the predictive model. This may include exploring new modeling techniques, expanding the dataset, or incorporating user feedback for adjustments..


### **Expected Benefits and Costs**

Here is an estimated analysis of the benefits and costs associated with this implementation, based on rational assumptions and available data.

**Necessary Investment**

Development and Implementation of the Functionality and Model: The initial investment required for the development and implementation of the recommendation system is estimated to be `500,000$`. This figure includes the costs of research and development, programming, testing, and deployment of the system.

**Estimated Revenues**

- Reduction in Churn Rate: Achieving a target reduction in the churn rate by 0.5%, considering Spotify has 220 million subscribers and an annual churn rate of 45%, would imply preventing the loss of approximately half a million users. With an average revenue per user (ARPU) of `$54` annually, this would represent `$27` million in retained revenue.
- Increase in Conversion Ratio: The recommendation system is expected to increase the conversion ratio of freemium users to paying subscribers by 0.5%, moving from the current 15% to 15.075%. With 500 million freemium users, this would mean 375,000 new subscribers, translating into additional revenue of approximately `$20.25 million`.

Therefore, the total estimated contribution to revenue would be `$47.25 million`.

**Operational Costs**
- Cloud: The cost associated with cloud services, including running the model and supporting the functionality for users, is anticipated not to exceed `$1 million`.
- Personnel Cost: For the continuous monitoring and maintenance of the system, the need for 3 people is estimated, with a total annual cost of `$225,000`.
This brings the total operational expenses to `$1,225,000`.

**Estimated Operational Profit**

The difference between the additional revenues generated and the operational costs results in an estimated operational profit of `$46,025,000` (`$47,250,000 - $1,225,000`).


### **Potential Risks and Challenges**

The implementation of an advanced recommendation system in Spotify, while promising to enhance user experience and the company's economic outcomes, is not without potential risks and challenges. These challenges, if not properly addressed, could limit the effectiveness of the solution and its anticipated positive impact. The main risks associated with the design of the proposed solution include:

1. **Lack of Predictive Quality of the Model**
One of the most significant risks is that the recommendation model fails to achieve sufficient predictive quality. This could be due to various reasons, such as insufficient training data, poor feature selection, or the model's inability to capture the complexity and dynamism of users' musical preferences. If the model cannot accurately predict user preferences, the recommendations are likely to be irrelevant or uninteresting to them, which could lead to low adoption and use of the recommendation functionality.

2. **Lack of Real-Time Personalization**
Another major challenge is ensuring that recommendations are personalized in real-time. Users' music tastes and preferences can change rapidly, influenced by trends, moods, or specific events. If the system cannot adapt and update recommendations in real-time or near real-time, users may perceive the recommendations as outdated or not relevant, diminishing the perceived value of the functionality.

3. **Inadequate User Interface**
The effectiveness of the solution also heavily depends on the interface through which recommendations are presented. A poorly designed user interface, which is not intuitive, attractive, or easy to use, can discourage users from interacting with the recommendations, regardless of their relevance or accuracy. User experience should be a primary consideration in the design of the recommendation functionality to ensure that suggestions are easily accessible and appealing to users.

**Mitigation Strategies**

To address these risks, Spotify should consider various mitigation strategies, including:

- Continuous Model Improvement: Implement a process of iteration and continuous improvement of the recommendation model, using user feedback and performance analysis to adjust and refine the system.

- Dynamic Personalization: Develop mechanisms that allow for updating recommendations in real-time or near real-time, based on the user's recent interaction with the platform and changes in music trends.

- User Testing and User-Centered Design: Conduct usability testing and iterative design with real users to ensure that the interface of the recommendation functionality is attractive, intuitive, and functional.


### **Further Analysis and Associated Problem Solving**

To optimize the recommendation system on Spotify, it is crucial to address key areas for future analysis and improvements:

**Improving the Predictive Model**
- Data Expansion: To improve the accuracy and relevance of recommendations, it is essential to expand the dataset used to train the predictive model. This includes increasing the number of users whose data are analyzed, as well as expanding the range of features considered for each user. Additional data can provide a deeper and more nuanced understanding of user preferences and behaviors.

- Inclusion of Listening Date and Time: Incorporating temporal variables, such as the specific date and time of listening, can allow the model to capture and learn time-based patterns, such as musical preferences changing according to the time of day, day of the week, or season.

- New Song Features: Enriching the model with more details about the songs, including aspects such as genre, era, rhythm, instruments, etc., could significantly improve the system's ability to make personalized and relevant recommendations.

- External Variables: Integrating external variables that may influence the user's mood, such as weather conditions, local events, holidays, and other cultural observances, can offer an additional dimension of personalization, aligning recommendations not only with the user's musical tastes but also with their current context and emotional state.

- Researching Adapted Methodologies: Exploring and adopting advanced analytical and modeling methodologies, specifically designed to handle the complexity and dynamic nature of musical and user data, is fundamental for the continuous development of the model.


**Implementing a Real-Time Model**

The ability to update recommendations in real-time based on recent user interaction and contextual changes is an ambitious but critical goal for the evolution of the recommendation system. Implementing a model that operates in real-time requires:

- Robust Technological Infrastructure: Developing and maintaining an infrastructure capable of processing large volumes of data in real-time, ensuring the speed and scalability of the recommendation system.

- Efficient Algorithms: Using or developing algorithms that can run quickly and update recommendations almost instantly, based on the user's most recent activity and changes in the aforementioned external variables.