# COGS 118B - Project Proposal

# Names

- Valeria Gonzalez Perez A16366104
- Gemma Luengo-Woods A17622576
- Aarohi Zade A16222196
- Nick Campos A17621673

# Abstract 
In the age of the Internet and streaming platforms such as Spotify, music has become widely accessible. However, even with the wide array of music we have been given access to, it can be easy to become confined within narrow music bubbles, limiting a user’s exposure to diverse sounds. Conventional recommendation systems, despite employing ML algorithms, often fall short in providing tailored suggestions that transcend traditional genre boundaries. We intend to implement our own unsupervised model using a Gaussian Mixture Model (GMM) to counter the aforementioned issues. We will supplement our model with a dataset of tens of thousands of songs and their corresponding features, which include variables such as the key the song is in, the genre, and elements such as valence, BPM, and danceability. Utilizing GMM for this dataset will create n-dimensional clusterings to represent relationships between songs, with an emphasis on soft, flexible clusterings that will suitably reflect the multifaceted nature of music. The success of our recommendations will be based on our model’s confidence that a user’s listening profile is similar to the model’s recommendations.

# Background



The recommendation system of a music platform can be the difference between a mediocre one and a great one. Apple Music, for instance, has almost become infamous for having a subpar recommendation system <a name="applemusic"></a>[<sup>[1]</sup>](#applemusicnote). Inspired by Spotify’s “Discover Weekly” <a name="discoverweekly"></a>[<sup>[2]</sup>](#discoverweeklynote), our goal is to use unsupervised machine learning to serve as the basis of a successful recommendation algorithm. While an optimal system would use supervised and reinforcement learning alongside unsupervised methods, we want to examine how an unsupervised approach could stack up against well-accepted recommendation systems.

A helpful overview of current research into these systems is “A systematic review and research perspective on recommender systems” <a name="research"></a>[<sup>[3]</sup>](#researchnote) by Deepjyoti Roy and Mala Dutta. It discusses the development and evaluation of recommender systems emphasizing the importance of algorithmic analysis. It also evaluates the performance metrics of recent contributions and identifies existing research gaps, aiming to guide future developments in efficient recommender system design. In line with the paper, we’re going to use a collaborative filtering approach that breaks down into a model-based filtering system with clustering techniques.
1. <a name="cite_note-1"></a> [^](#cite_ref-1) Anonymous User. (16 Feb 2018) Bad Music Recommendations **Apple Support Community.** https://discussions.apple.com/thread/8284099?sortBy=best 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) The Data School. (2 Apr 2023) Machine Learning 101 and The Spotify Case **The Data School.** https://www.thedataschool.com.au/mipadmin/machine-learning-101-and-the-spotify-case/#:~:text=Supervised%20Learning%20%E2%80%93%20Music%20Recommendations%3A%20Spotify's,you're%20likely%20to%20enjoy._ 
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Roy, D., Dutta, M. (3 May 2022) A systematic review and research perspective on recommender systems **Journal of Big Data 9, 59.** https://doi.org/10.1186/s40537-022-00592-5 



# Problem Statement

In the context of digital music services, users often encounter difficulty in discovering new songs that align with their individual preferences. With a vast number of options and diverse selection of tracks available to listen to, this is not a simple task for people with complex preferences. This project aims to refine the process of music recommendation by developing a system capable of accurately predicting and suggesting songs that users are likely to enjoy, based on a quantifiable profile of their musical preferences.

This profile is constructed using measurable attributes of songs they have previously listened to and favored, including but not limited to danceability, energy, loudness, liveness, and genre. The core of the problem lies in effectively grouping songs using these attributes, for which we propose the application of a Gaussian Mixture Model (GMM). By clustering the dataset of songs with GMM based on the mentioned metrics, we can then match a user's profile against these clusters to recommend new songs that share characteristics with those the user has shown a preference for.

The effectiveness of our recommendation system will be measured using precision and recall metrics, evaluating the system’s ability to identify songs that meet the user's taste while minimizing irrelevant suggestions.The effectiveness of our recommendation system will be measured using precision and recall metrics, evaluating the system’s ability to identify songs that meet the user's taste while minimizing irrelevant suggestions. This problem is both quantifiable, through the use of audio feature metrics, and replicable, as users consistently seek new music that fits their established preferences.

# Data

We’ll be using the Spotify Tracks Dataset found on Kaggle (https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset?resource=download). There are 114000 observations and 21 variables. Each observation is a song in Spotify and consists of an index in the dataset, the Spotify track ID, the artist(s) who wrote it, the name of the album that the song comes from, the name of the track, its popularity, the duration, if it’s explicit, its danceability, how energetic it is, what key it’s in, how loud it is in decibels, the mode (major or minor), its speechiness, acousticness, instrumentalness, the probability of the track being a live recording, the valence, tempo, time signature, and genre.

We’ll preprocess the data by removing any incomplete or repeat observations. We’ll also remove the explicitness, duration, popularity, and liveness variables since they aren’t particularly critical. Any observations where no key is detected (key = -1) and where speechiness is above 0.65 (predicted to be exclusively speech) will be removed from the dataset and we’ll also make adjustments to the genre classifications. All variables with the dataype 'object' will be type cast into strings. Songs that are labeled as 'acoustic', 'songwriter', 'singer-songwriter', 'happy' and 'sad' will be removed, songs labeled as 'electro' will be relabeled as 'electronic', and songs labeled as 'latino' will be relabeled as 'latin'.

The remaining critical variables are track_id (string), artists (string), album_name (string), track_name (string), danceability (float of 0.0 to 1.0 that describes how “danceable” a song is based on combination of musical elements), energy (float), key (int that maps to pitches using Pitch Class notation), loudness (float), mode (int where 0 is minor and 1 is major), speechiness (float where <= 0.33 is probably exclusively music and between 0.33 and 0.65 is a mix like rap), instrumentalness (float where the closer the value is to 1, the less vocal content), valence (float where high valence is more positive sentiment and low valence is more negative sentiment), tempo (float that represents BPM), time_signature (int from 3 to 7 that represents the beats per measure), and track_genre (string).

# Proposed Solution
In motivation to implement accurate user music preference prediction into a music recommendation system, we will classify potential suggestions by degree of similarity with song features such as energy, tempo, valence, danceability, genre, and related artists with respect to the user’s music preference patterns over time. We will classify over 70,000 different songs into specific groups of songs with related measurable features by applying the Gaussian Mixture Model (GMM) to cluster the data. GMM is the model of choice due to its better performance than k-means on large datasets and no specific requirements of data size or shape. From there, we will compare the representation of the user’s music preference (or the current song of choice user is listening to) along with the clustered representations of the spotify tracks dataset in vector space. Whichever cluster is closest in vector space to the user’s music will be denoted as the nearest neighbor or as the most likely related type of music the user will prefer to see in the music recommendation system. Confidence scores of the euclidean distance or the cosine similarity with a >= 0.7 threshold will be utilized to assess if  particular types of songs are a good fit to be recommended to the user.

# Evaluation Metrics

We intend to use confidence as the evaluation metric to quantify the certainty of our model’s recommendation. There are a few attributes of our model’s clustering choices, as well as the feature space they occupy, that we can base our confidence scores on. For example, one approach for the derivation of confidence that we are interested in is the distance between the user-inputted song and its nearest neighbor (which would serve as our output, or the recommended song to the user). If these songs are very close in feature space, it would suggest a higher level of confidence in the recommendation, which should translate into a higher confidence score. This score, which would be a function of distance, could be quantified in a myriad of ways, such as Euclidean distance or cosine similarity. In application, we intend to employ the Python library FRONNI to generate confidence interval breakdowns to determine if our performance is acceptable. If the confidence score is >= 0.7, we’re considering the classification a success.

# Ethics & Privacy

Sampling Bias - Although this is a randomized spotify music dataset, we cannot ensure the dataset accurately captures all the wide range of music on spotify or the numerous genres there exist in music. Moreover, we acknowledge that building a recommendation system based on spotify tracks may not accurately predict user’s next preferred choice of music on other music streaming platforms or in general outside of the scope of the songs available in the dataset.

Anonymity - We have controlled for the privacy concern of anonymity with song tracks that are tied to no identifiable data of the user and solely represent the physical properties and measurable features of music. There is no information of name, age, gender, location or any other revealing information.

Licensing - There are no conflicts of licensing with artist or music labels as the dataset obtained are spotify track songs which become only publicly available once an agreement has been established between the artist and Spotify. We have no knowledge nor are involved in this agreement between the artist and Spotify. Our research is based on a public dataset on kaggle.

Affiliations - We are not receiving any funds nor is our research being sponsored by Spotfiy. We establish no affiliation with Spotify and our research is not considering, and will not consider, any suggestions or recommendations from Spotify if communication were to be established. Our research is not intended to benefit a particular party and is solely focused on the investigative process of potential music recommendations predictions.

# Team Expectations 

* *We will meet every week on Monday to discuss any updates to the project.*
* *We will use WhatsApp as our primary form of communication.*
* *We will be respectful of each other and be willing to offer help whenever needed.*
* *We will make decisions by rule of majority vote.*
* *We will give each other three days of time to solve any issues/bugs/problems before asking for professorial assistance or constructive criticism.*

# Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/10  |  1 PM |  Brainstorm topics/questions (all)  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 2/16  |  10 AM |  Do background research on topic (all) | Discuss ideal dataset(s) and ethics; draft project proposal (all) | 
| 2/20  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets (all)  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part (all)  |
| 2/25  | 6 PM  | Import & Wrangle Data ,do some EDA (Gemma, Valeria) | Review/Edit wrangling/EDA; Discuss Analysis Plan (all)  |
| 3/05  | 12 PM  | Finalize wrangling/EDA (Nick, Valeria); Begin programming for project (Aarohi, Nick) | Discuss/edit project code; Complete project (all) |
| 3/18  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Aarohi, Gemma)| Discuss/edit full project (all) |
| 3/19  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
