# COGS 118B - Final Project

# Insert title here

## Group members

- Valeria Gonzalez Perez A16366104
- Gemma Luengo-Woods A17622576
- Aarohi Zade A16222196
- Nick Campos A17621673

# Abstract 
This section should be short and clearly stated. It should be a single paragraph <200 words.  It should summarize: 
- what your goal/problem is
- what the data used represents 
- the solution/what you did
- major results you came up with (mention how results are measured) 

__NB:__ this final project form is much more report-like than the proposal and the checkpoint. Think in terms of writing a paper with bits of code in the middle to make the plots/tables

# Background

Fill in the background and discuss the kind of prior work that has gone on in this research area here. **Use inline citation** to specify which references support which statements.  You can do that through HTML footnotes (demonstrated here). I used to reccommend Markdown footnotes (google is your friend) because they are simpler but recently I have had some problems with them working for me whereas HTML ones always work so far. So use the method that works for you, but do use inline citations.

Here is an example of inline citation. After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds<a name="lorenz"></a>[<sup>[1]</sup>](#lorenznote). Use a minimum of 2 or 3 citations, but we prefer more <a name="admonish"></a>[<sup>[2]</sup>](#admonishnote). You need enough citations to fully explain and back up important facts. 

Remeber you are trying to explain why someone would want to answer your question or why your hypothesis is in the form that you've stated. 

==

The recommendation system of a music platform can be the difference between a mediocre one and a great one. Apple Music, for instance, has almost become infamous for having a subpar recommendation system <a name="applemusic"></a>[<sup>[1]</sup>](#applemusicnote). Inspired by Spotify’s “Discover Weekly” <a name="discoverweekly"></a>[<sup>[2]</sup>](#discoverweeklynote), our goal is to use unsupervised machine learning to serve as the basis of a successful recommendation algorithm. While an optimal system would use supervised and reinforcement learning alongside unsupervised methods, we want to examine how an unsupervised approach could stack up against well-accepted recommendation systems.

A helpful overview of current research into these systems is “A systematic review and research perspective on recommender systems” <a name="research"></a>[<sup>[3]</sup>](#researchnote) by Deepjyoti Roy and Mala Dutta. It discusses the development and evaluation of recommender systems emphasizing the importance of algorithmic analysis. It also evaluates the performance metrics of recent contributions and identifies existing research gaps, aiming to guide future developments in efficient recommender system design. In line with the paper, we’re going to use a collaborative filtering approach that breaks down into a model-based filtering system with clustering techniques.

# Problem Statement

Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

==

In the context of digital music services, users often encounter difficulty in discovering new songs that align with their individual preferences. With a vast number of options and diverse selection of tracks available to listen to, this is not a simple task for people with complex preferences. This project aims to refine the process of music recommendation by developing a system capable of accurately predicting and suggesting songs that users are likely to enjoy, based on a quantifiable profile of their musical preferences.

This profile is constructed using measurable attributes of songs they have previously listened to and favored, including but not limited to danceability, energy, loudness, liveness, and genre. The core of the problem lies in effectively grouping songs using these attributes, for which we propose the application of a Gaussian Mixture Model (GMM). By clustering the dataset of songs with GMM based on the mentioned metrics, we can then match a user's profile against these clusters to recommend new songs that share characteristics with those the user has shown a preference for.

The effectiveness of our recommendation system will be measured using precision and recall metrics, evaluating the system’s ability to identify songs that meet the user's taste while minimizing irrelevant suggestions.The effectiveness of our recommendation system will be measured using precision and recall metrics, evaluating the system’s ability to identify songs that meet the user's taste while minimizing irrelevant suggestions. This problem is both quantifiable, through the use of audio feature metrics, and replicable, as users consistently seek new music that fits their established preferences.

# Data

The group used the Spotify Tracks Dataset found on Kaggle (https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset?resource=download). The dataset contained `114,000` observations and `21` feature variables. Each observation is a song from Spotify and consists of features representing an index in the dataset, the Spotify track ID, the artist(s) who wrote it, the name of the album that the song comes from, the name of the track, its popularity, the duration, if it’s explicit, its danceability, how energetic it is, what key it’s in, how loud it is in decibels, the mode (major or minor), its speechiness, acousticness, instrumentalness, the probability of the track being a live recording, the valence, tempo, time signature, and genre.

We’ll preprocess the data by removing any incomplete or repeat observations. We’ll also remove the explicitness, duration, popularity, and liveness variables since they aren’t particularly critical. Any observations where no key is detected (key = -1) and where speechiness is above 0.65 (predicted to be exclusively speech) will be removed from the dataset and we’ll also make adjustments to the genre classifications. All variables with the dataype 'object' will be type cast into strings. Songs that are labeled as 'acoustic', 'songwriter', 'singer-songwriter', 'happy' and 'sad' will be removed, songs labeled as 'electro' will be relabeled as 'electronic', and songs labeled as 'latino' will be relabeled as 'latin'.

The remaining critical variables are `artists` (string), `track_name` (string), `danceability` (float of 0.0 to 1.0 that describes how “danceable” a song is based on combination of musical elements), `energy` (float), `key` (int that maps to pitches using Pitch Class notation), `loudness` (float), `mode` (int where 0 is minor and 1 is major), `speechiness` (float where <= 0.33 is probably exclusively music and between 0.33 and 0.65 is a mix like rap), `instrumentalness` (float where the closer the value is to 1, the less vocal content), `valence` (float where high valence is more positive sentiment and low valence is more negative sentiment), `tempo` (float that represents BPM), `time_signature` (int from 3 to 7 that represents the beats per measure), and `track_genre` (string). 


In [18]:
import pandas as pd


df = pd.read_csv('./data/dataset.csv')

# Remove any incomplete observations or repeat observations (songs that have the same song name and recording artists)
df = df.dropna()
df = df.drop_duplicates()
df = df.drop_duplicates(subset = ["track_name", "artists"])


# Remove observations that don't have a key (where key = -1) and observations that are predicted to be exclusively speech (speechiness >= 0.65)
df = df[(df['key'] != -1) & (df['speechiness'] <= 0.65)]

# Remove tracks with multiple artists
df = df[~df['artists'].str.contains(';')]

# Drop any observations with uninformative genre classifications
labels_to_remove = ['acoustic', 'songwriter', 'singer-songwriter', 'happy', 'sad']
df = df[~df['track_genre'].isin(labels_to_remove)]

# Combine genres with similar classifications
df['track_genre'] = df['track_genre'].replace({'electro': 'electronic', 'latino': 'latin'})

# Save a copy of our df before dropping columns
not_dropped_df = df

# Remove variables that aren't critical
df = df.drop(columns=['explicit', 'duration_ms', 'popularity', 'liveness', 'album_name'])

#One-Hot encoding
df_encoded = pd.get_dummies(df, columns=['artists', 'track_genre'])

# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

# Results

You may have done tons of work on this. Not all of it belongs here. 

Reports should have a __narrative__. Once you've looked through all your results over the quarter, decide on one main point and 2-4 secondary points you want us to understand. Include the detailed code and analysis results of those points only; you should spend more time/code/plots on your main point than the others.

If you went down any blind alleys that you later decided to not pursue, please don't abuse the TAs time by throwing in 81 lines of code and 4 plots related to something you actually abandoned.  Consider deleting things that are not important to your narrative.  If its slightly relevant to the narrative or you just want us to know you tried something, you could keep it in by summarizing the result in this report in a sentence or two, moving the actual analysis to another file in your repo, and providing us a link to that file.

### Subsection 1

You will likely have different subsections as you go through your report. For instance you might start with an analysis of the dataset/problem and from there you might be able to draw out the kinds of algorithms that are / aren't appropriate to tackle the solution.  Or something else completely if this isn't the way your project works.

### Subsection 2

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

### Subsection 3

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

### Subsection 4

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 

Maybe you do model selection again, but using a different kind of metric than before?



# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

While we tried a variety of methods to understand any patterns within our Spotify data, there are always limitations and potential areas for improvement. One of the first limitations comes at the initial step of preprocessing. In order to decrease our dimensionality and allow for more computational efficiency, we removed and adjusted several features. For instance, in the context of this problem, we decided to remove explicitness, duration, popularity, and liveness variables. We also removed any observations where no key was detected or where a track was predicted to be exclusively speech. While we initially considered these to be non-critical attributes, these could very well include important information that would impact clustering and ultimately impact the success of a music recommendation system. More data could also be helpful in determining how well our clustering approaches scale to unseen tracks and might reveal unseen patterns in the data.

There is also always a risk of losing information when any dimensionality reduction technique is performed, including UMAP. The sensitivity of UMAP’s parameters greatly influences the outcome, but parameter fine-tuning can take a lot of time and computational resources. More exploration in this area could reveal patterns that were otherwise undiscovered. As for GMM, all of the data points are assumed to have been generated from a limited mixture of Gaussian distributions with unknown parameters. This might not always be the case with song tracks, especially when looking at a certain level of complexity. With regards to our evaluation metrics, although the silhouette score provides a good measure of intra and inter cluster similarity, it might not capture all aspects of their relationships. The effectiveness is also dependent on the distance metric, which we attempted to align with the data’s geometry.

### Ethics & Privacy

Sampling Bias - Although this is a randomized spotify music dataset, we cannot ensure the dataset accurately captures all the wide range of music on spotify or the numerous genres there exist in music. Moreover, we acknowledge that building a recommendation system based on spotify tracks may not accurately predict user’s next preferred choice of music on other music streaming platforms or in general outside of the scope of the songs available in the dataset.

Anonymity - We have controlled for the privacy concern of anonymity with song tracks that are tied to no identifiable data of the user and solely represent the physical properties and measurable features of music. There is no information of name, age, gender, location or any other revealing information.

Licensing - There are no conflicts of licensing with artist or music labels as the dataset obtained are spotify track songs which become only publicly available once an agreement has been established between the artist and Spotify. We have no knowledge nor are involved in this agreement between the artist and Spotify. Our research is based on a public dataset on Kaggle.

Affiliations - We are not receiving any funds nor is our research being sponsored by Spotfiy. We establish no affiliation with Spotify and our research is not considering, and will not consider, any suggestions or recommendations from Spotify if communication were to be established. Our research is not intended to benefit a particular party and is solely focused on the investigative process of potential music recommendations predictions.

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Anonymous User. (16 Feb 2018) Bad Music Recommendations **Apple Support Community.** https://discussions.apple.com/thread/8284099?sortBy=best 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) The Data School. (2 Apr 2023) Machine Learning 101 and The Spotify Case **The Data School.** https://www.thedataschool.com.au/mipadmin/machine-learning-101-and-the-spotify-case/#:~:text=Supervised%20Learning%20%E2%80%93%20Music%20Recommendations%3A%20Spotify's,you're%20likely%20to%20enjoy._ 
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Roy, D., Dutta, M. (3 May 2022) A systematic review and research perspective on recommender systems **Journal of Big Data 9, 59.** https://doi.org/10.1186/s40537-022-00592-5