# <a id='toc1_'></a>[The ***Best*** Albums of All Time](#toc0_)

This notebook presents my work in developing recommendation systems which can generate recommendations for music albums based off a given album's musical characteristics, such as acousticness, danceability, energy amongst other musical features. 

> we shall create a content based feature to generate recommendations for music albums based off several spotify features of the best albums of all time

<img src="/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/My Github/Recommender-Systems/img/albums.jpeg" width="650" height="350">

**Table of contents**<a id='toc0_'></a>    
- [Overview](#toc2_)    
  - [Data](#toc2_1_)    
- [Data Exploration](#toc3_)    
  - [Descriptive Statistics](#toc3_1_)    
  - [Data Visualization](#toc3_2_)    
    - [K-Means](#toc3_2_1_)    
    - [PAM Clustering](#toc3_2_2_)    
  - [Correlation Matrix](#toc3_3_)    
  - [Missing Values](#toc3_4_)    
- [Content Based Filtering](#toc4_)    
  - [Model 1: Standard CBF](#toc4_1_)    
  - [Model 2: CBF with Weighted Similarity Measure](#toc4_2_)    
- [Results and Conclusion](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->


# <a id='toc2_'></a>[Overview](#toc0_)

To create a content-based recommender system for music albums based on musical features, you can use the dataset that contains information about thousands of music albums, including danceability, energy, and other musical features. We outline below the process followed and the models generated. 

1. Data Exploration: Identify initial trends and get a better understanding of the data. 
2. Data Preprocessing: Clean and pre-process the data to remove any missing or inconsistent values. We may also need to normalize the musical features to ensure that they are on a common scale.
3. Recommendation Systems Generation: Based on the similarity matrix, generate recommendations for a given album by selecting the most similar albums to the given album.
    * Model 1: Content Based Recommender System using only musical features
    * Model 2: Content Based Recommender System using musical features and album ratings

To achieve Model 2, we shall make use of a weighted similarity measure such as the weighted cosine similarity or the weighted Pearson correlation to measure the similarity between the albums. WE can take into account album ratings and ensure that popular albums are ranked higher. We shall compare results from Model 1 and Model 2.

## <a id='toc2_1_'></a>[Data](#toc0_)

The dataset of albums and music features was taken from [Kaggle](https://www.kaggle.com/datasets/lucascantu/top-5000-albums-of-all-time-spotify-features). It contains information about 4402 albums, including Spotify features that can be found using their API. 

# <a id='toc3_'></a>[Data Exploration](#toc0_)

The idea here is to get a better understanding of the data you are working with, identify any patterns, anomalies, or relationships between the features, and inform the modeling process.

## <a id='toc3_1_'></a>[Descriptive Statistics](#toc0_)

Calculate descriptive statistics such as mean, median, mode, standard deviation, and quartiles for each feature to get a better understanding of the distribution of the data. This can help us identify any outliers or skewness in the data that may need to be addressed.

## <a id='toc3_2_'></a>[Data Visualization](#toc0_)

Visualize the data using plots such as histograms, scatter plots, or box plots to get a better understanding of the relationships between the features. We shall also explore using some non-hierarchical clustering techniques like K-Means and PAM clustering. By clustering the albums based on their musical features, you can gain insights into the relationships between the features and identify any patterns or structures in the data that can inform the modeling process.

### <a id='toc3_2_1_'></a>[K-Means](#toc0_)

K-means is a popular clustering algorithm that is used to group similar data points into clusters. The goal of k-means is to partition the data into k clusters, where k is a pre-specified number, such that the data points within each cluster are as similar as possible, and the data points across different clusters are as dissimilar as possible.

The algorithm works by initializing k random centroids, and then assigning each data point to the closest centroid. The centroids are then recomputed as the mean of all data points assigned to that centroid. The process of assigning data points to centroids and recomputing centroids is repeated until the centroids stop changing, or until a maximum number of iterations is reached.

### <a id='toc3_2_2_'></a>[PAM Clustering](#toc0_)

PAM is a variation of the k-means algorithm. The main difference between k-means and PAM is that in k-means, the centroids are the mean of the data points assigned to a cluster, whereas in PAM, the medoids are the actual data points that represent the cluster. This makes PAM more robust to outliers, as the medoids are chosen based on the actual data points, rather than the mean, which can be affected by outliers.

## <a id='toc3_3_'></a>[Correlation Matrix](#toc0_)

We are to also compute the correlation matrix to identify any strong relationships between the features. This can help us determine which features are most important for the recommendations and inform the modeling process.

## <a id='toc3_4_'></a>[Missing Values](#toc0_)

We now are to check for missing values in the data and decide how to handle them. Depending on the amount of missing data, we may need to look at removing the missing data, fill in the missing data using imputation techniques such as mean imputation or median imputation, or use a different modeling technique that can handle missing data.

# <a id='toc4_'></a>[Content Based Filtering](#toc0_)

## <a id='toc4_1_'></a>[Model 1: Standard CBF](#toc0_)

## <a id='toc4_2_'></a>[Model 2: CBF with Weighted Similarity Measure](#toc0_)

To incorporate album ratings into your content-based recommender system, you can use a weighted similarity measure that takes into account both the musical features and the album ratings. The process is very similar to that of Model 1. We clean and preprocess the data to remove any missing or inconsistent values. You may also need to normalize the musical features and album ratings to ensure that they are on a common scal. We use shall compute the similarity between each pair of albums based on both the musical features and album ratings. Now, we can use a weighted similarity measure such as the weighted cosine similarity or the weighted Pearson correlation to measure the similarity between the albums. Based on the similarity matrix, generate recommendations for a given album by selecting the most similar albums to the given album. When generating recommendations, you can use the weighted similarity measure to rank the albums based on both the musical features and album ratings. Albums with higher average ratings should be ranked higher, as they indicate that they are more popular or highly rated by users.

***Note***: by incorporating album ratings into the recommendations, you can ensure that albums with higher average ratings are ranked higher, possibly providing a better overall experience for the users.

# <a id='toc5_'></a>[Results and Conclusion](#toc0_)