<a href="https://colab.research.google.com/github/jmgang/SpoTwoFy-project-notebooks/blob/main/notebooks/4a_SpoTwoFy_Project_Create_Recommender_Engine_Pool_nonML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating a Recommender Engine Pool

*What is a recommender engine?*

A recommender engine is an information filtering system that predicts a user's preferences for a set of items (such as products, movies, or music) based on their previous interactions with those items or similar items, and provides personalized recommendations for new items.

*What is a recommender engine pool?*

The recommender engine pool is the data source from which the recommender engine gets its recommendations. It is made up of the (1) items considered for the recommendation and (2) measure/s that determines its fitness to be recommended.

| item | measure1 | measure2 |  
|------|----------|----------|
| 1    | 0.1      | 0.5      |  
| 2    | 0.2      | 0.6      |
| 3    | 0.3      | 0.7      |


In this notebook, we will create a track recommendation pool from the daily top 200 tracks data and generate track recommendations for a sample track.

In [2]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import joblib # joblib==1.2.0, install if needed

In [3]:
# Mount GDrive folders
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Set home directory
import os
home_dir = "/content/drive/MyDrive/Colab Notebooks/Sprint 3/"
os.chdir(home_dir)

In [5]:
DATA_DIRECTORY = 'https://raw.githubusercontent.com/jmgang/SpoTwoFy-project-notebooks/main/data/'

In [6]:
def load_playlist_data(genre_names):
    playlist_df_list = []
    for genre in tqdm(genre_names):
        filename = DATA_DIRECTORY + 'playlists/' + genre.lower() + '_playlist_data.csv'
        print(filename)
        pdf = pd.read_csv(filename)
        pdf['genre'] = genre.lower()
        playlist_df_list.append(pdf)
    return pd.concat(playlist_df_list)

In [7]:
def load_track_data(genre_names):
    track_df_list = []
    for genre in tqdm(genre_names):
        filename = DATA_DIRECTORY + 'playlists/' + genre.lower() + '_playlist_tracks_data.csv'
        tdf = pd.read_csv(filename)
        tdf = tdf[~tdf.duplicated(subset=['track_id'])]  # Remove duplicates based on 'track_id'
        tdf['genre'] = genre.lower()
        track_df_list.append(tdf)
    return pd.concat(track_df_list)

## 1. Read data

Read playlist data

In [8]:
# chart_tracks_df = pd.read_csv('data/ph_spotify_daily_charts_tracks.csv')
# chart_tracks_df.head()

genre_names = ['alternative_rock', 'pop', 'sad_opm']
playlist_df = load_playlist_data(genre_names)

  0%|          | 0/3 [00:00<?, ?it/s]

https://raw.githubusercontent.com/jmgang/SpoTwoFy-project-notebooks/main/data/playlists/alternative_rock_playlist_data.csv


 67%|██████▋   | 2/3 [00:00<00:00,  2.78it/s]

https://raw.githubusercontent.com/jmgang/SpoTwoFy-project-notebooks/main/data/playlists/pop_playlist_data.csv
https://raw.githubusercontent.com/jmgang/SpoTwoFy-project-notebooks/main/data/playlists/sad_opm_playlist_data.csv


100%|██████████| 3/3 [00:00<00:00,  3.08it/s]


Read tracks data

In [9]:
tracks_df = load_track_data(genre_names)

100%|██████████| 3/3 [00:01<00:00,  2.33it/s]


Remove tracks with no available/incomplete audio features

In [10]:
print(len(tracks_df))
tracks_df = tracks_df.dropna()
print(len(tracks_df))

11977
11977


Remove duplicate tracks

In [11]:
print(len(tracks_df))
tracks_df = tracks_df.drop_duplicates(subset=['artist_id','track_name'])
print(len(tracks_df))

11977
10305


## 5. Save recommender engine pool

In [12]:
# save data
tracks_df.to_csv("data/spotify_tracks_hale_no_ml_rec_pool.csv", index=False, encoding='utf-8')
# from google.colab import files
# files.download('spotify_tracks_no_ml_rec_pool.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>