# Introduction

Looking to incorporate some AI into your hack? Look no further. In this guide, I’ll walk you through a simple, hands-on introduction to recommender systems, where we build a playlist recommender.

This guide is primarily aimed at beginners, but I’ve also linked some more advanced resources at the bottom if you want to go deeper.

---

# Prerequisites

We’ll be using a combination of **pandas** (for data preprocessing) and **scikit-learn** (`sklearn`) for the actual learning and predictions.

No deep understanding of the underlying math is required—this guide is focused on a practical, hands-on approach. However, if you’re not familiar with pandas, I’d recommend taking a quick look at [this W3Schools pandas tutorial](https://www.w3schools.com/python/pandas/default.asp).


In [1]:
%pip install pandas

Collecting pandas
  Downloading pandas-2.0.3-cp38-cp38-win_amd64.whl.metadata (18 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.1 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting numpy>=1.20.3 (from pandas)
  Downloading numpy-1.24.4-cp38-cp38-win_amd64.whl.metadata (5.6 kB)
Downloading pandas-2.0.3-cp38-cp38-win_amd64.whl (10.8 MB)
   ---------------------------------------- 0.0/10.8 MB ? eta -:--:--
   ----- ---------------------------------- 1.6/10.8 MB 8.4 MB/s eta 0:00:02
   ------------ --------------------------- 3.4/10.8 MB 8.4 MB/s eta 0:00:01
   ------------------ --------------------- 5.0/10.8 MB 8.2 MB/s eta 0:00:01
   -------------------------- ------------- 7.1/10.8 MB 8.6 MB/s eta 0:00:01
   ---------------------------------- ----- 9.2/10.8 MB 8.8 MB/s eta 0:00:01
   ---------------------------------------  10.7/10.8 MB 8.9 MB/s eta 0:00:01
   --

In [2]:
%pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.3.2-cp38-cp38-win_amd64.whl.metadata (11 kB)
Collecting scipy>=1.5.0 (from scikit-learn)
  Downloading scipy-1.10.1-cp38-cp38-win_amd64.whl.metadata (58 kB)
Collecting joblib>=1.1.1 (from scikit-learn)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Using cached threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.3.2-cp38-cp38-win_amd64.whl (9.3 MB)
   ---------------------------------------- 0.0/9.3 MB ? eta -:--:--
   ------ --------------------------------- 1.6/9.3 MB 9.4 MB/s eta 0:00:01
   -------------- ------------------------- 3.4/9.3 MB 8.4 MB/s eta 0:00:01
   -------------------- ------------------- 4.7/9.3 MB 7.9 MB/s eta 0:00:01
   -------------------------- ------------- 6.0/9.3 MB 7.4 MB/s eta 0:00:01
   ------------------------------ --------- 7.1/9.3 MB 6.9 MB/s eta 0:00:01
   ------------------------------------ --- 8.4/

In [3]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

Here, I just loaded in some datasets that I got from [Kaggle](https://www.kaggle.com/). It's a great source for open data.

In [4]:
# Load in the datasets

artist_df = pd.read_csv("SpotGenTrack/Data Sources/spotify_artists.csv")
tracks_df = pd.read_csv("SpotGenTrack/Data Sources/spotify_tracks.csv")
low_level_audio_df = pd.read_csv("SpotGenTrack/Features Extracted/low_level_audio_features.csv")

We're going to use low-level audio features to make the actual predictions.

In [8]:
print(list(low_level_audio_df.columns))

['Unnamed: 0', 'Chroma_1', 'Chroma_10', 'Chroma_11', 'Chroma_12', 'Chroma_2', 'Chroma_3', 'Chroma_4', 'Chroma_5', 'Chroma_6', 'Chroma_7', 'Chroma_8', 'Chroma_9', 'MEL_1', 'MEL_10', 'MEL_100', 'MEL_101', 'MEL_102', 'MEL_103', 'MEL_104', 'MEL_105', 'MEL_106', 'MEL_107', 'MEL_108', 'MEL_109', 'MEL_11', 'MEL_110', 'MEL_111', 'MEL_112', 'MEL_113', 'MEL_114', 'MEL_115', 'MEL_116', 'MEL_117', 'MEL_118', 'MEL_119', 'MEL_12', 'MEL_120', 'MEL_121', 'MEL_122', 'MEL_123', 'MEL_124', 'MEL_125', 'MEL_126', 'MEL_127', 'MEL_128', 'MEL_13', 'MEL_14', 'MEL_15', 'MEL_16', 'MEL_17', 'MEL_18', 'MEL_19', 'MEL_2', 'MEL_20', 'MEL_21', 'MEL_22', 'MEL_23', 'MEL_24', 'MEL_25', 'MEL_26', 'MEL_27', 'MEL_28', 'MEL_29', 'MEL_3', 'MEL_30', 'MEL_31', 'MEL_32', 'MEL_33', 'MEL_34', 'MEL_35', 'MEL_36', 'MEL_37', 'MEL_38', 'MEL_39', 'MEL_4', 'MEL_40', 'MEL_41', 'MEL_42', 'MEL_43', 'MEL_44', 'MEL_45', 'MEL_46', 'MEL_47', 'MEL_48', 'MEL_49', 'MEL_5', 'MEL_50', 'MEL_51', 'MEL_52', 'MEL_53', 'MEL_54', 'MEL_55', 'MEL_56', 'MEL

In [9]:
artist = "Arctic Monkeys"
song = "Mardy Bum"

The dataset is split across multiple CSV files, so we'll do some light preprocessing to locate the song the user is interested in. Once we’ve identified it, we can generate a playlist of similar tracks.

In [11]:
# Extract the correct track

artist_data = artist_df.loc[artist_df["name"] == artist]
artist_id = artist_data["id"].iloc[0]  

# Find songs with matching name
name_matches = tracks_df[tracks_df["name"] == song]


# Get the correct song with the matching artist 
matches = []
for idx, row in name_matches.iterrows():
    if artist_id in row["artists_id"]:
        matches.append(idx)

song_data = tracks_df.loc[matches]

track_id = song_data["id"].iloc[0]


Here, we extract the audio features for our target song, as well as the full feature dataframe we'll compare it against.

In [12]:
low_level_audio = low_level_audio_df.loc[low_level_audio_df["track_id"] == track_id]
target = low_level_audio.drop(columns=["track_id"])
features = low_level_audio_df.drop(columns=["track_id"])

This is where the actual magic happens. The `cosine_similarity` function calculates the angle between feature vectors—rows in our dataset. Songs with smaller angles (i.e., higher cosine similarity) to the target are considered more similar.

In [13]:
similarities = cosine_similarity(target, features)[0]

Now we’ll add the similarity scores back into the dataframe and sort the songs by their similarity to the target track.

In [14]:
low_level_audio_df['similarity'] = similarities

# Get top recommendations
#TOD0 remove the exact match
recommended = low_level_audio_df.sort_values(by='similarity', ascending=False)

best_10_track_ids = []

for index, row in recommended.head(10).iterrows() :
    best_10_track_ids.append(row["track_id"])


All that's left is to retrieve the song names using their track numbers and enjoy the playlist!

In [15]:
playlist = []
for id in best_10_track_ids:
    track_row = tracks_df[tracks_df["id"] == id]
    song_name = track_row["name"].iloc[0]
    playlist.append(song_name)
    print(song_name)


Mardy Bum
Rage
Sticky
Siempre Soñe
Wanna Be Like You
Tu Salto
New Beginning
Is
Be Your Shadow
Bad Blood


Well done! You've just built your first recommender system. If you want to take it a step further, I recommend checking out the following resources:

- [The difference between content and collaborative filtering](https://thecleverprogrammer.com/2023/04/20/content-based-filtering-and-collaborative-filtering-difference/)
- [An in-depth guide to best practices for recommender systems](https://github.com/recommenders-team/recommenders)
- [Theory behind recommender systems](https://nafeea3000.medium.com/recommender-systems-c8db209dd0d3)

