# Fallback/Basic Recommender

A content-based music recommendation engine using k-nearest neighbors with Spotify audio features. This notebook implements a similarity-based recommender that finds tracks with similar audio characteristics to a given seed track.

## 1. Setup & Dependencies

Import required libraries for data manipulation, preprocessing, and machine learning.

In [None]:
import os
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors

## 2. Data Loading

Load the Spotify dataset from Google Drive containing track metadata and audio features.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

# Define the path where you saved the file in Google Drive
drive_path = '/content/drive/My Drive/Colab Notebooks/musix'


# Load the CSV file into a pandas DataFrame
spotify_df = pd.read_csv(drive_path + '/spotify.csv')

# Display the first few rows to verify
print(f"Successfully loaded 'spotify.csv' from Google Drive. Shape: {spotify_df.shape}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Successfully loaded 'spotify.csv' from Google Drive. Shape: (114000, 20)


In [36]:
df = spotify_df.copy()
df.shape

(114000, 20)

## 3. Exploratory Data Analysis

Examine the dataset structure, identify missing values, and prepare features for the recommendation model.

In [37]:
df

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.4610,1,-6.746,0,0.1430,0.0322,0.000001,0.3580,0.7150,87.917,4,acoustic
1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.420,0.1660,1,-17.235,1,0.0763,0.9240,0.000006,0.1010,0.2670,77.489,4,acoustic
2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.3590,0,-9.734,1,0.0557,0.2100,0.000000,0.1170,0.1200,76.332,4,acoustic
3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.9050,0.000071,0.1320,0.1430,181.740,3,acoustic
4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.4430,2,-9.681,1,0.0526,0.4690,0.000000,0.0829,0.1670,119.949,4,acoustic
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113995,2C3TZjDRiAzdyViavDJ217,Rainy Lullaby,#mindfulness - Soft Rain for Mindful Meditatio...,Sleep My Little Boy,21,384999,False,0.172,0.2350,5,-16.393,1,0.0422,0.6400,0.928000,0.0863,0.0339,125.995,5,world-music
113996,1hIz5L4IB9hN3WRYPOCGPw,Rainy Lullaby,#mindfulness - Soft Rain for Mindful Meditatio...,Water Into Light,22,385000,False,0.174,0.1170,0,-18.318,0,0.0401,0.9940,0.976000,0.1050,0.0350,85.239,4,world-music
113997,6x8ZfSoqDjuNa5SVP5QjvX,Cesária Evora,Best Of,Miss Perfumado,22,271466,False,0.629,0.3290,0,-10.895,0,0.0420,0.8670,0.000000,0.0839,0.7430,132.378,4,world-music
113998,2e6sXL2bYv4bSz6VTdnfLs,Michael W. Smith,Change Your World,Friends,41,283893,False,0.587,0.5060,7,-10.889,1,0.0297,0.3810,0.000000,0.2700,0.4130,135.960,4,world-music


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114000 entries, 0 to 113999
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   track_id          114000 non-null  object 
 1   artists           113999 non-null  object 
 2   album_name        113999 non-null  object 
 3   track_name        113999 non-null  object 
 4   popularity        114000 non-null  int64  
 5   duration_ms       114000 non-null  int64  
 6   explicit          114000 non-null  bool   
 7   danceability      114000 non-null  float64
 8   energy            114000 non-null  float64
 9   key               114000 non-null  int64  
 10  loudness          114000 non-null  float64
 11  mode              114000 non-null  int64  
 12  speechiness       114000 non-null  float64
 13  acousticness      114000 non-null  float64
 14  instrumentalness  114000 non-null  float64
 15  liveness          114000 non-null  float64
 16  valence           11

In [39]:
df["track_id"] = df["track_id"].astype(str)
df = df.drop_duplicates(subset=["track_id"]).reset_index(drop=True)
df.shape


(89741, 20)

In [None]:
# Ensure track_id is string type and remove duplicates
df["track_id"] = df["track_id"].astype(str)
df = df.drop_duplicates(subset=["track_id"]).reset_index(drop=True)

# Define required columns for the recommendation model
required_cols = [
    "track_id","track_name","artists","album_name","track_genre",
    "popularity","duration_ms","explicit","danceability","energy","loudness",
    "mode","speechiness","acousticness","instrumentalness","liveness","valence",
    "tempo","time_signature","key"
]

# Validate all required columns exist
missing = [c for c in required_cols if c not in df.columns]
if missing:
    raise ValueError(f"Missing expected columns: {missing}")

# Remove rows with missing values in required columns
df = df.dropna(subset=required_cols).reset_index(drop=True)

print("After cleaning:", df.shape)
df[required_cols].head()

After cleaning: (89740, 20)


Unnamed: 0,track_id,track_name,artists,album_name,track_genre,popularity,duration_ms,explicit,danceability,energy,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,key
0,5SuOikwiRyPMVoIQDJUgSV,Comedy,Gen Hoshino,Comedy,acoustic,73,230666,False,0.676,0.461,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,1
1,4qPNDBW1i3p13qLCt0Ki3A,Ghost - Acoustic,Ben Woodward,Ghost (Acoustic),acoustic,55,149610,False,0.42,0.166,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,1
2,1iJBSr7s7jYXzM8EGcbK5b,To Begin Again,Ingrid Michaelson;ZAYN,To Begin Again,acoustic,57,210826,False,0.438,0.359,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,0
3,6lfxq3CG4xtTiEg7opyCyx,Can't Help Falling In Love,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,acoustic,71,201933,False,0.266,0.0596,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,0
4,5vjLSffimiIP26QG5WcN2K,Hold On,Chord Overstreet,Hold On,acoustic,82,198853,False,0.618,0.443,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,2


In [None]:
# Define audio features for similarity calculations
FEATURES = [
    "danceability","energy","loudness","speechiness","acousticness",
    "instrumentalness","liveness","valence","tempo","duration_ms",
    "popularity","mode","key","time_signature"
]

# Create interaction features to capture more complex relationships
df["energy_valence"] = df["energy"] * df["valence"]  # Energy-mood interaction
df["tempo_loudness_ratio"] = df["tempo"] / (np.abs(df["loudness"]) + 1e-6)  # Tempo-loudness balance

FEATURES_EXT = FEATURES + ["energy_valence","tempo_loudness_ratio"]

# Build feature matrix
X = df[FEATURES_EXT].astype(float).values
print("Embedding matrix shape:", X.shape)

Embedding matrix shape: (89740, 16)


## 4. Feature Engineering & Normalization

Create feature matrices with audio characteristics and normalize using StandardScaler to ensure uniform scale across all features.

In [35]:
X[1]

array([ 4.20000000e-01,  1.66000000e-01, -1.72350000e+01,  7.63000000e-02,
        9.24000000e-01,  5.56000000e-06,  1.01000000e-01,  2.67000000e-01,
        7.74890000e+01,  1.49610000e+05,  5.50000000e+01,  1.00000000e+00,
        1.00000000e+00,  4.00000000e+00,  4.43220000e-02,  4.49602527e+00])

In [None]:
# Normalize features using StandardScaler for consistent scale across dimensions
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  
print("Scaled embedding matrix shape:", X_scaled.shape)

Scaled embedding matrix shape: (89740, 16)


## 5. Model Training

Build a k-nearest neighbors index using cosine similarity metric on the scaled features.

In [None]:
# Build KNN index using cosine similarity metric
# This enables fast nearest neighbor queries on the feature space
knn = NearestNeighbors(
    n_neighbors=50,  # Search for up to 50 neighbors
    metric='cosine',  # Use cosine distance for high-dimensional feature vectors
    algorithm='auto',  # Let scikit-learn choose optimal algorithm
)

knn.fit(X_scaled)
print("Similarity index built successfully.")

Similarity index built.


In [None]:
def search_tracks(query, n=10):
    """
    Search for tracks by genre, name, or artist using case-insensitive string matching.
    
    Args:
        query (str): Search query to match against track genres, names, or artists
        n (int): Maximum number of results to return (default: 10)
    
    Returns:
        pd.DataFrame: Matching tracks with their metadata
    """
    q = query.lower()
    # Create mask for multi-field search across genre, track name, and artist
    mask = (
        df['track_genre'].str.lower().str.contains(q, na=False) |
        df["track_name"].str.lower().str.contains(q, na=False) |
        df["artists"].str.lower().str.contains(q, na=False) 
    )
    return df.loc[mask, ["track_id","track_name","artists","track_genre"]].head(n)

# Test search functionality
search_tracks("Hip Hop")

Unnamed: 0,track_id,track_name,artists,track_genre
23156,0RY5YUV6cCwFPxg5X2rIv0,Hip Hop Jazz,Sean Deason,detroit-techno
25869,68k74wocnITv3Sjw08dqYQ,It's Bigger Than Hip Hop UK Ft Dead Prez - Ful...,Adam F;Dead Prez;DJ Fresh,drum-and-bass
42444,6rG27UkyWVcBXIHtroteuu,Hip Hop Rio,Planet Hemp,hard-rock
42545,0t2NKkUqJAA72D8WabGTsN,One More Chance - Hip Hop Mix,The Notorious B.I.G.,hardcore
42728,1SyQ6t9RdRBK0QUCS6a797,Hip Hop Hooray,Naughty By Nature,hardcore
56102,7bSzCK0n6KFe8s7PDbKPFG,Hip Hop Harry Theme Song,Hip Hop Harry,kids
56109,2rzjmdqn09eDgtxLjP1xfO,Head Shoulders Knees and Toes,Hip Hop Harry,kids
56200,77vI7Xn9UzqdPEvC2xwQfR,Wheels on the Bus,Hip Hop Harry,kids
56203,5PXBSvTtRYkXA6JR76CbJp,Do The Harry,Hip Hop Harry,kids
56253,71bw4oNIUNz4bjdP0apio3,Holiday Cheer,Hip Hop Harry,kids


## 6. Utility Functions

Define helper functions for searching and recommending tracks based on similarity scores.

In [61]:
id_to_index = {tid: i for i, tid in enumerate(df["track_id"])}

def recommend_by_track_id(track_id, k=10):
    if track_id not in id_to_index:
        raise ValueError("Track ID not found")

    idx = id_to_index[track_id]
    vec = X_scaled[idx].reshape(1, -1)

    distances, indices = knn.kneighbors(vec, n_neighbors=k+1)

    distances = distances.flatten()
    indices = indices.flatten()

    rec_indices = indices[1:]        # skip itself
    similarity = 1 - distances[1:]   # convert cosine distance to similarity

    recommendations = df.iloc[rec_indices].copy()
    recommendations["similarity"] = similarity

    return recommendations[
        ["track_name","artists","track_genre","popularity","similarity"]
    ].reset_index(drop=True)


In [None]:
# Select a seed track from Drake for recommendation
track_id = df[df["artists"] == "Drake"]["track_id"].values[0]

print("Seed track:")
print(df[df["track_id"] == track_id][["track_name","artists","track_genre"]])
print("\n" + "="*60)
print("Top 10 Similar Tracks:")
print("="*60 + "\n")

# Get and display recommendations
recommend_by_track_id(track_id, k=10)

Seed track:
       track_name artists track_genre
45421  God's Plan   Drake     hip-hop


Unnamed: 0,track_name,artists,track_genre,popularity,similarity
0,Just Friends,JORDY,singer-songwriter,58,0.955187
1,Gangsta's Paradise,Coolio;L.V.,funk,89,0.947709
2,Closer To You,Rasmus Hagen;Nora Andersson,singer-songwriter,64,0.914954
3,Take My Breath Away,EZI,electro,65,0.912009
4,I Wanna Fuck You,Snoop Dogg;Akon,funk,67,0.899412
5,Oh shit…are we in love?,Valley,electro,65,0.894954
6,Jawani,Arjan Dhillon,pop,69,0.88445
7,Streets,Doja Cat,dance,83,0.881909
8,Karıncalar,Hidra,turkish,48,0.873278
9,Sarcoma,Killstation,emo,62,0.86762


## 7. Example Usage

Test the recommender with a real track from the dataset and get the top 10 most similar recommendations.