# üåü Spotify Music Recommendation System Using Lyrics + Metadata üåü
# A fun, beginner-friendly deep learning project that mixes NLP + audio metadata + neural embeddings!


# üß© About This Notebook

This notebook walks through building a **hybrid music recommendation system** using:

- üéµ **Song lyrics** (processed with NLP + vector embeddings)  
- üé∂ **Song metadata** (genre, subgenre, artist, popularity, etc.)  
- üß† **A neural network regressor** that predicts embedding vectors  
- üìê **Cosine similarity** to generate final recommendations  

Even if you're new to **NLP**, **embeddings**, or frameworks like **TensorFlow**, don‚Äôt worry ‚Äî  
this notebook is designed to be **clean, beginner-friendly, and fun**.  
By the end, you‚Äôll understand how to build your own mini version of **Spotify‚Äôs recommendation engine**.
# üéØ Our Goal

Our mission is simple:

‚ú® **Build a recommendation engine that suggests songs matching your vibe**  
using a combination of **lyrics embeddings + song metadata embeddings**.

Think of it as creating your own mini version of **Spotify‚Äôs ‚ÄúRecommended for You‚Äù** ‚Äî but fully explainable, customizable, and built from scratch.

# üßô‚Äç‚ôÇÔ∏è‚ú® Importing The Almighty Python Libraries  
# (Thou shall not run ML without these sacred imports)

In [1]:
import pandas as pd
import json

# üéµ Loading the Spotify Songs Dataset 
# üëÄ Take a Sneak Peek at the Dataset  
# Let's see what the Spotify gods blessed (or cursed) us with.


In [2]:
data = pd.read_csv("spotify_songs.csv")   # Link https://www.kaggle.com/datasets/imuhammad/audio-features-and-lyrics-of-spotify-songs?resource=download
data.head()

Unnamed: 0,track_id,track_name,track_artist,lyrics,track_popularity,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,language
0,0017A6SJgTbfQVU2EtsPNo,Pangarap,Barbie's Cradle,Minsan pa Nang ako'y napalingon Hindi ko alam ...,41,1srJQ0njEQgd8w4XSqI4JQ,Trip,2001-01-01,Pinoy Classic Rock,37i9dQZF1DWYDQ8wBxd7xt,...,-10.068,1,0.0236,0.279,0.0117,0.0887,0.566,97.091,235440,tl
1,004s3t0ONYlzxII9PLgU6z,I Feel Alive,Steady Rollin,"The trees, are singing in the wind The sky blu...",28,3z04Lb9Dsilqw68SHt6jLB,Love & Loss,2017-11-21,Hard Rock Workout,3YouF0u7waJnolytf9JCXf,...,-4.739,1,0.0442,0.0117,0.00994,0.347,0.404,135.225,373512,en
2,00chLpzhgVjxs1zKC9UScL,Poison,Bell Biv DeVoe,"NA Yeah, Spyderman and Freeze in full effect U...",0,6oZ6brjB8x3GoeSYdwJdPc,Gold,2005-01-01,"Back in the day - R&B, New Jack Swing, Swingbe...",3a9y4eeCJRmG9p4YKfqYIx,...,-7.504,0,0.216,0.00432,0.00723,0.489,0.65,111.904,262467,en
3,00cqd6ZsSkLZqGMlQCR0Zo,Baby It's Cold Outside (feat. Christina Aguilera),CeeLo Green,I really can't stay Baby it's cold outside I'v...,41,3ssspRe42CXkhPxdc12xcp,CeeLo's Magic Moment,2012-10-29,Christmas Soul,6FZYc2BvF7tColxO8PBShV,...,-5.819,0,0.0341,0.689,0.0,0.0664,0.405,118.593,243067,en
4,00emjlCv9azBN0fzuuyLqy,Dumb Litty,KARD,Get up out of my business You don't keep me fr...,65,7h5X3xhh3peIK9Y0qI5hbK,KARD 2nd Digital Single ‚ÄòDumb Litty‚Äô,2019-09-22,K-Party Dance Mix,37i9dQZF1DX4RDXswvP6Mj,...,-1.993,1,0.0409,0.037,0.0,0.138,0.24,130.018,193160,en


In [3]:
data.shape

(18454, 25)

In [4]:
data.columns

Index(['track_id', 'track_name', 'track_artist', 'lyrics', 'track_popularity',
       'track_album_id', 'track_album_name', 'track_album_release_date',
       'playlist_name', 'playlist_id', 'playlist_genre', 'playlist_subgenre',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration_ms', 'language'],
      dtype='object')

# üóëÔ∏èüßπ Removing Useless Columns  
# If it's not helping the model ‚Üí we throw it in the bin.

In [5]:
df1 = data.drop(columns = ["track_id", "valence", "tempo", "track_album_release_date", "danceability", "playlist_name", "track_album_id", "track_album_name", "playlist_id", "key", "loudness", "acousticness", "liveness", "duration_ms"])
df1.head(3)

Unnamed: 0,track_name,track_artist,lyrics,track_popularity,playlist_genre,playlist_subgenre,energy,mode,speechiness,instrumentalness,language
0,Pangarap,Barbie's Cradle,Minsan pa Nang ako'y napalingon Hindi ko alam ...,41,rock,classic rock,0.401,1,0.0236,0.0117,tl
1,I Feel Alive,Steady Rollin,"The trees, are singing in the wind The sky blu...",28,rock,hard rock,0.88,1,0.0442,0.00994,en
2,Poison,Bell Biv DeVoe,"NA Yeah, Spyderman and Freeze in full effect U...",0,r&b,new jack swing,0.652,0,0.216,0.00723,en


In [6]:
df1.shape

(18454, 11)

# üö® Scanning for Missing Values  
# Because NaN stands for ‚ÄúNot A Number‚Äù but also ‚ÄúNot Allowed Now‚Äù.

In [7]:
df1.isna().sum()

track_name             0
track_artist           0
lyrics               260
track_popularity       0
playlist_genre         0
playlist_subgenre      0
energy                 0
mode                   0
speechiness            0
instrumentalness       0
language             260
dtype: int64

In [8]:
df1.dropna(inplace = True)

# ‚úîÔ∏è Confirming Zero Missing Values  
# The dataset is now cleaner than your search history.

In [9]:
df1.isna().sum()

track_name           0
track_artist         0
lyrics               0
track_popularity     0
playlist_genre       0
playlist_subgenre    0
energy               0
mode                 0
speechiness          0
instrumentalness     0
language             0
dtype: int64

In [10]:
df1.shape

(18194, 11)

# üî† Converting Text Columns to Lowercase  
# Because 'Hello' and 'hello' should not be treated like different species.

In [11]:
for col in ["track_name", "track_artist", "lyrics",	"playlist_genre", "playlist_subgenre"]:
    df1[col] = df1[col].str.lower()

In [12]:
df2 = df1.copy()

# üåê Keep Only English Songs  

In [13]:
df2 = df2[df2["language"] == "en"]
df2.shape

(15405, 11)

In [14]:
df2.drop(columns = ["language"], inplace = True)

In [15]:
genres = df2["playlist_genre"].unique().tolist()

## üíæ Saving Artists, Song Names & Genre Dictionaries  
## Because we love future convenience and hate re-processing.


In [16]:
genres_subgenre = {}

for g in genres:
    subgenres = df2[df2["playlist_genre"] == g]["playlist_subgenre"].dropna().unique().tolist()
    genres_subgenre[g] = subgenres

In [17]:
genres_subgenre

{'rock': ['hard rock', 'album rock', 'permanent wave', 'classic rock'],
 'r&b': ['new jack swing', 'neo soul', 'urban contemporary', 'hip pop'],
 'pop': ['dance pop', 'indie poptimism', 'post-teen pop', 'electropop'],
 'edm': ['big room', 'progressive electro house', 'pop edm', 'electro house'],
 'rap': ['gangster rap', 'trap', 'southern hip hop', 'hip hop'],
 'latin': ['tropical', 'latin hip hop', 'latin pop', 'reggaeton']}

In [18]:
artists = df2["track_artist"].unique().tolist()
songNames = df2["track_name"].unique().tolist()

In [19]:
with open("artist.json", "w") as f:
    json.dump(artists, f)

In [20]:
with open("songName.json", "w") as f:
    json.dump(songNames, f)

In [21]:
with open("genres_subgenre.json", "w") as f:
    json.dump(genres_subgenre, f)

In [22]:
df3 = df2.copy()

## NLTK (Natural Language ToolKit) - Used for the preprocessing the text data 
## Importing NLTK Tools -> Stopwords, tokenizers, lemmatizers.

In [23]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [24]:
stop_words = set(stopwords.words('english'))
lemma = WordNetLemmatizer()

# ‚ú® Cleaning All Lyrics  
# Goodbye fluff, punctuation, and questionable words.

In [25]:
def cleanText(text):
    text = text.lower()

    text = lemma.lemmatize(text)
    
    tokens = word_tokenize(text)

    words = [token.lower() for token in tokens if token not in stop_words and token.isalpha() and len(token)>2]

    return " ".join(words)

In [26]:
df3["lyrics"] = df3["lyrics"].apply(cleanText)

In [27]:
df2["lyrics"][1]

'the trees, are singing in the wind the sky blue, only as it can be and the angels, smiled at me i saw you, in that lonely bench at half past four, i kissed your soft soft hands and at 6 i kissed your lips and the angels smiled, i thought hey i feel alive! the park sign, said it was closed and we jumped that fence with no cares at all and we kissed under a tree we danced, under the midnight sun and i loved you, without knowing you at all and we laughed and felt so free and the angels they smiled, i thought hey, i feel alive!'

In [28]:
df3["lyrics"][1]

'trees singing wind sky blue angels smiled saw lonely bench half past four kissed soft soft hands kissed lips angels smiled thought hey feel alive park sign said closed jumped fence cares kissed tree danced midnight sun loved without knowing laughed felt free angels smiled thought hey feel alive'

In [29]:
df3 = df3.reset_index(drop=True)
df3.columns

Index(['track_name', 'track_artist', 'lyrics', 'track_popularity',
       'playlist_genre', 'playlist_subgenre', 'energy', 'mode', 'speechiness',
       'instrumentalness'],
      dtype='object')

In [30]:
from sklearn.preprocessing import LabelEncoder

songName_encoder = LabelEncoder()
artist_encoder = LabelEncoder()
genre_encoder = LabelEncoder()
subgenre_encoder = LabelEncoder()

In [31]:
y = df3["track_name"]
df4 = df3.copy()
df4['track_artist'] = artist_encoder.fit_transform(df3['track_artist'])
df4['playlist_genre'] = genre_encoder.fit_transform(df3['playlist_genre'])
df4['playlist_subgenre'] = subgenre_encoder.fit_transform(df3['playlist_subgenre'])
df4.drop(columns = ["track_name"], inplace = True)
df4.head()

Unnamed: 0,track_artist,lyrics,track_popularity,playlist_genre,playlist_subgenre,energy,mode,speechiness,instrumentalness
0,4044,trees singing wind sky blue angels smiled saw ...,28,5,7,0.88,1,0.0442,0.00994
1,438,yeah spyderman freeze full effect ready ron re...,0,3,14,0.652,0,0.216,0.00723
2,779,really stay baby cold outside got away baby co...,41,3,13,0.378,0,0.0341,0.0
3,2312,get business keep turning witness Ï≤ôÌïòÎäî criminal...,65,2,3,0.887,1,0.0409,0.0
4,2056,hold breath look keep trying darling okay scar...,70,3,23,0.639,1,0.055,0.0


In [32]:
y.reset_index(inplace = True , drop = True)
y_encoded = songName_encoder.fit_transform(y) 

In [33]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = model.encode(df4["lyrics"].fillna(""), show_progress_bar=True)

embedding_df = pd.DataFrame(
    embeddings,
    columns=[f"token_{i}" for i in range(embeddings.shape[1])]
)

  from .autonotebook import tqdm as notebook_tqdm





Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 482/482 [11:09<00:00,  1.39s/it]


## üìâ Reducing Embedding Dimensions with PCA  
## The embeddings are going keto: from 384 ‚Üí 50.

In [34]:
from sklearn.decomposition import PCA
pca = PCA(n_components=50)
reduced_embeddings = pca.fit_transform(embedding_df)

In [35]:
reduced_embeddings

array([[-0.25870818,  0.03268401, -0.18768808, ...,  0.04991819,
         0.04905635,  0.08152491],
       [ 0.21775162, -0.13479902,  0.0174536 , ..., -0.05334386,
         0.06618114,  0.00202239],
       [ 0.07024334, -0.04807389, -0.07521815, ..., -0.08242086,
         0.01888034, -0.04129078],
       ...,
       [ 0.30664182,  0.09471532,  0.09235753, ...,  0.02001332,
         0.07811463,  0.01665414],
       [-0.23448408,  0.08493301, -0.0723003 , ...,  0.09799903,
         0.07150644,  0.02948574],
       [-0.1537104 , -0.1877923 , -0.0969812 , ...,  0.01770057,
         0.10677427, -0.03666512]], dtype=float32)

# üìä Scaling Numeric Features  
# Neural networks prefer everything standardized like an Indian exam pattern.

In [36]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

num_cols = [
    "track_artist",
    "track_popularity",
    "playlist_genre",
    "playlist_subgenre",
    "energy",
    "mode",
    "speechiness",
    "instrumentalness"
]

scaled_num = pd.DataFrame(
    scaler.fit_transform(df4[num_cols].fillna(0)),
    columns=num_cols
)

In [37]:
df5 = pd.concat(
    [scaled_num.reset_index(drop=True)],
    axis=1
)

df5["lyrics_embedding"] = list(reduced_embeddings)

In [38]:
df5.head(3)

Unnamed: 0,track_artist,track_popularity,playlist_genre,playlist_subgenre,energy,mode,speechiness,instrumentalness,lyrics_embedding
0,1.126519,-0.562509,1.317674,-0.618344,1.038898,0.83104,-0.577054,-0.229535,"[-0.25870818, 0.032684013, -0.18768808, 0.1302..."
1,-1.398234,-1.697676,0.059187,0.418931,-0.203797,-1.203311,1.141287,-0.247652,"[0.21775162, -0.13479902, 0.017453596, 0.05266..."
2,-1.159482,-0.035468,0.059187,0.270749,-1.697211,-1.203311,-0.678074,-0.295986,"[0.070243336, -0.048073888, -0.07521815, 0.029..."


## üéØ Preparing Input Features (X) and Target Embeddings (y)
### X = numbers  
### y = emotions (kind of)

In [39]:
X = scaled_num          
y = reduced_embeddings  

df6 = df5.copy()  # this has numeric + encoded categorical + lyrics
df6["track_name"] = df3["track_name"]  # add back song names
df6["lyrics_embedding"] = list(reduced_embeddings)  # add embedding column

# üß© Building the Neural Network Regressor  
# This baby predicts embedding vectors like a champ.

In [40]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, Input, Model, Sequential

In [41]:
model = Sequential([
    Input(shape=(len(num_cols),)),
    layers.Dense(128, activation='relu'),
    layers.Dense(256, activation='relu'),
    layers.Dense(y.shape[1], activation='linear')  
])




# üèóÔ∏è Model Summary  
# Let‚Äôs admire our neural creation.

In [42]:
model.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae']
)

model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 128)               1152      
                                                                 
 dense_1 (Dense)             (None, 256)               33024     
                                                                 
 dense_2 (Dense)             (None, 50)                12850     
                                                                 
Total params: 47026 (183.70 KB)
Trainable params: 47026 (183.70 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


# üèãÔ∏è Training the Neural Network  
# Epochs = 5 (because GPUs are expensive)

In [43]:
history = model.fit(
    X,
    y,
    epochs=5,
    shuffle=True,
    validation_split=0.1,
    verbose=1
)

Epoch 1/5


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


# üõ†Ô∏è Importing Similarity Tools  
## Because "how similar are two songs?" is the BIG question.
## üéß Defining the Recommendation Function  
## This is where the magic happens ‚Üí recommending bangers.


In [44]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd

def safe_inverse_transform(encoder, values):
    valid_indices = (values >= 0) & (values < len(encoder.classes_))
    output = []
    for v in values:
        if 0 <= v < len(encoder.classes_):
            output.append(encoder.classes_[v])
        else:
            output.append("Unknown")  # or np.nan
    return np.array(output)


def get_recommendations(user_df, scaler, model, num_cols, df,
                        artist_encoder, genre_encoder, subgenre_encoder,
                        songNames, top_n=5):

    user_df = user_df.copy()

    # --- Validate seen labels ---
    while True:
        unseen = []
        if "track_artist" in user_df:
            unseen += [a for a in user_df["track_artist"] if a not in artist_encoder.classes_]
        if "playlist_genre" in user_df:
            unseen += [g for g in user_df["playlist_genre"] if g not in genre_encoder.classes_]
        if "playlist_subgenre" in user_df:
            unseen += [s for s in user_df["playlist_subgenre"] if s not in subgenre_encoder.classes_]

        if unseen:
            print(f"‚ö†Ô∏è Unseen labels detected: {set(unseen)}")
            print("Waiting for only seen (trained) labels... Retry with valid artist/genre/subgenre.")
            return None  # stop this run; you can modify to re-input or wait if needed
        else:
            break

    # --- Safe encoding ---
    user_df["track_artist"] = artist_encoder.transform(user_df["track_artist"])
    user_df["playlist_genre"] = genre_encoder.transform(user_df["playlist_genre"])
    user_df["playlist_subgenre"] = subgenre_encoder.transform(user_df["playlist_subgenre"])

    # --- Scale numeric features ---
    X_scaled = scaler.transform(user_df[num_cols])

    # --- Predict embedding ---
    predicted_emb = model.predict(X_scaled)

    # --- Compute cosine similarity ---
    all_embs = np.vstack(df["lyrics_embedding"].values)
    sims = cosine_similarity(predicted_emb, all_embs)[0]

    top_idx = np.argsort(sims)[::-1][:top_n]

    # --- Handle track names ---
    if "track_name" in df.columns:
        track_names = df.iloc[top_idx]["track_name"].values
    else:
        track_names = [songNames[i] for i in top_idx if i < len(songNames)]

    # --- Reverse-transform to readable values ---
    track_artists = safe_inverse_transform(artist_encoder, df.iloc[top_idx]["track_artist"].astype(int))
    playlist_genres = safe_inverse_transform(genre_encoder, df.iloc[top_idx]["playlist_genre"].astype(int))
    playlist_subgenres = safe_inverse_transform(subgenre_encoder, df.iloc[top_idx]["playlist_subgenre"].astype(int))

    # --- Final recommendations ---
    recommendations = pd.DataFrame({
        "track_name": track_names,
        "track_artist": track_artists,
        "playlist_genre": playlist_genres,
        "playlist_subgenre": playlist_subgenres,
        "track_popularity": df.iloc[top_idx]["track_popularity"].values,
        "similarity": sims[top_idx]
    })

    return recommendations.reset_index(drop=True)


# üìù Creating a Sample User Input  
# This is YOU telling the model: "Show me some good music."

In [51]:
user_input = pd.DataFrame([{
    "track_artist": "steady rollin",
    "track_popularity": 40,
    "playlist_genre": "pop",
    "playlist_subgenre": "dance pop",
    "energy": 0.1,
    "mode": 0.3,
    "speechiness": 0.43,
    "instrumentalness": 0
}])

recommendations = get_recommendations(
    user_df=user_input,
    scaler=scaler,
    model=model,
    num_cols=num_cols,
    df=df6,  
    artist_encoder=artist_encoder,
    genre_encoder=genre_encoder,
    subgenre_encoder=subgenre_encoder,
    songNames = songNames,
    top_n=5
)

recommendations



Unnamed: 0,track_name,track_artist,playlist_genre,playlist_subgenre,track_popularity,similarity
0,he said she said,Unknown,edm,album rock,0.451032,0.618621
1,unpredictable - main,!deladap,edm,big room,-0.157093,0.614215
2,sweet sweet,$uicideboy$,edm,big room,0.288866,0.591291
3,don't waste my time (feat. ella mai),$uicideboy$,Unknown,album rock,1.261865,0.587236
4,that girl,!deladap,edm,big room,-0.400343,0.582631


# üíæ Saving Model, Scaler & Encoders  
# Future you will thank present you.

In [None]:
import joblib

joblib.dump(model, "model//model.joblib")
joblib.dump(scaler, "model//scaler.joblib")
joblib.dump(artist_encoder, "model//artist_encoder.joblib")
joblib.dump(genre_encoder, "model//genre_encoder.joblib")
joblib.dump(subgenre_encoder, "model//subgenre_encoder.joblib")
joblib.dump(df6, "model//songs_dataset.joblib")
joblib.dump(songNames, "model//songNames.joblib")

['songNames.joblib']

# üéâ Done!
# Congratulations ‚Äî you built a hybrid recommendation engine from scratch.

In [1]:
import pypandoc

pypandoc.convert_file(
    source_file="Music_Recommendation.ipynb",
    to="docx",
    format="ipynb",
    outputfile="Music_Recommendation.docx",
    extra_args=["--standalone"]
)

''