<a href="https://colab.research.google.com/github/kennywong524/spotify-swipe-based-recommendation-system/blob/main/Content_based_engine_using_KNN_%26_Logistic_regression_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


#**Content-based Filtering**

I want to identify the best solutions for music recommendation. I intend to build and test a variety of models and determine which performs best with little user input data. The models will be trained on half the listening history of many users and will attempt to predict the remaining half.

In this notebook, we will focus solely on content-based filtering method which uses the similarity between songs in a system through 2 ML methods: KNN and Logistic Regression. The system will analyze the songs that share similar features and recommend those list of songs to users. This method will not require any user inputs and thus may pose limitations to the model in terms of variety of the songs being recommended.

For example, if a user enjoys pop music, a content filtering model will continue to recommend rock music. While this is effective in ensuring recommended items will be similar to a user’s tastes, the model is not able to recommend new types or genres of items to the user.

We will start exploring the hybrid & collaborative-filtering model in the next notebook.

In [1]:
from copy import deepcopy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [3]:
spotifysongs_cleaned = pd.read_csv('cleaned_spotify_songs.csv')

In [4]:
spotifysongs_cleaned

Unnamed: 0,track_name,track_artist,track_popularity,track_album_name,track_album_release_date,playlist_name,playlist_genre,playlist_subgenre,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,duration_min
0,"""This Is Seagull….""",The Snake Corps,34,Smother Earth,1990-01-01,"Maxi Pop GOLD (New Wave, Electropop, Synth Po...",pop,electropop,0.516,0.580,...,-13.288,0,0.0295,0.000002,0.857000,0.1100,0.235,135.903,238227,3.970450
1,(I Just) Died In Your Arms,Cutting Crew,71,Broadcast,1986-01-01,80's Songs | Top 💯 80s Music Hits,pop,electropop,0.625,0.726,...,-11.402,0,0.0444,0.015800,0.000169,0.0625,0.507,124.945,280400,4.673333
2,(No One Knows Me) Like the Piano,Sampha,63,Process,2017-02-03,Ultimate Indie Presents... Best Indie Tracks o...,pop,dance pop,0.621,0.199,...,-13.788,1,0.0344,0.976000,0.001860,0.1070,0.178,128.905,218160,3.636000
3,(You Drive Me) Crazy,Britney Spears,2,Baby One More Time,1999-01-12,90s Dance Hits,pop,dance pop,0.755,0.944,...,-3.936,1,0.0342,0.059900,0.000000,0.3320,0.962,104.006,198093,3.301550
4,(You Drive Me) Crazy [The Stop Remix!] - Remas...,Britney Spears,40,The Essential Britney Spears,2014-07-29,90s Dance Hits,pop,dance pop,0.727,0.951,...,-4.004,1,0.0375,0.211000,0.000701,0.0537,0.774,101.078,197653,3.294217
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3679,留学生,Monkey Majik,49,COLLABORATED,2019-03-06,Best of 2019 Dance Pop: Japan,pop,dance pop,0.650,0.878,...,-4.768,0,0.0451,0.148000,0.000000,0.0499,0.781,120.001,214773,3.579550
3680,真っ赤な太陽,RAMMELLS,29,真っ赤な太陽,2019,Best of 2019 Dance Pop: Japan,pop,dance pop,0.612,0.843,...,-4.552,1,0.0431,0.152000,0.001630,0.4020,0.322,125.997,181187,3.019783
3681,달라달라 DALLA DALLA,ITZY,52,IT'z Different,2019-02-13,Best of 2019 Dance Pop: Japan,pop,dance pop,0.790,0.853,...,-4.564,0,0.0665,0.001160,0.000042,0.3290,0.713,124.998,199874,3.331233
3682,피카부 Peek-A-Boo,Red Velvet,70,Perfect Velvet - The 2nd Album,2017-11-17,K-Party Dance Mix,pop,dance pop,0.839,0.902,...,-3.612,0,0.0536,0.086800,0.002570,0.2720,0.639,114.953,189050,3.150833


This notebook uses K-Nearest-Neighbors, Logistic Regression to recommend N other tracks based on the input track.


**Steps to follow:**

1. Preprocess the Data
We'll focus on relevant features for creating embeddings and combining other features.

2. Create Embeddings
We'll use TensorFlow to create song embeddings based on their categorical features.

3. Combine Features
We'll use song embeddings combined with other numerical features.

4. Train a KNN Model
We'll train a KNN model on the combined feature vectors.

5. Recommend Songs
We'll use the trained KNN model to recommend songs based on the similarity of feature vectors.

In [5]:
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense, concatenate

In [6]:
spotifysongs_cleaned

Unnamed: 0,track_name,track_artist,track_popularity,track_album_name,track_album_release_date,playlist_name,playlist_genre,playlist_subgenre,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,duration_min
0,"""This Is Seagull….""",The Snake Corps,34,Smother Earth,1990-01-01,"Maxi Pop GOLD (New Wave, Electropop, Synth Po...",pop,electropop,0.516,0.580,...,-13.288,0,0.0295,0.000002,0.857000,0.1100,0.235,135.903,238227,3.970450
1,(I Just) Died In Your Arms,Cutting Crew,71,Broadcast,1986-01-01,80's Songs | Top 💯 80s Music Hits,pop,electropop,0.625,0.726,...,-11.402,0,0.0444,0.015800,0.000169,0.0625,0.507,124.945,280400,4.673333
2,(No One Knows Me) Like the Piano,Sampha,63,Process,2017-02-03,Ultimate Indie Presents... Best Indie Tracks o...,pop,dance pop,0.621,0.199,...,-13.788,1,0.0344,0.976000,0.001860,0.1070,0.178,128.905,218160,3.636000
3,(You Drive Me) Crazy,Britney Spears,2,Baby One More Time,1999-01-12,90s Dance Hits,pop,dance pop,0.755,0.944,...,-3.936,1,0.0342,0.059900,0.000000,0.3320,0.962,104.006,198093,3.301550
4,(You Drive Me) Crazy [The Stop Remix!] - Remas...,Britney Spears,40,The Essential Britney Spears,2014-07-29,90s Dance Hits,pop,dance pop,0.727,0.951,...,-4.004,1,0.0375,0.211000,0.000701,0.0537,0.774,101.078,197653,3.294217
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3679,留学生,Monkey Majik,49,COLLABORATED,2019-03-06,Best of 2019 Dance Pop: Japan,pop,dance pop,0.650,0.878,...,-4.768,0,0.0451,0.148000,0.000000,0.0499,0.781,120.001,214773,3.579550
3680,真っ赤な太陽,RAMMELLS,29,真っ赤な太陽,2019,Best of 2019 Dance Pop: Japan,pop,dance pop,0.612,0.843,...,-4.552,1,0.0431,0.152000,0.001630,0.4020,0.322,125.997,181187,3.019783
3681,달라달라 DALLA DALLA,ITZY,52,IT'z Different,2019-02-13,Best of 2019 Dance Pop: Japan,pop,dance pop,0.790,0.853,...,-4.564,0,0.0665,0.001160,0.000042,0.3290,0.713,124.998,199874,3.331233
3682,피카부 Peek-A-Boo,Red Velvet,70,Perfect Velvet - The 2nd Album,2017-11-17,K-Party Dance Mix,pop,dance pop,0.839,0.902,...,-3.612,0,0.0536,0.086800,0.002570,0.2720,0.639,114.953,189050,3.150833


## Embeddings & model for KNN

In [35]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

In [36]:
# Select relevant features for embeddings
features = ['danceability', 'energy', 'loudness', 'speechiness', 'acousticness',
            'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']

In [37]:
# Scale feature - Standardize the features to have zero mean and unit variance using StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(spotifysongs_cleaned[features])

In [38]:
# Combine features into a single feature vector for each song
song_embeddings = scaled_features

In [39]:
X_train, X_test, y_train, y_test = train_test_split(scaled_features, spotifysongs_cleaned['user_feedback'], test_size=0.2, random_state=42)

In [40]:
# Fit a KNN model
knn_classifier = KNeighborsClassifier(n_neighbors=10, algorithm='ball_tree')
knn_classifier.fit(X_train, y_train)

Given that song features can include many dimensions (e.g., danceability, energy, loudness, etc.), and can be moderately high-dimensional, I decided to choose Ball Tree for this recommender system. However, if the songs data has a relatively low number of dimensions, KD Tree could also be effective.

In [43]:
# Predict the test set results
y_pred_knn = knn_classifier.predict(X_test)

# Evaluate the KNN classifier
knn_accuracy = accuracy_score(y_test, y_pred_knn)
knn_precision = precision_score(y_test, y_pred_knn)
knn_recall = recall_score(y_test, y_pred_knn)
knn_conf_matrix = confusion_matrix(y_test, y_pred_knn)

print("KNN Classifier Evaluation")
print("Accuracy:", knn_accuracy)
print("Precision:", knn_precision)
print("Recall:", knn_recall)
print("Confusion Matrix:\n", knn_conf_matrix)

KNN Classifier Evaluation
Accuracy: 0.55359565807327
Precision: 0.6024390243902439
Recall: 0.5980629539951574
Confusion Matrix:
 [[161 163]
 [166 247]]


In [12]:
# Function to get user input and recommend songs
def get_user_preferences():
    print("Please input your preferred values for the following features (between 0 and 1 for normalized values):")

    user_preferences = {}
    for feature in features[:-1]:  # Excluding 'duration_ms' for normalized input
        user_input = float(input(f"{feature}: "))
        user_preferences[feature] = user_input

    user_preferences['duration_ms'] = float(input("duration_ms (e.g., 200000 for 3 minutes 20 seconds): "))

    return user_preferences

In [13]:
# Function to recommend songs based on user_preferences

def recommend_songs(user_preferences, n_recommendations=10):
    # Create a DataFrame from user preferences
    user_df = pd.DataFrame(user_preferences, index=[0]) # each row = each user input

    # Scale the user input so that it's in the same range as the training data
    scaled_user_input = scaler.transform(user_df[features])

    # Use KNN to find the closest songs
    distances, indices = knn_model.kneighbors(scaled_user_input, n_neighbors=n_recommendations)

    return spotifysongs_cleaned.iloc[indices[0]]

The kneighbors method of the KNN model is used to find the closest songs to the user's input. This method returns two arrays:

1. distances: The distances between the user's input and the nearest neighbors (songs).

2. indices: The indices of the nearest neighbors (songs) in the dataset.
scaled_user_input is passed to the kneighbors method to find the nearest neighbors for the user's preferences.

n_neighbors=n_recommendations specifies the number of recommendations to return.

The indices of the nearest neighbors are used to extract the corresponding songs from the original dataset.
data.iloc[indices[0]] returns the rows of the dataset that correspond to the indices of the nearest neighbors.

In [14]:
#demo

In [16]:
get_user_preferences()

Please input your preferred values for the following features (between 0 and 1 for normalized values):
danceability: 0.5
energy: 0.5
loudness: 0.5
speechiness: 0.5
acousticness: 0.5
instrumentalness: 0.5
liveness: 0.5
valence: 0.5
tempo: 0.5
duration_ms (e.g., 200000 for 3 minutes 20 seconds): 20000


{'danceability': 0.5,
 'energy': 0.5,
 'loudness': 0.5,
 'speechiness': 0.5,
 'acousticness': 0.5,
 'instrumentalness': 0.5,
 'liveness': 0.5,
 'valence': 0.5,
 'tempo': 0.5,
 'duration_ms': 20000.0}

In [45]:
recommend_songs(get_user_preferences(), n_recommendations=10)

Please input your preferred values for the following features (between 0 and 1 for normalized values):
danceability: 0.4
energy: 0.5
loudness: 0.9
speechiness: 0.2
acousticness: 0.4
instrumentalness: 0.5
liveness: 0.6
valence: 0.7
tempo: 0.9
duration_ms (e.g., 200000 for 3 minutes 20 seconds): 20000


Unnamed: 0,track_name,track_artist,track_popularity,track_album_name,track_album_release_date,playlist_name,playlist_genre,playlist_subgenre,danceability,energy,...,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,duration_min,user_feedback
3641,like that,Bea Miller,73,aurora,2018-02-23,The Sound of Post-Teen Pop,pop,post-teen pop,0.573,0.441,...,0,0.0758,0.186,0.0,0.61,0.389,61.657,185547,3.09245,1
2543,Run Wild,Thutmose,54,Run Wild - EP,2018-08-08,Electropop,pop,electropop,0.582,0.962,...,1,0.132,0.011,0.0,0.628,0.576,76.014,159474,2.6579,1
1525,"It's Party Time (From ""Hotel Transylvania 3"")",Joe Jonas,44,"It's Party Time (From ""Hotel Transylvania 3"")",2018-07-06,The Sound of Post-Teen Pop,pop,post-teen pop,0.565,0.748,...,1,0.0576,0.425,0.0,0.34,0.653,75.972,149158,2.485967,0
248,Back to You (feat. Bebe Rexha & Digital Farm A...,Louis Tomlinson,74,Back to You (feat. Bebe Rexha & Digital Farm A...,2017-07-21,Post Teen Pop,pop,post-teen pop,0.683,0.53,...,0,0.142,0.207,0.0,0.394,0.645,75.016,190428,3.1738,1
402,Break The Sky,Dust of Apollon,29,Break The Sky,2020-01-13,The Edge of Indie Poptimism,pop,indie poptimism,0.319,0.536,...,1,0.0478,0.376,0.269,0.408,0.223,75.984,190120,3.168667,0
2856,Still Don't Know My Name,Labrinth,75,Euphoria (Original Score from the HBO Series),2019-10-04,Pop Inglés (2020 - 2010s)💙 Música En Inglés 2010s,pop,dance pop,0.314,0.63,...,1,0.116,0.471,0.266,0.205,0.312,88.244,153294,2.5549,1
1342,"I Can’t Get Enough (benny blanco, Selena Gomez...",benny blanco,79,"I Can’t Get Enough (benny blanco, Selena Gomez...",2019-02-28,Dance Pop Tunes,pop,dance pop,0.541,0.468,...,0,0.362,0.404,4e-06,0.358,0.69,95.266,158027,2.633783,1
701,Dear Future Husband,Meghan Trainor,71,Title (Deluxe),2015-01-09,post teen pop,pop,post-teen pop,0.655,0.782,...,1,0.185,0.375,0.0,0.317,0.832,79.427,184227,3.07045,1
1244,"Hey Mama (feat. Nicki Minaj, Bebe Rexha & Afro...",David Guetta,71,Listen,2014-11-21,ELECTROPOP,pop,electropop,0.596,0.73,...,1,0.151,0.24,0.0,0.325,0.525,85.979,192560,3.209333,1
2441,Ready,Alessia Cara,77,Ready,2019-07-22,"post-teen alternative, indie, pop (large variety)",pop,post-teen pop,0.529,0.496,...,0,0.151,0.214,0.0,0.212,0.588,74.619,178747,2.979117,1


In [18]:
0## More user-friendly version - Function to find a song by name
def find_and_refine_song_by_name(song_name):
    def find_song_by_name(song_name):
        results = spotifysongs_cleaned[spotifysongs_cleaned['track_name'].str.contains(song_name, case=False, na=False)]
        if len(results) == 0:
            print(f"No song found with name: {song_name}")
            return None
        elif len(results) == 1:
            return results.iloc[0]
        else:
            print(f"Multiple songs found with name: {song_name}. Please refine your search.")
            return results

    def refine_search(initial_results):
        while len(initial_results) > 1:
            print("Multiple songs found. Please provide more details to refine your search.")

            artist_name = input("Enter the artist name (or leave blank to skip): ").strip()
            album_name = input("Enter part of the album name (or leave blank to skip): ").strip()

            if artist_name:
                initial_results = initial_results[initial_results['track_artist'].str.contains(artist_name, case=False, na=False)]

            if album_name:
                initial_results = initial_results[initial_results['track_album_name'].str.contains(album_name, case=False, na=False)]

            if len(initial_results) == 0:
                print("No songs match the refined criteria. Please try again.")
                return None
            elif len(initial_results) == 1:
                return initial_results.iloc[0]

        return initial_results

    # Initial search
    initial_results = find_song_by_name(song_name)

    if initial_results is None or isinstance(initial_results, pd.Series):
        return initial_results

    # Refine search if multiple results are found
    refined_result = refine_search(initial_results)

    return refined_result

This uses Pandas' .str.contains() method to search for songs in the data DataFrame where the track_name column contains the string specified by song_name.
case=False makes the search case-insensitive.
na=False ensures that any NaN values in track_name are treated as non-matching, avoiding errors in the search.

Also make use of higher order function

In [19]:
# recommendation function
def recommend_songs_by_name(song_name, n_recommendations=10):
    song = find_and_refine_song_by_name(song_name)
    if song is None or isinstance(song, pd.DataFrame):
        return song

    song_features = song[features].values.reshape(1, -1)
    scaled_song_features = scaler.transform(song_features)

    distances, indices = knn_model.kneighbors(scaled_song_features, n_neighbors=n_recommendations)

    return spotifysongs_cleaned.iloc[indices[0]]

In [21]:
# Implementation
# Main loop for user interaction
while True:
    song_name = input("Enter a song name you like (or type 'exit' to quit): ")
    if song_name.lower() == 'exit':
        break

    recommended_songs = recommend_songs_by_name(song_name, 10)

    if recommended_songs is not None and not isinstance(recommended_songs, pd.DataFrame):
        print(recommended_songs[['track_name', 'track_artist', 'track_popularity']])
    elif isinstance(recommended_songs, pd.DataFrame):
        print(recommended_songs[['track_name', 'track_artist', 'track_popularity']])

    reset = input("Would you like to search for another song? (yes/no): ").strip().lower()
    if reset != 'yes':
        break

print("Thank you for using the recommendation system!")

Enter a song name you like (or type 'exit' to quit): lay me down
No song found with name: lay me down
Would you like to search for another song? (yes/no): yes
Enter a song name you like (or type 'exit' to quit): stay with me




                              track_name      track_artist  track_popularity
2847                        Stay With Me            ayokay                68
1293                                Hope  The Chainsmokers                74
1602                Kissing Other People     Lennon Stella                74
3266                             Undrunk          FLETCHER                74
1829                        Lucky Strike       Troye Sivan                 6
2043  Never Go Back - Robin Schulz Remix      Dennis Lloyd                70
3020                          Tessellate             alt-J                59
2837                            Starving  Hailee Steinfeld                77
1354      I Don't Wanna Love You Anymore              LANY                 5
2253                        Out of Place     Brooke Sierra                33
Would you like to search for another song? (yes/no): no
Thank you for using the recommendation system!


In [None]:
# KNN

# Predict the test set results
y_pred_knn = knn_classifier.predict(X_test)

# Evaluate the KNN classifier
knn_accuracy = accuracy_score(y_test, y_pred_knn)
knn_precision = precision_score(y_test, y_pred_knn)
knn_recall = recall_score(y_test, y_pred_knn)
knn_conf_matrix = confusion_matrix(y_test, y_pred_knn)

print("KNN Classifier Evaluation")
print("Accuracy:", knn_accuracy)
print("Precision:", knn_precision)
print("Recall:", knn_recall)
print("Confusion Matrix:\n", knn_conf_matrix)

##**Logistic Regression model**

1. Data Preparation:
We'll simulate user feedback by creating a binary label for whether a user likes a song. For simplicity, we'll assume that songs with a popularity score greater than 50 are liked by the user.

2. Train the Logistic Regression Model:
We'll use the prepared dataset to train the Logistic Regression model.

3. Generate Recommendations:
We'll predict the probability that the user will like each song and recommend the top N songs with the highest probabilities.

In [22]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

In [23]:
spotifysongs_cleaned

Unnamed: 0,track_name,track_artist,track_popularity,track_album_name,track_album_release_date,playlist_name,playlist_genre,playlist_subgenre,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,duration_min
0,"""This Is Seagull….""",The Snake Corps,34,Smother Earth,1990-01-01,"Maxi Pop GOLD (New Wave, Electropop, Synth Po...",pop,electropop,0.516,0.580,...,-13.288,0,0.0295,0.000002,0.857000,0.1100,0.235,135.903,238227,3.970450
1,(I Just) Died In Your Arms,Cutting Crew,71,Broadcast,1986-01-01,80's Songs | Top 💯 80s Music Hits,pop,electropop,0.625,0.726,...,-11.402,0,0.0444,0.015800,0.000169,0.0625,0.507,124.945,280400,4.673333
2,(No One Knows Me) Like the Piano,Sampha,63,Process,2017-02-03,Ultimate Indie Presents... Best Indie Tracks o...,pop,dance pop,0.621,0.199,...,-13.788,1,0.0344,0.976000,0.001860,0.1070,0.178,128.905,218160,3.636000
3,(You Drive Me) Crazy,Britney Spears,2,Baby One More Time,1999-01-12,90s Dance Hits,pop,dance pop,0.755,0.944,...,-3.936,1,0.0342,0.059900,0.000000,0.3320,0.962,104.006,198093,3.301550
4,(You Drive Me) Crazy [The Stop Remix!] - Remas...,Britney Spears,40,The Essential Britney Spears,2014-07-29,90s Dance Hits,pop,dance pop,0.727,0.951,...,-4.004,1,0.0375,0.211000,0.000701,0.0537,0.774,101.078,197653,3.294217
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3679,留学生,Monkey Majik,49,COLLABORATED,2019-03-06,Best of 2019 Dance Pop: Japan,pop,dance pop,0.650,0.878,...,-4.768,0,0.0451,0.148000,0.000000,0.0499,0.781,120.001,214773,3.579550
3680,真っ赤な太陽,RAMMELLS,29,真っ赤な太陽,2019,Best of 2019 Dance Pop: Japan,pop,dance pop,0.612,0.843,...,-4.552,1,0.0431,0.152000,0.001630,0.4020,0.322,125.997,181187,3.019783
3681,달라달라 DALLA DALLA,ITZY,52,IT'z Different,2019-02-13,Best of 2019 Dance Pop: Japan,pop,dance pop,0.790,0.853,...,-4.564,0,0.0665,0.001160,0.000042,0.3290,0.713,124.998,199874,3.331233
3682,피카부 Peek-A-Boo,Red Velvet,70,Perfect Velvet - The 2nd Album,2017-11-17,K-Party Dance Mix,pop,dance pop,0.839,0.902,...,-3.612,0,0.0536,0.086800,0.002570,0.2720,0.639,114.953,189050,3.150833


In [25]:
# selecting relevant features for the model

features = ['danceability', 'energy', 'loudness', 'speechiness', 'acousticness',
            'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']

# Simulate user feedback (1 for like, 0 for dislike)
# For simplicity, let's assume users like songs with popularity > 50. Create a new column called user feedback

spotifysongs_cleaned['user_feedback'] = (spotifysongs_cleaned['track_popularity'] > 50).astype(int)

In [26]:
spotifysongs_cleaned.head()

Unnamed: 0,track_name,track_artist,track_popularity,track_album_name,track_album_release_date,playlist_name,playlist_genre,playlist_subgenre,danceability,energy,...,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,duration_min,user_feedback
0,"""This Is Seagull….""",The Snake Corps,34,Smother Earth,1990-01-01,"Maxi Pop GOLD (New Wave, Electropop, Synth Po...",pop,electropop,0.516,0.58,...,0,0.0295,2e-06,0.857,0.11,0.235,135.903,238227,3.97045,0
1,(I Just) Died In Your Arms,Cutting Crew,71,Broadcast,1986-01-01,80's Songs | Top 💯 80s Music Hits,pop,electropop,0.625,0.726,...,0,0.0444,0.0158,0.000169,0.0625,0.507,124.945,280400,4.673333,1
2,(No One Knows Me) Like the Piano,Sampha,63,Process,2017-02-03,Ultimate Indie Presents... Best Indie Tracks o...,pop,dance pop,0.621,0.199,...,1,0.0344,0.976,0.00186,0.107,0.178,128.905,218160,3.636,1
3,(You Drive Me) Crazy,Britney Spears,2,Baby One More Time,1999-01-12,90s Dance Hits,pop,dance pop,0.755,0.944,...,1,0.0342,0.0599,0.0,0.332,0.962,104.006,198093,3.30155,0
4,(You Drive Me) Crazy [The Stop Remix!] - Remas...,Britney Spears,40,The Essential Britney Spears,2014-07-29,90s Dance Hits,pop,dance pop,0.727,0.951,...,1,0.0375,0.211,0.000701,0.0537,0.774,101.078,197653,3.294217,0


In [27]:
# Scale the features & standardize them

scaler = StandardScaler()
scaled_features = scaler.fit_transform(spotifysongs_cleaned[features])

In [28]:
from sklearn.model_selection import train_test_split

In [29]:
# & split the data into training & testing set and train the logistic regression model
X_train, X_test, y_train, y_test = train_test_split(scaled_features, spotifysongs_cleaned['user_feedback'], test_size=0.2, random_state=42)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

In [30]:
# Implementation code similar to KNN

def find_and_refine_song_by_name(song_name):
    def find_song_by_name(song_name):
        results = spotifysongs_cleaned[spotifysongs_cleaned['track_name'].str.contains(song_name, case=False, na=False)]
        if len(results) == 0:
            print(f"No song found with name: {song_name}")
            return None
        elif len(results) == 1:
            return results.iloc[0]
        else:
            print(f"Multiple songs found with name: {song_name}. Please refine your search.")
            return results

    def refine_search(initial_results):
        while len(initial_results) > 1:
            print("Multiple songs found. Please provide more details to refine your search.")

            artist_name = input("Enter the artist name (or leave blank to skip): ").strip()
            album_name = input("Enter part of the album name (or leave blank to skip): ").strip()

            if artist_name:
                initial_results = initial_results[initial_results['track_artist'].str.contains(artist_name, case=False, na=False)]

            if album_name:
                initial_results = initial_results[initial_results['track_album_name'].str.contains(album_name, case=False, na=False)]

            if len(initial_results) == 0:
                print("No songs match the refined criteria. Please try again.")
                return None
            elif len(initial_results) == 1:
                return initial_results.iloc[0]

        return initial_results

    # Initial search
    initial_results = find_song_by_name(song_name)

    if initial_results is None or isinstance(initial_results, pd.Series):
        return initial_results

    # Refine search if multiple results are found
    refined_result = refine_search(initial_results)

    return refined_result

In [31]:
def recommend_songs_by_name(song_name, n_recommendations=10):
    song = find_and_refine_song_by_name(song_name)
    if song is None or isinstance(song, pd.DataFrame):
        return song

    song_features = song[features].values.reshape(1, -1)
    scaled_song_features = scaler.transform(song_features)

    # Predict probabilities of liking each song
    probabilities = logreg.predict_proba(scaled_features)[:, 1]

    # Get the top N recommendations
    recommendations_indices = np.argsort(probabilities)[-n_recommendations:][::-1]

    return spotifysongs_cleaned.iloc[recommendations_indices]

# Main loop for user interaction
while True:
    song_name = input("Enter a song name you like (or type 'exit' to quit): ")
    if song_name.lower() == 'exit':
        break

    recommended_songs = recommend_songs_by_name(song_name, 10)

    if recommended_songs is not None and not isinstance(recommended_songs, pd.DataFrame):
        print(recommended_songs[['track_name', 'track_artist', 'track_popularity']])
    elif isinstance(recommended_songs, pd.DataFrame):
        print(recommended_songs[['track_name', 'track_artist', 'track_popularity']])

    reset = input("Would you like to search for another song? (yes/no): ").strip().lower()
    if reset != 'yes':
        break

print("Thank you for using the recommendation system!")

Enter a song name you like (or type 'exit' to quit): believer




                                             track_name         track_artist  \
611                                             Cradles            Sub Urban   
1342  I Can’t Get Enough (benny blanco, Selena Gomez...         benny blanco   
1537                                           Jalapeño        Janelle Monáe   
3005                                              Teeth  5 Seconds of Summer   
2602                                  Say It to My Face         Madison Beer   
236                                        BTSTU (Edit)             Jai Paul   
311                             Better Not (with Wafia)      Louis The Child   
3652                                            shut up       Greyson Chance   
3461                                 Why Do You Love Me   Charlotte Lawrence   
75                                          Acid Dreams                  MAX   

      track_popularity  
611                 78  
1342                79  
1537                33  
3005               

# Evaluating the models

In [33]:
# Logistic Regression

y_pred = logreg.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.6132971506105834
Precision: 0.6216730038022814
Recall: 0.7917675544794189
Confusion Matrix:
 [[125 199]
 [ 86 327]]


## Conclusion

In this case, it seems like Logistic Regression is a more accurate model.

Logistic Regression: Easy to implement and interpret & Computationally efficient, especially for large datasets. Provides probabilities for class membership, which can be useful for ranking recommendations. But also Assumes a linear relationship between the features and the log odds of the outcome. May not capture complex patterns in the data.


K-Nearest Neighbors (KNN): Can capture complex, non-linear relationships in the data. Simple to implement as it doesn’t involve an explicit training phase (but computationally intensive during prediction). Works well with multi-modal distributions and can adapt to the local structure of the data. Prediction can be slow, especially with large datasets, because it involves computing the distance to all training samples. Requires storing the entire training dataset. Subjected to Curse of Dimensionality (Performance can degrade with high-dimensional data due to the curse of dimensionality.)