<a href="https://colab.research.google.com/github/philnumpy/PRML-PROJECT/blob/main/prmlknn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

KNN


In [5]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load the dataset
file_path = "spotify_tracks.csv"
df = pd.read_csv(file_path)

# Selecting relevant numerical features for KNN
features = [
    "acousticness", "danceability", "duration_ms", "energy",
    "instrumentalness", "key", "liveness", "loudness", "mode",
    "speechiness", "tempo", "time_signature", "valence"
]

# Extract feature matrix and normalize
X = df[features].values


def knn_recommend(X, df, song_features, k=5, distance_metric="euclidean"):
    """
    Recommends k similar songs using K-Nearest Neighbors.

    Args:
        X: Feature matrix of all songs.
        df: Corresponding song metadata.
        song_features: Feature vector of the input song.
        k: Number of nearest neighbors to recommend.
        distance_metric: Distance metric ('euclidean' or 'manhattan').

    Returns:
        List of recommended songs.
    """
    distances = []

    for i, train_features in enumerate(X):#calculation of the distance matrix
        if distance_metric == "euclidean":
            dist = np.linalg.norm(train_features - song_features)
        elif distance_metric == "manhattan":
            dist = np.sum(np.abs(train_features - song_features))
        else:
            raise ValueError("Invalid metric. Use 'euclidean' or 'manhattan'.")

        song = df.iloc[i]
         # Ensure that the input song is NOT included in recommendations
        if song["track_name"].lower() != input_song.lower():
            distances.append((dist, song))

        #in this it makes X=[x,y,z,....] and then calculate the distance and make this array, then it sorts the data and take k nearest neighbours

    # Sort by distance and get k nearest songs
    distances.sort(key=lambda x: x[0])
    recommendations = []
    seen_tracks = set()

    for _, song in distances: #for taking k nearest songs
        track_name = song["track_name"]
        if track_name not in seen_tracks:
            recommendations.append(song)
            seen_tracks.add(track_name)
        if len(recommendations) >= k:
            break

    return recommendations

def recommend_songs(input_song, k=5):
    """
    Recommends similar songs based on user input.

    Args:
        input_song: Name of the song provided by the user.
        k: Number of recommendations.

    Returns:
        None (prints formatted output).
    """
    song_row = df[df["track_name"].str.lower() == input_song.lower()]#finding the song from the dataset

    if song_row.empty:
        print("Song not found in the dataset.")
        return

    song_features = song_row[features].values[0]#extracting the features of the song
    recommendations = knn_recommend(X, df, song_features, k=k)

    print("\n" + "=" * 70)
    print(f"🎵 **Input Song:** {input_song}\n")
    print("🎧 **Recommended Songs:**\n")

    # Creating a formatted output using tabular format
    print(f"{'No.':<5} {'Track Name':<40} {'Artist Name':<30} {'Track URL'}")
    print("-" * 130)

    for idx, song in enumerate(recommendations, start=1):
        track_name = song["track_name"][:37] + "..." if len(song["track_name"]) > 40 else song["track_name"]
        artist_name = song["artist_name"][:27] + "..." if len(song["artist_name"]) > 30 else song["artist_name"]
        track_url = song["track_url"]

        print(f"{idx:<5} {track_name:<40} {artist_name:<30} {track_url}")

    print("=" * 70 + "\n")

# User inputs a song name
input_song = input("Enter a song name: ")
recommend_songs(input_song, k=5)


Enter a song name: Leo Das Entry (From "Leo")

🎵 **Input Song:** Leo Das Entry (From "Leo")

🎧 **Recommended Songs:**

No.   Track Name                               Artist Name                    Track URL
----------------------------------------------------------------------------------------------------------------------------------
1     Intro - Stand Up                         BIGBANG                        https://open.spotify.com/track/4bfKZmzdoP3mJdAYZ8tqw7
2     Ratatapata - Boom Bap Mix                Arivu, Ranina Reddy, C. Sat... https://open.spotify.com/track/0nbQn4L8uCdMjIYnP2V6U5
3     Luke and Cassie                          Blake Neely, Tony Kanal        https://open.spotify.com/track/2rEkpAXFAZWYx2DXzGTZgL
4     Blue Gold                                Ramin Djawadi                  https://open.spotify.com/track/3TEC1h2U9rigkJfb1MDTEd
5     Viewing Time                             Ramin Djawadi                  https://open.spotify.com/track/26E4rr3d0Cl1PvQHftQ0C9



K-Nearest Neighbors (KNN) is a simple yet effective machine learning algorithm used for classification and regression tasks. It works by finding the k most similar data points (neighbors) to a given input based on a chosen distance metric, such as Euclidean or Manhattan distance.

How KNN Works:

Feature Selection & Normalization: Each data point is represented as a feature vector, and scaling techniques (e.g., StandardScaler) are applied to ensure uniformity.


Distance Calculation: For a given input, the algorithm computes the distance to all training data points.


Neighbor Selection: The k closest data points are identified.


Prediction/Recommendation:

For classification, the majority class among neighbors is assigned to the input.

For recommendation systems (as in this case), the closest songs with diverse track names are suggested.

Advantages:

Simple & Intuitive: Easy to understand and implement.

No Training Phase: Unlike many algorithms, KNN does not require model training.

Adaptability: Works well with various types of data.

Limitations:

Computationally Expensive: For large datasets, calculating distances for all data points can be slow.

Sensitive to Noise & Outliers: Noisy data can impact accuracy.

Choice of k Matters: A poorly chosen k can lead to suboptimal recommendations.

WORKING OF THE CODE:

1. Loading and Preprocessing the Dataset:

Step 1: Load the Dataset

    file_path = "spotify_tracks.csv
    df = pd.read_csv(file_path)

Reads the dataset from the CSV file into a Pandas DataFrame (df).

Step 2: Selecting Relevant Features

    features = [
    "acousticness", "danceability", "duration_ms", "energy",
    "instrumentalness", "key", "liveness", "loudness", "mode",
    "speechiness", "tempo", "time_signature", "valence"
    ]
    X = df[features].values

Extracts numerical features that influence music similarity, such as popularity, energy, danceability, etc.

X contains only these selected features for training.



Step 3: Splitting the Dataset

Splits the dataset into training (50,000 samples) and test (remaining samples).

X_train and X_test contain numerical feature values.

df_train and df_test store the metadata (song names, artists, URLs, etc.).

2. Implementing the KNN Recommendation System:

Step 4: Compute Distances to Find Nearest Neighbors

def knn_recommend(X_train, df_train, song_features, k=5, distance_metric="euclidean"):
   
    distances = []
    
    for i, train_features in enumerate(X_train):
        if distance_metric == "euclidean":
            dist = np.linalg.norm(train_features - song_features)
        elif distance_metric == "manhattan":
            dist = np.sum(np.abs(train_features - song_features))
        else:
            raise ValueError("Invalid metric. Use 'euclidean' or 'manhattan'.")

             song = df.iloc[i]
         # Ensure that the input song is NOT included in recommendations
        if song["track_name"].lower() != input_song.lower():
            distances.append((dist, song))
        
        
Takes a song's feature vector (song_features) and compares it with all songs in the training set (X_train).

Computes the distance using either:

Euclidean Distance (default) → Measures straight-line distance.

Manhattan Distance → Measures distance along coordinate axes.

Stores distances and corresponding song metadata.

Step 5: Sort and Select the k Nearest Neighbors

    distances.sort(key=lambda x: x[0])

    recommendations = []

    seen_tracks = set()
    
    for _, song in distances:

    track_name = song["track_name"]
    if track_name not in seen_tracks:
        recommendations.append(song)
        seen_tracks.add(track_name)
    if len(recommendations) >= k:
        break

Sorts all songs based on computed distance.

Ensures diversity by avoiding duplicate track names.

Selects top k nearest neighbors as recommendation

3. Recommending Similar Songs

Step 6: Wrapper Function for Multiple Test Songs

    def recommend_songs(sample_indices, k=5):
    for test_index in sample_indices:
        test_song = df_test.iloc[test_index]
        test_features = X_test[test_index]
        
        print("=" * 50)
        print(f"Input Song: {test_song['track_name']} by {test_song['artist_name']}")
        print(f"Listen: {test_song['track_url']}\n")
        print("Recommended Songs:")
        
        recommendations = knn_recommend(X_train, df_train, test_features, k=k)
        rec_df = pd.DataFrame(recommendations)[["track_name", "artist_name", "track_url"]]
        print(rec_df.to_string(index=False))
        print("=" * 50 + "\n")

Loops over multiple test song indices.

Fetches metadata of the test song.

Calls knn_recommend() to get k recommendations.

Formats the output for better readability.


4. Running the Recommendation System

    sample_indices = [0, 1, 2, 3, 4]

    recommend_songs(sample_indices, k=5)

Selects 5 random test songs.

Prints 5 recommendations for each test song.















PROBLEM WITH THE APPROACH

Potential Overfitting with Low k Values:

The efficiency metrics are highly inflated because the recommendations are based on the closest songs from the dataset.

Lower k values, like 5, may cause overfitting by recommending nearly identical songs rather than diverse yet relevant ones.

Bias Towards Popular Songs:

The dataset includes a "popularity" feature, which might make the model recommend more popular songs rather than truly similar ones.

This can reduce diversity in recommendations, limiting the discovery of less-known but relevant songs.
