<a href="https://colab.research.google.com/github/maximalsteel/Capstone-Project--Predicting-Spotify-Song-Popularity/blob/main/Spotify_Popularity_Testing_%26_Deployment_by_Divyansh_Taneja.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Spotify Track Data Collection for Model Validation**

> This script leverages the Spotipy library to gather comprehensive track details from Spotify, such as audio features and popularity scores, across various genres and years. The collected data is exported as a CSV file to test the model prior to deployment.



In [None]:
! pip install spotipy

Collecting spotipy
  Downloading spotipy-2.24.0-py3-none-any.whl.metadata (4.9 kB)
Collecting redis>=3.5.3 (from spotipy)
  Downloading redis-5.0.8-py3-none-any.whl.metadata (9.2 kB)
Downloading spotipy-2.24.0-py3-none-any.whl (30 kB)
Downloading redis-5.0.8-py3-none-any.whl (255 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m255.6/255.6 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: redis, spotipy
Successfully installed redis-5.0.8 spotipy-2.24.0


In [None]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import time

# Define your credentials
client_id = 'e5e7f11c15624f6595f48d44d105fb3f'
client_secret = 'c1e2976a66174855ba4dc328f348e7a3'

# Set up authorization
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=client_id, client_secret=client_secret))

# Function to get track details
def get_track_details(track_id, genre):
    try:
        track = sp.track(track_id)
        features = sp.audio_features(track_id)[0]

        # Get popularity metrics (streams/downloads if available)
        popularity = sp.track(track_id)['popularity']

        return {
            'track_id': track['id'],
            'artist': track['artists'][0]['name'],
            'album': track['album']['name'],
            'track_name': track['name'],
            'popularity': popularity,
            'duration_ms': features['duration_ms'],
            'explicit': track['explicit'],
            'danceability': features['danceability'],
            'energy': features['energy'],
            'key': features['key'],
            'loudness': features['loudness'],
            'mode': features['mode'],
            'speechiness': features['speechiness'],
            'acousticness': features['acousticness'],
            'instrumentalness': features['instrumentalness'],
            'liveness': features['liveness'],
            'valence': features['valence'],
            'tempo': features['tempo'],
            'time_signature': features['time_signature'],
            'genre': genre,
        }
    except Exception as e:
        print(f"Error getting details for track ID {track_id}: {e}")
        return None

# Function to search and collect tracks
def collect_tracks(query, year_start, year_end, max_tracks_to_collect):
    all_tracks = []
    limit = 50  # Max tracks per request
    total_collected = 0

    for year in range(year_start, year_end + 1):
        for offset in range(0, max_tracks_to_collect, limit):
            try:
                results = sp.search(q=f'{query} year:{year}', type='track', limit=limit, offset=offset)
                if not results['tracks']['items']:
                    break

                track_ids = [track['id'] for track in results['tracks']['items']]
                track_data = [get_track_details(track_id, query) for track_id in track_ids if get_track_details(track_id, query) is not None]
                all_tracks.extend(track_data)
                total_collected += len(track_data)

                # Print progress
                print(f"Collected {total_collected} tracks so far.")

                # Check if we have collected enough tracks
                if total_collected >= max_tracks_to_collect:
                    return all_tracks

                # Respect rate limits
                time.sleep(1)

            except Exception as e:
                print(f"An error occurred: {e}")
                time.sleep(5)  # Wait a bit before retrying

    return all_tracks

# Collect data using different queries (genres)
queries = ['happy','romance','folk','alt-rock','german','groove']
year_start = 2019
year_end = 2023
max_tracks_to_collect = 150  # Set the maximum number of tracks to collect

all_track_data = []

for query in queries:
    print(f"Collecting up to {max_tracks_to_collect} tracks for query: {query}")
    track_data = collect_tracks(query, year_start, year_end, max_tracks_to_collect // len(queries))
    all_track_data.extend(track_data)

    # Check if we have collected enough tracks
    if len(all_track_data) >= max_tracks_to_collect:
        break

# Convert to pandas DataFrame for easy manipulation
valid_track_data = [track for track in all_track_data if track is not None]
df = pd.DataFrame(valid_track_data)

# Save to CSV
df.to_csv('/content/drive/MyDrive/Capstone/Deployement_spotify_test_.csv', index=False)
print(f"Saved {len(df)} collected tracks to Deployement_spotify_test_.csv")


Collecting up to 150 tracks for query: happy
Collected 50 tracks so far.
Collecting up to 150 tracks for query: romance
Collected 50 tracks so far.
Collecting up to 150 tracks for query: folk
Collected 50 tracks so far.
Saved 150 collected tracks to Deployement_spotify_test_.csv


In [None]:
# IMPORT LIBRARIES
# Data Manipulation
import numpy as np
import pandas as pd

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Preprocessing
from sklearn.preprocessing import LabelEncoder

### **Preprocessing Spotify Track Data for Model Testing**
> Assigning popularity classes and categories based on popularity scores, and selects relevant features for model testing. The processed data is used to validate the prediction model prior to deployment.



In [None]:
data= pd.read_csv('/content/drive/MyDrive/Capstone/EDA/Deployement_spotify_test_.csv')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
data['popularity_class'] = 0
data['popularity_category'] = ''

# Assign classes and categories based on popularity ranges
data.loc[data['popularity'] <= 25, 'popularity_class'] = 0
data.loc[(data['popularity'] > 25) & (data['popularity'] <= 50), 'popularity_class'] = 1
data.loc[(data['popularity'] > 50) & (data['popularity'] <= 75), 'popularity_class'] = 2
data.loc[(data['popularity'] > 75) & (data['popularity'] <= 100), 'popularity_class'] = 3

data.loc[data['popularity'] <= 25, 'popularity_category'] = 'Low Popularity'
data.loc[(data['popularity'] > 25) & (data['popularity'] <= 50), 'popularity_category'] = 'Medium Popularity'
data.loc[(data['popularity'] > 50) & (data['popularity'] <= 75), 'popularity_category'] = 'High Popularity'
data.loc[(data['popularity'] > 75) & (data['popularity'] <= 100), 'popularity_category'] = 'Very High Popularity'

In [None]:
test=data[['duration_ms', 'explicit', 'danceability', 'energy', 'key', 'loudness',
       'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'time_signature', 'genre']]

In [None]:
# label encoding explicit
le = LabelEncoder()
test['explicit'] = le.fit_transform(test['explicit'])

In [None]:
test.sample(3)

Unnamed: 0,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,genre
146,172000,0,0.597,0.485,6,-7.899,0,0.0249,0.668,0.0117,0.116,0.349,96.124,4,folk
70,222999,0,0.659,0.663,11,-6.823,1,0.0811,0.205,0.0,0.105,0.333,140.013,4,romance
11,161067,0,0.787,0.503,1,-5.23,0,0.0606,0.0743,1.5e-05,0.287,0.659,100.05,4,happy


### **Testing and Validating Spotify Track Popularity Predictions**


> This code preprocesses Spotify track data by encoding genres using a predefined mapping, then applies a pre-trained Random Forest model to predict the popularity category of each track. The predicted popularity is added to the DataFrame for further analysis.

In [None]:
import pandas as pd
import joblib

# Load the model
model_path = '/content/drive/MyDrive/Capstone/EDA/Final_random_forest_model.joblib'
with open(model_path, 'rb') as file:
    Final_Model = joblib.load(file)

# Define the complete genre mapping
genre_mapping = {
    'Tamil': 5, 'Telugu': 6, 'Kannada': 2, 'Bollywood': 0, 'Rap': 3, 'Romance': 4, 'Indian pop': 1,
    'acoustic': 7, 'afrobeat': 8, 'alt-rock': 9, 'alternative': 10, 'ambient': 11, 'anime': 12,
    'black-metal': 13, 'bluegrass': 14, 'blues': 15, 'brazil': 16, 'breakbeat': 17, 'british': 18,
    'cantopop': 19, 'chicago-house': 20, 'children': 21, 'chill': 22, 'classical': 23, 'club': 24,
    'comedy': 25, 'country': 26, 'dance': 27, 'dancehall': 28, 'death-metal': 29, 'deep-house': 30,
    'detroit-techno': 31, 'disco': 32, 'disney': 33, 'drum-and-bass': 34, 'dub': 35, 'dubstep': 36,
    'edm': 37, 'electro': 38, 'electronic': 39, 'emo': 40, 'folk': 41, 'forro': 42, 'french': 43,
    'funk': 44, 'garage': 45, 'german': 46, 'gospel': 47, 'goth': 48, 'grindcore': 49, 'groove': 50,
    'grunge': 51, 'guitar': 52, 'happy': 53, 'hard-rock': 54, 'hardcore': 55, 'hardstyle': 56,
    'heavy-metal': 57, 'hip-hop': 58, 'honky-tonk': 59, 'house': 60, 'idm': 61, 'indian': 62,
    'indie-pop': 64, 'indie': 63, 'industrial': 65, 'iranian': 66, 'j-dance': 67, 'j-idol': 68,
    'j-pop': 69, 'j-rock': 70, 'jazz': 71, 'k-pop': 72, 'kids': 73, 'latin': 74, 'latino': 75,
    'malay': 76, 'mandopop': 77, 'metal': 78, 'metalcore': 79, 'minimal-techno': 80, 'mpb': 81,
    'new-age': 82, 'opera': 83, 'pagode': 84, 'party': 85, 'piano': 86, 'pop-film': 88, 'pop': 87,
    'power-pop': 89, 'progressive-house': 90, 'psych-rock': 91, 'punk-rock': 93, 'punk': 92,
    'r-n-b': 94, 'reggae': 95, 'reggaeton': 96, 'rock-n-roll': 98, 'rock': 97, 'rockabilly': 99,
    'romance': 100, 'sad': 101, 'salsa': 102, 'samba': 103, 'sertanejo': 104, 'show-tunes': 105,
    'singer-songwriter': 106, 'ska': 107, 'sleep': 108, 'songwriter': 109, 'soul': 110, 'spanish': 111,
    'study': 112, 'swedish': 113, 'synth-pop': 114, 'tango': 115, 'techno': 116, 'trance': 117,
    'trip-hop': 118, 'turkish': 119, 'world-music': 120}

df = test.copy()

# Encode 'genre'
df['genre'] = df['genre'].map(genre_mapping).fillna(-1).astype(int)

predictions = Final_Model.predict(df)

# Map the prediction to categories
categories = ['Low Popularity', 'Medium Popularity', 'High Popularity', 'Very High Popularity']
df['Predicted Popularity'] = [categories[pred] for pred in predictions]


In [None]:
import sklearn
print(sklearn.__version__)

1.2.2


In [None]:
!pip install scikit-learn==1.3.2

Collecting scikit-learn==1.3.2
  Downloading scikit_learn-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading scikit_learn-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m47.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.2
    Uninstalling scikit-learn-1.2.2:
      Successfully uninstalled scikit-learn-1.2.2
Successfully installed scikit-learn-1.3.2


### **Comparing Predicted and Actual Popularity**

>  Comparing predicted Spotify track popularity with actual categories, samples five entries, and counts instances of matching predictions to evaluate model performance.



In [None]:
df1=df['Predicted Popularity']
df2=data['popularity_category']
comparison_df=pd.DataFrame((df1,df2)).T
comparison_df.sample(5)

Unnamed: 0,Predicted Popularity,popularity_category
79,Low Popularity,Medium Popularity
1,Low Popularity,High Popularity
58,Medium Popularity,Medium Popularity
81,Medium Popularity,Medium Popularity
91,Low Popularity,Low Popularity


In [None]:
# The actual categories
data['popularity_category'].value_counts()

Unnamed: 0_level_0,count
popularity_category,Unnamed: 1_level_1
Medium Popularity,93
Low Popularity,40
High Popularity,16
Very High Popularity,1


In [None]:
# Count and display the instances where predictions match the actual categories
comparison_df[comparison_df['Predicted Popularity']==comparison_df['popularity_category']].value_counts()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Predicted Popularity,popularity_category,Unnamed: 2_level_1
Medium Popularity,Medium Popularity,39
Low Popularity,Low Popularity,21
High Popularity,High Popularity,2


### **Inference:**
Out of a total of 150 tracks, the model demonstrated notable performance in predicting popularity categories. It correctly identified:

> * Medium Popularity: 39 tracks out of 93
* Low Popularity: 21 tracks out of 40
* High Popularity: 2 tracks out of 16

The model shows strong predictive accuracy for 'Medium Popularity,' which constitutes the majority of the dataset. This indicates a solid foundation for deployment. However, further refinement is needed for predicting less frequent categories like 'High Popularity' and 'Very High Popularity' due to their limited representation in the data. Overall, the model is well-positioned for deployment with targeted improvements.




# **Spotify Song Popularity Predictor Deployment**





In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-4.40.0-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.111.1-py3-none-any.whl.metadata (26 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.2.0 (from gradio)
  Downloading gradio_client-1.2.0-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.9 (from gradi

In [None]:
import gradio as gr
import joblib
import warnings
import pandas as pd

# Ignore warnings
warnings.filterwarnings('ignore')

# Load the model
model_path = '/content/drive/MyDrive/Capstone/EDA/Final_random_forest_model.joblib'
with open(model_path, 'rb') as file:
  Final_Model = joblib.load(file)

# Define genre mapping
genre_mapping = {
    'Tamil': 5, 'Telugu': 6, 'Kannada': 2, 'Bollywood': 0, 'Rap': 3, 'Romance': 4, 'Indian pop': 1,
    'acoustic': 7, 'afrobeat': 8, 'alt-rock': 9, 'alternative': 10, 'ambient': 11, 'anime': 12,
    'black-metal': 13, 'bluegrass': 14, 'blues': 15, 'brazil': 16, 'breakbeat': 17, 'british': 18,
    'cantopop': 19, 'chicago-house': 20, 'children': 21, 'chill': 22, 'classical': 23, 'club': 24,
    'comedy': 25, 'country': 26, 'dance': 27, 'dancehall': 28, 'death-metal': 29, 'deep-house': 30,
    'detroit-techno': 31, 'disco': 32, 'disney': 33, 'drum-and-bass': 34, 'dub': 35, 'dubstep': 36,
    'edm': 37, 'electro': 38, 'electronic': 39, 'emo': 40, 'folk': 41, 'forro': 42, 'french': 43,
    'funk': 44, 'garage': 45, 'german': 46, 'gospel': 47, 'goth': 48, 'grindcore': 49, 'groove': 50,
    'grunge': 51, 'guitar': 52, 'happy': 53, 'hard-rock': 54, 'hardcore': 55, 'hardstyle': 56,
    'heavy-metal': 57, 'hip-hop': 58, 'honky-tonk': 59, 'house': 60, 'idm': 61, 'indian': 62,
    'indie-pop': 64, 'indie': 63, 'industrial': 65, 'iranian': 66, 'j-dance': 67, 'j-idol': 68,
    'j-pop': 69, 'j-rock': 70, 'jazz': 71, 'k-pop': 72, 'kids': 73, 'latin': 74, 'latino': 75,
    'malay': 76, 'mandopop': 77, 'metal': 78, 'metalcore': 79, 'minimal-techno': 80, 'mpb': 81,
    'new-age': 82, 'opera': 83, 'pagode': 84, 'party': 85, 'piano': 86, 'pop-film': 88, 'pop': 87,
    'power-pop': 89, 'progressive-house': 90, 'psych-rock': 91, 'punk-rock': 93, 'punk': 92,
    'r-n-b': 94, 'reggae': 95, 'reggaeton': 96, 'rock-n-roll': 98, 'rock': 97, 'rockabilly': 99,
    'romance': 100, 'sad': 101, 'salsa': 102, 'samba': 103, 'sertanejo': 104, 'show-tunes': 105,
    'singer-songwriter': 106, 'ska': 107, 'sleep': 108, 'songwriter': 109, 'soul': 110, 'spanish': 111,
    'study': 112, 'swedish': 113, 'synth-pop': 114, 'tango': 115, 'techno': 116, 'trance': 117,
    'trip-hop': 118, 'turkish': 119, 'world-music': 120
}

# Define the prediction function
def Prediction(duration_ms, explicit, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, time_signature, genre):
    # Convert 'explicit' and 'mode' to numeric if they are in boolean format
    explicit = 1 if explicit else 0
    mode = 1 if mode else 0

    # Encode 'genre' using the mapping dictionary
    genre = genre_mapping.get(genre, -1)

    # Prepare input data
    input_data = [[duration_ms, explicit, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, time_signature, genre]]

    # Convert to DataFrame if your model expects DataFrame input
    input_df = pd.DataFrame(input_data, columns=['duration_ms', 'explicit', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature', 'genre'])

    # Get prediction
    prediction = Final_Model.predict(input_df)[0]

    # Map the prediction to categories
    categories = ['Low Popularity', 'Medium Popularity', 'High Popularity', 'Very High Popularity']
    return categories[prediction]

# Define the Gradio interface
iface = gr.Interface(
    fn=Prediction,
    inputs=[
        gr.Number(label='Duration (ms)', value=0, precision=0, minimum=0, maximum=471556),
        gr.Checkbox(label='Explicit'),  # Assuming boolean input
        gr.Slider(minimum=0, maximum=1, step=0.01, label='Danceability'),
        gr.Slider(minimum=0, maximum=1, step=0.01, label='Energy'),
        gr.Number(label='Key', value=0, precision=0),  # Adjust according to key values
        gr.Slider(minimum=-50, maximum=50, step=0.01, label='Loudness'),
        gr.Checkbox(label='Mode'),  # Assuming boolean input
        gr.Slider(minimum=0, maximum=1, step=0.01, label='Speechiness'),
        gr.Slider(minimum=0, maximum=1, step=0.01, label='Acousticness'),
        gr.Slider(minimum=0, maximum=1, step=0.01, label='Instrumentalness'),
        gr.Slider(minimum=0, maximum=1, step=0.01, label='Liveness'),
        gr.Slider(minimum=0, maximum=1, step=0.01, label='Valence'),
        gr.Slider(minimum=0, maximum=242.1835, step=0.01, label='Tempo'),
        gr.Number(label='Time Signature', value=4, precision=0),  # Adjust according to time signature values
        gr.Dropdown(choices=list(genre_mapping.keys()), label='Genre')  # Dropdown for genre selection
    ],
    outputs=gr.Textbox(label='Predicted Popularity Category'),
    title="Spotify Song Popularity Predictor",
    description='This Application Predicts the Popularity Category of a Spotify Song',
    allow_flagging='never'
)

# Launch the interface
iface.launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://e55d2f37becc3e5521.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [None]:
Prediction(duration_ms=219742.0,explicit=0,danceability=0.377,energy=0.651,key=4,loudness=-5.437,mode=0,speechiness=0.0589,acousticness=0.01970,instrumentalness=0.000053,liveness=0.1740,valence=0.0851,tempo=129.607,time_signature=4,genre='Romance')

'Medium Popularity'

In [None]:
Prediction(duration_ms=299960.0,explicit=0,danceability=0.705,energy=0.712,key=6,loudness=-6.156,mode=1,speechiness=0.0385,acousticness=0.01020,instrumentalness=0.000855,liveness=0.1000,valence=0.6200,tempo=97.512,time_signature=4,genre='alt-rock')

'Very High Popularity'

In [None]:
Prediction(duration_ms=138680.0,explicit=0,danceability=0.226,energy=0.0803,key=2,loudness=-21.048,mode=0,speechiness=0.0484,acousticness=0.920,instrumentalness=0.170100,liveness=0.0969,valence=0.137,tempo=176.020,time_signature=3,genre=5)

'Low Popularity'

In [None]:
Prediction(duration_ms=275504.0,explicit=0,danceability=0.331,energy=0.2350,key=4,loudness=-13.816,mode=0,speechiness=0.0411,acousticness=0.891,instrumentalness=0.000003,liveness=0.1070,valence=0.252,tempo=82.511,time_signature=4,genre='Bollywood')

'High Popularity'



> *The Gradio app interface enables users to predict the popularity category of Spotify tracks based on various audio features. The model has been deployed to provide real-time predictions, with input fields tailored for song attributes and genre. The app offers a user-friendly experience to evaluate song popularity quickly.*

