## MSDS 696 Notebook3 Spotify Track Data Feature Engineering with Playlist Generation Test Code 

## Project Title:
Create and Build A Data Engineering Pipeline to Collect, Process, and Store Spotify Data. This is intended to be a fun project to look at who the most popular artists are, what their most popular tracks are , and look at some characteristics of the songs.

## NoteBook Description:
This notebook creates new features on the Spotify track data. Once features are created and applied,the data will be checked for NANs. After cleaning, A Euclidian Distance will be calculated from one on the engineered features called "custom_score" to determine song similarity.This song list will then be sent to my personal Spotify account for listening.    

### Mary J Hollon
### Due 8-22-2024

In [1]:
import pandas as pd
import numpy as np


# Let's Load the data to examine its structure and contents

df = pd.read_csv('updated_tracks_2015.csv')

# Display basic information and the first few rows of the dataset
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                400 non-null    object 
 1   name              400 non-null    object 
 2   artist_id         400 non-null    object 
 3   year              400 non-null    int64  
 4   popularity        400 non-null    int64  
 5   release_date      400 non-null    object 
 6   energy            398 non-null    float64
 7   danceability      398 non-null    float64
 8   instrumentalness  398 non-null    float64
 9   loudness          398 non-null    float64
 10  tempo             398 non-null    float64
 11  valence           398 non-null    float64
dtypes: float64(6), int64(2), object(4)
memory usage: 37.6+ KB


This dataset contains 400 entries with the following columns:

- id: Unique identifier for each track.
- name: Name of the track.
- artist_id: Identifier for the artist.
- year: - year song was released
- popularity: - this is the popularity score of the artist. The higher the score, the more popular the artist.
- release_date: - the song's actual release date
- energy: A float representing the intensity and activity of the track or how "loud" and "noisy" a track is. Values close to 0 are low energy and values close to 1 are high energy.
- danceability: A float indicating how suitable a track is for dancing based on a combination of musical elements. A value of 0 is the least danceable, a value of 1 is the most danceable.
- instrumentalness: A float predicting whether a track has no vocals.Values close to 0 have more vocal content, values close to 1 do not have vocal content
- loudness: A float representing the overall loudness of the track in decibels.The closer to 0 the louder the track and farther away from zero or large in absolute value, the softer or lower the track.
- tempo: A float indicating the tempo of the track in beats per minute. It indicates the "speed" or "pace" of the music
- valence: A float describing the musical positiveness conveyed by a track. Values closer to 0 are more sad or angry, while values closer to 1 are more happy or cheerful. 


Source: https://developer.spotify.com/documentation/web-api/reference/get-several-audio-features

In [2]:
df.head(10)

Unnamed: 0,id,name,artist_id,year,popularity,release_date,energy,danceability,instrumentalness,loudness,tempo,valence
0,3fqwjXwUGN6vbzIwvyFMhx,Tennessee Whiskey,4YLtscXsxbVgi031ovDDdh,2015,83,2015-05-04,0.37,0.392,0.0096,-10.888,48.718,0.512
1,3pXF1nA74528Edde4of9CC,Don't,2EMAnMvWE2eb56ToJVfCWs,2015,83,2015-10-02,0.356,0.765,0.0,-5.556,96.991,0.189
2,0QZ5yyl6B6utIWkxeBDxQN,The Night We Met,6ltzsmQQbmdoHHbLZ4ZN25,2015,78,2015-04-07,0.366,0.545,0.267,-9.51,86.997,0.1
3,6K4t31amVTZDgR3sKmwUJJ,The Less I Know The Better,5INjqkS1o8h1imAzPqGZBb,2015,85,2015-07-17,0.74,0.64,0.00678,-4.083,116.879,0.785
4,43PuMrRfbyyuz4QpZ3oAwN,Exchange,2EMAnMvWE2eb56ToJVfCWs,2015,81,2015-10-02,0.433,0.525,0.0,-10.598,160.108,0.276
5,6FBzhcfgGacfXF3AmtfEaX,C U Girl,57vWImR43h4CaDao012Ofp,2015,81,2015-02-15,0.473,0.414,0.0523,-8.911,100.0,0.409
6,5E30LdtzQTGqRvNd7l6kG5,Daddy Issues,77SW9BnxLY8rJ0RciFqkHh,2015,85,2015-10-30,0.521,0.588,0.149,-9.461,85.012,0.337
7,7fBv7CLKzipRk6EC6TWHOB,The Hills,1Xyo4u8uXC1ZmMpatF05PJ,2015,85,2015-08-28,0.564,0.585,0.0,-7.063,113.003,0.137
8,7H0ya83CMmgFcOhw0UB6ow,Space Song,56ZTgzPBDge0OvCGgMO3OY,2015,77,2015-08-28,0.792,0.508,0.124,-7.311,147.067,0.601
9,3iVcZ5G6tvkXZkZKlMpIUs,Alright,2YZyLoL8N0Wb9xBt1NhZWg,2015,80,2015-03-16,0.766,0.796,0.0,-5.974,110.034,0.558


Let's engineer some features to add to this data set.

- Feature 1 - Categorize energy 
- Feature 2 - Categorize tempo
- Feature 3 - Categorize danceability
- Feature 4 - Categorize valance
- Feature 5 - Create Interaction Term: Danceability and Valence
- Feature 6 - Scale loudness with Min_Max Scaler
- Feature 7 - Categorize scaled loudness
- Feature 8 - Calculate a custon score for each track which can be used to determine similar tracks 

The danceability_valence_interaction is the product of danceability and valence for each track.This term provides a combined measure of how danceable and positive a track is. A higher interaction value indicates a track that is both highly danceable and conveys a positive mood. Conversely, a lower value might indicate a track that is either not very danceable, conveys a negative mood, or both.


In [3]:
from sklearn.preprocessing import MinMaxScaler

def apply_feature_engineering(df):
    # Feature 1: Energy Categories
    df['energy_category'] = pd.cut(df['energy'], bins=3, labels=['Low', 'Medium', 'High'])

    # Feature 2: Tempo Ranges
    def categorize_tempo(tempo):
        if tempo < 100:
            return 'Slow'
        elif tempo < 140:
            return 'Medium'
        else:
            return 'Fast'

    df['tempo_category'] = df['tempo'].apply(categorize_tempo)

    # Feature 3: Danceability Categories
    def categorize_danceability(danceability):
        if danceability is None:
            return 'Unknown'
        if danceability < 0.4:
            return 'Low'
        elif danceability < 0.7:
            return 'Medium'
        else:
            return 'High'

    df['danceability_category'] = df['danceability'].apply(categorize_danceability)
    
    # Feature 4: Classify Valence (or Mood) Categories
    def categorize_valence(valence):
        if valence < 0.3:
            return 'Sad'
        elif valence < 0.6:
            return 'Neutral'
        else:
            return 'Happy'

    df['valence_category'] = df['valence'].apply(categorize_valence)

    # Feature 5: Interaction Term: Danceability and Valence
    df['danceability_valence_interaction'] = df['danceability'] * df['valence']

    # Feature 6: Loudness Scaled With MinMax Scaler
    min_max_scaler = MinMaxScaler()
    df['loudness_scaled'] = min_max_scaler.fit_transform(df[['loudness']])

    # Feature 7: Define categories based on the min-max scaled loudness with four categories
    def categorize_loudness_scaled(loudness_scaled):
        if loudness_scaled <= 0.25:
            return 'Very Low'
        elif loudness_scaled <= 0.5:
            return 'Low'
        elif loudness_scaled <= 0.75:
            return 'High'
        else:
            return 'Very High'

    df['loudness_category'] = df['loudness_scaled'].apply(categorize_loudness_scaled)

    # Feature 8: Instrumentalness Categories
    def categorize_instrumentalness(instrumentalness):
        if instrumentalness == 0:
            return 'None'
        elif instrumentalness < 0.3:
            return 'Low'
        elif instrumentalness < 0.7:
            return 'Medium'
        else:
            return 'High'

    df['instrumentalness_category'] = df['instrumentalness'].apply(categorize_instrumentalness)

    return df

def calculate_custom_score(df, danceability_weight=0.30, energy_weight=0.25, valence_weight=0.25, loudness_weight=0.20):
    """
    Calculate a custom score for each track in the dataframe based on specified feature weights.

    Parameters:
    df (pd.DataFrame): The dataframe containing track features.
    danceability_weight (float): Weight for danceability feature.
    energy_weight (float): Weight for energy feature.
    valence_weight (float): Weight for valence feature.
    loudness_weight (float): Weight for loudness feature.

    Returns:
    pd.DataFrame: Dataframe with a new column 'custom_score'.
    """
    # Normalize selected features if not already normalized
    #It turns out I didn't need to scale danceability, energy, and valence because the values were already between 0 - 1.
    min_max_scaler = MinMaxScaler()

    
    # Calculate the custom score
    df['custom_score'] = (df['danceability'] * danceability_weight +
                          df['energy'] * energy_weight +
                          df['valence'] * valence_weight +
                          df['loudness_scaled'] * loudness_weight)

    return df



In [4]:
# Let's apply the feature engineering functions above to the track data
# Process files from 2015 to 2024 

for year in range(2015, 2025):
    file_path = f'updated_tracks_{year}.csv'
    
    try:
        df = pd.read_csv(file_path)
        df = apply_feature_engineering(df)
        df = calculate_custom_score(df)
        
        # Save or display the processed data
        df.to_csv(f'processed_tracks_{year}.csv', index=False)
        
        # Optionally display the first few rows
        print(f"Processed tracks_{year}.csv")
        print(df.head())
    except FileNotFoundError:
        print(f"File tracks_{year}.csv not found.")


Processed tracks_2015.csv
                       id                        name               artist_id  \
0  3fqwjXwUGN6vbzIwvyFMhx           Tennessee Whiskey  4YLtscXsxbVgi031ovDDdh   
1  3pXF1nA74528Edde4of9CC                       Don't  2EMAnMvWE2eb56ToJVfCWs   
2  0QZ5yyl6B6utIWkxeBDxQN            The Night We Met  6ltzsmQQbmdoHHbLZ4ZN25   
3  6K4t31amVTZDgR3sKmwUJJ  The Less I Know The Better  5INjqkS1o8h1imAzPqGZBb   
4  43PuMrRfbyyuz4QpZ3oAwN                    Exchange  2EMAnMvWE2eb56ToJVfCWs   

   year  popularity release_date  energy  danceability  instrumentalness  \
0  2015          83   2015-05-04   0.370         0.392           0.00960   
1  2015          83   2015-10-02   0.356         0.765           0.00000   
2  2015          78   2015-04-07   0.366         0.545           0.26700   
3  2015          85   2015-07-17   0.740         0.640           0.00678   
4  2015          81   2015-10-02   0.433         0.525           0.00000   

   loudness  ...  valence  ene

In [5]:
# Let's check one of the files
df = pd.read_csv('processed_tracks_2020.csv')

In [6]:
df.head()

Unnamed: 0,id,name,artist_id,year,popularity,release_date,energy,danceability,instrumentalness,loudness,...,valence,energy_category,tempo_category,danceability_category,valence_category,danceability_valence_interaction,loudness_scaled,loudness_category,instrumentalness_category,custom_score
0,0zirWZTcXBBwGsevrsIpvT,Clean Baby Sleep White Noise (Loopable),6Cqtx9fpxzggIMuKn0RGCp,2020,93,2020-04-29,,,,,...,,,Fast,High,Happy,,,Very High,High,
1,3FU6urUVsgXa6RBuV2PdRk,Heartless (feat. Morgan Wallen),5fMUXHkw8R8eOP2RNVYEZX,2020,82,2020-05-29,0.556,0.765,0.0,-6.417,...,0.274,Medium,Medium,High,Sad,0.20961,0.820886,Very High,,0.601177
2,2UikqkwBv7aIvlixeVXHWt,You Should Probably Leave,4YLtscXsxbVgi031ovDDdh,2020,83,2020-11-13,0.477,0.602,3.1e-05,-8.425,...,0.552,Medium,Fast,Medium,Neutral,0.332304,0.752626,Very High,Low,0.588375
3,3hxIUxnT27p5WcmjGUXNwx,Shut up My Moms Calling,35WVTyRnKAoaGExqgktVyb,2020,87,2020-02-10,0.409,0.485,0.0,-10.711,...,0.376,Medium,Medium,Medium,Neutral,0.18236,0.674916,High,,0.476733
4,3jHdKaLCkuNEkWcLVmQPCX,BEST INTEREST,4V8LLVI7PbaPR0K2TGSxFF,2020,82,2020-01-25,0.575,0.596,0.000153,-8.325,...,0.34,Medium,Slow,Medium,Neutral,0.20264,0.756025,Very High,Low,0.558755


In [7]:
# Let's check for Nans in the processed_track{year} files from 2015 to 2024 

for year in range(2015, 2025):
    file_path = f'processed_tracks_{year}.csv'
    try:
        df = pd.read_csv(file_path)
        nan_summary_before = df.isna().sum()
        print(f"NaN summary before dropping rows for {file_path}:\n{nan_summary_before}\n")
        
        # Drop rows with NaN values
        df_clean = df.dropna()
        
        nan_summary_after = df_clean.isna().sum()
        print(f"NaN summary after dropping rows for {file_path}:\n{nan_summary_after}\n")
        
        # Save the cleaned dataframe back to the CSV file
        df_clean.to_csv(file_path, index=False)
        
    except FileNotFoundError:
        print(f"{file_path} not found.")


NaN summary before dropping rows for processed_tracks_2015.csv:
id                                  0
name                                0
artist_id                           0
year                                0
popularity                          0
release_date                        0
energy                              2
danceability                        2
instrumentalness                    2
loudness                            2
tempo                               2
valence                             2
energy_category                     2
tempo_category                      0
danceability_category               0
valence_category                    0
danceability_valence_interaction    2
loudness_scaled                     2
loudness_category                   0
instrumentalness_category           0
custom_score                        2
dtype: int64

NaN summary after dropping rows for processed_tracks_2015.csv:
id                                  0
name                   

We have handled the missing values. Let's use Euclidian Distance and the song custom score to determine similar tracks ( just for fun )

In [8]:
df=pd.read_csv('processed_tracks_2017.csv')
df.head()

Unnamed: 0,id,name,artist_id,year,popularity,release_date,energy,danceability,instrumentalness,loudness,...,valence,energy_category,tempo_category,danceability_category,valence_category,danceability_valence_interaction,loudness_scaled,loudness_category,instrumentalness_category,custom_score
0,7KA4W4McWYRpgf0fWsJZWB,See You Again (feat. Kali Uchis),4V8LLVI7PbaPR0K2TGSxFF,2017,89,2017-07-21,0.559,0.558,7e-06,-9.222,...,0.62,Medium,Slow,Medium,Happy,0.34596,0.614083,High,Low,0.584967
1,1mMLMZYXkMueg65jRRWG1l,When It Rains It Pours,718COspgdWOnwOFpJHRZHS,2017,84,2017-06-02,0.801,0.551,6e-06,-5.069,...,0.625,High,Medium,Medium,Happy,0.344375,0.823862,Very High,Low,0.686572
2,6me7F0aaZjwDo6RJ5MrfBD,Evergreen,4qU7IJSReZnsLy5907Mtau,2017,89,2017-05-17,0.216,0.555,0.00416,-11.661,...,0.504,Low,Slow,Medium,Neutral,0.27972,0.490882,Low,Low,0.444676
3,7sO5G9EABYOXQKNPNiE9NR,Ric Flair Drip (with Metro Boomin),4DdkRBBYG6Yk9Ka8tdJ9BW,2017,85,2017-10-30,0.428,0.88,5.1e-05,-8.28,...,0.333,Medium,Medium,High,Neutral,0.29304,0.661666,High,Low,0.586583
4,3EaJDYHA0KnX88JvDhL9oa,Dark Red,57vWImR43h4CaDao012Ofp,2017,86,2017-02-20,0.784,0.603,8e-06,-4.023,...,0.769,High,Fast,Medium,Happy,0.463707,0.876698,Very High,Low,0.74449


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 21 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                397 non-null    object 
 1   name                              397 non-null    object 
 2   artist_id                         397 non-null    object 
 3   year                              397 non-null    int64  
 4   popularity                        397 non-null    int64  
 5   release_date                      397 non-null    object 
 6   energy                            397 non-null    float64
 7   danceability                      397 non-null    float64
 8   instrumentalness                  397 non-null    float64
 9   loudness                          397 non-null    float64
 10  tempo                             397 non-null    float64
 11  valence                           397 non-null    float64
 12  energy_c

In [10]:
from scipy.spatial.distance import euclidean

# Function to find similar tracks based on custom score
def find_similar_tracks(df, track_name, top_n=5):
    
    # Check if the track name exists in the dataframe
    if track_name not in df['name'].values:
        raise ValueError(f"Track name '{track_name}' not found in the dataset.")
    
    # Get the custom score of the reference track
    reference_score = df.loc[df['name'] == track_name, 'custom_score'].values[0]
    
    # Calculate the Euclidean distance between the reference score and all other scores
    df['similarity'] = df['custom_score'].apply(lambda x: euclidean([x], [reference_score]))
    
    # Sort by similarity (ascending) and exclude the reference track itself
    similar_tracks = df[df['name'] != track_name].sort_values(by='similarity', ascending=True).head(top_n)
    
    # Return the most similar tracks
    return similar_tracks[['id','name', 'artist_id', 'custom_score', 'similarity']]



In [11]:
# let's call the function and try it out !
# It may be useful later !

df = pd.read_csv('processed_tracks_2020.csv')

# Example: Find similar tracks to a given track name
similar_tracks = find_similar_tracks(df, track_name="You Should Probably Leave", top_n=25)

similar_tracks


Unnamed: 0,id,name,artist_id,custom_score,similarity
259,4HBZA5flZLE435QTztThqH,Stuck with U (with Justin Bieber),66CXWjxzNUsdJxJ2JdwvnR,0.588389,1.3e-05
15,7s5VQqrjBtrBgZL4pEa46S,Romantic Lover,3XxNRirzbjfLdDli06zMaB,0.588233,0.000142
7,0VjIjW4GlUZAMYd2vXMi3b,Blinding Lights,1Xyo4u8uXC1ZmMpatF05PJ,0.58732,0.001055
242,40TPiJpvwGIyvPjJMDTKfy,Rags2Riches 2 (feat. Lil Baby),45TgXXqMDdF8BkjA83OM7z,0.589796,0.001421
233,4N7i6RfJXMWfkx9Zr6pzkJ,E-ER (feat. Lil Yachty),1m7LSAMIB1BErIHYSOn32W,0.586498,0.001878
173,67q8yivDoOPXCYodi1zTix,Glue Myself Shut,2RQXRUsr4IW1f3mKyKsy4B,0.591069,0.002694
355,1DRLIUG8HFTBfJOFYaByZn,FORMATIONS,3Ka3k9K2WStR52UJVtbJZW,0.585671,0.002704
324,4R5GN0mBvb6jkRj3Zvyhkl,Will I See You Again?,0oK5D6uPhGu4Jk2dbZfodU,0.585595,0.00278
103,01JMnRUs2YOK6DDpdQASGY,Grace (feat. 42 Dugg),5f7VJjfbwm532GiveGC0ZK,0.585424,0.002951
111,2xB46Bj9HZ4cr058yN4Pla,Secrets,31W5EY0aAly4Qieq6OFu6I,0.591408,0.003033


Let's develop some code to create a Public Playlist in my Spotify account

### NOTE: You MUST have Spotify Developer Credentials for this code to run !

In [28]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth, SpotifyClientCredentials
import pandas as pd
from spotify_credentials import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET

# Authenticate with Spotify using OAuth
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=SPOTIFY_CLIENT_ID,
                                               client_secret=SPOTIFY_CLIENT_SECRET,
                                               redirect_uri="your_redirect_uri",
                                               scope="playlist-modify-public"))

def write_playlist_to_spotify(similar_tracks_df, playlist_name="Similar Songs Playlist"):
    user_id = sp.current_user()["id"]

    # Create a new playlist
    playlist = sp.user_playlist_create(user=user_id, name=playlist_name, public=True, description="Playlist based on custom_score similarity")

    # Get track IDs
    track_ids = similar_tracks_df['id'].tolist()

    # Add tracks to the playlist
    sp.playlist_add_items(playlist_id=playlist['id'], items=track_ids)
    print(f"Playlist '{playlist_name}' created successfully with {len(track_ids)} tracks!")
    
 

In [20]:
# Example Usage
df = pd.read_csv('processed_tracks_2020.csv')

# Find similar tracks to a given track name
similar_tracks = find_similar_tracks(df, track_name="You Should Probably Leave", top_n=25)

# Write the similar_tracks DataFrame to Spotify
write_playlist_to_spotify(similar_tracks, playlist_name="Similar Tracks to 'You Should Probably Leave'")

Playlist 'Similar Tracks to 'You Should Probably Leave'' created successfully with 25 tracks!


Now, let's use the similar tracks based on euclidian distance to generate a playlist

In [23]:
# let's look at another example

df = pd.read_csv('processed_tracks_2024.csv')

df.head(25)

Unnamed: 0,id,name,artist_id,year,popularity,release_date,energy,danceability,instrumentalness,loudness,...,valence,energy_category,tempo_category,danceability_category,valence_category,danceability_valence_interaction,loudness_scaled,loudness_category,instrumentalness_category,custom_score
0,0WbMK4wrZ1wFSty9F7FCgu,"Good Luck, Babe!",7GlBOeep6PqTfFi59PTUUN,2024,96,2024-04-05,0.582,0.7,0.0,-5.96,...,0.785,Medium,Medium,High,Happy,0.5495,0.669214,High,,0.685593
1,6AI3ezQ4o3HUoP6Dhudph3,Not Like Us,2YZyLoL8N0Wb9xBt1NhZWg,2024,96,2024-05-04,0.472,0.898,0.0,-7.001,...,0.214,Medium,Medium,High,Sad,0.192172,0.591972,High,,0.559294
2,6dOtVTDdiauQNBQEDOtlAB,BIRDS OF A FEATHER,6qqNVTkY8uBg9cP3Jd7DAH,2024,100,2024-05-17,0.507,0.747,0.0608,-10.171,...,0.438,Medium,Medium,High,Neutral,0.327186,0.356756,Low,Low,0.531701
3,7221xIgOnuakPdLqT0F3nP,I Had Some Help (Feat. Morgan Wallen),246dkjvS1zLTtiykXe5h60,2024,94,2024-05-10,0.855,0.638,0.0,-4.86,...,0.731,High,Medium,Medium,Happy,0.466378,0.750835,Very High,,0.738067
4,5N3hjp1WNayUPZrA8kJmJP,Please Please Please,74KM79TiuVKeVCqs8QtB0B,2024,97,2024-06-06,0.586,0.669,0.0,-6.073,...,0.579,Medium,Medium,Medium,Neutral,0.387351,0.66083,High,,0.624116
5,2qSkIjg1o9h3YT9RAgYN75,Espresso,74KM79TiuVKeVCqs8QtB0B,2024,98,2024-04-12,0.76,0.701,6.5e-05,-5.478,...,0.69,High,Medium,High,Happy,0.48369,0.704979,High,Low,0.713796
6,5AJ9hqTS2wcFQCELCFRO7A,MILLION DOLLAR BABY,1WaFQSHVGZQJTbf0BdxdNo,2024,94,2024-04-26,0.697,0.852,0.00037,-5.52,...,0.919,Medium,Medium,High,Happy,0.782988,0.701862,High,Low,0.799972
7,2FQrifJ1N335Ljm3TjTVVf,A Bar Song (Tipsy),3y2cIKLjiOlp1Np37WiUdH,2024,92,2024-04-12,0.709,0.722,0.0,-4.95,...,0.604,High,Slow,High,Happy,0.436088,0.744157,High,,0.693681
8,4ZJ4vzLQekI0WntDbanNC7,Pink Skies,40ZNYROS4zLfyyBSs2PGe2,2024,88,2024-05-24,0.488,0.525,5.1e-05,-6.909,...,0.253,Medium,Slow,Medium,Sad,0.132825,0.598798,High,Low,0.46251
9,7Fzl7QaTu47WyP9R5S5mh5,Lies Lies Lies,4oUHIQIBe0LHzYfvXNW4QM,2024,87,2024-07-05,0.702,0.486,7.5e-05,-5.758,...,0.382,High,Slow,Medium,Neutral,0.185652,0.684203,High,Low,0.553641


### I will select the song "Please Please Please" by Sabrina Carpenter and generate similar tracks based on my similarity score

In [26]:
# let's call the function generate a list of similar songs in a dataframe before writing it to Spotify

df = pd.read_csv('processed_tracks_2024.csv')

# Example: Find similar tracks to a given track name
similar_tracks = find_similar_tracks(df, track_name="Please Please Please", top_n=25)

similar_tracks


Unnamed: 0,id,name,artist_id,custom_score,similarity
152,4S8PxReB1UiDR2F5x1lyIR,meet the grahams,2YZyLoL8N0Wb9xBt1NhZWg,0.62421,9.4e-05
351,3dzRwqd1L3HqxmViUJt20A,Sweet Dreams,1Tie3AZgLQZqYEp8Fv4zOZ,0.625238,0.001122
188,2PgjJ90q1zETqCX68dxgyd,Sweet Dreams,1Tie3AZgLQZqYEp8Fv4zOZ,0.625238,0.001122
36,6XjDF6nds4DE2BBbagZol6,Gata Only,7CvTknweLr9feJtRGrpDBy,0.625865,0.001749
177,0SdBkFh6u5IHIXqlBu0NyI,Yeah Glo!,2qoQgPAilErOKCwE2Y8wOG,0.62136,0.002756
94,5A8xI7PN4WDe9e61xEdt94,Yeah Glo!,2qoQgPAilErOKCwE2Y8wOG,0.62136,0.002756
325,150w1WjVxUWycZqTflcDAe,Goes Without Saying (Feat. Brad Paisley),246dkjvS1zLTtiykXe5h60,0.620203,0.003913
99,7Mg5CBO37Rifk2RyDJ8fzd,Boogieman,7iKgSlIINjat3bsCYiNMYX,0.628159,0.004044
362,6XaJfhwof7qIgbbXO5tIQI,Igual Que Un Ángel (with Peso Pluma),1U1el3k54VvEUzo3ybLPlM,0.620044,0.004072
308,6slyBSVe8V4sZ6smTNukQJ,What Don't Belong To Me,246dkjvS1zLTtiykXe5h60,0.619597,0.004519


In [24]:
# Example Usage
df = pd.read_csv('processed_tracks_2024.csv')

# Find similar tracks to a given track name
similar_tracks = find_similar_tracks(df, track_name="Please Please Please", top_n=25)

# Write the similar_tracks DataFrame to Spotify
write_playlist_to_spotify(similar_tracks, playlist_name="Similar Tracks to 'Please Please Please'")


Playlist 'Similar Tracks to 'Please Please Please'' created successfully with 25 tracks!


In [29]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth, SpotifyClientCredentials
import pandas as pd
from spotify_credentials import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET

# Authenticate with Spotify using OAuth
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=SPOTIFY_CLIENT_ID,
                                               client_secret=SPOTIFY_CLIENT_SECRET,
                                               redirect_uri="your_redirect_uri",
                                               scope="playlist-modify-public"))

def write_playlist_to_spotify(similar_tracks_df, playlist_name="Similar Songs Playlist"):
    user_id = sp.current_user()["id"]

    # Create a new playlist
    playlist = sp.user_playlist_create(user=user_id, name=playlist_name, public=True, description="Playlist based on custom_score similarity")

    # Get track IDs
    track_ids = similar_tracks_df['id'].tolist()

    # Add tracks to the playlist
    sp.playlist_add_items(playlist_id=playlist['id'], items=track_ids)
    
    # Print the success message with the playlist link
    playlist_url = playlist['external_urls']['spotify']
    print(f"Playlist '{playlist_name}' created successfully with {len(track_ids)} tracks!")
    print(f"Spotify Playlist Link: {playlist_url}")



### Summary

This file processes the updated_tracks_{year} files which were updated with track audio data. Feature Engineering is applied and NANs were removed. The new feature engineered and cleaned files are:

 - processed_tracks_2015
 - processed_tracks_2016
 - processed_tracks_2017
 - processed_tracks_2018
 - processed_tracks_2019
 - processed_tracks_2020
 - processed_tracks_2021
 - processed_tracks_2022
 - processed_tracks_2023
 - processed_tracks_2024
 
 
 
 The test code to write a develop a playlist based on the feature custom_score has been developed and tested successfully


## END of Notebook