## Vectorization Method

Of the two models I (Jack) made, this is the one I prefer. The process of this model is as follows:
1. Load processed data from processing notebook
2. Accept user inputs of songs
3. Calculate and compare variances of user selected song features to monte carlo simulation of randomly selected variances to find significant features
4. Treat significant features of user songs as a vector and use euclidean distances to find the most similar songs across the entire dataset

In [2]:
# Import Statements
import pandas as pd
import statistics
import numpy as np
import json
from sklearn.metrics.pairwise import euclidean_distances
import ipywidgets as widgets
from IPython.display import display, clear_output

In [3]:
df = pd.read_csv(r'C:\Users\jsull\UW Work\Stat 451\Project\processed_data.csv') # read in the processed data that was created in the data_processing.ipynb
df = df.drop(['Unnamed: 0', 'key', 'mode', 'time_signature'], axis = 1) # drop a few columns I will not be using (kept in during processing because my team members might need them)
df[['Acoustic', 'Classical and Opera',
       'Country and Folk', 'Electronic and Dance', 'Hip-Hop and Urban',
       'Instrumental', 'Jazz and Blues', 'Latin', 'Miscellaneous', 'Pop','Reggae and Tropical', 'Rock', 'Spiritual and Religious', 'Theater','World']] = df[['Acoustic', 'Classical and Opera', 'Country and Folk',
    'Electronic and Dance', 'Hip-Hop and Urban', 'Instrumental', 'Jazz and Blues', 'Latin', 'Miscellaneous', 'Pop', 'Reggae and Tropical', 'Rock',
     'Spiritual and Religious', 'Theater','World']] * 0.4
df['popularity'] = df['popularity']
df.head()

Unnamed: 0,master_idx,artist_name,track_name,track_id,year,genre,popularity,danceability,energy,loudness,...,Instrumental,Jazz and Blues,Latin,Miscellaneous,Pop,Reggae and Tropical,Rock,Spiritual and Religious,Theater,World
0,0,Jason Mraz,I Won't Give Up,53QF56cjZA9RTuuMZDrSA6,2012,Acoustic,0.68,0.483,0.303,0.739361,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Jason Mraz,93 Million Miles,1s8tP3jP4GZcyHDsjvw218,2012,Acoustic,0.5,0.572,0.454,0.735699,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,Joshua Hyslop,Do Not Let Me Go,7BRCa8MPiyuvr2VU3O9W0F,2012,Acoustic,0.57,0.409,0.234,0.680697,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Boyce Avenue,Fast Car,63wsZUhUZLlh1OsyrZq7sz,2012,Acoustic,0.58,0.392,0.251,0.742781,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,Andrew Belle,Sky's Still Blue,6nXIYClvJAfi6ujLiKqEq8,2012,Acoustic,0.54,0.43,0.791,0.813859,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
# This cell opens the results from a Monte Carlo simulation (in monte_carlo_code.ipynb) to determine the 0.05 quantile of the variance of the 10 values selected from each column

with open(r'C:\Users\jsull\UW Work\Stat 451\Project\monte_carlo.json', 'r') as f:
    results_dict = json.load(f)
results_dict

{'danceability': 0.013382394999999998,
 'energy': 0.025014266666666667,
 'speechiness': 0.00042638024999999994,
 'acousticness': 0.0484138911527198,
 'instrumentalness': 0.03146510005767226,
 'liveness': 0.004487286666666666,
 'valence': 0.03284678672777778,
 'loudness': 0.001398039625570148,
 'tempo': 0.005009213793574107,
 'duration_ms': 6.857034893928311e-05}

Below I created a basic jupyter notebook user interface to input songs they like from the dataset. I won't get technical, but in essence the code:
1. Prompts a user to input how many songs they want to enter (minimum of 10).
2. Prompts the user to input either a song name or artist name to search in the dataframe.
3. Filters the dataframe by user input, and provides a dropdown menu to select the song they want to enter.
4. Appends the selected song to a new dataframe - user_songs.

In [5]:
user_songs = pd.DataFrame(columns=df.columns)  # Initialize the user_songs DataFrame

def find_songs(song_title, artist_name):
    conditions = []
    if song_title:
        conditions.append(df['track_name'].str.contains(song_title, case=False, na=False))
    if artist_name:
        conditions.append(df['artist_name'].str.contains(artist_name, case=False, na=False))
    
    if conditions:
        filtered_df = df[conditions[0]]
        for condition in conditions[1:]:
            filtered_df = filtered_df[condition]
        return filtered_df.sort_values(by='popularity', ascending=False)
    else:
        return pd.DataFrame(columns=df.columns)

def display_song_selection(songs_df):
    if songs_df.empty:
        print("No matching songs found. Try different keywords.")
        return

    options = [(f"{row['track_name']} by {row['artist_name']}", row['track_id']) for index, row in songs_df.iterrows()]
    song_dropdown = widgets.Dropdown(options=options, description='Select Song:')
    confirm_button = widgets.Button(description="Confirm Selection")
    display(song_dropdown, confirm_button)
    
    def on_confirm_button_clicked(b):
        add_song_to_user_songs(song_dropdown.value)
    confirm_button.on_click(on_confirm_button_clicked)

def add_song_to_user_songs(song_id):
    global user_songs
    selected_song = df.loc[df['track_id'] == song_id]
    user_songs = pd.concat([user_songs, selected_song], ignore_index=True)
    clear_output(wait=True)
    display(user_songs[['track_name', 'artist_name', 'year', 'genre']])
    if len(user_songs) < N:
        print(f"Add more songs. {N - len(user_songs)} more to go.")
        initiate_song_selection()
    else:
        print("Song selection complete.")

def initiate_song_selection():
    song_input = widgets.Text(value='', placeholder='Enter song title', description='Song:', disabled=False)
    artist_input = widgets.Text(value='', placeholder='Enter artist name', description='Artist:', disabled=False)
    search_button = widgets.Button(description="Search")
    display(song_input, artist_input, search_button)
    
    def on_search_button_clicked(b):
        song_title = song_input.value.strip()
        artist_name = artist_input.value.strip()
        filtered_songs = find_songs(song_title, artist_name)
        clear_output(wait=True)
        display(song_input, artist_input, search_button)
        display_song_selection(filtered_songs)

    search_button.on_click(on_search_button_clicked)

def set_song_count():
    print("Enter how many songs you would like to add (minimum 10):")
    N_widget = widgets.IntText(value=10, description='Number of songs:', disabled=False)
    display(N_widget)

    def on_N_set(change):
        global N
        N = N_widget.value
        if N >= 10:
            clear_output(wait=True)
            initiate_song_selection()
        else:
            print("Please enter a value of 10 or higher.")

    N_widget.observe(on_N_set, names='value')

set_song_count()


Enter how many songs you would like to add (minimum 10):


IntText(value=10, description='Number of songs:')

The songs below are a group of songs from my country playlist that I have been using as a quick user data set to evaluate my model after changes. Feel free to uncomment to run the program to predict based on these songs.

In [9]:
test_id = ['67AdiJcurlf6gocGobfaXs',
 '1mMLMZYXkMueg65jRRWG1l',
 '0w3Q14i073jLoew1hgJkwD',
 '4ly1QBXEwYoDmje9rmEgC4',
 '73zawW1ttszLRgT9By826D',
 '3oZ6dlSfCE9gZ55MGPJctc',
 '4rW9EUFaMSNVY8JhbqrB6z',
 '3OjNkFFZavF89xvRqWCXmU',
 '0OWhKvvsHptt6vnnNUSM9a',
 '2HbpYFQbairMoU2YFyOP2x',
 '7G6l2FtQyRhQgYgut2I6i8']

user_songs = df[df['track_id'].isin(test_id)]
user_songs

Unnamed: 0,master_idx,artist_name,track_name,track_id,year,genre,popularity,danceability,energy,loudness,...,Instrumental,Jazz and Blues,Latin,Miscellaneous,Pop,Reggae and Tropical,Rock,Spiritual and Religious,Theater,World
117044,118810,Cody Johnson,Me and My Kind,4rW9EUFaMSNVY8JhbqrB6z,2014,Country and Folk,0.68,0.588,0.807,0.845865,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
220218,223573,Jon Pardi,Head Over Boots,4ly1QBXEwYoDmje9rmEgC4,2016,Country and Folk,0.72,0.563,0.688,0.829035,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
220219,223574,Jon Pardi,Heartache On The Dance Floor,0w3Q14i073jLoew1hgJkwD,2016,Country and Folk,0.71,0.596,0.834,0.831701,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
260522,263877,Luke Combs,When It Rains It Pours,1mMLMZYXkMueg65jRRWG1l,2017,Country and Folk,0.78,0.551,0.801,0.81948,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
371574,376777,Cody Johnson,On My Way to You,3OjNkFFZavF89xvRqWCXmU,2019,Country and Folk,0.63,0.443,0.538,0.812928,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
532859,540774,Luke Combs,"Going, Going, Gone",67AdiJcurlf6gocGobfaXs,2022,Country and Folk,0.77,0.565,0.56,0.803983,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
532864,540779,Zach Bryan,Oklahoma Smokeshow,0OWhKvvsHptt6vnnNUSM9a,2022,Country and Folk,0.75,0.544,0.573,0.809443,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
532898,540813,Zach Bryan,Motorcycle Drive By,2HbpYFQbairMoU2YFyOP2x,2022,Country and Folk,0.7,0.655,0.682,0.827622,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581921,590798,Morgan Wallen,Man Made A Bar (feat. Eric Church),73zawW1ttszLRgT9By826D,2023,Country and Folk,0.8,0.498,0.764,0.820491,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581922,590799,Morgan Wallen,’98 Braves,3oZ6dlSfCE9gZ55MGPJctc,2023,Country and Folk,0.79,0.488,0.67,0.808399,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next I have calculated the variances of the features in the user inputted sample.

In [10]:
var_dict = {}
for col in ["danceability", "energy", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", 'loudness', 'tempo', 'duration_ms']:
    var_dict[col] = statistics.variance(user_songs[col])

var_dict # sort this dictionary by its keys
sorted_var_dict = dict(sorted(var_dict.items(), key=lambda item: item[1]))
sorted_var_dict

{'instrumentalness': 4.3167151e-09,
 'duration_ms': 2.127962209405894e-05,
 'speechiness': 5.825399999999999e-05,
 'loudness': 0.00015541076294418976,
 'danceability': 0.004858200000000001,
 'tempo': 0.005487165229823666,
 'energy': 0.013525490909090908,
 'acousticness': 0.025111042909090914,
 'valence': 0.026394818181818182,
 'liveness': 0.03060922963636364}

Below I compare the variances of the user sample to the 0.05 quantiles calculated in the monte carlo simulation. If the variance of the user sample is less than the corresponding variance threshold from the monte carlo simulation, we can feasibly assume the category predicts a user's enjoyment of the song better than chance. This notion allows us to pick features to train our final model.

In [11]:
training_features = []
for key in sorted_var_dict.keys():
    if sorted_var_dict[key] <= results_dict[key]:
        training_features.append(key)

if len(training_features) < 3:
    training_features = list(sorted_var_dict.keys())

new_features = ['popularity', 'Acoustic', 'Classical and Opera', 'Country and Folk', 'Electronic and Dance', 'Hip-Hop and Urban', 
                'Instrumental', 'Jazz and Blues', 'Latin', 'Miscellaneous', 'Pop', 'Reggae and Tropical', 'Rock', 
                'Spiritual and Religious', 'Theater', 'World']

training_features.extend(new_features)
print(training_features)

['instrumentalness', 'duration_ms', 'speechiness', 'loudness', 'danceability', 'energy', 'acousticness', 'valence', 'popularity', 'Acoustic', 'Classical and Opera', 'Country and Folk', 'Electronic and Dance', 'Hip-Hop and Urban', 'Instrumental', 'Jazz and Blues', 'Latin', 'Miscellaneous', 'Pop', 'Reggae and Tropical', 'Rock', 'Spiritual and Religious', 'Theater', 'World']


Finally I use the significant features from the user input songs to compare similarity to the rest of the songs in the dataset using Euclidean distance as my chief metric. This method uses lots of linear algebra as the framework for keeping track of distances. A more in depth explaination of the linear algebra will be included in the presentation and report of this project.

In [12]:
mask1 = df['track_id'].isin(user_songs['track_id'])
df_without_user = df[~mask1]
matrix = euclidean_distances(X=user_songs[training_features], Y=df_without_user[training_features])
distances = matrix.sum(axis=0)
df_without_user['distances'] = distances 
recommendations = df_without_user.sort_values(by='distances', ascending=True).head(len(user_songs))[['artist_name', 'track_name', 'year', 'genre']]
recommendations

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_without_user['distances'] = distances


Unnamed: 0,artist_name,track_name,year,genre
532880,Tyler Hubbard,5 Foot 9,2022,Country and Folk
260555,Russell Dickerson,Yours,2017,Country and Folk
315816,Luke Combs,"Houston, We Got a Problem",2018,Country and Folk
117059,Sam Hunt,Take Your Time,2014,Country and Folk
532911,Zach Bryan,Quittin' Time,2022,Country and Folk
426297,Koe Wetzel,Good Die Young,2020,Country and Folk
426294,Lee Brice,Memory I Don't Mess With,2020,Country and Folk
260535,Midland,Drinkin' Problem,2017,Country and Folk
581926,Morgan Wallen,Whiskey Friends,2023,Country and Folk
1022521,Funda Arar,Yak Gel,2009,Country and Folk


This final cell is very similar to the previous, but rather than summing the distance across all ten user input songs, I instead select the most similar song from the dataset to each user input song. In theory this should be more accurate for users that enter music from many different genres, where a total distance from all songs is less useful.

In [13]:
mask1 = df['track_id'].isin(user_songs['track_id'])
df_without_user = df[~mask1]
matrix = euclidean_distances(X=user_songs[training_features], Y=df_without_user[training_features])
closest_indices = np.argmin(matrix, axis=1)
closest_songs = df_without_user.iloc[closest_indices]
closest_songs = closest_songs.drop_duplicates()
recommendations = closest_songs[['artist_name', 'track_name', 'year', 'genre']]
recommendations

Unnamed: 0,artist_name,track_name,year,genre
966361,Darius Rucker,Alright,2008,Country and Folk
480058,Chayce Beckham,23 - Steel Mix,2021,Country and Folk
426286,Tenille Arts,Somebody Like That,2020,Country and Folk
581927,Morgan Wallen,Born With A Beer In My Hand,2023,Country and Folk
371635,Thomas Rhett,Remember You Young,2019,Country and Folk
325278,George Ezra,Hold My Girl,2018,Country and Folk
581957,Thomas Rhett,Angels (Don’t Always Have Wings),2023,Country and Folk
426297,Koe Wetzel,Good Die Young,2020,Country and Folk
480010,Cody Johnson,'Til You Can't,2021,Country and Folk
426273,Luke Combs,Forever After All,2020,Country and Folk
