![Add a relevant banner image here](path_to_image)

# Project Title

## Overview

Short project description. Your bottom line up front (BLUF) insights.

## Business Understanding

The customer of this project is FutureProduct Advisors, a consultancy that helps their customers develop innovative and new consumer products. FutureProduct’s customers are increasingly seeking help from their consultants in go-to-market activities. 

FutureProduct’s consultants can support these go-to-market activities, but the business does not have all the infrastructure needed to support it. Their biggest ask is for a tool to help them find interesting, up-and-coming music to accompany social posts and online ads for go-to-market promotions. 

**Stakeholders**

- FutureProduct Managing Director: oversees their consulting practice and is sponsoring this project.
- FutureProduct Senior Consultants: the actual users of the prospective tool. A small subset of the consultants will pilot the prototype tool.
- My consulting leadership: sponsors of this effort; will provide oversight and technical input of the project as needed.

**Primary Goals**

1.	Build a data tool that can evaluate any song in the Billboard Hot 100 list and make predictions about:
    -	The song’s position on the Hot 100 list 4 weeks in the future
    -	The song’s highest position on the list in the next 6 months
2.	Create a rubric that lists the 3 most important factors for songs’ placement on the Hot 100 list for each hear from 2000 to 2021.


## Data Understanding

Billboard Hot 100 weekly charts (Kaggle): https://www.kaggle.com/datasets/thedevastator/billboard-hot-100-audio-features

I’ve chosen this dataset because it has a direct measurement of song popularity (the Hot 100 list) and because its long history gives significant context to a song’s positioning in a given week.
The features list gives a wide range of song attributes to explore and enables me to determine what features most significantly contribute to a song’s popularity and how that changes over time.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import ast
from collections import Counter

import xgboost as xgb

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay, mean_squared_error, r2_score

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

np.random.seed(42)

In [2]:
df_hotlist_all = pd.read_csv('Data/Hot Stuff.csv')
df_features_all = pd.read_csv('Data/Hot 100 Audio Features.csv')

In [3]:
# exploring hotlist df
df_hotlist_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327895 entries, 0 to 327894
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   index                   327895 non-null  int64  
 1   url                     327895 non-null  object 
 2   WeekID                  327895 non-null  object 
 3   Week Position           327895 non-null  int64  
 4   Song                    327895 non-null  object 
 5   Performer               327895 non-null  object 
 6   SongID                  327895 non-null  object 
 7   Instance                327895 non-null  int64  
 8   Previous Week Position  295941 non-null  float64
 9   Peak Position           327895 non-null  int64  
 10  Weeks on Chart          327895 non-null  int64  
dtypes: float64(1), int64(5), object(5)
memory usage: 27.5+ MB


In [4]:
# exploring features df
df_features_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29503 entries, 0 to 29502
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   index                      29503 non-null  int64  
 1   SongID                     29503 non-null  object 
 2   Performer                  29503 non-null  object 
 3   Song                       29503 non-null  object 
 4   spotify_genre              27903 non-null  object 
 5   spotify_track_id           24397 non-null  object 
 6   spotify_track_preview_url  14491 non-null  object 
 7   spotify_track_duration_ms  24397 non-null  float64
 8   spotify_track_explicit     24397 non-null  object 
 9   spotify_track_album        24391 non-null  object 
 10  danceability               24334 non-null  float64
 11  energy                     24334 non-null  float64
 12  key                        24334 non-null  float64
 13  loudness                   24334 non-null  flo

#### Exploratory Data Analysis

## Data Preparation

### Initial Data Selection and Feature Engineering

In [5]:
# removing hotlist df attributes that will not be used in cleaning or analysis
df_hotlist_all = df_hotlist_all.drop(['index', 'url', 'Song', 'Performer'], axis=1)
# converting WeekID to datetime
df_hotlist_all['WeekID'] = pd.to_datetime(df_hotlist_all['WeekID'], errors='coerce')
df_hotlist_all = df_hotlist_all.sort_values(by='WeekID')

# creating a new hotlist df with only complete year data from 2000 - 2020, the time period being studied
df_hotlist_2000s = df_hotlist_all.loc[(df_hotlist_all['WeekID'] > '1999-12-31') & (df_hotlist_all['WeekID'] < '2021-01-01')]

# adding a column to calculate the week over week change in rank
def diff(a, b):
    return a - b

df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s.apply(lambda x: diff(x['Week Position'], x['Previous Week Position']), axis=1)
# replacing NaNs with 0
df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s['Rank_Change'].fillna(0)

# removing features df attributes that will not be used in cleaning or analysis
df_features_all = df_features_all.drop(['index', 'Performer', 'Song', 'spotify_track_album', 
                                        'spotify_track_preview_url', 'spotify_track_explicit', 
                                        'spotify_track_popularity'], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s.apply(lambda x: diff(x['Week Position'], x['Previous Week Position']), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s['Rank_Change'].fillna(0)


In [6]:
# new df with the max weekly rank change for each song in df_hotlist_2000s
df_max_rank_change = df_hotlist_2000s.groupby('SongID', as_index=False)['Rank_Change'].max()
df_max_rank_change.rename(columns={'Rank_Change': 'Max_Rank_Change'}, inplace=True)
df_max_rank_change.set_index('SongID', inplace=True)

# new df with the max peak rank for each song in df_hotlist_2000s
df_max_peak_pos = df_hotlist_2000s.groupby('SongID', as_index=False)['Peak Position'].max()
df_max_peak_pos.rename(columns={'Peak Position': 'Max_Peak_Position'}, inplace=True)
df_max_peak_pos.set_index('SongID', inplace=True)

# ensuring these new dfs have no null values
df_max_rank_change['Max_Rank_Change'].isna().sum(), df_max_peak_pos['Max_Peak_Position'].isna().sum()

(0, 0)

In [7]:
# extracting full list of songs in the time period being studied
songid_list = df_hotlist_2000s['SongID'].unique()

# creating a features df with only songs in df_hotlist_2000s
df_features_2000s = df_features_all[df_features_all['SongID'].isin(songid_list)]

# checking for duplicates
print(len(df_features_2000s))
print(len(pd.unique(df_features_2000s['SongID'])))

8781
8664


In [8]:
# removing duplicates and rechecking
df_features_2000s = df_features_2000s.drop_duplicates(subset='SongID')

print(len(df_features_2000s))
print(len(pd.unique(df_features_2000s['SongID'])))

8664
8664


In [9]:
# adding max peak position to features df
df_2000s_data = df_features_2000s.join(df_max_peak_pos, on='SongID')

# adding max rank change to features df
df_2000s_data = df_2000s_data.join(df_max_rank_change, on='SongID')

# removing entries with missing values and defining as a new df
df_cleaned = df_2000s_data[df_2000s_data.notna().all(axis=1)]
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7798 entries, 5 to 29499
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   SongID                     7798 non-null   object 
 1   spotify_genre              7798 non-null   object 
 2   spotify_track_id           7798 non-null   object 
 3   spotify_track_duration_ms  7798 non-null   float64
 4   danceability               7798 non-null   float64
 5   energy                     7798 non-null   float64
 6   key                        7798 non-null   float64
 7   loudness                   7798 non-null   float64
 8   mode                       7798 non-null   float64
 9   speechiness                7798 non-null   float64
 10  acousticness               7798 non-null   float64
 11  instrumentalness           7798 non-null   float64
 12  liveness                   7798 non-null   float64
 13  valence                    7798 non-null   float64
 

### Feature Engineering for Genre

The dataset has genre in a single column; the entry for each song has a variety of genres listed in that single column. In order to explore genre, I'll need to break this field out.

In [10]:
# generating a df with unique genre names
unique_genres = list(set(
    genre 
    for genre_string in df_cleaned['spotify_genre'] 
    if pd.notna(genre_string)
    for genre in ast.literal_eval(genre_string)
))

df_unique_genres = pd.DataFrame(unique_genres, columns=['genre'])

# adding counts of each unique genre name
# Extract all genres (with duplicates) and count them
all_genres_list = []
for genre_string in df_cleaned['spotify_genre']:
    if pd.notna(genre_string):
        genre_list = ast.literal_eval(genre_string)
        all_genres_list.extend(genre_list)

# Count occurrences
genre_counts = Counter(all_genres_list)

# Map counts to genres dataframe
df_unique_genres['count'] = df_unique_genres['genre'].map(genre_counts)
df_unique_genres = df_unique_genres.sort_values('count', ascending=False)

In [None]:
# writing to csv for easier review of the data
df_unique_genres.to_csv('genre_counts.csv', index=False)

After reviewing the full set of genre counts, I created a new csv that contains genres which appear in 50 or more song entries.

In [11]:
# loading list of genres with 50 or more instances in df_cleaned
df_genres_50_up = pd.read_csv('genre_counts_50+inst.csv')

# converting df to list
final_genres_list = df_genres_50_up['genre'].tolist()

# manually one-hot encoding each genre

# creating each new genre column and initializing to 0
for genre in final_genres_list:
    df_cleaned[genre] = 0

# iterating through rows to set values to 1 when genre column appears in original spotify_genre column
for idx, genre_string in enumerate(df_cleaned['spotify_genre']):
    if pd.notna(genre_string):
        genre_list = ast.literal_eval(genre_string)
        for genre in genre_list:
            df_cleaned.at[idx, genre] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[genre] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[genre] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[genre] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats

In [12]:
# reviewing full df 
pd.set_option('display.max_columns', None)
df_cleaned.head(3)

Unnamed: 0,SongID,spotify_genre,spotify_track_id,spotify_track_duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,Max_Peak_Position,Max_Rank_Change,pop,rap,pop rap,dance pop,post-teen pop,hip hop,trap,contemporary country,country road,country,southern hip hop,modern country rock,atl hip hop,r&b,canadian pop,melodic rap,urban contemporary,pop rock,hollywood,glee club,neo mellow,canadian hip hop,toronto rap,edm,gangster rap,country pop,tropical house,hip pop,modern rock,miami hip hop,latin,electropop,chicago rap,conscious hip hop,viral pop,country dawn,uk pop,dirty south rap,philly rap,detroit hip hop,alternative r&b,post-grunge,oklahoma country,talent show,canadian contemporary r&b,neo soul,boy band,reggaeton,electro house,rock,atl trap,nc hip hop,new orleans rap,emo rap,australian country,alternative metal,canadian metal,canadian rock,nu metal,alternative rock,permanent wave,dance rock,new romantic,new wave,new wave pop,soft rock,synthpop,candy pop,europop,adult standards,brill building pop,easy listening,vocal jazz,dancehall,glam metal,plugg,underground hip hop,deep pop r&b,pop punk,pittsburgh rap,acoustic pop,piano rock,art pop,canadian indie,chamber pop,indie pop,indie rock,slow core,stomp and holler,cali rap,slow game,alternative dance,dance-punk,indietronica,new rave,indie pop rap,comic,texas pop punk,bachata,latin pop,tropical,crunk,metropopolis,baton rouge rap,brooklyn drill,nyc rap,australian pop,punk,east coast hip hop,queens hip hop,west coast trap,g funk,complextro,german techno,new jack swing,escape room,indie r&b,indie soul,dmv rap,memphis hip hop,new jersey rap,british soul,danish pop,scandipop,texas country,idol,rap kreyol,dfw rap,deep southern trap,deep talent show,country rock,redneck,american folk revival,cantautor,latin arena pop,mexican pop,rock en espanol,spanish pop,country gospel,alberta country,canadian contemporary country,canadian country,emo,funk,soul,classic soul,disco,motown,post-disco,quiet storm,lilith,folk-pop,ny roots,ethiopian pop,trap queen,canadian pop punk,canadian punk,pop reggaeton,downtempo,electronic trap,shiver pop,latin hip hop,reggaeton flow,rap metal,socal pop punk,alaska indie,singer-songwriter,sertanejo,sertanejo pop,sertanejo universitario,pixie,pop emo,swedish electropop,swedish pop,garage rock,punk blues,big room,brostep,catstep,electra,funk metal,rap rock,australian dance,christian alternative rock,christian rock,canadian latin,bronx hip hop,hardcore hip hop,lounge,girl group,wrestling,west coast rap,show tunes,etherpop,indie poptimism,belgian dance,belgian pop,eurodance,progressive electro house,modern uplift,australian hip hop,kentucky hip hop,folk,mellow gold,heartland rock,art rock,experimental,experimental rock,melancholia,post-punk,psychedelic rock,jam band,barbadian pop,puerto rican pop,trap latino,deep contemporary country,lds youth,reggae fusion,progressive house,ohio hip hop,arkansas country,blues rock,modern blues rock,small room,bubblegum dance,deep big room,dutch house,smooth jazz,smooth saxophone,christian music,lgbtq+ hip hop,reggaeton colombiano,rap latina,houston rap,modern folk rock,uk americana,alternative hip hop,chicano rap,cartoon,children's music,old school hip hop,bounce,electro,disco house,canadian ccm,christian punk,indiecoustica,ectofolk,irish rock,anthem worship,ccm,christian pop,worship,bassline,social media pop,norwegian hip hop,outlaw country,hawaiian hip hop,vapor trap,bhangra,desi hip hop,desi pop,scottish singer-songwriter,grunge,hard rock,k-pop,k-pop boy group,electropowerpop,neon pop punk,trancecore,album rock,classic rock,glam rock,protopunk,north carolina hip hop,house,uk dance,nu-metalcore,trap soul,italian pop,italo dance,rock-and-roll,rockabilly,groove metal,rap conscient,drill,baroque pop,uk contemporary r&b,celtic rock,harlem hip hop,electronica,nu jazz,trip hop,bow pop,country rap,san diego rap,canadian trap,south african rock,christian indie,moombahton,neo-singer-songwriter,neo-synthpop,neo-traditional country,funk rock,aussietronica,disney,florida rap,colombian pop,a cappella,latin viral pop,antiviral pop,comedy rock,parody,viral rap,alternative pop rock,la indie,movie tunes,indie electropop,la pop,pop edm,portland hip hop,viral trap,bubble trance,hopebeat,gospel r&b,k-hop,teen pop,modern alternative rock,nu gaze,ghanaian hip hop,new americana,southern soul,pop soul,swedish synthpop,classic country pop,nashville sound,sleaze rock,kids dance party,metal,old school thrash,speed metal,thrash metal,k-rap,folk rock,meme rap,lo-fi,washington indie,brooklyn indie,shimmer pop,big beat,skate punk,nyc pop,sheffield indie,scottish rock,uk alternative pop,vocal house,contemporary vocal jazz,norwegian pop,alabama metal,yacht rock,soca,experimental pop,icelandic experimental,icelandic pop,latin rock,mexican rock,melbourne bounce international,cyberpunk,electronic rock,industrial,industrial metal,industrial rock,alternative country,indie folk,roots rock,canadian singer-songwriter,comedy,screamo,j-pop,japanese singer-songwriter,post-metal,progressive metal,progressive rock,detroit trap,battle rap,balkan brass,transpop,roots americana,hyphy,australian indie,filter house,neo r&b,lovers rock,old school dancehall,riddim,anti-folk,nz pop,world worship,cedm,minnesota hip hop,philly soul,canadian folk,bass trap,vapor twitch,san marcos tx indie,swedish garage rock,swedish hard rock,swedish indie rock,alt z,bedroom pop,chicago hardcore,chicago punk,hardcore punk,southern rock,cowboy western,traditional country,yodeling,chicago indie,alabama indie,queer country,mexican hip hop,deep vocal house,k-pop girl group,cello,power metal,chicago drill,birmingham metal,operatic pop,uk funky,modern salsa,salsa,gospel,grunge pop,french pop,minneapolis sound,bubblegum pop,classic uk pop,boston hip hop,modern southern rock,turntablism,trance,swedish alternative rock,trap boricua,modern blues,champeta,vallenato,deep norteno,duranguense,musica potosina,norteno,norteno-sax,cumbia,liquid funk,chicago house,bluegrass gospel,wu fam,chinese hip hop,chinese idol pop,palm desert scene,stoner metal,stoner rock,virgin islands reggae,destroy techno,tennessee hip hop,french shoegaze,miami bass,romanian pop,electroclash,easycore,milwaukee hip hop,bluegrass,traditional folk,pop r&b,deep progressive trance,australian electropop,israeli pop,rebel blues,banda,regional mexican pop,oxford indie,celtic,middle earth,panamanian pop,albuquerque indie,portland indie,australian rock,lo star,drum and bass,azonto,deep latin christian,grupera,canadian electronic,quebec indie,stomp pop,pop dance,slap house,broadway,alabama rap,finnish edm,uk garage,indie anthem-folk,charlottesville indie,progressive trance,uplifting trance,classic girl group,irish singer-songwriter,power pop,strut,ninja,indy indie,german pop,deep euro house,deep house,german dance,bedroom soul,chutney,scorecore,soundtrack,comic metal,britpop,madchester,el paso indie,grime,deep underground hip hop,vapor pop,pinoy hip hop,dutch hip hop,icelandic indie,icelandic rock,australian house,bass house,jazz rap,seattle indie,reggae,bmore,arkansas hip hop,brazilian death metal,brazilian metal,brazilian thrash metal,crossover thrash,new wave of thrash metal,fake,ann arbor indie,bahamian pop,alternative pop
5,...Ready For It?Taylor Swift,"['pop', 'post-teen pop']",2yLa0QULdQr0qAIvVwN6B5,208186.0,0.613,0.764,2.0,-6.509,1.0,0.136,0.0527,0.0,0.197,0.417,160.015,4.0,4.0,22.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
13,'Til Summer Comes AroundKeith Urban,"['australian country', 'contemporary country',...",1CKmI1IQjVEVB3F7VmJmM3,331466.0,0.57,0.629,9.0,-7.608,0.0,0.0331,0.593,0.000136,0.77,0.308,127.907,4.0,92.0,14.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16,'Tis The Damn SeasonTaylor Swift,"['pop', 'post-teen pop']",7dW84mWkdWE5a6lFWxJCBG,229840.0,0.575,0.434,5.0,-8.193,1.0,0.0312,0.735,6.6e-05,0.105,0.348,145.916,4.0,39.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


I created two datasets: one containing genre and one without. This will allow me to model this data with and without genre.

In [13]:
# removing fields used for prep/cleaning but not needed for analysis
df_cleaned = df_cleaned.drop(['SongID', 'spotify_genre', 'spotify_track_id'], axis=1)

# my code added columns for all genres in spotify_genre, removing unwanted columns and creating a clean df with genre
last_col_to_keep_genre = 'emo rap'
df_cleaned_genre = df_cleaned.loc[:, :last_col_to_keep_genre]
# removing NaN rows
df_cleaned_genre = df_cleaned_genre.dropna()

# creating a clean df for analysis without genre
last_col_to_keep_no_genre = 'Max_Rank_Change'
df_cleaned_no_genre = df_cleaned.loc[:, :last_col_to_keep_no_genre]
# removing NaN rows
df_cleaned_no_genre = df_cleaned_no_genre.dropna()

df_cleaned_genre.info(), df_cleaned_no_genre.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7798 entries, 5 to 29499
Data columns (total 69 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   spotify_track_duration_ms  7798 non-null   float64
 1   danceability               7798 non-null   float64
 2   energy                     7798 non-null   float64
 3   key                        7798 non-null   float64
 4   loudness                   7798 non-null   float64
 5   mode                       7798 non-null   float64
 6   speechiness                7798 non-null   float64
 7   acousticness               7798 non-null   float64
 8   instrumentalness           7798 non-null   float64
 9   liveness                   7798 non-null   float64
 10  valence                    7798 non-null   float64
 11  tempo                      7798 non-null   float64
 12  time_signature             7798 non-null   float64
 13  Max_Peak_Position          7798 non-null   float64
 

(None, None)

### Features and Target Variables

I'm prepping 4 versions for XGBoost and k-NN:

1. Max Peak Position, no genre (1__1 variables)
2. Max Peak Position, with genre (1_2 variables)
3. Max Rank Change, no genre (2_1 variables)
4. Max Rank Change, with genre (2_2 variables)

In [14]:
# Prepare features and target for XGBoost and k-NN max_peak_position analysis, no genre
X1_1 = df_cleaned_no_genre.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y1_1 = df_cleaned_no_genre['Max_Peak_Position']

# Splitting the data into training and testing sets (75-25 split and random_state of 42)
X1_1_train, X1_1_test, y1_1_train, y1_1_test = train_test_split(X1_1, y1_1, test_size=0.25, random_state=42)

# Standardize the features
scaler = StandardScaler()
X1_1_train_scaled = scaler.fit_transform(X1_1_train)
X1_1_test_scaled = scaler.fit_transform(X1_1_test)

In [15]:
# Prepare features and target for XGBoost and k-NN max_peak_position analysis, including genre
X1_2 = df_cleaned_genre.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y1_2 = df_cleaned_genre['Max_Peak_Position']

# Splitting the data into training and testing sets (75-25 split and random_state of 42)
X1_2_train, X1_2_test, y1_2_train, y1_2_test = train_test_split(X1_2, y1_2, test_size=0.25, random_state=42)

# Standardize the features
scaler = StandardScaler()
X1_2_train_scaled = scaler.fit_transform(X1_2_train)
X1_2_test_scaled = scaler.fit_transform(X1_2_test)

In [16]:
# Prepare features and target for XGBoost and k-NN max_rank_change analysis, no genre
X2_1 = df_cleaned_no_genre.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y2_1 = df_cleaned_no_genre['Max_Rank_Change']

# Splitting the data into training and testing sets (75-25 split and random_state of 42)
X2_1_train, X2_1_test, y2_1_train, y2_1_test = train_test_split(X2_1, y2_1, test_size=0.25, random_state=42)

# Standardize the features
scaler = StandardScaler()
X2_1_train_scaled = scaler.fit_transform(X2_1_train)
X2_1_test_scaled = scaler.fit_transform(X2_1_test)

In [17]:
# Prepare features and target for XGBoost and k-NN max_rank_change analysis, including genre
X2_2 = df_cleaned_genre.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y2_2 = df_cleaned_genre['Max_Rank_Change']

# Splitting the data into training and testing sets (75-25 split and random_state of 42)
X2_2_train, X2_2_test, y2_2_train, y2_2_test = train_test_split(X2_2, y2_2, test_size=0.25, random_state=42)

# Standardize the features
scaler = StandardScaler()
X2_2_train_scaled = scaler.fit_transform(X2_2_train)
X2_2_test_scaled = scaler.fit_transform(X2_2_test)

Another 4 versions of the data the deep learning model

1. Max Peak Position, no genre (3__1 variables)
2. Max Peak Position, with genre (3_2 variables)
3. Max Rank Change, no genre (4_1 variables)
4. Max Rank Change, with genre (4_2 variables)

In [18]:
# features and target for deep learning max_peak_position analysis, no genre
X3_1 = df_cleaned_no_genre.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y3_1 = df_cleaned_no_genre['Max_Peak_Position']

# Splitting the data into training and testing sets 
X3_1_train, X3_1_test, y3_1_train, y3_1_test = train_test_split(X3_1, y3_1, test_size=0.2, random_state=42)

# splitting training data into training and validiation
X3_1_train_final, X3_1_val, y3_1_train_final, y3_1_val = train_test_split(X3_1_train, y3_1_train, test_size=0.2, random_state=42)

# normalizing 
scaler.fit(X3_1_train_final)
X3_1_train_scaled = scaler.transform(X3_1_train_final)
X3_1_val_scaled = scaler.transform(X3_1_val)
X3_1_test_scaled = scaler.transform(X3_1_test)

In [19]:
# features and target for deep learning max_peak_position analysis, including genre
X3_2 = df_cleaned_genre.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y3_2 = df_cleaned_genre['Max_Peak_Position']

# Splitting the data into training and testing sets 
X3_2_train, X3_2_test, y3_2_train, y3_2_test = train_test_split(X3_2, y3_2, test_size=0.2, random_state=42)

# splitting training data into training and validiation
X3_2_train_final, X3_2_val, y3_2_train_final, y3_2_val = train_test_split(X3_2_train, y3_2_train, test_size=0.2, random_state=42)

# normalizing
scaler.fit(X3_2_train_final)
X3_2_train_scaled = scaler.transform(X3_2_train_final)
X3_2_val_scaled = scaler.transform(X3_2_val)
X3_2_test_scaled = scaler.transform(X3_2_test)

In [20]:
# features and target for deep learning max_rank_change analysis, no genre
X4_1 = df_cleaned_no_genre.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y4_1 = df_cleaned_no_genre['Max_Rank_Change']

# Splitting the data into training and testing sets 
X4_1_train, X4_1_test, y4_1_train, y4_1_test = train_test_split(X4_1, y4_1, test_size=0.2, random_state=42)

# splitting training data into training and validiation
X4_1_train_final, X4_1_val, y4_1_train_final, y4_1_val = train_test_split(X4_1_train, y4_1_train, test_size=0.2, random_state=42)

# normalizing
scaler.fit(X4_1_train_final)
X4_1_train_scaled = scaler.transform(X4_1_train_final)
X4_1_val_scaled = scaler.transform(X4_1_val)
X4_1_test_scaled = scaler.transform(X4_1_test)

In [21]:
# Prepare features and target for simple deep learning max_rank_change analysis, including genre
X4_2 = df_cleaned_genre.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y4_2 = df_cleaned_genre['Max_Rank_Change']

# Splitting the data into training and testing sets 
X4_2_train, X4_2_test, y4_2_train, y4_2_test = train_test_split(X4_2, y4_2, test_size=0.2, random_state=42)

# splitting training data into training and validiation
X4_2_train_final, X4_2_val, y4_2_train_final, y4_2_val = train_test_split(X4_2_train, y4_2_train, test_size=0.2, random_state=42)

# normalizing
scaler.fit(X4_2_train_final)
X4_2_train_scaled = scaler.transform(X4_2_train_final)
X4_2_val_scaled = scaler.transform(X4_2_val)
X4_2_test_scaled = scaler.transform(X4_2_test)

## Analysis

### XGBoost | Max Peak Position - No Genre

In [22]:
# XGBoost for max_peak_position, no genre

xgb_model1_1 = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, random_state=42)

xgb_model1_1.fit(X1_1_train, y1_1_train)
y1_1_pred = xgb_model1_1.predict(X1_1_test)
y1_1_pred = np.clip(np.round(y1_1_pred), 1, 100)

rmse1_1 = np.sqrt(mean_squared_error(y1_1_test, y1_1_pred))
r2_1_1 = r2_score(y1_1_test, y1_1_pred)

print(f'RMSE: {rmse1_1:.3f}')
print(f'R²: {r2_1_1:.3f}')

RMSE: 26.079
R²: -0.174


In [23]:
# hyperparameter tuning 1

param_grid1 = {
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
}

grid_search_xgb1_1 = GridSearchCV(estimator=xgb_model1_1,
                            param_grid=param_grid1,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb1_1.fit(X1_1_train, y1_1_train)

print("Best parameters:", grid_search_xgb1_1.best_params_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.01, 'max_depth': 6, 'subsample': 0.8}


In [24]:
# hyperparameter tuning 2

param_grid2 = {
    'max_depth': [5, 6, 7],
    'learning_rate': [0.005, 0.01, 0.015,],
    'subsample': [0.75, 0.8, 0.85],
    'colsample_bytree': [1.0, 1.1, 1.2],
}

grid_search_xgb1_1 = GridSearchCV(estimator=xgb_model1_1,
                            param_grid=param_grid2,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb1_1.fit(X1_1_train, y1_1_train)

print("Best parameters:", grid_search_xgb1_1.best_params_)

Fitting 5 folds for each of 81 candidates, totalling 405 fits


270 fits failed out of a total of 405.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
135 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\core.py", line 726, in inner_f
    return func(**kwargs)
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\sklearn.py", line 1170, in fit
    self._Booster = train(
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\core.py", line 726, in inner_f
    return func(**kwar

Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.015, 'max_depth': 5, 'subsample': 0.75}


In [27]:
# hyperparameter tuning 3

param_grid3 = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.015],
    'subsample': [0.73, 0.74, 0.75],
    'colsample_bytree': [1.0],
}

grid_search_xgb1_1 = GridSearchCV(estimator=xgb_model1_1,
                            param_grid=param_grid3,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb1_1.fit(X1_1_train, y1_1_train)

print("Best parameters:", grid_search_xgb1_1.best_params_)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.015, 'max_depth': 5, 'subsample': 0.75}


In [29]:
# Extract best/final model for max_peak_position
best_xgb1_1 = grid_search_xgb1_1.best_estimator_

# predictions
y1_1_pred_best = best_xgb1_1.predict(X1_1_test)
y1_1_pred_best = np.clip(np.round(y1_1_pred_best), 1, 100)

# Evaluate XGBoost model
rmse1_1_best = np.sqrt(mean_squared_error(y1_1_test, y1_1_pred_best))
r2_1_1_best = r2_score(y1_1_test, y1_1_pred_best)

print(f'RMSE: {rmse1_1_best:.3f}')
print(f'R²: {r2_1_1_best:.3f}')

RMSE: 23.925
R²: 0.012


### XGBoost | Max Peak Position - With Genre

In [30]:
# XGBoost for max_peak_position, with genre

xgb_model1_2 = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, random_state=42)

xgb_model1_2.fit(X1_2_train, y1_2_train)
y1_2_pred = xgb_model1_2.predict(X1_2_test)
y1_2_pred = np.clip(np.round(y1_2_pred), 1, 100)

rmse1_2 = np.sqrt(mean_squared_error(y1_2_test, y1_2_pred))
r2_1_2 = r2_score(y1_2_test, y1_2_pred)

print(f'RMSE: {rmse1_2:.3f}')
print(f'R²: {r2_1_2:.3f}')

RMSE: 26.000
R²: -0.167


In [31]:
# hyperparameter tuning 1

param_grid1 = {
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
}

grid_search_xgb1_2 = GridSearchCV(estimator=xgb_model1_2,
                            param_grid=param_grid1,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb1_2.fit(X1_2_train, y1_2_train)

print("Best parameters:", grid_search_xgb1_2.best_params_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.01, 'max_depth': 9, 'subsample': 0.8}


In [32]:
# hyperparameter tuning 2

param_grid4 = {
    'max_depth': [8, 9, 10],
    'learning_rate': [0.005, 0.01, 0.015],
    'subsample': [0.6, 0.7, 0.8],
    'colsample_bytree': [0.6, 0.7, 0.8],
}

grid_search_xgb1_2 = GridSearchCV(estimator=xgb_model1_2,
                            param_grid=param_grid4,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb1_2.fit(X1_2_train, y1_2_train)

print("Best parameters:", grid_search_xgb1_2.best_params_)

Fitting 5 folds for each of 81 candidates, totalling 405 fits
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.015, 'max_depth': 8, 'subsample': 0.6}


In [None]:
# hyperparameter tuning 3

param_grid5 = {
    'max_depth': [7, 8],
    'learning_rate': [0.013, 0.015, 0.017],
    'subsample': [0.5, 0.6],
    'colsample_bytree': [0.8],
}

grid_search_xgb1_2 = GridSearchCV(estimator=xgb_model1_2,
                            param_grid=param_grid5,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb1_2.fit(X1_2_train, y1_2_train)

print("Best parameters:", grid_search_xgb1_2.best_params_)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.015, 'max_depth': 8, 'subsample': 0.6}


In [35]:
# Extract best/final model 
best_xgb1_2 = grid_search_xgb1_2.best_estimator_

# predictions
y1_2_pred_best = best_xgb1_2.predict(X1_2_test)
y1_2_pred_best = np.clip(np.round(y1_2_pred_best), 1, 100)

# Evaluate XGBoost model
rmse1_2_best = np.sqrt(mean_squared_error(y1_2_test, y1_2_pred_best))
r2_1_2_best = r2_score(y1_2_test, y1_2_pred_best)

print(f'RMSE: {rmse1_2_best:.3f}')
print(f'R²: {r2_1_2_best:.3f}')

RMSE: 23.963
R²: 0.008


### XGBoost | Max Rank Change - No Genre

In [36]:
# XGBoost for max_rank_change

xgb_model2_1 = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, random_state=42)

xgb_model2_1.fit(X2_1_train, y2_1_train)
y2_1_pred = xgb_model2_1.predict(X2_1_test)
y2_1_pred = np.clip(np.round(y2_1_pred), 1, 100)

rmse2_1 = np.sqrt(mean_squared_error(y2_1_test, y2_1_pred))
r2_2_1 = r2_score(y2_1_test, y2_1_pred)

print(f'RMSE: {rmse2_1:.3f}')
print(f'R²: {r2_2_1:.3f}')

RMSE: 12.816
R²: -0.191


In [37]:
# hyperparameter tuning 1 for max_rank_change

param_grid1 = {
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
}

grid_search_xgb2_1 = GridSearchCV(estimator=xgb_model2_1,
                            param_grid=param_grid1,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb2_1.fit(X2_1_train, y2_1_train)

print("Best parameters:", grid_search_xgb2_1.best_params_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.01, 'max_depth': 3, 'subsample': 1.0}


In [38]:
# hyperparameter tuning 2 for max_rank_change

param_grid4 = {
    'max_depth': [3],
    'learning_rate': [0.005, 0.01, 0.015],
    'subsample': [1.0, 1.1, 1.2],
    'colsample_bytree': [0.7, 0.8, 0.9],
}

grid_search_xgb2_1 = GridSearchCV(estimator=xgb_model2_1,
                            param_grid=param_grid4,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb2_1.fit(X2_1_train, y2_1_train)

print("Best parameters:", grid_search_xgb2_1.best_params_)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best parameters: {'colsample_bytree': 0.9, 'learning_rate': 0.005, 'max_depth': 3, 'subsample': 1.0}


90 fits failed out of a total of 135.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\core.py", line 726, in inner_f
    return func(**kwargs)
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\sklearn.py", line 1170, in fit
    self._Booster = train(
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\core.py", line 726, in inner_f
    return func(**kwargs

In [39]:
# hyperparameter tuning 3 for max_rank_change

param_grid5 = {
    'max_depth': [3],
    'learning_rate': [0.003, 0.005, 0.007],
    'subsample': [1.0],
    'colsample_bytree': [0.9],
        }

grid_search_xgb2_1 = GridSearchCV(estimator=xgb_model2_1,
                            param_grid=param_grid5,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb2_1.fit(X2_1_train, y2_1_train)

print("Best parameters:", grid_search_xgb2_1.best_params_)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best parameters: {'colsample_bytree': 0.9, 'learning_rate': 0.005, 'max_depth': 3, 'subsample': 1.0}


In [40]:
# Extract best/final model  for max_rank_change
best_xgb2_1 = grid_search_xgb2_1.best_estimator_

y2_1_pred_best = best_xgb2_1.predict(X2_1_test)
y2_1_pred_best = np.clip(np.round(y2_1_pred_best), 1, 100)

# Evaluate XGBoost model
rmse2_1_best = np.sqrt(mean_squared_error(y2_1_test, y2_1_pred_best))
r2_2_1_best = r2_score(y2_1_test, y2_1_pred_best)

print(f'RMSE: {rmse2_1_best:.3f}')
print(f'R²: {r2_2_1_best:.3f}')

RMSE: 11.733
R²: 0.002


### XGBoost | Max Rank Change - With Genre

In [41]:
# XGBoost for max_rank_change

xgb_model2_2 = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, random_state=42)

xgb_model2_2.fit(X2_2_train, y2_2_train)
y2_2_pred = xgb_model2_2.predict(X2_2_test)
y2_2_pred = np.clip(np.round(y2_2_pred), 1, 100)

rmse2_2 = np.sqrt(mean_squared_error(y2_2_test, y2_2_pred))
r2_2_2 = r2_score(y2_2_test, y2_2_pred)

print(f'RMSE: {rmse2_2:.3f}')
print(f'R²: {r2_2_2:.3f}')

RMSE: 12.695
R²: -0.169


In [42]:
# hyperparameter tuning 1

param_grid1 = {
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
}

grid_search_xgb2_2 = GridSearchCV(estimator=xgb_model2_2,
                            param_grid=param_grid1,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb2_2.fit(X2_2_train, y2_2_train)

print("Best parameters:", grid_search_xgb2_2.best_params_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.01, 'max_depth': 3, 'subsample': 1.0}


In [43]:
# hyperparameter tuning 2

param_grid6 = {
    'max_depth': [2, 3, 4],
    'learning_rate': [0.005, 0.01, 0.015],
    'subsample': [1.0, 1.1, 1.2],
    'colsample_bytree': [0.6, 0.7, 0.8],
}

grid_search_xgb2_2 = GridSearchCV(estimator=xgb_model2_2,
                            param_grid=param_grid6,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb2_2.fit(X2_2_train, y2_2_train)

print("Best parameters:", grid_search_xgb2_2.best_params_)

Fitting 5 folds for each of 81 candidates, totalling 405 fits
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.005, 'max_depth': 3, 'subsample': 1.0}


270 fits failed out of a total of 405.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
135 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\core.py", line 726, in inner_f
    return func(**kwargs)
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\sklearn.py", line 1170, in fit
    self._Booster = train(
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\core.py", line 726, in inner_f
    return func(**kwar

In [44]:
# hyperparameter tuning 3

param_grid7 = {
    'max_depth': [2],
    'learning_rate': [0.003, 0.005, 0.007],
    'subsample': [1.0],
    'colsample_bytree': [0.8],
}

grid_search_xgb2_2 = GridSearchCV(estimator=xgb_model2_2,
                            param_grid=param_grid7,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb2_2.fit(X2_2_train, y2_2_train)

print("Best parameters:", grid_search_xgb2_2.best_params_)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.007, 'max_depth': 2, 'subsample': 1.0}


In [45]:
# Extract best/final model  for max_rank_change
best_xgb2_2 = grid_search_xgb2_2.best_estimator_

y2_2_pred_best = best_xgb2_2.predict(X2_2_test)
y2_2_pred_best = np.clip(np.round(y2_2_pred_best), 1, 100)

# Evaluate XGBoost model
rmse2_2_best = np.sqrt(mean_squared_error(y2_2_test, y2_2_pred_best))
r2_2_2_best = r2_score(y2_2_test, y2_2_pred_best)

print(f'RMSE: {rmse2_2_best:.3f}')
print(f'R²: {r2_2_2_best:.3f}')

RMSE: 11.736
R²: 0.001


### k-Nearest Neighbors | Max Peak Position - No Genre

In [46]:
# testing different distance metrics to find optimal approach
metrics = ['euclidean', 'manhattan', 'chebyshev']
k_value = 5

# Dictionary to store results
results = {}

# Loop through list of metrics
for metric in metrics:
        # Create and evaluate model with different metrics and k=5
    knn = KNeighborsClassifier(n_neighbors=k_value, metric=metric)
    # Get cross val scores for model
    cv_scores = cross_val_score(knn, X1_1_train_scaled, y1_1_train, cv=5, scoring='accuracy')
    # Store the mean of cv scores as value and metric name as key in results dictionary
    results[metric] = cv_scores.mean()

best_metric = max(results, key=results.get)

print(results)
print(f"\nBest metric: {best_metric} with accuracy: {results[best_metric]:.4f}")

{'euclidean': 0.011286584340476555, 'manhattan': 0.013338451302523157, 'chebyshev': 0.01196961388578155}

Best metric: manhattan with accuracy: 0.0133


### k-Nearest Neighbors | Max Peak Position - With Genre

In [47]:
# testing different distance metrics to find optimal approach
metrics = ['euclidean', 'manhattan', 'chebyshev']
k_value = 5

# Dictionary to store results
results = {}

# Loop through list of metrics
for metric in metrics:
        # Create and evaluate model with different metrics and k=5
    knn = KNeighborsClassifier(n_neighbors=k_value, metric=metric)
    # Get cross val scores for model
    cv_scores = cross_val_score(knn, X1_2_train_scaled, y1_2_train, cv=5, scoring='accuracy')
    # Store the mean of cv scores as value and metric name as key in results dictionary
    results[metric] = cv_scores.mean()

best_metric = max(results, key=results.get)

print(results)
print(f"\nBest metric: {best_metric} with accuracy: {results[best_metric]:.4f}")

{'euclidean': 0.009576159037236881, 'manhattan': 0.013508367879625366, 'chebyshev': 0.011798235031767966}

Best metric: manhattan with accuracy: 0.0135


### k-Nearest Neighbors | Max Rank Change - No Genre

In [48]:
# testing different distance metrics to find optimal approach
metrics = ['euclidean', 'manhattan', 'chebyshev']
k_value = 5

# Dictionary to store results
results = {}

# Loop through list of metrics
for metric in metrics:
        # Create and evaluate model with different metrics and k=5
    knn = KNeighborsClassifier(n_neighbors=k_value, metric=metric)
    # Get cross val scores for model
    cv_scores = cross_val_score(knn, X2_1_train_scaled, y2_1_train, cv=5, scoring='accuracy')
    # Store the mean of cv scores as value and metric name as key in results dictionary
    results[metric] = cv_scores.mean()

best_metric = max(results, key=results.get)

print(results)
print(f"\nBest metric: {best_metric} with accuracy: {results[best_metric]:.4f}")



{'euclidean': 0.1793772162634438, 'manhattan': 0.18125814305455024, 'chebyshev': 0.18143039927471066}

Best metric: chebyshev with accuracy: 0.1814


### k-Nearest Neighbors | Max Rank Change - With Genre

In [49]:
# testing different distance metrics to find optimal approach
metrics = ['euclidean', 'manhattan', 'chebyshev']
k_value = 5

# Dictionary to store results
results = {}

# Loop through list of metrics
for metric in metrics:
        # Create and evaluate model with different metrics and k=5
    knn = KNeighborsClassifier(n_neighbors=k_value, metric=metric)
    # Get cross val scores for model
    cv_scores = cross_val_score(knn, X2_2_train_scaled, y2_2_train, cv=5, scoring='accuracy')
    # Store the mean of cv scores as value and metric name as key in results dictionary
    results[metric] = cv_scores.mean()

best_metric = max(results, key=results.get)

print(results)
print(f"\nBest metric: {best_metric} with accuracy: {results[best_metric]:.4f}")



{'euclidean': 0.1747608080542212, 'manhattan': 0.17937809362959062, 'chebyshev': 0.17766869192018891}

Best metric: manhattan with accuracy: 0.1794


### Deep Learning | Max Peak Position - No Genre

In [None]:
# deep learning model, no regularization or dropout

baseline_model3_1 = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X3_1_train.shape[1],)),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)  # Single output for regression
])

baseline_model3_1.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_baseline3_1 = baseline_model3_1.fit(
    X3_1_train_scaled, y3_1_train_final,
    validation_data=(X3_1_val_scaled, y3_1_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [None]:
# deep learning model with batch normalization

bnorm_model3_1 = keras.Sequential([
    layers.Dense(64, activation='linear', input_shape=(X3_1_train.shape[1],)),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(64, activation='linear'),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(64, activation='linear'),
    layers.BatchNormalization(),
    layers.Activation('relu'),

    layers.Dense(1)  # Single output for regression
])

bnorm_model3_1.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_bnorm_model3_1 = bnorm_model3_1.fit(
    X3_1_train_scaled, y3_1_train_final,
    validation_data=(X3_1_val_scaled, y3_1_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [None]:
# deep learning model regularization (L2 and dropout)

l2_reg = 1e-4
dropout_rate = 0.4

reg_model3_1 = keras.Sequential([
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg),
                 input_shape=(X3_1_train.shape[1],)),
    layers.Dropout(dropout_rate),
        
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),

    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),

    layers.Dense(1)  # Single output for regression
])

reg_model3_1.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_reg_model3_1 = reg_model3_1.fit(
    X3_1_train_scaled, y3_1_train_final,
    validation_data=(X3_1_val_scaled, y3_1_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [62]:
# model evaluation
print("MODEL EVALUATION: MAX PEAK POSITION, NO GENRE")

print("\n=== Baseline Model ===")
train_scores3_1 = baseline_model3_1.evaluate(X3_1_train_scaled, y3_1_train_final, verbose=0)
val_scores3_1   = baseline_model3_1.evaluate(X3_1_val_scaled, y3_1_val, verbose=0)
print(f"Train MAE: {train_scores3_1[1]:.4f}, Train MSE: {train_scores3_1[2]:.4f}")
print(f"Val   MAE: {val_scores3_1[1]:.4f}, Val   MSE: {val_scores3_1[2]:.4f}")

print("\n=== BatchNorm Model ===")
train_scores_bn3_1 = bnorm_model3_1.evaluate(X3_1_train_scaled, y3_1_train_final, verbose=0)
val_scores_bn3_1   = bnorm_model3_1.evaluate(X3_1_val_scaled, y3_1_val, verbose=0)
print(f"Train MAE: {train_scores_bn3_1[1]:.4f}, Train MSE: {train_scores_bn3_1[2]:.4f}")
print(f"Val   MAE: {val_scores_bn3_1[1]:.4f}, Val   MSE: {val_scores_bn3_1[2]:.4f}")

print("\n=== Regularized Model (L2 + Dropout) ===")
train_scores_reg3_1 = reg_model3_1.evaluate(X3_1_train_scaled, y3_1_train_final, verbose=0)
val_scores_reg3_1   = reg_model3_1.evaluate(X3_1_val_scaled, y3_1_val, verbose=0)
print(f"Train MAE: {train_scores_reg3_1[1]:.4f}, Train MSE: {train_scores_reg3_1[2]:.4f}")
print(f"Val   MAE: {val_scores_reg3_1[1]:.4f}, Val   MSE: {val_scores_reg3_1[2]:.4f}")


MODEL EVALUATION: MAX PEAK POSITION, NO GENRE

=== Baseline Model ===
Train MAE: 15.8960, Train MSE: 449.0920
Val   MAE: 20.8504, Val   MSE: 766.0729

=== BatchNorm Model ===
Train MAE: 14.2694, Train MSE: 357.2459
Val   MAE: 20.7950, Val   MSE: 741.8729

=== Regularized Model (L2 + Dropout) ===
Train MAE: 18.8158, Train MSE: 566.6469
Val   MAE: 19.4852, Val   MSE: 627.6425


### Deep Learning | Max Peak Position - With Genre

In [59]:
# deep learning model, no regularization or dropout

baseline_model3_2 = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X3_2_train.shape[1],)),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)  # Single output for regression
])

baseline_model3_2.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_baseline3_2 = baseline_model3_2.fit(
    X3_2_train_scaled, y3_2_train_final,
    validation_data=(X3_2_val_scaled, y3_2_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [60]:
# deep learning model with batch normalization

bnorm_model3_2 = keras.Sequential([
    layers.Dense(64, activation='linear', input_shape=(X3_2_train.shape[1],)),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(64, activation='linear'),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(64, activation='linear'),
    layers.BatchNormalization(),
    layers.Activation('relu'),

    layers.Dense(1)  # Single output for regression
])

bnorm_model3_2.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_bnorm_model3_2 = bnorm_model3_2.fit(
    X3_2_train_scaled, y3_2_train_final,
    validation_data=(X3_2_val_scaled, y3_2_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [61]:
# deep learning model with regularization (L2 and dropout)

l2_reg = 1e-4
dropout_rate = 0.4

reg_model3_2 = keras.Sequential([
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg),
                 input_shape=(X3_2_train.shape[1],)),
    layers.Dropout(dropout_rate),
        
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),

    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),

    layers.Dense(1)  # Single output for regression
])

reg_model3_2.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_reg_model3_2 = reg_model3_2.fit(
    X3_2_train_scaled, y3_2_train_final,
    validation_data=(X3_2_val_scaled, y3_2_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [63]:
# model evaluation
print("MODEL EVALUATION: MAX PEAK POSITION, WITH GENRE")

print("\n=== Baseline Model ===")
train_scores3_2 = baseline_model3_2.evaluate(X3_2_train_scaled, y3_2_train_final, verbose=0)
val_scores3_2   = baseline_model3_2.evaluate(X3_2_val_scaled, y3_2_val, verbose=0)
print(f"Train MAE: {train_scores3_2[1]:.4f}, Train MSE: {train_scores3_2[2]:.4f}")
print(f"Val   MAE: {val_scores3_2[1]:.4f}, Val   MSE: {val_scores3_2[2]:.4f}")

print("\n=== BatchNorm Model ===")
train_scores_bn3_2 = bnorm_model3_2.evaluate(X3_2_train_scaled, y3_2_train_final, verbose=0)
val_scores_bn3_2   = bnorm_model3_2.evaluate(X3_2_val_scaled, y3_2_val, verbose=0)
print(f"Train MAE: {train_scores_bn3_2[1]:.4f}, Train MSE: {train_scores_bn3_2[2]:.4f}")
print(f"Val   MAE: {val_scores_bn3_2[1]:.4f}, Val   MSE: {val_scores_bn3_2[2]:.4f}")

print("\n=== Regularized Model (L2 + Dropout) ===")
train_scores_reg3_2 = reg_model3_2.evaluate(X3_2_train_scaled, y3_2_train_final, verbose=0)
val_scores_reg3_2   = reg_model3_2.evaluate(X3_2_val_scaled, y3_2_val, verbose=0)
print(f"Train MAE: {train_scores_reg3_2[1]:.4f}, Train MSE: {train_scores_reg3_2[2]:.4f}")
print(f"Val   MAE: {val_scores_reg3_2[1]:.4f}, Val   MSE: {val_scores_reg3_2[2]:.4f}")


MODEL EVALUATION: MAX PEAK POSITION, WITH GENRE

=== Baseline Model ===
Train MAE: 13.9684, Train MSE: 361.5233
Val   MAE: 21.6882, Val   MSE: 812.5523

=== BatchNorm Model ===
Train MAE: 12.9813, Train MSE: 313.7757
Val   MAE: 21.3910, Val   MSE: 784.8604

=== Regularized Model (L2 + Dropout) ===
Train MAE: 18.6040, Train MSE: 545.2108
Val   MAE: 19.9859, Val   MSE: 643.9786


### Deep Learning | Max Rank Change - No Genre

In [64]:
# deep learning model, no regularization or dropout

baseline_model4_1 = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X4_1_train.shape[1],)),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)  # Single output for regression
])

baseline_model4_1.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_baseline4_1 = baseline_model4_1.fit(
    X4_1_train_scaled, y4_1_train_final,
    validation_data=(X4_1_val_scaled, y4_1_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [65]:
# deep learning model with batch normalization

bnorm_model4_1 = keras.Sequential([
    layers.Dense(64, activation='linear', input_shape=(X4_1_train.shape[1],)),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(64, activation='linear'),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(64, activation='linear'),
    layers.BatchNormalization(),
    layers.Activation('relu'),

    layers.Dense(1)  # Single output for regression
])

bnorm_model4_1.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_bnorm_model4_1 = bnorm_model4_1.fit(
    X4_1_train_scaled, y4_1_train_final,
    validation_data=(X4_1_val_scaled, y4_1_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [66]:
# deep learning model with regularization (L2 and dropout)

l2_reg = 1e-4
dropout_rate = 0.4

reg_model4_1 = keras.Sequential([
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg),
                 input_shape=(X4_1_train.shape[1],)),
    layers.Dropout(dropout_rate),
        
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),

    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),

    layers.Dense(1)  # Single output for regression
])

reg_model4_1.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_reg_model4_1 = reg_model4_1.fit(
    X4_1_train_scaled, y4_1_train_final,
    validation_data=(X4_1_val_scaled, y4_1_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [67]:
# model evaluation
print("MODEL EVALUATION: MAX PEAK POSITION, NO GENRE")

print("\n=== Baseline Model ===")
train_scores4_1 = baseline_model4_1.evaluate(X4_1_train_scaled, y4_1_train_final, verbose=0)
val_scores4_1   = baseline_model4_1.evaluate(X4_1_val_scaled, y4_1_val, verbose=0)
print(f"Train MAE: {train_scores4_1[1]:.4f}, Train MSE: {train_scores4_1[2]:.4f}")
print(f"Val   MAE: {val_scores4_1[1]:.4f}, Val   MSE: {val_scores4_1[2]:.4f}")

print("\n=== BatchNorm Model ===")
train_scores_bn4_1 = bnorm_model4_1.evaluate(X4_1_train_scaled, y4_1_train_final, verbose=0)
val_scores_bn4_1   = bnorm_model4_1.evaluate(X4_1_val_scaled, y4_1_val, verbose=0)
print(f"Train MAE: {train_scores_bn4_1[1]:.4f}, Train MSE: {train_scores_bn4_1[2]:.4f}")
print(f"Val   MAE: {val_scores_bn4_1[1]:.4f}, Val   MSE: {val_scores_bn4_1[2]:.4f}")

print("\n=== Regularized Model (L2 + Dropout) ===")
train_scores_reg4_1 = reg_model4_1.evaluate(X4_1_train_scaled, y4_1_train_final, verbose=0)
val_scores_reg4_1   = reg_model4_1.evaluate(X4_1_val_scaled, y4_1_val, verbose=0)
print(f"Train MAE: {train_scores_reg4_1[1]:.4f}, Train MSE: {train_scores_reg4_1[2]:.4f}")
print(f"Val   MAE: {val_scores_reg4_1[1]:.4f}, Val   MSE: {val_scores_reg4_1[2]:.4f}")


MODEL EVALUATION: MAX PEAK POSITION, NO GENRE

=== Baseline Model ===
Train MAE: 5.7779, Train MSE: 59.3154
Val   MAE: 10.5114, Val   MSE: 210.9423

=== BatchNorm Model ===
Train MAE: 5.8212, Train MSE: 60.9606
Val   MAE: 9.6852, Val   MSE: 183.9711

=== Regularized Model (L2 + Dropout) ===
Train MAE: 7.9112, Train MSE: 122.9827
Val   MAE: 8.8412, Val   MSE: 159.3799


### Deep Learning | Max Rank Change - With Genre

In [68]:
# deep learning model, no regularization or dropout

baseline_model4_2 = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X4_2_train.shape[1],)),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)  # Single output for regression
])

baseline_model4_2.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_baseline4_2 = baseline_model4_2.fit(
    X4_2_train_scaled, y4_2_train_final,
    validation_data=(X4_2_val_scaled, y4_2_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [69]:
# deep learning model with batch normalization

bnorm_model4_2 = keras.Sequential([
    layers.Dense(64, activation='linear', input_shape=(X4_2_train.shape[1],)),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(64, activation='linear'),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(64, activation='linear'),
    layers.BatchNormalization(),
    layers.Activation('relu'),

    layers.Dense(1)  # Single output for regression
])

bnorm_model4_2.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_bnorm_model4_2 = bnorm_model4_2.fit(
    X4_2_train_scaled, y4_2_train_final,
    validation_data=(X4_2_val_scaled, y4_2_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [70]:
# deep learning model with regularization (L2 and dropout)

l2_reg = 1e-4
dropout_rate = 0.4

reg_model4_2 = keras.Sequential([
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg),
                 input_shape=(X4_2_train.shape[1],)),
    layers.Dropout(dropout_rate),
        
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),

    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),

    layers.Dense(1)  # Single output for regression
])

reg_model4_2.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_reg_model4_2 = reg_model4_2.fit(
    X4_2_train_scaled, y4_2_train_final,
    validation_data=(X4_2_val_scaled, y4_2_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [71]:
# model evaluation
print("MODEL EVALUATION: MAX PEAK POSITION, NO GENRE")

print("\n=== Baseline Model ===")
train_scores4_2 = baseline_model4_2.evaluate(X4_2_train_scaled, y4_2_train_final, verbose=0)
val_scores4_2   = baseline_model4_2.evaluate(X4_2_val_scaled, y4_2_val, verbose=0)
print(f"Train MAE: {train_scores4_2[1]:.4f}, Train MSE: {train_scores4_2[2]:.4f}")
print(f"Val   MAE: {val_scores4_2[1]:.4f}, Val   MSE: {val_scores4_2[2]:.4f}")

print("\n=== BatchNorm Model ===")
train_scores_bn4_2 = bnorm_model4_2.evaluate(X4_2_train_scaled, y4_2_train_final, verbose=0)
val_scores_bn4_2   = bnorm_model4_2.evaluate(X4_2_val_scaled, y4_2_val, verbose=0)
print(f"Train MAE: {train_scores_bn4_1[1]:.4f}, Train MSE: {train_scores_bn4_1[2]:.4f}")
print(f"Val   MAE: {val_scores_bn4_1[1]:.4f}, Val   MSE: {val_scores_bn4_1[2]:.4f}")

print("\n=== Regularized Model (L2 + Dropout) ===")
train_scores_reg4_2 = reg_model4_2.evaluate(X4_2_train_scaled, y4_2_train_final, verbose=0)
val_scores_reg4_2   = reg_model4_2.evaluate(X4_2_val_scaled, y4_2_val, verbose=0)
print(f"Train MAE: {train_scores_reg4_2[1]:.4f}, Train MSE: {train_scores_reg4_2[2]:.4f}")
print(f"Val   MAE: {val_scores_reg4_2[1]:.4f}, Val   MSE: {val_scores_reg4_2[2]:.4f}")


MODEL EVALUATION: MAX PEAK POSITION, NO GENRE

=== Baseline Model ===
Train MAE: 4.7415, Train MSE: 44.5522
Val   MAE: 10.9702, Val   MSE: 234.2391

=== BatchNorm Model ===
Train MAE: 5.8212, Train MSE: 60.9606
Val   MAE: 9.6852, Val   MSE: 183.9711

=== Regularized Model (L2 + Dropout) ===
Train MAE: 7.6500, Train MSE: 111.4738
Val   MAE: 9.0667, Val   MSE: 163.0826


## Evaluation

### Business Insight/Recommendation 1

### Business Insight/Recommendation 2

### Business Insight/Recommendation 3

### Tableau Dashboard link

## Conclusion and Next Steps
Text here