![Add a relevant banner image here](path_to_image)

# Project Title

## Overview

Short project description. Your bottom line up front (BLUF) insights.

## Business Understanding

The customer of this project is FutureProduct Advisors, a consultancy that helps their customers develop innovative and new consumer products. FutureProduct’s customers are increasingly seeking help from their consultants in go-to-market activities. 

FutureProduct’s consultants can support these go-to-market activities, but the business does not have all the infrastructure needed to support it. Their biggest ask is for a tool to help them find interesting, up-and-coming music to accompany social posts and online ads for go-to-market promotions. 

**Stakeholders**

- FutureProduct Managing Director: oversees their consulting practice and is sponsoring this project.
- FutureProduct Senior Consultants: the actual users of the prospective tool. A small subset of the consultants will pilot the prototype tool.
- My consulting leadership: sponsors of this effort; will provide oversight and technical input of the project as needed.

**Primary Goals**

1.	Build a data tool that can evaluate any song in the Billboard Hot 100 list and make predictions about:
    -	The song’s position on the Hot 100 list 4 weeks in the future
    -	The song’s highest position on the list in the next 6 months
2.	Create a rubric that lists the 3 most important factors for songs’ placement on the Hot 100 list for each hear from 2000 to 2021.


## Data Understanding

Billboard Hot 100 weekly charts (Kaggle): https://www.kaggle.com/datasets/thedevastator/billboard-hot-100-audio-features

I’ve chosen this dataset because it has a direct measurement of song popularity (the Hot 100 list) and because its long history gives significant context to a song’s positioning in a given week.
The features list gives a wide range of song attributes to explore and enables me to determine what features most significantly contribute to a song’s popularity and how that changes over time.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import ast
from collections import Counter

import xgboost as xgb

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay, mean_squared_error, r2_score

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

np.random.seed(42)

In [2]:
df_hotlist_all = pd.read_csv('Data/Hot Stuff.csv')
df_features_all = pd.read_csv('Data/Hot 100 Audio Features.csv')

In [3]:
# exploring hotlist data
df_hotlist_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327895 entries, 0 to 327894
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   index                   327895 non-null  int64  
 1   url                     327895 non-null  object 
 2   WeekID                  327895 non-null  object 
 3   Week Position           327895 non-null  int64  
 4   Song                    327895 non-null  object 
 5   Performer               327895 non-null  object 
 6   SongID                  327895 non-null  object 
 7   Instance                327895 non-null  int64  
 8   Previous Week Position  295941 non-null  float64
 9   Peak Position           327895 non-null  int64  
 10  Weeks on Chart          327895 non-null  int64  
dtypes: float64(1), int64(5), object(5)
memory usage: 27.5+ MB


In [4]:
# exploring features df
df_features_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29503 entries, 0 to 29502
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   index                      29503 non-null  int64  
 1   SongID                     29503 non-null  object 
 2   Performer                  29503 non-null  object 
 3   Song                       29503 non-null  object 
 4   spotify_genre              27903 non-null  object 
 5   spotify_track_id           24397 non-null  object 
 6   spotify_track_preview_url  14491 non-null  object 
 7   spotify_track_duration_ms  24397 non-null  float64
 8   spotify_track_explicit     24397 non-null  object 
 9   spotify_track_album        24391 non-null  object 
 10  danceability               24334 non-null  float64
 11  energy                     24334 non-null  float64
 12  key                        24334 non-null  float64
 13  loudness                   24334 non-null  flo

## Data Preparation
Text here

In [5]:
# removing attributes that will not be used in cleaning or analysis
df_hotlist_all = df_hotlist_all.drop(['index', 'url', 'Song', 'Performer'], axis=1)
df_hotlist_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327895 entries, 0 to 327894
Data columns (total 7 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   WeekID                  327895 non-null  object 
 1   Week Position           327895 non-null  int64  
 2   SongID                  327895 non-null  object 
 3   Instance                327895 non-null  int64  
 4   Previous Week Position  295941 non-null  float64
 5   Peak Position           327895 non-null  int64  
 6   Weeks on Chart          327895 non-null  int64  
dtypes: float64(1), int64(4), object(2)
memory usage: 17.5+ MB


In [6]:
# removing attributes that will not be used in cleaning or analysis
df_features_all = df_features_all.drop(['index', 'Performer', 'Song', 'spotify_track_album', 'spotify_track_preview_url', 'spotify_track_explicit', 'spotify_track_popularity'], axis=1)
df_features_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29503 entries, 0 to 29502
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   SongID                     29503 non-null  object 
 1   spotify_genre              27903 non-null  object 
 2   spotify_track_id           24397 non-null  object 
 3   spotify_track_duration_ms  24397 non-null  float64
 4   danceability               24334 non-null  float64
 5   energy                     24334 non-null  float64
 6   key                        24334 non-null  float64
 7   loudness                   24334 non-null  float64
 8   mode                       24334 non-null  float64
 9   speechiness                24334 non-null  float64
 10  acousticness               24334 non-null  float64
 11  instrumentalness           24334 non-null  float64
 12  liveness                   24334 non-null  float64
 13  valence                    24334 non-null  flo

In [7]:
# converting WeekID to datetime
df_hotlist_all['WeekID'] = pd.to_datetime(df_hotlist_all['WeekID'], errors='coerce')
df_hotlist_all = df_hotlist_all.sort_values(by='WeekID')
df_hotlist_all.head(3)

Unnamed: 0,WeekID,Week Position,SongID,Instance,Previous Week Position,Peak Position,Weeks on Chart
18553,1958-08-02,63,High School ConfidentialJerry Lee Lewis And Hi...,1,,63,1
103337,1958-08-02,98,Little SerenadeThe Ames Brothers,1,,98,1
146293,1958-08-02,68,Volare (Nel Blu Dipinto Di Blu)Dean Martin,1,,68,1


In [None]:
# creating a new df with only complete year data from 2000 - 2020, the time period being studied
df_hotlist_2000s = df_hotlist_all.loc[(df_hotlist_all['WeekID'] > '1999-12-31') & (df_hotlist_all['WeekID'] < '2021-01-01')]
df_hotlist_2000s.head(2), df_hotlist_2000s.tail(2)

(           WeekID  Week Position                                   SongID  \
 72674  2000-01-01             69                   Deck The HallsSHeDAISY   
 239827 2000-01-01             83  Guerrilla RadioRage Against The Machine   
 
         Instance  Previous Week Position  Peak Position  Weeks on Chart  
 72674          1                    97.0             69               2  
 239827         1                    87.0             69              10  ,
            WeekID  Week Position                    SongID  Instance  \
 7909   2020-12-26             40     Gold RushTaylor Swift         1   
 320975 2020-12-26             65  HawaiMaluma & The Weeknd         1   
 
         Previous Week Position  Peak Position  Weeks on Chart  
 7909                       NaN             40               1  
 320975                    55.0             12              17  )

In [9]:
# adding a column to calculate the week over week change in rank
def diff(a, b):
    return a - b

df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s.apply(lambda x: diff(x['Week Position'], x['Previous Week Position']), axis=1)
df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s['Rank_Change'].fillna(0)
df_hotlist_2000s.head(3), df_hotlist_2000s.tail(3), df_hotlist_2000s.info()

<class 'pandas.core.frame.DataFrame'>
Index: 109600 entries, 72674 to 320975
Data columns (total 8 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   WeekID                  109600 non-null  datetime64[ns]
 1   Week Position           109600 non-null  int64         
 2   SongID                  109600 non-null  object        
 3   Instance                109600 non-null  int64         
 4   Previous Week Position  99290 non-null   float64       
 5   Peak Position           109600 non-null  int64         
 6   Weeks on Chart          109600 non-null  int64         
 7   Rank_Change             109600 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(4), object(1)
memory usage: 7.5+ MB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s.apply(lambda x: diff(x['Week Position'], x['Previous Week Position']), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s['Rank_Change'].fillna(0)


(           WeekID  Week Position                                    SongID  \
 72674  2000-01-01             69                    Deck The HallsSHeDAISY   
 239827 2000-01-01             83   Guerrilla RadioRage Against The Machine   
 253976 2000-01-01             60  HeartbreakerMariah Carey Featuring Jay-Z   
 
         Instance  Previous Week Position  Peak Position  Weeks on Chart  \
 72674          1                    97.0             69               2   
 239827         1                    87.0             69              10   
 253976         1                    51.0              1              18   
 
         Rank_Change  
 72674         -28.0  
 239827         -4.0  
 253976          9.0  ,
            WeekID  Week Position                            SongID  Instance  \
 265214 2020-12-26             93  Ain't Always The CowboyJon Pardi         1   
 7909   2020-12-26             40             Gold RushTaylor Swift         1   
 320975 2020-12-26             65       

In [10]:
# new df with the max weekly rank change for each song in df_hotlist_2000s
df_max_rank_change = df_hotlist_2000s.groupby('SongID', as_index=False)['Rank_Change'].max()
df_max_rank_change.rename(columns={'Rank_Change': 'Max_Rank_Change'}, inplace=True)
df_max_rank_change.set_index('SongID', inplace=True)
df_max_rank_change.head(3), df_max_rank_change.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8669 entries, #1Nelly to www.memoryAlan Jackson
Data columns (total 1 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Max_Rank_Change  8669 non-null   float64
dtypes: float64(1)
memory usage: 135.5+ KB


(                                         Max_Rank_Change
 SongID                                                  
 #1Nelly                                             13.0
 #BeautifulMariah Carey Featuring Miguel             17.0
 #SELFIEThe Chainsmokers                             18.0,
 None)

In [11]:
# new df with the max peak position for each song in df_hotlist_2000s
df_max_peak_pos = df_hotlist_2000s.groupby('SongID', as_index=False)['Peak Position'].max()
df_max_peak_pos.rename(columns={'Peak Position': 'Max_Peak_Position'}, inplace=True)
df_max_peak_pos.set_index('SongID', inplace=True)
df_max_peak_pos.head(3), df_max_peak_pos.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8669 entries, #1Nelly to www.memoryAlan Jackson
Data columns (total 1 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   Max_Peak_Position  8669 non-null   int64
dtypes: int64(1)
memory usage: 135.5+ KB


(                                         Max_Peak_Position
 SongID                                                    
 #1Nelly                                                 75
 #BeautifulMariah Carey Featuring Miguel                 24
 #SELFIEThe Chainsmokers                                 55,
 None)

In [12]:
# ensuring these new dfs have no null values
df_max_peak_pos['Max_Peak_Position'].isna().sum(), df_max_rank_change['Max_Rank_Change'].isna().sum()

(0, 0)

In [13]:
# extracting full list of songs in the time period being studied
songid_list = df_hotlist_2000s['SongID'].unique()

# creating a features df with only songs in df_hotlist_2000s
df_features_2000s = df_features_all[df_features_all['SongID'].isin(songid_list)]

In [14]:
# checking for duplicates
print(len(df_features_2000s))
print(len(pd.unique(df_features_2000s['SongID'])))

8781
8664


In [15]:
# removing duplicates and rechecking
df_features_2000s = df_features_2000s.drop_duplicates(subset='SongID')

print(len(df_features_2000s))
print(len(pd.unique(df_features_2000s['SongID'])))

8664
8664


In [16]:
# adding max peak position to main df
df_2000s_data = df_features_2000s.join(df_max_peak_pos, on='SongID')
# adding max rank change to main df
df_2000s_data = df_2000s_data.join(df_max_rank_change, on='SongID')

df_2000s_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8664 entries, 5 to 29500
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   SongID                     8664 non-null   object 
 1   spotify_genre              8250 non-null   object 
 2   spotify_track_id           7882 non-null   object 
 3   spotify_track_duration_ms  7882 non-null   float64
 4   danceability               7849 non-null   float64
 5   energy                     7849 non-null   float64
 6   key                        7849 non-null   float64
 7   loudness                   7849 non-null   float64
 8   mode                       7849 non-null   float64
 9   speechiness                7849 non-null   float64
 10  acousticness               7849 non-null   float64
 11  instrumentalness           7849 non-null   float64
 12  liveness                   7849 non-null   float64
 13  valence                    7849 non-null   float64
 

In [17]:
# removing entries with missing values
df_cleaned = df_2000s_data[df_2000s_data.notna().all(axis=1)]
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7798 entries, 5 to 29499
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   SongID                     7798 non-null   object 
 1   spotify_genre              7798 non-null   object 
 2   spotify_track_id           7798 non-null   object 
 3   spotify_track_duration_ms  7798 non-null   float64
 4   danceability               7798 non-null   float64
 5   energy                     7798 non-null   float64
 6   key                        7798 non-null   float64
 7   loudness                   7798 non-null   float64
 8   mode                       7798 non-null   float64
 9   speechiness                7798 non-null   float64
 10  acousticness               7798 non-null   float64
 11  instrumentalness           7798 non-null   float64
 12  liveness                   7798 non-null   float64
 13  valence                    7798 non-null   float64
 

In [18]:
# generating a df with unique genre names
unique_genres = list(set(
    genre 
    for genre_string in df_cleaned['spotify_genre'] 
    if pd.notna(genre_string)
    for genre in ast.literal_eval(genre_string)
))

df_unique_genres = pd.DataFrame(unique_genres, columns=['genre'])

In [19]:
# adding counts of each unique genre name
# Extract all genres (with duplicates) and count them
all_genres_list = []
for genre_string in df_cleaned['spotify_genre']:
    if pd.notna(genre_string):
        genre_list = ast.literal_eval(genre_string)
        all_genres_list.extend(genre_list)

# Count occurrences
genre_counts = Counter(all_genres_list)

# Map counts to genres dataframe
df_unique_genres['count'] = df_unique_genres['genre'].map(genre_counts)
df_unique_genres = df_unique_genres.sort_values('count', ascending=False)

In [None]:
# writing to csv for easier review of the data
df_unique_genres.to_csv('genre_counts.csv', index=False)

In [20]:
# loading list of genres with 50 or more instances in df_cleaned
df_genres_50_up = pd.read_csv('genre_counts_50+inst.csv')

In [21]:
# converting df to list
final_genres_list = df_genres_50_up['genre'].tolist()

# manually one-hot encoding each genre

# creating each new genre column and initializing to 0
for genre in final_genres_list:
    df_cleaned[genre] = 0

# iterating through rows to set values to 1 when genre column appears in original spotify_genre column
for idx, genre_string in enumerate(df_cleaned['spotify_genre']):
    if pd.notna(genre_string):
        genre_list = ast.literal_eval(genre_string)
        for genre in genre_list:
            df_cleaned.at[idx, genre] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[genre] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[genre] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[genre] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats

In [22]:
pd.set_option('display.max_columns', None)
df_cleaned.head(3)

Unnamed: 0,SongID,spotify_genre,spotify_track_id,spotify_track_duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,Max_Peak_Position,Max_Rank_Change,pop,rap,pop rap,dance pop,post-teen pop,hip hop,trap,contemporary country,country road,country,southern hip hop,modern country rock,atl hip hop,r&b,canadian pop,melodic rap,urban contemporary,pop rock,hollywood,glee club,neo mellow,canadian hip hop,toronto rap,edm,gangster rap,country pop,tropical house,hip pop,modern rock,miami hip hop,latin,electropop,chicago rap,conscious hip hop,viral pop,country dawn,uk pop,dirty south rap,philly rap,detroit hip hop,alternative r&b,post-grunge,oklahoma country,talent show,canadian contemporary r&b,neo soul,boy band,reggaeton,electro house,rock,atl trap,nc hip hop,new orleans rap,emo rap,australian country,alternative metal,canadian metal,canadian rock,nu metal,alternative rock,permanent wave,dance rock,new romantic,new wave,new wave pop,soft rock,synthpop,candy pop,europop,adult standards,brill building pop,easy listening,vocal jazz,dancehall,glam metal,plugg,underground hip hop,deep pop r&b,pop punk,pittsburgh rap,acoustic pop,piano rock,art pop,canadian indie,chamber pop,indie pop,indie rock,slow core,stomp and holler,cali rap,slow game,alternative dance,dance-punk,indietronica,new rave,indie pop rap,comic,texas pop punk,bachata,latin pop,tropical,crunk,metropopolis,baton rouge rap,brooklyn drill,nyc rap,australian pop,punk,east coast hip hop,queens hip hop,west coast trap,g funk,complextro,german techno,new jack swing,escape room,indie r&b,indie soul,dmv rap,memphis hip hop,new jersey rap,british soul,danish pop,scandipop,texas country,idol,rap kreyol,dfw rap,deep southern trap,deep talent show,country rock,redneck,american folk revival,cantautor,latin arena pop,mexican pop,rock en espanol,spanish pop,country gospel,alberta country,canadian contemporary country,canadian country,emo,funk,soul,classic soul,disco,motown,post-disco,quiet storm,lilith,folk-pop,ny roots,ethiopian pop,trap queen,canadian pop punk,canadian punk,pop reggaeton,downtempo,electronic trap,shiver pop,latin hip hop,reggaeton flow,rap metal,socal pop punk,alaska indie,singer-songwriter,sertanejo,sertanejo pop,sertanejo universitario,pixie,pop emo,swedish electropop,swedish pop,garage rock,punk blues,big room,brostep,catstep,electra,funk metal,rap rock,australian dance,christian alternative rock,christian rock,canadian latin,bronx hip hop,hardcore hip hop,lounge,girl group,wrestling,west coast rap,show tunes,etherpop,indie poptimism,belgian dance,belgian pop,eurodance,progressive electro house,modern uplift,australian hip hop,kentucky hip hop,folk,mellow gold,heartland rock,art rock,experimental,experimental rock,melancholia,post-punk,psychedelic rock,jam band,barbadian pop,puerto rican pop,trap latino,deep contemporary country,lds youth,reggae fusion,progressive house,ohio hip hop,arkansas country,blues rock,modern blues rock,small room,bubblegum dance,deep big room,dutch house,smooth jazz,smooth saxophone,christian music,lgbtq+ hip hop,reggaeton colombiano,rap latina,houston rap,modern folk rock,uk americana,alternative hip hop,chicano rap,cartoon,children's music,old school hip hop,bounce,electro,disco house,canadian ccm,christian punk,indiecoustica,ectofolk,irish rock,anthem worship,ccm,christian pop,worship,bassline,social media pop,norwegian hip hop,outlaw country,hawaiian hip hop,vapor trap,bhangra,desi hip hop,desi pop,scottish singer-songwriter,grunge,hard rock,k-pop,k-pop boy group,electropowerpop,neon pop punk,trancecore,album rock,classic rock,glam rock,protopunk,north carolina hip hop,house,uk dance,nu-metalcore,trap soul,italian pop,italo dance,rock-and-roll,rockabilly,groove metal,rap conscient,drill,baroque pop,uk contemporary r&b,celtic rock,harlem hip hop,electronica,nu jazz,trip hop,bow pop,country rap,san diego rap,canadian trap,south african rock,christian indie,moombahton,neo-singer-songwriter,neo-synthpop,neo-traditional country,funk rock,aussietronica,disney,florida rap,colombian pop,a cappella,latin viral pop,antiviral pop,comedy rock,parody,viral rap,alternative pop rock,la indie,movie tunes,indie electropop,la pop,pop edm,portland hip hop,viral trap,bubble trance,hopebeat,gospel r&b,k-hop,teen pop,modern alternative rock,nu gaze,ghanaian hip hop,new americana,southern soul,pop soul,swedish synthpop,classic country pop,nashville sound,sleaze rock,kids dance party,metal,old school thrash,speed metal,thrash metal,k-rap,folk rock,meme rap,lo-fi,washington indie,brooklyn indie,shimmer pop,big beat,skate punk,nyc pop,sheffield indie,scottish rock,uk alternative pop,vocal house,contemporary vocal jazz,norwegian pop,alabama metal,yacht rock,soca,experimental pop,icelandic experimental,icelandic pop,latin rock,mexican rock,melbourne bounce international,cyberpunk,electronic rock,industrial,industrial metal,industrial rock,alternative country,indie folk,roots rock,canadian singer-songwriter,comedy,screamo,j-pop,japanese singer-songwriter,post-metal,progressive metal,progressive rock,detroit trap,battle rap,balkan brass,transpop,roots americana,hyphy,australian indie,filter house,neo r&b,lovers rock,old school dancehall,riddim,anti-folk,nz pop,world worship,cedm,minnesota hip hop,philly soul,canadian folk,bass trap,vapor twitch,san marcos tx indie,swedish garage rock,swedish hard rock,swedish indie rock,alt z,bedroom pop,chicago hardcore,chicago punk,hardcore punk,southern rock,cowboy western,traditional country,yodeling,chicago indie,alabama indie,queer country,mexican hip hop,deep vocal house,k-pop girl group,cello,power metal,chicago drill,birmingham metal,operatic pop,uk funky,modern salsa,salsa,gospel,grunge pop,french pop,minneapolis sound,bubblegum pop,classic uk pop,boston hip hop,modern southern rock,turntablism,trance,swedish alternative rock,trap boricua,modern blues,champeta,vallenato,deep norteno,duranguense,musica potosina,norteno,norteno-sax,cumbia,liquid funk,chicago house,bluegrass gospel,wu fam,chinese hip hop,chinese idol pop,palm desert scene,stoner metal,stoner rock,virgin islands reggae,destroy techno,tennessee hip hop,french shoegaze,miami bass,romanian pop,electroclash,easycore,milwaukee hip hop,bluegrass,traditional folk,pop r&b,deep progressive trance,australian electropop,israeli pop,rebel blues,banda,regional mexican pop,oxford indie,celtic,middle earth,panamanian pop,albuquerque indie,portland indie,australian rock,lo star,drum and bass,azonto,deep latin christian,grupera,canadian electronic,quebec indie,stomp pop,pop dance,slap house,broadway,alabama rap,finnish edm,uk garage,indie anthem-folk,charlottesville indie,progressive trance,uplifting trance,classic girl group,irish singer-songwriter,power pop,strut,ninja,indy indie,german pop,deep euro house,deep house,german dance,bedroom soul,chutney,scorecore,soundtrack,comic metal,britpop,madchester,el paso indie,grime,deep underground hip hop,vapor pop,pinoy hip hop,dutch hip hop,icelandic indie,icelandic rock,australian house,bass house,jazz rap,seattle indie,reggae,bmore,arkansas hip hop,brazilian death metal,brazilian metal,brazilian thrash metal,crossover thrash,new wave of thrash metal,fake,ann arbor indie,bahamian pop,alternative pop
5,...Ready For It?Taylor Swift,"['pop', 'post-teen pop']",2yLa0QULdQr0qAIvVwN6B5,208186.0,0.613,0.764,2.0,-6.509,1.0,0.136,0.0527,0.0,0.197,0.417,160.015,4.0,4.0,22.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
13,'Til Summer Comes AroundKeith Urban,"['australian country', 'contemporary country',...",1CKmI1IQjVEVB3F7VmJmM3,331466.0,0.57,0.629,9.0,-7.608,0.0,0.0331,0.593,0.000136,0.77,0.308,127.907,4.0,92.0,14.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16,'Tis The Damn SeasonTaylor Swift,"['pop', 'post-teen pop']",7dW84mWkdWE5a6lFWxJCBG,229840.0,0.575,0.434,5.0,-8.193,1.0,0.0312,0.735,6.6e-05,0.105,0.348,145.916,4.0,39.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [23]:
# my code added columns for all genres in spotify_genre, removing unwanted columns
#last_col_to_keep = 'emo rap'
#df_cleaned = df_cleaned.loc[:, :last_col_to_keep]
#df_cleaned.head(3)

# setting up an analysis without genre
last_col_to_keep = 'Max_Rank_Change'
df_cleaned = df_cleaned.loc[:, :last_col_to_keep]
df_cleaned.head(3)



Unnamed: 0,SongID,spotify_genre,spotify_track_id,spotify_track_duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,Max_Peak_Position,Max_Rank_Change
5,...Ready For It?Taylor Swift,"['pop', 'post-teen pop']",2yLa0QULdQr0qAIvVwN6B5,208186.0,0.613,0.764,2.0,-6.509,1.0,0.136,0.0527,0.0,0.197,0.417,160.015,4.0,4.0,22.0
13,'Til Summer Comes AroundKeith Urban,"['australian country', 'contemporary country',...",1CKmI1IQjVEVB3F7VmJmM3,331466.0,0.57,0.629,9.0,-7.608,0.0,0.0331,0.593,0.000136,0.77,0.308,127.907,4.0,92.0,14.0
16,'Tis The Damn SeasonTaylor Swift,"['pop', 'post-teen pop']",7dW84mWkdWE5a6lFWxJCBG,229840.0,0.575,0.434,5.0,-8.193,1.0,0.0312,0.735,6.6e-05,0.105,0.348,145.916,4.0,39.0,0.0


In [24]:
# removing fields used for prep/cleaning but not needed for analysis
df_cleaned = df_cleaned.drop(['SongID', 'spotify_genre', 'spotify_track_id'], axis=1)
df_cleaned.head(3), df_cleaned.tail(3)

(    spotify_track_duration_ms  danceability  energy  key  loudness  mode  \
 5                    208186.0         0.613   0.764  2.0    -6.509   1.0   
 13                   331466.0         0.570   0.629  9.0    -7.608   0.0   
 16                   229840.0         0.575   0.434  5.0    -8.193   1.0   
 
     speechiness  acousticness  instrumentalness  liveness  valence    tempo  \
 5        0.1360        0.0527          0.000000     0.197    0.417  160.015   
 13       0.0331        0.5930          0.000136     0.770    0.308  127.907   
 16       0.0312        0.7350          0.000066     0.105    0.348  145.916   
 
     time_signature  Max_Peak_Position  Max_Rank_Change  
 5              4.0                4.0             22.0  
 13             4.0               92.0             14.0  
 16             4.0               39.0              0.0  ,
       spotify_track_duration_ms  danceability  energy  key  loudness  mode  \
 7792                        NaN           NaN     NaN  

In [25]:
# removing extra rows
df_cleaned = df_cleaned.dropna()
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7798 entries, 5 to 29499
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   spotify_track_duration_ms  7798 non-null   float64
 1   danceability               7798 non-null   float64
 2   energy                     7798 non-null   float64
 3   key                        7798 non-null   float64
 4   loudness                   7798 non-null   float64
 5   mode                       7798 non-null   float64
 6   speechiness                7798 non-null   float64
 7   acousticness               7798 non-null   float64
 8   instrumentalness           7798 non-null   float64
 9   liveness                   7798 non-null   float64
 10  valence                    7798 non-null   float64
 11  tempo                      7798 non-null   float64
 12  time_signature             7798 non-null   float64
 13  Max_Peak_Position          7798 non-null   float64
 

In [26]:
# Prepare features and target for XGBoost and k-NN max_peak_position analysis
X1 = df_cleaned.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y1 = df_cleaned['Max_Peak_Position']

# Splitting the data into training and testing sets (75-25 split and random_state of 42)
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.25, random_state=42)

# Standardize the features
scaler = StandardScaler()
X1_train_scaled = scaler.fit_transform(X1_train)
X1_test_scaled = scaler.fit_transform(X1_test)

In [27]:
# Prepare features and target for XGBoost and k-NN max_rank_change analysis
X2 = df_cleaned.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y2 = df_cleaned['Max_Rank_Change']

# Splitting the data into training and testing sets (75-25 split and random_state of 42)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.25, random_state=42)

# Standardize the features
scaler = StandardScaler()
X2_train_scaled = scaler.fit_transform(X2_train)
X2_test_scaled = scaler.fit_transform(X2_test)

In [28]:
# Prepare features and target for simple deep learning max_peak_position analysis
X3 = df_cleaned.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y3 = df_cleaned['Max_Peak_Position']

# Splitting the data into training and testing sets 
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=42)

# splitting training data into training and validiation
X3_train_final, X3_val, y3_train_final, y3_val = train_test_split(X3_train, y3_train, test_size=0.2, random_state=42)

print("Training data shape:", X3_train_final.shape)
print("Validation data shape:", X3_val.shape)
print("Test data shape:", X3_test.shape)

Training data shape: (4990, 13)
Validation data shape: (1248, 13)
Test data shape: (1560, 13)


In [29]:
# Prepare features and target for simple deep learning max_rank_change analysis
X4 = df_cleaned.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y4 = df_cleaned['Max_Rank_Change']

# Splitting the data into training and testing sets 
X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, test_size=0.2, random_state=42)

# splitting training data into training and validiation
X4_train_final, X4_val, y4_train_final, y4_val = train_test_split(X4_train, y4_train, test_size=0.2, random_state=42)

print("Training data shape:", X4_train_final.shape)
print("Validation data shape:", X4_val.shape)
print("Test data shape:", X4_test.shape)

Training data shape: (4990, 13)
Validation data shape: (1248, 13)
Test data shape: (1560, 13)


In [30]:
# normalizing data for simple deep learning max_peak_position

scaler.fit(X3_train_final)

X3_train_scaled = scaler.transform(X3_train_final)
X3_val_scaled = scaler.transform(X3_val)
X3_test_scaled = scaler.transform(X3_test)

In [31]:
# normalizing data for simple deep learning max_rank_change

scaler.fit(X4_train_final)

X4_train_scaled = scaler.transform(X4_train_final)
X4_val_scaled = scaler.transform(X4_val)
X4_test_scaled = scaler.transform(X4_test)

## Analysis

Text here

In [32]:
# XGBoost for max_peak_position

xgb_model1 = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, random_state=42)

xgb_model1.fit(X1_train, y1_train)
y1_pred = xgb_model1.predict(X1_test)
y1_pred = np.clip(np.round(y1_pred), 1, 100)

rmse1 = np.sqrt(mean_squared_error(y1_test, y1_pred))
r2_1 = r2_score(y1_test, y1_pred)

print(f'RMSE: {rmse1:.3f}')
print(f'R²: {r2_1:.3f}')

RMSE: 26.079
R²: -0.174


In [33]:
# hyperparameter tuning 1 for max_peak_position

param_grid1 = {
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
}

grid_search_xgb1 = GridSearchCV(estimator=xgb_model1,
                            param_grid=param_grid1,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb1.fit(X1_train, y1_train)

print("Best parameters:", grid_search_xgb1.best_params_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.01, 'max_depth': 6, 'subsample': 0.8}


In [34]:
# hyperparameter tuning 2 for max_peak_position

param_grid2 = {
    'max_depth': [6],
    'learning_rate': [0.005, 0.01, 0.015,],
    'subsample': [0.75, 0.8, 0.85],
    'colsample_bytree': [1.0, 1.1, 1.2],
}

grid_search_xgb1 = GridSearchCV(estimator=xgb_model1,
                            param_grid=param_grid2,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb1.fit(X1_train, y1_train)

print("Best parameters:", grid_search_xgb1.best_params_)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.015, 'max_depth': 6, 'subsample': 0.75}


90 fits failed out of a total of 135.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\core.py", line 726, in inner_f
    return func(**kwargs)
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\sklearn.py", line 1170, in fit
    self._Booster = train(
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\core.py", line 726, in inner_f
    return func(**kwargs

In [35]:
# hyperparameter tuning 3 for max_peak_position

param_grid3 = {
    'max_depth': [6],
    'learning_rate': [0.015],
    'subsample': [0.73, 0.74, 0.75],
    'colsample_bytree': [1.0],
}

grid_search_xgb1 = GridSearchCV(estimator=xgb_model1,
                            param_grid=param_grid3,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb1.fit(X1_train, y1_train)

print("Best parameters:", grid_search_xgb1.best_params_)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.015, 'max_depth': 6, 'subsample': 0.73}


In [36]:
# Extract best/final model for max_peak_position
best_xgb1 = grid_search_xgb1.best_estimator_

# predictions
y1_pred_best = best_xgb1.predict(X1_test)
y1_pred_best = np.clip(np.round(y1_pred_best), 1, 100)

# Evaluate XGBoost model
rmse1_best = np.sqrt(mean_squared_error(y1_test, y1_pred_best))
r2_1_best = r2_score(y1_test, y1_pred_best)

print(f'RMSE: {rmse1_best:.3f}')
print(f'R²: {r2_1_best:.3f}')

RMSE: 23.937
R²: 0.011


In [37]:
# XGBoost for max_rank_change

xgb_model2 = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, random_state=42)

xgb_model2.fit(X2_train, y2_train)
y2_pred = xgb_model2.predict(X2_test)
y2_pred = np.clip(np.round(y2_pred), 1, 100)

rmse2 = np.sqrt(mean_squared_error(y2_test, y2_pred))
r2_2 = r2_score(y2_test, y2_pred)

print(f'RMSE: {rmse2:.3f}')
print(f'R²: {r2_2:.3f}')

RMSE: 12.816
R²: -0.191


In [38]:
# hyperparameter tuning 1 for max_rank_change

param_grid1 = {
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
}

grid_search_xgb2_1 = GridSearchCV(estimator=xgb_model2,
                            param_grid=param_grid1,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb2_1.fit(X2_train, y2_train)

print("Best parameters:", grid_search_xgb2_1.best_params_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.01, 'max_depth': 3, 'subsample': 1.0}


In [39]:
# hyperparameter tuning 2 for max_rank_change

param_grid4 = {
    'max_depth': [2, 3, 4],
    'learning_rate': [0.005, 0.01, 0.015],
    'subsample': [0.9, 1.0, 1.1],
    'colsample_bytree': [1.0, 1.1, 1.2],
}

grid_search_xgb2_2 = GridSearchCV(estimator=xgb_model2,
                            param_grid=param_grid4,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb2_2.fit(X2_train, y2_train)

print("Best parameters:", grid_search_xgb2_2.best_params_)

Fitting 5 folds for each of 81 candidates, totalling 405 fits
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.01, 'max_depth': 2, 'subsample': 0.9}


315 fits failed out of a total of 405.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\core.py", line 726, in inner_f
    return func(**kwargs)
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\sklearn.py", line 1170, in fit
    self._Booster = train(
  File "c:\Users\marha\.conda\envs\ai-environment\lib\site-packages\xgboost\core.py", line 726, in inner_f
    return func(**kwarg

In [40]:
# hyperparameter tuning 3 for max_rank_change

param_grid5 = {
    'max_depth': [1, 2],
    'learning_rate': [0.01],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [1.0],
        }

grid_search_xgb2_3 = GridSearchCV(estimator=xgb_model2,
                            param_grid=param_grid5,
                            cv=5,
                            n_jobs=-1,
                            verbose=1)
grid_search_xgb2_3.fit(X2_train, y2_train)

print("Best parameters:", grid_search_xgb2_3.best_params_)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.01, 'max_depth': 2, 'subsample': 0.9}


In [41]:
# Extract best/final model  for max_rank_change
best_xgb2 = grid_search_xgb2_3.best_estimator_

y2_pred_best = best_xgb2.predict(X2_test)
y2_pred_best = np.clip(np.round(y2_pred_best), 1, 100)

# Evaluate XGBoost model
rmse2_best = np.sqrt(mean_squared_error(y2_test, y2_pred_best))
r2_2_best = r2_score(y2_test, y2_pred_best)

print(f'RMSE: {rmse2_best:.3f}')
print(f'R²: {r2_2_best:.3f}')

RMSE: 11.724
R²: 0.003


In [43]:
# k-NN model for max_peak_position

# testing different distance metrics to find optimal approach
metrics = ['euclidean', 'manhattan', 'chebyshev']
k_value = 5

# Dictionary to store results
results = {}

# Loop through list of metrics
for metric in metrics:
        # Create and evaluate model with different metrics and k=5
    knn = KNeighborsClassifier(n_neighbors=k_value, metric=metric)
    # Get cross val scores for model
    cv_scores = cross_val_score(knn, X1_train_scaled, y1_train, cv=5, scoring='accuracy')
    # Store the mean of cv scores as value and metric name as key in results dictionary
    results[metric] = cv_scores.mean()

best_metric = max(results, key=results.get)

print(results)
print(f"\nBest metric: {best_metric} with accuracy: {results[best_metric]:.4f}")

{'euclidean': 0.011286584340476555, 'manhattan': 0.013338451302523157, 'chebyshev': 0.01196961388578155}

Best metric: manhattan with accuracy: 0.0133


In [44]:
# k-NN model for max_rank_change

# testing different distance metrics to find optimal approach
metrics = ['euclidean', 'manhattan', 'chebyshev']
k_value = 5

# Dictionary to store results
results2 = {}

# Loop through list of metrics
for metric in metrics:
        # Create and evaluate model with different metrics and k=5
    knn2 = KNeighborsClassifier(n_neighbors=k_value, metric=metric)
    # Get cross val scores for model
    cv_scores2 = cross_val_score(knn2, X2_train_scaled, y2_train, cv=5, scoring='accuracy')
    # Store the mean of cv scores as value and metric name as key in results dictionary
    results2[metric] = cv_scores2.mean()

best_metric2 = max(results2, key=results2.get)

print(results2)
print(f"\nBest metric: {best_metric2} with accuracy: {results2[best_metric2]:.4f}")



{'euclidean': 0.1793772162634438, 'manhattan': 0.18125814305455024, 'chebyshev': 0.18143039927471066}

Best metric: chebyshev with accuracy: 0.1814


In [45]:
# simple deep learning model for max_peak_position, no regularization or dropout

baseline_model3 = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X3_train.shape[1],)),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)  # Single output for regression
])

baseline_model3.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_baseline3 = baseline_model3.fit(
    X3_train_scaled, y3_train_final,
    validation_data=(X3_val_scaled, y3_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [46]:
# simple deep learning model for max_peak_position with batch normalization

bnorm_model3 = keras.Sequential([
    layers.Dense(64, activation='linear', input_shape=(X3_train.shape[1],)),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(64, activation='linear'),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(64, activation='linear'),
    layers.BatchNormalization(),
    layers.Activation('relu'),

    layers.Dense(1)  # Single output for regression
])

bnorm_model3.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_bnorm_model3 = bnorm_model3.fit(
    X3_train_scaled, y3_train_final,
    validation_data=(X3_val_scaled, y3_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [49]:
# simple deep learning model for max_peak_position with regularization (L2 and dropout)

l2_reg = 1e-4
dropout_rate = 0.4

reg_model3 = keras.Sequential([
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg),
                 input_shape=(X3_train.shape[1],)),
    layers.Dropout(dropout_rate),
        
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),

    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),

    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),

    layers.Dense(1)  # Single output for regression
])

reg_model3.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_reg_model3 = reg_model3.fit(
    X3_train_scaled, y3_train_final,
    validation_data=(X3_val_scaled, y3_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [50]:
# model evaluation

print("=== Baseline Model ===")
train_scores3 = baseline_model3.evaluate(X3_train_scaled, y3_train_final, verbose=0)
val_scores3   = baseline_model3.evaluate(X3_val_scaled, y3_val, verbose=0)
print(f"Train MAE: {train_scores3[1]:.4f}, Train MSE: {train_scores3[2]:.4f}")
print(f"Val   MAE: {val_scores3[1]:.4f}, Val   MSE: {val_scores3[2]:.4f}")

print("\n=== BatchNorm Model ===")
train_scores_bn3 = bnorm_model3.evaluate(X3_train_scaled, y3_train_final, verbose=0)
val_scores_bn3   = bnorm_model3.evaluate(X3_val_scaled, y3_val, verbose=0)
print(f"Train MAE: {train_scores_bn3[1]:.4f}, Train MSE: {train_scores_bn3[2]:.4f}")
print(f"Val   MAE: {val_scores_bn3[1]:.4f}, Val   MSE: {val_scores_bn3[2]:.4f}")

print("\n=== Regularized Model (L2 + Dropout) ===")
train_scores_reg3 = reg_model3.evaluate(X3_train_scaled, y3_train_final, verbose=0)
val_scores_reg3   = reg_model3.evaluate(X3_val_scaled, y3_val, verbose=0)
print(f"Train MAE: {train_scores_reg3[1]:.4f}, Train MSE: {train_scores_reg3[2]:.4f}")
print(f"Val   MAE: {val_scores_reg3[1]:.4f}, Val   MSE: {val_scores_reg3[2]:.4f}")


=== Baseline Model ===
Train MAE: 17.9517, Train MSE: 486.7548
Val   MAE: 21.8057, Val   MSE: 751.6981

=== BatchNorm Model ===
Train MAE: 14.3549, Train MSE: 350.5200
Val   MAE: 21.2928, Val   MSE: 782.5354

=== Regularized Model (L2 + Dropout) ===
Train MAE: 19.6790, Train MSE: 576.6615
Val   MAE: 20.3928, Val   MSE: 635.0504


In [51]:
# simple deep learning model for max_rank_change, no regularization or dropout

baseline_model4 = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X4_train.shape[1],)),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)  # Single output for regression
])

baseline_model4.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_baseline4 = baseline_model4.fit(
    X4_train_scaled, y4_train_final,
    validation_data=(X4_val_scaled, y4_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [52]:
# simple deep learning model for max_rank_change with batch normalization

bnorm_model4 = keras.Sequential([
    layers.Dense(64, activation='linear', input_shape=(X4_train.shape[1],)),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(64, activation='linear'),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(1)  # Single output for regression
])

bnorm_model4.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_bnorm_model4 = bnorm_model4.fit(
    X4_train_scaled, y4_train_final,
    validation_data=(X4_val_scaled, y4_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [62]:
# simple deep learning model for max_rank_change with regularization (L2 and dropout)

l2_reg = 1e-4
dropout_rate = 0.4

reg_model4 = keras.Sequential([
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg),
                 input_shape=(X4_train.shape[1],)),
    layers.Dropout(dropout_rate),
        
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),

    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),


    layers.Dense(1)  # Single output for regression
])

reg_model4.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_reg_model4 = reg_model4.fit(
    X4_train_scaled, y4_train_final,
    validation_data=(X4_val_scaled, y4_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [63]:
# model evaluation

print("=== Baseline Model ===")
train_scores4 = baseline_model4.evaluate(X4_train_scaled, y4_train_final, verbose=0)
val_scores4   = baseline_model4.evaluate(X4_val_scaled, y4_val, verbose=0)
print(f"Train MAE: {train_scores4[1]:.4f}, Train MSE: {train_scores4[2]:.4f}")
print(f"Val   MAE: {val_scores4[1]:.4f}, Val   MSE: {val_scores4[2]:.4f}")

print("\n=== BatchNorm Model ===")
train_scores_bn4 = bnorm_model4.evaluate(X4_train_scaled, y4_train_final, verbose=0)
val_scores_bn4   = bnorm_model4.evaluate(X4_val_scaled, y4_val, verbose=0)
print(f"Train MAE: {train_scores_bn4[1]:.4f}, Train MSE: {train_scores_bn4[2]:.4f}")
print(f"Val   MAE: {val_scores_bn4[1]:.4f}, Val   MSE: {val_scores_bn4[2]:.4f}")

print("\n=== Regularized Model (L2 + Dropout) ===")
train_scores_reg4 = reg_model4.evaluate(X4_train_scaled, y4_train_final, verbose=0)
val_scores_reg4   = reg_model4.evaluate(X4_val_scaled, y4_val, verbose=0)
print(f"Train MAE: {train_scores_reg4[1]:.4f}, Train MSE: {train_scores_reg4[2]:.4f}")
print(f"Val   MAE: {val_scores_reg4[1]:.4f}, Val   MSE: {val_scores_reg4[2]:.4f}")


=== Baseline Model ===
Train MAE: 7.1557, Train MSE: 95.5070
Val   MAE: 9.6317, Val   MSE: 176.7695

=== BatchNorm Model ===
Train MAE: 7.0440, Train MSE: 89.4687
Val   MAE: 9.6184, Val   MSE: 174.5367

=== Regularized Model (L2 + Dropout) ===
Train MAE: 7.8876, Train MSE: 119.7593
Val   MAE: 8.8956, Val   MSE: 158.6776


## Evaluation

### Business Insight/Recommendation 1

### Business Insight/Recommendation 2

### Business Insight/Recommendation 3

### Tableau Dashboard link

## Conclusion and Next Steps
Text here