![Add a relevant banner image here](path_to_image)

# Project Title

## Overview

Short project description. Your bottom line up front (BLUF) insights.

## Business Understanding

The customer of this project is FutureProduct Advisors, a consultancy that helps their customers develop innovative and new consumer products. FutureProduct’s customers are increasingly seeking help from their consultants in go-to-market activities. 

FutureProduct’s consultants can support these go-to-market activities, but the business does not have all the infrastructure needed to support it. Their biggest ask is for a tool to help them find interesting, up-and-coming music to accompany social posts and online ads for go-to-market promotions. 

**Stakeholders**

- FutureProduct Managing Director: oversees their consulting practice and is sponsoring this project.
- FutureProduct Senior Consultants: the actual users of the prospective tool. A small subset of the consultants will pilot the prototype tool.
- My consulting leadership: sponsors of this effort; will provide oversight and technical input of the project as needed.

**Primary Goals**

1.	Build a data tool that can evaluate any song in the Billboard Hot 100 list and make predictions about:
    -	The song’s position on the Hot 100 list 4 weeks in the future
    -	The song’s highest position on the list in the next 6 months
2.	Create a rubric that lists the 3 most important factors for songs’ placement on the Hot 100 list for each hear from 2000 to 2021.


## Data Understanding

Billboard Hot 100 weekly charts (Kaggle): https://www.kaggle.com/datasets/thedevastator/billboard-hot-100-audio-features

I’ve chosen this dataset because it has a direct measurement of song popularity (the Hot 100 list) and because its long history gives significant context to a song’s positioning in a given week.
The features list gives a wide range of song attributes to explore and enables me to determine what features most significantly contribute to a song’s popularity and how that changes over time.


In [46]:
import pandas as pd
import numpy as np
import ast
from collections import Counter

from pyspark import SparkContext
from pyspark.sql import SparkSession

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay
from sklearn.metrics import mean_squared_error, r2_score

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import math
import kagglehub
from kagglehub import KaggleDatasetAdapter

np.random.seed(42)



In [2]:
df_hotlist_all = pd.read_csv('Data/Hot Stuff.csv')
df_features_all = pd.read_csv('Data/Hot 100 Audio Features.csv')

In [3]:
# exploring hotlist data
df_hotlist_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327895 entries, 0 to 327894
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   index                   327895 non-null  int64  
 1   url                     327895 non-null  object 
 2   WeekID                  327895 non-null  object 
 3   Week Position           327895 non-null  int64  
 4   Song                    327895 non-null  object 
 5   Performer               327895 non-null  object 
 6   SongID                  327895 non-null  object 
 7   Instance                327895 non-null  int64  
 8   Previous Week Position  295941 non-null  float64
 9   Peak Position           327895 non-null  int64  
 10  Weeks on Chart          327895 non-null  int64  
dtypes: float64(1), int64(5), object(5)
memory usage: 27.5+ MB


In [4]:
# exploring features df
df_features_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29503 entries, 0 to 29502
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   index                      29503 non-null  int64  
 1   SongID                     29503 non-null  object 
 2   Performer                  29503 non-null  object 
 3   Song                       29503 non-null  object 
 4   spotify_genre              27903 non-null  object 
 5   spotify_track_id           24397 non-null  object 
 6   spotify_track_preview_url  14491 non-null  object 
 7   spotify_track_duration_ms  24397 non-null  float64
 8   spotify_track_explicit     24397 non-null  object 
 9   spotify_track_album        24391 non-null  object 
 10  danceability               24334 non-null  float64
 11  energy                     24334 non-null  float64
 12  key                        24334 non-null  float64
 13  loudness                   24334 non-null  flo

## Data Preparation
Text here

In [5]:
# removing attributes that will not be used in cleaning or analysis
df_hotlist_all = df_hotlist_all.drop(['index', 'url', 'Song', 'Performer'], axis=1)
df_hotlist_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327895 entries, 0 to 327894
Data columns (total 7 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   WeekID                  327895 non-null  object 
 1   Week Position           327895 non-null  int64  
 2   SongID                  327895 non-null  object 
 3   Instance                327895 non-null  int64  
 4   Previous Week Position  295941 non-null  float64
 5   Peak Position           327895 non-null  int64  
 6   Weeks on Chart          327895 non-null  int64  
dtypes: float64(1), int64(4), object(2)
memory usage: 17.5+ MB


In [6]:
# removing attributes that will not be used in cleaning or analysis
df_features_all = df_features_all.drop(['index', 'Performer', 'Song', 'spotify_track_album', 'spotify_track_preview_url', 'spotify_track_explicit', 'spotify_track_popularity'], axis=1)
df_features_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29503 entries, 0 to 29502
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   SongID                     29503 non-null  object 
 1   spotify_genre              27903 non-null  object 
 2   spotify_track_id           24397 non-null  object 
 3   spotify_track_duration_ms  24397 non-null  float64
 4   danceability               24334 non-null  float64
 5   energy                     24334 non-null  float64
 6   key                        24334 non-null  float64
 7   loudness                   24334 non-null  float64
 8   mode                       24334 non-null  float64
 9   speechiness                24334 non-null  float64
 10  acousticness               24334 non-null  float64
 11  instrumentalness           24334 non-null  float64
 12  liveness                   24334 non-null  float64
 13  valence                    24334 non-null  flo

In [7]:
# converting WeekID to datetime
df_hotlist_all['WeekID'] = pd.to_datetime(df_hotlist_all['WeekID'], errors='coerce')
df_hotlist_all = df_hotlist_all.sort_values(by='WeekID')
df_hotlist_all.head(3)

Unnamed: 0,WeekID,Week Position,SongID,Instance,Previous Week Position,Peak Position,Weeks on Chart
18553,1958-08-02,63,High School ConfidentialJerry Lee Lewis And Hi...,1,,63,1
103337,1958-08-02,98,Little SerenadeThe Ames Brothers,1,,98,1
146293,1958-08-02,68,Volare (Nel Blu Dipinto Di Blu)Dean Martin,1,,68,1


In [8]:
# creating a new df with only complete year data from 2009 - 2020, the time period being studied
df_hotlist_2000s = df_hotlist_all.loc[(df_hotlist_all['WeekID'] > '2009-12-31') & (df_hotlist_all['WeekID'] < '2021-01-01')]
df_hotlist_2000s.head(2), df_hotlist_2000s.tail(2)

(           WeekID  Week Position                                    SongID  \
 270579 2010-01-02             37  Run This TownJay-Z, Rihanna & Kanye West   
 145895 2010-01-02            100    Video PhoneBeyonce Featuring Lady Gaga   
 
         Instance  Previous Week Position  Peak Position  Weeks on Chart  
 270579         1                    32.0              2              21  
 145895         1                    97.0             65               4  ,
            WeekID  Week Position                    SongID  Instance  \
 7909   2020-12-26             40     Gold RushTaylor Swift         1   
 320975 2020-12-26             65  HawaiMaluma & The Weeknd         1   
 
         Previous Week Position  Peak Position  Weeks on Chart  
 7909                       NaN             40               1  
 320975                    55.0             12              17  )

In [9]:
# adding a column to calculate the week over week change in rank
def diff(a, b):
    return a - b

df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s.apply(lambda x: diff(x['Week Position'], x['Previous Week Position']), axis=1)
df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s['Rank_Change'].fillna(0)
df_hotlist_2000s.head(3), df_hotlist_2000s.tail(3), df_hotlist_2000s.info()

<class 'pandas.core.frame.DataFrame'>
Index: 57400 entries, 270579 to 320975
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   WeekID                  57400 non-null  datetime64[ns]
 1   Week Position           57400 non-null  int64         
 2   SongID                  57400 non-null  object        
 3   Instance                57400 non-null  int64         
 4   Previous Week Position  50939 non-null  float64       
 5   Peak Position           57400 non-null  int64         
 6   Weeks on Chart          57400 non-null  int64         
 7   Rank_Change             57400 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(4), object(1)
memory usage: 3.9+ MB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s.apply(lambda x: diff(x['Week Position'], x['Previous Week Position']), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s['Rank_Change'].fillna(0)


(           WeekID  Week Position                                    SongID  \
 270579 2010-01-02             37  Run This TownJay-Z, Rihanna & Kanye West   
 145895 2010-01-02            100    Video PhoneBeyonce Featuring Lady Gaga   
 279649 2010-01-02             96    Kings And QueensThirty Seconds To Mars   
 
         Instance  Previous Week Position  Peak Position  Weeks on Chart  \
 270579         1                    32.0              2              21   
 145895         1                    97.0             65               4   
 279649         2                    82.0             82               4   
 
         Rank_Change  
 270579          5.0  
 145895          3.0  
 279649         14.0  ,
            WeekID  Week Position                            SongID  Instance  \
 265214 2020-12-26             93  Ain't Always The CowboyJon Pardi         1   
 7909   2020-12-26             40             Gold RushTaylor Swift         1   
 320975 2020-12-26             65       

In [13]:
# new df with the max weekly rank change for each song in df_hotlist_2000s
df_max_rank_change = df_hotlist_2000s.groupby('SongID', as_index=False)['Rank_Change'].max()
df_max_rank_change.rename(columns={'Rank_Change': 'Max_Rank_Change'}, inplace=True)
df_max_rank_change.set_index('SongID', inplace=True)
df_max_rank_change.head(3), df_max_rank_change.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5277 entries, #BeautifulMariah Carey Featuring Miguel to whoa (mind in awe)XXXTENTACION
Data columns (total 1 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Max_Rank_Change  5277 non-null   float64
dtypes: float64(1)
memory usage: 82.5+ KB


(                                             Max_Rank_Change
 SongID                                                      
 #BeautifulMariah Carey Featuring Miguel                 17.0
 #SELFIEThe Chainsmokers                                 18.0
 #thatPOWERwill.i.am Featuring Justin Bieber             21.0,
 None)

In [10]:
# new df with the max peak position for each song in df_hotlist_2000s
df_max_peak_pos = df_hotlist_2000s.groupby('SongID', as_index=False)['Peak Position'].max()
df_max_peak_pos.rename(columns={'Peak Position': 'Max_Peak_Position'}, inplace=True)
df_max_peak_pos.set_index('SongID', inplace=True)
df_max_peak_pos.head(3), df_max_peak_pos.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5277 entries, #BeautifulMariah Carey Featuring Miguel to whoa (mind in awe)XXXTENTACION
Data columns (total 1 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   Max_Peak_Position  5277 non-null   int64
dtypes: int64(1)
memory usage: 82.5+ KB


(                                             Max_Peak_Position
 SongID                                                        
 #BeautifulMariah Carey Featuring Miguel                     24
 #SELFIEThe Chainsmokers                                     55
 #thatPOWERwill.i.am Featuring Justin Bieber                 42,
 None)

In [14]:
df_max_peak_pos['Max_Peak_Position'].isna().sum(), df_max_rank_change['Max_Rank_Change'].isna().sum()

(0, 0)

In [15]:
# extracting full list of songs in the time period being studied
songid_list = df_hotlist_2000s['SongID'].unique()

# creating a features df with only songs in df_hotlist_2000s
df_features_2000s = df_features_all[df_features_all['SongID'].isin(songid_list)]

In [16]:
# checking for duplicates
print(len(df_features_2000s))
print(len(pd.unique(df_features_2000s['SongID'])))

5389
5272


In [17]:
# removing duplicates and rechecking
df_features_2000s = df_features_2000s.drop_duplicates(subset='SongID')

print(len(df_features_2000s))
print(len(pd.unique(df_features_2000s['SongID'])))

5272
5272


In [18]:
# adding max peak position to main df
df_2000s_data = df_features_2000s.join(df_max_peak_pos, on='SongID')
# adding max rank change to main df
df_2000s_data = df_2000s_data.join(df_max_rank_change, on='SongID')

df_2000s_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5272 entries, 5 to 29499
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   SongID                     5272 non-null   object 
 1   spotify_genre              4924 non-null   object 
 2   spotify_track_id           4773 non-null   object 
 3   spotify_track_duration_ms  4773 non-null   float64
 4   danceability               4753 non-null   float64
 5   energy                     4753 non-null   float64
 6   key                        4753 non-null   float64
 7   loudness                   4753 non-null   float64
 8   mode                       4753 non-null   float64
 9   speechiness                4753 non-null   float64
 10  acousticness               4753 non-null   float64
 11  instrumentalness           4753 non-null   float64
 12  liveness                   4753 non-null   float64
 13  valence                    4753 non-null   float64
 

In [19]:
# removing entries with missing values
df_cleaned = df_2000s_data[df_2000s_data.notna().all(axis=1)]
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4713 entries, 5 to 29499
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   SongID                     4713 non-null   object 
 1   spotify_genre              4713 non-null   object 
 2   spotify_track_id           4713 non-null   object 
 3   spotify_track_duration_ms  4713 non-null   float64
 4   danceability               4713 non-null   float64
 5   energy                     4713 non-null   float64
 6   key                        4713 non-null   float64
 7   loudness                   4713 non-null   float64
 8   mode                       4713 non-null   float64
 9   speechiness                4713 non-null   float64
 10  acousticness               4713 non-null   float64
 11  instrumentalness           4713 non-null   float64
 12  liveness                   4713 non-null   float64
 13  valence                    4713 non-null   float64
 

In [20]:
# generating a df with unique genre names
unique_genres = list(set(
    genre 
    for genre_string in df_cleaned['spotify_genre'] 
    if pd.notna(genre_string)
    for genre in ast.literal_eval(genre_string)
))

df_unique_genres = pd.DataFrame(unique_genres, columns=['genre'])

In [21]:
# adding counts of each unique genre name
# Extract all genres (with duplicates) and count them
all_genres_list = []
for genre_string in df_cleaned['spotify_genre']:
    if pd.notna(genre_string):
        genre_list = ast.literal_eval(genre_string)
        all_genres_list.extend(genre_list)

# Count occurrences
genre_counts = Counter(all_genres_list)

# Map counts to genres dataframe
df_unique_genres['count'] = df_unique_genres['genre'].map(genre_counts)
df_unique_genres = df_unique_genres.sort_values('count', ascending=False)

In [None]:
# writing to csv for easier review of the data
df_unique_genres.to_csv('genre_counts.csv', index=False)

In [22]:
# loading list of genres with 50 or more instances in df_cleaned
df_genres_50_up = pd.read_csv('genre_counts_50+inst.csv')

In [23]:
# converting df to list
final_genres_list = df_genres_50_up['genre'].tolist()

# manually one-hot encoding each genre

# creating each new genre column and initializing to 0
for genre in final_genres_list:
    df_cleaned[genre] = 0

# iterating through rows to set values to 1 when genre column appears in original spotify_genre column
for idx, genre_string in enumerate(df_cleaned['spotify_genre']):
    if pd.notna(genre_string):
        genre_list = ast.literal_eval(genre_string)
        for genre in genre_list:
            df_cleaned.at[idx, genre] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[genre] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[genre] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[genre] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats

In [24]:
pd.set_option('display.max_columns', None)
df_cleaned.head(3)

Unnamed: 0,SongID,spotify_genre,spotify_track_id,spotify_track_duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,Max_Peak_Position,Max_Rank_Change,pop,rap,pop rap,dance pop,post-teen pop,hip hop,trap,contemporary country,country road,country,southern hip hop,modern country rock,atl hip hop,r&b,canadian pop,melodic rap,urban contemporary,pop rock,hollywood,glee club,neo mellow,canadian hip hop,toronto rap,edm,gangster rap,country pop,tropical house,hip pop,modern rock,miami hip hop,latin,electropop,chicago rap,conscious hip hop,viral pop,country dawn,uk pop,dirty south rap,philly rap,detroit hip hop,alternative r&b,post-grunge,oklahoma country,talent show,canadian contemporary r&b,neo soul,boy band,reggaeton,electro house,rock,atl trap,nc hip hop,new orleans rap,emo rap,australian country,alternative rock,permanent wave,adult standards,brill building pop,easy listening,vocal jazz,plugg,underground hip hop,pittsburgh rap,cali rap,slow game,alternative dance,dance-punk,indie pop,indie rock,indietronica,new rave,indie pop rap,bachata,latin pop,tropical,deep pop r&b,metropopolis,baton rouge rap,brooklyn drill,nyc rap,australian pop,west coast trap,g funk,complextro,german techno,dmv rap,new jersey rap,danish pop,scandipop,texas country,dfw rap,acoustic pop,deep talent show,redneck,american folk revival,country gospel,folk-pop,ny roots,east coast hip hop,candy pop,memphis hip hop,trap queen,pop reggaeton,downtempo,electronic trap,shiver pop,latin hip hop,reggaeton flow,pop punk,punk,socal pop punk,sertanejo,sertanejo pop,sertanejo universitario,emo,pixie,pop emo,new jack swing,quiet storm,swedish electropop,swedish pop,big room,brostep,catstep,electra,australian dance,british soul,lounge,girl group,alternative metal,canadian metal,canadian rock,nu metal,wrestling,piano rock,west coast rap,bronx hip hop,hardcore hip hop,show tunes,etherpop,indie poptimism,progressive electro house,modern uplift,australian hip hop,kentucky hip hop,art pop,art rock,experimental,experimental rock,melancholia,post-punk,psychedelic rock,barbadian pop,puerto rican pop,trap latino,queens hip hop,progressive house,ohio hip hop,rap metal,deep big room,dutch house,lgbtq+ hip hop,reggaeton colombiano,houston rap,modern folk rock,stomp and holler,uk americana,alternative hip hop,europop,cartoon,children's music,arkansas country,funk,soul,mexican pop,crunk,electro,disco house,idol,social media pop,hawaiian hip hop,vapor trap,grunge,hard rock,k-pop,k-pop boy group,electropowerpop,neon pop punk,trancecore,album rock,classic rock,dance rock,glam rock,protopunk,north carolina hip hop,house,uk dance,nu-metalcore,trap soul,rock-and-roll,rockabilly,groove metal,rap conscient,drill,baroque pop,uk contemporary r&b,indiecoustica,lds youth,lilith,celtic rock,reggae fusion,canadian trap,outlaw country,ccm,christian alternative rock,christian indie,christian music,worship,moombahton,neo-singer-songwriter,neo-synthpop,neo-traditional country,folk,new wave pop,singer-songwriter,aussietronica,irish rock,disney,florida rap,colombian pop,a cappella,latin viral pop,rap latina,viral rap,la indie,indie electropop,la pop,pop edm,portland hip hop,viral trap,deep southern trap,hopebeat,k-hop,san diego rap,teen pop,modern alternative rock,nu gaze,motown,ghanaian hip hop,canadian indie,new americana,modern blues rock,south african rock,pop soul,swedish synthpop,escape room,indie soul,heartland rock,k-rap,funk metal,funk rock,nyc pop,garage rock,sheffield indie,uk alternative pop,mellow gold,soft rock,yacht rock,latin arena pop,latin rock,mexican rock,rock en espanol,bubblegum dance,eurodance,melbourne bounce international,country rock,comedy,comic,j-pop,japanese singer-songwriter,post-metal,progressive metal,progressive rock,blues rock,punk blues,detroit trap,bow pop,roots americana,hyphy,australian indie,filter house,ectofolk,nz pop,anthem worship,world worship,cedm,christian pop,minnesota hip hop,dancehall,canadian country,canadian singer-songwriter,canadian folk,folk rock,bass trap,vapor twitch,alt z,bedroom pop,vocal house,chicago hardcore,chicago punk,hardcore punk,cowboy western,traditional country,yodeling,alabama indie,lovers rock,country rap,queer country,movie tunes,k-pop girl group,chicago drill,uk funky,gospel,gospel r&b,chicago indie,minneapolis sound,bubblegum pop,classic country pop,classic uk pop,nashville sound,boston hip hop,modern southern rock,trance,trap boricua,champeta,vallenato,liquid funk,chicago house,chinese hip hop,chinese idol pop,alternative pop rock,glam metal,virgin islands reggae,destroy techno,electronic rock,christian rock,rap rock,tennessee hip hop,jam band,french shoegaze,romanian pop,pop r&b,australian electropop,israeli pop,rebel blues,celtic,middle earth,panamanian pop,brooklyn indie,lo star,drum and bass,canadian electronic,chamber pop,quebec indie,stomp pop,pop dance,slap house,broadway,alabama rap,shimmer pop,battle rap,alberta country,canadian contemporary country,classic girl group,irish singer-songwriter,indy indie,german pop,old school hip hop,deep euro house,deep house,german dance,bedroom soul,metal,scorecore,soundtrack,indie folk,progressive trance,grime,deep underground hip hop,vapor pop,modern salsa,salsa,pinoy hip hop,dutch hip hop,icelandic indie,icelandic rock,australian house,bass house,jazz rap,bmore,ninja,arkansas hip hop,antiviral pop,comedy rock,parody,indie r&b
5,...Ready For It?Taylor Swift,"['pop', 'post-teen pop']",2yLa0QULdQr0qAIvVwN6B5,208186.0,0.613,0.764,2.0,-6.509,1.0,0.136,0.0527,0.0,0.197,0.417,160.015,4.0,4.0,22.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
13,'Til Summer Comes AroundKeith Urban,"['australian country', 'contemporary country',...",1CKmI1IQjVEVB3F7VmJmM3,331466.0,0.57,0.629,9.0,-7.608,0.0,0.0331,0.593,0.000136,0.77,0.308,127.907,4.0,92.0,14.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16,'Tis The Damn SeasonTaylor Swift,"['pop', 'post-teen pop']",7dW84mWkdWE5a6lFWxJCBG,229840.0,0.575,0.434,5.0,-8.193,1.0,0.0312,0.735,6.6e-05,0.105,0.348,145.916,4.0,39.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [25]:
# my code added columns for all genres in spotify_genre, removing unwanted columns :(
last_col_to_keep = 'emo rap'
df_cleaned = df_cleaned.loc[:, :last_col_to_keep]
df_cleaned.head(3)

Unnamed: 0,SongID,spotify_genre,spotify_track_id,spotify_track_duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,Max_Peak_Position,Max_Rank_Change,pop,rap,pop rap,dance pop,post-teen pop,hip hop,trap,contemporary country,country road,country,southern hip hop,modern country rock,atl hip hop,r&b,canadian pop,melodic rap,urban contemporary,pop rock,hollywood,glee club,neo mellow,canadian hip hop,toronto rap,edm,gangster rap,country pop,tropical house,hip pop,modern rock,miami hip hop,latin,electropop,chicago rap,conscious hip hop,viral pop,country dawn,uk pop,dirty south rap,philly rap,detroit hip hop,alternative r&b,post-grunge,oklahoma country,talent show,canadian contemporary r&b,neo soul,boy band,reggaeton,electro house,rock,atl trap,nc hip hop,new orleans rap,emo rap
5,...Ready For It?Taylor Swift,"['pop', 'post-teen pop']",2yLa0QULdQr0qAIvVwN6B5,208186.0,0.613,0.764,2.0,-6.509,1.0,0.136,0.0527,0.0,0.197,0.417,160.015,4.0,4.0,22.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,'Til Summer Comes AroundKeith Urban,"['australian country', 'contemporary country',...",1CKmI1IQjVEVB3F7VmJmM3,331466.0,0.57,0.629,9.0,-7.608,0.0,0.0331,0.593,0.000136,0.77,0.308,127.907,4.0,92.0,14.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16,'Tis The Damn SeasonTaylor Swift,"['pop', 'post-teen pop']",7dW84mWkdWE5a6lFWxJCBG,229840.0,0.575,0.434,5.0,-8.193,1.0,0.0312,0.735,6.6e-05,0.105,0.348,145.916,4.0,39.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
# removing fields used for prep/cleaning but not needed for analysis
df_cleaned = df_cleaned.drop(['SongID', 'spotify_genre', 'spotify_track_id'], axis=1)
df_cleaned.head(3), df_cleaned.tail(3)

(    spotify_track_duration_ms  danceability  energy  key  loudness  mode  \
 5                    208186.0         0.613   0.764  2.0    -6.509   1.0   
 13                   331466.0         0.570   0.629  9.0    -7.608   0.0   
 16                   229840.0         0.575   0.434  5.0    -8.193   1.0   
 
     speechiness  acousticness  instrumentalness  liveness  valence    tempo  \
 5        0.1360        0.0527          0.000000     0.197    0.417  160.015   
 13       0.0331        0.5930          0.000136     0.770    0.308  127.907   
 16       0.0312        0.7350          0.000066     0.105    0.348  145.916   
 
     time_signature  Max_Peak_Position  Max_Rank_Change  pop  rap  pop rap  \
 5              4.0                4.0             22.0  0.0  0.0      0.0   
 13             4.0               92.0             14.0  0.0  1.0      1.0   
 16             4.0               39.0              0.0  0.0  0.0      0.0   
 
     dance pop  post-teen pop  hip hop  trap  contempo

In [27]:
# removing extra rows
df_cleaned = df_cleaned.dropna()
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4713 entries, 5 to 29499
Data columns (total 69 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   spotify_track_duration_ms  4713 non-null   float64
 1   danceability               4713 non-null   float64
 2   energy                     4713 non-null   float64
 3   key                        4713 non-null   float64
 4   loudness                   4713 non-null   float64
 5   mode                       4713 non-null   float64
 6   speechiness                4713 non-null   float64
 7   acousticness               4713 non-null   float64
 8   instrumentalness           4713 non-null   float64
 9   liveness                   4713 non-null   float64
 10  valence                    4713 non-null   float64
 11  tempo                      4713 non-null   float64
 12  time_signature             4713 non-null   float64
 13  Max_Peak_Position          4713 non-null   float64
 

In [None]:
# Prepare features and target for k-NN max_peak_position analysis
X1 = df_cleaned.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y1 = df_cleaned['Max_Peak_Position']

# Splitting the data into training and testing sets (75-25 split and random_state of 42)
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.25, random_state=42)

# Standardize the features
scaler = StandardScaler()
X1_train_scaled = scaler.fit_transform(X1_train)
X1_test_scaled = scaler.fit_transform(X1_test)

In [39]:
# Prepare features and target for k-NN max_rank_change analysis
X2 = df_cleaned.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y2 = df_cleaned['Max_Rank_Change']

# Splitting the data into training and testing sets (75-25 split and random_state of 42)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.25, random_state=42)

# Standardize the features
scaler = StandardScaler()
X2_train_scaled = scaler.fit_transform(X2_train)
X2_test_scaled = scaler.fit_transform(X2_test)

In [36]:
# Prepare features and target for simple deep learning max_peak_position analysis
X3 = df_cleaned.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y3 = df_cleaned['Max_Peak_Position']

# Splitting the data into training and testing sets 
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=42)

# splitting training data into training and validiation
X3_train_final, X3_val, y3_train_final, y3_val = train_test_split(X3_train, y3_train, test_size=0.2, random_state=42)

print("Training data shape:", X3_train_final.shape)
print("Validation data shape:", X3_val.shape)
print("Test data shape:", X3_test.shape)

Training data shape: (3016, 67)
Validation data shape: (754, 67)
Test data shape: (943, 67)


In [37]:
# Prepare features and target for simple deep learning max_rank_change analysis
X4 = df_cleaned.drop(['Max_Peak_Position', 'Max_Rank_Change'], axis=1)
y4 = df_cleaned['Max_Rank_Change']

# Splitting the data into training and testing sets 
X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, test_size=0.2, random_state=42)

# splitting training data into training and validiation
X4_train_final, X4_val, y4_train_final, y4_val = train_test_split(X4_train, y4_train, test_size=0.2, random_state=42)

print("Training data shape:", X4_train_final.shape)
print("Validation data shape:", X4_val.shape)
print("Test data shape:", X4_test.shape)

Training data shape: (3016, 67)
Validation data shape: (754, 67)
Test data shape: (943, 67)


In [42]:
# normalizing data for simple deep learning max_peak_position

scaler.fit(X3_train_final)

X3_train_scaled = scaler.transform(X3_train_final)
X3_val_scaled = scaler.transform(X3_val)
X3_test_scaled = scaler.transform(X3_test)

print("Feature means:", scaler.mean_)
print("Feature variances:", scaler.var_)


Feature means: [ 2.17672555e+05  6.38025199e-01  6.62048077e-01  5.26492042e+00
 -6.14628249e+00  6.56498674e-01  1.14965385e-01  1.90596894e-01
  7.77107493e-03  1.82270491e-01  4.77007725e-01  1.23633392e+02
  3.97015915e+00  6.53183024e-02  6.06763926e-02  5.43766578e-02
  4.47612732e-02  3.87931034e-02  3.91246684e-02  3.97877984e-02
  2.88461538e-02  2.28779841e-02  2.32095491e-02  2.18832891e-02
  1.25994695e-02  1.59151194e-02  8.28912467e-03  1.32625995e-02
  1.29310345e-02  6.96286472e-03  6.29973475e-03  7.62599469e-03
  5.96816976e-03  4.64190981e-03  4.97347480e-03  4.97347480e-03
  2.98408488e-03  3.97877984e-03  4.64190981e-03  2.98408488e-03
  3.64721485e-03  4.97347480e-03  4.64190981e-03  4.31034483e-03
  4.31034483e-03  2.98408488e-03  3.64721485e-03  4.64190981e-03
  3.31564987e-03  2.98408488e-03  3.31564987e-03  2.98408488e-03
  5.30503979e-03  2.65251989e-03  2.98408488e-03  3.64721485e-03
  3.64721485e-03  3.31564987e-03  2.32095491e-03  3.31564987e-03
  9.946949

In [None]:
# normalizing data for simple deep learning max_rank_change

scaler.fit(X4_train_final)

X4_train_scaled = scaler.transform(X4_train_final)
X4_val_scaled = scaler.transform(X4_val)
X4_test_scaled = scaler.transform(X4_test)


print("Feature means:", scaler.mean_)
print("Feature variances:", scaler.var_)


Feature means: [ 2.17672555e+05  6.38025199e-01  6.62048077e-01  5.26492042e+00
 -6.14628249e+00  6.56498674e-01  1.14965385e-01  1.90596894e-01
  7.77107493e-03  1.82270491e-01  4.77007725e-01  1.23633392e+02
  3.97015915e+00  6.53183024e-02  6.06763926e-02  5.43766578e-02
  4.47612732e-02  3.87931034e-02  3.91246684e-02  3.97877984e-02
  2.88461538e-02  2.28779841e-02  2.32095491e-02  2.18832891e-02
  1.25994695e-02  1.59151194e-02  8.28912467e-03  1.32625995e-02
  1.29310345e-02  6.96286472e-03  6.29973475e-03  7.62599469e-03
  5.96816976e-03  4.64190981e-03  4.97347480e-03  4.97347480e-03
  2.98408488e-03  3.97877984e-03  4.64190981e-03  2.98408488e-03
  3.64721485e-03  4.97347480e-03  4.64190981e-03  4.31034483e-03
  4.31034483e-03  2.98408488e-03  3.64721485e-03  4.64190981e-03
  3.31564987e-03  2.98408488e-03  3.31564987e-03  2.98408488e-03
  5.30503979e-03  2.65251989e-03  2.98408488e-03  3.64721485e-03
  3.64721485e-03  3.31564987e-03  2.32095491e-03  3.31564987e-03
  9.946949

## Analysis

Text here

In [30]:
# k-NN model for max_peak_position

# testing different distance metrics to find optimal approach
metrics = ['euclidean', 'manhattan', 'chebyshev']
k_value = 5

# Dictionary to store results
results = {}

# Loop through list of metrics
for metric in metrics:
        # Create and evaluate model with different metrics and k=5
    knn = KNeighborsClassifier(n_neighbors=k_value, metric=metric)
    # Get cross val scores for model
    cv_scores = cross_val_score(knn, X1_train_scaled, y1_train, cv=5, scoring='accuracy')
    # Store the mean of cv scores as value and metric name as key in results dictionary
    results[metric] = cv_scores.mean()

best_metric = max(results, key=results.get)

In [31]:
print(results)
print(f"\nBest metric: {best_metric} with accuracy: {results[best_metric]:.4f}")

{'euclidean': 0.011318222069070526, 'manhattan': 0.00877185249888809, 'chebyshev': 0.012734251976391487}

Best metric: chebyshev with accuracy: 0.0127


In [32]:
# k-NN model for max_rank_change

# testing different distance metrics to find optimal approach
metrics = ['euclidean', 'manhattan', 'chebyshev']
k_value = 5

# Dictionary to store results
results2 = {}

# Loop through list of metrics
for metric in metrics:
        # Create and evaluate model with different metrics and k=5
    knn2 = KNeighborsClassifier(n_neighbors=k_value, metric=metric)
    # Get cross val scores for model
    cv_scores2 = cross_val_score(knn2, X2_train_scaled, y2_train, cv=5, scoring='accuracy')
    # Store the mean of cv scores as value and metric name as key in results dictionary
    results2[metric] = cv_scores2.mean()

best_metric2 = max(results2, key=results2.get)



In [33]:
print(results2)
print(f"\nBest metric: {best_metric2} with accuracy: {results2[best_metric2]:.4f}")

{'euclidean': 0.26598523065580537, 'manhattan': 0.2739092282356524, 'chebyshev': 0.2710819766719691}

Best metric: manhattan with accuracy: 0.2739


In [None]:
# simple deep learning model for max_peak_position, no regularization or dropout

baseline_model3 = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X3_train.shape[1],)),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)  # Single output for regression
])

baseline_model3.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_baseline3 = baseline_model3.fit(
    X3_train_scaled, y3_train_final,
    validation_data=(X3_val_scaled, y3_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [None]:
# simple deep learning model for max_peak_position with batch normalization

bnorm_model3 = keras.Sequential([
    layers.Dense(64, activation='linear', input_shape=(X3_train.shape[1],)),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(64, activation='linear'),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(1)  # Single output for regression
])

bnorm_model3.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_bnorm_model3 = bnorm_model3.fit(
    X3_train_scaled, y3_train_final,
    validation_data=(X3_val_scaled, y3_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [None]:
# simple deep learning model for max_peak_position with regularization (L2 and dropout)

l2_reg = 1e-4
dropout_rate = 0.3

reg_model3 = keras.Sequential([
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg),
                 input_shape=(X3_train.shape[1],)),
    layers.Dropout(dropout_rate),
        
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),

    layers.Dense(1)  # Single output for regression
])

reg_model3.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_reg_model3 = reg_model3.fit(
    X3_train_scaled, y3_train_final,
    validation_data=(X3_val_scaled, y3_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [71]:
# model evaluation

print("=== Baseline Model ===")
train_scores3 = baseline_model3.evaluate(X3_train_scaled, y3_train_final, verbose=0)
val_scores3   = baseline_model3.evaluate(X3_val_scaled, y3_val, verbose=0)
print(f"Train MAE: {train_scores3[1]:.4f}, Train MSE: {train_scores3[2]:.4f}")
print(f"Val   MAE: {val_scores3[1]:.4f}, Val   MSE: {val_scores3[2]:.4f}")

print("\n=== BatchNorm Model ===")
train_scores_bn3 = bnorm_model3.evaluate(X3_train_scaled, y3_train_final, verbose=0)
val_scores_bn3   = bnorm_model3.evaluate(X3_val_scaled, y3_val, verbose=0)
print(f"Train MAE: {train_scores_bn3[1]:.4f}, Train MSE: {train_scores_bn3[2]:.4f}")
print(f"Val   MAE: {val_scores_bn3[1]:.4f}, Val   MSE: {val_scores_bn3[2]:.4f}")

print("\n=== Regularized Model (L2 + Dropout) ===")
train_scores_reg3 = reg_model3.evaluate(X3_train_scaled, y3_train_final, verbose=0)
val_scores_reg3   = reg_model3.evaluate(X3_val_scaled, y3_val, verbose=0)
print(f"Train MAE: {train_scores_reg3[1]:.4f}, Train MSE: {train_scores_reg3[2]:.4f}")
print(f"Val   MAE: {val_scores_reg3[1]:.4f}, Val   MSE: {val_scores_reg3[2]:.4f}")


=== Baseline Model ===
Train MAE: 18.7836, Train MSE: 556.1683
Val   MAE: 23.6419, Val   MSE: 834.4442

=== BatchNorm Model ===
Train MAE: 17.1103, Train MSE: 483.8191
Val   MAE: 23.7769, Val   MSE: 859.0748

=== Regularized Model (L2 + Dropout) ===
Train MAE: 21.8296, Train MSE: 670.4427
Val   MAE: 23.9001, Val   MSE: 797.2174


In [None]:
# simple deep learning model for max_rank_change, no regularization or dropout

baseline_model4 = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X4_train.shape[1],)),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)  # Single output for regression
])

baseline_model4.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_baseline4 = baseline_model4.fit(
    X4_train_scaled, y4_train_final,
    validation_data=(X4_val_scaled, y4_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [57]:
# simple deep learning model for max_rank_change with batch normalization

bnorm_model4 = keras.Sequential([
    layers.Dense(64, activation='linear', input_shape=(X4_train.shape[1],)),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(64, activation='linear'),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    
    layers.Dense(1)  # Single output for regression
])

bnorm_model4.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_bnorm_model4 = bnorm_model4.fit(
    X4_train_scaled, y4_train_final,
    validation_data=(X4_val_scaled, y4_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [58]:
# simple deep learning model for max_rank_change with regularization (L2 and dropout)

l2_reg = 1e-4
dropout_rate = 0.3

reg_model4 = keras.Sequential([
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg),
                 input_shape=(X4_train.shape[1],)),
    layers.Dropout(dropout_rate),
        
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_reg)),
    layers.Dropout(dropout_rate),

    layers.Dense(1)  # Single output for regression
])

reg_model4.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae','mse']
)

history_reg_model4 = reg_model4.fit(
    X4_train_scaled, y4_train_final,
    validation_data=(X4_val_scaled, y4_val),
    epochs=100,
    batch_size=32,
    verbose=1
)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [59]:
# model evaluation

print("=== Baseline Model ===")
train_scores4 = baseline_model4.evaluate(X4_train_scaled, y4_train_final, verbose=0)
val_scores4   = baseline_model4.evaluate(X4_val_scaled, y4_val, verbose=0)
print(f"Train MAE: {train_scores4[1]:.4f}, Train MSE: {train_scores4[2]:.4f}")
print(f"Val   MAE: {val_scores4[1]:.4f}, Val   MSE: {val_scores4[2]:.4f}")

print("\n=== BatchNorm Model ===")
train_scores_bn4 = bnorm_model4.evaluate(X4_train_scaled, y4_train_final, verbose=0)
val_scores_bn4   = bnorm_model4.evaluate(X4_val_scaled, y4_val, verbose=0)
print(f"Train MAE: {train_scores_bn4[1]:.4f}, Train MSE: {train_scores_bn4[2]:.4f}")
print(f"Val   MAE: {val_scores_bn4[1]:.4f}, Val   MSE: {val_scores_bn4[2]:.4f}")

print("\n=== Regularized Model (L2 + Dropout) ===")
train_scores_reg4 = reg_model4.evaluate(X4_train_scaled, y4_train_final, verbose=0)
val_scores_reg4   = reg_model4.evaluate(X4_val_scaled, y4_val, verbose=0)
print(f"Train MAE: {train_scores_reg4[1]:.4f}, Train MSE: {train_scores_reg4[2]:.4f}")
print(f"Val   MAE: {val_scores_reg4[1]:.4f}, Val   MSE: {val_scores_reg4[2]:.4f}")


=== Baseline Model ===
Train MAE: 7.5674, Train MSE: 105.6946
Val   MAE: 11.9550, Val   MSE: 245.9815

=== BatchNorm Model ===
Train MAE: 7.5174, Train MSE: 108.0350
Val   MAE: 11.0248, Val   MSE: 216.7124

=== Regularized Model (L2 + Dropout) ===
Train MAE: 9.0894, Train MSE: 148.4039
Val   MAE: 10.2212, Val   MSE: 185.2670


## Evaluation

### Business Insight/Recommendation 1

### Business Insight/Recommendation 2

### Business Insight/Recommendation 3

### Tableau Dashboard link

## Conclusion and Next Steps
Text here