![Add a relevant banner image here](path_to_image)

# Project Title

## Overview

Short project description. Your bottom line up front (BLUF) insights.

## Business Understanding

The customer of this project is FutureProduct Advisors, a consultancy that helps their customers develop innovative and new consumer products. FutureProduct’s customers are increasingly seeking help from their consultants in go-to-market activities. 

FutureProduct’s consultants can support these go-to-market activities, but the business does not have all the infrastructure needed to support it. Their biggest ask is for a tool to help them find interesting, up-and-coming music to accompany social posts and online ads for go-to-market promotions. 

**Stakeholders**

- FutureProduct Managing Director: oversees their consulting practice and is sponsoring this project.
- FutureProduct Senior Consultants: the actual users of the prospective tool. A small subset of the consultants will pilot the prototype tool.
- My consulting leadership: sponsors of this effort; will provide oversight and technical input of the project as needed.

**Primary Goals**

1.	Build a data tool that can evaluate any song in the Billboard Hot 100 list and make predictions about:
    -	The song’s position on the Hot 100 list 4 weeks in the future
    -	The song’s highest position on the list in the next 6 months
2.	Create a rubric that lists the 3 most important factors for songs’ placement on the Hot 100 list for each hear from 2000 to 2021.


## Data Understanding

Billboard Hot 100 weekly charts (Kaggle): https://www.kaggle.com/datasets/thedevastator/billboard-hot-100-audio-features

I’ve chosen this dataset because it has a direct measurement of song popularity (the Hot 100 list) and because its long history gives significant context to a song’s positioning in a given week.
The features list gives a wide range of song attributes to explore and enables me to determine what features most significantly contribute to a song’s popularity and how that changes over time.


In [1]:
import pandas as pd
import numpy as np

from pyspark import SparkContext
from pyspark.sql import SparkSession

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay
from sklearn.metrics import mean_squared_error, r2_score

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import math
import kagglehub
from kagglehub import KaggleDatasetAdapter

np.random.seed(42)



In [2]:
df_hotlist_all = pd.read_csv('Data/Hot Stuff.csv')
df_features_all = pd.read_csv('Data/Hot 100 Audio Features.csv')

In [3]:
# exploring hotlist data
df_hotlist_all.info(), df_hotlist_all.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327895 entries, 0 to 327894
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   index                   327895 non-null  int64  
 1   url                     327895 non-null  object 
 2   WeekID                  327895 non-null  object 
 3   Week Position           327895 non-null  int64  
 4   Song                    327895 non-null  object 
 5   Performer               327895 non-null  object 
 6   SongID                  327895 non-null  object 
 7   Instance                327895 non-null  int64  
 8   Previous Week Position  295941 non-null  float64
 9   Peak Position           327895 non-null  int64  
 10  Weeks on Chart          327895 non-null  int64  
dtypes: float64(1), int64(5), object(5)
memory usage: 27.5+ MB


(None,
    index                                                url     WeekID  \
 0      0  http://www.billboard.com/charts/hot-100/1965-0...  7/17/1965   
 1      1  http://www.billboard.com/charts/hot-100/1965-0...  7/24/1965   
 2      2  http://www.billboard.com/charts/hot-100/1965-0...  7/31/1965   
 
    Week Position                    Song   Performer  \
 0             34  Don't Just Stand There  Patty Duke   
 1             22  Don't Just Stand There  Patty Duke   
 2             14  Don't Just Stand There  Patty Duke   
 
                              SongID  Instance  Previous Week Position  \
 0  Don't Just Stand TherePatty Duke         1                    45.0   
 1  Don't Just Stand TherePatty Duke         1                    34.0   
 2  Don't Just Stand TherePatty Duke         1                    22.0   
 
    Peak Position  Weeks on Chart  
 0             34               4  
 1             22               5  
 2             14               6  )

In [4]:
# index column is duplicative/unnecessary, url will not be used
df_hotlist_all = df_hotlist_all.drop(['index', 'url'], axis=1)
df_hotlist_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327895 entries, 0 to 327894
Data columns (total 9 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   WeekID                  327895 non-null  object 
 1   Week Position           327895 non-null  int64  
 2   Song                    327895 non-null  object 
 3   Performer               327895 non-null  object 
 4   SongID                  327895 non-null  object 
 5   Instance                327895 non-null  int64  
 6   Previous Week Position  295941 non-null  float64
 7   Peak Position           327895 non-null  int64  
 8   Weeks on Chart          327895 non-null  int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 22.5+ MB


## Data Preparation
Text here

In [5]:
# converting WeekID to datetime
df_hotlist_all['WeekID'] = pd.to_datetime(df_hotlist_all['WeekID'], errors='coerce')
df_hotlist_all = df_hotlist_all.sort_values(by='WeekID')
df_hotlist_all.head(3)

Unnamed: 0,WeekID,Week Position,Song,Performer,SongID,Instance,Previous Week Position,Peak Position,Weeks on Chart
18553,1958-08-02,63,High School Confidential,Jerry Lee Lewis And His Pumping Piano,High School ConfidentialJerry Lee Lewis And Hi...,1,,63,1
103337,1958-08-02,98,Little Serenade,The Ames Brothers,Little SerenadeThe Ames Brothers,1,,98,1
146293,1958-08-02,68,Volare (Nel Blu Dipinto Di Blu),Dean Martin,Volare (Nel Blu Dipinto Di Blu)Dean Martin,1,,68,1


In [6]:
# creating a new df with only complete year data from 2000 - 2024, the time period being studied
df_hotlist_2000s = df_hotlist_all.loc[(df_hotlist_all['WeekID'] > '1999-12-31') & (df_hotlist_all['WeekID'] < '2021-01-01')]
df_hotlist_2000s.head(2), df_hotlist_2000s.tail(2)

(           WeekID  Week Position             Song                 Performer  \
 72674  2000-01-01             69   Deck The Halls                  SHeDAISY   
 239827 2000-01-01             83  Guerrilla Radio  Rage Against The Machine   
 
                                          SongID  Instance  \
 72674                    Deck The HallsSHeDAISY         1   
 239827  Guerrilla RadioRage Against The Machine         1   
 
         Previous Week Position  Peak Position  Weeks on Chart  
 72674                     97.0             69               2  
 239827                    87.0             69              10  ,
            WeekID  Week Position       Song            Performer  \
 7909   2020-12-26             40  Gold Rush         Taylor Swift   
 320975 2020-12-26             65      Hawai  Maluma & The Weeknd   
 
                           SongID  Instance  Previous Week Position  \
 7909       Gold RushTaylor Swift         1                     NaN   
 320975  HawaiMaluma & 

In [7]:
# adding a column to calculate the week over week change in rank
df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s['Week Position'] - df_hotlist_2000s['Previous Week Position']
df_hotlist_2000s.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s['Week Position'] - df_hotlist_2000s['Previous Week Position']


Unnamed: 0,WeekID,Week Position,Song,Performer,SongID,Instance,Previous Week Position,Peak Position,Weeks on Chart,Rank_Change
72674,2000-01-01,69,Deck The Halls,SHeDAISY,Deck The HallsSHeDAISY,1,97.0,69,2,-28.0
239827,2000-01-01,83,Guerrilla Radio,Rage Against The Machine,Guerrilla RadioRage Against The Machine,1,87.0,69,10,-4.0
253976,2000-01-01,60,Heartbreaker,Mariah Carey Featuring Jay-Z,HeartbreakerMariah Carey Featuring Jay-Z,1,51.0,1,18,9.0


In [14]:
# new df with the max weekly rank change for each song in df_hotlist_2000s
df_max_rank_change = df_hotlist_2000s.groupby('SongID', as_index=False)['Rank_Change'].max()
df_max_rank_change.rename(columns={'Rank_Change': 'Max_Rank_Change'}, inplace=True)
df_max_rank_change.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8669 entries, 0 to 8668
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   SongID           8669 non-null   object 
 1   Max_Rank_Change  6691 non-null   float64
dtypes: float64(1), object(1)
memory usage: 135.6+ KB


In [15]:
# extracting full list of songs in the time period being studied
songs_list = df_hotlist_2000s['SongID'].unique()

# creating a features df with only songs in df_hotlist_2000s
df_features_2000s = df_features_all[df_features_all['SongID'].isin(songs_list)]
df_features_2000s.info(), df_features_2000s.head(3)

<class 'pandas.core.frame.DataFrame'>
Index: 8781 entries, 5 to 29500
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   index                      8781 non-null   int64  
 1   SongID                     8781 non-null   object 
 2   Performer                  8781 non-null   object 
 3   Song                       8781 non-null   object 
 4   spotify_genre              8362 non-null   object 
 5   spotify_track_id           7991 non-null   object 
 6   spotify_track_preview_url  4366 non-null   object 
 7   spotify_track_duration_ms  7991 non-null   float64
 8   spotify_track_explicit     7991 non-null   object 
 9   spotify_track_album        7991 non-null   object 
 10  danceability               7958 non-null   float64
 11  energy                     7958 non-null   float64
 12  key                        7958 non-null   float64
 13  loudness                   7958 non-null   float64
 

(None,
     index                                             SongID  \
 5       5                       ...Ready For It?Taylor Swift   
 6       6  '03 Bonnie & ClydeJay-Z Featuring Beyonce Knowles   
 13     13                'Til Summer Comes AroundKeith Urban   
 
                           Performer                      Song  \
 5                      Taylor Swift          ...Ready For It?   
 6   Jay-Z Featuring Beyonce Knowles        '03 Bonnie & Clyde   
 13                      Keith Urban  'Til Summer Comes Around   
 
                                         spotify_genre        spotify_track_id  \
 5                            ['pop', 'post-teen pop']  2yLa0QULdQr0qAIvVwN6B5   
 6   ['east coast hip hop', 'hip hop', 'pop rap', '...  5ljCWsDlSyJ41kwqym2ORw   
 13  ['australian country', 'contemporary country',...  1CKmI1IQjVEVB3F7VmJmM3   
 
    spotify_track_preview_url  spotify_track_duration_ms  \
 5                        NaN                   208186.0   
 6             

In [16]:
df_2000s_data = pd.merge(df_features_2000s, df_max_rank_change, on='SongID', how='left')
df_2000s_data.describe

<bound method NDFrame.describe of       index                                             SongID  \
0         5                       ...Ready For It?Taylor Swift   
1         6  '03 Bonnie & ClydeJay-Z Featuring Beyonce Knowles   
2        13                'Til Summer Comes AroundKeith Urban   
3        16                   'Tis The Damn SeasonTaylor Swift   
4        66                    (Hot S**t) Country GrammarNelly   
...     ...                                                ...   
8776  29492    ZEZEKodak Black Featuring Travis Scott & Offset   
8777  29497                                   ZombieBad Wolves   
8778  29498  Zoo YorkLil Tjay Featuring Fivio Foreign & Pop...   
8779  29499                                         ZoomFuture   
8780  29500                 ZoomLil' Boosie Featuring Yung Joc   

                                         Performer  \
0                                     Taylor Swift   
1                  Jay-Z Featuring Beyonce Knowles   
2          

## Analysis

Text here

## Evaluation

### Business Insight/Recommendation 1

### Business Insight/Recommendation 2

### Business Insight/Recommendation 3

### Tableau Dashboard link

## Conclusion and Next Steps
Text here