![Add a relevant banner image here](path_to_image)

# Project Title

## Overview

Short project description. Your bottom line up front (BLUF) insights.

## Business Understanding

The customer of this project is FutureProduct Advisors, a consultancy that helps their customers develop innovative and new consumer products. FutureProduct’s customers are increasingly seeking help from their consultants in go-to-market activities. 

FutureProduct’s consultants can support these go-to-market activities, but the business does not have all the infrastructure needed to support it. Their biggest ask is for a tool to help them find interesting, up-and-coming music to accompany social posts and online ads for go-to-market promotions. 

**Stakeholders**

- FutureProduct Managing Director: oversees their consulting practice and is sponsoring this project.
- FutureProduct Senior Consultants: the actual users of the prospective tool. A small subset of the consultants will pilot the prototype tool.
- My consulting leadership: sponsors of this effort; will provide oversight and technical input of the project as needed.

**Primary Goals**

1.	Build a data tool that can evaluate any song in the Billboard Hot 100 list and make predictions about:
    -	The song’s position on the Hot 100 list 4 weeks in the future
    -	The song’s highest position on the list in the next 6 months
2.	Create a rubric that lists the 3 most important factors for songs’ placement on the Hot 100 list for each hear from 2000 to 2021.


## Data Understanding

Billboard Hot 100 weekly charts (Kaggle): https://www.kaggle.com/datasets/thedevastator/billboard-hot-100-audio-features

I’ve chosen this dataset because it has a direct measurement of song popularity (the Hot 100 list) and because its long history gives significant context to a song’s positioning in a given week.
The features list gives a wide range of song attributes to explore and enables me to determine what features most significantly contribute to a song’s popularity and how that changes over time.


In [None]:
import pandas as pd
import numpy as np
import ast
from collections import Counter

from pyspark import SparkContext
from pyspark.sql import SparkSession

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay
from sklearn.metrics import mean_squared_error, r2_score

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import math
import kagglehub
from kagglehub import KaggleDatasetAdapter

np.random.seed(42)



In [None]:
df_hotlist_all = pd.read_csv('Data/Hot Stuff.csv')
df_features_all = pd.read_csv('Data/Hot 100 Audio Features.csv')

In [None]:
# exploring hotlist data
df_hotlist_all.info()

In [None]:
# exploring features df
df_features_all.info()

## Data Preparation
Text here

In [None]:
# removing attributes that will not be used in cleaning or analysis
df_hotlist_all = df_hotlist_all.drop(['index', 'url'], axis=1)
df_hotlist_all.info()

In [None]:
# removing attributes that will not be used in cleaning or analysis
df_features_all = df_features_all.drop(['index', 'spotify_track_album', 'spotify_track_preview_url', 'spotify_track_explicit', 'spotify_track_popularity'], axis=1)
df_features_all.info()

In [None]:
# converting WeekID to datetime
df_hotlist_all['WeekID'] = pd.to_datetime(df_hotlist_all['WeekID'], errors='coerce')
df_hotlist_all = df_hotlist_all.sort_values(by='WeekID')
df_hotlist_all.head(3)

In [None]:
# creating a new df with only complete year data from 2009 - 2024, the time period being studied
df_hotlist_2000s = df_hotlist_all.loc[(df_hotlist_all['WeekID'] > '2009-12-31') & (df_hotlist_all['WeekID'] < '2021-01-01')]
df_hotlist_2000s.head(2), df_hotlist_2000s.tail(2)

In [None]:
# adding a column to calculate the week over week change in rank
def diff(a, b):
    return a - b

df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s.apply(lambda x: diff(x['Week Position'], x['Previous Week Position']), axis=1)
df_hotlist_2000s['Rank_Change'] = df_hotlist_2000s['Rank_Change'].fillna(0)
df_hotlist_2000s.head(3), df_hotlist_2000s.tail(3), df_hotlist_2000s.info()

In [None]:
# new df with the max weekly rank change for each song in df_hotlist_2000s
df_max_rank_change = df_hotlist_2000s.groupby('SongID', as_index=False)['Rank_Change'].max()
df_max_rank_change.rename(columns={'Rank_Change': 'Max_Rank_Change'}, inplace=True)
df_max_rank_change.info()

In [None]:
# new df with the max peak position for each song in df_hotlist_2000s
df_max_peak_pos = df_hotlist_2000s.groupby('SongID', as_index=False)['Peak Position'].max()
df_max_peak_pos.rename(columns={'Peak Position': 'Max_Peak_Position'}, inplace=True)
df_max_peak_pos.info()

In [None]:
# extracting full list of songs in the time period being studied
songs_list = df_hotlist_2000s['SongID'].unique()

# creating a features df with only songs in df_hotlist_2000s
df_features_2000s = df_features_all[df_features_all['SongID'].isin(songs_list)]

In [None]:
# checking for duplicates
print(len(df_features_2000s))
print(len(pd.unique(df_features_2000s['SongID'])))

In [None]:
# removing duplicates
df_features_2000s = df_features_2000s.drop_duplicates(subset='SongID')

In [None]:
# re-checking for duplicates
print(len(df_features_2000s))
print(len(pd.unique(df_features_2000s['SongID'])))

In [None]:
# adding max rank change and max peak position to main df
df_2000s_data = pd.merge(df_features_2000s, df_max_rank_change, on='SongID', how='left')
df_2000s_data = pd.merge(df_features_2000s, df_max_peak_pos, on='SongID', how='left')

df_2000s_data.info()

In [None]:
# removing entries with missing values
df_cleaned = df_2000s_data[df_2000s_data.notna().all(axis=1)]
df_cleaned.info()

In [None]:
# generating a df with unique genre names
unique_genres = list(set(
    genre 
    for genre_string in df_cleaned['spotify_genre'] 
    if pd.notna(genre_string)
    for genre in ast.literal_eval(genre_string)
))

df_unique_genres = pd.DataFrame(unique_genres, columns=['genre'])

In [None]:
# adding counts of each unique genre name
# Extract all genres (with duplicates) and count them
all_genres_list = []
for genre_string in df_cleaned['spotify_genre']:
    if pd.notna(genre_string):
        genre_list = ast.literal_eval(genre_string)
        all_genres_list.extend(genre_list)

# Count occurrences
genre_counts = Counter(all_genres_list)

# Map counts to genres dataframe
df_unique_genres['count'] = df_unique_genres['genre'].map(genre_counts)
df_unique_genres = df_unique_genres.sort_values('count', ascending=False)

In [None]:
# writing to csv for easier review of the data
df_unique_genres.to_csv('genre_counts.csv', index=False)

In [None]:
# loading list of genres with 50 or more instances in df_cleaned
df_genres_50_up = pd.read_csv('genre_counts_50+inst.csv')
df_genres_50_up.head(3)

In [None]:
# converting df to list
final_genres_list = df_genres_50_up['genre'].tolist()

# manually one-hot encoding each genre

# creating each new genre column and initializing to 0
for genre in final_genres_list:
    df_cleaned[genre] = 0

# iterating through rows to set values to 1 when genre column appears in original spotify_genre column
for idx, genre_string in enumerate(df_cleaned['spotify_genre']):
    if pd.notna(genre_string):
        genre_list = ast.literal_eval(genre_string)
        for genre in genre_list:
            df_cleaned.at[idx, genre] = 1

In [None]:
pd.set_option('display.max_columns', None)
df_cleaned.head(3)

In [None]:
# my code added columns for all genres in spotify_genre. in the interest of time, removing them here rather than fixing my code above :(
last_col_to_keep = 'emo rap'
df_cleaned = df_cleaned.loc[:, :last_col_to_keep]
df_cleaned.head(3)

## Analysis

Text here

## Evaluation

### Business Insight/Recommendation 1

### Business Insight/Recommendation 2

### Business Insight/Recommendation 3

### Tableau Dashboard link

## Conclusion and Next Steps
Text here