# Final Project 
#### Eunice Kim
#### DS 4003

## Data Introduction 

https://www.kaggle.com/datasets/leonardopena/top-spotify-songs-from-20102019-by-year/data

This dataset shows a comprehensive summary of the music industry, presenting the top 10 songs from 2010 to 2019 globally. Collected from Spotify and Billboard, it shows current music trends in various areas and among various groups of people. Spotify is one of the biggest and popular applications people use for music streaming. Both artists and producers release their singles, albums, and exclusive content through this platform. With a wide collection of data encompassing consumer preferences and listening habits, Spotify gives a valuable resource for understanding what drives viral hits in the music industry. My goal is to take the insights from this dataset to create an interactive dashboard that will allow users to dive into different genres, styles, and the overall most popular songs. This would provide personalized music recommendations based on the user preferences. Although there were many other datasets available, this dataset stood out the most because of its balance of depth and breadth of information. I concluded that this dataset aligns best and provides a solid basis for accomplishing my goals. 


## Data Cleaning

In [1]:
# import dependencies
import pandas as pd
import numpy as np
import plotly.express as px
from dash import Dash, dcc, html, Input, Output

In [8]:
# import the data
data = pd.read_csv("data.csv", encoding='latin-1') # originally would not run but after adding the "encoding='latin-1" it works
data.head() #show a part of the dataset

Unnamed: 0.1,Unnamed: 0,title,artist,top genre,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop
0,1,"Hey, Soul Sister",Train,neo mellow,2010,97,89,67,-4,8,80,217,19,4,83
1,2,Love The Way You Lie,Eminem,detroit hip hop,2010,87,93,75,-5,52,64,263,24,23,82
2,3,TiK ToK,Kesha,dance pop,2010,120,84,76,-3,29,71,200,10,14,80
3,4,Bad Romance,Lady Gaga,dance pop,2010,119,92,70,-4,8,71,295,0,4,79
4,5,Just the Way You Are,Bruno Mars,pop,2010,109,84,64,-5,9,43,221,2,4,78


In [43]:
# dropping unneccesary columns
data = data.drop(['Unnamed: 0'], axis=1) # deleted the first column which just listed the column number with no actual valuable data
data.head()

Unnamed: 0,title,artist,top genre,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop
0,"hey, soul sister",train,neo mellow,2010,97,89,67,-4,8,80,217,19,4,83
1,love the way you lie,eminem,detroit hip hop,2010,87,93,75,-5,52,64,263,24,23,82
2,tik tok,kesha,dance pop,2010,120,84,76,-3,29,71,200,10,14,80
3,bad romance,lady gaga,dance pop,2010,119,92,70,-4,8,71,295,0,4,79
4,just the way you are,bruno mars,pop,2010,109,84,64,-5,9,43,221,2,4,78


In [57]:
# Searching for na values 
clean_data = data.isna().any(axis=1) # checking by row for na values
print (clean_data) # displaying the dataset, found that there are no na values

# another way I verified there were no na values
data.isnull( ).sum( ) # double verified there were no na values

0      False
1      False
2      False
3      False
4      False
       ...  
598    False
599    False
600    False
601    False
602    False
Length: 603, dtype: bool


title        0
artist       0
top genre    0
year         0
bpm          0
nrgy         0
dnce         0
dB           0
live         0
val          0
dur          0
acous        0
spch         0
pop          0
dtype: int64

In [45]:
# Checking data types to see if I need to convert anything
data.dtypes # found that there are only objects and int data types, dont need to change data types or data structure!

title        object
artist       object
top genre    object
year          int64
bpm           int64
nrgy          int64
dnce          int64
dB            int64
live          int64
val           int64
dur           int64
acous         int64
spch          int64
pop           int64
dtype: object

In [46]:
# Changing column values to lower case
data['title'] = data['title'].str.lower() # Changed the song titles all to lower case because they had random upper and lower cases, will prevent silly mistakes in the future
data['artist'] = data['artist'].str.lower() # Changed artist names to lower case so they are easier to work with in the future
print(data)

                                                 title            artist  \
0                                     hey, soul sister             train   
1                                 love the way you lie            eminem   
2                                              tik tok             kesha   
3                                          bad romance         lady gaga   
4                                 just the way you are        bruno mars   
..                                                 ...               ...   
598                find u again (feat. camila cabello)       mark ronson   
599      cross me (feat. chance the rapper & pnb rock)        ed sheeran   
600  no brainer (feat. justin bieber, chance the ra...         dj khaled   
601    nothing breaks like a heart (feat. miley cyrus)       mark ronson   
602                                   kills you slowly  the chainsmokers   

           top genre  year  bpm  nrgy  dnce  dB  live  val  dur  acous  spch  \
0      

## Exploratory Analysis

This dataset contains 603 rows and 14 columns with 584 unique title values, 184 unique artist values, 50 unique genre values, 104 unique bpm values, 77 unique energy values, 70 dance values, 14 dB values, 61 liveliness values, 94 valence values, 144 duration values, 75 acoustic values, 39 speech values, and 71 popularity values. There are no missing values in any observation or variable. 


In [47]:
data.head()

Unnamed: 0,title,artist,top genre,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop
0,"hey, soul sister",train,neo mellow,2010,97,89,67,-4,8,80,217,19,4,83
1,love the way you lie,eminem,detroit hip hop,2010,87,93,75,-5,52,64,263,24,23,82
2,tik tok,kesha,dance pop,2010,120,84,76,-3,29,71,200,10,14,80
3,bad romance,lady gaga,dance pop,2010,119,92,70,-4,8,71,295,0,4,79
4,just the way you are,bruno mars,pop,2010,109,84,64,-5,9,43,221,2,4,78


In [48]:
data.tail()

Unnamed: 0,title,artist,top genre,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop
598,find u again (feat. camila cabello),mark ronson,dance pop,2019,104,66,61,-7,20,16,176,1,3,75
599,cross me (feat. chance the rapper & pnb rock),ed sheeran,pop,2019,95,79,75,-6,7,61,206,21,12,75
600,"no brainer (feat. justin bieber, chance the ra...",dj khaled,dance pop,2019,136,76,53,-5,9,65,260,7,34,70
601,nothing breaks like a heart (feat. miley cyrus),mark ronson,dance pop,2019,114,79,60,-6,42,24,217,1,7,69
602,kills you slowly,the chainsmokers,electropop,2019,150,44,70,-9,13,23,213,6,6,67


In [49]:
data.info() # Shows a quick overview of the dataset. The index dtype and column dtypes are either object or int. There are also no non-null values. 



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 603 entries, 0 to 602
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      603 non-null    object
 1   artist     603 non-null    object
 2   top genre  603 non-null    object
 3   year       603 non-null    int64 
 4   bpm        603 non-null    int64 
 5   nrgy       603 non-null    int64 
 6   dnce       603 non-null    int64 
 7   dB         603 non-null    int64 
 8   live       603 non-null    int64 
 9   val        603 non-null    int64 
 10  dur        603 non-null    int64 
 11  acous      603 non-null    int64 
 12  spch       603 non-null    int64 
 13  pop        603 non-null    int64 
dtypes: int64(11), object(3)
memory usage: 66.1+ KB


In [50]:
data.shape # This dataset has 603 rows and 14 columns


(603, 14)

In [51]:
data.size # Total number of 8442 different elements in this dataset

8442

In [54]:
data.ndim # This dataset is only two dimensional

2

In [59]:
data.nunique() # There are a total of 584 songs, 184 different artists, and 50 artists in the dataset. 

title        584
artist       184
top genre     50
year          10
bpm          104
nrgy          77
dnce          70
dB            14
live          61
val           94
dur          144
acous         75
spch          39
pop           71
dtype: int64

## Data Dictionary 

   **title** - The title of the song
   
   **artist** - The artist of the song

   **top genre** - The genre of the track

   **year** - The release year of the recording

   **BPM** - Beats Per Minute, the tempo of the song 

   **nrgy** - The energy of a song, the higher the value the more energtic

   **dnce** - The higher the value, the easier it is to dance to this song

   **db** - The higher the value, the louder the song

   **live** - Liveness. The higher the value, the more likely the song is a live recording

   **val** - Valence. The higher the value, the more positive mood for the song

   **dur** - The duration of the song

   **acous** - The higher the value the more acoustic the song is

   **spch** - The higher the value the more spoken word the song contains

   **pop** - The higher the value the more popular the song is

## Brainstorming

For this project, I think creating a dropdown menu to provide options for filtering the data by either year or artist would be helpful for users to search for top songs. I could also create a slider for the years so that the user is able to select either one year or see the data from a selectable wide range of years. Some possible data visualizations include a line graph of the artists on the y axis and the number of songs they have had in the top 10 on the x axis throughout the years 2010 to 2019. I could also show which genre is the most popular through a pie chart by sorting the colors by genre and displaying the percentage of songs that have been selected for that specfic genre, filtered on year using a multi-select dropdown. I could show how the energy of different genres relate by creating a histogram or show the danceability of different genres. 
