# Final Project 
#### Eunice Kim
#### DS 4003

## Data Introduction 

https://www.kaggle.com/datasets/leonardopena/top-spotify-songs-from-20102019-by-year/data

This dataset shows a comprehensive summary of the music industry, presenting the top 10 songs from 2010 to 2019 globally. Collected from Spotify and Billboard, it shows current music trends in various areas and among various groups of people. Spotify is one of the biggest and popular applications people use for music streaming. Both artists and producers release their singles, albums, and exclusive content through this platform. With a wide collection of data encompassing consumer preferences and listening habits, Spotify gives a valuable resource for understanding what drives viral hits in the music industry. My goal is to take the insights from this dataset to create an interactive dashboard that will allow users to dive into different genres, styles, and the overall most popular songs. This would provide personalized music recommendations based on the user preferences. Although there were many other datasets available, this dataset stood out the most because of its balance of depth and breadth of information. I concluded that this dataset aligns best and provides a solid basis for accomplishing my goals. 


## Data Cleaning

In [31]:
# import dependencies
import pandas as pd
import numpy as np
import plotly.express as px
from dash import Dash, dcc, html, Input, Output

In [32]:
# import the data
data = pd.read_csv("data.csv", encoding='latin-1') # originally would not run but after adding the "encoding='latin-1" it works
data.head() #show a part of the dataset

# dropping unneccesary columns
data = data.drop(['Unnamed: 0'], axis=1) # deleted the first column which just listed the column number with no actual valuable data
data.head()

#rename column values for easier understanding and reading
data = data.rename(columns={
    'bpm': 'Beats Per Minute',
    'nrgy': 'Energy',
    'dnce': 'Danceability',
    'dB': 'Loudness',
    'live': 'Liveness',
    'val': 'Valence',
    'dur': 'Duration',
    'acous': 'Acousticness',
    'spch': 'Speechiness',
    'pop': 'Popularity'
})

# this row has no values, all 0 
data = data.drop(442)

# Replace symbols in the artist column
data['artist'] = data['artist'].str.replace('!', '')
data['artist'] = data['artist'].str.replace('.', '')
data['artist'] = data['artist'].str.replace('-', '')
data['artist'] = data['artist'].str.replace('+', '')
data['artist'] = data['artist'].str.replace('&', '')
data['artist'] = data['artist'].str.replace(' ', '')
print(data)

# edit artist names so they are correctly shown in world cloud
data['artist'] = data['artist'].replace('Florence + The Machine', 'Florence and The Machine')
data['artist'] = data['artist'].replace('Beyoncé', 'Beyonce')
data['artist'] = data['artist'].replace('3OH!3', '3OH3')
data['artist'] = data['artist'].replace('T.I.', 'TI')
data['artist'] = data['artist'].replace('P!nk', 'Pink')
data['artist'] = data['artist'].replace('Selena Gomez & The Scene', 'Selena Gomez and The Scene')
data['artist'] = data['artist'].replace('fun.', 'Fun')
data['artist'] = data['artist'].replace('Ne-Yo', 'NeYo')
data['artist'] = data['artist'].replace('will.i.am', 'William')
data['artist'] = data['artist'].replace('Emeli Sandé', 'Emeli Sande')
data['artist'] = data['artist'].replace('MAGIC!', 'MAGIC')
data['artist'] = data['artist'].replace('Mr. Probz', 'Mr Probz')
data['artist'] = data['artist'].replace('G-Eazy', 'G Eazy')
data['artist'] = data['artist'].replace('BØRNS', 'BORNS')
data['artist'] = data['artist'].replace('MØ', 'MO')
data['artist'] = data['artist'].replace('Dan + Shay', 'Dan and Shay')
data['artist'] = data['artist'].replace('N.E.R.D', 'NERD')
data['artist'] = data['artist'].replace('Macklemore & Ryan Lewis', 'Macklemore and Ryan Lewis')


                                                 title           artist  \
0                                     Hey, Soul Sister            Train   
1                                 Love The Way You Lie           Eminem   
2                                              TiK ToK            Kesha   
3                                          Bad Romance         LadyGaga   
4                                 Just the Way You Are        BrunoMars   
..                                                 ...              ...   
598                Find U Again (feat. Camila Cabello)       MarkRonson   
599      Cross Me (feat. Chance the Rapper & PnB Rock)        EdSheeran   
600  No Brainer (feat. Justin Bieber, Chance the Ra...         DJKhaled   
601    Nothing Breaks Like a Heart (feat. Miley Cyrus)       MarkRonson   
602                                   Kills You Slowly  TheChainsmokers   

           top genre  year  Beats Per Minute  Energy  Danceability  Loudness  \
0         neo mello

In [33]:
# Changing column values to lower case
#data['title'] = data['title'].str.lower() # Changed the song titles all to lower case because they had random upper and lower cases, will prevent silly mistakes in the future
#data['artist'] = data['artist'].str.lower() # Changed artist names to lower case so they are easier to work with in the future
#print(data)

## Exploratory Analysis

This dataset contains 603 rows and 14 columns with 584 unique title values, 184 unique artist values, 50 unique genre values, 104 unique bpm values, 77 unique energy values, 70 dance values, 14 dB values, 61 liveliness values, 94 valence values, 144 duration values, 75 acoustic values, 39 speech values, and 71 popularity values. There are no missing values in any observation or variable. 


In [34]:
data.head()

Unnamed: 0,title,artist,top genre,year,Beats Per Minute,Energy,Danceability,Loudness,Liveness,Valence,Duration,Acousticness,Speechiness,Popularity
0,"Hey, Soul Sister",Train,neo mellow,2010,97,89,67,-4,8,80,217,19,4,83
1,Love The Way You Lie,Eminem,detroit hip hop,2010,87,93,75,-5,52,64,263,24,23,82
2,TiK ToK,Kesha,dance pop,2010,120,84,76,-3,29,71,200,10,14,80
3,Bad Romance,LadyGaga,dance pop,2010,119,92,70,-4,8,71,295,0,4,79
4,Just the Way You Are,BrunoMars,pop,2010,109,84,64,-5,9,43,221,2,4,78


In [35]:
data.tail()

Unnamed: 0,title,artist,top genre,year,Beats Per Minute,Energy,Danceability,Loudness,Liveness,Valence,Duration,Acousticness,Speechiness,Popularity
598,Find U Again (feat. Camila Cabello),MarkRonson,dance pop,2019,104,66,61,-7,20,16,176,1,3,75
599,Cross Me (feat. Chance the Rapper & PnB Rock),EdSheeran,pop,2019,95,79,75,-6,7,61,206,21,12,75
600,"No Brainer (feat. Justin Bieber, Chance the Ra...",DJKhaled,dance pop,2019,136,76,53,-5,9,65,260,7,34,70
601,Nothing Breaks Like a Heart (feat. Miley Cyrus),MarkRonson,dance pop,2019,114,79,60,-6,42,24,217,1,7,69
602,Kills You Slowly,TheChainsmokers,electropop,2019,150,44,70,-9,13,23,213,6,6,67


In [36]:
data.info() # Shows a quick overview of the dataset. The index dtype and column dtypes are either object or int. There are also no non-null values. 



<class 'pandas.core.frame.DataFrame'>
Index: 602 entries, 0 to 602
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   title             602 non-null    object
 1   artist            602 non-null    object
 2   top genre         602 non-null    object
 3   year              602 non-null    int64 
 4   Beats Per Minute  602 non-null    int64 
 5   Energy            602 non-null    int64 
 6   Danceability      602 non-null    int64 
 7   Loudness          602 non-null    int64 
 8   Liveness          602 non-null    int64 
 9   Valence           602 non-null    int64 
 10  Duration          602 non-null    int64 
 11  Acousticness      602 non-null    int64 
 12  Speechiness       602 non-null    int64 
 13  Popularity        602 non-null    int64 
dtypes: int64(11), object(3)
memory usage: 70.5+ KB


In [37]:
data.shape # This dataset has 603 rows and 14 columns


(602, 14)

In [38]:
data.size # Total number of 8442 different elements in this dataset

8428

In [39]:
data.ndim # This dataset is only two dimensional

2

In [40]:
data.nunique() # There are a total of 584 songs, 184 different artists, and 50 artists in the dataset. 

title               583
artist              184
top genre            50
year                 10
Beats Per Minute    103
Energy               76
Danceability         69
Loudness             13
Liveness             60
Valence              93
Duration            144
Acousticness         75
Speechiness          38
Popularity           71
dtype: int64

In [41]:
data.to_csv('data2', index=False)

## Data Dictionary 

   **title** - The title of the song
   
   **artist** - The artist of the song

   **top genre** - The genre of the track

   **year** - The release year of the recording

   **BPM** - Beats Per Minute, the tempo of the song 

   **nrgy** - The energy of a song, the higher the value the more energtic

   **dnce** - The higher the value, the easier it is to dance to this song

   **db** - The higher the value, the louder the song

   **live** - Liveness. The higher the value, the more likely the song is a live recording

   **val** - Valence. The higher the value, the more positive mood for the song

   **dur** - The duration of the song

   **acous** - The higher the value the more acoustic the song is

   **spch** - The higher the value the more spoken word the song contains

   **pop** - The higher the value the more popular the song is

## Brainstorming

For this project, I think creating a dropdown menu to provide options for filtering the data by either year or artist would be helpful for users to search for top songs. I could also create a slider for the years so that the user is able to select either one year or see the data from a selectable wide range of years. Some possible data visualizations include a line graph of the artists on the y axis and the number of songs they have had in the top 10 on the x axis throughout the years 2010 to 2019. I could also show which genre is the most popular through a pie chart by sorting the colors by genre and displaying the percentage of songs that have been selected for that specfic genre, filtered on year using a multi-select dropdown. I could show how the energy of different genres relate by creating a histogram or show the danceability of different genres. 
