# Spotify Data

## Index

* [Introduction](#introduction)
* [Data description](#description)
* [Questions](#questions)

## Introduction <a class="anchor" id="introduction"></a>

The dataset I will work in this notebook is a dataset of *Spotify* songs with several different attributes. I think this dataset could be useful to answer some questions about musical genres, popularity of songs and artists. The data was found on the *Kaggle* datasets repository and can be downloaded by pressing this [link](https://www.kaggle.com/edalrami/19000-spotify-songs#song_data.csv). Let first get started by loading the data and describing how it looks like.

## Data description <a class="anchor" id="description"></a>

As stated in the introduction, this dataset was taken from *Kaggle* and it's composed by different kinds of information for 19000 *Spotify* songs. We find that we have 2 diferent *CSV* files with different information about songs, so first of all, let's load the files into this notebook.

In [11]:
import pandas as pd

songs_df = pd.read_csv("./data/song_data.csv", index_col=0)
info_df = pd.read_csv("./data/song_info.csv", index_col=0)

Now let's look how these two dataframes look like so we can have an idea of the information inside them.

In [12]:
songs_df.head()

Unnamed: 0_level_0,song_popularity,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence
song_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Boulevard of Broken Dreams,73,262333,0.00552,0.496,0.682,2.9e-05,8,0.0589,-4.095,1,0.0294,167.06,4,0.474
In The End,66,216933,0.0103,0.542,0.853,0.0,3,0.108,-6.407,0,0.0498,105.256,4,0.37
Seven Nation Army,76,231733,0.00817,0.737,0.463,0.447,0,0.255,-7.828,1,0.0792,123.881,4,0.324
By The Way,74,216933,0.0264,0.451,0.97,0.00355,0,0.102,-4.938,1,0.107,122.444,4,0.198
How You Remind Me,56,223826,0.000954,0.447,0.766,0.0,10,0.113,-5.065,1,0.0313,172.011,4,0.574


In [13]:
info_df.head()

Unnamed: 0_level_0,artist_name,album_names,playlist
song_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Boulevard of Broken Dreams,Green Day,Greatest Hits: God's Favorite Band,00s Rock Anthems
In The End,Linkin Park,Hybrid Theory,00s Rock Anthems
Seven Nation Army,The White Stripes,Elephant,00s Rock Anthems
By The Way,Red Hot Chili Peppers,By The Way (Deluxe Version),00s Rock Anthems
How You Remind Me,Nickelback,Silver Side Up,00s Rock Anthems


As we can see, in the dataset named *songs_df* we can find several attributes that we list in the next lines. I found the description of some of them in this [link](https://towardsdatascience.com/is-my-spotify-music-boring-an-analysis-involving-music-data-and-machine-learning-47550ae931de) and this [link](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/):

- **Song name**: This is the song name, it serves as the index of the dataframe.

- **Song popularity**: A number indicating how popular a song is. The higer the number the most popular the song is.

- **Song duration ms***: A number with the duration of the song in miliseconds.

- **Acousticness**: This value describes how acoustic a song is. A score of 1.0 means the song is most likely to be an acoustic one.

- **Danceability**: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

- **Energy**: Represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.

- **Instrumentalness**: This value represents the amount of vocals in the song. The closer it is to 1.0, the more instrumental the song is.

- **Key**: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

- **Liveness**: This value describes the probability that the song was recorded with a live audience. According to the official documentation “a value above 0.8 provides strong likelihood that the track is live”.

- **Loudness**: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

- **Audio mode**: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

- **Spechiness**: “Speechiness detects the presence of spoken words in a track”. If the speechiness of a song is above 0.66, it is probably made of spoken words, a score between 0.33 and 0.66 is a song that may contain both music and words, and a score below 0.33 means the song does not have any speech.

- **Tempo**: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

- **Time signature**: An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

- **Audio valence**: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

On the other side, the *info_df* contains the following attributes:

- **Song name**: This is the song name, it serves as the index of the dataframe. We can link this dataframe with the other one through this field.

- **Artist name**: Artist whose the song belongs to.

- **Album name**: Name of the album containing the song.

- **Playlist**: Playlist on *Spotify* from where the song was retrieved.