# Data analysis of my Spotify

## Description of the project
This project is dedicated to one of my favorite ways of spending free time - listening to music. Due to the new account there is small-scale of the data, but i regard this analysis as representive of my music taste in general.

The aim of the project obtain statistical and graphical description of my music preferences.

To reach the goal I will:
1. retrieve data from Spotify via Spotify Web API and Spotipy library
2. conduct exploratory data analysis via Pandas library
3. make Tableau Public data viz with pithy conclusions

## Retrieving data

[Spotipy](https://spotipy.readthedocs.io/en/2.22.1/) is one of the Web API Wrappers to use in project on Python, it uses Spotify Web API and through Authorization Code flow you get the needed data. 

In [1]:
pip install spotipy --upgrade

Note: you may need to restart the kernel to use updated packages.


In [2]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import pandas as pd
import math

You need to create and register a new application to generate valid credentials on the Spotify for developers website in My dashboard section. There you get the `SPOTIPY_CLIENT_ID` and `SPOTIPY_CLIENT_SECRET`. The Authorization Code Flow needs you to add a `REDIRECT_URI` to your application at My Dashboard (navigate to your application and then Edit Settings).

After that through SpotifyOAuth you will accept conditions and will receive link, paste it, thereafter you ready to start retrieve data.

In [3]:
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id="",
                                               client_secret="",
                                               redirect_uri="http://www.google.com/",
                                               scope="user-library-read"))

To today (27/01/2023) I have 426 favorite tracks on the Spotify. Because of the limits in methods i use for loops with number of iterations equals to number of tracks divide by the limit (20 tracks).

In [4]:
number_of_favorite_tracks = 426
num_iterations = math.ceil(number_of_favorite_tracks / 20)

Spotify Web API provides a versatile information about user's music and podcasts and about all tracks and albums on Spotify.

I selected sought-for data and made DataFrame via pandas library.

In [5]:
tracks_ids = []
tracks_ids_artist_name = dict()
offset = 0

for i in range(0, num_iterations):
    results = sp.current_user_saved_tracks(offset=offset)
    for item in results['items']:
        tracks_ids.append(item['track']['id'])
        tracks_ids_artist_name[item['track']['id']] = [item['track']['artists'][0]['name'], 
                                                       item['track']['name'], 
                                                       item['track']['album']['name'], 
                                                       item['track']['album']['release_date'], 
                                                       item['added_at'], 
                                                       item['track']['popularity']]
    offset += 20
    

In [6]:
num_iterations_tracks = math.ceil(len(tracks_ids) / 100)
start = 0
end = 100
tracks_features = []
for i in range(0, num_iterations_tracks):
    preliminary_results = sp.audio_features(tracks=tracks_ids[start:end])
    tracks_features.extend(preliminary_results)
    start = end
    end += 100

In [7]:
df = pd.DataFrame.from_dict(tracks_features)

In [8]:
df_names = pd.DataFrame.from_dict(tracks_ids_artist_name, orient='index').reset_index()
df_names.columns = ['id', 'artist_name', 'song_name', 'album_name', 'release_date', 'add_date', 'popularity']
df = df.merge(df_names, on='id')

In [9]:
df = df.drop(columns=['analysis_url', 'track_href', 'type'])

The same songs from the same artist but from different albums/EPs were removed

In [10]:
df = df.drop_duplicates(subset=['artist_name', 'song_name'])

After removing some unnecessary for the further analysis columns, info about the dataframe was presented.

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 415 entries, 0 to 425
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      415 non-null    float64
 1   energy            415 non-null    float64
 2   key               415 non-null    int64  
 3   loudness          415 non-null    float64
 4   mode              415 non-null    int64  
 5   speechiness       415 non-null    float64
 6   acousticness      415 non-null    float64
 7   instrumentalness  415 non-null    float64
 8   liveness          415 non-null    float64
 9   valence           415 non-null    float64
 10  tempo             415 non-null    float64
 11  id                415 non-null    object 
 12  uri               415 non-null    object 
 13  duration_ms       415 non-null    int64  
 14  time_signature    415 non-null    int64  
 15  artist_name       415 non-null    object 
 16  song_name         415 non-null    object 
 1

* `danceability` - Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.


* `energy` - Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.


* `key` - The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.


* `loudness` - The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.


* `mode` - Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.


* `speechiness` - Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.


* `acousticness` - A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.


* `instrumentalness` - Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.


* `liveness` - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.


* `valence` - A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).


* `tempo` - The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.


* `id` - The Spotify ID for the track.


* `uri` - The Spotify URI for the track.


* `duration_ms` - The duration of the track in milliseconds.


* `time_signature` - An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of "3/4", to "7/4".


* `artist_name` - Song artist name.


* `song_name` - The name of the track.


* `album_name` - The name of the album.


* `release_date` - The date the album was first released.


* `add_date` - The date the track was added to the Saved track.


* `popularity` - The popularity of the artist. The value will be between 0 and 100, with 100 being the most popular. The artist's popularity is calculated from the popularity of all the artist's tracks.

Changing type of the variables from `object` to `datetime` in `release_date` and `add_date` columns

In [12]:
df['release_date'] = pd.to_datetime(df['release_date'])

In [13]:
df['add_date'] = pd.to_datetime(df['add_date'])
df['add_date'] = df['add_date'].dt.date

Dataset for Tableau was saved in `.csv` format

In [14]:
df.to_csv('spotify_data.csv', index=False)

## EDA (exploratory data analysis)

The largest number of the Saved tracks (16 tracks) from Kendrick Lamar, who is, without doubt, my favorite artist.

By the number of Saved tracks my top-5 artist looks like this:
1. Kendrick Lamar
2. BROCKHAMPTON
3. Duckwrth
4. Frank Ocean
5. Royal Blood

In [15]:
df['artist_name'].value_counts().to_frame().head()

Unnamed: 0,artist_name
Kendrick Lamar,16
BROCKHAMPTON,13
Duckwrth,11
Frank Ocean,10
Royal Blood,10


I'm not really a fun of acoustic tracks, tracks without words and tracks from Live concerts. All of this confirmed by the average values of `liveness`, `acousticness` and `instrumentalness` of my Saved tracks.

In [16]:
df['liveness'].mean()

0.1873737349397591

In [17]:
df['acousticness'].mean()

0.21345098144578323

In [18]:
df['instrumentalness'].mean()

0.09874566906024093

I enjoy dancing so it's not a surprise that on average my music is danceable and energetic, according to the data.

In [19]:
df['danceability'].mean()

0.648671807228916

In [20]:
df['energy'].mean()

0.6465706024096388

I guess we all listen music based on our mood, so the valence in my music preference almost perfectly - 0.5 - right between positive and sad.

In [21]:
df['valence'].mean()

0.5020681927710844

Most often I add songs on Monday, least often on Saturday. After the weeknd's music helps me get into the work vibe, on the weekend I try to use my phone and laptop seldom.

In [22]:
df['add_date'].apply(lambda x : x.weekday()).value_counts()
# 0 - Monday, 6 - Sunday

0    164
1    143
3     42
2     32
4     28
6      5
5      1
Name: add_date, dtype: int64

I know that I don't have unique taste in music, usually I listen to mainstream tracks (many of them from TikTok or films and series), that's why the majority of the tracks was released in the last 10 years.

In [23]:
df['release_date'].apply(lambda x : x.year).value_counts().head(10)

2022    38
2017    34
2018    31
2019    31
2021    30
2009    22
2020    21
2016    19
2015    18
2014    17
Name: release_date, dtype: int64

The 'oldest' song in my Saved tracks is There Arms of Mine by Otis Redding. This is beautiful song without age.

In [24]:
min_date_release = min(df['release_date'])
df[df['release_date'] == min_date_release][['artist_name', 'song_name', 'release_date']]

Unnamed: 0,artist_name,song_name,release_date
403,Otis Redding,These Arms of Mine,1964-01-01


Hit songs are about 3 minutes long because of two major factors: the historic popularity of the 45 rpm record and the monetization methods applied by radio stations and record producers throughout the 20th century [Source](https://www.musicianwave.com/whats-the-average-length-of-a-song-year/)

As i mentioned previously, my preferences is more mainstream, than unique. However, rap songs, which i usually listen, longer than 3 minutes. Therefore, the average duration of a track in my Saved tracks is almost 4 minutes.

In [25]:
# 1 min = 60000 ms
df['duration_ms'].mean() / 60000

3.7483673493975904

For instance, mean duration of the Kendrick Lamar tracks in my Saved tracks more than 4 minutes.

In [26]:
df[df['artist_name'] == "Kendrick Lamar"]['duration_ms'].mean() / 60000

4.555336458333334

The tempo is the speed or pace of a given piece and derives directly from the average beat duration, so the higher the tempo the faster feel a song.

Average tempo in my Saved songs is 118. To understand that it is i slice a dataframe with songs with this tempo. The resulting songs are rather danceable, and also my all-time favorites.

In [27]:
df['tempo'].mean()

118.56232289156618

In [28]:
df[(df['tempo'] > 118) & (df['tempo'] < 119)][['artist_name', 'song_name']]

Unnamed: 0,artist_name,song_name
48,Beyoncé,MOVE (feat. Grace Jones & Tems)
92,NEIL FRANCES,Music Sounds Better with You
121,Lady Gaga,Just Dance
361,Muse,Time is Running Out
389,Depeche Mode,Strangelove


As the conclusion, i want to summarize all of the facts about my music preferneces according to my Saved tracks on Spotify.
1. on average my favorite music is danceable and energetic
2. the largest number of the Saved tracks (16 tracks) from Kendrick Lamar
3. most often I add songs on Monday, least often on Saturday
4. i got about 50% sad songs and 50% positive songs
5. the majority of the tracks in my Saved was released in the last 10 years
6. average duration of a track in my Saved tracks is almost 4 minutes
7. average tempo in my Saved songs is 118 as, for instance, Muse - Time is Running Out

## Tableau dataviz
[Link to Tableau Public data viz](https://public.tableau.com/app/profile/polina.kopteva/viz/Musicpreferncesanalysis/Main?publish=yes)