## **Overview**

The objective of this project is to conduct a comprehensive analysis of the user's music preferences and tastes using data from Spotify's API and the user's streaming history. By leveraging Python programming language and various data analysis libraries, the project aims to provide insights into the user's listening habits, preferred artists, tracks, and other relevant metrics. The ultimate goal is to gain a deeper understanding of the user's musical preferences and behavior.

## **Import Package**

In [117]:
# Importing all the required libraries and packages
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import warnings

In [118]:
warnings.filterwarnings("ignore")
# pio.renderers.default = "svg"

In [119]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## **Data Overview**

### **Read Data**

In [120]:
spotify_data = pd.read_csv(r'D:\Repository\spotify-analysis\Data\all_spotify_data.csv', low_memory=False)

In [121]:
spotify_data.head()

Unnamed: 0.1,Unnamed: 0,ts,platform,ms_played,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,episode_name,episode_show_name,spotify_episode_uri,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode,track_uri,track_id,track_name,release_date,length,track_popularity,album_id,album_name,artist_name,acousticness,danceability,energy,instrumentalness,liveness,speechiness,loudness,mode,tempo,time_signature
0,0,2019-02-08T22:51:39Z,android,31267,,,,"8 - ""Taste"" by Rebecca Woodmass",A Little Poem with Rebecca Woodmass,spotify:episode:0lY0dPwAfZoPgSepp1eCrp,clickrow,endplay,False,,False,1549666235287,False,,,,,,,,,,,,,,,,,,,
1,1,2019-02-09T03:50:12Z,android,5463,,,,What your breath could reveal about your healt...,TED Talks Daily,spotify:episode:6v0Fw2ov3RclzEFoG68roa,clickrow,unexpected-exit-while-paused,False,,False,1549666299325,False,,,,,,,,,,,,,,,,,,,
2,2,2019-02-09T03:50:33Z,android,5463,,,,What your breath could reveal about your healt...,TED Talks Daily,spotify:episode:6v0Fw2ov3RclzEFoG68roa,clickrow,unexpected-exit-while-paused,False,,False,1549666299325,False,,,,,,,,,,,,,,,,,,,
3,3,2019-02-09T03:50:51Z,android,5463,,,,What your breath could reveal about your healt...,TED Talks Daily,spotify:episode:6v0Fw2ov3RclzEFoG68roa,clickrow,unexpected-exit-while-paused,False,,False,1549666299325,False,,,,,,,,,,,,,,,,,,,
4,4,2019-02-09T11:05:55Z,android,77742,,,,What your breath could reveal about your healt...,TED Talks Daily,spotify:episode:6v0Fw2ov3RclzEFoG68roa,appload,logout,False,,False,1549710054124,False,,,,,,,,,,,,,,,,,,,


### **Detailed Data Information**

In [122]:
spotify_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134325 entries, 0 to 134324
Data columns (total 36 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Unnamed: 0                         134325 non-null  int64  
 1   ts                                 134325 non-null  object 
 2   platform                           134325 non-null  object 
 3   ms_played                          134325 non-null  int64  
 4   master_metadata_track_name         131845 non-null  object 
 5   master_metadata_album_artist_name  131845 non-null  object 
 6   master_metadata_album_album_name   131845 non-null  object 
 7   episode_name                       2403 non-null    object 
 8   episode_show_name                  2400 non-null    object 
 9   spotify_episode_uri                2403 non-null    object 
 10  reason_start                       134325 non-null  object 
 11  reason_end                         1343

In [123]:
spotify_data.describe()

Unnamed: 0.1,Unnamed: 0,ms_played,offline_timestamp,length,track_popularity,acousticness,danceability,energy,instrumentalness,liveness,speechiness,loudness,mode,tempo,time_signature
count,134325.0,134325.0,134325.0,20868.0,20868.0,20868.0,20868.0,20868.0,20868.0,20868.0,20868.0,20868.0,20868.0,20868.0,20868.0
mean,67162.0,172787.2,779241900000.0,225869.20208,67.387148,0.493633,0.546404,0.493545,0.066855,0.157076,0.05076,-8.556674,0.869034,118.845372,3.909574
std,38776.431792,140097.6,812493800000.0,54265.271242,15.440956,0.343641,0.135323,0.219124,0.216872,0.10546,0.047525,3.954602,0.337371,30.436768,0.332888
min,0.0,0.0,0.0,48000.0,2.0,2.6e-05,0.0793,0.00244,0.0,0.0334,0.0232,-44.677,0.0,52.463,1.0
25%,33581.0,111880.0,1682916000.0,194469.0,60.0,0.13,0.443,0.32,0.0,0.0978,0.0308,-10.634,1.0,95.932,4.0
50%,67162.0,187172.0,1713447000.0,219724.0,70.0,0.514,0.558,0.469,9e-06,0.113,0.0363,-7.726,1.0,115.094,4.0
75%,100743.0,226686.0,1634229000000.0,252266.0,78.0,0.831,0.643,0.663,0.000912,0.169,0.052,-5.872,1.0,139.99,4.0
max,134324.0,9645688.0,1665470000000.0,493400.0,96.0,0.996,0.925,0.988,0.982,0.626,0.463,-1.329,1.0,199.811,5.0


In [124]:
spotify_data.describe(include='object')

Unnamed: 0,ts,platform,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,episode_name,episode_show_name,spotify_episode_uri,reason_start,reason_end,skipped,track_uri,track_id,track_name,release_date,album_id,album_name,artist_name
count,134325,134325,131845,131845,131845,2403,2400,2403,134325,134325,65915,131845,20868,20868,20868,20868,20868,20868
unique,122160,3,14810,6463,11732,1582,329,1584,9,10,4,18207,582,544,355,415,409,117
top,2022-01-01T11:42:30Z,windows,Location,LANY,Menari Dengan Bayangan,YOASOBI THE BOOK 2 Spotify Edition,Mishary Rashid Alafasy,spotify:episode:0sPNGPetAt5PP9IXKD6Bko,trackdone,trackdone,False,152lZdxL1OR0ZMW6KquMif,6xGruZOHLs39ZbVccQTuPZ,Glimpse of Us,2019-11-29,1DAuVHMlBvIjzWZALSUXbn,Menari Dengan Bayangan,LANY
freq,331,118996,387,3936,854,58,371,58,94047,93480,39505,382,306,312,666,666,666,872


## **Data Preprocessing**

### **Data Type**

#### **Checking Data Type**

In [125]:
spotify_data.dtypes

Unnamed: 0                             int64
ts                                    object
platform                              object
ms_played                              int64
master_metadata_track_name            object
master_metadata_album_artist_name     object
master_metadata_album_album_name      object
episode_name                          object
episode_show_name                     object
spotify_episode_uri                   object
reason_start                          object
reason_end                            object
shuffle                                 bool
skipped                               object
offline                                 bool
offline_timestamp                      int64
incognito_mode                          bool
track_uri                             object
track_id                              object
track_name                            object
release_date                          object
length                               float64
track_popu

#### **Fixing Data Type**

In [126]:
# Convert 'ts' column to datetime format
spotify_data['ts'] = pd.to_datetime(spotify_data['ts'])

In [127]:
spotify_data['ts'].dtypes

datetime64[ns, UTC]

### **Data Duplicates**

#### **Identifying Data Duplicates**

In [128]:
spotify_data.duplicated().value_counts()

False    134325
Name: count, dtype: int64

In [129]:
# duplicate_rows = spotify_data[spotify_data.duplicated()]
# duplicate_rows.head()

In [130]:
# duplicate_rows = spotify_data[spotify_data.duplicated(keep=False)]
# duplicate_rows.head()

### **Missing Values**

#### **Identifying Missing Values**

In [131]:
# Function to calculate missing values by column 
def missing_values_table(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        return mis_val_table_ren_columns

In [132]:
missing_values = missing_values_table(spotify_data)
missing_values

Your selected dataframe has 36 columns.
There are 26 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
episode_show_name,131925,98.2
episode_name,131922,98.2
spotify_episode_uri,131922,98.2
album_id,113457,84.5
artist_name,113457,84.5
tempo,113457,84.5
mode,113457,84.5
loudness,113457,84.5
speechiness,113457,84.5
liveness,113457,84.5


We see that `episode_show_name`,`episode_name`, and `spotify_episode_uri` have the highest null/missing values because only 1.8% of the entire dataset belongs to podcasts, and those three columns are specifically related to podcasts. 

Additionally, several columns have null percentages exceeding 80% due to most of them containing information derived from the Spotify API dataset (before merging), which only consists of 1000 rows. Consequently, upon merging with the streaming history dataset, numerous values became null. 

For this reason, I haven't performed any missing value handling yet.

### **Features List**

In [133]:
# List numerical features in streaming history music dataframe
numerical_features = list(spotify_data.select_dtypes(include=['int64','float64']).columns)
print('List of numerical featues {}'.format(numerical_features))

List of numerical featues ['Unnamed: 0', 'ms_played', 'offline_timestamp', 'length', 'track_popularity', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'speechiness', 'loudness', 'mode', 'tempo', 'time_signature']


In [134]:
# List categorical features
categorical_features = list(spotify_data.select_dtypes(include=['object']).columns)
print('List of categorical featues {}'.format(categorical_features))

List of categorical featues ['platform', 'master_metadata_track_name', 'master_metadata_album_artist_name', 'master_metadata_album_album_name', 'episode_name', 'episode_show_name', 'spotify_episode_uri', 'reason_start', 'reason_end', 'skipped', 'track_uri', 'track_id', 'track_name', 'release_date', 'album_id', 'album_name', 'artist_name']


### **Separate Music and Podcast from Streaming History**

#### **Music Streaming History**

In [135]:
# Make sure the data only includes music history
indices_to_drop = spotify_data[spotify_data['episode_name'].notnull() & spotify_data['episode_show_name'].notnull() & spotify_data['spotify_episode_uri'].notnull()].index
spotify_music = spotify_data.drop(indices_to_drop)

In [136]:
# Remove columns related to podcast
spotify_music = spotify_music.drop(columns=['episode_name', 'episode_show_name', 'spotify_episode_uri'])

In [137]:
spotify_music.info()

<class 'pandas.core.frame.DataFrame'>
Index: 131925 entries, 5 to 134324
Data columns (total 33 columns):
 #   Column                             Non-Null Count   Dtype              
---  ------                             --------------   -----              
 0   Unnamed: 0                         131925 non-null  int64              
 1   ts                                 131925 non-null  datetime64[ns, UTC]
 2   platform                           131925 non-null  object             
 3   ms_played                          131925 non-null  int64              
 4   master_metadata_track_name         131845 non-null  object             
 5   master_metadata_album_artist_name  131845 non-null  object             
 6   master_metadata_album_album_name   131845 non-null  object             
 7   reason_start                       131925 non-null  object             
 8   reason_end                         131925 non-null  object             
 9   shuffle                            131925 

In [138]:
spotify_music.head()

Unnamed: 0.1,Unnamed: 0,ts,platform,ms_played,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode,track_uri,track_id,track_name,release_date,length,track_popularity,album_id,album_name,artist_name,acousticness,danceability,energy,instrumentalness,liveness,speechiness,loudness,mode,tempo,time_signature
5,5,2019-02-09 14:54:58+00:00,android,47292,A Sky Full of Stars - Live at the Royal Albert...,Coldplay,Ghost Stories Live 2014,appload,endplay,True,,False,1549724033458,False,2WXTF0qgKbczC2O8VymeLO,,,,,,,,,,,,,,,,,,
6,6,2019-02-09 14:55:31+00:00,android,30126,Noi siamo infinito,Alessio Bernabei,Noi siamo infinito,clickrow,endplay,True,,False,1549724096707,False,3mBROtKxM1QTRv98X1dArZ,,,,,,,,,,,,,,,,,,
7,7,2019-02-09 14:56:14+00:00,android,40875,Sediakala,Dialog Dini Hari,Sediakala,appload,endplay,True,,False,1549724131761,False,7rXlplJTbAB1gqGZtlWvR5,,,,,,,,,,,,,,,,,,
8,8,2019-02-09 14:56:58+00:00,android,42898,Garis Terdepan,Fiersa Besari,Konspirasi Alam Semesta,clickrow,endplay,True,,False,1549724173484,False,5LjtZJr2XsaeEdieJ5rrcS,,,,,,,,,,,,,,,,,,
9,9,2019-02-09 14:57:36+00:00,android,34332,Kata Hilang Makna,Yura Yunita,Merakit,appload,endplay,True,,False,1549724219564,False,1jAF90jCmaSnAkcWHTHP0t,,,,,,,,,,,,,,,,,,


#### **Music Streaming History with Audio Features Detail**

In [139]:
# Make sure the data only includes music history
spotify_music_audiofeatures = spotify_music.loc[spotify_music['track_id'].notnull()]

In [140]:
spotify_music_audiofeatures.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20868 entries, 10 to 134316
Data columns (total 33 columns):
 #   Column                             Non-Null Count  Dtype              
---  ------                             --------------  -----              
 0   Unnamed: 0                         20868 non-null  int64              
 1   ts                                 20868 non-null  datetime64[ns, UTC]
 2   platform                           20868 non-null  object             
 3   ms_played                          20868 non-null  int64              
 4   master_metadata_track_name         20868 non-null  object             
 5   master_metadata_album_artist_name  20868 non-null  object             
 6   master_metadata_album_album_name   20868 non-null  object             
 7   reason_start                       20868 non-null  object             
 8   reason_end                         20868 non-null  object             
 9   shuffle                            20868 non-null  bo

In [141]:
spotify_music_audiofeatures.head()

Unnamed: 0.1,Unnamed: 0,ts,platform,ms_played,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode,track_uri,track_id,track_name,release_date,length,track_popularity,album_id,album_name,artist_name,acousticness,danceability,energy,instrumentalness,liveness,speechiness,loudness,mode,tempo,time_signature
10,10,2019-02-09 14:57:59+00:00,android,21886,Untuk Perempuan Yang Sedang Di Pelukan,Payung Teduh,Dunia Batas,clickrow,fwdbtn,True,,False,1549724255489,False,0urpBLpcm6DOGzs86rcKd8,0urpBLpcm6DOGzs86rcKd8,Untuk Perempuan Yang Sedang Di Pelukan,2014-11-04,342000.0,72.0,26FxxaKDiIGxEm549dRtaZ,Dunia Batas,Payung Teduh,0.552,0.477,0.493,0.00345,0.102,0.0291,-7.346,1.0,148.984,4.0
11,11,2019-02-09 14:58:09+00:00,android,1638,"Yang Patah Tumbuh, Yang Hilang Berganti",Banda Neira,"Yang Patah Tumbuh, Yang Hilang Berganti",fwdbtn,fwdbtn,True,,False,1549724278451,False,6Rd4ep779v8CjlFVhaHrNX,6Rd4ep779v8CjlFVhaHrNX,"Yang Patah Tumbuh, Yang Hilang Berganti",2016-01-29,393072.0,63.0,1e1NmOduCFHp1z29cSzyMa,"Yang Patah Tumbuh, Yang Hilang Berganti",Banda Neira,0.938,0.353,0.354,0.00407,0.122,0.0268,-10.85,1.0,96.887,4.0
19,19,2019-02-10 00:46:51+00:00,android,3856,"Zona Nyaman (From ""Filosofi Kopi 2: Ben & Jody"")",Fourtwnty,Ego & Fungsi Otak,clickrow,endplay,False,,False,1549759605916,False,4lfAvFv8roumWVKXhHF8uN,4lfAvFv8roumWVKXhHF8uN,"Zona Nyaman (From ""Filosofi Kopi 2: Ben & Jody"")",2018-04-20,243413.0,67.0,01PCAiyhBHXTj9KfSigcQz,Ego & Fungsi Otak,Fourtwnty,0.355,0.515,0.649,0.00155,0.175,0.0338,-6.278,1.0,120.003,4.0
20,20,2019-02-10 00:46:56+00:00,android,4545,Monokrom,Tulus,Monokrom,clickrow,endplay,False,,False,1549759610209,False,4GfK1qOF3uBWidbPlTCQRL,4GfK1qOF3uBWidbPlTCQRL,Monokrom,2016-08-03,214567.0,76.0,4szhn3xPmOJklFAcqNvTnQ,Monokrom,Tulus,0.573,0.534,0.462,6e-06,0.0974,0.0326,-9.383,1.0,88.046,4.0
26,26,2019-02-10 00:55:25+00:00,android,34663,Yellow,Coldplay,Parachutes,clickrow,endplay,False,,False,1549760089412,False,3AJwUDP919kvQ9QcozQPxg,3AJwUDP919kvQ9QcozQPxg,Yellow,2000-07-10,266773.0,89.0,6ZG5lRT77aJ3btmArcykra,Parachutes,Coldplay,0.00239,0.429,0.661,0.000121,0.234,0.0281,-7.227,1.0,173.372,4.0


In [142]:
missing_values_table(spotify_music_audiofeatures)

Your selected dataframe has 33 columns.
There are 1 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
skipped,10588,50.7


In [143]:
spotify_music_audiofeatures['skipped'].dtypes

dtype('O')

In [144]:
spotify_music_audiofeatures['skipped'].value_counts()

skipped
False    6542
True     2137
0.0      1173
1.0       428
Name: count, dtype: int64

In [145]:
spotify_music_audiofeatures['skipped'] = spotify_music_audiofeatures['skipped'].replace({'False': 'False', 'True': 'True', '0.0': 'False', '1.0': 'True'})

In [146]:
spotify_music_audiofeatures['skipped'].value_counts()

skipped
False    7715
True     2565
Name: count, dtype: int64

In [147]:
missing_values_table(spotify_music_audiofeatures)

Your selected dataframe has 33 columns.
There are 1 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
skipped,10588,50.7


In [148]:
skipped_mode = spotify_music_audiofeatures['skipped'].mode()[0]
spotify_music_audiofeatures['skipped'].fillna(skipped_mode, inplace=True)

In [149]:
missing_values_table(spotify_music_audiofeatures)

Your selected dataframe has 33 columns.
There are 0 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values


In [150]:
spotify_music_audiofeatures['skipped'].value_counts()

skipped
False    18303
True      2565
Name: count, dtype: int64

In [151]:
spotify_music_audiofeatures.duplicated().value_counts()

False    20868
Name: count, dtype: int64

#### **Podcast Streaming History**

In [152]:
# Subset of columns included in the podcast streaming history
podcast_columns = ['ts', 'platform', 'ms_played', 'episode_name', 'episode_show_name', 'spotify_episode_uri', 'reason_start', 'reason_end', 'shuffle', 'skipped', 'offline', 'offline_timestamp', 'incognito_mode']

In [153]:
spotify_podcast = spotify_data[podcast_columns].copy()

In [154]:
spotify_podcast.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134325 entries, 0 to 134324
Data columns (total 13 columns):
 #   Column               Non-Null Count   Dtype              
---  ------               --------------   -----              
 0   ts                   134325 non-null  datetime64[ns, UTC]
 1   platform             134325 non-null  object             
 2   ms_played            134325 non-null  int64              
 3   episode_name         2403 non-null    object             
 4   episode_show_name    2400 non-null    object             
 5   spotify_episode_uri  2403 non-null    object             
 6   reason_start         134325 non-null  object             
 7   reason_end           134325 non-null  object             
 8   shuffle              134325 non-null  bool               
 9   skipped              65915 non-null   object             
 10  offline              134325 non-null  bool               
 11  offline_timestamp    134325 non-null  int64              
 12  in

In [155]:
# Make sure the data only includes podcast history
spotify_podcast = spotify_podcast.dropna(subset=['spotify_episode_uri'])

In [156]:
spotify_podcast.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2403 entries, 0 to 134306
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype              
---  ------               --------------  -----              
 0   ts                   2403 non-null   datetime64[ns, UTC]
 1   platform             2403 non-null   object             
 2   ms_played            2403 non-null   int64              
 3   episode_name         2403 non-null   object             
 4   episode_show_name    2400 non-null   object             
 5   spotify_episode_uri  2403 non-null   object             
 6   reason_start         2403 non-null   object             
 7   reason_end           2403 non-null   object             
 8   shuffle              2403 non-null   bool               
 9   skipped              1103 non-null   object             
 10  offline              2403 non-null   bool               
 11  offline_timestamp    2403 non-null   int64              
 12  incognito_mode       24

In [157]:
spotify_podcast.head()

Unnamed: 0,ts,platform,ms_played,episode_name,episode_show_name,spotify_episode_uri,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode
0,2019-02-08 22:51:39+00:00,android,31267,"8 - ""Taste"" by Rebecca Woodmass",A Little Poem with Rebecca Woodmass,spotify:episode:0lY0dPwAfZoPgSepp1eCrp,clickrow,endplay,False,,False,1549666235287,False
1,2019-02-09 03:50:12+00:00,android,5463,What your breath could reveal about your healt...,TED Talks Daily,spotify:episode:6v0Fw2ov3RclzEFoG68roa,clickrow,unexpected-exit-while-paused,False,,False,1549666299325,False
2,2019-02-09 03:50:33+00:00,android,5463,What your breath could reveal about your healt...,TED Talks Daily,spotify:episode:6v0Fw2ov3RclzEFoG68roa,clickrow,unexpected-exit-while-paused,False,,False,1549666299325,False
3,2019-02-09 03:50:51+00:00,android,5463,What your breath could reveal about your healt...,TED Talks Daily,spotify:episode:6v0Fw2ov3RclzEFoG68roa,clickrow,unexpected-exit-while-paused,False,,False,1549666299325,False
4,2019-02-09 11:05:55+00:00,android,77742,What your breath could reveal about your healt...,TED Talks Daily,spotify:episode:6v0Fw2ov3RclzEFoG68roa,appload,logout,False,,False,1549710054124,False


In [158]:
missing_values = missing_values_table(spotify_podcast)
missing_values

Your selected dataframe has 13 columns.
There are 2 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
skipped,1300,54.1
episode_show_name,3,0.1


## **Exploratory Data Analysis**

### **`Audio` Streaming Analysis**

#### **Yearly `audio` streaming trend**

In [159]:
# Set 'ts' column as the index
indexed = spotify_data.set_index('ts', inplace=False)

# Resample the data on a monthly basis
streaming_trends = indexed.resample('M').sum()

fig = px.line(streaming_trends, x=streaming_trends.index, y=(streaming_trends['ms_played'] / (1000 * 60 * 60)).round(2),
              labels={'ts': 'Month-Year', 'y': 'Hours Played'},
              title='Monthly Audio Streaming Trends Over Time',
              color_discrete_sequence=['green'], markers=True)

fig.add_vrect(x0="2022-10-01", x1="2023-02-01", 
              fillcolor="rgba(0,255,0,0.2)", 
              layer="below", 
              line_width=0)

fig.update_layout(title=dict(text='Monthly Audio Streaming Trends Over Time', x=0.5),
                  xaxis_title='Month-Year',
                  yaxis_title='Hours Played',
                  width=1000, height=500)
fig.show()

#### **Weekly `audio` streaming trend**

In [160]:
# Analyze trends based on the day of the week
trends_by_day = spotify_data.groupby(spotify_data['ts'].dt.dayofweek).agg({
    'ms_played': lambda x: x.sum() / (1000 * 60 * 60)
    }).reset_index()

# Get the top 2 days of the week
top_days = trends_by_day.nlargest(2, 'ms_played').index

fig = px.bar(trends_by_day, x='ts', y='ms_played',
             labels={'ms_played': 'Total Streaming Time (Hours)', 'ts': 'Day of the Week'},
             title='Streaming Trends by Day of the Week')

mean_trends_by_day = trends_by_day['ms_played'].mean()
fig.add_hline(y=mean_trends_by_day, 
              line_dash="dot", line_color="blue", 
              annotation_text=f'Mean: {mean_trends_by_day:.2f} hours',
              annotation_position="top right")

colors = ['green' if i in top_days else '#16D35C' for i in trends_by_day.index]
fig.update_traces(marker_color=colors,
                  text=trends_by_day['ms_played'].round(2), 
                  textposition='inside', 
                  textfont=dict(color='white'))

fig.update_layout(xaxis=dict(tickmode='array', tickvals=list(range(7)), ticktext=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']),
                  title=dict(text='Streaming Trends by Day of the Week', x=0.5),
                  width=800, height=500)
fig.show()

#### **`Platform` Analysis**

In [161]:
# Calculate platform counts
platform_counts = spotify_data['platform'].value_counts()

fig = px.pie(platform_counts, values=platform_counts.values, names=platform_counts.index,
             title='Distribution of Streaming Platforms')

# Customize the appearance
fig.update_traces(marker=dict(colors=['green', '#16D35C']))  # Set custom colors
fig.update_layout(title={'text': 'Distribution of Streaming Platforms', 'x': 0.5})  # Center the title
fig.update_layout(width=500, height=500)

fig.show()

### **`Music` Streaming Analysis**

#### **Yearly `Music` Streaming Trend**

##### **By Play Counts**

##### **By Playback Duration**

In [162]:
# Set 'ts' column as the index
indexed = spotify_music.set_index('ts', inplace=False)

streaming_trends_yearly = indexed.resample('Y').sum()

fig = px.line(streaming_trends_yearly, 
              x=streaming_trends_yearly.index, 
              y=(streaming_trends_yearly['ms_played'] / (1000 * 60 * 60)).round(2),
              labels={'ts': 'Year', 'y': 'Hours Played'},
              title='Yearly Audio Streaming Trends Over Time',
              color_discrete_sequence=['green'], markers=True)

fig.update_layout(title=dict(text='Yearly Audio Streaming Trends Over Time', x=0.5),
                  xaxis_title='Year',
                  yaxis_title='Hours Played',
                  width=800, height=500)
fig.show()

##### **By Song Counts**

In [163]:
spotify_music['year'] = spotify_music['ts'].dt.year
grouped_data = spotify_music.groupby('year').size().reset_index(name='song_count')
grouped_data = grouped_data.sort_values(by='year')

In [164]:
fig = px.line(grouped_data, x='year', y='song_count',
              labels={'year': 'Year', 'song_count': 'Total Songs'},
              color_discrete_sequence=['green'],
              markers=True)

fig.update_layout(title=dict(text='Yearly Song Count Trends', x=0.5),
                  xaxis_title='Year',
                  yaxis_title='Total Songs',
                  width=800, height=500,
                  yaxis=dict(tickformat=',.0f', range=[0, grouped_data['song_count'].max() * 1.1]))
fig.show()

##### **By Artists Counts**

In [165]:
# Create a 'year' column based on the 'ts' timestamp column
spotify_data['year'] = spotify_data['ts'].dt.year

# Calculate the number of unique artists for each year
artists_per_year = spotify_data.groupby('year')['master_metadata_album_artist_name'].nunique().reset_index()

# Create the plot using Plotly Express (px)
fig = px.line(artists_per_year, x='year', y='master_metadata_album_artist_name',
             labels={'year': 'Year', 'master_metadata_album_artist_name': 'Number of Unique Artists'},
             title='Number of Unique Artists Each Year',
             color_discrete_sequence=['green'], markers=True,
             width=800, height=500)
fig.show()

#### **Monthly `Music` Streaming Trend**

In [166]:
# Set 'ts' column as the index
indexed = spotify_music.set_index('ts', inplace=False)

# Resample the data on a monthly basis
streaming_trends = indexed.resample('M').sum()

# Create a line plot using Plotly Express
fig = px.line(streaming_trends, x=streaming_trends.index, y=(streaming_trends['ms_played'] / (1000 * 60 * 60)).round(2),
              labels={'ts': 'Month-Year', 'y': 'Hours Played'},
              title='Monthly Audio Streaming Trends Over Time',
              color_discrete_sequence=['green'], markers=True)

fig.add_vrect(x0="2022-10-01", x1="2023-02-01", fillcolor="rgba(0,255,0,0.2)", layer="below", line_width=0)

fig.update_layout(title=dict(text='Monthly Music Streaming Trends Over Time', x=0.5),
                  xaxis_title='Month-Year',
                  yaxis_title='Hours Played',
                  width=1100, height=500)
fig.show()

#### **Weekly `Music` Streaming Trend**

In [167]:
# Analyze trends based on the day of the week
trends_by_day = spotify_music.groupby(spotify_data['ts'].dt.dayofweek).agg({
    'ms_played': lambda x: x.sum() / (1000 * 60 * 60)  # Convert milliseconds to hours and sum
}).reset_index()

# Get the top 2 days of the week
top_days = trends_by_day.nlargest(2, 'ms_played').index

# Visualize the trends
fig = px.bar(trends_by_day, x='ts', y='ms_played',
             labels={'ms_played': 'Total Streaming Time (Hours)', 'ts': 'Day of the Week'},
             title='Streaming Trends by Day of the Week')

mean_trends_by_day = trends_by_day['ms_played'].mean()
fig.add_hline(y=mean_trends_by_day, 
              line_dash="dot", line_color="blue", 
              annotation_text=f'Mean: {mean_trends_by_day:.2f} hours',
              annotation_position="top right")

colors = ['green' if i in top_days else '#16D35C' for i in trends_by_day.index]
fig.update_traces(marker_color=colors,
                  text=trends_by_day['ms_played'].round(2), 
                  textposition='inside', 
                  textfont=dict(color='white'))

fig.update_layout(xaxis=dict(tickmode='array', tickvals=list(range(7)), ticktext=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']),
                  title=dict(text='Streaming Trends by Day of the Week', x=0.5),
                  width=800, height=500)
fig.show()

#### **Daily `Music` Streaming Trend**

In [168]:
daily_playback_counts = (spotify_music.groupby(spotify_music['ts'].dt.date)['ms_played'].sum() / (1000 * 60 * 60)).reset_index()

fig = px.line(daily_playback_counts, x='ts', y='ms_played',
              labels={'ts': 'Date', 'ms_played': 'Playback Time'},
              title='Daily Playback Time Over Time',
              hover_data={'ts': '|%B %d, %Y'})

fig.update_layout(xaxis_title='Date', yaxis_title='Total Playback Time (Hours)', title_x=0.5,
                  width=1100, height=500)
fig.show()

In [169]:
may_5th_data = spotify_data[spotify_data['ts'].dt.date == pd.to_datetime('2023-05-05').date()]
song_playback_duration_may_5th = (may_5th_data.groupby(['master_metadata_album_artist_name', 'master_metadata_track_name'])['ms_played'].sum() / (1000 * 60)).reset_index()
top_10_songs_may_5th = song_playback_duration_may_5th.nlargest(10, 'ms_played')

top_10_songs_may_5th = top_10_songs_may_5th.rename(columns={'master_metadata_album_artist_name': 'Artist', 
                                                            'master_metadata_track_name': 'Track Name',
                                                            'ms_played': 'Playback Duration'})

top_10_songs_may_5th

Unnamed: 0,Artist,Track Name,Playback Duration
237,Taylor Swift,august,13.0961
269,d4vd,Here With Me,11.356467
173,New West,Those Eyes,11.0375
164,Miley Cyrus,Angels Like You,9.82265
249,The Weeknd,Die For You (with Ariana Grande) - Remix,8.293283
95,Justin Bieber,Love Yourself,7.790667
21,Aziz Hedra,Somebody's Pleasure,7.465233
102,Keane,Somewhere Only We Know,7.433967
188,Raim Laode,Komang,7.423533
230,Taylor Swift,All Of The Girls You Loved Before,7.380233


#### **Hourly `Music` Streaming Trend**

#### **`Top Albums` All the Time**

In [170]:
album_playback_duration = spotify_data.groupby('master_metadata_album_album_name')['ms_played'].sum() / (1000 * 60)
album_play_counts = spotify_data['master_metadata_album_album_name'].value_counts()

In [171]:
# Create a DataFrame with album names, artist names, and play times
album_metadata = spotify_data[['master_metadata_album_album_name', 'master_metadata_album_artist_name']]
album_metadata.drop_duplicates(inplace=True)  # Remove duplicate rows
album_metadata.set_index('master_metadata_album_album_name', inplace=True)

# Merge album_metadata with album_playback_duration
merged_data = album_metadata.merge(album_playback_duration, left_index=True, right_index=True)

# Merge merged_data with album_play_counts
merged_data = merged_data.merge(album_play_counts, left_index=True, right_index=True)

In [172]:
merged_data.columns = ['Artist Name', 'Playback Duration', 'Play Counts']

# Reset index to turn the index into a regular column
merged_data.reset_index(inplace=True)

# Set the desired name for the index column
merged_data.rename(columns={'master_metadata_album_album_name': 'Album Name'}, inplace=True)

In [173]:
grouped_data = merged_data.groupby('Album Name')

filtered_data = []

for album, group in grouped_data: # Check if the album has more than one artist
    if len(group['Artist Name'].unique()) > 1:
        filtered_data.append(group.head(1))
    else:
        filtered_data.append(group)

filtered_merged_data = pd.concat(filtered_data)

top_albums = filtered_merged_data.nlargest(20, 'Playback Duration')
print("\nTop 20 Albums by Playback Duration (in Hours) and Play Counts:\n", top_albums.to_string(index=False, float_format="{:.2f}".format))


Top 20 Albums by Playback Duration (in Hours) and Play Counts:
                              Album Name         Artist Name  Playback Duration  Play Counts
                 Menari Dengan Bayangan              Hindia            3028.26          854
                          Mantra Mantra           Kunto Aji            2556.88          751
         The Feeling of Falling Upwards 5 Seconds of Summer            2520.66          667
                               gg bb xx                LANY            2361.89          845
                          Malibu Nights                LANY            2166.24          671
                             mama's boy                LANY            2005.83          674
                             logic mess         Arash Buana            1897.79          607
                                Manusia               Tulus            1802.52          576
                                   LANY                LANY            1761.30          545
               

In [174]:
# Create traces for playback duration (line) and play counts (bars)
trace_duration = go.Scatter(x=top_albums['Album Name'], y=top_albums['Playback Duration'],
                            mode='lines+markers', name='Playback Duration', line=dict(color='blue'))
trace_counts = go.Bar(x=top_albums['Album Name'], y=top_albums['Play Counts'],
                      name='Play Counts', marker=dict(color='green'))

# Create layout for the chart with a secondary y-axis
layout = go.Layout(title='Top 20 Albums by Playback Duration and Play Counts', title_x=0.5,
                   xaxis=dict(title='Album Name', tickangle=-45),
                   yaxis=dict(title='Play Counts', side='left', color='green'),
                   yaxis2=dict(title='Playback Duration', side='right', overlaying='y', color='blue'),
                   legend=dict(x=0, y=1.0, bgcolor='rgba(255, 255, 255, 0)', bordercolor='rgba(255, 255, 255, 0)'),
                   barmode='group', width=1100, height=500)

fig = go.Figure(data=[trace_duration, trace_counts], layout=layout)
fig.show()

In [175]:
fig = px.bar(top_albums, x='Album Name', y='Playback Duration',
             color_discrete_sequence=['green'], labels={'Playback Duration': 'Playback Duration'})

fig.add_scatter(x=top_albums['Album Name'], y=top_albums['Play Counts'], 
                mode='lines+markers', name='Play Counts', yaxis='y2', line=dict(color='yellowgreen'))

fig.update_layout(title='Top 20 Albums by Playback Duration and Play Counts', title_x=0.5,
                  xaxis_title='Album Name', yaxis_title='Playback Duration',
                  yaxis=dict(title='Playback Duration', color='green'),
                  yaxis2=dict(title='Play Counts', color='yellowgreen', overlaying='y', side='right'),
                  showlegend=False, width=1100, height=500)

fig.update_traces(hovertemplate="<b>%{x}</b><br>Playback Duration: %{y:.2f}<extra></extra>", selector=dict(type='bar')) # Hover data for the bar plot
fig.update_traces(hovertemplate="<b>%{x}</b><br>Play Counts: %{y}<extra></extra>", selector=dict(type='scatter')) # Hover data for the line plot
fig.show()

#### **`Favourite Songs` All the Time**

#### **`Favourite Artists` All the Time**

In [176]:
# The top 20 artists with the most number of streams based on play counts
artist_play_counts = spotify_data['master_metadata_album_artist_name'].value_counts().head(20)
artist_play_counts

master_metadata_album_artist_name
LANY                   3936
5 Seconds of Summer    2245
Lofi Fruits Music      2125
Taylor Swift           1858
Kunto Aji              1667
Tulus                  1565
Hindia                 1178
Lauv                   1162
Sheila On 7            1093
Arash Buana            1072
Yiruma                 1007
Coldplay                986
keshi                   984
NIKI                    872
Justin Bieber           852
Hayd                    823
YOASOBI                 778
Ardhito Pramono         735
DEPAPEPE                709
Shawn Mendes            705
Name: count, dtype: int64

In [177]:
fig = px.bar(artist_play_counts.head(10), x=artist_play_counts.head(10).index, y=artist_play_counts.head(10).values,
             labels={'x': 'Artists', 'y': 'Total Streams'},
             title='Top 10 Most Streamed Artists Based on Play Counts')

# Define colors using Spotify palette for top 3 tracks, green for others
color_scale = ['green' if artist in artist_play_counts.head(3).index else '#16D35C' for artist in artist_play_counts.index]

fig.update_traces(marker_color=color_scale,
                  text=artist_play_counts.values, 
                  textposition='inside', 
                  textfont=dict(color='white'))

median_play_counts = artist_play_counts.median()
fig.add_hline(y=median_play_counts, 
              line_dash="dot", line_color="blue", 
              annotation_text=f'Median: {median_play_counts:.2f} hours',
              annotation_position="top right")

fig.update_layout(xaxis_title='Artists', 
                  yaxis_title='Total Streams in Hours', 
                  xaxis_tickangle=-45, showlegend=False, title_x=0.5,
                  width=1000, height=500)
fig.show()

In [178]:
# Calculate the accumulated number of streams for each artist in hours
artist_playback_duration = spotify_data.groupby('master_metadata_album_artist_name')['ms_played'].sum() / (1000 * 60 * 60)

# Sort artists based on the accumulated number of streams, from most to least
artist_playback_duration = artist_playback_duration.sort_values(ascending=False).head(20).reset_index()
artist_playback_duration

Unnamed: 0,master_metadata_album_artist_name,ms_played
0,LANY,203.074674
1,5 Seconds of Summer,119.409441
2,Taylor Swift,98.686382
3,Kunto Aji,90.270804
4,Tulus,78.815167
5,Hindia,69.891779
6,Lofi Fruits Music,63.03375
7,Sheila On 7,61.754606
8,Coldplay,59.388756
9,Yiruma,56.431022


In [179]:
fig = px.bar(artist_playback_duration.head(10), x='master_metadata_album_artist_name', y='ms_played',
             labels={'master_metadata_album_artist_name': 'Artists', 'ms_played': 'Total Streams in Hours'},
             title='Top 10 Most Streamed Artists Based on Playback Duration')

# Define colors using Spotify palette for top 5 tracks, green for others
color_scale = ['green' if artist in artist_playback_duration.head(3).values else '#16D35C' for artist in artist_playback_duration['master_metadata_album_artist_name']]

fig.update_traces(marker_color=color_scale,
                  text=artist_playback_duration['ms_played'].round(2), 
                  textposition='inside', 
                  textfont=dict(color='white'))

median_playback_duration = artist_playback_duration['ms_played'].median()
fig.add_hline(y=median_playback_duration, 
              line_dash="dot", line_color="blue", 
              annotation_text=f'Median: {median_playback_duration:.2f} hours',
              annotation_position="top right")

fig.update_layout(xaxis_title='Artists', 
                  yaxis_title='Total Streams in Hours', 
                  xaxis_tickangle=-45, showlegend=False, title_x=0.5,
                  width=1000, height=500)
fig.show()

#### **`Top 3 Songs` from `Top 3 Artist`**

In [180]:
# Calculate the accumulated number of streams for each artist in hours
most_played_songs = (spotify_data.groupby(['master_metadata_album_artist_name', 'master_metadata_track_name'])['ms_played'].sum() / (1000 * 60)).reset_index()

# Sort artists based on the accumulated number of streams, from most to least
list_most_played_songs = most_played_songs.sort_values(by='ms_played', ascending=False).round(2).head(20)
list_most_played_songs

Unnamed: 0,master_metadata_album_artist_name,master_metadata_track_name,ms_played
6749,Joji,Glimpse of Us,1097.91
2517,Cheon ji won,You Can Cry,1044.52
7655,LANY,Malibu Nights,994.74
7562,Kunto Aji,Rehat,814.86
5591,Hindia,Membasuh,811.48
4393,Feby Putri,Runtuh,746.84
2570,Choi Yu Ree,Wish,731.67
7557,Kunto Aji,Pilu Membiru,718.29
13222,TAEIL,Starlight,671.69
14317,Tulus,Diri,664.02


In [181]:
top_artists = artist_playback_duration.sort_values(by='ms_played', ascending=False).head(3)['master_metadata_album_artist_name']

top_songs = pd.concat([most_played_songs[most_played_songs['master_metadata_album_artist_name'] == artist].nlargest(3, 'ms_played') for artist in top_artists])
top_songs = top_songs.sort_values(by='ms_played', ascending=True)

fig = px.bar(top_songs, x='ms_played', y='master_metadata_track_name', color='master_metadata_album_artist_name', orientation='h',
             labels={'master_metadata_track_name': 'Track Name', 'ms_played': 'Playback Time', 'master_metadata_album_artist_name': 'Artist'},
             title='Top 3 Songs from Top 3 Most Played Artists')

fig.update_layout(xaxis_title='Total Playback Time (minutes)', yaxis_title='Track Name',
                  legend_title='Artist', title_x=0.5, width=1000, height=500)

fig.show()

#### **Yearly `Top Songs` and `Top Artists`**

##### **`2019` Top Songs and Top Artists**

In [182]:
spotify_data_2019 = spotify_data[spotify_data['ts'].dt.year == 2019]

top_songs = (spotify_data_2019.groupby(['master_metadata_track_name', 'master_metadata_album_artist_name'])['ms_played'].sum() / (1000 * 10)).reset_index()
top_songs = top_songs.sort_values(by='ms_played', ascending=False).head(20)

top_songs = top_songs.rename(columns={'master_metadata_album_artist_name': 'Artist', 
                                      'master_metadata_track_name': 'Track Name',
                                      'ms_played': 'Playback Duration'})

print("Top Songs for 2019:")
print(top_songs)

Top Songs for 2019:
                                          Track Name              Artist  Playback Duration
1301          Untuk Perempuan Yang Sedang Di Pelukan        Payung Teduh          1735.6242
1019    Sampai Jadi Debu (Menampilkan Gardika Gigih)         Banda Neira          1564.9837
546                            If You're Not The One  Daniel Bedingfield          1286.2758
732                                     Love Someone        Lukas Graham          1160.4668
1371         Yang Patah Tumbuh, Yang Hilang Berganti         Banda Neira          1116.8143
1197                              Teman Tapi Menikah       Dengarkan Dia          1001.2434
97                                             April       Fiersa Besari           994.5008
475                                      Hitam Putih           Fourtwnty           961.3434
517                                  I Love You 3000    Stephanie Poetri           926.2673
285                                           Dan...        

##### **`2020` Top Songs and Top Artists**

In [183]:
spotify_data_2020 = spotify_data[spotify_data['ts'].dt.year == 2020]

top_songs = (spotify_data_2020.groupby(['master_metadata_track_name', 'master_metadata_album_artist_name'])['ms_played'].sum() / (1000 * 10)).reset_index()
top_songs = top_songs.sort_values(by='ms_played', ascending=False).head(20)

top_songs = top_songs.rename(columns={'master_metadata_album_artist_name': 'Artist', 
                                      'master_metadata_track_name': 'Track Name',
                                      'ms_played': 'Playback Duration'})

print("Top Songs for 2020:")
print(top_songs)

Top Songs for 2020:
                   Track Name           Artist  Playback Duration
2071                Tiga Pagi           Fletch          2717.0763
748   Growing Up (Rara Sekar)         Daramuda          2056.1518
1657                 Sadajiwa           Fletch          1456.4656
1074           Laraku, Pilumu           Fletch          1445.7452
1261                 Membasuh           Hindia          1410.8291
1073                     Lara     Dialog Senja          1392.1713
699            Forget Jakarta   Adhitia Sofyan          1376.4033
124               Angin Hujan           Fletch          1268.4740
2324                    ghost       Skinnyfabs          1228.9302
1510             Pilu Membiru        Kunto Aji          1222.5940
19       3 A.M. (Bonus Track)           Fletch          1210.6356
1539           Pura Pura Lupa            Mahen          1125.9884
1504      Pesan Di Balik Awan   Adhitia Sofyan          1109.8110
175                  Bad Liar  Imagine Dragons          

##### **`2021` Top Songs and Top Artists**

In [184]:
spotify_data_2021 = spotify_data[spotify_data['ts'].dt.year == 2021]

top_songs = (spotify_data_2021.groupby(['master_metadata_track_name', 'master_metadata_album_artist_name'])['ms_played'].sum() / (1000 * 10)).reset_index()
top_songs = top_songs.sort_values(by='ms_played', ascending=False).head(20)

top_songs = top_songs.rename(columns={'master_metadata_album_artist_name': 'Artist', 
                                      'master_metadata_track_name': 'Track Name',
                                      'ms_played': 'Playback Duration'})

print("Top Songs for 2021:")
print(top_songs)

Top Songs for 2021:
                                             Track Name           Artist  Playback Duration
3016                                        You Can Cry     Cheon ji won          4105.4100
2090                                              Rehat        Kunto Aji          2200.9607
2067                                          Rainy Day      Sungha Jung          2063.9920
109                                          All I Need            JEMMA          1948.5355
1654                                             Middle         DJ Snake          1838.5583
2976                                               Wish      Choi Yu Ree          1817.4193
1606                                          May I Ask      Luke Chiang          1676.4227
2237                                            Sejenak        Biru Baru          1616.7461
1272                                          Je T'aime              JOY          1553.9198
229                                             Baangka     

##### **`2022` Top Songs and Top Artists**

In [185]:
spotify_data_2022 = spotify_data[spotify_data['ts'].dt.year == 2022]

top_songs = (spotify_data_2022.groupby(['master_metadata_track_name', 'master_metadata_album_artist_name'])['ms_played'].sum() / (1000 * 10)).reset_index()
top_songs = top_songs.sort_values(by='ms_played', ascending=False).head(20)

top_songs = top_songs.rename(columns={'master_metadata_album_artist_name': 'Artist', 
                                      'master_metadata_track_name': 'Track Name',
                                      'ms_played': 'Playback Duration'})

print("Top Songs for 2022:")
print(top_songs)

Top Songs for 2022:
                     Track Name         Artist  Playback Duration
1653              Glimpse of Us           Joji          6015.8219
4417                  Starlight          TAEIL          3512.9781
2004                        Hug          suggi          3350.1882
3479         Our Beloved Summer  Kim Kyung Hee          2868.1553
1632                      Ghost  Justin Bieber          2584.6424
2344              It'll Be Okay   Shawn Mendes          2453.4550
3037                   Maybe if           BIBI          2393.6359
1241                     Drawer           10CM          2364.7271
1162                       Diri          Tulus          2314.5816
5369                       With      Kim Taeri          2297.2705
4482  Strawberries & Cigarettes    Troye Sivan          2256.1764
3928                     Runtuh     Feby Putri          2233.5816
2654                Late Regret   ONG SEONG WU          2228.6441
3609        Peter Pan Was Right   Anson Seabra          

##### **`2023` Top Songs and Top Artists**

In [186]:
spotify_data_2023 = spotify_data[spotify_data['ts'].dt.year == 2023]

top_songs = (spotify_data_2023.groupby(['master_metadata_track_name', 'master_metadata_album_artist_name'])['ms_played'].sum() / (1000 * 10)).reset_index()
top_songs = top_songs.sort_values(by='ms_played', ascending=False).head(20)

top_songs = top_songs.rename(columns={'master_metadata_album_artist_name': 'Artist', 
                                      'master_metadata_track_name': 'Track Name',
                                      'ms_played': 'Playback Duration'})

print("Top Songs for 2023:")
print(top_songs)

Top Songs for 2023:
                               Track Name                  Artist  Playback Duration
1248                       Cosmic Railway                     EXO          2904.3849
3782                        Malibu Nights                    LANY          2545.5971
5499                  Somebody's Pleasure              Aziz Hedra          2083.7113
1829                        Falling Again                    Ridh          2015.5834
3292                                LIMBO                   keshi          1944.5400
7318                            hurt road                    DAY6          1905.5338
2438                         Here With Me                    d4vd          1901.3898
4393                     Opening Sequence     TOMORROW X TOGETHER          1865.6154
5899                    Tak Segampang Itu            Anggi Marito          1807.4089
7064                               august            Taylor Swift          1786.7549
6239                           Those Eyes    

##### **`2024` Top Songs and Top Artists (Q1)**

In [187]:
spotify_data_2024 = spotify_data[spotify_data['ts'].dt.year == 2024]

top_songs = (spotify_data_2024.groupby(['master_metadata_track_name', 'master_metadata_album_artist_name'])['ms_played'].sum() / (1000 * 10)).reset_index()
top_songs = top_songs.sort_values(by='ms_played', ascending=False).head(20)

top_songs = top_songs.rename(columns={'master_metadata_album_artist_name': 'Artist', 
                                      'master_metadata_track_name': 'Track Name',
                                      'ms_played': 'Playback Duration'})

print("Top Songs for 2024:")
print(top_songs)

Top Songs for 2024:
                                             Track Name               Artist  Playback Duration
2233                                   Oceans & Engines                 NIKI          3273.5757
2933                                      Sudden Shower              ECLIPSE          1739.2485
3588                                         Youngblood  5 Seconds of Summer          1588.3796
2305  Outer Space / Carry On (Live from The Royal Al...  5 Seconds of Summer          1507.2542
3045                                              Teeth  5 Seconds of Summer          1433.0082
216                                             Amnesia  5 Seconds of Summer          1396.1204
1799                                            Lighter             Galantis          1389.6901
2735                               She Looks So Perfect  5 Seconds of Summer          1210.1667
2569                                            Run Run              ECLIPSE          1185.3605
323                 

### **`Podcast` Streaming Analysis**

#### **Yearly podcast streaming trend**

In [188]:
grouped_data = spotify_podcast.groupby(spotify_podcast['ts'].dt.year).size().reset_index(name='podcast_count')
grouped_data.columns = ['year', 'podcast_count']
grouped_data = grouped_data.sort_values(by='year')

In [189]:
fig = px.line(grouped_data, x='year', y='podcast_count',
              labels={'year': 'Year', 'podcast_count': 'Total Episode'},
              color_discrete_sequence=['green'],
              markers=True)

fig.update_layout(title=dict(text='Yearly Podcast Listening Trends', x=0.5),
                  xaxis_title='Year',
                  yaxis_title='Total Number of Podcast Episodes',
                  width=800, height=500,
                  yaxis=dict(tickformat=',.0f', range=[0, grouped_data['podcast_count'].max() * 1.1]))
fig.show()

#### **Monthly podcast streaming trend**

In [190]:
# Set 'ts' column as the index
indexed = spotify_podcast.set_index('ts', inplace=False)

# Resample the data on a monthly basis
streaming_trends = indexed.resample('M').sum()

# Create a line plot using Plotly Express
fig = px.line(streaming_trends, x=streaming_trends.index, y=(streaming_trends['ms_played'] / (1000 * 60 * 60)).round(2),
              labels={'ts': 'Month-Year', 'y': 'Hours Played'},
              title='Monthly Audio Streaming Trends Over Time',
              color_discrete_sequence=['green'], markers=True)

fig.add_vrect(x0="2022-11-01", x1="2023-01-01", fillcolor="rgba(0,255,0,0.2)", layer="below", line_width=0)
fig.add_vrect(x0="2024-01-01", x1="2024-04-01", fillcolor="rgba(0,255,0,0.2)", layer="below", line_width=0)

fig.update_layout(title=dict(text='Monthly Music Streaming Trends Over Time', x=0.5),
                  xaxis_title='Month-Year',
                  yaxis_title='Hours Played',
                  width=1100, height=500)
fig.show()

#### **Weekly podcast streaming trend**

In [191]:
# Analyze trends based on the day of the week
trends_by_day = spotify_podcast.groupby(spotify_podcast['ts'].dt.dayofweek).agg({
    'ms_played': lambda x: x.sum() / (1000 * 60 * 60)  # Convert milliseconds to hours and sum
}).reset_index()

# Get the top 2 days of the week
top_days = trends_by_day.nlargest(2, 'ms_played').index

# Visualize the trends
fig = px.bar(trends_by_day, x='ts', y='ms_played',
             labels={'ms_played': 'Total Streaming Time (Hours)', 'ts': 'Day of the Week'},
             title='Streaming Trends by Day of the Week')

# Add a line to show the mean   
mean_trends_by_day = trends_by_day['ms_played'].mean()
fig.add_hline(y=mean_trends_by_day, 
              line_dash="dot", line_color="blue", 
              annotation_text=f'Mean: {mean_trends_by_day:.2f} hours',
              annotation_position="top right")

colors = ['green' if i in top_days else '#16D35C' for i in trends_by_day.index]
fig.update_traces(marker_color=colors,
                  text=trends_by_day['ms_played'].round(2), 
                  textposition='inside', 
                  textfont=dict(color='white'))

fig.update_layout(xaxis=dict(tickmode='array', tickvals=list(range(7)), ticktext=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']),
                  title=dict(text='Streaming Trends by Day of the Week', x=0.5),
                  width=800, height=500)
fig.show()

#### **Favourite podcast/show name all the time**

In [192]:
# Calculate the accumulated number of streams for each artist in hours
showname_playback_duration = spotify_data.groupby('episode_show_name')['ms_played'].sum() / (1000 * 60 * 60)

# Sort artists based on the accumulated number of streams, from most to least
showname_playback_duration = showname_playback_duration.sort_values(ascending=False).head(20).reset_index()
showname_playback_duration

Unnamed: 0,episode_show_name,ms_played
0,Mishary Rashid Alafasy,25.836151
1,Rintik Sedu,19.88352
2,The Espresso Hour,14.40757
3,DataFramed,8.197462
4,Meditate with Tsamara,7.929386
5,PodQuest,7.425116
6,"Kuas, Kanvas dan Bulan Kesepian",6.51402
7,BukaTalks,5.761425
8,The Koe Cast,5.726045
9,Huberman Lab,5.687858


In [193]:
# Visualize top artists
fig = px.bar(showname_playback_duration.head(10), x='episode_show_name', y='ms_played',
             labels={'episode_show_name': 'Artists', 'ms_played': 'Total Streams in Hours'},
             title='Top 10 Most Streamed Podcast Based on Playback Duration')

# Define colors using Spotify palette for top 5 tracks, green for others
color_scale = ['green' if artist in showname_playback_duration.head(3).values else '#16D35C' for artist in showname_playback_duration['episode_show_name']]

# Update colors and text labels for each bar
fig.update_traces(marker_color=color_scale,
                  text=showname_playback_duration['ms_played'].round(2), 
                  textposition='inside', 
                  textfont=dict(color='white'))

fig.update_layout(xaxis_title='Artists', 
                  yaxis_title='Total Streams in Hours', 
                  xaxis_tickangle=-45, showlegend=False, title_x=0.5,
                  width=1000, height=500)
fig.show()

In [194]:
# The top 20 artists with the most number of streams based on play counts
showname_play_counts = spotify_podcast['episode_show_name'].value_counts().head(20)
showname_play_counts

episode_show_name
Mishary Rashid Alafasy                371
Rintik Sedu                           204
Menjadi Manusia                        68
Yaqeen Podcast                         63
YOASOBI THE BOOK 2 Spotify Edition     58
The Late Brunch with Sara Neyrhiza     50
BukaTalks                              45
quranreview                            45
Meditate with Tsamara                  43
Kuas, Kanvas dan Bulan Kesepian        42
The Espresso Hour                      35
MengAnalisa                            34
PodQuest                               31
Calm it Down                           29
WORK LIFE TRAMPOLINE                   27
Podcast Raditya Dika                   27
Endgame with Gita Wirjawan             26
Data Talks                             25
Mudacumasekali                         25
The Friday Podcast                     23
Name: count, dtype: int64

In [195]:
# Visualize top artists
fig = px.bar(showname_play_counts.head(10), x=showname_play_counts.head(10).index, y=showname_play_counts.head(10).values,
             labels={'x': 'Artists', 'y': 'Total Streams'},
             title='Top 10 Most Streamed Podcast Based on Play Counts')

# Define colors using Spotify palette for top 5 tracks, green for others
color_scale = ['green' if artist in showname_play_counts.head(3).index else '#16D35C' for artist in showname_play_counts.index]

# Update colors and text labels for each bar
fig.update_traces(marker_color=color_scale,
                  text=showname_play_counts.values, 
                  textposition='inside', 
                  textfont=dict(color='white'))

fig.update_layout(xaxis_title='Artists', 
                  yaxis_title='Total Streams in Hours', 
                  xaxis_tickangle=-45, showlegend=False, title_x=0.5,
                  width=1000, height=500)
fig.show()