# Processing Spotify Listening History Data

Spotify provides a [personal data archive](https://www.spotify.com/us/account/privacy/) including all listening history, and depending on how long you want for the data archive you can get some or all of your Spotify data. I did the shorter time to just get my Spotify listening history while driving Uber. When you request the extended listening history, the [data key](https://support.spotify.com/us/article/understanding-my-data/) shows that each record contains 21 items:

1. Date and time of when the stream ended in UTC format.
2. Your Spotify username.
3. Platform used when streaming the track (e.g. Android OS, Google Chromecast).
4. For how many milliseconds the track was played.
5. Country code of the country where the stream was played.
6. IP address used when streaming the track.
7. User agent used when streaming the track (e.g. a browser, like Mozilla Firefox, or Safari).
8. Name of the track.
9. Name of the artist, band or podcast.
10. Name of the album of the track.
11. A Spotify Track URI, that is identifying the unique music track.
12. Name of the episode of the podcast.
13. Name of the show of the podcast.
14. A Spotify Episode URI, that is identifying the unique podcast episode.
15. Reason why the track started (e.g. previous track finished or you picked it from the playlist).
16. Reason why the track ended (e.g. the track finished playing or you hit the next button).
17. Whether shuffle mode was used when playing the track.
18. Information whether the user skipped to the next song.
19. Information whether the track was played in offline mode.
20. Timestamp of when offline mode was used, if it was used.
21. Information whether the track was played during a private session.

I'm really just interested in these items:

1. Date and time of when the stream ended in UTC format.
3. Platform used when streaming the track (e.g. Android OS, Google Chromecast).
4. For how many milliseconds the track was played.
8. Name of the track.
9. Name of the artist, band or podcast.
11. A Spotify Track URI, that is identifying the unique music track.

In [28]:
import pandas as pd
import glob
import hashlib
import json
import re

file_pattern = "../data/spotify_extended_data/Streaming_History_Audio_*.json"
file_list = glob.glob(file_pattern)
sorted_files = sorted(file_list, key=lambda x: int(re.search(r'(\d+)', x).group(1)))

dataframes = []
for file in sorted_files:
    dataframes.append(pd.read_json(file))

songs = pd.concat(dataframes, ignore_index=True)

print(songs.info())
print(songs['platform'].unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86734 entries, 0 to 86733
Data columns (total 23 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   ts                                 86734 non-null  object 
 1   platform                           86734 non-null  object 
 2   ms_played                          86734 non-null  int64  
 3   conn_country                       86734 non-null  object 
 4   ip_addr                            86734 non-null  object 
 5   master_metadata_track_name         86694 non-null  object 
 6   master_metadata_album_artist_name  86694 non-null  object 
 7   master_metadata_album_album_name   86694 non-null  object 
 8   spotify_track_uri                  86694 non-null  object 
 9   episode_name                       37 non-null     object 
 10  episode_show_name                  37 non-null     object 
 11  spotify_episode_uri                37 non-null     obj

In [29]:
songs = songs[['ts', 'platform', 'ms_played', 'master_metadata_track_name', 'master_metadata_album_artist_name', 'spotify_track_uri']]
songs = songs.rename(columns={
    'ts': 'time_end',
    'master_metadata_track_name': 'track',
    'master_metadata_album_artist_name': 'artist',
    'spotify_track_uri': 'uri'
})
songs = songs[songs['platform'] == 'ios']
print(songs.info())

<class 'pandas.core.frame.DataFrame'>
Index: 79789 entries, 0 to 86733
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   time_end   79789 non-null  object
 1   platform   79789 non-null  object
 2   ms_played  79789 non-null  int64 
 3   track      79749 non-null  object
 4   artist     79749 non-null  object
 5   uri        79749 non-null  object
dtypes: int64(1), object(5)
memory usage: 4.3+ MB
None


# Adjusting Time

To make further analysis easier, we should have datetime objects for song start and stop times. Since we only have end time, we can get start time by subtracting `msPlayed` in milliseconds from the track end time. We then need to change both times from UTC to UTC-6 for Chicago.

In [30]:
songs = songs[songs['ms_played'] > 0]
songs['time_end'] = pd.to_datetime(songs['time_end'], format='ISO8601')
songs['time_start'] = songs['time_end'] - pd.to_timedelta(songs['ms_played'], unit='ms')
songs['time_start'] = songs['time_start'].dt.tz_convert('America/Chicago')
songs['time_end'] = songs['time_end'].dt.tz_convert('America/Chicago')
songs.drop(columns=['ms_played'], inplace=True)

print(songs.head())
print(songs.info())

                   time_end platform                     track  \
0 2023-09-21 09:04:06-05:00      ios                     Human   
1 2023-09-21 09:07:45-05:00      ios  Universe (feat. Kehlani)   
2 2023-09-21 09:08:05-05:00      ios                  BackBack   
3 2023-09-21 09:11:36-05:00      ios                  Take Two   
4 2023-09-21 09:15:05-05:00      ios  Kiss Me More (feat. SZA)   

            artist                                   uri  \
0        Sevdaliza  spotify:track:5h0M2GbBfvOj8GdG7sIDQT   
1    Ty Dolla $ign  spotify:track:5waFNguEkggHt2R05RxNBp   
2          SuhnDon  spotify:track:2jP6fOa57zmFUU7JxcwtSp   
3  Chong the Nomad  spotify:track:2QcF4vRgpl0ndKdC3vr2iM   
4         Doja Cat  spotify:track:748mdHapucXQri7IAO8yFK   

                        time_start  
0 2023-09-21 09:03:45.132000-05:00  
1 2023-09-21 09:04:04.862000-05:00  
2 2023-09-21 09:07:46.023000-05:00  
3 2023-09-21 09:08:04.387000-05:00  
4 2023-09-21 09:11:36.134000-05:00  
<class 'pandas.core.

In [31]:
songs.to_csv('../data/spotify.csv', index=False)

In [32]:
print(f"Unique artists: {songs['artist'].nunique()}")
unique_song_count = songs.drop_duplicates(subset=["track", "artist"]).shape[0]
print(f"Unique songs: {unique_song_count}")

Unique artists: 6257
Unique songs: 15461
