# Processing Spotify Listening History Data

Spotify provides a [personal data archive](https://www.spotify.com/us/account/privacy/) including all listening history, and depending on how long you want for the data archive you can get some or all of your Spotify data. I did the shorter time to just get my Spotify listening history while driving Uber. The [data key](https://support.spotify.com/us/article/understanding-my-data/) shows that StreamingHistory contains four items:

* Date and time of when the stream ended in Coordinated Universal Time format (UTC).
* Name of "creator" for each stream (e.g. the artist name of a music track).
* Name of items listened to or watched (e.g. title of music track or name of video).
* “msPlayed”- Stands for how many mili-seconds the track was listened to. 

In [8]:
import pandas as pd
import glob
import hashlib
import json
import re

file_pattern = "../data/StreamingHistory_music_*.json"
file_list = glob.glob(file_pattern)
sorted_files = sorted(file_list, key=lambda x: int(re.search(r'(\d+)', x).group(1)))

dataframes = []
for file in sorted_files:
    dataframes.append(pd.read_json(file))

spotify_df = pd.concat(dataframes, ignore_index=True)

print(spotify_df.head())
print(spotify_df.info())

            endTime    artistName  \
0  2023-09-12 23:59    Aphex Twin   
1  2023-09-13 23:01  Jamar Rogers   
2  2023-09-13 23:02   DJ MENOR ML   
3  2023-09-14 00:10      Overmono   
4  2023-09-14 00:10      Overmono   

                                          trackName  msPlayed  
0                                        Ageispolis     96352  
1                              God Bless Your Lungs     47466  
2  Troquei Meu Playstation Por Um Pentão De Robocop     39573  
3                                         If U Ever      2730  
4                                         Good Lies      5546  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61710 entries, 0 to 61709
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   endTime     61710 non-null  object
 1   artistName  61710 non-null  object
 2   trackName   61710 non-null  object
 3   msPlayed    61710 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 1.9

# Adjusting Time

To make further analysis easier, we should have datetime objects for song start and stop times. Since we only have end time, we can get start time by subtracting `msPlayed` in milliseconds from the track end time. We then need to change both times from UTC to UTC-6 for Chicago.

In [9]:
spotify_df['time_end'] = pd.to_datetime(spotify_df['endTime'], format='%Y-%m-%d %H:%M')
spotify_df['time_start'] = spotify_df['time_end'] - pd.to_timedelta(spotify_df['msPlayed'], unit='ms')
spotify_df['time_start'] = spotify_df['time_start'].dt.tz_localize('UTC').dt.tz_convert('America/Chicago')
spotify_df['time_end'] = spotify_df['time_end'].dt.tz_localize('UTC').dt.tz_convert('America/Chicago')
spotify_df.drop(columns=['endTime', 'msPlayed'], inplace=True)

print(spotify_df.head())
print(spotify_df.info())

     artistName                                         trackName  \
0    Aphex Twin                                        Ageispolis   
1  Jamar Rogers                              God Bless Your Lungs   
2   DJ MENOR ML  Troquei Meu Playstation Por Um Pentão De Robocop   
3      Overmono                                         If U Ever   
4      Overmono                                         Good Lies   

                   time_end                       time_start  
0 2023-09-12 18:59:00-05:00 2023-09-12 18:57:23.648000-05:00  
1 2023-09-13 18:01:00-05:00 2023-09-13 18:00:12.534000-05:00  
2 2023-09-13 18:02:00-05:00 2023-09-13 18:01:20.427000-05:00  
3 2023-09-13 19:10:00-05:00 2023-09-13 19:09:57.270000-05:00  
4 2023-09-13 19:10:00-05:00 2023-09-13 19:09:54.454000-05:00  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61710 entries, 0 to 61709
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype                          
---  ------      --------------  -

In [11]:
spotify_df.to_csv('../data/spotify.csv', index=False)

In [12]:
print(f"Unique artists: {spotify_df['artistName'].nunique()}")
unique_song_count = spotify_df.drop_duplicates(subset=["trackName", "artistName"]).shape[0]
print(f"Unique songs: {unique_song_count}")

Unique artists: 5502
Unique songs: 13340
