In [1]:
from os import path
import numpy as np
import pandas as pd
from IPython.display import display

if not path.exists("data/raw/StreamingHistory.csv"):
    streaming_history_0_df = pd.read_json("data/raw/StreamingHistory0.json")
    streaming_history_1_df = pd.read_json("data/raw/StreamingHistory1.json")
    streaming_history_2_df = pd.read_json("data/raw/StreamingHistory2.json")
    streaming_history_3_df = pd.read_json("data/raw/StreamingHistory3.json")

    stream_history_df = pd.concat([streaming_history_0_df, streaming_history_1_df, streaming_history_2_df, streaming_history_3_df])
    stream_history_df.to_csv("data/raw/StreamingHistory.csv")
else:
    stream_history_df = pd.read_csv("data/raw/StreamingHistory.csv")
    stream_history_df.drop(stream_history_df.columns[0], axis=1, inplace=True)

print(stream_history_df.shape)
stream_history_df.head()



(32398, 4)


Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2020-07-02 17:14,Radio Rental,Episode 06,3197470
1,2020-10-15 23:59,Dawes,From the Right Angle,105984
2,2020-10-16 02:02,The Arcadian Wild,Letters from the Atlantic,295437
3,2020-10-16 02:02,The Arcadian Wild,Roots,0
4,2020-10-16 02:02,The Arcadian Wild,Blue Eyed Girl,162


## Cleaning Up the Data

First, fix the date data

In [2]:
from datetime import timedelta

stream_history_df["date"] = pd.to_datetime(stream_history_df["endTime"], format="%Y-%m-%d")
stream_history_df["date"] = stream_history_df["date"].apply(lambda date: date - timedelta(hours = 6)) # Put it in CST
stream_history_df.drop("endTime", axis=1, inplace=True)
stream_history_df["date"].describe(datetime_is_numeric=True)

count                            32398
mean     2021-05-04 22:04:09.046854656
min                2020-07-02 11:14:00
25%                2021-01-14 14:17:45
50%                2021-04-19 10:01:30
75%                2021-08-23 17:53:30
max                2022-01-05 17:59:00
Name: date, dtype: object

In [3]:
stream_history_df = stream_history_df.rename({
    "artistName": "artist_name",
    "trackName": "track_name",
    "msPlayed": "ms_played"
}, axis=1)

Just for simplicity sake, I'm only going to focus on music listened to in 2021

In [4]:
stream_history_df = stream_history_df[~(stream_history_df['date'] < '2021-01-01') & ~(stream_history_df['date'] >= '2022-01-01')].reset_index(drop=True)
stream_history_df.date.describe(datetime_is_numeric=True)

count                            25637
mean     2021-06-12 23:11:57.996645376
min                2021-01-01 00:01:00
25%                2021-03-08 13:21:00
50%                2021-05-28 12:55:00
75%                2021-09-15 09:44:00
max                2021-12-31 17:31:00
Name: date, dtype: object

Before I get rid of anything, let's take a look at how long I listened to music or podcasts this year!

In [5]:
time_streamed_ms = sum(stream_history_df.ms_played)
time_streamed_sec = time_streamed_ms / 1000
time_streamed_min = time_streamed_sec / 60
time_streamed_hour = time_streamed_min / 60
time_streamed_day = time_streamed_hour / 24
print("Days I have listened to Spotify: {:.3f} days".format(time_streamed_day))

Days I have listened to Spotify: 62.713 days


## Consolidating Data

We're about to make a helluva lot of calls to Spotify, so we'll want to consolidate the data to make as few calls as possible. To do this, we'll gather all nonunique tracks

In [6]:
stream_history_consolidated_df = stream_history_df[["track_name", "artist_name"]].drop_duplicates().reset_index(drop=True)
print(stream_history_consolidated_df.shape)
stream_history_consolidated_df.head()

(6146, 2)


Unnamed: 0,track_name,artist_name
0,Auld Lang Syne,Bing Crosby
1,Perfect Symphony (Ed Sheeran & Andrea Bocelli),Ed Sheeran
2,Honey Hold Me,Morningsiders
3,Only Home I've Ever Known,The California Honeydrops
4,Mind is a Mountain,The Get Ahead


Now, it's imperative that I get the IDs for all of these tracks, because if I have the IDs I'll be able to do all sorts of stuff including gathering additional data on the tracks. So here's my algorithm for extracting the ID from the `track_name` and `artist_name`:

1. Search for the track by calling Spotify's search API with the query `'track:"{track_name}" AND artist:"{artist_name}"'`. I'll only be searching for tracks, regardless of whether or not the thing being searched is a song or podcast episode
2. Take the list of all potential matches, and filter them. Only accept the track if the track name from the API call is exactly the same as the given `track_name`, and if the `artist_name` is included in the artist names of the track from the API call
3. If that list is nonempty, return the first result with type `track` (Note: this would be a good area to add an "accurate" option to the data. If the list is more than one element, have the user manually select which is correct [AKA throw error `Multiple matches found`])
4. If the track list is empty, then it is probably a podcast episode
5. Search for the episode by calling Spotify's search API with the query `'"{track_name}" AND "{artist_name}"'`, and only search for episodes
6. Filter the list of potential matches by accepting only if the name of the episode returned exactly matches `name`
7. If that list is empty, return `None`
8. If that list has 1 element, return that element's id with type `episode`
9. If that list has more than 1 element, we have to get the returned episodes' shows. To do that, we'll take the episode ids returned from Spotify's search endpoint and send it to Spotify's episode endpoint
10. Take the results from that and filter if the returned show name is the same as `show_name`
11. If that list is empty, return `None`
12. If that list is nonempty, return the first result with type `episode` (Note: same comment as above with the "accurate" option)

In [7]:
import requests
import re
import time
from urllib.parse import urlencode
token = open('token.txt', 'r').read().strip()

base_url = "https://api.spotify.com/v1/"
query_obj = {"type": "track,episode", "market": "US", "limit": 5}

def get_correct_episode(name, show_name):
    # We have a separate search call here without the `track:` and `artist:` search fields. That way we can search for podcasts
    query_obj["q"] = str.format('"{}" AND "{}"', re.sub(r"['\"]", "", name), re.sub(r"['\"]", "", show_name))
    res = requests.get(base_url + "search" + "?" + urlencode(query_obj), headers={
        "Authorization": "Bearer " + token
    })
    if res.status_code != 200:
        return pd.Series([None, None, "Search call failed, try again"])
    potential_episodes = list(filter(lambda x: x, res.json()["episodes"]["items"]))

    episodes = list(filter(lambda episode: episode["name"] == name, potential_episodes))
    if len(episodes) == 0:
        return None
    elif len(episodes) == 1:
        return ["episode", episodes[0]["id"]]

    # The episode object returned does not have show data, so let's get that show data
    episode_ids = list(map(lambda episode: episode["id"], episodes))
    res = requests.get(base_url + "episodes" + "?" + urlencode({"ids": ",".join(episode_ids)}), headers={
        "Authorization": "Bearer " + token
    })
    if res.status_code != 200:
        return "Episodes call failed, try again"
    res_json = res.json()
    episodes = list(filter(lambda episode: episode["show"]["name"] == show_name, res_json["episodes"]))
    if len(episodes) > 0:
        return ["episode", episodes[0]["id"]]

    # Still nothing? Just return None then
    return None

def get_correct_track_or_episode(name, artist, potential_tracks):
    # First check for tracks
    tracks = list(filter(lambda track: track["name"] == name and
                                       any(list(map(lambda track_artist_obj: track_artist_obj["name"] == artist, track["artists"])))
                         , potential_tracks))
#     print(list(map(lambda track: track["name"], potential_tracks)))
    if len(tracks) > 0:
        return ["track", tracks[0]["id"]]

    # If no tracks follow the criteria, it may be a podcast episode
    return get_correct_episode(name, artist)

def get_spotify_data(track_name, artist_name):
    # Adding the `track:` and `artist:` search fields here disallow us from finding podcasts, only tracks
    query_obj["q"] = str.format('track:"{}" AND artist:"{}"', re.sub(r"['\"]", "", track_name), re.sub(r"['\"]", "", artist_name))
    res = requests.get(base_url + "search" + "?" + urlencode(query_obj), headers={
        "Authorization": "Bearer " + token
    })
    if res.status_code != 200:
        return pd.Series([None, None, "Track search call failed, try again"])
    res_json = res.json()

    data = get_correct_track_or_episode(track_name, artist_name, res_json["tracks"]["items"])
    if type(data) == str:
        return pd.Series([None, None, data])
    elif data == None:
        return pd.Series([None, None, "No data available"])
    else:
        return pd.Series([data[0], data[1], None])

Split the entries into buckets in case one fails we don't have to completely restart

In [8]:
import time
from functools import reduce

num_entries = stream_history_consolidated_df.shape[0]
bucket_size = 1000
num_buckets = num_entries // bucket_size + (1 if num_entries % bucket_size > 0 else 0)

bucket_times = [0] * num_buckets
stream_history_bucket_df = pd.DataFrame()
for i in range(num_buckets):
    if not path.exists("data/processed/history/StreamingHistoryWithData{}.csv".format(i)):
        time_start = time.time()
        stream_history_bucket_df = stream_history_consolidated_df.loc[i * bucket_size:((i + 1) * bucket_size) - 1].copy()
        stream_history_bucket_df[["type", "id", "error"]] = stream_history_bucket_df.apply(lambda x: get_spotify_data(x[0], x[1]), axis=1)
        time_elapsed = time.time() - time_start
        bucket_times[i] = time_elapsed
        print("Time elapsed for bucket {}: {:.3f} seconds".format(i, time_elapsed))
        stream_history_bucket_df.to_csv("data/processed/history/StreamingHistoryWithData{}.csv".format(i))

if stream_history_bucket_df.shape[0] == 0:
    print("Already collected all data")
else:
    # Last time I ran this it took 1944.531 seconds (32.409 minutes)
    total_time_elapsed = reduce(lambda acc, time_elapsed: acc + time_elapsed, bucket_times, 0)
    print("Total time elapsed: {:.3f} seconds ({:.3f} minutes)".format(total_time_elapsed, total_time_elapsed / 60))
    display(stream_history_bucket_df.tail())

Already collected all data


In [9]:
if not path.exists("data/processed/StreamingHistoryWithData.csv"):
    stream_history_consolidated_df = pd.DataFrame()
    for i in range(num_buckets):
        temp_df = pd.read_csv("data/processed/history/StreamingHistoryWithData{}.csv".format(i))
        temp_df.drop(temp_df.columns[0], axis=1, inplace=True)
        if stream_history_consolidated_df.shape[0] == 0:
            stream_history_consolidated_df = temp_df
        else:
            stream_history_consolidated_df = stream_history_consolidated_df.append(temp_df)

    stream_history_consolidated_df = stream_history_consolidated_df.reset_index(drop=True)
    stream_history_consolidated_df.to_csv("data/processed/StreamingHistoryWithData.csv")
else:
    stream_history_consolidated_df = pd.read_csv("data/processed/StreamingHistoryWithData.csv")
    stream_history_consolidated_df.drop(stream_history_consolidated_df.columns[0], axis=1, inplace=True)
stream_history_consolidated_df.tail()

Unnamed: 0,track_name,artist_name,type,id,error
6141,Always Alright,Alabama Shakes,track,6ckUX8cgcqjoNGTd2A2Pvd,
6142,#34 Annie,Heavyweight,episode,6in5V1yLC4AdjRTpBx0BeC,
6143,Ai Weiwei On His Father's Exile — And Hopes Fo...,Consider This from NPR,episode,3gm88eARzA8gzVxL2mTrqO,
6144,Spider-Mane Was on Crack One Day,MonoNeon,track,62bqqXjFxMbXsldDz2ZUXJ,
6145,L'oiseau qui danse,Tennyson,track,3KKBT4UfWIP3fBMXn5Oprj,


### Cleaning the Data
There are going to be a few entries that were missed by my algorithm. We'll take those and export them, so they can be manually fixed using the program I made for this, the [Spotify Data Fix Program](https://justinro-underscore.github.io/SpotifyDataFix/index.html)

In [10]:
missing_data_df = stream_history_consolidated_df[stream_history_consolidated_df["id"].isna()].reset_index()
print(missing_data_df.shape[0])
if not path.exists("data/processed/MissingStreamingHistoryWithId.csv"):
    missing_data_df.to_csv("data/processed/MissingStreamingHistoryWithId.csv")
missing_data_df.head()

64


Unnamed: 0,index,track_name,artist_name,type,id,error
0,0,Auld Lang Syne,Bing Crosby,,,No data available
1,98,"Saturday, January 2, 2021",Up First,,,No data available
2,144,"Monday, Jan. 4, 2021",Up First,,,No data available
3,246,"Tuesday, January 5, 2021",Up First,,,No data available
4,340,"Wednesday, Jan. 6, 2021",Up First,,,No data available


Now go! Fix the data! Return when it's done!

In [11]:
if path.exists("data/processed/FixedDataForJustin.csv"):
    fixed_data_df = pd.read_csv("data/processed/FixedDataForJustin.csv")
    fixed_data_df.drop(fixed_data_df.columns[0], axis=1, inplace=True)
    fixed_data_df.index = fixed_data_df["index"]
    fixed_data_df.drop(columns=["index"], axis=1, inplace=True)
    display(fixed_data_df.head())
else:
    print("Please run the Spotify Fix")

Unnamed: 0_level_0,track_name,artist_name,type,id,delete
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Auld Lang Syne,Bing Crosby,,,True
98,"Saturday, January 2, 2021",Up First,,,True
144,"Monday, Jan. 4, 2021",Up First,,,True
246,"Tuesday, January 5, 2021",Up First,,,True
340,"Wednesday, Jan. 6, 2021",Up First,,,True


Let's go ahead and flag all of the tracks to delete

In [12]:
stream_history_consolidated_df["delete"] = False
to_delete_df = fixed_data_df[fixed_data_df["delete"]]
stream_history_consolidated_df.loc[to_delete_df.index, "delete"] = True
display(stream_history_consolidated_df.head())
stream_history_consolidated_df.iloc[to_delete_df.index].head()

Unnamed: 0,track_name,artist_name,type,id,error,delete
0,Auld Lang Syne,Bing Crosby,,,No data available,True
1,Perfect Symphony (Ed Sheeran & Andrea Bocelli),Ed Sheeran,track,3zl7j5ua8mF4JDYuxrfo01,,False
2,Honey Hold Me,Morningsiders,track,3XBDyDl3lwihZ8taFqMsJa,,False
3,Only Home I've Ever Known,The California Honeydrops,track,49soGZl5uftZH9E7T20SDm,,False
4,Mind is a Mountain,The Get Ahead,track,4upb9RfRf0hfW1rTU8bozj,,False


Unnamed: 0,track_name,artist_name,type,id,error,delete
0,Auld Lang Syne,Bing Crosby,,,No data available,True
98,"Saturday, January 2, 2021",Up First,,,No data available,True
144,"Monday, Jan. 4, 2021",Up First,,,No data available,True
246,"Tuesday, January 5, 2021",Up First,,,No data available,True
340,"Wednesday, Jan. 6, 2021",Up First,,,No data available,True


In [13]:
fixed_data_df = fixed_data_df[~fixed_data_df["delete"]]
fixed_data_df[["true_track_name", "true_artist_name"]] = [None, None]
duplicated_tracks = fixed_data_df[fixed_data_df.duplicated("id", keep=False)]
true_duplicate_names = duplicated_tracks.groupby("id").apply(lambda x: x.iloc[0][["track_name", "artist_name"]])
true_duplicate_names.index.name = None
true_duplicate_names

Now we can set the `type` and `id` of all the missing tracks

In [14]:
fixed_data_df = fixed_data_df[~fixed_data_df["delete"]]

stream_history_consolidated_df.loc[fixed_data_df.index, "type"] = fixed_data_df["type"]
stream_history_consolidated_df.loc[fixed_data_df.index, "id"] = fixed_data_df["id"]
stream_history_consolidated_df.iloc[fixed_data_df.index].head()

Unnamed: 0,track_name,artist_name,type,id,error,delete
552,America The Beautiful,Ray Charles,track,5uPakSfc8x3RpbAGSpEHeB,No data available,False
818,Short Skirt / Long Jacket,Cake,track,3OOFEF20WqtsUPcRbPY3L7,No data available,False
937,Together,Matthew Halsall,track,6sJUbzyzNOrAt9dr3pkOPk,No data available,False
1077,Midnight Lorry,Dispatch,track,0OoODSk0hu4WWmi3yKbRP3,No data available,False
1168,"My Brother, My Keeper",Mandolin Orange,track,2kfGWmg7r3dRdTgQXvUqs1,No data available,False


Not sure why I added this `error` column, didn't actually end up using it for anything:

In [15]:
stream_history_consolidated_df.drop(columns=["error"], axis=1, inplace=True)

Finally we can merge the two dataframes and remove the ones that need deleting. Now we have the full streaming history data!

In [16]:
stream_history_df = stream_history_df.merge(stream_history_consolidated_df, on=["artist_name", "track_name"])
stream_history_df = stream_history_df[~stream_history_df["delete"]].drop(columns=["delete"], axis=1)
stream_history_df = stream_history_df.sort_values(by=["date"]).reset_index(drop=True)
stream_history_df.head()

Unnamed: 0,artist_name,track_name,ms_played,date,type,id
0,Ed Sheeran,Perfect Symphony (Ed Sheeran & Andrea Bocelli),16889,2021-01-01 00:04:00,track,3zl7j5ua8mF4JDYuxrfo01
1,Morningsiders,Honey Hold Me,181455,2021-01-01 00:04:00,track,3XBDyDl3lwihZ8taFqMsJa
2,The California Honeydrops,Only Home I've Ever Known,246213,2021-01-01 00:09:00,track,49soGZl5uftZH9E7T20SDm
3,Doc Robinson,Slip Away,182013,2021-01-01 00:12:00,track,2vwpOGHlOroQYiIByW7qa3
4,The Get Ahead,Mind is a Mountain,4711,2021-01-01 00:12:00,track,4upb9RfRf0hfW1rTU8bozj


In [17]:
stream_history_consolidated_df = stream_history_consolidated_df[~stream_history_consolidated_df["delete"]]
stream_history_consolidated_df.drop(columns=["delete"], axis=1, inplace=True)
stream_history_consolidated_df.head()

Unnamed: 0,track_name,artist_name,type,id
1,Perfect Symphony (Ed Sheeran & Andrea Bocelli),Ed Sheeran,track,3zl7j5ua8mF4JDYuxrfo01
2,Honey Hold Me,Morningsiders,track,3XBDyDl3lwihZ8taFqMsJa
3,Only Home I've Ever Known,The California Honeydrops,track,49soGZl5uftZH9E7T20SDm
4,Mind is a Mountain,The Get Ahead,track,4upb9RfRf0hfW1rTU8bozj
5,Slip Away,Doc Robinson,track,2vwpOGHlOroQYiIByW7qa3


Let's also fix the track name and artist name of any duplicates that may have arisen in the fixing missing data process:

In [18]:
stream_history_consolidated_df = stream_history_consolidated_df.set_index("id")
stream_history_consolidated_df.update(true_duplicate_names)
stream_history_consolidated_df = stream_history_consolidated_df.drop_duplicates().reset_index()

In [19]:
stream_history_df = stream_history_df.set_index("id")
stream_history_df.update(true_duplicate_names)
stream_history_df = stream_history_df.reset_index()

And let's save this data:

In [20]:
if not path.exists("data/processed/FullTrackData.csv"):
    stream_history_consolidated_df.to_csv("data/processed/FullTrackData.csv")
if not path.exists("data/ready/CompleteStreamingHistory.csv"):
    stream_history_df.to_csv("data/ready/CompleteStreamingHistory.csv")

## Complete!

Now we can move along to the fun stuff... the ANALYSIS

For the future: I should probably finish cleaning the Podcast data as well