# Feature Analysis
Let's do a little feature analysis! Mostly I'm just going to be pulling the data from Spotify's APIs to get even more cool data from my primitive data. (Look at how far we've come! Just started out with a track name and artist name)

In [1]:
from os import path
import numpy as np
import pandas as pd
from IPython.display import display

if path.exists("data/ready/CompleteStreamingHistory.csv"):
    streaming_history_data_df = pd.read_csv("data/ready/CompleteStreamingHistory.csv")
    streaming_history_data_df.drop(streaming_history_data_df.columns[0], axis=1, inplace=True)
    print(streaming_history_data_df.shape)
    display(streaming_history_data_df.head())
else:
    print("Clean data from other journal before continuing")



(25593, 6)


Unnamed: 0,id,artist_name,track_name,ms_played,date,type
0,3zl7j5ua8mF4JDYuxrfo01,Ed Sheeran,Perfect Symphony (Ed Sheeran & Andrea Bocelli),16889,2021-01-01 00:04:00,track
1,3XBDyDl3lwihZ8taFqMsJa,Morningsiders,Honey Hold Me,181455,2021-01-01 00:04:00,track
2,49soGZl5uftZH9E7T20SDm,The California Honeydrops,Only Home I've Ever Known,246213,2021-01-01 00:09:00,track
3,2vwpOGHlOroQYiIByW7qa3,Doc Robinson,Slip Away,182013,2021-01-01 00:12:00,track
4,4upb9RfRf0hfW1rTU8bozj,The Get Ahead,Mind is a Mountain,4711,2021-01-01 00:12:00,track


In [2]:
if path.exists("data/processed/FullTrackData.csv"):
    full_track_and_episode_data_df = pd.read_csv("data/processed/FullTrackData.csv")
    full_track_and_episode_data_df.drop(full_track_and_episode_data_df.columns[0], axis=1, inplace=True)
    print(full_track_and_episode_data_df.shape)
    display(full_track_and_episode_data_df.head())
else:
    print("Clean data from other journal before continuing")

(6114, 4)


Unnamed: 0,id,track_name,artist_name,type
0,3zl7j5ua8mF4JDYuxrfo01,Perfect Symphony (Ed Sheeran & Andrea Bocelli),Ed Sheeran,track
1,3XBDyDl3lwihZ8taFqMsJa,Honey Hold Me,Morningsiders,track
2,49soGZl5uftZH9E7T20SDm,Only Home I've Ever Known,The California Honeydrops,track
3,4upb9RfRf0hfW1rTU8bozj,Mind is a Mountain,The Get Ahead,track
4,2vwpOGHlOroQYiIByW7qa3,Slip Away,Doc Robinson,track


In [3]:
tracks_data_df = full_track_and_episode_data_df[full_track_and_episode_data_df["type"] == "track"].drop(columns=["type"], axis=1)
display(tracks_data_df.head())
episodes_data_df = full_track_and_episode_data_df[full_track_and_episode_data_df["type"] == "episode"].drop(columns=["type"], axis=1)
display(episodes_data_df.head())

Unnamed: 0,id,track_name,artist_name
0,3zl7j5ua8mF4JDYuxrfo01,Perfect Symphony (Ed Sheeran & Andrea Bocelli),Ed Sheeran
1,3XBDyDl3lwihZ8taFqMsJa,Honey Hold Me,Morningsiders
2,49soGZl5uftZH9E7T20SDm,Only Home I've Ever Known,The California Honeydrops
3,4upb9RfRf0hfW1rTU8bozj,Mind is a Mountain,The Get Ahead
4,2vwpOGHlOroQYiIByW7qa3,Slip Away,Doc Robinson


Unnamed: 0,id,track_name,artist_name
7,4vqMyjMPaLZc06BBSzBEk3,Losing Relatives to Fox News,You're Wrong About
127,3L7WxMTjXmSzhGncCUIIJ6,The Stanford Prison Experiment,You're Wrong About
138,5uqoeTTKOPlXKkcNGrSMpb,100: The 100th Episode of the Podcast! - The G...,The Gus & Eddy Podcast
139,0MXLCHhLZjlA31du3sMtHw,Borders Between Us,Resistance
141,50FHng0614l7nSY1AK9se6,Killer Clowns,You're Wrong About


## Getting the Extra Data
Thankfully, Spotify is FANTASTIC with its documentation and API specs, as they are all quite consistent! (Thank you Spotify) So, this allows me to create one single method to pull all of the data from the different endpoints!

In [4]:
import requests
import re
import time
from urllib.parse import urlencode
token = open('token.txt', 'r').read().strip()

base_url = "https://api.spotify.com/v1/"

def get_spotify_data(ids, endpoint, sublist_length=50):
    key_name = endpoint.replace("-", "_")
    ids_data = []
    for i in range((len(ids) // sublist_length) + 1):
        sub_ids = ",".join(ids[i * sublist_length:(i + 1) * sublist_length])
        query_obj = {"ids": sub_ids}
        res = requests.get(base_url + endpoint + "?" + urlencode(query_obj), headers={
            "Authorization": "Bearer " + token
        })
        if res.status_code != 200:
            print(res.status_code)
            raise Exception("Spotify call failed, something has gone wrong")
        res_json = res.json()
        ids_data += res_json[key_name]
        time.sleep(1)
    return ids_data

First let's get the full track data (yeah I know we did this in the cleaning step, but that was mostly just to get the track `id`s. This time, we're doing **feature engineering**! It's only proper to do it again here and pretend like I never did that in the first place).

We'll start off by gathering the data, then we'll clean it up later:

In [5]:
track_ids = list(tracks_data_df["id"])

if not path.exists("data/raw/FullTrackDataRaw.csv"):
    full_track_data = get_spotify_data(track_ids, "tracks")
    full_track_data_df = pd.DataFrame({"id": track_ids, "data": full_track_data})    
    full_track_data_df.to_csv("data/raw/FullTrackDataRaw.csv")
else:
    full_track_data_df = pd.read_csv("data/raw/FullTrackDataRaw.csv")
    full_track_data_df.drop(full_track_data_df.columns[0], axis=1, inplace=True)

full_track_data_df.head()

Unnamed: 0,id,data
0,3zl7j5ua8mF4JDYuxrfo01,"{'album': {'album_type': 'single', 'artists': ..."
1,3XBDyDl3lwihZ8taFqMsJa,"{'album': {'album_type': 'album', 'artists': [..."
2,49soGZl5uftZH9E7T20SDm,"{'album': {'album_type': 'album', 'artists': [..."
3,4upb9RfRf0hfW1rTU8bozj,"{'album': {'album_type': 'single', 'artists': ..."
4,2vwpOGHlOroQYiIByW7qa3,"{'album': {'album_type': 'album', 'artists': [..."


Track feature data, this will give us all sorts of neat metrics calculated by Spotify:

In [6]:
if not path.exists("data/raw/TrackFeatureDataRaw.csv"):
    track_feature_data = get_spotify_data(track_ids, "audio-features", 100)
    track_feature_data_df = pd.DataFrame({"id": track_ids, "data": track_feature_data})
    track_feature_data_df.to_csv("data/raw/TrackFeatureDataRaw.csv")
else:
    track_feature_data_df = pd.read_csv("data/raw/TrackFeatureDataRaw.csv")
    track_feature_data_df.drop(track_feature_data_df.columns[0], axis=1, inplace=True)

track_feature_data_df.head()

Unnamed: 0,id,data
0,3zl7j5ua8mF4JDYuxrfo01,"{'danceability': 0.544, 'energy': 0.417, 'key'..."
1,3XBDyDl3lwihZ8taFqMsJa,"{'danceability': 0.606, 'energy': 0.567, 'key'..."
2,49soGZl5uftZH9E7T20SDm,"{'danceability': 0.472, 'energy': 0.516, 'key'..."
3,4upb9RfRf0hfW1rTU8bozj,"{'danceability': 0.625, 'energy': 0.514, 'key'..."
4,2vwpOGHlOroQYiIByW7qa3,"{'danceability': 0.743, 'energy': 0.471, 'key'..."


Let's also get the episode data, so we can potentially analyze that as well:

In [7]:
episode_ids = list(episodes_data_df["id"])

if not path.exists("data/raw/FullEpisodeDataRaw.csv"):
    full_episode_data = get_spotify_data(episode_ids, "episodes")
    full_episode_data_df = pd.DataFrame({"id": episode_ids, "data": full_episode_data})    
    full_episode_data_df.to_csv("data/raw/FullEpisodeDataRaw.csv")
else:
    full_episode_data_df = pd.read_csv("data/raw/FullEpisodeDataRaw.csv")
    full_episode_data_df.drop(full_episode_data_df.columns[0], axis=1, inplace=True)

full_episode_data_df.head()

Unnamed: 0,id,data
0,4vqMyjMPaLZc06BBSzBEk3,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
1,3L7WxMTjXmSzhGncCUIIJ6,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
2,5uqoeTTKOPlXKkcNGrSMpb,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
3,0MXLCHhLZjlA31du3sMtHw,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
4,50FHng0614l7nSY1AK9se6,{'audio_preview_url': 'https://p.scdn.co/mp3-p...


We will also do album and artist data later on, but I need the artists from albums, and the albums from the tracks first!

## Cleaning the Extra Data

Since we are getting the raw data from the API endpoints, we should clean it up by choosing only the pieces that we care about. These are the methods we'll use to help with the cleaning:

In [8]:
import ast # Gonna use ast because json has issues with single quotes
import json

# This will be used to show what keys are returned from the endpoint
def print_nested_keys(obj, offset=""):
    if obj is None:
        return
    if isinstance(obj, list):
        if len(obj) == 0:
            return
        obj = obj[0]
    keys = obj.keys()
    for key in keys:
        print(offset + key)
        try:
            print_nested_keys(obj[key], offset + "  ")
        except:
            pass

In [9]:
from datetime import date

def convert_release_date(date_str):
    if len(date_str) == 4:
        return date(int(date_str), 1, 1)
    elif len(date_str) == 7:
        return date(int(date_str[0:4]), int(date_str[5:7]), 1)
    else:
        return date(int(date_str[0:4]), int(date_str[5:7]), int(date_str[8:10]))

def get_data_values(entry, accessor_dict):
    data_str = str(entry["data"])
    if data_str == "None":
        print(entry)
    data = None
    try:
        data = ast.literal_eval(data_str)
    except:
        try:
            data = json.loads(data_str)
        except:
            print("Invalid: ")
            print(data_str)
            pass
    finally:
        for new_entry_key, data_access_key in accessor_dict.items():
            if callable(data_access_key):
                entry[new_entry_key] = data_access_key(data)
            elif isinstance(data_access_key, list):
                for access in data_access_key:
                    entry[new_entry_key] = data[access] if not new_entry_key in entry else entry[new_entry_key][access]
            else:
                entry[new_entry_key] = data[data_access_key]
    return entry

In [10]:
full_track_data = full_track_data_df["data"].iloc[0]
print("Full Track Data:")
print_nested_keys(ast.literal_eval(str(full_track_data)), "  ")

Full Track Data:
  album
    album_type
    artists
      external_urls
        spotify
      href
      id
      name
      type
      uri
    available_markets
    external_urls
      spotify
    href
    id
    images
      height
      url
      width
    name
    release_date
    release_date_precision
    total_tracks
    type
    uri
  artists
    external_urls
      spotify
    href
    id
    name
    type
    uri
  available_markets
  disc_number
  duration_ms
  explicit
  external_ids
    isrc
  external_urls
    spotify
  href
  id
  is_local
  name
  popularity
  preview_url
  track_number
  type
  uri


In [11]:
if not path.exists("data/ready/FullTrackData.csv"):
    full_track_data_accessor = {
        "album_id": ["album", "id"],
        "album_name": ["album", "name"],
        "album_release_date": lambda data: convert_release_date(data["album"]["release_date"]),
        "album_total_tracks": ["album", "total_tracks"],
        "artist_ids": lambda data: list(map(lambda artist_data: artist_data["id"], data["artists"])),
        "artist_names": lambda data: list(map(lambda artist_data: artist_data["name"], data["artists"])),
        "duration_ms": "duration_ms",
        "explicit": "explicit",
        "name": "name",
        "popularity": "popularity",
        "track_number": "track_number"
    }
    full_track_data_df = full_track_data_df.apply(lambda entry: get_data_values(entry, full_track_data_accessor), axis=1).drop(columns=["data"])
    full_track_data_df.to_csv("data/ready/FullTrackData.csv")
else:
    full_track_data_df = pd.read_csv("data/ready/FullTrackData.csv")
    full_track_data_df.drop(full_track_data_df.columns[0], axis=1, inplace=True)

full_track_data_df.head()

Unnamed: 0,id,album_id,album_name,album_release_date,album_total_tracks,artist_ids,artist_names,duration_ms,explicit,name,popularity,track_number
0,3zl7j5ua8mF4JDYuxrfo01,2MOs2gBy14kW9jYXbv2A3O,Perfect Symphony (Ed Sheeran & Andrea Bocelli),2017-12-15,1,"['6eUKZXaKkcviH0Ku9w2n3V', '3EA9hVIzKfFiQI0Kik...","['Ed Sheeran', 'Andrea Bocelli']",265363,False,Perfect Symphony (Ed Sheeran & Andrea Bocelli),70,1
1,3XBDyDl3lwihZ8taFqMsJa,3LjEwG1UMyjrH13rrMmdco,A Little Lift,2019-03-29,12,['5hPR4Atp3QY2ztiAcz1inl'],['Morningsiders'],181454,False,Honey Hold Me,45,3
2,49soGZl5uftZH9E7T20SDm,6Rt2NlqIHMj7xanrfhRgTl,Call It Home: Vol. 1 & 2,2018-04-06,16,['21t0aavYGSGFkYYFhu6urk'],['The California Honeydrops'],246213,False,Only Home I've Ever Known,41,1
3,4upb9RfRf0hfW1rTU8bozj,6J14ERmMo7PajcC63ACPke,Mind is a Mountain,2017-09-15,1,['4iBgPaD9hI6n7uRppTbyVO'],['The Get Ahead'],210723,False,Mind is a Mountain,45,1
4,2vwpOGHlOroQYiIByW7qa3,6H5BifGwk4ilDChOWOMmaa,Deep End,2017-07-22,11,['5O0efDEpkqEmWbXD2zpkjz'],['Doc Robinson'],182013,False,Slip Away,42,1


In [12]:
track_feature_data = track_feature_data_df["data"].iloc[0]
print("Track Feature Data:")
print_nested_keys(ast.literal_eval(str(track_feature_data)), "  ")

Track Feature Data:
  danceability
  energy
  key
  loudness
  mode
  speechiness
  acousticness
  instrumentalness
  liveness
  valence
  tempo
  type
  id
  uri
  track_href
  analysis_url
  duration_ms
  time_signature


In [13]:
if not path.exists("data/ready/TrackFeatureData.csv"):
    track_feature_data_accessor = {x: x for x in [
        "danceability",
        "energy",
        "key",
        "loudness",
        "mode",
        "speechiness",
        "acousticness",
        "instrumentalness",
        "liveness",
        "valence",
        "tempo",
        "time_signature",
    ]}
    track_feature_data_df = track_feature_data_df.apply(lambda entry: get_data_values(entry, track_feature_data_accessor), axis=1).drop(columns=["data"])
    track_feature_data_df.to_csv("data/ready/TrackFeatureData.csv")
else:
    track_feature_data_df = pd.read_csv("data/ready/TrackFeatureData.csv")
    track_feature_data_df.drop(track_feature_data_df.columns[0], axis=1, inplace=True)

track_feature_data_df.head()

Unnamed: 0,id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,3zl7j5ua8mF4JDYuxrfo01,0.544,0.417,8,-4.387,1,0.0247,0.586,0.0,0.085,0.207,95.156,3
1,3XBDyDl3lwihZ8taFqMsJa,0.606,0.567,0,-6.841,1,0.0268,0.367,0.00325,0.19,0.521,82.012,4
2,49soGZl5uftZH9E7T20SDm,0.472,0.516,3,-8.945,1,0.0537,0.718,0.00175,0.14,0.481,96.407,4
3,4upb9RfRf0hfW1rTU8bozj,0.625,0.514,2,-5.314,1,0.0353,0.772,0.000254,0.109,0.606,83.99,4
4,2vwpOGHlOroQYiIByW7qa3,0.743,0.471,3,-7.779,0,0.0354,0.482,0.0,0.143,0.514,108.061,4


Now that we have track data, let's get the album data!

In [14]:
album_ids = list(full_track_data_df["album_id"].drop_duplicates())

if not path.exists("data/raw/AlbumDataRaw.csv"):
    album_data = get_spotify_data(album_ids, "albums", 20)
    album_data_df = pd.DataFrame({"id": album_ids, "data": album_data})
    album_data_df.to_csv("data/raw/AlbumDataRaw.csv")
else:
    album_data_df = pd.read_csv("data/raw/AlbumDataRaw.csv")
    album_data_df.drop(album_data_df.columns[0], axis=1, inplace=True)

album_data_df.head()

Unnamed: 0,id,data
0,2MOs2gBy14kW9jYXbv2A3O,"{'album_type': 'single', 'artists': [{'externa..."
1,3LjEwG1UMyjrH13rrMmdco,"{'album_type': 'album', 'artists': [{'external..."
2,6Rt2NlqIHMj7xanrfhRgTl,"{'album_type': 'album', 'artists': [{'external..."
3,6J14ERmMo7PajcC63ACPke,"{'album_type': 'single', 'artists': [{'externa..."
4,6H5BifGwk4ilDChOWOMmaa,"{'album_type': 'album', 'artists': [{'external..."


In [15]:
album_data = album_data_df["data"].iloc[0]
print("Album Data:")
print_nested_keys(ast.literal_eval(str(album_data)), "  ")

Album Data:
  album_type
  artists
    external_urls
      spotify
    href
    id
    name
    type
    uri
  available_markets
  copyrights
    text
    type
  external_ids
    upc
  external_urls
    spotify
  genres
  href
  id
  images
    height
    url
    width
  label
  name
  popularity
  release_date
  release_date_precision
  total_tracks
  tracks
    href
    items
      artists
        external_urls
          spotify
        href
        id
        name
        type
        uri
      available_markets
      disc_number
      duration_ms
      explicit
      external_urls
        spotify
      href
      id
      is_local
      name
      preview_url
      track_number
      type
      uri
    limit
    next
    offset
    previous
    total
  type
  uri


In [16]:
if not path.exists("data/ready/AlbumData.csv"):
    album_data_accessor = {
        "album_type": "album_type",
        "artist_ids": lambda data: list(map(lambda artist_data: artist_data["id"], data["artists"])),
        "artist_names": lambda data: list(map(lambda artist_data: artist_data["name"], data["artists"])),
        "copyright_texts": lambda data: list(map(lambda copyright: copyright["text"], data["copyrights"])),
        "genres": "genres",
        "image_url": lambda data: data["images"][0]["url"],
        "label": "label",
        "name": "name",
        "popularity": "popularity",
        "release_date": lambda data: convert_release_date(data["release_date"]),
        "total_tracks": "total_tracks"
    }
    album_data_df = album_data_df.apply(lambda entry: get_data_values(entry, album_data_accessor), axis=1).drop(columns=["data"])
    album_data_df.to_csv("data/ready/AlbumData.csv")
else:
    album_data_df = pd.read_csv("data/ready/AlbumData.csv")
    album_data_df.drop(album_data_df.columns[0], axis=1, inplace=True)

album_data_df.head()

Unnamed: 0,id,album_type,artist_ids,artist_names,copyright_texts,genres,image_url,label,name,popularity,release_date,total_tracks
0,2MOs2gBy14kW9jYXbv2A3O,single,"[6eUKZXaKkcviH0Ku9w2n3V, 3EA9hVIzKfFiQI0Kikz2wo]","[Ed Sheeran, Andrea Bocelli]","[© 2017 Asylum Records UK, a division of Atlan...",[],https://i.scdn.co/image/ab67616d0000b273baf909...,Atlantic Records UK,Perfect Symphony (Ed Sheeran & Andrea Bocelli),63,2017-12-15,1
1,3LjEwG1UMyjrH13rrMmdco,album,[5hPR4Atp3QY2ztiAcz1inl],[Morningsiders],"[© 2019 Morningsiders, LLC under exclusive lic...",[],https://i.scdn.co/image/ab67616d0000b2733cdceb...,Nettwerk Records,A Little Lift,55,2019-03-29,12
2,6Rt2NlqIHMj7xanrfhRgTl,album,[21t0aavYGSGFkYYFhu6urk],[The California Honeydrops],"[2018 The California Honeydrops, 2018 The Cali...",[],https://i.scdn.co/image/ab67616d0000b273958282...,Tubtone Records,Call It Home: Vol. 1 & 2,49,2018-04-06,16
3,6J14ERmMo7PajcC63ACPke,single,[4iBgPaD9hI6n7uRppTbyVO],[The Get Ahead],[(C) 2017 The Get Ahead under exclusive licens...,[],https://i.scdn.co/image/ab67616d0000b2736fe59c...,Jullian Records,Mind is a Mountain,38,2017-09-15,1
4,6H5BifGwk4ilDChOWOMmaa,album,[5O0efDEpkqEmWbXD2zpkjz],[Doc Robinson],"[2017 Independent, 2017 Independent]",[],https://i.scdn.co/image/ab67616d0000b273b10fe6...,Independent,Deep End,46,2017-07-22,11


Now that we have the album data, let's get that artist data!

In [17]:
album_artist_ids = [artist_id for artist_ids in album_data_df["artist_ids"].values for artist_id in ast.literal_eval(str(artist_ids))]
track_artist_ids = [artist_id for artist_ids in full_track_data_df["artist_ids"].values for artist_id in ast.literal_eval(str(artist_ids))]
artist_ids = list(dict.fromkeys(album_artist_ids + track_artist_ids))

if not path.exists("data/raw/ArtistDataRaw.csv"):
    artist_data = get_spotify_data(artist_ids, "artists", 50)
    artist_data_df = pd.DataFrame({"id": artist_ids, "data": artist_data})
    artist_data_df.to_csv("data/raw/ArtistDataRaw.csv")
else:
    artist_data_df = pd.read_csv("data/raw/ArtistDataRaw.csv")
    artist_data_df.drop(artist_data_df.columns[0], axis=1, inplace=True)

artist_data_df.head()

Unnamed: 0,id,data
0,6eUKZXaKkcviH0Ku9w2n3V,{'external_urls': {'spotify': 'https://open.sp...
1,3EA9hVIzKfFiQI0Kikz2wo,{'external_urls': {'spotify': 'https://open.sp...
2,5hPR4Atp3QY2ztiAcz1inl,{'external_urls': {'spotify': 'https://open.sp...
3,21t0aavYGSGFkYYFhu6urk,{'external_urls': {'spotify': 'https://open.sp...
4,4iBgPaD9hI6n7uRppTbyVO,{'external_urls': {'spotify': 'https://open.sp...


In [18]:
artist_data = artist_data_df["data"].iloc[0]
print("Artist Data:")
print_nested_keys(ast.literal_eval(str(artist_data)), "  ")

Artist Data:
  external_urls
    spotify
  followers
    href
    total
  genres
  href
  id
  images
    height
    url
    width
  name
  popularity
  type
  uri


In [19]:
if not path.exists("data/ready/ArtistData.csv"):
    artist_data_accessor = {x: x for x in [
        "genres",
        "name",
        "popularity"
    ]}
    artist_data_df = artist_data_df.apply(lambda entry: get_data_values(entry, artist_data_accessor), axis=1).drop(columns=["data"])
    artist_data_df.to_csv("data/ready/ArtistData.csv")
else:
    artist_data_df = pd.read_csv("data/ready/ArtistData.csv")
    artist_data_df.drop(artist_data_df.columns[0], axis=1, inplace=True)

artist_data_df.head()

Unnamed: 0,id,genres,name,popularity
0,6eUKZXaKkcviH0Ku9w2n3V,"[pop, uk pop]",Ed Sheeran,97
1,3EA9hVIzKfFiQI0Kikz2wo,"[classical tenor, italian tenor, operatic pop]",Andrea Bocelli,74
2,5hPR4Atp3QY2ztiAcz1inl,"[folk-pop, indiecoustica, stomp and holler]",Morningsiders,48
3,21t0aavYGSGFkYYFhu6urk,"[bay area indie, deep new americana, funk, mod...",The California Honeydrops,54
4,4iBgPaD9hI6n7uRppTbyVO,[retro soul],The Get Ahead,30


Let's also get all the episode data!

In [20]:
episode_ids = list(episodes_data_df["id"].drop_duplicates())

if not path.exists("data/raw/FullEpisodeDataRaw.csv"):
    full_episode_data = get_spotify_data(episode_ids, "episodes")
    full_episode_data_df = pd.DataFrame({"id": episode_ids, "data": full_episode_data})    
    full_episode_data_df.to_csv("data/raw/FullEpisodeDataRaw.csv")
else:
    full_episode_data_df = pd.read_csv("data/raw/FullEpisodeDataRaw.csv")
    full_episode_data_df.drop(full_episode_data_df.columns[0], axis=1, inplace=True)

full_episode_data_df.head()

Unnamed: 0,id,data
0,4vqMyjMPaLZc06BBSzBEk3,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
1,3L7WxMTjXmSzhGncCUIIJ6,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
2,5uqoeTTKOPlXKkcNGrSMpb,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
3,0MXLCHhLZjlA31du3sMtHw,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
4,50FHng0614l7nSY1AK9se6,{'audio_preview_url': 'https://p.scdn.co/mp3-p...


In [21]:
episode_data = full_episode_data_df["data"].iloc[0]
print("Episode Data:")
print_nested_keys(ast.literal_eval(str(episode_data)), "  ")

Episode Data:
  audio_preview_url
  description
  duration_ms
  explicit
  external_urls
    spotify
  href
  html_description
  id
  images
    height
    url
    width
  is_externally_hosted
  is_playable
  language
  languages
  name
  release_date
  release_date_precision
  show
    available_markets
    copyrights
    description
    explicit
    external_urls
      spotify
    href
    html_description
    id
    images
      height
      url
      width
    is_externally_hosted
    languages
    media_type
    name
    publisher
    total_episodes
    type
    uri
  type
  uri


In [22]:
if not path.exists("data/ready/FullEpisodeData.csv"):
    episode_data_accessor = {
        "description": "description",
        "duration_ms": "duration_ms",
        "explicit": "explicit",
        "image_url": lambda data: data["images"][0]["url"],
        "name": "name",
        "release_date": lambda data: convert_release_date(data["release_date"]),
        "show_id": lambda data: data["show"]["id"],
        "show_copyright_texts": lambda data: list(map(lambda copyright: copyright["text"], data["show"]["copyrights"])),
        "show_description": lambda data: data["show"]["description"],
        "show_explicit": lambda data: data["show"]["explicit"],
        "show_image_url": lambda data: data["show"]["images"][0]["url"],
        "show_name": lambda data: data["show"]["name"],
        "show_publisher": lambda data: data["show"]["publisher"],
        "show_total_episodes": lambda data: data["show"]["total_episodes"]
    }
    full_episode_data_df = full_episode_data_df.apply(lambda entry: get_data_values(entry, episode_data_accessor), axis=1).drop(columns=["data"])
    full_episode_data_df.to_csv("data/ready/FullEpisodeData.csv")
else:
    full_episode_data_df = pd.read_csv("data/ready/FullEpisodeData.csv")
    full_episode_data_df.drop(full_episode_data_df.columns[0], axis=1, inplace=True)

full_episode_data_df.head()

Unnamed: 0,id,description,duration_ms,explicit,image_url,name,release_date,show_id,show_copyright_texts,show_description,show_explicit,show_image_url,show_name,show_publisher,show_total_episodes
0,4vqMyjMPaLZc06BBSzBEk3,Mike tells Sarah what makes older Americans mo...,3318569,False,https://i.scdn.co/image/ab6765630000ba8a557189...,Losing Relatives to Fox News,2020-12-07,1RefFgQB4Lrl7qczcTWA3o,[],Sarah is a journalist obsessed with the past. ...,True,https://i.scdn.co/image/ab6765630000ba8a557189...,You're Wrong About,Sarah Marshall,150
1,3L7WxMTjXmSzhGncCUIIJ6,Mike tells Sarah the complicated story of an o...,4258926,False,https://i.scdn.co/image/ab6765630000ba8a557189...,The Stanford Prison Experiment,2020-12-21,1RefFgQB4Lrl7qczcTWA3o,[],Sarah is a journalist obsessed with the past. ...,True,https://i.scdn.co/image/ab6765630000ba8a557189...,You're Wrong About,Sarah Marshall,150
2,5uqoeTTKOPlXKkcNGrSMpb,Timmy the Tooth Haunts My Dreams: https://bit....,4064235,False,https://i.scdn.co/image/ab6765630000ba8a6ef102...,100: The 100th Episode of the Podcast! - The G...,2020-12-28,0itVSZMhztCXlbissOQV44,[],The Gus & Eddy Podcast is a weekly show starri...,True,https://i.scdn.co/image/ab6765630000ba8a6ef102...,The Gus & Eddy Podcast,Gus & Eddy,135
3,0MXLCHhLZjlA31du3sMtHw,Saidu Tejan-Thomas Jr.’s award winning story B...,2293760,True,https://i.scdn.co/image/ab6765630000ba8a04ebbf...,Borders Between Us,2020-12-23,02JzQLXpqTtViFUGQjRkj3,[],Resistance is a show about refusing to accept ...,True,https://i.scdn.co/image/ab6765630000ba8a04ebbf...,Resistance,Gimlet,26
4,50FHng0614l7nSY1AK9se6,"For our 100th episode, American Hysteria host ...",3681149,False,https://i.scdn.co/image/ab6765630000ba8a557189...,Killer Clowns,2020-09-21,1RefFgQB4Lrl7qczcTWA3o,[],Sarah is a journalist obsessed with the past. ...,True,https://i.scdn.co/image/ab6765630000ba8a557189...,You're Wrong About,Sarah Marshall,150


## Additional Feature Engineering

I also want to get the genre data to put on tracks. The genre data only comes from the artist objects, so we'll take that data and put it on the tracks. Note: not all artists have genre data, so this won't be incredibly useful, but it'll be cool with what we have!

In [23]:
def get_genres_from_artists(entry):
    artists = ast.literal_eval(entry["artist_ids"])
    genres_set = set()
    for artist in artists:
        genres = artist_data_df[artist_data_df["id"] == artist].iloc[0]["genres"]
        for genre in genres:
            genres_set.add(genre)
    return list(genres_set)

if not "genres" in full_track_data_df:
    full_track_data_df["genres"] = full_track_data_df.apply(get_genres_from_artists, axis=1)
    full_track_data_df.to_csv("data/ready/FullTrackData.csv")
full_track_data_df.head()

Unnamed: 0,id,album_id,album_name,album_release_date,album_total_tracks,artist_ids,artist_names,duration_ms,explicit,name,popularity,track_number,genres
0,3zl7j5ua8mF4JDYuxrfo01,2MOs2gBy14kW9jYXbv2A3O,Perfect Symphony (Ed Sheeran & Andrea Bocelli),2017-12-15,1,"['6eUKZXaKkcviH0Ku9w2n3V', '3EA9hVIzKfFiQI0Kik...","['Ed Sheeran', 'Andrea Bocelli']",265363,False,Perfect Symphony (Ed Sheeran & Andrea Bocelli),70,1,"[classical tenor, italian tenor, operatic pop,..."
1,3XBDyDl3lwihZ8taFqMsJa,3LjEwG1UMyjrH13rrMmdco,A Little Lift,2019-03-29,12,['5hPR4Atp3QY2ztiAcz1inl'],['Morningsiders'],181454,False,Honey Hold Me,45,3,"[stomp and holler, folk-pop, indiecoustica]"
2,49soGZl5uftZH9E7T20SDm,6Rt2NlqIHMj7xanrfhRgTl,Call It Home: Vol. 1 & 2,2018-04-06,16,['21t0aavYGSGFkYYFhu6urk'],['The California Honeydrops'],246213,False,Only Home I've Ever Known,41,1,"[modern funk, deep new americana, bay area ind..."
3,4upb9RfRf0hfW1rTU8bozj,6J14ERmMo7PajcC63ACPke,Mind is a Mountain,2017-09-15,1,['4iBgPaD9hI6n7uRppTbyVO'],['The Get Ahead'],210723,False,Mind is a Mountain,45,1,[retro soul]
4,2vwpOGHlOroQYiIByW7qa3,6H5BifGwk4ilDChOWOMmaa,Deep End,2017-07-22,11,['5O0efDEpkqEmWbXD2zpkjz'],['Doc Robinson'],182013,False,Slip Away,42,1,"[columbus ohio indie, deep new americana, stom..."


I also want to figure out approximately how many times I've listened to a track. I kinda bounced a few ideas around on how to calculate this, but I decided that the best way to calculate this is to add up all of the milliseconds I've played the songs (`ms_played`) and determine how many total times I've played a song by dividing that number by the length of the song and simply rounding the result:

In [24]:
if not "time" in streaming_history_data_df:
    def split_date(row):
        date = pd.to_datetime(row["date"])
        return pd.Series([date.date(), date.time()])
    streaming_history_data_df[["date", "time"]] = streaming_history_data_df.apply(split_date, axis=1)

    streaming_history_data_df = streaming_history_data_df.sort_values(by=["date", "time"]).reset_index(drop=True)
    streaming_history_data_df.to_csv("data/ready/CompleteStreamingHistory.csv")

if "artist_name" in streaming_history_data_df and "track_name" in streaming_history_data_df:
    streaming_history_data_df.drop(columns=["artist_name", "track_name"], axis=1, inplace=True)
    streaming_history_data_df.to_csv("data/ready/CompleteStreamingHistory.csv")

if streaming_history_data_df.columns[0] != "date":
    streaming_history_data_df = streaming_history_data_df[["date", "time", "type", "id", "ms_played"]]
    streaming_history_data_df.to_csv("data/ready/CompleteStreamingHistory.csv")

streaming_history_data_df.head()

Unnamed: 0,date,time,type,id,ms_played
0,2021-01-01,00:04:00,track,3zl7j5ua8mF4JDYuxrfo01,16889
1,2021-01-01,00:04:00,track,3XBDyDl3lwihZ8taFqMsJa,181455
2,2021-01-01,00:09:00,track,49soGZl5uftZH9E7T20SDm,246213
3,2021-01-01,00:12:00,track,2vwpOGHlOroQYiIByW7qa3,182013
4,2021-01-01,00:12:00,track,4upb9RfRf0hfW1rTU8bozj,4711


So I found [this article](https://towardsdatascience.com/is-my-spotify-music-boring-an-analysis-involving-music-data-and-machine-learning-47550ae931de) that calculates the boringness of a song through this arbitrary equation:
```
boringness = loudness + tempo + (energy*100) + (danceability*100)
```
And I really want to find out how boring my music is, so...

In [25]:
def get_boringness_score(entry):
    return entry["loudness"] + entry["tempo"] + (entry["energy"] * 100) + (entry["danceability"] * 100)

if not "boringness" in track_feature_data_df:
    track_feature_data_df["boringness"] = track_feature_data_df.apply(get_boringness_score, axis=1)
    track_feature_data_df.to_csv("data/ready/TrackFeatureData.csv")
track_feature_data_df.head()

Unnamed: 0,id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,boringness
0,3zl7j5ua8mF4JDYuxrfo01,0.544,0.417,8,-4.387,1,0.0247,0.586,0.0,0.085,0.207,95.156,3,186.869
1,3XBDyDl3lwihZ8taFqMsJa,0.606,0.567,0,-6.841,1,0.0268,0.367,0.00325,0.19,0.521,82.012,4,192.471
2,49soGZl5uftZH9E7T20SDm,0.472,0.516,3,-8.945,1,0.0537,0.718,0.00175,0.14,0.481,96.407,4,186.262
3,4upb9RfRf0hfW1rTU8bozj,0.625,0.514,2,-5.314,1,0.0353,0.772,0.000254,0.109,0.606,83.99,4,192.576
4,2vwpOGHlOroQYiIByW7qa3,0.743,0.471,3,-7.779,0,0.0354,0.482,0.0,0.143,0.514,108.061,4,221.682


Okay, okay. I think it's finally time...

Time... for the analysis.