# Feature Analysis
Let's do a little feature analysis! Mostly I'm just going to be pulling the data from Spotify's APIs to get even more cool data from my primitive data. (Look at how far we've come! Just started out with a track name and artist name)

In [1]:
from os import path
import numpy as np
import pandas as pd
from IPython.display import display

if path.exists("data/ready/CompleteStreamingHistory.csv"):
    streaming_history_data_df = pd.read_csv("data/ready/CompleteStreamingHistory.csv")
    streaming_history_data_df.drop(streaming_history_data_df.columns[0], axis=1, inplace=True)
    print(streaming_history_data_df.shape)
    display(streaming_history_data_df.head())
else:
    print("Clean data from other journal before continuing")



(26823, 5)


Unnamed: 0,date,time,type,id,ms_played
0,2020-01-01,00:51:00,track,4hns23kYYZg0BhDwXeDxB1,680
1,2020-01-01,00:52:00,track,3sLmks6fCY40bBSGDjU4FO,3282
2,2020-01-01,00:52:00,track,2408a07TNDga6lMlaIFLEU,1496
3,2020-01-01,00:52:00,track,4Saza06xljloZwotqXdNle,66171
4,2020-01-01,00:55:00,track,0rupt7DuLo3WGecL3cyi19,150773


In [2]:
if path.exists("data/processed/FullTrackData.csv"):
    full_track_and_episode_data_df = pd.read_csv("data/processed/FullTrackData.csv")
    full_track_and_episode_data_df.drop(full_track_and_episode_data_df.columns[0], axis=1, inplace=True)
    print(full_track_and_episode_data_df.shape)
    display(full_track_and_episode_data_df.head())
else:
    print("Clean data from other journal before continuing")

(5840, 4)


Unnamed: 0,id,track_name,artist_name,type
0,4hns23kYYZg0BhDwXeDxB1,Haitian Fight Song,Mingus Big Band,track
1,3sLmks6fCY40bBSGDjU4FO,Everything You Is,Andy Martin & Vic Lewis,track
2,2408a07TNDga6lMlaIFLEU,To You,Thad Jones,track
3,4Saza06xljloZwotqXdNle,Auld Lang Syne,Guy Lombardo & His Royal Canadians,track
4,0rupt7DuLo3WGecL3cyi19,All Of Me,Count Basie,track


In [3]:
tracks_data_df = full_track_and_episode_data_df[full_track_and_episode_data_df["type"] == "track"].drop(columns=["type"], axis=1)
display(tracks_data_df.head())
episodes_data_df = full_track_and_episode_data_df[full_track_and_episode_data_df["type"] == "episode"].drop(columns=["type"], axis=1)
display(episodes_data_df.head())

Unnamed: 0,id,track_name,artist_name
0,4hns23kYYZg0BhDwXeDxB1,Haitian Fight Song,Mingus Big Band
1,3sLmks6fCY40bBSGDjU4FO,Everything You Is,Andy Martin & Vic Lewis
2,2408a07TNDga6lMlaIFLEU,To You,Thad Jones
3,4Saza06xljloZwotqXdNle,Auld Lang Syne,Guy Lombardo & His Royal Canadians
4,0rupt7DuLo3WGecL3cyi19,All Of Me,Count Basie


Unnamed: 0,id,track_name,artist_name
110,1zC8VOx9ltEhS1DzeHOg5I,Ep 49: Elliot,Darknet Diaries
111,0RMFL7GGtxWSLvVgM1Y9oz,Ep 48: Operation Socialist,Darknet Diaries
112,0SF6oXn0z9UiT299AWAJJU,Ep 45: XBox Underground (Part 1),Darknet Diaries
117,5FO4bqRKnNt5QNkz9lVwAA,Ep 46: XBox Underground (Part 2),Darknet Diaries
118,2ojCUReqahpKxL7jUp6mCN,Ep 44: Zain,Darknet Diaries


## Getting the Extra Data
Thankfully, Spotify is FANTASTIC with its documentation and API specs, as they are all quite consistent! (Thank you Spotify) So, this allows me to create one single method to pull all of the data from the different endpoints!

In [4]:
import requests
import re
import time
from urllib.parse import urlencode
token = open('token.txt', 'r').read().strip()

base_url = "https://api.spotify.com/v1/"

def get_spotify_data(ids, endpoint, sublist_length=50):
    key_name = endpoint.replace("-", "_")
    ids_data = []
    for i in range((len(ids) // sublist_length) + 1):
        sub_ids = ",".join(ids[i * sublist_length:(i + 1) * sublist_length])
        query_obj = {"ids": sub_ids}
        res = requests.get(base_url + endpoint + "?" + urlencode(query_obj), headers={
            "Authorization": "Bearer " + token
        })
        if res.status_code != 200:
            print(res.status_code)
            raise Exception("Spotify call failed, something has gone wrong")
        res_json = res.json()
        ids_data += res_json[key_name]
        time.sleep(1)
    return ids_data

First let's get the full track data (yeah I know we did this in the cleaning step, but that was mostly just to get the track `id`s. This time, we're doing **feature engineering**! It's only proper to do it again here and pretend like I never did that in the first place).

We'll start off by gathering the data, then we'll clean it up later:

In [5]:
track_ids = list(tracks_data_df["id"])

if not path.exists("data/raw/FullTrackDataRaw.csv"):
    full_track_data = get_spotify_data(track_ids, "tracks")
    full_track_data_df = pd.DataFrame({"id": track_ids, "data": full_track_data})    
    full_track_data_df.to_csv("data/raw/FullTrackDataRaw.csv")
else:
    full_track_data_df = pd.read_csv("data/raw/FullTrackDataRaw.csv")
    full_track_data_df.drop(full_track_data_df.columns[0], axis=1, inplace=True)

full_track_data_df.head()

Unnamed: 0,id,data
0,4hns23kYYZg0BhDwXeDxB1,"{'album': {'album_type': 'album', 'artists': [..."
1,3sLmks6fCY40bBSGDjU4FO,"{'album': {'album_type': 'album', 'artists': [..."
2,2408a07TNDga6lMlaIFLEU,"{'album': {'album_type': 'album', 'artists': [..."
3,4Saza06xljloZwotqXdNle,"{'album': {'album_type': 'compilation', 'artis..."
4,0rupt7DuLo3WGecL3cyi19,"{'album': {'album_type': 'album', 'artists': [..."


Track feature data, this will give us all sorts of neat metrics calculated by Spotify:

In [6]:
if not path.exists("data/raw/TrackFeatureDataRaw.csv"):
    track_feature_data = get_spotify_data(track_ids, "audio-features", 100)
    track_feature_data_df = pd.DataFrame({"id": track_ids, "data": track_feature_data})
    track_feature_data_df.to_csv("data/raw/TrackFeatureDataRaw.csv")
else:
    track_feature_data_df = pd.read_csv("data/raw/TrackFeatureDataRaw.csv")
    track_feature_data_df.drop(track_feature_data_df.columns[0], axis=1, inplace=True)

track_feature_data_df.head()

Unnamed: 0,id,data
0,4hns23kYYZg0BhDwXeDxB1,"{'danceability': 0.386, 'energy': 0.392, 'key'..."
1,3sLmks6fCY40bBSGDjU4FO,"{'danceability': 0.581, 'energy': 0.517, 'key'..."
2,2408a07TNDga6lMlaIFLEU,"{'danceability': 0.138, 'energy': 0.11, 'key':..."
3,4Saza06xljloZwotqXdNle,"{'danceability': 0.24, 'energy': 0.245, 'key':..."
4,0rupt7DuLo3WGecL3cyi19,"{'danceability': 0.529, 'energy': 0.176, 'key'..."


Let's also get the episode data, so we can potentially analyze that as well:

In [7]:
episode_ids = list(episodes_data_df["id"])

if not path.exists("data/raw/FullEpisodeDataRaw.csv"):
    full_episode_data = get_spotify_data(episode_ids, "episodes")
    full_episode_data_df = pd.DataFrame({"id": episode_ids, "data": full_episode_data})    
    full_episode_data_df.to_csv("data/raw/FullEpisodeDataRaw.csv")
else:
    full_episode_data_df = pd.read_csv("data/raw/FullEpisodeDataRaw.csv")
    full_episode_data_df.drop(full_episode_data_df.columns[0], axis=1, inplace=True)

full_episode_data_df.head()

Unnamed: 0,id,data
0,1zC8VOx9ltEhS1DzeHOg5I,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
1,0RMFL7GGtxWSLvVgM1Y9oz,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
2,0SF6oXn0z9UiT299AWAJJU,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
3,5FO4bqRKnNt5QNkz9lVwAA,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
4,2ojCUReqahpKxL7jUp6mCN,{'audio_preview_url': 'https://p.scdn.co/mp3-p...


We will also do album and artist data later on, but I need the artists from albums, and the albums from the tracks first!

## Cleaning the Extra Data

Since we are getting the raw data from the API endpoints, we should clean it up by choosing only the pieces that we care about. These are the methods we'll use to help with the cleaning:

In [8]:
import ast # Gonna use ast because json has issues with single quotes
import json

# This will be used to show what keys are returned from the endpoint
def print_nested_keys(obj, offset=""):
    if obj is None:
        return
    if isinstance(obj, list):
        if len(obj) == 0:
            return
        obj = obj[0]
    keys = obj.keys()
    for key in keys:
        print(offset + key)
        try:
            print_nested_keys(obj[key], offset + "  ")
        except:
            pass

In [9]:
from datetime import date

def convert_release_date(date_str):
    if len(date_str) == 4:
        return date(int(date_str), 1, 1)
    elif len(date_str) == 7:
        return date(int(date_str[0:4]), int(date_str[5:7]), 1)
    else:
        return date(int(date_str[0:4]), int(date_str[5:7]), int(date_str[8:10]))

def get_data_values(entry, accessor_dict):
    data_str = str(entry["data"])
    if data_str == "None":
        print(entry)
    data = None
    try:
        data = ast.literal_eval(data_str)
    except:
        try:
            data = json.loads(data_str)
        except:
            print("Invalid: ")
            print(data_str)
            pass
    finally:
        for new_entry_key, data_access_key in accessor_dict.items():
            if callable(data_access_key):
                entry[new_entry_key] = data_access_key(data)
            elif isinstance(data_access_key, list):
                for access in data_access_key:
                    entry[new_entry_key] = data[access] if not new_entry_key in entry else entry[new_entry_key][access]
            else:
                entry[new_entry_key] = data[data_access_key]
    return entry

In [10]:
full_track_data = full_track_data_df["data"].iloc[0]
print("Full Track Data:")
print_nested_keys(ast.literal_eval(str(full_track_data)), "  ")

Full Track Data:
  album
    album_type
    artists
      external_urls
        spotify
      href
      id
      name
      type
      uri
    available_markets
    external_urls
      spotify
    href
    id
    images
      height
      url
      width
    name
    release_date
    release_date_precision
    total_tracks
    type
    uri
  artists
    external_urls
      spotify
    href
    id
    name
    type
    uri
  available_markets
  disc_number
  duration_ms
  explicit
  external_ids
    isrc
  external_urls
    spotify
  href
  id
  is_local
  name
  popularity
  preview_url
  track_number
  type
  uri


In [11]:
if not path.exists("data/ready/FullTrackData.csv"):
    full_track_data_accessor = {
        "album_id": ["album", "id"],
        "album_name": ["album", "name"],
        "album_release_date": lambda data: convert_release_date(data["album"]["release_date"]),
        "album_total_tracks": ["album", "total_tracks"],
        "artist_ids": lambda data: list(map(lambda artist_data: artist_data["id"], data["artists"])),
        "artist_names": lambda data: list(map(lambda artist_data: artist_data["name"], data["artists"])),
        "duration_ms": "duration_ms",
        "explicit": "explicit",
        "name": "name",
        "popularity": "popularity",
        "track_number": "track_number"
    }
    full_track_data_df = full_track_data_df.apply(lambda entry: get_data_values(entry, full_track_data_accessor), axis=1).drop(columns=["data"])
    full_track_data_df.to_csv("data/ready/FullTrackData.csv")
else:
    full_track_data_df = pd.read_csv("data/ready/FullTrackData.csv")
    full_track_data_df.drop(full_track_data_df.columns[0], axis=1, inplace=True)

full_track_data_df.head()

Unnamed: 0,id,album_id,album_name,album_release_date,album_total_tracks,artist_ids,artist_names,duration_ms,explicit,name,popularity,track_number,genres
0,4hns23kYYZg0BhDwXeDxB1,0Gwu5X7W1mrkSTk2uZ25cv,Blues & Politics,1999-06-29,8,['54YNxT02JdAApvFBhD8ea0'],['Mingus Big Band'],499826,False,Haitian Fight Song,28,2,"['a', ',', 'z', 'e', ""'"", ']', 'b', 'm', 'n', ..."
1,3sLmks6fCY40bBSGDjU4FO,7By1lfK4fTIs2YsMvA0FWH,The Project,2004-01-01,10,['5zS6TsJ4lQFUGePSHAXaI9'],['Andy Martin & Vic Lewis'],459106,False,Everything You Is,0,6,"['[', ']']"
2,2408a07TNDga6lMlaIFLEU,5gfrrR8BnDgFhqGWcQaWFe,And the Danish Radio Big Band & Eclipse,2013-05-10,22,['6DbqS0X8cSFOPGsvyze2yh'],['Thad Jones'],255466,False,To You,6,2,"[""'"", 'b', 's', ' ', 'd', 'o', ',', 'c', 'u', ..."
3,4Saza06xljloZwotqXdNle,3f22Ap0VSZYWsqrGcphUnY,Christmas Classics,2004-01-01,16,['5fJ4w85NxFXyWlPU9wH6BE'],['Guy Lombardo & His Royal Canadians'],128000,False,Auld Lang Syne,33,16,"['[', ']']"
4,0rupt7DuLo3WGecL3cyi19,2kAN1sZjSQQDkusyXyngep,Frankly Basie / Count Basie Plays The Hits Of ...,1963-01-01,15,['2jFZlvIea42ZvcCw4OeEdA'],['Count Basie'],150773,False,All Of Me,21,12,"[""'"", 'w', 'b', 's', ' ', 'o', 'd', ',', 'u', ..."


In [12]:
track_feature_data = track_feature_data_df["data"].iloc[0]
print("Track Feature Data:")
print_nested_keys(ast.literal_eval(str(track_feature_data)), "  ")

Track Feature Data:
  danceability
  energy
  key
  loudness
  mode
  speechiness
  acousticness
  instrumentalness
  liveness
  valence
  tempo
  type
  id
  uri
  track_href
  analysis_url
  duration_ms
  time_signature


In [13]:
if not path.exists("data/ready/TrackFeatureData.csv"):
    track_feature_data_accessor = {x: x for x in [
        "danceability",
        "energy",
        "key",
        "loudness",
        "mode",
        "speechiness",
        "acousticness",
        "instrumentalness",
        "liveness",
        "valence",
        "tempo",
        "time_signature",
    ]}
    track_feature_data_df = track_feature_data_df.apply(lambda entry: get_data_values(entry, track_feature_data_accessor), axis=1).drop(columns=["data"])
    track_feature_data_df.to_csv("data/ready/TrackFeatureData.csv")
else:
    track_feature_data_df = pd.read_csv("data/ready/TrackFeatureData.csv")
    track_feature_data_df.drop(track_feature_data_df.columns[0], axis=1, inplace=True)

track_feature_data_df.head()

Unnamed: 0,id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,boringness
0,4hns23kYYZg0BhDwXeDxB1,0.386,0.392,2,-14.123,0,0.0408,0.536,0.794,0.167,0.399,89.153,4,152.83
1,3sLmks6fCY40bBSGDjU4FO,0.581,0.517,5,-10.203,0,0.0378,0.901,0.794,0.108,0.553,138.751,4,238.348
2,2408a07TNDga6lMlaIFLEU,0.138,0.11,5,-15.402,1,0.0337,0.945,0.943,0.111,0.0534,61.73,4,71.128
3,4Saza06xljloZwotqXdNle,0.24,0.245,5,-10.777,1,0.0327,0.979,0.788,0.0947,0.112,83.922,3,121.645
4,0rupt7DuLo3WGecL3cyi19,0.529,0.176,7,-17.229,1,0.0398,0.762,0.72,0.0529,0.791,150.108,4,203.379


Now that we have track data, let's get the album data!

In [14]:
album_ids = list(full_track_data_df["album_id"].drop_duplicates())

if not path.exists("data/raw/AlbumDataRaw.csv"):
    album_data = get_spotify_data(album_ids, "albums", 20)
    album_data_df = pd.DataFrame({"id": album_ids, "data": album_data})
    album_data_df.to_csv("data/raw/AlbumDataRaw.csv")
else:
    album_data_df = pd.read_csv("data/raw/AlbumDataRaw.csv")
    album_data_df.drop(album_data_df.columns[0], axis=1, inplace=True)

album_data_df.head()

Unnamed: 0,id,data
0,0Gwu5X7W1mrkSTk2uZ25cv,"{'album_type': 'album', 'artists': [{'external..."
1,7By1lfK4fTIs2YsMvA0FWH,"{'album_type': 'album', 'artists': [{'external..."
2,5gfrrR8BnDgFhqGWcQaWFe,"{'album_type': 'album', 'artists': [{'external..."
3,3f22Ap0VSZYWsqrGcphUnY,"{'album_type': 'compilation', 'artists': [{'ex..."
4,2kAN1sZjSQQDkusyXyngep,"{'album_type': 'album', 'artists': [{'external..."


In [15]:
album_data = album_data_df["data"].iloc[0]
print("Album Data:")
print_nested_keys(ast.literal_eval(str(album_data)), "  ")

Album Data:
  album_type
  artists
    external_urls
      spotify
    href
    id
    name
    type
    uri
  available_markets
  copyrights
    text
    type
  external_ids
    upc
  external_urls
    spotify
  genres
  href
  id
  images
    height
    url
    width
  label
  name
  popularity
  release_date
  release_date_precision
  total_tracks
  tracks
    href
    items
      artists
        external_urls
          spotify
        href
        id
        name
        type
        uri
      available_markets
      disc_number
      duration_ms
      explicit
      external_urls
        spotify
      href
      id
      is_local
      name
      preview_url
      track_number
      type
      uri
    limit
    next
    offset
    previous
    total
  type
  uri


In [16]:
if not path.exists("data/ready/AlbumData.csv"):
    album_data_accessor = {
        "album_type": "album_type",
        "artist_ids": lambda data: list(map(lambda artist_data: artist_data["id"], data["artists"])),
        "artist_names": lambda data: list(map(lambda artist_data: artist_data["name"], data["artists"])),
        "copyright_texts": lambda data: list(map(lambda copyright: copyright["text"], data["copyrights"])),
        "genres": "genres",
        "image_url": lambda data: data["images"][0]["url"],
        "label": "label",
        "name": "name",
        "popularity": "popularity",
        "release_date": lambda data: convert_release_date(data["release_date"]),
        "total_tracks": "total_tracks"
    }
    album_data_df = album_data_df.apply(lambda entry: get_data_values(entry, album_data_accessor), axis=1).drop(columns=["data"])
    album_data_df.to_csv("data/ready/AlbumData.csv")
else:
    album_data_df = pd.read_csv("data/ready/AlbumData.csv")
    album_data_df.drop(album_data_df.columns[0], axis=1, inplace=True)

album_data_df.head()

Unnamed: 0,id,album_type,artist_ids,artist_names,copyright_texts,genres,image_url,label,name,popularity,release_date,total_tracks
0,0Gwu5X7W1mrkSTk2uZ25cv,album,['54YNxT02JdAApvFBhD8ea0'],['Mingus Big Band'],"['1999 Francis Dreyfus Music SARL, a BMG Compa...",[],https://i.scdn.co/image/ab67616d0000b27315338b...,Dreyfus Jazz,Blues & Politics,28,1999-06-29,8
1,7By1lfK4fTIs2YsMvA0FWH,album,['5zS6TsJ4lQFUGePSHAXaI9'],['Andy Martin & Vic Lewis'],"['2004 Drewbone Music', '2004 Drewbone Music']",[],https://i.scdn.co/image/ab67616d0000b273750e6c...,Drewbone Music,The Project,5,2004-01-01,10
2,5gfrrR8BnDgFhqGWcQaWFe,album,['6DbqS0X8cSFOPGsvyze2yh'],['Thad Jones'],['(C) 2013 Storyville Records'],[],https://i.scdn.co/image/ab67616d0000b2731a5d61...,Storyville,And the Danish Radio Big Band & Eclipse,15,2013-05-10,22
3,3f22Ap0VSZYWsqrGcphUnY,compilation,['0LyfQWJT6nXafLPZqxe9Of'],['Various Artists'],"['© 2004 Capitol Catalog', 'This Compilation ℗...",[],https://i.scdn.co/image/ab67616d0000b2733cfd1f...,Capitol Records,Christmas Classics,66,2004-01-01,16
4,2kAN1sZjSQQDkusyXyngep,album,['2jFZlvIea42ZvcCw4OeEdA'],['Count Basie'],"['© 1963 The Verve Music Group, a Division of ...",[],https://i.scdn.co/image/ab67616d0000b27364e3d6...,Verve,Frankly Basie / Count Basie Plays The Hits Of ...,21,1963-01-01,15


Now that we have the album data, let's get that artist data!

In [17]:
album_artist_ids = [artist_id for artist_ids in album_data_df["artist_ids"].values for artist_id in ast.literal_eval(str(artist_ids))]
track_artist_ids = [artist_id for artist_ids in full_track_data_df["artist_ids"].values for artist_id in ast.literal_eval(str(artist_ids))]
artist_ids = list(dict.fromkeys(album_artist_ids + track_artist_ids))

if not path.exists("data/raw/ArtistDataRaw.csv"):
    artist_data = get_spotify_data(artist_ids, "artists", 50)
    artist_data_df = pd.DataFrame({"id": artist_ids, "data": artist_data})
    artist_data_df.to_csv("data/raw/ArtistDataRaw.csv")
else:
    artist_data_df = pd.read_csv("data/raw/ArtistDataRaw.csv")
    artist_data_df.drop(artist_data_df.columns[0], axis=1, inplace=True)

artist_data_df.head()

Unnamed: 0,id,data
0,54YNxT02JdAApvFBhD8ea0,{'external_urls': {'spotify': 'https://open.sp...
1,5zS6TsJ4lQFUGePSHAXaI9,{'external_urls': {'spotify': 'https://open.sp...
2,6DbqS0X8cSFOPGsvyze2yh,{'external_urls': {'spotify': 'https://open.sp...
3,0LyfQWJT6nXafLPZqxe9Of,{'external_urls': {'spotify': 'https://open.sp...
4,2jFZlvIea42ZvcCw4OeEdA,{'external_urls': {'spotify': 'https://open.sp...


In [18]:
artist_data = artist_data_df["data"].iloc[0]
print("Artist Data:")
print_nested_keys(ast.literal_eval(str(artist_data)), "  ")

Artist Data:
  external_urls
    spotify
  followers
    href
    total
  genres
  href
  id
  images
    height
    url
    width
  name
  popularity
  type
  uri


In [19]:
if not path.exists("data/ready/ArtistData.csv"):
    artist_data_accessor = {x: x for x in [
        "genres",
        "name",
        "popularity"
    ]}
    artist_data_df = artist_data_df.apply(lambda entry: get_data_values(entry, artist_data_accessor), axis=1).drop(columns=["data"])
    artist_data_df.to_csv("data/ready/ArtistData.csv")
else:
    artist_data_df = pd.read_csv("data/ready/ArtistData.csv")
    artist_data_df.drop(artist_data_df.columns[0], axis=1, inplace=True)

artist_data_df.head()

Unnamed: 0,id,genres,name,popularity
0,54YNxT02JdAApvFBhD8ea0,"['bebop', 'big band', 'jazz', 'modern big band']",Mingus Big Band,33
1,5zS6TsJ4lQFUGePSHAXaI9,[],Andy Martin & Vic Lewis,2
2,6DbqS0X8cSFOPGsvyze2yh,"['bebop', 'big band', 'cool jazz', 'hard bop',...",Thad Jones,42
3,0LyfQWJT6nXafLPZqxe9Of,[],Various Artists,0
4,2jFZlvIea42ZvcCw4OeEdA,"['adult standards', 'bebop', 'big band', 'cool...",Count Basie,67


Let's also get all the episode data!

In [20]:
episode_ids = list(episodes_data_df["id"].drop_duplicates())

if not path.exists("data/raw/FullEpisodeDataRaw.csv"):
    full_episode_data = get_spotify_data(episode_ids, "episodes")
    full_episode_data_df = pd.DataFrame({"id": episode_ids, "data": full_episode_data})    
    full_episode_data_df.to_csv("data/raw/FullEpisodeDataRaw.csv")
else:
    full_episode_data_df = pd.read_csv("data/raw/FullEpisodeDataRaw.csv")
    full_episode_data_df.drop(full_episode_data_df.columns[0], axis=1, inplace=True)

full_episode_data_df.head()

Unnamed: 0,id,data
0,1zC8VOx9ltEhS1DzeHOg5I,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
1,0RMFL7GGtxWSLvVgM1Y9oz,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
2,0SF6oXn0z9UiT299AWAJJU,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
3,5FO4bqRKnNt5QNkz9lVwAA,{'audio_preview_url': 'https://p.scdn.co/mp3-p...
4,2ojCUReqahpKxL7jUp6mCN,{'audio_preview_url': 'https://p.scdn.co/mp3-p...


In [21]:
episode_data = full_episode_data_df["data"].iloc[0]
print("Episode Data:")
print_nested_keys(ast.literal_eval(str(episode_data)), "  ")

Episode Data:
  audio_preview_url
  description
  duration_ms
  explicit
  external_urls
    spotify
  href
  id
  images
    height
    url
    width
  is_externally_hosted
  is_playable
  language
  languages
  name
  release_date
  release_date_precision
  show
    available_markets
    copyrights
    description
    explicit
    external_urls
      spotify
    href
    id
    images
      height
      url
      width
    is_externally_hosted
    languages
    media_type
    name
    publisher
    total_episodes
    type
    uri
  type
  uri


In [22]:
if not path.exists("data/ready/FullEpisodeData.csv"):
    episode_data_accessor = {
        "description": "description",
        "duration_ms": "duration_ms",
        "explicit": "explicit",
        "image_url": lambda data: data["images"][0]["url"],
        "name": "name",
        "release_date": lambda data: convert_release_date(data["release_date"]),
        "show_id": lambda data: data["show"]["id"],
        "show_copyright_texts": lambda data: list(map(lambda copyright: copyright["text"], data["show"]["copyrights"])),
        "show_description": lambda data: data["show"]["description"],
        "show_explicit": lambda data: data["show"]["explicit"],
        "show_image_url": lambda data: data["show"]["images"][0]["url"],
        "show_name": lambda data: data["show"]["name"],
        "show_publisher": lambda data: data["show"]["publisher"],
        "show_total_episodes": lambda data: data["show"]["total_episodes"]
    }
    full_episode_data_df = full_episode_data_df.apply(lambda entry: get_data_values(entry, episode_data_accessor), axis=1).drop(columns=["data"])
    full_episode_data_df.to_csv("data/ready/FullEpisodeData.csv")
else:
    full_episode_data_df = pd.read_csv("data/ready/FullEpisodeData.csv")
    full_episode_data_df.drop(full_episode_data_df.columns[0], axis=1, inplace=True)

full_episode_data_df.head()

Unnamed: 0,id,description,duration_ms,explicit,image_url,name,release_date,show_id,show_copyright_texts,show_description,show_explicit,show_image_url,show_name,show_publisher,show_total_episodes
0,1zC8VOx9ltEhS1DzeHOg5I,In this episode we meet Elliot Alderson (@fs0c...,2873078,False,https://i.scdn.co/image/3645a43f9aed7fa7e63dbe...,Ep 49: Elliot,2019-10-15,4XPl3uEEL9hvqMkoZrzbx5,[],Explore true stories of the dark side of the I...,False,https://i.scdn.co/image/838edb072aec169c858b87...,Darknet Diaries,Jack Rhysider,84
1,0RMFL7GGtxWSLvVgM1Y9oz,This is the story about when a nation state ha...,3047654,False,https://i.scdn.co/image/a370a9e7159574b295c993...,Ep 48: Operation Socialist,2019-10-01,4XPl3uEEL9hvqMkoZrzbx5,[],Explore true stories of the dark side of the I...,False,https://i.scdn.co/image/838edb072aec169c858b87...,Darknet Diaries,Jack Rhysider,84
2,0SF6oXn0z9UiT299AWAJJU,This is the story about the XBox hacking scene...,4940983,True,https://i.scdn.co/image/d186d62ea3d8f6bc073bd4...,Ep 45: XBox Underground (Part 1),2019-08-20,4XPl3uEEL9hvqMkoZrzbx5,[],Explore true stories of the dark side of the I...,False,https://i.scdn.co/image/838edb072aec169c858b87...,Darknet Diaries,Jack Rhysider,84
3,5FO4bqRKnNt5QNkz9lVwAA,This is the story about the XBox hacking scene...,5130841,True,https://i.scdn.co/image/f56f1f615571d38dbc58a7...,Ep 46: XBox Underground (Part 2),2019-09-03,4XPl3uEEL9hvqMkoZrzbx5,[],Explore true stories of the dark side of the I...,False,https://i.scdn.co/image/838edb072aec169c858b87...,Darknet Diaries,Jack Rhysider,84
4,2ojCUReqahpKxL7jUp6mCN,Ransomware is ugly. It infects your machine an...,2231118,False,https://i.scdn.co/image/d0b20341b369b48d6b0429...,Ep 44: Zain,2019-08-06,4XPl3uEEL9hvqMkoZrzbx5,[],Explore true stories of the dark side of the I...,False,https://i.scdn.co/image/838edb072aec169c858b87...,Darknet Diaries,Jack Rhysider,84


## Additional Feature Engineering

I also want to get the genre data to put on tracks. The genre data only comes from the artist objects, so we'll take that data and put it on the tracks. Note: not all artists have genre data, so this won't be incredibly useful, but it'll be cool with what we have!

In [23]:
def get_genres_from_artists(entry):
    artists = ast.literal_eval(entry["artist_ids"])
    genres_set = set()
    for artist in artists:
        genres = artist_data_df[artist_data_df["id"] == artist].iloc[0]["genres"]
        for genre in genres:
            genres_set.add(genre)
    return list(genres_set)

if not "genres" in full_track_data_df:
    full_track_data_df["genres"] = full_track_data_df.apply(get_genres_from_artists, axis=1)
    full_track_data_df.to_csv("data/ready/FullTrackData.csv")
full_track_data_df.head()

Unnamed: 0,id,album_id,album_name,album_release_date,album_total_tracks,artist_ids,artist_names,duration_ms,explicit,name,popularity,track_number,genres
0,4hns23kYYZg0BhDwXeDxB1,0Gwu5X7W1mrkSTk2uZ25cv,Blues & Politics,1999-06-29,8,['54YNxT02JdAApvFBhD8ea0'],['Mingus Big Band'],499826,False,Haitian Fight Song,28,2,"['a', ',', 'z', 'e', ""'"", ']', 'b', 'm', 'n', ..."
1,3sLmks6fCY40bBSGDjU4FO,7By1lfK4fTIs2YsMvA0FWH,The Project,2004-01-01,10,['5zS6TsJ4lQFUGePSHAXaI9'],['Andy Martin & Vic Lewis'],459106,False,Everything You Is,0,6,"['[', ']']"
2,2408a07TNDga6lMlaIFLEU,5gfrrR8BnDgFhqGWcQaWFe,And the Danish Radio Big Band & Eclipse,2013-05-10,22,['6DbqS0X8cSFOPGsvyze2yh'],['Thad Jones'],255466,False,To You,6,2,"[""'"", 'b', 's', ' ', 'd', 'o', ',', 'c', 'u', ..."
3,4Saza06xljloZwotqXdNle,3f22Ap0VSZYWsqrGcphUnY,Christmas Classics,2004-01-01,16,['5fJ4w85NxFXyWlPU9wH6BE'],['Guy Lombardo & His Royal Canadians'],128000,False,Auld Lang Syne,33,16,"['[', ']']"
4,0rupt7DuLo3WGecL3cyi19,2kAN1sZjSQQDkusyXyngep,Frankly Basie / Count Basie Plays The Hits Of ...,1963-01-01,15,['2jFZlvIea42ZvcCw4OeEdA'],['Count Basie'],150773,False,All Of Me,21,12,"[""'"", 'w', 'b', 's', ' ', 'o', 'd', ',', 'u', ..."


I also want to figure out approximately how many times I've listened to a track. I kinda bounced a few ideas around on how to calculate this, but I decided that the best way to calculate this is to add up all of the milliseconds I've played the songs (`ms_played`) and determine how many total times I've played a song by dividing that number by the length of the song and simply rounding the result:

In [24]:
if not "time" in streaming_history_data_df:
    def split_date(row):
        date = pd.to_datetime(row["date"])
        return pd.Series([date.date(), date.time()])
    streaming_history_data_df[["date", "time"]] = streaming_history_data_df.apply(split_date, axis=1)

    streaming_history_data_df = streaming_history_data_df.sort_values(by=["date", "time"]).reset_index(drop=True)
    streaming_history_data_df.to_csv("data/ready/CompleteStreamingHistory.csv")

if "artist_name" in streaming_history_data_df and "track_name" in streaming_history_data_df:
    streaming_history_data_df.drop(columns=["artist_name", "track_name"], axis=1, inplace=True)
    streaming_history_data_df.to_csv("data/ready/CompleteStreamingHistory.csv")

if streaming_history_data_df.columns[0] != "date":
    streaming_history_data_df = streaming_history_data_df[["date", "time", "type", "id", "ms_played"]]
    streaming_history_data_df.to_csv("data/ready/CompleteStreamingHistory.csv")

streaming_history_data_df.head()

Unnamed: 0,date,time,type,id,ms_played
0,2020-01-01,00:51:00,track,4hns23kYYZg0BhDwXeDxB1,680
1,2020-01-01,00:52:00,track,3sLmks6fCY40bBSGDjU4FO,3282
2,2020-01-01,00:52:00,track,2408a07TNDga6lMlaIFLEU,1496
3,2020-01-01,00:52:00,track,4Saza06xljloZwotqXdNle,66171
4,2020-01-01,00:55:00,track,0rupt7DuLo3WGecL3cyi19,150773


So I found [this article](https://towardsdatascience.com/is-my-spotify-music-boring-an-analysis-involving-music-data-and-machine-learning-47550ae931de) that calculates the boringness of a song through this arbitrary equation:
```
boringness = loudness + tempo + (energy*100) + (danceability*100)
```
And I really want to find out how boring my music is, so...

In [25]:
def get_boringness_score(entry):
    return entry["loudness"] + entry["tempo"] + (entry["energy"] * 100) + (entry["danceability"] * 100)

if not "boringness" in track_feature_data_df:
    track_feature_data_df["boringness"] = track_feature_data_df.apply(get_boringness_score, axis=1)
    track_feature_data_df.to_csv("data/ready/TrackFeatureData.csv")
track_feature_data_df.head()

Unnamed: 0,id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,boringness
0,4hns23kYYZg0BhDwXeDxB1,0.386,0.392,2,-14.123,0,0.0408,0.536,0.794,0.167,0.399,89.153,4,152.83
1,3sLmks6fCY40bBSGDjU4FO,0.581,0.517,5,-10.203,0,0.0378,0.901,0.794,0.108,0.553,138.751,4,238.348
2,2408a07TNDga6lMlaIFLEU,0.138,0.11,5,-15.402,1,0.0337,0.945,0.943,0.111,0.0534,61.73,4,71.128
3,4Saza06xljloZwotqXdNle,0.24,0.245,5,-10.777,1,0.0327,0.979,0.788,0.0947,0.112,83.922,3,121.645
4,0rupt7DuLo3WGecL3cyi19,0.529,0.176,7,-17.229,1,0.0398,0.762,0.72,0.0529,0.791,150.108,4,203.379


Okay, okay. I think it's finally time...

Time... for the analysis.