# Setup

Before we can begin analyzing my music collection, we have to fetch it and clean it. We first import all the necessary packages to work.

In [3]:
import os, platform
from time import sleep

import re
from pprint import pprint 

import pandas as pd
from pandas import Timestamp
import numpy as np
import itertools
from datetime import datetime
from pytz import timezone

import multiprocessing as mp

import seaborn as sns
import matplotlib.pyplot as plt

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

### Credentials

Establish my credentials for the Spotify API, and setup up an object to use for calls to the API.

In [4]:
spotify_id = '672b6ccc89154897bbafa579105f4124'
spotify_secret = 'bed129a78ded4e849eab9ca4fb3da1e3'
REQUEST_TIMEOUT = 4.0

client_credentials_manager = SpotifyClientCredentials(
    client_id = spotify_id, 
    client_secret = spotify_secret)
sp = spotipy.Spotify(
    client_credentials_manager = client_credentials_manager, 
    requests_timeout = REQUEST_TIMEOUT)

# Fetch Song Information

I have linked my Spotify account to my Last.FM account. Last.FM records each stream, and this log can be downloaded via https://benjaminbenben.com/lastfm-to-csv/.

Next, we load the raw data. Notice how there are some missing album titles and timestamps. This is likely just the result of a bad script pulling from Last.FM, so we'll have to fix that. Below the counts is a random sample of the data, just to get a feel of what is in there.

In [3]:
original_history = pd.read_csv('data/alexliebscher.csv')
original_history.count()

artist       10479
album        10466
track        10479
timestamp    10427
dtype: int64

We will just backfill to take care of missing timestamp data. A very small percentage is missing, and I assume the missing values have a high probability of being similar to the song before.

In [4]:
original_history['timestamp'] = original_history['timestamp'].bfill()
original_history['timestamp'].count()

10479

The timestamps are also missing timestamp information, so we should add that to ensure our analysis reflects my local time. In this case, all timestamps are assumed to be UTC and are converted to US/Pacific, my local zone.

In [5]:
timezoned_history = original_history
timezoned_history['timestamp'] = pd.to_datetime(timezoned_history['timestamp'], utc=True)

The first song was recorded on December 18th, 2017 at roughly 7pm. This dataset covers the following 134 days after that. A random sample is available to see the corrected timestamps.

In [6]:
history_max = timezoned_history['timestamp'].max()
history_min = timezoned_history['timestamp'].min()

print(history_min.tz_convert('US/Pacific'))
print(history_max - history_min)

2017-12-18 18:55:00-08:00
133 days 20:24:00


## Fetch full track data

Although we have artist, album, track, and timestamp for each stream, there's a lot more information that we can find. We choose to use the Spotify API, as it is reliable, easy to use, and offers a handful of quantitative features we otherwise wouldn't be able to assess.

In [24]:
delimeter_pattern = re.compile("[\{\}\[\]\(\)\#\'\"]")
classical_pattern = re.compile("((op\.?|no\.?)\s*\d{1,3}\s?)", re.IGNORECASE)
collections_pattern = re.compile("(^\d{1,3}\s*)")
stylizations_pattern = re.compile("[\,\-\_\&\*]\s?|\:\s")


def clean_query(track, artist, album=''):
    # remove (feat. some artist) for cleaner search
    track = track.lower()
    if " (feat" in track:
        track = track.split(" (feat")[0]
    elif " (with" in track:
        track = track.split(" (with")[0]
    elif " (&" in track:
        track = track.split(" (&")[0]
        
    # clean album names too
    album = album.lower()
    if "nan" == album:
        album = ""
    elif " (feat" in album:
        album = album.split(" (feat")[0]
    elif " (with" in album:
        album = album.split(" (with")[0]
    elif " (&" in album:
        album = album.split(" (&")[0]
        
    # compose a clean, simple query string
    query = str(track + ' ' + artist + ' ' + album).strip()
    
    query = delimeter_pattern.sub("", query) # remove various delimeter chars
    query, subs = classical_pattern.subn("", query) # remove common strings in classical track titles
                                                    # unfortunately modifies tracks such as Candy Shop 
                                                    # by 50 Cent to "candy shCent"
    if subs > 0:
        # classical music often starts with the number of pieces in
        # a collection ("12 Etudes, Op. 10: No.10 in C minor")
        query = collections_pattern.sub("", query)
        
    query = stylizations_pattern.sub(" ", query) # common stylizations in track/album names
    
    return query

def format_return_track(metadata, audio_features):
    # store a new track
    _track = dict({})
    
    _track['id'] = metadata['id']
    _track['name'] = metadata['name']
    _track['release'] = metadata['album']['release_date']
    _track['popularity'] = metadata['popularity']
    _track['explicit'] = int(metadata['explicit'])
    _track['artists'] = [a['id'] for a in metadata['artists']]
    _track['album'] = metadata['album']['name']

    _track['acousticness'] = audio_features['acousticness']
    _track['danceability'] = audio_features['danceability']
    _track['duration_ms'] = audio_features['duration_ms']
    _track['energy'] = audio_features['energy']
    _track['key'] = audio_features['key']
    _track['liveness'] = audio_features['liveness']
    _track['loudness'] = audio_features['loudness']
    _track['mode'] = audio_features['mode']
    _track['speechiness'] = audio_features['speechiness']
    _track['tempo'] = audio_features['tempo']
    _track['time_signature'] = audio_features['time_signature']
    _track['valence'] = audio_features['valence']
    
    return _track

def get_track_info(track, artist, album='', id_excl=False, verbose=0):
    '''
    With a track name and artist, and optionally an album name,
    search for a corresponding track via the Spotify API and
    build an object with possible descriptive data.
    
    Parameters
    ----------
    track : str
        The name of a track
    artist : str
        The name of the track's artist
    album : str, optional
        The name of the track's album
    id_excl : bool, optional
        Return only the track's Spotify ID
    verbose : int, optional
        Level of verbosity. 0 is no output
    
    Return
    ----------
    Descriptive track data, or just the track ID, or an empty
    dict if no data could be found for the specified track
    '''
    query = clean_query(track, artist, album)    
        
    # if the song exists in the Spotify catalog, fetch info
    try:
        if verbose >= 2: print('Query track: ' + query)
        meta = sp.search(q='track:' + query, type='track', limit=1)
        meta = meta['tracks']['items'][0]

        if not id_excl:
            features = sp.audio_features([meta['id']])[0]
            
    except Exception as e:
        
        # if the track could not be found, try once more without the album
        if album is not "":
            
            if verbose >= 2: print('Requery {} by {} without album'.format(track, artist))
                
            retry = get_track_info(track, artist)
            # if the track couldn't be found without the album, give up
            if retry:
                return retry
            
        if verbose >= 1:
            print('No data for {} by {}, query: {}\n'.format(track, artist, query))
            
        return {}

    if id_excl and meta['id']:
        return meta['id']
    
    # return the track information
    try:
        return format_return_track(meta, features)
    except TypeError:
        if verbose >= 2: print('Parameter missing for {} by {}'.format(track, artist))
    
    return {}

## Baseline multiprocessor efficiency

We time and record serial processing and multiprocessor functionality to estimate performance improvements. Let this be a simple measurement of how well we can do with multiprocessing when fetching track data from the API.

Extract a random sample of 50 tracks. We will use this to compare single processor efficiency with multiprocessor efficiency.

In [8]:
sample = timezoned_history.sample(50)

In [9]:
cpus = mp.cpu_count() # we'll make use of all CPUs, we use this later too

In [10]:
# initialize a sharedctypes integer to count records
v = mp.Value('i', 0, lock=False)

def async_fetch(track, artist, album):
    '''
    Count and display track searches and timing
    
    Parameters
    ----------
    track : str
        The name of a track
    artist : str
        The name of the track's artist
    album : str
        The name of the track's album
    
    Return
    ----------
    Data about the track, if the track is found (otherwise, empty dict)
    '''
    if v.value % 10 == 0 and v.value is not 0:
        # after every 10 tracks searched, print progress information
        print('record: #{} at ({})\n'.format(str(v.value), datetime.now() - s))
        
    v.value += 1
    return get_track_info(track, artist, album, verbose=2)

def serial(tracks):
    '''
    A serial processor for comparison's purpose (1 CPU, 1 process)
    
    Parameters
    ----------
    tracks : list
        A list of tracks to search
    
    Return
    ----------
    A list of track data in dicts
    '''
    return [get_track_info(str(t['track']), str(t['artist']), str(t['album']), verbose=2) for i, t in tracks.iterrows()]

def multiprocess(processes, tracks):  
    '''
    Multiprocessing to utilize all cores for comparison's purpose
    
    Parameters
    ----------
    processes : int
        The number of processes to create in parallel
    tracks : list
        A list of tracks to search
    
    Return
    ----------
    A list of track data in dicts
    '''
    pool = mp.Pool(processes=processes)
    results = [pool.apply_async(async_fetch, args=(str(t['track']), str(t['artist']), str(t['album']))) for i, t in tracks.iterrows()]
    results = [p.get() for p in results]
        
    return results

print('\n')
print('# of CPUs:\t{}'.format(cpus))
print('Python version:\t{}'.format(platform.python_version()))
print('Compiler:\t{}'.format(platform.python_compiler()))
print('System:\t\t{}'.format(platform.system()))
print('Release:\t{}'.format(platform.release()))
print('Machine:\t{}'.format(platform.machine()))
print('Processor:\t{}'.format(platform.processor()))
print('Interpreter:\t{}'.format(platform.architecture()[0]))
print('\n')


print('Testing Serial\n')
# Test and time serial()
s = datetime.now()
serial_temp = pd.DataFrame(serial(sample)).dropna()
serial_t = datetime.now() - s

print('Testing Multiprocessor\n')
# Test and time multiprocess()
s = datetime.now()
multi_temp = pd.DataFrame(multiprocess(cpus, sample)).dropna()
multi_t = datetime.now() - s



# of CPUs:	4
Python version:	3.6.3
Compiler:	GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
System:		Darwin
Release:	17.5.0
Machine:	x86_64
Processor:	i386
Interpreter:	64bit


Testing Serial

Query track: edition Rex Orange County edition
Query track: try me James Brown try me
Query track: excuse me A$AP Rocky at.long.last.a$ap
Query track: honest  lifelike remix The Chainsmokers honest remixes
Query track: primo Ark Patrol primo
Query track: new girl toms song The walters songs for dads
Query track: sleeping bag shakewell sleeping bag
Query track: closure Ab Soul these days...
Query track: gatekeeper Jessie Reyez kiddo
Query track: comin out strong Future hndrxx
Query track: loyalty. feat. rihanna. Kendrick Lamar damn.
Query track: gold BROCKHAMPTON saturation
Query track: tutti frutti Little Richard heres little richard remastered  expanded
Requery tutti frutti by Little Richard without album
Query track: slumber party Hellberg mrsuicidesheep presents  taking you higher

Process ForkPoolWorker-3:
Process ForkPoolWorker-2:
Process ForkPoolWorker-1:
Process ForkPoolWorker-4:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Users/alex/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/alex/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/alex/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/alex/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/alex/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/alex/anaconda3/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/Users/alex/anaconda3/lib/python3.6/multiprocessing/proces

In [11]:
print('Serial Processing')
print('search ratio (found : expected): {}'.format(len(serial_temp)/len(sample)))
print('Time: {}'.format(serial_t))
print('\nMulti Processing')
print('search ratio (found : expected): {}'.format(len(multi_temp)/len(sample)))
print('Time: {}'.format(multi_t))
print('\n{0:.2f}x faster with multiprocess'.format(serial_t / multi_t))

Serial Processing
search ratio (found : expected): 0.98
Time: 0:00:14.109964

Multi Processing
search ratio (found : expected): 0.98
Time: 0:00:06.262145

2.25x faster with multiprocess


## Fetch all track data from Spotify

Use all 4 of my CPUs to fetch track information in parallel.

In [19]:
# initialize sharedctype integers to count records
v = mp.Value('i', 0, lock=False)
total = None

def async_fetch_real(track, artist, album, timestamp):
    '''
    Count and display track searches and timing
    
    Parameters
    ----------
    track : str
        The name of a track
    artist : str
        The name of the track's artist
    album : str
        The name of the track's album
    timestamp : str
        The timestamp of the track
    
    Return
    ----------
    Data about the track, if the track is found (otherwise, empty dict)    
    '''
    if v.value % 100 == 0 and v.value is not 0:
        # after every 100 tracks searched, print progress information
        elap = datetime.now() - s
        print('record: #{} - remaining: {}\n'.format(str(v.value), ((elap/v.value) * total.value) - elap))
        
    v.value += 1
    
    _t = get_track_info(track, artist, album, verbose=1)
    # re-attach the timestamp to the track data
    _t.update({
        'timestamp': timestamp
    })
            
    return _t

def multiprocess(processes, tracks):
    '''
    Multiprocessing to utilize all cores for all listening history
    
    Parameters
    ----------
    processes : int
        The number of processes to create in parallel
    tracks : list
        A list of tracks to search
    
    Return
    ----------
    A list of track data in dicts    
    '''
    pool = mp.Pool(processes=processes)
    results = [pool.apply_async(async_fetch_real, args=(str(t['track']), str(t['artist']), str(t['album']), Timestamp(t['timestamp']),)) for i, t in tracks.iterrows()]
    results = [p.get() for p in results]
    
    return results

def updateTracks(original, unique):
    '''
    Fetch track data via the Spotify API and save the compiled output to JSON
    
    Parameters
    ----------
    original : pandas.DataFrame
    unique : pandas.DataFrame
    '''
    print('Begin fetch...')
    
    temp = pd.DataFrame(multiprocess(cpus, unique)).dropna()
    multi_t = datetime.now() - s
        
    if not temp.empty:
        compiled = pd.concat([original, temp], ignore_index=True)
        compiled.to_json('data/history_comp.json')
        print('Saved complete history')
        
        print('Search ratio (found : expected): {}'.format(len(temp)/total.value))
        print('Total songs found: \t\t{}'.format(len(temp)))
        print('Total time:\t\t\t {}'.format(multi_t))
        print('Songs/sec fetched:\t\t {}'.format(total.value/multi_t.total_seconds()))
        
        return compiled
    else:
        print('No new track data found')

In [20]:
last_full_history = pd.read_json('data/history_comp.json') if os.path.isfile('data/history_comp.json') else pd.DataFrame(columns=['timestamp'])
last_full_history['timestamp'] = pd.to_datetime(last_full_history['timestamp'], utc=True)

s = datetime.now()

if last_full_history.empty or last_full_history['timestamp'].max() < timezoned_history['timestamp'].max():
    unique = timezoned_history[~timezoned_history['timestamp'].isin(last_full_history['timestamp'])]
    
    total = mp.Value('i', len(unique), lock=False)
        
    updateTracks(last_full_history, unique)
else:
    print('all records up to date')
    
# Saved complete history
# Search ratio (found : expected): 0.9925565416547381
# Total songs found: 		10401
# Total time:			 0:35:05.198987
# Songs/sec fetched:		 4.977676725435361

Begin fetch...
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
record: #100 - remaining: 0:25:16.870071

retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
record: #200 - remaining: 0:29:10.195047

retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...5secs
retrying ...1secs
retrying ...6secs
retrying ...6secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
record: #300 - remaining: 0:30:27.802320

retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...5secs
retrying ...1secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...1secs
record: #400 - remaining: 0:31:02.579033

retryin


record: #2300 - remaining: 0:26:59.490758

retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
record: #2400 - remaining: 0:26:42.268415

record: #2400 - remaining: 0:26:42.284740

retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
record: #2500 - remaining: 0:26:22.539674

retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
No data for stampede - original mix by Dimitri Vegas & Like Mike, query: stampede  original mix Dimitri Vegas  Like Mike stampede

No data for stayin' alive - 2007 remastered vers

retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
record: #4500 - remaining: 0:19:51.656997

retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
record: #4600 - remaining: 0:19:32.715569

retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
No data for what if i go? - single version by Mura Masa, query: what if i go?  single version Mura Masa what if i go?

No data for feel the fire - egzod remix by Pluto, query: feel the fire  egzod remix Pluto feel the fire remixes

retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
record: #4700 - remaining: 0:19:12.118290

record: #4700 - remaining: 0:19:12.125576

retrying ...6secs
re

retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
record: #6500 - remaining: 0:13:18.527370

retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
record: #6600 - remaining: 0:12:58.630376

retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
No data for wachet auf ruft uns die stimme bwv 140: chorale prelude by Johann Sebastian Bach, query: wachet auf ruft uns die stimme bwv 140 chorale prelude Johann Sebastian Bach bach great organ favorites

retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
record: #6700 - remaining: 0:12:38.355746

retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs


retrying ...6secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...6secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
No data for insanity - suyano remix by Rooverb, query: insanity  suyano remix Rooverb insanity

record: #8400 - remaining: 0:06:57.300354

No data for afterlife (dabin remix) [feat. echos] by Illenium, query: afterlife dabin remix feat. echos Illenium ashes remixes

retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
No data for california love - original version (explicit) by 2Pac, query: california love  original version explicit 2Pac 2pac greatest hits explicit version

retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
record: #8500 -

retrying ...1secs
retrying ...1secs
retrying ...1secs
No data for the well-tempered clavierbwv 846: prelude i in c major by Johann Sebastian Bach, query: the well tempered clavierbwv 846 prelude i in c major Johann Sebastian Bach bach the well tempered clavier book 1 bwv 846 869

record: #10100 - remaining: 0:01:16.064451

retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
No data for &burn by Billie Eilish, query:  burn Billie Eilish  burn

No data for &burn by Billie Eilish, query:  burn Billie Eilish  burn

retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
record: #10200 - remaining: 0:00:55.983138

retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...5secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
No data for we're not alone - original mix by Virtual Riot,

In [21]:
artists = pd.read_json('data/artist_info.json')
artists.sample(5)

Unnamed: 0,artist,followers,genres,id,popularity
610,Emily Warren,13720,[],1oKdM70mJD8VvDOTKeS8t1,71
2566,Rygby,135,[],4JYWD2cPpCgnXE1T3lhKg9,23
2047,Mightyfools,17078,"[deep big room, electro house, fidget house, m...",5XJWi5Ev0zwaRiXnN8Oo5O,53
1804,Slim Jxmmi,19985,[],7EEiVZvj6RCEtVX2F2pyxu,78
3346,DJ Kay Slay,13628,[deep southern trap],1giPduZUUFru23D2icr2A8,42


In [22]:
full_history = pd.read_json('data/history_comp.json')
if full_history['timestamp'].max().tz is None:
        full_history['timestamp'] = pd.to_datetime(full_history['timestamp'], utc=True)

In [23]:
missing_artist_ids = []

for i, r in full_history.iterrows():
    for artist in r['artists']:
        if not (artists['id'] == artist).any():
            missing_artist_ids.append(artist)
            
len(missing_artist_ids)

6

In [24]:
def get_artist_info(id):
    try:
        result = sp.artist(id)
    except:
        return {}
    
    return {'artist': result['name'],
            'id': result['id'], 
            'genres': np.array(result['genres']), 
            'popularity': result['popularity'], 
            'followers': result['followers']['total']}

In [25]:
# initialize a sharedctypes integer to count records
v = mp.Value('i', 0, lock=False)
total = mp.Value('i', len(missing_artist_ids), lock=False)

cpus = mp.cpu_count()

def async_fetch_real(id):
    '''
    count and display artist searches and timing
    '''
    if v.value % 100 == 0 and v.value is not 0:
        print('record: #{}'.format(v.value))
        elap = datetime.now() - s
        print('time remaining: {}'.format(((elap/v.value) * total.value) - elap))
        
    v.value += 1
    
    return get_artist_info(id)

def multiprocess(processes, ids):
    '''
    multiprocessing to utilize all cores
    '''
    pool = mp.Pool(processes=processes)
    results = [pool.apply_async(async_fetch_real, args=(str(i),)) for i in ids]
    results = [p.get() for p in results]
    return results

s = datetime.now()
artists_temp = pd.DataFrame(multiprocess(cpus, missing_artist_ids)).dropna()
multi_t = datetime.now() - s

stitched_artists = artists.append(artists_temp, ignore_index=True)
stitched_artists = stitched_artists.drop_duplicates('id')

if not artists_temp.empty:
    stitched_artists.to_json('data/artist_info.json'.format(s.month, s.day))
    print('Saved artist info')
    
print('{} artists added (expected {})'.format(len(stitched_artists)-len(artists), len(artists_temp)))

Saved artist info
6 artists added (expected 6)


The service to export Last.FM data overcounts the most recently listened to song, so I choose to keep the first quarter of instances and drop the remainder. This prevents an artificial skewing toward a song that shouldn't be the mode of the data set. Drawback: if the first song _really_ is the mode of the dataset, I unknowingly change that.

In [26]:
track_mode = full_history['id'].mode()[0]
L = list(full_history.loc[full_history['id'] == track_mode].index)
L = L[int(len(L)*0.25):]
print(full_history.loc[L[0]]['name'])
L

LUST.


[3000,
 3201,
 3402,
 3602,
 3802,
 3999,
 401,
 4199,
 4400,
 4601,
 4800,
 4998,
 5198,
 5396,
 5596,
 5796,
 5995,
 602,
 6196,
 6396,
 6597,
 6796,
 6996,
 7197,
 7395,
 7594,
 7793,
 7993,
 803,
 8190,
 8386,
 8585,
 8784,
 8984,
 9183,
 9383,
 9576,
 9775,
 9975]

In [27]:
full_history = full_history.drop(index=L).reset_index()

In [28]:
full_history.to_json('data/history_comp.json')

# Mainstream Music

In [46]:
def fetch_mainstream(playlists):
    mainstream_music = []
    for playlist in playlists:
        for item in sp.user_playlist_tracks(playlist[0], playlist[1])['items']:
            id_ = item['track']['id']

            features = sp.audio_features(id_)[0]

            mainstream_music.append(format_return_track(item['track'], features))
            
        print('Playlist complete')

    return mainstream_music

In [45]:
playlists = [('spotify', '37i9dQZF1DX0s5kDXi1oC5'), ('spotify', '37i9dQZF1DXcBWIGoYBM5M')]

mainstream_music = pd.DataFrame(fetch_mainstream(playlists)).dropna()
mainstream_music.to_json('data/mainstream_music.json')
mainstream_music

Unnamed: 0,acousticness,album,artists,danceability,duration_ms,energy,explicit,id,key,liveness,loudness,mode,name,popularity,release,speechiness,tempo,time_signature,valence
0,0.05340,Levels,[1vCWHaC5f2uS3yhpwWbIA6],0.605,199904,0.873,0,5QjJgPU8AJeickx34f7on6,1,0.3140,-5.938,0,Levels - Radio Edit,78,2011-01-01,0.0344,126.026,4,0.4760
1,0.02910,The Heist,[5BcAKTbp20cv7tC5VqPFoC],0.641,258343,0.922,0,3bidbhpOYeV4knp8AIu8Xn,2,0.0862,-4.457,1,Can't Hold Us - feat. Ray Dalton,83,2012-10-09,0.0786,146.078,4,0.8470
2,0.09910,Wild Ones,"[0jnsk9HBra6NMjO2oANoPY, 5WUlDfRSoLAfcVSX1WnrxN]",0.608,232947,0.860,0,1NpW5kyvO4XrNJ3rnfcNy3,5,0.2620,-5.324,0,Wild Ones (feat. Sia),79,2012-06-22,0.0554,127.075,4,0.4370
3,0.09210,Globalization,"[0TnOYISbd1XYRBk9myaseg, 21E3waRsmPlU7jZsS13rcj]",0.720,229360,0.802,1,2bJvI42r8EF3wxjOuDav4r,1,0.6940,-5.797,1,Time of Our Lives,79,2014-11-21,0.0582,124.043,4,0.7230
4,0.03190,The Papercut Chronicles II,"[4IJczjB0fJ04gs4uvP0Fli, 4bYPcJP5jwMhSivRcqie2n]",0.646,210960,0.795,0,0qOnSQQF0yzuPWsXrQ9paz,9,0.2670,-3.293,1,Stereo Hearts (feat. Adam Levine),77,2011-11-11,0.0976,89.990,4,0.7960
5,0.55500,Talk Dirty,[07YZf4WDAMNwqr4jfgOZ8y],0.635,217419,0.691,0,5KONnBIQ9LqCxyeSPin26k,0,0.0970,-4.862,1,Trumpets,71,2013-09-10,0.2580,82.142,4,0.6380
6,0.06900,Nothing But The Beat,"[1Cs0zKBU1kc0i8ypK3B9ai, 5WUlDfRSoLAfcVSX1WnrxN]",0.599,245041,0.803,0,77TT8Xvx637TpzV8kKGkUw,0,0.1290,-3.641,0,Titanium (feat. Sia),71,2011-08-31,0.0986,126.057,4,0.2330
7,0.00346,True,[1vCWHaC5f2uS3yhpwWbIA6],0.518,247427,0.784,0,4h8VwCb1MTGoLKueQ1WgbD,2,0.1710,-5.659,1,Wake Me Up,81,2013-01-01,0.0524,124.102,4,0.5880
8,0.06430,The Wanted (Special Edition),[2NhdGz9EDv2FeUw6udu2g1],0.755,198187,0.838,0,3AGOgQzp0YcPH41u9p7dOp,7,0.1180,-4.500,0,Glad You Came,68,2012-04-24,0.0687,126.877,4,0.4730
9,0.10900,Beauty Behind The Madness,[1Xyo4u8uXC1ZmMpatF05PJ],0.711,213520,0.783,0,22VdIZQfgXJea34mQxlt81,9,0.0947,-5.407,0,Can't Feel My Face,76,2015-08-28,0.0423,107.972,4,0.5870
