# Description

**Goal**: I want to find out what makes a playlist successful, and try to predict how successful a playlist will be based on its known features. (Future idea may be to generate a playlist that might be successful.)

*Successful*: defined as number of followers, in this exercise.

## Step 1: Data Preparation

Connect to Spotify and download all "featured" playlist

In [4]:
# Import Packages
import spotipy
import requests
import sys
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials
import numpy as np
import random
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import json
import time
import sys
import urllib
from sklearn.preprocessing import MultiLabelBinarizer

In [6]:
# ID and Password for accessing Spotify API
client_id = ""
client_secret = ""

# Setup the credentials
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)

# Make the connection
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [3]:
# Get all spotify playlists
playlists = sp.user_playlists('spotify')

# Empty list to hold playlist information
spotify_playlists = []

# Loop to get data for each playlist
while playlists:
    
    for i, playlist in enumerate(playlists['items']):
        names = playlist['name']
        track_count = playlist['tracks']['total']
        ids = playlist['id']
        uri = playlist['uri']
        href = playlist['href']
        public = playlist['public']
        data_aggregation = names, track_count, ids, uri, href, public
        spotify_playlists.append(data_aggregation)
        
    if playlists['next']:
        playlists = sp.next(playlists)
    
    else:
        playlists = None

In [4]:
# Convert list into a dataframe
data = pd.DataFrame(np.array(spotify_playlists).reshape(len(spotify_playlists),6), 
                    columns=['Name', 'No. of Tracks', 'ID', 'URI', 'HREF', 'Public'])
data.head()

Unnamed: 0,Name,No. of Tracks,ID,URI,HREF,Public
0,Today's Top Hits,50,37i9dQZF1DXcBWIGoYBM5M,spotify:playlist:37i9dQZF1DXcBWIGoYBM5M,https://api.spotify.com/v1/playlists/37i9dQZF1...,True
1,RapCaviar,50,37i9dQZF1DX0XUsuxWHRQd,spotify:playlist:37i9dQZF1DX0XUsuxWHRQd,https://api.spotify.com/v1/playlists/37i9dQZF1...,True
2,Hot Country,56,37i9dQZF1DX1lVhptIYRda,spotify:playlist:37i9dQZF1DX1lVhptIYRda,https://api.spotify.com/v1/playlists/37i9dQZF1...,True
3,¡Viva Latino!,50,37i9dQZF1DX10zKzsJ2jva,spotify:playlist:37i9dQZF1DX10zKzsJ2jva,https://api.spotify.com/v1/playlists/37i9dQZF1...,True
4,New Music Friday,100,37i9dQZF1DX4JAvHpjipBk,spotify:playlist:37i9dQZF1DX4JAvHpjipBk,https://api.spotify.com/v1/playlists/37i9dQZF1...,True


In [6]:
data['No. of Tracks'] = data['No. of Tracks'].apply(pd.to_numeric, errors='coerce')

In [7]:
data.dtypes

Name             object
No. of Tracks     int64
ID               object
URI              object
HREF             object
Public           object
dtype: object

In [9]:
# Pull the number of followers per playlist
playlist_follower = []

# Loop over playlists and get followers
for i in range(0, len(data['URI'])-1): 
    
    # If number of followers is greater than 0
    if data['No. of Tracks'][i] > 0:
        #uri = data['URI'][i]
        username = 'user'
        playlist_id = data['ID'][i]
        #playlist_id = uri.split(':')[2]
        results = sp.user_playlist(username, playlist_id)
        followers = results['followers']['total']
        playlist_follower.append(followers)
    
    # If follower count is 0, append 0   
    else: 
        followers = 0
        playlist_follower.append(followers)

In [17]:
# Add a new column for followers 
data['Followers'] = pd.DataFrame({'Followers': playlist_follower})
data

Unnamed: 0,Name,No. of Tracks,ID,URI,HREF,Public,Followers
0,Today's Top Hits,50,37i9dQZF1DXcBWIGoYBM5M,spotify:playlist:37i9dQZF1DXcBWIGoYBM5M,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,26794529.0
1,RapCaviar,50,37i9dQZF1DX0XUsuxWHRQd,spotify:playlist:37i9dQZF1DX0XUsuxWHRQd,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,13359713.0
2,Hot Country,56,37i9dQZF1DX1lVhptIYRda,spotify:playlist:37i9dQZF1DX1lVhptIYRda,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,5924405.0
3,¡Viva Latino!,50,37i9dQZF1DX10zKzsJ2jva,spotify:playlist:37i9dQZF1DX10zKzsJ2jva,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,10649317.0
4,New Music Friday,100,37i9dQZF1DX4JAvHpjipBk,spotify:playlist:37i9dQZF1DX4JAvHpjipBk,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,3679444.0
...,...,...,...,...,...,...,...
1432,Women of Pop,70,37i9dQZF1DX3WvGXE8FqYX,spotify:playlist:37i9dQZF1DX3WvGXE8FqYX,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,1985046.0
1433,dw-c,50,5ji4GZJpll6twskFvKxiHx,spotify:playlist:5ji4GZJpll6twskFvKxiHx,https://api.spotify.com/v1/playlists/5ji4GZJpl...,True,14.0
1434,dw_g,30,40VxbK9NqccdUDUpiUXmbp,spotify:playlist:40VxbK9NqccdUDUpiUXmbp,https://api.spotify.com/v1/playlists/40VxbK9Nq...,True,7.0
1435,Top Shower Songs,100,0RTz1jFo5BXGPfI8eVf8sj,spotify:playlist:0RTz1jFo5BXGPfI8eVf8sj,https://api.spotify.com/v1/playlists/0RTz1jFo5...,True,58.0


In [5]:
# save a copy of this data as a csv file
data.to_csv("/Users/fujinhuizi/Documents/GitHub/data/spotify/spotify_pl.csv")
#data = pd.read_csv("/Users/fujinhuizi/Documents/GitHub/data/spotify/spotify_pl.csv")

Now we have our **response variable**: number of Followers. 

Some questions comes to mind:

* What's the distribution of it?
* What might be a contributing factor for a playlist to be successful?

For the 1st question, we can look at some basic statistics. 

For the 2nd question, some brainstorming ideas as follows:

* Number of tracks
* Genre (to quantify, maybe pop%, dance%, EDA%, genre count, etc.)
* Mean value of the features of the tracks (e.g. danceability, valence, tempo, etc.)
* Mean popularity of the tracks
* Artists (to quantify, maybe most popular artist %, Justin Bieber %, etc.)

...


In [7]:
# basic statistics of the numeric columns. Use non-scientific notation for easier reading
data.describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

Unnamed: 0.1,Unnamed: 0,No. of Tracks,Followers
count,1437.0,1437.0,1436.0
mean,718.0,72.2853,511081.0
std,414.97,61.7223,1324120.0
min,0.0,0.0,0.0
25%,359.0,42.0,5595.75
50%,718.0,53.0,110410.0
75%,1077.0,88.0,473496.0
max,1436.0,851.0,26794500.0


Some findings from the statistics above:

* One of the records has number of followers missing, we may want to exclude it.
* There are playlists with 0 tracks in it. We may want to remove it from the list.
* Biggest playlist has 851 songs in it!
* Most popular playlist has 268 million followers, average is 0.5 million.

In [8]:
# clean up list: remove missing follower record
data_1 = data.dropna(subset=['Followers'])
data_1

Unnamed: 0.1,Unnamed: 0,Name,No. of Tracks,ID,URI,HREF,Public,Followers
0,0,Today's Top Hits,50,37i9dQZF1DXcBWIGoYBM5M,spotify:playlist:37i9dQZF1DXcBWIGoYBM5M,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,26794529.0
1,1,RapCaviar,50,37i9dQZF1DX0XUsuxWHRQd,spotify:playlist:37i9dQZF1DX0XUsuxWHRQd,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,13359713.0
2,2,Hot Country,56,37i9dQZF1DX1lVhptIYRda,spotify:playlist:37i9dQZF1DX1lVhptIYRda,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,5924405.0
3,3,¡Viva Latino!,50,37i9dQZF1DX10zKzsJ2jva,spotify:playlist:37i9dQZF1DX10zKzsJ2jva,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,10649317.0
4,4,New Music Friday,100,37i9dQZF1DX4JAvHpjipBk,spotify:playlist:37i9dQZF1DX4JAvHpjipBk,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,3679444.0
...,...,...,...,...,...,...,...,...
1431,1431,Essential Folk,97,37i9dQZF1DWVmps5U8gHNv,spotify:playlist:37i9dQZF1DWVmps5U8gHNv,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,462063.0
1432,1432,Women of Pop,70,37i9dQZF1DX3WvGXE8FqYX,spotify:playlist:37i9dQZF1DX3WvGXE8FqYX,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,1985046.0
1433,1433,dw-c,50,5ji4GZJpll6twskFvKxiHx,spotify:playlist:5ji4GZJpll6twskFvKxiHx,https://api.spotify.com/v1/playlists/5ji4GZJpl...,True,14.0
1434,1434,dw_g,30,40VxbK9NqccdUDUpiUXmbp,spotify:playlist:40VxbK9NqccdUDUpiUXmbp,https://api.spotify.com/v1/playlists/40VxbK9Nq...,True,7.0


In [15]:
# which playlists have no tracks in it??


Unnamed: 0,Name,No. of Tracks,ID,URI,HREF,Public,Followers
0,Today's Top Hits,50,37i9dQZF1DXcBWIGoYBM5M,spotify:playlist:37i9dQZF1DXcBWIGoYBM5M,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,26794529.0
1,RapCaviar,50,37i9dQZF1DX0XUsuxWHRQd,spotify:playlist:37i9dQZF1DX0XUsuxWHRQd,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,13359713.0
2,Hot Country,56,37i9dQZF1DX1lVhptIYRda,spotify:playlist:37i9dQZF1DX1lVhptIYRda,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,5924405.0
3,¡Viva Latino!,50,37i9dQZF1DX10zKzsJ2jva,spotify:playlist:37i9dQZF1DX10zKzsJ2jva,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,10649317.0
4,New Music Friday,100,37i9dQZF1DX4JAvHpjipBk,spotify:playlist:37i9dQZF1DX4JAvHpjipBk,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,3679444.0
...,...,...,...,...,...,...,...
1431,Essential Folk,97,37i9dQZF1DWVmps5U8gHNv,spotify:playlist:37i9dQZF1DWVmps5U8gHNv,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,462063.0
1432,Women of Pop,70,37i9dQZF1DX3WvGXE8FqYX,spotify:playlist:37i9dQZF1DX3WvGXE8FqYX,https://api.spotify.com/v1/playlists/37i9dQZF1...,True,1985046.0
1433,dw-c,50,5ji4GZJpll6twskFvKxiHx,spotify:playlist:5ji4GZJpll6twskFvKxiHx,https://api.spotify.com/v1/playlists/5ji4GZJpl...,True,14.0
1434,dw_g,30,40VxbK9NqccdUDUpiUXmbp,spotify:playlist:40VxbK9NqccdUDUpiUXmbp,https://api.spotify.com/v1/playlists/40VxbK9Nq...,True,7.0


In [9]:
# New function to get tracks in playlist
def get_playlist_tracks(username, playlist_id):
    results = sp.user_playlist_tracks(username, playlist_id)
    tracks = results['items']
    while results['next']:
        results = sp.next(results)
        tracks.extend(results['items'])
    return tracks

Running the feature extraction from Spotify could take a significant amount of time and also tend to raise errors in the process. To avoid losing information when such error occurs, a dictionary is used in cache memory.

In [10]:
# Subsample of data to pull
Spotify_playlists = data.iloc[0:10]

# Create playlist cache in memory
playlist_tracks = dict()

In [11]:
# Collect audio features per track per playlist
for playlist in Spotify_playlists["ID"]:
    if Spotify_playlists.loc[Spotify_playlists['ID'] == playlist, 'No. of Tracks'].item() > 0:
        try:
            playlist_tracks[playlist] = get_playlist_tracks('spotify', playlist)
            time.sleep(random.randint(1, 3))
        except:
            pass

In [12]:
# Define an example list of songs for the first 10 playlists
songs_playlist = []

for item,playlist in enumerate(playlist_tracks):
    track_len = len(playlist_tracks[playlist])
    for song_item,song in enumerate(playlist_tracks[playlist]):
        songs_playlist.append((playlist,playlist_tracks[playlist][song_item]['track']['id']))
        
print("Number of Songs in Playlists: {}".format(len(songs_playlist)))

Number of Songs in Playlists: 0


In [13]:
# Create audio feature dictionary and set sleeping time thresholds
songs = [item[1] for item in songs_playlist]

audio_feat = dict()
limit_songs_small = 10
limit_songs_medium = 200

In [14]:
# Audio feature extraction - saves information in cache
for item,song in enumerate(songs):
    if song not in audio_feat:
        try:
            audio_feat[song] = sp.audio_features(song)
        except:
            pass

        if item % limit_songs_small == 0:
            time.sleep(random.randint(0, 1))

        if item % limit_songs_medium == 0:
            time.sleep(random.randint(0, 1))

        out = np.floor(item * 1. / len(songs_playlist) * 100)
        sys.stdout.write("\r%d%%" % out)
        sys.stdout.flush()

sys.stdout.write("\r%d%%" % 100)

100%

In [15]:
# Convert raw data into dictionaries
acousticness = dict()
danceability = dict()
duration_ms = dict()
energy = dict()
instrumentalness = dict()
key = dict()
liveness = dict()
loudness = dict()
mode = dict()
speechiness = dict()
tempo = dict()
time_signature = dict()
valence = dict()

for item,song in enumerate(audio_feat):
    try:
        acousticness[song] = audio_feat[song][0]['acousticness']
        danceability[song] = audio_feat[song][0]['danceability']
        duration_ms[song] = audio_feat[song][0]['duration_ms']
        energy[song] = audio_feat[song][0]['energy']
        instrumentalness[song] = audio_feat[song][0]['instrumentalness']
        key[song] = audio_feat[song][0]['key']
        liveness[song] = audio_feat[song][0]['liveness']
        loudness[song] = audio_feat[song][0]['loudness']
        mode[song] = audio_feat[song][0]['mode']
        speechiness[song] = audio_feat[song][0]['speechiness']
        tempo[song] = audio_feat[song][0]['tempo']
        time_signature[song] = audio_feat[song][0]['time_signature']
        valence[song] = audio_feat[song][0]['valence']
    except TypeError:
        pass

In [16]:
# Creation of audio feature dataframes from dictionaries
acc_df = pd.DataFrame(pd.Series(acousticness)).reset_index().rename(columns={'index': 'song', 0: 'acousticness'})
dan_df = pd.DataFrame(pd.Series(danceability)).reset_index().rename(columns={'index': 'song', 0: 'dance'})
dur_df = pd.DataFrame(pd.Series(duration_ms)).reset_index().rename(columns={'index': 'song', 0: 'duration'})
ene_df = pd.DataFrame(pd.Series(energy)).reset_index().rename(columns={'index': 'song', 0: 'energy'})
inst_df = pd.DataFrame(pd.Series(instrumentalness)).reset_index().rename(columns={'index': 'song', 0: 'instrumentalness'})
key_df = pd.DataFrame(pd.Series(key)).reset_index().rename(columns={'index': 'song', 0: 'key'})
live_df = pd.DataFrame(pd.Series(liveness)).reset_index().rename(columns={'index': 'song', 0: 'liveness'})
loud_df = pd.DataFrame(pd.Series(loudness)).reset_index().rename(columns={'index': 'song', 0: 'loudness'})
mode_df = pd.DataFrame(pd.Series(mode)).reset_index().rename(columns={'index': 'song', 0: 'mode'})
spee_df = pd.DataFrame(pd.Series(speechiness)).reset_index().rename(columns={'index': 'song', 0: 'speech'})
temp_df = pd.DataFrame(pd.Series(tempo)).reset_index().rename(columns={'index': 'song', 0: 'tempo'})
time_df = pd.DataFrame(pd.Series(time_signature)).reset_index().rename(columns={'index': 'song', 0: 'time'})
vale_df = pd.DataFrame(pd.Series(valence)).reset_index().rename(columns={'index': 'song', 0: 'valence'})

  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  """
  
  import sys
  
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':
  del sys.path[0]
  


In [None]:
# need error handling
