# Create Spotify dataset

## Introduction

We will create music datasets based on a list of Last.fm users, using the Spotipy library which translates Spotify Web API to Python.

We need playlists dataset for our analysis, and the easiest way to get it is to search users playlists. But we don't have a list of Spotify users, so we try Last.fm users, because many of them are Spotify users too.

## Datasets created

- **Users dataset**: a list of Spotify users; from Last.fm users dataset;
- **Playlists dataset**: from users dataset;
- **Tracks dataset**: from playlists dataset;
- **Audio features dataset**: a complement to tracks dataset; from tracks dataset;
- **Artists dataset**: from tracks dataset.

**P.S.**: The comments through the notebook are intentional.

In [None]:
from spotipy.oauth2 import SpotifyClientCredentials
from tqdm.notebook import tqdm
import glob
import numpy as np
import os
import pandas as pd
import random
import requests
import spotipy
import time

In [None]:
auth_manager = SpotifyClientCredentials()
sp = spotipy.Spotify(auth_manager=auth_manager)

In [None]:
# How many users do we want to search for playlists?
LEN_USERS = 1000

## Obtain users

We will gather Last.fm users and test if they are Spotify users too.

In [None]:
with open('../data/users.txt') as f:
    users = f.read().split('\n')
    
random.shuffle(users)

## Test users and obtain playlists

We now test the users and gather their playlists at the same time, if it succeeds.

In [None]:
sp_users = []
playlists = []
pbar = tqdm(total=LEN_USERS)
for user in users:
    user_exists = 1
    while True:
        try:
            their_playlists = sp.user_playlists(user)
        except requests.exceptions.ReadTimeout as e:
            print(e)
            time.sleep(10)
            continue
        except spotipy.exceptions.SpotifyException:
            user_exists = 0
        break
    if not user_exists:
        continue
    playlists.extend(their_playlists['items'])
    while their_playlists['next']:
        while True:
            try:
                their_playlists = sp.next(their_playlists)
            except requests.exceptions.ReadTimeout as e:
                print(e)
                time.sleep(10)
                continue              
            break
        playlists.extend(their_playlists['items'])
    pbar.update()
    sp_users.append(user)
    if len(sp_users) >= LEN_USERS:
        break

In [None]:
print('We have now {} playlists!'.format(len(playlists)))

Example of playlist:

In [None]:
playlists[0]

## Save Spotify users to file

In [None]:
with open('../data/sp_users.txt', 'w') as f:
    for user in sp_users:
        f.write('{}\n'.format(user))

### Optional: get number of followers info

It seems to be necessary to pass again through all playlists just for a bit of information, that is, the number of followers of a playlist.

In [None]:
# for i, playlist in tqdm(enumerate(playlists.copy()), total=len(playlists.copy())):
#     playlists[i] = sp.playlist(playlists[i]['id'])

## Treat playlists dataset

Now we treat the dataset playlists, filtering just what we want. We also expand the `owner` column and remove duplicates.

In [None]:
# Filter columns
playlists = pd.DataFrame(playlists, columns=[
    'collaborative',
    'description',
#     'external_urls',
#     'followers',
#     'href',
    'id',
#     'images',
    'name',
    'owner',
    'primary_color',
    'public',
#     'snapshot_id',
    'tracks',
#     'type',
#     'uri'
])

In [None]:
# Expand owner dict
playlists['owner_id'] = playlists['owner'].apply(pd.Series)['id']
playlists.drop(columns='owner', inplace=True)

In [None]:
# # Remove duplicates
# playlists.drop_duplicates('id')#, inplace=True)

In [None]:
# # Reindex
# playlists.reset_index(drop=True, inplace=True)

## Write playlists dataset to file

In [None]:
playlists.to_pickle('../data/sp_playlists.pkl')

## Iterate through playlists to get tracks

We now iterate through the playlists dataset in order to gather information about tracks.

In [None]:
# Unique the playlist ids
playlist_ids = list(set(playlists.id.to_list()))

In [None]:
# Iteration
tracks = []
for i, playlist_id in tqdm(enumerate(playlist_ids), total=len(playlist_ids)):
    while True:
        try:
            q = sp.playlist_tracks(playlist_id)
        except requests.exceptions.ReadTimeout as e:
            print(e)
            time.sleep(10)
            continue
        break
    items = q['items'].copy()
    for item in items:
        # We save the playlist id too
        item.update({'playlist_id': playlist_id})
    tracks.extend(items)
    while q['next']:
        while True:
            try:
                q = sp.next(q)
            except Exception as e:
                print(e)
                time.sleep(10)
                continue
            break
        items = q['items'].copy()
        for item in items:
            item.update({'playlist_id': playlist_id})
        tracks.extend(items)
        
    # This is necessary because we don't have enough memory
    if (i + 1) % 1000 == 0:
        pd.DataFrame(tracks).to_pickle('../data/sp_tracks_temp_{}.pkl'.format(i))
        tracks = []

Example of track:

In [None]:
if len(tracks) > 0:
    tracks[0]

In [None]:
# The last save of pickles
if len(tracks) > 0:
    pd.DataFrame(tracks).to_pickle('../data/sp_tracks_temp_{}.pkl'.format(len(playlist_ids)))

# Free memory please
del tracks

## Treat tracks database

We now treat the tracks database. We do it in all pickle files saved above.

In [None]:
all_files = glob.glob(os.path.join('../data/', '*_temp_*.pkl'))
for file in tqdm(all_files):
    
    # Filter
    df = pd.read_pickle(file)[[
        'added_at',
        'added_by',
        'is_local',
    #     'primary_color',
        'track',
    #     'video_thumbnail',
        'playlist_id',

    ]]

    # # Drop rows with NaN values
    # print('{} rows were dropped.'.format(len(df.drop(df.dropna().index))))
    # df.dropna(inplace=True)

    # Parse dates
    df.added_at = pd.to_datetime(df.added_at)

    # Expand added_by column
    df['added_by'] = df.added_by.apply(pd.Series).id

    # Expand track column
    df2 = df.track.apply(pd.Series).copy()
    df2 = df2[[
        'album',
        'artists',
        'available_markets',
        'disc_number',
        'duration_ms',
    #     'episode',
        'explicit',
    #     'external_ids',
    #     'external_urls',
    #     'href',
        'id',
    #     'is_local',
        'name',
        'popularity',
    #     'preview_url',
    #     'track',
        'track_number',
    #     'type',
    #     'uri',
    #     'linked_from'
    ]]
    df.drop(columns='track', inplace=True)
    df = df.join(df2)

    # Expand album column
    df2 = df.album.apply(pd.Series).copy()
    df2 = df2[[
        'album_type',
        'artists',
        'available_markets',
    #     'external_urls',
    #     'href',
        'id',
    #     'images',
        'name',
        'release_date',
    #     'release_date_precision',
    #     'total_tracks',
    #     'type',
    #     'uri'
    ]]
    df2.rename(columns={
        'artists': 'album_artists',
        'available_markets': 'album_available_markets',
        'id': 'album_id',
        'name': 'album_name',
        'release_date': 'album_release_date'
    }, inplace=True)
    df.drop(columns='album', inplace=True)
    df = df.join(df2)

    # Expand artists column
    def try_id(d):
        try:
            ids = [i['id'] for i in d if not pd.isna(i['id'])]
            if len(ids) > 0:
                return ids
        except:
            pass
        return np.nan
    def try_name(d):
        try:
            ids = [i['name'] for i in d if not pd.isna(i['name'])]
            if len(ids) > 0:
                return ids
        except:
            pass
        return np.nan
    df['artists_ids'] = df.artists.apply(try_id)
    df['artists_names'] = df.artists.apply(try_name)
    df.drop(columns='artists', inplace=True)

    # Expand album_artists column
    df['album_artists_ids'] = df.album_artists.apply(try_id)
    df['album_artists_names'] = df.album_artists.apply(try_name)
    df.drop(columns='album_artists', inplace=True)

    # # Drop rows with NaN values
    # print('{} rows were dropped.'.format(len(df.drop(df.dropna().index))))
    # df.dropna(inplace=True)
    
    path = file.split('_temp_')
    path = path[0] + '_ready_' + path[1]
    df.to_pickle(path)

## Iterate through tracks to get their features

Tracks have features, like `danceability`, which is important for future analysis. We collect the ids of the tracks and search for the features:

In [None]:
all_files = glob.glob('../data/*_ready_*.pkl')
ids = []
for file in tqdm(all_files):
    ids.extend(list(pd.read_pickle(file).id))

In [None]:
ids = [id for id in ids if not pd.isna(id)]
ids = list(set(ids))

In [None]:
audio_features = []
for i in tqdm(range(0, len(ids), 100)):
    while True:
        try:
            q = sp.audio_features(ids[i:i+100])
        except requests.exceptions.ReadTimeout as e:
            print(e)
            time.sleep(10)
            continue
        break
    audio_features.extend(q)
#     if i % 10000 == 0:
#         time.sleep(6)

`sp.audio_features` can return `[None]`, so we check it.

In [None]:
audio_features = [track for track in audio_features if not pd.isna(track)]

In [None]:
audio_features = pd.DataFrame(audio_features, columns=[
    'danceability',
    'energy',
    'key',
    'loudness',
    'mode',
    'speechiness',
    'acousticness',
    'instrumentalness',
    'liveness',
    'valence',
    'tempo',
#     'type',
    'id',
#     'uri',
#     'track_href',
#     'analysis_url',
    'duration_ms',
    'time_signature'
])

In [None]:
audio_features.sample(3)

## Write audio features dataset to file

In [None]:
audio_features.to_pickle('../data/sp_audio_features.pkl')

## Get artists

It's important to have artists data too, mainly because of track genres.

In [None]:
all_files = glob.glob('../data/*_ready_*.pkl')
artists_ids = []
for file in tqdm(all_files):
    for item in pd.read_pickle(file).artists_ids:
        if type(item) == list:
            artists_ids.extend(item)

In [None]:
artists_ids = [artists_id for artists_id in artists_ids if not pd.isna(artists_id)]
artists_ids = list(set(artists_ids))

In [None]:
artists = []
for i in tqdm(range(0, len(artists_ids), 50)):
    while True:
        try:
            q = sp.artists(artists_ids[i:i+50])
        except requests.exceptions.ReadTimeout as e:
            print(e)
            time.sleep(10)
            continue
        break
    artists.extend(q['artists'])
#     if i % 10000 == 0:
#         time.sleep(6)

In [None]:
# Filter columns
artists_df = pd.DataFrame(artists, columns=[
#     'external_urls',
    'followers',
    'genres',
#     'href',
    'id',
#     'images',
    'name',
    'popularity',
#     'type',
#     'uri'
])

In [None]:
# Expand followers columns
artists_df.followers = artists_df.followers.apply(lambda x: x['total'])

## Write artists dataset to file

In [None]:
artists_df.to_pickle('../data/sp_artists.pkl')