# Estimated Spotify Plays

In this project I want to build a deep learning model that predicts how often I will listen to a song based on its audio features which can be retrieved via the [Spotify Web API](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/).

## Data Preprocessing

Let's have a look at the dataset I have to start with.

In [1]:
!ls -alFh dataset/*.csv

-rw-r--r-- 1 moik moik 1,5M Sep 23 11:03 dataset/hoergewohnheiten.csv
-rw-r--r-- 1 moik moik  13M Sep 23 11:03 dataset/last_fm.csv


There are two CSV files which hold information about when (UTC timestamp) I did listen to a certain song (identifies via the combination of song title and artist name). One file is a export of [LastFM](https://mainstream.ghan.nl/export.html) the other one is from a side project I started a while ago which is called [Hoergewohnheiten](https://github.com/mymindwentblvnk/hoergewohnheiten)

Let's see how files look like.

In [2]:
!head -5 dataset/last_fm.csv

uts,utc_time,artist,artist_mbid,album,album_mbid,track,track_mbid
"1486670769","09 Feb 2017, 20:06","Shy Glizzy","21354007-a91f-4460-934d-12d389de3a2a","Homieland, Vol. 2","","Woah",""
"1486658377","09 Feb 2017, 16:39","Dre","9ccc0b15-6506-46ba-8bca-6e35d40ebfa7","Rich & Lit","","5 Rounds",""
"1486658166","09 Feb 2017, 16:36","Dre","9ccc0b15-6506-46ba-8bca-6e35d40ebfa7","Rich & Lit","","Fine Ass Girls",""
"1486507369","07 Feb 2017, 22:42","Cosima","dcf74737-7d5b-4847-acdb-9edec6a7cea1","To Build A House","","To Build A House",""


In [3]:
!head -5 dataset/hoergewohnheiten.csv

timestamp,title,artist,album
1530516395,"Ella, elle l'a - Remasterisé",France Gall,Babacar ( Remasterisé)
1530516323,Get Down,Junglepussy,Jp3
1530516172,State of the Union,Junglepussy,Jp3
1530516088,Jammin That Screw,Trae Tha Truth,48 Hours Later


It looks like the LastFM export has a unique identifier but that does not help with the Hoergewohnheiten data. So now I want to build the following datastructure with help of Paul Lamere's [spotipy](https://github.com/plamere/spotipy) where every row represents one track:
 

| tempo | valence | energy | ... | danceability | plays |
|-------|---------|--------|-----|--------------|-------|
| 98.30 | 0.523   | 0.993  | ... | 0.7350       | 12    |
| 132.4 | 0.24    | 0.451  | ... | 0.99002      | 130   |
| 78.0  | 0.9     | 0.56   | ... | 0.12502      | 2     |
| ...   | ...     | ...    | ... | ...          | ...   |

There are the following features (see [Audio Features Object](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/)):

* danceability                                                                  
* energy                                                                        
* key                                                                           
* loudness                                                                      
* mode                                                                          
* speechiness                                                                   
* acousticness                                                                  
* instrumentalness                                                              
* liveness                                                                      
* valence                                                                       
* tempo                                                                         
* duration_ms
* time_signature

### Count plays per Track

First I count the amount of plays by artist name and track title.

In [4]:
import csv
from collections import defaultdict

splitter = '#*#*#*#*#*#*#'
play_data_dict = defaultdict(int)

with open('dataset/hoergewohnheiten.csv', 'r') as hoergewohnheiten_in:
    reader = csv.DictReader(hoergewohnheiten_in)
    for row in reader:
        temp_identifier = "{artist}{splitter}{title}".format(title=row['title'],
                                                             artist=row['artist'],
                                                             splitter=splitter)
        play_data_dict[temp_identifier] += 1
        
with open('dataset/last_fm.csv', 'r') as last_fm_in:
    reader = csv.DictReader(last_fm_in)
    for row in reader:
        temp_identifier = "{artist}{splitter}{title}".format(title=row['track'],
                                                             artist=row['artist'],
                                                             splitter=splitter)
        play_data_dict[temp_identifier] += 1

play_data = list(zip(
    list([k.split(splitter) for k in play_data_dict.keys()]), 
    list(play_data_dict.values())
))

print(len(play_data), "plays found.")
print(play_data[:10])

21501 plays found.
[(['France Gall', "Ella, elle l'a - Remasterisé"], 2), (['Junglepussy', 'Get Down'], 1), (['Junglepussy', 'State of the Union'], 1), (['Trae Tha Truth', 'Jammin That Screw'], 5), (['Faithless', 'Insomnia'], 9), (['Faithless', 'God Is a DJ - Radio Mix'], 2), (['DJ Bobo', 'Everybody'], 6), (['Robin S', 'Show Me Love'], 1), (['Ricky Martin', "La Copa de la Vida (La Cancion Oficial de la Copa Mundial, Francia '98) - Spanglish Radio Edit"], 1), (['Members Of Mayday', 'Sonic Empire - Short Mix'], 1)]


### Get Spotify information from API

In [5]:
def SpotifyClient():
    from spotipy import Spotify
    import spotipy.util

    try:
        import spotify_settings
        user_name = spotify_settings.USER_NAME
        client_id = spotify_settings.CLIENT_ID
        client_secret = spotify_settings.CLIENT_SECRET
        redirect_uri = spotify_settings.REDIRECT_URI
    except ImportError:
        user_name = 'SET_THIS_YOURSELF'
        client_id = 'SET_THIS_YOURSELF'
        client_secret = 'SET_THIS_YOURSELF'
        redirect_uri = 'SET_THIS_YOURSELF'

    token = spotipy.util.prompt_for_user_token(
        user_name, redirect_uri=redirect_uri,
        client_id=client_id, client_secret=client_secret,
        scope='user-library-read'
    )
    return Spotify(auth=token)

#### Retrieve for every track the Spotify id via search API

At first I search for every _artist name track name_ combination to get the Spotify id for these tracks. The id is important to fetch the audio features for every song I heard in the next step. 

I do not want to request the Spotify API everytime, since this is a very time intensive step. So I save the results into a subfolder *search_results*. I reset the Spotify connection every 1000 requests to not run into a timeout with this one.

In [6]:
!mkdir dataset/search_results

mkdir: das Verzeichnis »dataset/search_results“ kann nicht angelegt werden: Die Datei existiert bereits


In [None]:
import json
import os.path
import hashlib


spotify_client = SpotifyClient()

print("Retrieving track ids")

for index, play in enumerate(play_data, 1):
    print(index, "/", len(play_data), end='\r')
    
    if index % 500 == 0:
        spotify_client = SpotifyClient()  # Refresh client every n requests
        
    artist = play[0][1]
    track = play[0][0]
    plays = play[1]
    
    query = '{} {}'.format(artist, track)
    query_hash = hashlib.md5(query.encode()).hexdigest()
    
    if not os.path.isfile('dataset/search_results/{}.json'.format(query_hash)):
        result = spotify_client.search(q=query, type='track', limit=1)

        if len(result['tracks']['items']) == 1:
            track_id = result['tracks']['items'][0]['id']
            query_result = {
                'id':  track_id,
                'track_data': result['tracks']['items'][0],
                'artist_data': result['tracks']['items'][0]['artists'],
                'plays': plays
            }

            with open('dataset/search_results/{}.json'.format(query_hash), 'w') as json_out:
                json.dump(query_result, json_out)

#### Fetch audio features per track

In the next step I can fetch data from the Spotify API in batches of 50 tracks. To do so I created a generator that returns (yields) over batches of 50 of an given iterable.

In the first step I had to create a list of all track ids in JSON files in *search_resuls*.

In [8]:
from glob import glob


def batch_generator(iterable, size=50):
    iterable = list(iterable)
    l = len(iterable)
    for ndx in range(0, l, size):
        yield iterable[ndx:min(ndx + size, l)]

Extract track ids from the search results

In [9]:
track_ids = []
for json_file in glob('dataset/search_results/*.json'):
    with open(json_file, 'r') as json_in:
        data = json.load(json_in)
        if 'id' in data:
            track_ids.append(data['id'])

and get all audio features for these tracks.

In [None]:
audio_features_per_track_id = dict()
track_id_batches = batch_generator(track_ids)

spotify_client = SpotifyClient()

for index, batch in enumerate(track_id_batches, 1):
    print("Retrieving audio features - Request", index, end='\r')
    audio_features = spotify_client.audio_features(tracks=batch)
    for feature in audio_features:
        if feature:
            track_id = feature['id']
            audio_features_per_track_id[track_id] = feature
        
import json
with open('dataset/audio_features.json', 'w') as json_out:
    json.dump(audio_features_per_track_id, json_out)

Retrieving audio features - Request 293

WHAT HAVE I DONE??? Yeah, I created data in two destinations:

* *dataset/search_querys* which multiple JSON files that have information about a track (track id, track name, artist) and the number of times I played the track
* *dataset/audio_features.json* which provides the Spotify audio features per track id.

#### Joining the data

In a last step of data preprocessing I have to join the number of plays with the audio features of a song.