# Project Pitch - Data Science

In [1]:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from bs4 import BeautifulSoup
import requests
import string

## Spotify Million Playlist Dataset Challenge

![spotify](https://d3000t1r8yrm6n.cloudfront.net/uploads/ckeditor/pictures/56/image.png)

_Martin Bremm, Data Science #1024_

### Dataset Description

"The [dataset](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge#dataset) contains 1,000,000 playlists, including playlist titles and track titles, created by users on the Spotify platform between January 2010 and October 2017. The evaluation task is automatic playlist continuation: given a seed playlist title and/or initial set of tracks in a playlist, to predict the subsequent tracks in that playlist. This is an open-ended challenge intended to encourage research in music recommendations, and no prizes will be awarded (other than bragging rights)."

Here’s an example of a typical playlist entry:


In [2]:
{
        "name": "musical",
        "collaborative": "false",
        "pid": 5,
        "modified_at": 1493424000,
        "num_albums": 7,
        "num_tracks": 12,
        "num_followers": 1,
        "num_edits": 2,
        "duration_ms": 2657366,
        "num_artists": 6,
        "tracks": [
            {
                "pos": 0,
                "artist_name": "Degiheugi",
                "track_uri": "spotify:track:7vqa3sDmtEaVJ2gcvxtRID",
                "artist_uri": "spotify:artist:3V2paBXEoZIAhfZRJmo2jL",
                "track_name": "Finalement",
                "album_uri": "spotify:album:2KrRMJ9z7Xjoz1Az4O6UML",
                "duration_ms": 166264,
                "album_name": "Dancing Chords and Fireflies"
            },
            {
                "pos": 1,
                "artist_name": "Degiheugi",
                "track_uri": "spotify:track:23EOmJivOZ88WJPUbIPjh6",
                "artist_uri": "spotify:artist:3V2paBXEoZIAhfZRJmo2jL",
                "track_name": "Betty",
                "album_uri": "spotify:album:3lUSlvjUoHNA8IkNTqURqd",
                "duration_ms": 235534,
                "album_name": "Endless Smile"
            },
            {
                "pos": 2,
                "artist_name": "Degiheugi",
                "track_uri": "spotify:track:1vaffTCJxkyqeJY7zF9a55",
                "artist_uri": "spotify:artist:3V2paBXEoZIAhfZRJmo2jL",
                "track_name": "Some Beat in My Head",
                "album_uri": "spotify:album:2KrRMJ9z7Xjoz1Az4O6UML",
                "duration_ms": 268050,
                "album_name": "Dancing Chords and Fireflies"
            },
            # 8 tracks omitted
            {
                "pos": 11,
                "artist_name": "Mo' Horizons",
                "track_uri": "spotify:track:7iwx00eBzeSSSy6xfESyWN",
                "artist_uri": "spotify:artist:3tuX54dqgS8LsGUvNzgrpP",
                "track_name": "Fever 99\u00b0",
                "album_uri": "spotify:album:2Fg1t2tyOSGWkVYHlFfXVf",
                "duration_ms": 364320,
                "album_name": "Come Touch The Sun"
            }
        ],

    }

{'name': 'musical',
 'collaborative': 'false',
 'pid': 5,
 'modified_at': 1493424000,
 'num_albums': 7,
 'num_tracks': 12,
 'num_followers': 1,
 'num_edits': 2,
 'duration_ms': 2657366,
 'num_artists': 6,
 'tracks': [{'pos': 0,
   'artist_name': 'Degiheugi',
   'track_uri': 'spotify:track:7vqa3sDmtEaVJ2gcvxtRID',
   'artist_uri': 'spotify:artist:3V2paBXEoZIAhfZRJmo2jL',
   'track_name': 'Finalement',
   'album_uri': 'spotify:album:2KrRMJ9z7Xjoz1Az4O6UML',
   'duration_ms': 166264,
   'album_name': 'Dancing Chords and Fireflies'},
  {'pos': 1,
   'artist_name': 'Degiheugi',
   'track_uri': 'spotify:track:23EOmJivOZ88WJPUbIPjh6',
   'artist_uri': 'spotify:artist:3V2paBXEoZIAhfZRJmo2jL',
   'track_name': 'Betty',
   'album_uri': 'spotify:album:3lUSlvjUoHNA8IkNTqURqd',
   'duration_ms': 235534,
   'album_name': 'Endless Smile'},
  {'pos': 2,
   'artist_name': 'Degiheugi',
   'track_uri': 'spotify:track:1vaffTCJxkyqeJY7zF9a55',
   'artist_uri': 'spotify:artist:3V2paBXEoZIAhfZRJmo2jL',
   't

### Goal 

**Goal** = Develop a system for `automatic playlist creation` based on a text input.
Given a text prompt, participants' systems shall generate a list of recommended tracks that can be added to a playlist fitting the mood of the text prompt

### Required Format

**Input**

- A user-created playlist, represented by:
    - *Playlist metadata* (see the dataset README)
    - K *seed tracks*: a list of K tracks in the playlist, where K can equal 0, 1, 5, 10, 25, or 100.

**Output**

- A list of 500 recommended candidate tracks, ordered by relevance in decreasing order.

### Look into Dataset

In [3]:
import time
import os, json

json_path = "spotify_playlist_continuation/spotify_million_playlist_dataset/data"
json_files = os.listdir(json_path)

json_dicts_list = []
title = []
num_tracks = []
num_albums = []
tracks = []

for index, js in enumerate(json_files[:100]):
    with open(os.path.join(json_path, js)) as json_file:
        json_text = json.load(json_file)["playlists"][0]
        json_dicts_list.append(json_text)

In [4]:
import pandas as pd

df = pd.DataFrame.from_dict(json_dicts_list)
df.head()

Unnamed: 0,name,collaborative,pid,modified_at,num_tracks,num_albums,num_followers,tracks,num_edits,duration_ms,num_artists,description
0,Bob Dylan,False,549000,1454803200,75,65,1,"[{'pos': 0, 'artist_name': 'Bob Dylan', 'track...",28,18425368,39,
1,Ninja Sex Party,False,613000,1509062400,48,9,1,"[{'pos': 0, 'artist_name': 'Ninja Sex Party', ...",4,8935563,3,
2,litty titty,False,115000,1508371200,212,126,1,"[{'pos': 0, 'artist_name': 'Travis Scott', 'tr...",11,51022342,75,
3,Pt. 2,False,778000,1500422400,32,25,3,"[{'pos': 0, 'artist_name': 'Rae Sremmurd', 'tr...",16,7184035,18,
4,newness,False,290000,1509408000,69,65,2,"[{'pos': 0, 'artist_name': 'Lorde', 'track_uri...",24,15162070,61,


### Unstacking the "tracks" colum

In [5]:
# first 3 tracks of first playlist
df["tracks"][0][:3]

[{'pos': 0,
  'artist_name': 'Bob Dylan',
  'track_uri': 'spotify:track:6QHYEZlm9wyfXfEM1vSu1P',
  'artist_uri': 'spotify:artist:74ASZWbe4lXaubB36ztrGX',
  'track_name': 'Boots of Spanish Leather',
  'album_uri': 'spotify:album:7DZeLXvr9eTVpyI1OlqtcS',
  'duration_ms': 277106,
  'album_name': "The Times They Are A-Changin'"},
 {'pos': 1,
  'artist_name': 'Bob Dylan',
  'track_uri': 'spotify:track:3RkQ3UwOyPqpIiIvGVewuU',
  'artist_uri': 'spotify:artist:74ASZWbe4lXaubB36ztrGX',
  'track_name': 'Mr. Tambourine Man',
  'album_uri': 'spotify:album:1lPoRKSgZHQAYXxzBsOQ7v',
  'duration_ms': 330533,
  'album_name': 'Bringing It All Back Home'},
 {'pos': 2,
  'artist_name': 'Loggins & Messina',
  'track_uri': 'spotify:track:0ju1jP0cSPJ8tmojYBEI89',
  'artist_uri': 'spotify:artist:7emRV8AluG3d4e5T0DZiK9',
  'track_name': "Danny's Song",
  'album_uri': 'spotify:album:5BWgJaesMjpJWCTU9sgUPf',
  'duration_ms': 254653,
  'album_name': "The Best: Loggins & Messina Sittin' In Again"}]

### Potential Issues with the Dataset

As part of the challenge, we release a **separate challenge dataset ("test set")** that consists of 10,000 playlists with **incomplete information**. It has many of the same data fields and follows the same structure as the Million Playlist Dataset ("training set"), but the playlists may include incomplete metadata (no title), and only include K tracks. More specifically, the challenge dataset is divided into 10 scenarios, with 1000 examples of each scenario:

1. Title only (no tracks)
2. Title and first track
3. Title and first 5 tracks
4. First 5 tracks only
5. Title and first 10 tracks
6. First 10 tracks only
7. Title and first 25 tracks
8. Title and 25 random tracks
9. Title and first 100 tracks
10. Title and 100 random tracks


There are some playlists which have no initial seed track given
- Example:

In [6]:
challenge_path = "spotify_playlist_continuation/spotify_million_playlist_dataset_challenge"
challenge_file = "challenge_set.json"


with open(os.path.join(challenge_path, challenge_file)) as json_file:
        json_text = json.load(json_file)
        
json_text["playlists"][45]

{'name': 'Iconic',
 'num_holdouts': 34,
 'pid': 1000102,
 'num_tracks': 34,
 'tracks': [],
 'num_samples': 0}

### Pulling Song Info from the Spotify API

In [7]:
# !pip install spotipy

In [8]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd

After creating a [new app](https://developer.spotify.com/dashboard/applications/be9d854c547a4350add0a9da00a84892) with the spotify API, you receive the client_id and the secret_id:


In [9]:
from config import spotify_cid, spotify_secret_key

client_credentials_manager = SpotifyClientCredentials(client_id=spotify_cid, client_secret=spotify_secret_key)

sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)


In [10]:
track = df["tracks"][3][29]
track

{'pos': 29,
 'artist_name': 'Playboi Carti',
 'track_uri': 'spotify:track:1e1JKLEDKP7hEQzJfNAgPl',
 'artist_uri': 'spotify:artist:699OTQXzgjhIYAHMy9RyPD',
 'track_name': 'Magnolia',
 'album_uri': 'spotify:album:4rJgzzfFHAVFhCSt2P4I3j',
 'duration_ms': 181812,
 'album_name': 'Playboi Carti'}

In [11]:
sp.audio_features(track["track_uri"])

[{'danceability': 0.791,
  'energy': 0.582,
  'key': 11,
  'loudness': -7.323,
  'mode': 0,
  'speechiness': 0.286,
  'acousticness': 0.0114,
  'instrumentalness': 0,
  'liveness': 0.35,
  'valence': 0.443,
  'tempo': 162.991,
  'type': 'audio_features',
  'id': '1e1JKLEDKP7hEQzJfNAgPl',
  'uri': 'spotify:track:1e1JKLEDKP7hEQzJfNAgPl',
  'track_href': 'https://api.spotify.com/v1/tracks/1e1JKLEDKP7hEQzJfNAgPl',
  'analysis_url': 'https://api.spotify.com/v1/audio-analysis/1e1JKLEDKP7hEQzJfNAgPl',
  'duration_ms': 181812,
  'time_signature': 4}]

### Missing Features?

![Spotify_API_Song_Features.png](attachment:Spotify_API_Song_Features.png)

Spotify only considers **audio features**, but what if we were to apply an NLTK model to analyse the overall **lyrical content (e.g., lyrical valence)** to make better predictions on the playlist?

[Linking Spotify API with Genius.com lyrics](https://medium.com/swlh/how-to-leverage-spotify-api-genius-lyrics-for-data-science-tasks-in-python-c36cdfb55cf3)

### Research Objective

Could individual song features of the songs in a playlist (e.g., danceability...) help with playlist continuation?

### Potential Methods

- Recurrent Neural Networks (RNN)
- Ranking of feature importance

https://arxiv.org/abs/1810.01520

"In summary, most approaches ensemble the results obtained by several well-known methods, including matrix factorization models, neighborhood-based collaborative filtering models, basic information retrieval techniques, and learning to rank models. The results show that the models work best when a sufficient number of tracks per playlist is provided and they are randomly selected from the playlist (as opposed to the sequential order from the beginning of the playlist). The submitted solutions could not effectively use playlist titles for APC. This might be due to the sparseness of the titles as well as the scale of the training data. <font color='blue'>**In addition, none of the submitted solutions tried to infer the user intents from the playlist titles**</font>. The results also demonstrate that the performance of different models are close to each other when few tracks per playlist are given." 

### Scraping lyrics and other song info from Genius.com

![genius_transparent.png](spotify_pictures/genius_transparent.png)

In [14]:
from config import genius_cat
import requests

def song_info():
    base_url = 'https://api.genius.com'
    user_input = input('artist and song: ').replace(" ", "-")

    endpoint = 'search/'
    request_uri = '/'.join([base_url, endpoint])

    params = {'q': user_input}
    token = 'Bearer {}'.format(genius_cat)
    headers = {'Authorization': token}

    response = requests.get(request_uri, params=params, headers=headers)
    song_info = response.json()["response"]["hits"][0]["result"]
    return song_info

In [15]:
song_info = song_info()
song_info

artist and song: September


{'annotation_count': 4,
 'api_path': '/songs/476552',
 'artist_names': 'Earth, Wind & Fire',
 'full_title': 'September by\xa0Earth,\xa0Wind & Fire',
 'header_image_thumbnail_url': 'https://images.genius.com/b680e9bcd6301c41196eb7f59f8326ab.300x300x1.png',
 'header_image_url': 'https://images.genius.com/b680e9bcd6301c41196eb7f59f8326ab.1000x1000x1.png',
 'id': 476552,
 'language': 'en',
 'lyrics_owner_id': 702235,
 'lyrics_state': 'complete',
 'path': '/Earth-wind-and-fire-september-lyrics',
 'pyongs_count': 111,
 'relationships_index_url': 'https://genius.com/Earth-wind-and-fire-september-sample',
 'release_date_components': {'year': 1978, 'month': 11, 'day': 18},
 'release_date_for_display': 'November 18, 1978',
 'release_date_with_abbreviated_month_for_display': 'Nov. 18, 1978',
 'song_art_image_thumbnail_url': 'https://images.genius.com/b680e9bcd6301c41196eb7f59f8326ab.300x300x1.png',
 'song_art_image_url': 'https://images.genius.com/b680e9bcd6301c41196eb7f59f8326ab.1000x1000x1.png'

In [46]:
def song_annotations(song_id=None):
    base_url = "http://api.genius.com"
    
    if not song_id:
        return "Please insert song_id"
    try:
        path = f"referents?song_id={song_id}&text_format=plain"
        request_uri = '/'.join([base_url, path])
    except:
        print("Song_id not valid. Please insert valid song_id")
    
    token = 'Bearer {}'.format(genius_cat)
    headers = {'Authorization': token}
    
    response = requests.get(request_uri, headers=headers)
    json_data = response.json()
    
    annotations = []
    for annotation in json_data["response"]["referents"]:
        annotations.append(annotation["annotations"][0]["body"]["plain"])
        print(annotation["annotations"][0]["body"]["plain"])
        
    return annotations

In [48]:
song_annotations(song_info["id"])

The theme of memory, prevalent in this song, is reminiscent of the poem “September” by Helen Hunt Jackson, an American poet of the 19th century. Published in 1892 in Poems, she writes in the final lines:


But none of all this beauty
Which floods the earth and air
Is unto me the secret
 Which makes September fair.

’T is a thing which I remember;
To name it thrills me yet:
One day of one September
 I never can forget.
In an interview with SongFacts, Allee Willis (one of the songwriters of “September”) gave a personal story about how she initially didn’t like this three-syllable series of classic phonetic compilation:

I absolutely could not deal with lyrics that were nonsensical, or lines that weren’t complete sentences. And I’m exceedingly happy that I lost that attitude. I went, “You cannot leave bada-ya in the chorus, that has to mean something.” [Maurice White] said, “No, that feels great. That’s what people are going to remember. We’re leaving it.” We did try other stuff, and it a

['The theme of memory, prevalent in this song, is reminiscent of the poem “September” by Helen Hunt Jackson, an American poet of the 19th century. Published in 1892 in Poems, she writes in the final lines:\n\n\nBut none of all this beauty\nWhich floods the earth and air\nIs unto me the secret\n Which makes September fair.\n\n’T is a thing which I remember;\nTo name it thrills me yet:\nOne day of one September\n I never can forget.',
 'In an interview with SongFacts, Allee Willis (one of the songwriters of “September”) gave a personal story about how she initially didn’t like this three-syllable series of classic phonetic compilation:\n\nI absolutely could not deal with lyrics that were nonsensical, or lines that weren’t complete sentences. And I’m exceedingly happy that I lost that attitude. I went, “You cannot leave bada-ya in the chorus, that has to mean something.” [Maurice White] said, “No, that feels great. That’s what people are going to remember. We’re leaving it.” We did try ot

Is it illegal to scapre these lyrics? [No](https://www.theverge.com/2020/8/11/21363692/google-genius-lyrics-lawsuit-scraping-copyright-yelp-antitrust-competition)

### What could we do with this info?

Analyse the lyrics or annotations with an advanced, pre-trained **transformers NLP model**, like [BERT](https://huggingface.co/docs/transformers/model_doc/bert) or even [BART](https://huggingface.co/docs/transformers/model_doc/bart)

### Project Overview

![spotify_project.png](spotify_pictures/spotify_project.png)

### Tasks & Delegations

1. NLP feature extraction from song annotations
 - Data cleaning
 - Sentiment analysis/ text comprehension
 - Feature engineering
 - Putting it all in one DataFrame
2. KNN cluster analysis
 - Analysis of generated features
 - Dimensionality reduction & Data visualization
 - Checking for closest k-neighbors
3. Deployment
 - mlflow
 - Model comparison & database updates
4. Overview & Time-Management

### References

1. https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge#dataset
2. https://developer.spotify.com/documentation/web-api/
3. https://medium.com/swlh/how-to-leverage-spotify-api-genius-lyrics-for-data-science-tasks-in-python-c36cdfb55cf3
4. https://docs.genius.com/
5. https://huggingface.co/docs/transformers/model_doc/bert
6. https://huggingface.co/docs/transformers/model_doc/bart

Put this into command line in correct directory:

jupyter nbconvert index.ipynb --to slides --post serve \
--no-prompt \
--TagRemovePreprocessor.remove_input_tags=remove_input \
--TagRemovePreprocessor.remove_all_outputs_tags=remove_output \
--SlidesExporter.reveal_scroll=True \
--stdout > index.html