# Data Collection

The data used in this project is taken from ["Spotify Million Playlist Dataset Challenge"](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge) - a continuation of a data science research challenge focused on music recommendation organized by Spotify (See [RecSys Challenge 2018](http://www.recsyschallenge.com/2018/)).

* Another music recommendation challenge that we've considered to base our work on is <https://www.kaggle.com/c/msdchallenge/overview>. However, due to its old age (2012), smaller scale and rigid data formats, the former dataset was preferred.

The project's data consists of:
1. spotify_million_playlist_dataset (the challenge dataset)
2. songs_dataset
3. audio_features_dataset
4. lyrics_corpus

## spotify_million_playlist_dataset

### The raw challenge dataset downloaded from <https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge/dataset_files>).

* 1 million playlists consisting of over 2 million unique tracks by nearly 300,000 artists. Created by US Spotify users between January 2010 and November 2017. This dataset is separated into multiple JSON files, each containing 1000 playlists. **Used for both the training set and the test set**

In [1]:
import json

# Show format of one playlist and one track
with open('data/spotify_million_playlist_dataset/mpd.slice.0-999.json') as f:
    ex_playlist = json.load(f)['playlists'][0]
    ex_playlist['tracks'] = [ex_playlist['tracks'][0]]

ex_playlist

{'name': 'Throwbacks',
 'collaborative': 'false',
 'pid': 0,
 'modified_at': 1493424000,
 'num_tracks': 52,
 'num_albums': 47,
 'num_followers': 1,
 'tracks': [{'pos': 0,
   'artist_name': 'Missy Elliott',
   'track_uri': 'spotify:track:0UaMYEvWZi0ZqiDOoHU3YI',
   'artist_uri': 'spotify:artist:2wIVse2owClT7go1WT98tk',
   'track_name': 'Lose Control (feat. Ciara & Fat Man Scoop)',
   'album_uri': 'spotify:album:6vV5UrXcfyQD1wu4Qo2I9K',
   'duration_ms': 226863,
   'album_name': 'The Cookbook'}],
 'num_edits': 6,
 'duration_ms': 11532414,
 'num_artists': 37}

## songs_dataset.json 
### All songs from the playlists dataset collected with the following code:

In [None]:
import json
import os

all_songs = {}
spotify_dataset_path = 'data/spotify_million_playlist_dataset/'

# Add all songs from a Spotify slice file (from dataset) to all_songs.json
def add_all_songs_from_file(path):
    with open(path) as f:
        data = json.load(f)
        
    for playlist in data['playlists']:
        for track in playlist['tracks']:
            track_id = track['track_uri'].partition('spotify:track:')[-1]
            artist_id = track['artist_uri'].partition('spotify:artist:')[-1]
            if track_id not in all_songs:
                all_songs[track_id] = {
                    'track_name': track['track_name'], 
                    'artist_name': track['artist_name'], 
                    'artist_id': artist_id
                }

for slice_file in os.listdir(spotify_dataset_path):
    add_all_songs_from_file(spotify_dataset_path + slice_file)
with open('data/all_songs.json', 'w') as f:
    json.dump(all_songs, f)

In [4]:
import pandas as pd

# Demonstration of the file's format
all_songs_df = pd.read_json('data/all_songs.json').T
all_songs_df.head()

Unnamed: 0,track_name,artist_name,artist_id
0UaMYEvWZi0ZqiDOoHU3YI,Lose Control (feat. Ciara & Fat Man Scoop),Missy Elliott,2wIVse2owClT7go1WT98tk
6I9VzXrHxO9rA9A5euc8Ak,Toxic,Britney Spears,26dSoYclwsYLMAKD3tpOr4
0WqIKmW4BTrj3eJFmnCKMv,Crazy In Love,Beyoncé,6vWDO969PvNqNYHIOW5v0m
1AWQoqb9bSvzTjaLralEkT,Rock Your Body,Justin Timberlake,31TPClRtHm23RisEBtV3X7
1lzr43nnXAijIGYnCT8M8H,It Wasn't Me,Shaggy,5EvFsr3kj42KNv97ZEnqij


## audio_features_dataset.json

### Various audio features collection generated out of 'songs_dataset'. 
### Retrieved from Spotify public API (<https://api.spotify.com/>).

In [None]:
# TODO

## lyrics_corpus.json
### Lyrics collection of many of the songs from the playlists dataset. Scraped from Genius Lyrics site and public API (<https://genius.com/>).

* [Note: The full code (lyrics_list_builder.py) also searched for missing URLs to obtain as many lyrics as possible]

In [None]:
import aiohttp
import asyncio
import time
import json
from lyrics_scraper import url, lyrics # based on code from https://github.com/johnwmillr/LyricsGenius
import unicodedata
import re

asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

all_urls = []
all_require_search = []
all_lyrics = {}

# Build Genius URLs dictionary

print('Building all_urls list...')
start_time = time.time()

def parse_name(name):
    s = unicodedata.normalize('NFKD', name).encode('ascii','ignore').decode('utf8')
    s = re.search(r'([^()\[\]-]*)', s).group(1).strip().replace(' ', '-').replace('&', 'and')
    return re.sub('[^a-zA-Z0-9_\-]', '', s)

with open('data/songs_dataset.json', 'r') as songs_file:
    with open('data/lyrics1-100000.json', 'r') as lyrics_file:
        all_songs = json.load(songs_file)
        all_lyrics = json.load(lyrics_file)
        assert type(all_lyrics) == dict
        assert type(all_songs) == dict
        counter = 0
        for track_id, track_data in all_songs.items():
            # Limit number of songs
            if counter >= 100000:
                break
            counter += 1
            # Don't fetch lyrics we already have
            if all_lyrics.get(track_id):
                continue
            parsed_track_name = parse_name(track_data['track_name'])
            parsed_artist_name = parse_name(track_data['artist_name'])
            if parsed_artist_name and parsed_track_name:
                all_urls.append((track_id, track_data, 
                    f'https://genius.com/{parsed_artist_name}-{parsed_track_name}-lyrics'))
                
            else:
                all_require_search.append((track_id, track_data))
            

print(f'len(all_urls) equals {len(all_urls)}')
print("--- URLs list building took %s seconds ---" % (time.time() - start_time))

# Build lyrics list with asynchronous HTTP requests to genius.com

async def get_lyrics(session, url, track_id, track_name):
    try:
        async with session.get(url, timeout=5) as resp:
            if (resp.status == 200):
                lyrics_html = await resp.text()
                return (track_id, track_name + '\n' + lyrics(lyrics_html, True))
            else:
                print(f'Received status {resp.status} for {url}') if resp.status != 404 else None
                return (track_id, None)
    except Exception as e:
        return (track_id, None)

songs_lyrics_list = []

async def add_to_lyrics_list(urls_list, songs_offsets=(0, None)):
    """ Try to retrieve lyrics from given URLs """
    global songs_lyrics_list

    async with aiohttp.ClientSession() as session:
        tasks = []

        print(f'Retrieving lyrics of songs {songs_offsets[0]}:{songs_offsets[1]}...')
        for track_id, track_data, url in urls_list[songs_offsets[0]:songs_offsets[1]]:
            tasks.append(asyncio.ensure_future(get_lyrics(session, url, track_id, track_data['track_name'])))

        songs_lyrics_list += await asyncio.gather(*tasks)
        

total_songs_num = len(all_urls)
songs_at_each_interval = 200

start_time = time.time()
print('Retrieving lyrics from URLs found in all_urls...')
for i in range(0, total_songs_num, songs_at_each_interval):
    asyncio.run(add_to_lyrics_list(all_urls, (i, i + songs_at_each_interval)))
    time.sleep(0.2)
end_time = time.time()
print("--- Lyrics retrieval took %s seconds ---" % (end_time - start_time))

for track_id, track_lyrics in songs_lyrics_list:
    if not track_lyrics:
        all_require_search.append((track_id, all_songs[track_id]))

songs_lyrics_list = [(track_id, lyrics) for (track_id, lyrics) in songs_lyrics_list if lyrics]
print(f'Retrieved lyrics of {len(songs_lyrics_list)} songs')

start_time = time.time()
file_path = 'data/lyrics_corpus.json'
with open(file_path, 'w') as f:
    all_lyrics.update({track_id: lyrics for (track_id, lyrics) in songs_lyrics_list if lyrics})
    json.dump(all_lyrics, f)
print(f"Added lyrics for {len(songs_lyrics_list)} songs to {file_path}")
print("--- Lyrics file writing took %s seconds ---" % (end_time - start_time))

* Some tracks are instrumental, and for some others lyrics were unavailable. The lyrics_Corpus only contains entries of songs with existing and obtainable lyrics. 
* Out of the 100,000 tracks which we use, we managed to obtain lyrics for about 75% of them. 

In [1]:
import json

with open('data/lyrics_corpus.json', 'r') as lyrics_file:
    all_lyrics = json.load(lyrics_file)

Before we start analyzing the lyrics, we might need to remove outliers. Outliers for our purpose mean songs with lyrics too short (which probably indicate that they were not properly scraped or that they are purely instrumental), or too long (which might cause biased analysis).

In [31]:
import plotly.express as px
import pandas as pd

all_lyrics_series = pd.Series(all_lyrics)

fig = px.histogram([len(x) for x in all_lyrics_series.values], labels={'value':'lyrics length'})
fig.show()

In [38]:
from scipy.stats import zscore
import numpy as np

abs_z_scores = np.abs(zscore([len(x) for x in all_lyrics_series]))
# For our purpose, an outlier is a value that is more than 3 standard deviations from the mean
filtered_z_entries = (abs_z_scores < 3)
filtered_min_entries = [len(x) > 200 for x in all_lyrics_series]
all_lyrics_series = all_lyrics_series[filtered_z_entries & filtered_min_entries]
fig = px.histogram([len(x) for x in all_lyrics_series.values], labels={'value':'lyrics length'})
print(len(all_lyrics_series))
fig.show()

74603


## NRC Word-Emotion Association Lexicon - Emotion-Intensity-Lexicon-v1
### A large lexicon comprised of multiple lexicons (one for each emotion) of thousands of English words, their sentiment and the intensity of that sentiment (in scale of 0.0-1.0). Used for Emotion Analysis of songs' lyrics. The annotations were manually done by crowdsourcing.

In [49]:
with open('data/NRC/OneFilePerEmotion/sadness-scores.txt', 'r') as lexicon_file:
    for _ in range(3):
        print(lexicon_file.readline())

heartbreaking	0.969

mourning	0.969

tragic	0.961



In [57]:
from flair.models import TextClassifier
from flair.data import Sentence

# load tagger
classifier = TextClassifier.load('sentiment')
sentence = Sentence("enormously not entertaining for moviegoers of any age.")

# call predict
classifier.predict(sentence)

# check prediction
print(sentence)

2021-09-04 22:23:12,651 loading file C:\Users\LiorB\.flair\models\sentiment-en-mix-distillbert_4.pt
Sentence: "enormously not entertaining for moviegoers of any age ."   [− Tokens: 9  − Sentence-Labels: {'label': [NEGATIVE (0.9999)]}]
