# Note: <br>The following file will fail if you will try to run it as it must contain private Spotify client settings.

First we had to install spotipy library by running `pip install spotipy`
<br>
According to https://anaconda.org/jkroes/spotipy

#### Using Spotify API request to approve few terms of use:<br>
* I understand that this app is not for commercial use
* I understand that I cannot migrate my app from non-commercial to commercial without permission
* I understand and agree with Spotify's Developer Terms of Service, Branding Guidelines, and Privacy Policy
    * https://developer.spotify.com/terms/
    * https://developer.spotify.com/branding-guidelines/
    * https://www.spotify.com/il-en/legal/privacy-policy/

Create access token and get a specific user liked songs (as a test).

In [2]:
from_year_d = 6
to_year_d = 9

import spotipy
import spotipy.util as util
import sys

import pandas as pd
import numpy as np
import re
import datetime

username = ''
client_id = ''
client_secret = ''
redirect_uri = 'http://localhost:8888/callback/'
scope = 'user-library-read'

if len(sys.argv) > 1:
    username = sys.argv[1]
else:
    print("Usage: %s username" % (sys.argv[0],))
    sys.exit()

token = util.prompt_for_user_token(username, scope, client_id, client_secret, redirect_uri)

if token:
    sp = spotipy.Spotify(auth=token)
    results = sp.current_user_saved_tracks()
    for item in results['items']:
        track = item['track']
        print(track['name'] + ' - ' + track['artists'][0]['name'])
else:
    print("Can't get token for", username)

NI BIEN NI MAL - Bad Bunny
One Level Down - Original mix - Sphera
Backseat Freestyle - Kendrick Lamar
Tusa - KAROL G
Ready To Let Go - Cage The Elephant
Diggin' a Hole - Downstairs Monsters
Starry Night - Original Mix - Peggy Gou
That's Life (feat. Mac Miller & Sia) - 88-Keys
Eleven - Khalid
Lalala - Y2K
מאושרים - Doli & Penn
ROXANNE - Arizona Zervas
Be Still - Liam Gallagher
לא חסר לי כלום - Avihu Pinchasov Rhythm Club
Sunday Best - Surfaces
Right Back (feat. A Boogie Wit Da Hoodie) - Khalid
I Got A Name - Stereo Version - Jim Croce
Lover Of The Light - Live From Red Rocks, Colorado - Mumford & Sons
What's the Use? - Mac Miller
Hard Sun - Eddie Vedder


**Next step:** 
* Get audio features and most of the data that we can take from spotify for songs who doesn't include in 'Billboard Year-End Hot 100 singles' (if a song that picked up will be part of the list, then he will be removed)<br>
* We will made calculation during the query such as taking the release date and and get the Day in week and Season.<br>
* Converting string values into numeric values so we will be able to use them in the future for our prediction.
    * Day of week range is between 1-7:<br>
        1 = Monday and 7 = Sunday.
    * Seasons range are betwen 1-4:<br>
        1 = Spring, 2 = Summer, 3 = Autumn, 4 = Winter
    * Does the songs release is single or not(album / compilation)
* Get 600 songs: 3 years, 200 each year.

* offset: The index of the first result to return. if you want to get the results with the start index 10 you will need to set the offset to 10.<br>
* limit: Maximum number of results to return.

In [3]:
songs_name_no = {"2016": [],"2017": [], "2018":[]}
artists_name_no = {"2016": [],"2017": [], "2018":[]}
track_id_no = {"2016": [],"2017": [], "2018":[]}
is_single_no = {"2016": [],"2017": [], "2018":[]}
total_tracks_no = {"2016": [],"2017": [], "2018":[]}
release_date_no = {"2016": [],"2017": [], "2018":[]}
day_of_week_no = {"2016": [],"2017": [], "2018":[]}
release_season_no = {"2016": [],"2017": [], "2018":[]}
artist_genres_no = {"2016": [],"2017": [], "2018":[]}

for i in range(from_year_d, to_year_d): 
    for j in range(0,200,2):
        # spotify API search method for python doesn't work with multiple concatenation and NOT value
        # so had to take all the songs and eliminate the hebrew one later
        track_results = sp.search(q='year:201{}'.format(i), type='track', limit=2,offset=j)
        
        for j, k in enumerate(track_results['tracks']['items']):
            songs_name_no["201{}".format(i)].append(k['name'])
            
            artists_name_no["201{}".format(i)].append(k['artists'][0]['name'])
            
            track_id_no["201{}".format(i)].append(k['id']) 
            
            temp_single = k['album']['album_type']
            if temp_single == 'single':
                temp_single = 1
            else:
                temp_single = 0
            is_single_no["201{}".format(i)].append(temp_single)
            
            total_tracks_no["201{}".format(i)].append(k['album']['total_tracks'])
            
            temp_release = k['album']['release_date']
            release_date_no["201{}".format(i)].append(temp_release)
            
            temp_day = datetime.datetime.strptime(temp_release, '%Y-%m-%d').strftime('%A')
            if temp_day == 'Monday':
                temp_day = 1
            elif temp_day == 'Tuesday':
                temp_day = 2
            elif temp_day == 'Wednesday':
                temp_day = 3
            elif temp_day == 'Thursday':
                temp_day = 4
            elif temp_day == 'Friday':
                temp_day = 5
            elif temp_day == 'Saturday':
                temp_day = 6
            elif temp_day == 'Sunday':
                temp_day = 7
            day_of_week_no["201{}".format(i)].append(temp_day)
            
            month = int(temp_release.split('-')[1])
            if month in [3,4,5]:
                month = 1
            elif month in [6,7,8]:
                month = 2
            elif month in [9,10,11]:
                month = 3
            elif month in [12,1,2]:
                month = 4
            release_season_no["201{}".format(i)].append(month)
            
            temp_artist_id = k['artists'][0]['id']
            artist_genres_no["201{}".format(i)].append(sp.artist(temp_artist_id).get('genres'))

    print('Number of elements in 201{}_track_id list:'.format(i), len(track_id_no["201{}".format(i)]))

Number of elements in 2016_track_id list: 200
Number of elements in 2017_track_id list: 200
Number of elements in 2018_track_id list: 200


<br>Insert all the data we collected from lists into dataframe

In [4]:
songs_df_no={"2016_df_no": pd.DataFrame(),"2017_df_no": pd.DataFrame(), "2018_df_no":pd.DataFrame()}

i = from_year_d
for key, value in songs_df_no.items():
    songs_df_no[key]['Title'] = songs_name_no["201{}".format(i)]
    songs_df_no[key]['Artist'] = artists_name_no["201{}".format(i)]
    songs_df_no[key]['id'] = track_id_no["201{}".format(i)]
    songs_df_no[key]['artist_genres'] = artist_genres_no["201{}".format(i)]
    songs_df_no[key]['is_single'] = is_single_no["201{}".format(i)]
    songs_df_no[key]['total_tracks'] = total_tracks_no["201{}".format(i)]
    songs_df_no[key]['release_date'] = release_date_no["201{}".format(i)]
    songs_df_no[key]['day_of_week'] = day_of_week_no["201{}".format(i)]
    songs_df_no[key]['release_season'] = release_season_no["201{}".format(i)]
    songs_df_no[key]['Year'] = "201{}".format(i)
    songs_df_no[key]['is_top100'] = 0
    i = i + 1
songs_df_no['2018_df_no'].head()

Unnamed: 0,Title,Artist,id,artist_genres,is_single,total_tracks,release_date,day_of_week,release_season,Year,is_top100
0,Falling,Trevor Daniel,4TnjEaWOeW0eKTKIEvJyCa,"[alternative r&b, melodic rap, pop rap]",1,1,2018-10-05,5,3,2018,0
1,לשוב הביתה,Ishay Ribo,52n4gF126eIllrGuc9Zus6,[israeli pop],0,11,2018-02-23,5,4,2018,0
2,Lucid Dreams,Juice WRLD,285pBltuF7vW8TeWk8hdRR,"[chicago rap, melodic rap]",0,17,2018-12-10,1,4,2018,0
3,לבחור נכון,Amir Dadon,7n6emXIcaECmkljP1rPlvQ,"[classic israeli pop, israeli pop, israeli rock]",0,10,2018-01-01,1,4,2018,0
4,אחת ולתמיד,Ishay Ribo,3bgNXXL7TjlBDOl36wLWHk,[israeli pop],0,11,2018-02-23,5,4,2018,0


<br>**Next step:**
* Eliminate songs in Hebrew in order to be relevant to top100 list.<br>
Also our lyrics site doesn't include Hebrew songs.
* Note that there is a chance that Hebrew songs written in English letters will exist, we had to include because there is no way to determine their origin<br>
#### The reasons above forced us to take much bigger amount of songs in the first query as the elimination reduced it dramatically. 

In [5]:
i = from_year_d
for key, value in songs_df_no.items():
    print("Number of elements for 201{} with Hebrew songs: {}".format(i, len(songs_df_no[key]['Title'])))
    songs_df_no[key] = songs_df_no[key][~songs_df_no[key]['Title'].str.contains('[א-ת]', regex = True)]
    print("Number of elements for 201{} without Hebrew songs: {}\n".format(i, len(songs_df_no[key]['Title'])))
    i = i + 1

Number of elements for 2016 with Hebrew songs: 200
Number of elements for 2016 without Hebrew songs: 157

Number of elements for 2017 with Hebrew songs: 200
Number of elements for 2017 without Hebrew songs: 130

Number of elements for 2018 with Hebrew songs: 200
Number of elements for 2018 without Hebrew songs: 133



After we ran this cell we were able to determine the amount of Hebrew songs we had, and actually how many songs we have to 'work' with.<br><br>
Originally we query for 300 songs per year, after this step we reduced it to 200 per year.<br><br>
Assuming that some of them will also be remove once we will merge with the songs the part of top100, then amount of 130-160 is good.

<br><br>**Next step**: Concatenate between the 3 dataframes ignoring the indexes as all of them use indexes 0-199.

In [6]:
df_spotipy_no = pd.concat([songs_df_no['2016_df_no'], songs_df_no['2017_df_no'], songs_df_no['2018_df_no']],axis=0, sort=False, ignore_index=True)
mid = len(df_spotipy_no)/2
print("Shape of the dataset: {}".format(df_spotipy_no.shape))
df_spotipy_no.iloc[np.r_[0:2, mid:mid+2, -2:0]]

Shape of the dataset: (420, 11)


Unnamed: 0,Title,Artist,id,artist_genres,is_single,total_tracks,release_date,day_of_week,release_season,Year,is_top100
0,goosebumps,Travis Scott,6gBFPUFcJLzWGx4lenP6h2,[rap],0,14,2016-09-16,5,3,2016,0
1,Say You Won't Let Go,James Arthur,0p6RzKrGeXzyYYd2RZPKd8,"[pop, post-teen pop, talent show, uk pop]",0,18,2016-10-28,5,3,2016,0
210,River (feat. Ed Sheeran),Eminem,1cS0TgbR263ey9jn0MwD2s,"[detroit hip hop, g funk, hip hop, rap]",0,19,2017-12-15,5,4,2017,0
211,Zahav,Static & Ben El,0vF70TcDmYyLVktrewpNgY,"[israeli pop, jewish pop]",1,1,2017-01-29,7,4,2017,0
418,Look Back at It,A Boogie Wit da Hoodie,3Ol2xnObFdKV9pmRD2t9x8,"[melodic rap, pop rap, rap, trap]",0,20,2018-12-21,5,4,2018,0
419,SLOW DANCING IN THE DARK,Joji,0rKtyWc8bvkriBthvHKY8d,"[alternative r&b, viral pop]",0,12,2018-10-26,5,3,2018,0


<br>There are scenatious where same track get under multiple track IDs (single, as part of an album, etc).<br>
Therefore we are about to check it and correct if needed.

In [7]:
group = df_spotipy_no.groupby(['Artist','Title'], as_index=True).size()
print("The amount of duplicated songs: {}".format(group[group > 1].count()))

The amount of duplicated songs: 7


In [8]:
print("Songs count BEFORE drop duplicate: {}".format(len(df_spotipy_no)))
df_spotipy_no.drop_duplicates(subset=['Artist','Title'], inplace=True)
print("Songs count AFTER drop duplicate: {}".format(len(df_spotipy_no)))

Songs count BEFORE drop duplicate: 420
Songs count AFTER drop duplicate: 413


In [9]:
group = df_spotipy_no.groupby(['Artist','Title'], as_index=True).size()
print("The amount of duplicated songs: {}".format(group[group > 1].count()))

The amount of duplicated songs: 0


<br>**Next step:** Creat function go get Audio Features data so we can call it and use for each song.
#### The function must get dataframe thet conatain column with track id named 'id'.

In [10]:
# empty list, batchsize and the counter for None results
def getAudioFeatures(df):
    rows = []
    batchsize = 100
    None_counter = 0

    for i in range(0,len(df['id']),batchsize):
        batch = df['id'][i:i+batchsize]
        feature_results = sp.audio_features(batch)
        for i, t in enumerate(feature_results):
            if t == None:
                None_counter = None_counter + 1
            else:
                rows.append(t)
    print('Done,\nNumber of tracks where no audio features were available:',None_counter)
    return(rows)

In [11]:
rows = getAudioFeatures(df_spotipy_no)

Done,
Number of tracks where no audio features were available: 0


<br>**Next step:** Insert the audio features data collected into a NEW dataframe.

In [12]:
df_audio_features_no = pd.DataFrame.from_dict(rows, orient='columns')
print("Shape of the dataset: {}".format(df_audio_features_no.shape))
df_audio_features_no.head()

Shape of the dataset: (413, 18)


Unnamed: 0,acousticness,analysis_url,danceability,duration_ms,energy,id,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence
0,0.0847,https://api.spotify.com/v1/audio-analysis/6gBF...,0.841,243837,0.728,6gBFPUFcJLzWGx4lenP6h2,0.0,7,0.149,-3.37,1,0.0484,130.049,4,https://api.spotify.com/v1/tracks/6gBFPUFcJLzW...,audio_features,spotify:track:6gBFPUFcJLzWGx4lenP6h2,0.43
1,0.695,https://api.spotify.com/v1/audio-analysis/0p6R...,0.358,211467,0.557,0p6RzKrGeXzyYYd2RZPKd8,0.0,10,0.0902,-7.398,1,0.059,85.043,4,https://api.spotify.com/v1/tracks/0p6RzKrGeXzy...,audio_features,spotify:track:0p6RzKrGeXzyYYd2RZPKd8,0.494
2,0.141,https://api.spotify.com/v1/audio-analysis/7MXV...,0.678,230453,0.588,7MXVkk9YMctZqd1Srtv4MB,6e-06,7,0.137,-7.015,1,0.276,186.005,4,https://api.spotify.com/v1/tracks/7MXVkk9YMctZ...,audio_features,spotify:track:7MXVkk9YMctZqd1Srtv4MB,0.486
3,0.414,https://api.spotify.com/v1/audio-analysis/7BKL...,0.748,244960,0.524,7BKLCZ1jbUBVqRi2FVlTVw,0.0,8,0.111,-5.599,1,0.0338,95.01,4,https://api.spotify.com/v1/tracks/7BKLCZ1jbUBV...,audio_features,spotify:track:7BKLCZ1jbUBVqRi2FVlTVw,0.661
4,0.702,https://api.spotify.com/v1/audio-analysis/7MiZ...,0.391,131272,0.396,7MiZjKawmXTsTNePyTfPyL,0.405,1,0.315,-8.621,0,0.189,99.112,5,https://api.spotify.com/v1/tracks/7MiZjKawmXTs...,audio_features,spotify:track:7MiZjKawmXTsTNePyTfPyL,0.199


In [13]:
df_audio_features_no = pd.DataFrame.from_dict(rows,orient='columns')
print("Shape of the dataset: {}".format(df_audio_features_no.shape))
df_audio_features_no.info()

Shape of the dataset: (413, 18)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413 entries, 0 to 412
Data columns (total 18 columns):
acousticness        413 non-null float64
analysis_url        413 non-null object
danceability        413 non-null float64
duration_ms         413 non-null int64
energy              413 non-null float64
id                  413 non-null object
instrumentalness    413 non-null float64
key                 413 non-null int64
liveness            413 non-null float64
loudness            413 non-null float64
mode                413 non-null int64
speechiness         413 non-null float64
tempo               413 non-null float64
time_signature      413 non-null int64
track_href          413 non-null object
type                413 non-null object
uri                 413 non-null object
valence             413 non-null float64
dtypes: float64(9), int64(4), object(5)
memory usage: 58.2+ KB


<br>**Next step:** Processing the data - drop uneeded columns.

In [14]:
df_audio_features_no.drop(['analysis_url', 'track_href', 'type', 'uri'], axis=1,inplace=True)

print("Shape of the dataset: {}".format(df_audio_features_no.shape))
df_audio_features_no.info()

Shape of the dataset: (413, 14)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413 entries, 0 to 412
Data columns (total 14 columns):
acousticness        413 non-null float64
danceability        413 non-null float64
duration_ms         413 non-null int64
energy              413 non-null float64
id                  413 non-null object
instrumentalness    413 non-null float64
key                 413 non-null int64
liveness            413 non-null float64
loudness            413 non-null float64
mode                413 non-null int64
speechiness         413 non-null float64
tempo               413 non-null float64
time_signature      413 non-null int64
valence             413 non-null float64
dtypes: float64(9), int64(4), object(1)
memory usage: 45.2+ KB


<br>**Next step:** Merge between audio features dataframe and our original dataframe.

In [15]:
# the 'inner' method will make sure that we only keep track IDs present in both datasets
df_spotipy_final_no = pd.merge(df_spotipy_no, df_audio_features_no, on='id', how='inner')
print("Shape of the dataset: {}".format(df_spotipy_final_no.shape))
df_spotipy_final_no.head()

Shape of the dataset: (413, 24)


Unnamed: 0,Title,Artist,id,artist_genres,is_single,total_tracks,release_date,day_of_week,release_season,Year,...,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,goosebumps,Travis Scott,6gBFPUFcJLzWGx4lenP6h2,[rap],0,14,2016-09-16,5,3,2016,...,0.728,0.0,7,0.149,-3.37,1,0.0484,130.049,4,0.43
1,Say You Won't Let Go,James Arthur,0p6RzKrGeXzyYYd2RZPKd8,"[pop, post-teen pop, talent show, uk pop]",0,18,2016-10-28,5,3,2016,...,0.557,0.0,10,0.0902,-7.398,1,0.059,85.043,4,0.494
2,Starboy,The Weeknd,7MXVkk9YMctZqd1Srtv4MB,"[canadian contemporary r&b, canadian pop, pop]",0,18,2016-11-25,5,3,2016,...,0.588,6e-06,7,0.137,-7.015,1,0.276,186.005,4,0.486
3,Closer,The Chainsmokers,7BKLCZ1jbUBVqRi2FVlTVw,"[electropop, pop, tropical house]",1,1,2016-07-29,5,2,2016,...,0.524,0.0,8,0.111,-5.599,1,0.0338,95.01,4,0.661
4,Devil Eyes,Hippie Sabotage,7MiZjKawmXTsTNePyTfPyL,[edm],0,11,2016-02-05,5,4,2016,...,0.396,0.405,1,0.315,-8.621,0,0.189,99.112,5,0.199


Note: No songs lost, we have the same number of enteries before & after the merge.

<br>Check if we have any duplication in track

In [16]:
df_spotipy_final_no[df_spotipy_final_no.duplicated(subset=['Artist','Title'],keep=False)]

Unnamed: 0,Title,Artist,id,artist_genres,is_single,total_tracks,release_date,day_of_week,release_season,Year,...,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence


<br><br>**Next step:** Get data for songs that are part of top100 Billboard playlists.<br><br>
**Basically the steps will be similar to how it been done on songs that are not part of top100 while the change is that we get our info from known playlists that already contain the top100 songs for each year.**

##### We had the option to include 'popularity' column for each song but as the popularity is updated ongoing data then an old song current popularity isn't relevant as we wishing to have his popularity according the same year he has been chosen - we marked those lines.

In [17]:
from pprint import pprint

pl_uris = ['spotify:playlist:2LWafCgWzsXGWv7wJeePjA', 
           'spotify:playlist:255aUSCuVTcdD5JTogG69d', 
           'spotify:playlist:37IRJrV9jd0LnsFTIY83ax'] # top 100 billboard singles playlists by order: 2016, 2017, 2018 

track_id_yes = {"2016": [],"2017": [], "2018":[]}
songs_name_yes = {"2016": [],"2017": [], "2018":[]}
artists_name_yes = {"2016": [],"2017": [], "2018":[]}
artist_genres_yes = {"2016": [],"2017": [], "2018":[]}
is_single_yes = {"2016": [],"2017": [], "2018":[]}
total_tracks_yes = {"2016": [],"2017": [], "2018":[]}
release_date_yes = {"2016": [],"2017": [], "2018":[]}
day_of_week_yes = {"2016": [],"2017": [], "2018":[]}
release_season_yes = {"2016": [],"2017": [], "2018":[]}
release_year_yes = {"2016": [],"2017": [], "2018":[]}


j = from_year_d
while True:
    for playlist in pl_uris:
        offset = 0
        tracks_id = sp.playlist_tracks(playlist, offset=offset,
                                      fields='items.track.id,total')
        songs_name = sp.playlist_tracks(playlist, offset=offset,
                                      fields='items.track.name.total')
        artist_name = sp.playlist_tracks(playlist, offset=offset,
                                      fields='items.track.artists.name.total')
        artists_id = sp.playlist_tracks(playlist, offset=offset,
                              fields='items.track.artists.id.total') 
        is_single = sp.playlist_tracks(playlist, offset=offset,
                              fields='items.track.album.album_type.total')
        total_tracks = sp.playlist_tracks(playlist, offset=offset,
                              fields='items.track.album.total_tracks.total')
        release_date = sp.playlist_tracks(playlist, offset=offset,
                              fields='items.track.album.release_date')
        
    #     popularity = sp.playlist_tracks(playlist, offset=offset,
    #                                   fields='items.track.popularity.total')
        
        i = 0
        offset = offset + len(tracks_id['items'])
        
        for i in range(0, offset):
            if (tracks_id['items'][i].get('track').get('id') != None):
                track_id_yes['201{}'.format(j)].append(tracks_id['items'][i].get('track').get('id'))
                
                songs_name_yes['201{}'.format(j)].append(songs_name['items'][i].get('track').get('name'))
                
                artists_name_yes['201{}'.format(j)].append(artist_name['items'][i].get('track').get('artists')[0].get('name'))
                
                temp_single = is_single['items'][i].get('track').get('album').get('album_type')
                if temp_single == 'single':
                    temp_single = 1
                else:
                    temp_single = 0
                is_single_yes["201{}".format(j)].append(temp_single)
                
                total_tracks_yes["201{}".format(j)].append(total_tracks['items'][i].get('track').get('album').get('total_tracks'))
                
                temp_release = release_date['items'][i].get('track').get('album').get('release_date')
                release_date_yes["201{}".format(j)].append(temp_release)
                
                temp_day = datetime.datetime.strptime(temp_release, '%Y-%m-%d').strftime('%A')
                if temp_day == 'Monday':
                    temp_day = 1
                elif temp_day == 'Tuesday':
                    temp_day = 2
                elif temp_day == 'Wednesday':
                    temp_day = 3
                elif temp_day == 'Thursday':
                    temp_day = 4
                elif temp_day == 'Friday':
                    temp_day = 5
                elif temp_day == 'Saturday':
                    temp_day = 6
                elif temp_day == 'Sunday':
                    temp_day = 7
                day_of_week_yes["201{}".format(j)].append(temp_day)
                
                month = int(temp_release.split('-')[1])
                if month in [3,4,5]:
                    month = 1
                elif month in [6,7,8]:
                    month = 2
                elif month in [9,10,11]:
                    month = 3
                elif month in [12,1,2]:
                    month = 4
                release_season_yes["201{}".format(j)].append(month)
                
                year = int(temp_release.split('-')[0])
                release_year_yes["201{}".format(j)].append(year)
                
                temp_artist_id = artists_id['items'][i].get('track').get('artists')[0].get('id')
                artist_genres_yes['201{}'.format(j)].append(sp.artist(temp_artist_id).get('genres'))
            else:
                continue
        if (j < to_year_d-1):
            j = j + 1
        else:
            j = j + 1
            break
    if (j >= to_year_d):
        break

retrying ...1secs


In [18]:
list_check = [track_id_yes, songs_name_yes, artists_name_yes, artist_genres_yes, 
              is_single_yes, total_tracks_yes, release_date_yes, day_of_week_yes, release_season_yes]
for i in list_check:
    print("dictionery-dataframes size {}, {}, {}".format(len(i['2016']), len(i['2017']), len(i['2018'])))

dictionery-dataframes size 98, 99, 100
dictionery-dataframes size 98, 99, 100
dictionery-dataframes size 98, 99, 100
dictionery-dataframes size 98, 99, 100
dictionery-dataframes size 98, 99, 100
dictionery-dataframes size 98, 99, 100
dictionery-dataframes size 98, 99, 100
dictionery-dataframes size 98, 99, 100
dictionery-dataframes size 98, 99, 100


We have missing data for 2 songs in top100 billboard 2016 and 1 for 2017. It's data loss that we have to absorb.

<br>**Next step:** Put the data collected from lists into data frames fer year and collect their audio features.<br>
Splitted into dataframe per year because getAudioFeatured function had to work with defined offset.

In [19]:
songs_df_yes = {"2016_df_yes": pd.DataFrame(),"2017_df_yes": pd.DataFrame(), "2018_df_yes":pd.DataFrame()}

i = from_year_d
for key, value in songs_df_yes.items():
    songs_df_yes[key]['Title'] = songs_name_yes["201{}".format(i)]
    songs_df_yes[key]['Artist'] = artists_name_yes["201{}".format(i)]
    songs_df_yes[key]['id'] = track_id_yes["201{}".format(i)]
    songs_df_yes[key]['artist_genres'] = artist_genres_yes["201{}".format(i)]
    songs_df_yes[key]['is_single'] = is_single_yes["201{}".format(i)]
    songs_df_yes[key]['total_tracks'] = total_tracks_yes["201{}".format(i)]
    songs_df_yes[key]['release_date'] = release_date_yes["201{}".format(i)]
    songs_df_yes[key]['day_of_week'] = day_of_week_yes["201{}".format(i)]
    songs_df_yes[key]['release_season'] = release_season_yes["201{}".format(i)]
    songs_df_yes[key]['Year'] = release_year_yes["201{}".format(i)]
    songs_df_yes[key]['is_top100'] = 1
    i = i + 1
print("Shape of the dataset: {}".format(songs_df_yes['2016_df_yes'].shape))
songs_df_yes['2016_df_yes'].tail()

Shape of the dataset: (98, 11)


Unnamed: 0,Title,Artist,id,artist_genres,is_single,total_tracks,release_date,day_of_week,release_season,Year,is_top100
93,Humble And Kind,Tim McGraw,1qosWrKxri24ZIzH4ZDFcp,"[contemporary country, country, country road]",0,14,2015-01-01,4,4,2015,1
94,Wicked,Future,6BbINUfGabVyiNFJpQXn3x,"[atl hip hop, pop rap, rap, southern hip hop, ...",0,12,2016-04-13,3,1,2016,1
95,Tiimmy Turner,Desiigner,0zMxL4BTjSqCsUtfdlcL8G,"[pop rap, rap, southern hip hop, trap, viral t...",1,1,2016-07-22,5,2,2016,1
96,See You Again (feat. Charlie Puth),Wiz Khalifa,7wqSzGeodspE3V6RBD5W8L,"[hip hop, pittsburgh rap, pop rap, rap, southe...",1,1,2015-03-10,2,1,2015,1
97,Perfect,One Direction,3NLnwwAQbbFKcEcV8hDItk,"[boy band, dance pop, pop, post-teen pop, tale...",0,17,2015-11-13,5,3,2015,1


In [20]:
#Get the audio feature based on previous defined function.

audio_features_yes = {"2016": pd.DataFrame(),"2017": pd.DataFrame(), "2018":pd.DataFrame()}
for i in range(from_year_d, to_year_d):
    rows = getAudioFeatures(songs_df_yes['201{}_df_yes'.format(i)])
    audio_features_yes['201{}'.format(i)] = pd.DataFrame.from_dict(rows, orient='columns')
    print("Shape of audio features dataset 201{}: {}.".format(i, audio_features_yes['201{}'.format(i)].shape))

Done,
Number of tracks where no audio features were available: 0
Shape of audio features dataset 2016: (98, 18).
Done,
Number of tracks where no audio features were available: 0
Shape of audio features dataset 2017: (99, 18).
Done,
Number of tracks where no audio features were available: 0
Shape of audio features dataset 2018: (100, 18).


In [21]:
audio_features_yes['2017'].head()

Unnamed: 0,acousticness,analysis_url,danceability,duration_ms,energy,id,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence
0,0.581,https://api.spotify.com/v1/audio-analysis/0FE9...,0.825,233713,0.652,0FE9t6xYkqWXU2ahLh6D8X,0.0,1,0.0931,-3.183,0,0.0802,95.977,4,https://api.spotify.com/v1/tracks/0FE9t6xYkqWX...,audio_features,spotify:track:0FE9t6xYkqWXU2ahLh6D8X,0.931
1,0.229,https://api.spotify.com/v1/audio-analysis/5CtI...,0.694,228827,0.815,5CtI0qwDJkDQGwXD1H1cLb,0.0,2,0.0924,-4.328,1,0.12,88.931,4,https://api.spotify.com/v1/tracks/5CtI0qwDJkDQ...,audio_features,spotify:track:5CtI0qwDJkDQGwXD1H1cLb,0.813
2,0.013,https://api.spotify.com/v1/audio-analysis/0KKk...,0.853,206693,0.56,0KKkJNfGyhkQ5aFogxQAPU,0.0,1,0.0944,-4.961,1,0.0406,134.066,4,https://api.spotify.com/v1/tracks/0KKkJNfGyhkQ...,audio_features,spotify:track:0KKkJNfGyhkQ5aFogxQAPU,0.86
3,0.000243,https://api.spotify.com/v1/audio-analysis/7ujx...,0.906,177000,0.625,7ujx3NYtwO2LkmKGz59mXp,3.2e-05,1,0.0975,-6.779,0,0.0903,150.018,4,https://api.spotify.com/v1/tracks/7ujx3NYtwO2L...,audio_features,spotify:track:7ujx3NYtwO2LkmKGz59mXp,0.423
4,0.0306,https://api.spotify.com/v1/audio-analysis/1dNI...,0.607,247627,0.649,1dNIEtp7AY3oDAKCGg2XkH,2.5e-05,11,0.174,-6.695,0,0.0362,102.996,4,https://api.spotify.com/v1/tracks/1dNIEtp7AY3o...,audio_features,spotify:track:1dNIEtp7AY3oDAKCGg2XkH,0.505


<br>**Next step:** Concatenate between songs_df&audio_features each year.
* Because we might have duplicates (single, as part of an album, etc) and the merge must happen on the track id, decided to first merge between each year songs_df&audio_features dataframes and only then to concat the three of them.

In [22]:
# the 'inner' method will make sure that we only keep track IDs present in both datasets

dic_spotipy_final_yes = {"2016": pd.DataFrame(),"2017": pd.DataFrame(), "2018":pd.DataFrame()}

for i in range(from_year_d, to_year_d):
    dic_spotipy_final_yes['201{}'.format(i)] = pd.merge(songs_df_yes['201{}_df_yes'.format(i)],
                                                        audio_features_yes['201{}'.format(i)], 
                                                                           on='id', how='inner')
    print("Shape of the merged 201{} dataset: {}".format(i, dic_spotipy_final_yes['201{}'.format(i)].shape))

Shape of the merged 2016 dataset: (98, 28)
Shape of the merged 2017 dataset: (99, 28)
Shape of the merged 2018 dataset: (100, 28)


<br>**Next step:** Concatenate between the 3 dataframes ignoring the indexes.

In [23]:
df_spotipy_final_yes = pd.concat([dic_spotipy_final_yes['2016'], 
                                  dic_spotipy_final_yes['2017'], 
                                  dic_spotipy_final_yes['2018']],axis=0, sort=False, ignore_index=True)
mid = len(df_spotipy_final_yes)/2
print("Shape of the final dataset for songs in top100: {}".format(df_spotipy_final_yes.shape))
df_spotipy_final_yes.iloc[np.r_[0:2, mid:mid+2, -2:0]]

Shape of the final dataset for songs in top100: (297, 28)


Unnamed: 0,Title,Artist,id,artist_genres,is_single,total_tracks,release_date,day_of_week,release_season,Year,...,liveness,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence
0,Love Yourself,Justin Bieber,3hB5DgAiMAQ4DzYbsMq1IT,"[canadian pop, pop, post-teen pop]",0,19,2015-11-13,5,3,2015,...,0.28,-9.828,1,0.438,100.418,4,https://api.spotify.com/v1/tracks/3hB5DgAiMAQ4...,audio_features,spotify:track:3hB5DgAiMAQ4DzYbsMq1IT,0.515
1,Sorry,Justin Bieber,69bp2EbF7Q2rqc5N3ylezZ,"[canadian pop, pop, post-teen pop]",0,19,2015-11-13,5,3,2015,...,0.299,-3.669,0,0.045,99.945,4,https://api.spotify.com/v1/tracks/69bp2EbF7Q2r...,audio_features,spotify:track:69bp2EbF7Q2rqc5N3ylezZ,0.41
148,Thunder,Imagine Dragons,0tKcYR2II1VCQWT79i5NrW,"[modern rock, rock]",0,11,2017-06-23,5,2,2017,...,0.155,-4.749,1,0.0479,167.88,4,https://api.spotify.com/v1/tracks/0tKcYR2II1VC...,audio_features,spotify:track:0tKcYR2II1VCQWT79i5NrW,0.298
149,T-Shirt,Migos,7KOlJ92bu51cltsD9KU5I7,"[atl hip hop, hip hop, pop rap, rap, southern ...",0,13,2017-01-27,5,4,2017,...,0.158,-3.744,0,0.217,139.023,4,https://api.spotify.com/v1/tracks/7KOlJ92bu51c...,audio_features,spotify:track:7KOlJ92bu51cltsD9KU5I7,0.486
295,Mi Gente (feat. Beyoncé),J Balvin,0GzmMQizDeA2NVMUaZksv0,"[latin, reggaeton]",1,1,2017-09-28,4,3,2017,...,0.231,-6.36,0,0.0818,105.009,4,https://api.spotify.com/v1/tracks/0GzmMQizDeA2...,audio_features,spotify:track:0GzmMQizDeA2NVMUaZksv0,0.469
296,Believer,Imagine Dragons,0pqnGHJpmpxLKifKRmU6WP,"[modern rock, rock]",0,12,2017-06-23,5,2,2017,...,0.081,-4.374,0,0.128,124.949,4,https://api.spotify.com/v1/tracks/0pqnGHJpmpxL...,audio_features,spotify:track:0pqnGHJpmpxLKifKRmU6WP,0.666


<br>**Next step:** Processing the data - drop uneeded columns.

**We check for duplicates only for our self known, as our data took from specific places and not by random search.<br>
When we have duplicated song, it's mean that this song won twice year and his duplicated data is important as his data have x2 value - his weight is doubled from regular song and it's right to keep it as is.**

In [24]:
group = df_spotipy_final_yes.groupby(['Artist','Title'], as_index=True).size()
print("The amount of duplicated songs: {}".format(group[group > 1].count()))

The amount of duplicated songs: 21


In [25]:
list(df_spotipy_final_yes.columns)

['Title',
 'Artist',
 'id',
 'artist_genres',
 'is_single',
 'total_tracks',
 'release_date',
 'day_of_week',
 'release_season',
 'Year',
 'is_top100',
 'acousticness',
 'analysis_url',
 'danceability',
 'duration_ms',
 'energy',
 'instrumentalness',
 'key',
 'liveness',
 'loudness',
 'mode',
 'speechiness',
 'tempo',
 'time_signature',
 'track_href',
 'type',
 'uri',
 'valence']

In [26]:
df_spotipy_final_yes.drop(['analysis_url', 'track_href', 'type', 'uri'], axis=1,inplace=True)

print("Shape of the final dataset for songs in top100: {}".format(df_spotipy_final_yes.shape))
list(df_spotipy_final_yes.columns)

Shape of the final dataset for songs in top100: (297, 24)


['Title',
 'Artist',
 'id',
 'artist_genres',
 'is_single',
 'total_tracks',
 'release_date',
 'day_of_week',
 'release_season',
 'Year',
 'is_top100',
 'acousticness',
 'danceability',
 'duration_ms',
 'energy',
 'instrumentalness',
 'key',
 'liveness',
 'loudness',
 'mode',
 'speechiness',
 'tempo',
 'time_signature',
 'valence']

In [27]:
df_spotipy_final_yes.iloc[np.r_[0:2, mid:mid+2, -2:0]]

Unnamed: 0,Title,Artist,id,artist_genres,is_single,total_tracks,release_date,day_of_week,release_season,Year,...,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Love Yourself,Justin Bieber,3hB5DgAiMAQ4DzYbsMq1IT,"[canadian pop, pop, post-teen pop]",0,19,2015-11-13,5,3,2015,...,0.378,0.0,4,0.28,-9.828,1,0.438,100.418,4,0.515
1,Sorry,Justin Bieber,69bp2EbF7Q2rqc5N3ylezZ,"[canadian pop, pop, post-teen pop]",0,19,2015-11-13,5,3,2015,...,0.76,0.0,0,0.299,-3.669,0,0.045,99.945,4,0.41
148,Thunder,Imagine Dragons,0tKcYR2II1VCQWT79i5NrW,"[modern rock, rock]",0,11,2017-06-23,5,2,2017,...,0.81,0.21,0,0.155,-4.749,1,0.0479,167.88,4,0.298
149,T-Shirt,Migos,7KOlJ92bu51cltsD9KU5I7,"[atl hip hop, hip hop, pop rap, rap, southern ...",0,13,2017-01-27,5,4,2017,...,0.687,0.0,10,0.158,-3.744,0,0.217,139.023,4,0.486
295,Mi Gente (feat. Beyoncé),J Balvin,0GzmMQizDeA2NVMUaZksv0,"[latin, reggaeton]",1,1,2017-09-28,4,3,2017,...,0.716,0.0,11,0.231,-6.36,0,0.0818,105.009,4,0.469
296,Believer,Imagine Dragons,0pqnGHJpmpxLKifKRmU6WP,"[modern rock, rock]",0,12,2017-06-23,5,2,2017,...,0.78,0.0,10,0.081,-4.374,0,0.128,124.949,4,0.666


<br>**Next step:** Remove df_spotify_final_no rows that contains songs from df_spotify_final_yes.<br>
We have this kind of situation because we created df_spotify_final_no from 'random' search and it might contain songs that are part of top100 list.

In [28]:
print("Shape of the final dataset for songs in top100: {}".format(df_spotipy_final_yes.shape))
print("Shape of the final dataset for songs NOT in top100: {}".format(df_spotipy_final_no.shape))

Shape of the final dataset for songs in top100: (297, 24)
Shape of the final dataset for songs NOT in top100: (413, 24)


Creating python-sets combined with song title-and the artist-title. This way we will have unique entry for songs.<br>
* We don't care about 'Year' because we are about to take the inner result between the sets and then to delete those rows from the NOT top100 dataframe as it should not appear there no matter which year.

In [29]:
list_set_yes = set()
list_set_no = set()

# in top100 list
size = len(df_spotipy_final_yes['Title'])
for i in range(0,size):
    list_set_yes.add((df_spotipy_final_yes['Title'][i] , df_spotipy_final_yes['Artist'][i]))
print(len(list_set_yes))

# NOT in top100 list
size = len(df_spotipy_final_no['Title'])
for i in range(0,size):
    if (pd.isnull(df_spotipy_final_no['Title'].iloc[i])):
        print("empty")
    list_set_no.add((df_spotipy_final_no['Title'][i] , df_spotipy_final_no['Artist'][i]))
print(len(list_set_no))

276
413


In [30]:
aa = set()
aa.add('a')
print(aa)
aa.add('a')
print(aa)

{'a'}
{'a'}


As we can see from the results, python-set removing duplicates automatically, therefore we will use python-intersection to find the matching entries.<br>
Then we will delete from df_spotipy_final_no every entry that returned into the new value (inter_result).

In [31]:
inter_result = list_set_no.intersection(list_set_yes)

print("{} rows found that should not belong to df_spotipy_final_no datafreame.".format(len(inter_result)))
print("Therefore we should have {} rows at the end of the process.\n\n".format(df_spotipy_final_no.shape[0]-len(inter_result)))
print(inter_result)

123 rows found that should not belong to df_spotipy_final_no datafreame.
Therefore we should have 290 rows at the end of the process.


{('Yes Indeed', 'Lil Baby'), ('Natural', 'Imagine Dragons'), ('Starving', 'Hailee Steinfeld'), ('Dangerous Woman', 'Ariana Grande'), ("Don't Let Me Down", 'The Chainsmokers'), ('Low Life', 'Future'), ('Look At Me!', 'XXXTENTACION'), ('Ric Flair Drip (& Metro Boomin)', 'Offset'), ('Look Alive (feat. Drake)', 'BlocBoy JB'), ('Finesse - Remix; feat. Cardi B', 'Bruno Mars'), ('Cheap Thrills', 'Sia'), ('PILLOWTALK', 'ZAYN'), ('Eastside (with Halsey & Khalid)', 'benny blanco'), ('Unforgettable', 'French Montana'), ('Feel It Still', 'Portugal. The Man'), ('Sucker For Pain (with Wiz Khalifa, Imagine Dragons, Logic & Ty Dolla $ign feat. X Ambassadors)', 'Lil Wayne'), ('Wolves', 'Selena Gomez'), ('Malibu', 'Miley Cyrus'), ('Never Forget You', 'Zara Larsson'), ('Hotline Bling', 'Drake'), ('Congratulations', 'Post Malone'), ('LOVE. FEAT. ZACARI.', 'Kendrick Lamar'

In [32]:
for item in inter_result:
    t,y = item
    df_spotipy_final_no = df_spotipy_final_no[(df_spotipy_final_no['Title'] != t) | (df_spotipy_final_no['Artist'] != y)]
print("Shape of the final dataset for songs NOT in top100: {}".format(df_spotipy_final_no.shape))
df_spotipy_final_no.tail()

Shape of the final dataset for songs NOT in top100: (290, 24)


Unnamed: 0,Title,Artist,id,artist_genres,is_single,total_tracks,release_date,day_of_week,release_season,Year,...,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
408,High On Life (feat. Bonn),Martin Garrix,4ut5G4rgB1ClpMTMfjoIuy,"[big room, edm, pop, progressive house, tropic...",1,1,2018-07-29,7,2,2018,...,0.486,0.0,6,0.111,-6.431,0,0.0311,128.038,4,0.368
409,Lost In Japan,Shawn Mendes,79esEXlqqmq0GPz0xQSZTV,"[canadian pop, dance pop, pop, post-teen pop, ...",0,14,2018-05-25,5,1,2018,...,0.738,0.0,10,0.106,-6.784,1,0.374,105.027,4,0.425
410,Faucet Failure,Ski Mask The Slump God,1ThmUihH9dF8EV08ku5AXN,"[miami hip hop, rap, trap, underground hip hop...",0,13,2018-11-30,5,3,2018,...,0.552,0.0,10,0.0952,-9.373,0,0.335,99.993,4,0.615
411,Look Back at It,A Boogie Wit da Hoodie,3Ol2xnObFdKV9pmRD2t9x8,"[melodic rap, pop rap, rap, trap]",0,20,2018-12-21,5,4,2018,...,0.587,0.0,3,0.148,-5.075,0,0.0413,96.057,4,0.536
412,SLOW DANCING IN THE DARK,Joji,0rKtyWc8bvkriBthvHKY8d,"[alternative r&b, viral pop]",0,12,2018-10-26,5,3,2018,...,0.479,0.00598,3,0.191,-7.458,1,0.0261,88.964,4,0.284


As we can see, that dataframe rows number changed as we expected.

<br>**Final steps:** Concatenate between the 2 final dataframes ignoring the indexes and save then as csv for outer use.

Lets remind ourself last time the sizes of the two dataframes:

In [33]:
print("Shape of the final dataset for songs in top100: {}".format(df_spotipy_final_yes.shape))
print("Shape of the final dataset for songs NOT in top100: {}".format(df_spotipy_final_no.shape))

Shape of the final dataset for songs in top100: (297, 24)
Shape of the final dataset for songs NOT in top100: (290, 24)


In [34]:
df_spotipy_final = pd.concat([df_spotipy_final_yes, df_spotipy_final_no],
                             axis=0, sort=False, ignore_index=True)
print("Shape of the final dataset for songs: {}".format(df_spotipy_final.shape))

Shape of the final dataset for songs: (587, 24)


In [35]:
df_spotipy_final_yes.to_csv('./data/top100spotify.csv')
df_spotipy_final_no.to_csv('./data/NOTtop100spotify.csv')

In [36]:
df_spotipy_final.to_csv('./data/spotify.csv')