# Music Popularity Prediciton
DJ Khaled boldly claimed to always know when a song will be a hit. We decided to further investigate by asking three key questions: 
1. Are there certain characteristics for hit songs 
2. what are the largest influencers on a song’s success 
3. and can old songs even predict the popularity of new songs? 

## Creating Dataset
There are no relevant and updated dataset that exists to perform this analysis, hence we decided to form our our dataset by using a technique called Data Collection.
For this project, we'll be using the data produced from one of the largest Music Streaming Website in the world called Spotify. We'll be using Spotify Web API to collect our raw data from the custom playlist of 1300 songs. 
But for demonstrative purpose we will be using a smaller playlist with 32 songs for now.

## DataFrames
Our dataset will contain 3 data frames namely, Album Data, Artist Data, Song Data.
In the following code I have demonstrated the creation of the Song DataFrame.

In [56]:
#Import Library
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import spotipy
from matplotlib import style
from spotipy import util
from spotipy.oauth2 import SpotifyClientCredentials 
loc='G:\\Projects\\MIR\\DataMining'

### Step 1 : Setting Up Connection
Set Up Connection with Spotify Web API and give it access to my spotify account using a token generated automatically. We need this to use my 32 song playlist called 'Best Mixtape Ever'

In [57]:
#Set up Connection 
client_id = 'a90f2a20aea54d5781d4cc7a38ab6694' #Need to create developer profile
client_secret = 'b8810bc6ac1d4877b78c45a8654ebf18'
username = 'f0uthgq19mkamht0gno58wkt0' #Store username
scope = 'playlist-read-collaborative'
redirect_uri='http://localhost/'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, 
client_secret=client_secret)#Create manager for ease
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
token = util.prompt_for_user_token(username, scope, client_id, 
client_secret, redirect_uri)

if token:
 sp = spotipy.Spotify(auth=token)
else:
 print("Can't get token for", username)

In [62]:
#Connect to the 'best mixtape ever'
playlist_id = 'spotify:playlist:6X4iVRG9PAyNT7gRUAZEMb'
playlist = sp.user_playlist(username,playlist_id)

### Step 2 : Collecting the Data

In [63]:
def user_playlist_tracks_full(spotify_connection, user, playlist_id=None, fields=None, market=None):
    """ Get full details of the tracks of a playlist owned by a user.
        https://developer.spotify.com/documentation/web-api/reference/playlists/get-playlists-tracks/

        Parameters:
            - user - the id of the user
            - playlist_id - the id of the playlist
            - fields - which fields to return
            - market - an ISO 3166-1 alpha-2 country code.
    """

    # first run through also retrieves total no of songs in library
    response = spotify_connection.user_playlist_tracks(user, playlist_id, fields=fields, limit=100, market=market)
    results = response["items"]
    

    # subsequently runs until it hits the user-defined limit or has read all songs in the library
    while len(results) < response["total"]:
        response = spotify_connection.user_playlist_tracks(user, playlist_id, fields=fields, limit=100, offset=len(results), market=market)
        results.extend(response["items"]) 
    return(results)


In [64]:
results=user_playlist_tracks_full(sp,username,playlist_id)


#### results is a list of dictionaries that contains the meta data produced by each song

In [65]:
results[3]['track']['album']

{'album_type': 'album',
 'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/7HCqGPJcQTyGJ2yqntbuyr'},
   'href': 'https://api.spotify.com/v1/artists/7HCqGPJcQTyGJ2yqntbuyr',
   'id': '7HCqGPJcQTyGJ2yqntbuyr',
   'name': 'Amit Trivedi',
   'type': 'artist',
   'uri': 'spotify:artist:7HCqGPJcQTyGJ2yqntbuyr'}],
 'available_markets': ['AD',
  'AE',
  'AR',
  'AT',
  'AU',
  'BE',
  'BG',
  'BH',
  'BO',
  'BR',
  'CA',
  'CH',
  'CL',
  'CO',
  'CR',
  'CY',
  'CZ',
  'DE',
  'DK',
  'DO',
  'DZ',
  'EC',
  'EE',
  'EG',
  'ES',
  'FI',
  'FR',
  'GB',
  'GR',
  'GT',
  'HK',
  'HN',
  'HU',
  'ID',
  'IE',
  'IL',
  'IN',
  'IS',
  'IT',
  'JO',
  'JP',
  'KW',
  'LB',
  'LI',
  'LT',
  'LU',
  'LV',
  'MA',
  'MC',
  'MT',
  'MX',
  'MY',
  'NI',
  'NL',
  'NO',
  'NZ',
  'OM',
  'PA',
  'PE',
  'PH',
  'PL',
  'PS',
  'PT',
  'PY',
  'QA',
  'RO',
  'SA',
  'SE',
  'SG',
  'SK',
  'SV',
  'TH',
  'TN',
  'TR',
  'TW',
  'US',
  'UY',
  'VN',
  'ZA'],
 'external_u

In [78]:
## results list contains the following 6 keys for the dictionary of each song
results[0].keys()

dict_keys(['added_at', 'added_by', 'is_local', 'primary_color', 'track', 'video_thumbnail'])

In [66]:
track_uri=[]
raw_playlist_data=[]
for i in range(0,len(results)):
    if songs[i]['track']['id'] != None: # Removes  local tracks, if any
        play=results[i]
        raw_playlist_data.append(play['track'])
        track_uri.append(play['track']['uri'])
               

### raw_playlist_data is a list which contains the value in the 'track' key of results list

In [67]:
raw_playlist_data[21]['album']

{'album_type': 'album',
 'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/3DiDSECUqqY1AuBP8qtaIa'},
   'href': 'https://api.spotify.com/v1/artists/3DiDSECUqqY1AuBP8qtaIa',
   'id': '3DiDSECUqqY1AuBP8qtaIa',
   'name': 'Alicia Keys',
   'type': 'artist',
   'uri': 'spotify:artist:3DiDSECUqqY1AuBP8qtaIa'}],
 'available_markets': ['AD',
  'AE',
  'AR',
  'AT',
  'AU',
  'BE',
  'BG',
  'BH',
  'BO',
  'BR',
  'CA',
  'CH',
  'CL',
  'CO',
  'CR',
  'CY',
  'CZ',
  'DE',
  'DK',
  'DO',
  'DZ',
  'EC',
  'EE',
  'EG',
  'ES',
  'FI',
  'FR',
  'GB',
  'GR',
  'GT',
  'HK',
  'HN',
  'HU',
  'ID',
  'IE',
  'IL',
  'IN',
  'IS',
  'IT',
  'JO',
  'JP',
  'KW',
  'LB',
  'LI',
  'LT',
  'LU',
  'LV',
  'MA',
  'MC',
  'MT',
  'MX',
  'MY',
  'NI',
  'NL',
  'NO',
  'NZ',
  'OM',
  'PA',
  'PE',
  'PH',
  'PL',
  'PS',
  'PT',
  'PY',
  'QA',
  'RO',
  'SA',
  'SE',
  'SG',
  'SK',
  'SV',
  'TH',
  'TN',
  'TR',
  'TW',
  'US',
  'UY',
  'VN',
  'ZA'],
 'external_ur

In [68]:
for i in range(len(raw_playlist_data)):
    del raw_playlist_data[i]['available_markets']
    del raw_playlist_data[i]['disc_number']
    del raw_playlist_data[i]['episode']
    del raw_playlist_data[i]['external_ids']
    del raw_playlist_data[i]['external_urls']
    del raw_playlist_data[i]['is_local']
    del raw_playlist_data[i]['preview_url']
    del raw_playlist_data[i]['type']
    del raw_playlist_data[i]['album']
    del raw_playlist_data[i]['artists']

In [70]:
raw_playlist_data[0]

{'duration_ms': 278573,
 'explicit': False,
 'href': 'https://api.spotify.com/v1/tracks/3NZJlJemX3mzjf56MqC5ML',
 'id': '3NZJlJemX3mzjf56MqC5ML',
 'name': 'Forever',
 'popularity': 72,
 'track': True,
 'track_number': 1,
 'uri': 'spotify:track:3NZJlJemX3mzjf56MqC5ML'}

## Audio Feature Extraction
to extract audio features of a song, all we need is the uri of that song. uri is available in raw_playlist_data

In [71]:
def audio_features(element):
    #Add new key-values to store audio features
    element['acousticness'] = []
    element['danceability'] = []
    element['energy'] = []
    element['instrumentalness'] = []
    element['liveness'] = []
    element['loudness'] = []
    element['speechiness'] = []
    element['tempo'] = []
    element['valence'] = []
    #element['popularity'] = []
    #create a track counter
    
    features = sp.audio_features(element['uri'])
        
    #Append to relevant key-value
    element['acousticness'].append(features[0]['acousticness'])
    element['danceability'].append(features[0]['danceability'])
    element['energy'].append(features[0]['energy'])
    element['instrumentalness'].append(features[0]['instrumentalness'])
    element['liveness'].append(features[0]['liveness'])
    element['loudness'].append(features[0]['loudness'])
    element['speechiness'].append(features[0]['speechiness'])
    element['tempo'].append(features[0]['tempo'])
    element['valence'].append(features[0]['valence'])
    #popularity is stored elsewhere
    #pop = sp.track(i)
    #element['popularity'].append(pop['popularity'])

In [72]:
for i in raw_playlist_data:
    audio_features(i)

In [73]:
raw_playlist_data[0]

{'duration_ms': 278573,
 'explicit': False,
 'href': 'https://api.spotify.com/v1/tracks/3NZJlJemX3mzjf56MqC5ML',
 'id': '3NZJlJemX3mzjf56MqC5ML',
 'name': 'Forever',
 'popularity': 72,
 'track': True,
 'track_number': 1,
 'uri': 'spotify:track:3NZJlJemX3mzjf56MqC5ML',
 'acousticness': [0.0368],
 'danceability': [0.672],
 'energy': [0.82],
 'instrumentalness': [0.000188],
 'liveness': [0.184],
 'loudness': [-4.456],
 'speechiness': [0.0459],
 'tempo': [120.005],
 'valence': [0.438]}

In [74]:
raw_playlist_data[0]

{'duration_ms': 278573,
 'explicit': False,
 'href': 'https://api.spotify.com/v1/tracks/3NZJlJemX3mzjf56MqC5ML',
 'id': '3NZJlJemX3mzjf56MqC5ML',
 'name': 'Forever',
 'popularity': 72,
 'track': True,
 'track_number': 1,
 'uri': 'spotify:track:3NZJlJemX3mzjf56MqC5ML',
 'acousticness': [0.0368],
 'danceability': [0.672],
 'energy': [0.82],
 'instrumentalness': [0.000188],
 'liveness': [0.184],
 'loudness': [-4.456],
 'speechiness': [0.0459],
 'tempo': [120.005],
 'valence': [0.438]}

In [75]:
df=pd.DataFrame([raw_playlist_data][0])

In [76]:
df

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,href,id,instrumentalness,liveness,loudness,name,popularity,speechiness,tempo,track,track_number,uri,valence
0,[0.0368],[0.672],278573,[0.82],False,https://api.spotify.com/v1/tracks/3NZJlJemX3mz...,3NZJlJemX3mzjf56MqC5ML,[0.000188],[0.184],[-4.456],Forever,72,[0.0459],[120.005],True,1,spotify:track:3NZJlJemX3mzjf56MqC5ML,[0.438]
1,[0.292],[0.542],303200,[0.589],False,https://api.spotify.com/v1/tracks/7w5cxTEzp1rf...,7w5cxTEzp1rfV3KCy0Bd5N,[0.000522],[0.177],[-6.623],Home,71,[0.0318],[111.665],True,6,spotify:track:7w5cxTEzp1rfV3KCy0Bd5N,[0.137]
2,[0.912],[0.671],211933,[0.153],False,https://api.spotify.com/v1/tracks/3oQomOPRNQ5N...,3oQomOPRNQ5NVFUmLJHbAV,[5.53e-05],[0.0771],[-13.569],Over the Rainbow,68,[0.0404],[85.6],True,10,spotify:track:3oQomOPRNQ5NVFUmLJHbAV,[0.658]
3,[0.827],[0.722],284626,[0.288],False,https://api.spotify.com/v1/tracks/5Q58RkKyUafm...,5Q58RkKyUafm15Syxg79DW,[1.06e-05],[0.12],[-10.767],Sham,55,[0.0444],[119.899],True,3,spotify:track:5Q58RkKyUafm15Syxg79DW,[0.502]
4,[0.501],[0.606],274040,[0.282],False,https://api.spotify.com/v1/tracks/683b4ikwa62J...,683b4ikwa62JevCjwrmfg6,[8.3e-06],[0.152],[-12.207],Moondance - 2013 Remaster,68,[0.0339],[67.409],True,2,spotify:track:683b4ikwa62JevCjwrmfg6,[0.563]
5,[0.97],[0.264],182893,[0.22],False,https://api.spotify.com/v1/tracks/59QTPCnWwfFT...,59QTPCnWwfFTSrzvOozHNL,[0.0459],[0.104],[-15.359],Remember,28,[0.034],[128.964],True,12,spotify:track:59QTPCnWwfFTSrzvOozHNL,[0.159]
6,[0.857],[0.552],191786,[0.371],False,https://api.spotify.com/v1/tracks/1T575AhHueYi...,1T575AhHueYinKSDflEsGK,[0],[0.0838],[-9.104],Heaven,72,[0.0359],[199.843],True,5,spotify:track:1T575AhHueYinKSDflEsGK,[0.245]
7,[0.638],[0.229],157866,[0.494],False,https://api.spotify.com/v1/tracks/74VR3AkGPhbY...,74VR3AkGPhbYXnxcOYa16x,[0],[0.217],[-8.22],Strangers In The Night,66,[0.0311],[80.092],True,1,spotify:track:74VR3AkGPhbYXnxcOYa16x,[0.526]
8,[0.128],[0.574],300106,[0.629],False,https://api.spotify.com/v1/tracks/4jDmJ51x1o9N...,4jDmJ51x1o9NZB5Nxxc7gY,[0],[0.271],[-8.815],Careless Whisper,75,[0.0363],[153.119],True,3,spotify:track:4jDmJ51x1o9NZB5Nxxc7gY,[0.786]
9,[0.0159],[0.433],210549,[0.701],False,https://api.spotify.com/v1/tracks/72sk4hbxsJWd...,72sk4hbxsJWdQ3JMNIUUpY,[4.99e-06],[0.0774],[-3.869],It’s Nice to Be Alive,0,[0.0445],[95.095],True,10,spotify:track:72sk4hbxsJWdQ3JMNIUUpY,[0.744]


In [80]:
##just a small piece of code to remove duplicate elements from the table if any
print(len(df))
final_df = df.sort_values('popularity', ascending=False).drop_duplicates('name').sort_index()
print(len(final_df))

32
32


In [241]:
final_df.to_csv(loc+"\\Songs.csv")