# Music Popularity Prediciton
DJ Khaled boldly claimed to always know when a song will be a hit. We decided to further investigate by asking three key questions: 
1. Are there certain characteristics for hit songs 
2. what are the largest influencers on a song’s success 
3. and can old songs even predict the popularity of new songs? 

## Creating Dataset
There are no relevant and updated dataset that exists to perform this analysis, hence we decided to form our our dataset by using a technique called Data Collection.
For this project, we'll be using the data produced from one of the largest Music Streaming Website in the world called Spotify. We'll be using Spotify Web API to collect our raw data from the custom playlist of 1300 songs. 
But for demonstrative purpose we will be using a smaller playlist with 32 songs for now.

## DataFrames
Our dataset will contain 3 data frames namely, Album Data, Artist Data, Song Data.
In the following code I have demonstrated the creation of the Song DataFrame.

In [1]:
#Import Library
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import spotipy
from matplotlib import style
from spotipy import util
from spotipy.oauth2 import SpotifyClientCredentials 
loc='G:\\Projects\\Music-Popularity-Predicition\\DataMining'

### Step 1 : Setting Up Connection
Set Up Connection with Spotify Web API and give it access to my spotify account using a token generated automatically. We need this to use my 32 song playlist called 'Best Mixtape Ever'

In [6]:
#Set up Connection 
client_id = 'a90f2a20aea54d5781d4cc7a38ab6694' #Need to create developer profile
client_secret = 'b8810bc6ac1d4877b78c45a8654ebf18'
username = 'f0uthgq19mkamht0gno58wkt0' #Store username
scope = 'playlist-read-collaborative'
redirect_uri='http://localhost/'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, 
client_secret=client_secret)#Create manager for ease
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
token = util.prompt_for_user_token(username, scope, client_id, 
client_secret, redirect_uri)

if token:
 sp = spotipy.Spotify(auth=token)
else:
 print("Can't get token for", username)

In [7]:
#Connect to the 'the ultimate playlist'
playlist_id = 'spotify:playlist:0vW19IAl4yjeg2IlfNuBFh'
playlist = sp.user_playlist(username,playlist_id)

### Step 2 : Collecting the Data

In [8]:
def user_playlist_tracks_full(spotify_connection, user, playlist_id=None, fields=None, market=None):
    """ Get full details of the tracks of a playlist owned by a user.
        https://developer.spotify.com/documentation/web-api/reference/playlists/get-playlists-tracks/

        Parameters:
            - user - the id of the user
            - playlist_id - the id of the playlist
            - fields - which fields to return
            - market - an ISO 3166-1 alpha-2 country code.
    """

    # first run through also retrieves total no of songs in library
    response = spotify_connection.user_playlist_tracks(user, playlist_id, fields=fields, limit=100, market=market)
    results = response["items"]
    

    # subsequently runs until it hits the user-defined limit or has read all songs in the library
    while len(results) < response["total"]:
        response = spotify_connection.user_playlist_tracks(user, playlist_id, fields=fields, limit=100, offset=len(results), market=market)
        results.extend(response["items"]) 
    return(results)


# Artist Table

In [10]:
results=user_playlist_tracks_full(sp,username,playlist_id)

In [11]:
raw_playlist_data=[]
for i in range(0,len(results)):
    if results[i]['track']['id'] != None: # Removes  local tracks, if any
        play=results[i]
        raw_playlist_data.append(play['track'])

In [12]:
artistData=[]
for i in range(0,len(raw_playlist_data)):
    artistData.append(raw_playlist_data[i]['artists'][0])

In [13]:
#correct the format of external link
for i in range(len(artistData)):
    artistData[i]['url']=artistData[i]['external_urls']['spotify']
    del artistData[i]['href']
    del artistData[i]['external_urls']
    del artistData[i]['type']

In [14]:
artistData[0]

{'id': '5YGY8feqx7naU7z4HrwZM6',
 'name': 'Miley Cyrus',
 'uri': 'spotify:artist:5YGY8feqx7naU7z4HrwZM6',
 'url': 'https://open.spotify.com/artist/5YGY8feqx7naU7z4HrwZM6'}

In [15]:
df=pd.DataFrame([artistData][0])

In [16]:
df.shape

(1491, 4)

In [17]:
print(len(df))
final_df = df.drop_duplicates('id').sort_index()
print(len(final_df))

1491
430


In [18]:
final_df.to_csv(loc+"\\ArtistData.csv")

# Album Data

In [19]:
albumData=[]
for i in range(len(raw_playlist_data)):
    albumData.append(raw_playlist_data[i]['album'])

In [20]:
for i in range(len(albumData)):
    albumData[i]['url']=albumData[i]['external_urls']['spotify']
    albumData[i]['image']=albumData[i]['images'][0]['url']
    del albumData[i]['artists']
    del albumData[i]['available_markets']
    del albumData[i]['external_urls']
    del albumData[i]['href']
    del albumData[i]['images']
    del albumData[i]['release_date_precision']
    del albumData[i]['type']

In [21]:
albumData[0]

{'album_type': 'album',
 'id': '0IuHVgAvbNDJnJepuSZ8Oz',
 'name': 'The Time Of Our Lives (International Version)',
 'release_date': '2009-01-01',
 'total_tracks': 8,
 'uri': 'spotify:album:0IuHVgAvbNDJnJepuSZ8Oz',
 'url': 'https://open.spotify.com/album/0IuHVgAvbNDJnJepuSZ8Oz',
 'image': 'https://i.scdn.co/image/871910879d96353fd909a24307625282cd5bbf69'}

In [22]:
df=pd.DataFrame([albumData][0])

In [23]:
df.shape

(1491, 8)

In [24]:
final_df = df.drop_duplicates('id').sort_index()
print(len(final_df))

1196


In [25]:
final_df.to_csv(loc+"\\AlbumData.csv")

# Song Table

In [26]:
raw_playlist_data[0].keys()

dict_keys(['album', 'artists', 'available_markets', 'disc_number', 'duration_ms', 'episode', 'explicit', 'external_ids', 'external_urls', 'href', 'id', 'is_local', 'name', 'popularity', 'preview_url', 'track', 'track_number', 'type', 'uri'])

In [27]:
songData=[]
for i in range(len(raw_playlist_data)):
    d={}
    d['duration_ms']=raw_playlist_data[i]['duration_ms']
    d['explict']=raw_playlist_data[i]['explicit']
    d['external_urls']=raw_playlist_data[i]['external_urls']
    d['id']=raw_playlist_data[i]['id']
    d['name']=raw_playlist_data[i]['name']
    d['popularity']=raw_playlist_data[i]['popularity']
    d['uri']=raw_playlist_data[i]['uri']
    songData.append(d)

for i in range(len(songData)):
    songData[i]['url']=songData[i]['external_urls']['spotify']   
    del songData[i]['external_urls']

In [28]:
songData[0]

{'duration_ms': 202066,
 'explict': False,
 'id': '3E7dfMvvCLUddWissuqMwr',
 'name': 'Party In The U.S.A.',
 'popularity': 80,
 'uri': 'spotify:track:3E7dfMvvCLUddWissuqMwr',
 'url': 'https://open.spotify.com/track/3E7dfMvvCLUddWissuqMwr'}

## Audio Feature Extraction
to extract audio features of a song, all we need is the uri of that song. uri is available in raw_playlist_data

In [29]:
def audio_features(element):
    #Add new key-values to store audio features
    element['acousticness'] = []
    element['danceability'] = []
    element['energy'] = []
    element['instrumentalness'] = []
    element['liveness'] = []
    element['loudness'] = []
    element['speechiness'] = []
    element['tempo'] = []
    element['valence'] = []
    #element['popularity'] = []
    #create a track counter
    
    features = sp.audio_features(element['uri'])
        
    #Append to relevant key-value
    element['acousticness'].append(features[0]['acousticness'])
    element['danceability'].append(features[0]['danceability'])
    element['energy'].append(features[0]['energy'])
    element['instrumentalness'].append(features[0]['instrumentalness'])
    element['liveness'].append(features[0]['liveness'])
    element['loudness'].append(features[0]['loudness'])
    element['speechiness'].append(features[0]['speechiness'])
    element['tempo'].append(features[0]['tempo'])
    element['valence'].append(features[0]['valence'])
    #popularity is stored elsewhere
    #pop = sp.track(i)
    #element['popularity'].append(pop['popularity'])

In [30]:
for i in songData:
    audio_features(i)

In [31]:
for i in range(len(songData)):
    songData[i]['acousticness']=songData[i]['acousticness'][0]
    songData[i]['danceability']=songData[i]['danceability'][0]
    songData[i]['energy']=songData[i]['energy'][0]
    songData[i]['instrumentalness']=songData[i]['instrumentalness'][0]
    songData[i]['liveness']=songData[i]['liveness'][0]
    songData[i]['loudness']=songData[i]['loudness'][0]
    songData[i]['speechiness']=songData[i]['speechiness'][0]
    songData[i]['tempo']=songData[i]['tempo'][0]
    songData[i]['valence']=songData[i]['valence'][0]

In [32]:
songData[0]

{'duration_ms': 202066,
 'explict': False,
 'id': '3E7dfMvvCLUddWissuqMwr',
 'name': 'Party In The U.S.A.',
 'popularity': 80,
 'uri': 'spotify:track:3E7dfMvvCLUddWissuqMwr',
 'url': 'https://open.spotify.com/track/3E7dfMvvCLUddWissuqMwr',
 'acousticness': 0.00112,
 'danceability': 0.652,
 'energy': 0.698,
 'instrumentalness': 0.000115,
 'liveness': 0.0886,
 'loudness': -4.667,
 'speechiness': 0.042,
 'tempo': 96.021,
 'valence': 0.47}

In [33]:
df=pd.DataFrame([songData][0])

In [34]:
df.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explict,id,instrumentalness,liveness,loudness,name,popularity,speechiness,tempo,uri,url,valence
0,0.00112,0.652,202066,0.698,False,3E7dfMvvCLUddWissuqMwr,0.000115,0.0886,-4.667,Party In The U.S.A.,80,0.042,96.021,spotify:track:3E7dfMvvCLUddWissuqMwr,https://open.spotify.com/track/3E7dfMvvCLUddWi...,0.47
1,0.38,0.866,199440,0.813,False,7JJmb5XwzOO8jgpou264Ml,0.0,0.0779,-4.063,There's Nothing Holdin' Me Back,84,0.0554,121.998,spotify:track:7JJmb5XwzOO8jgpou264Ml,https://open.spotify.com/track/7JJmb5XwzOO8jgp...,0.969
2,0.192,0.695,215280,0.762,False,51MMC5DogGZAnHil5HQAXg,0.00244,0.0863,-3.497,Circles,81,0.0395,120.042,spotify:track:51MMC5DogGZAnHil5HQAXg,https://open.spotify.com/track/51MMC5DogGZAnHi...,0.553
3,0.751,0.501,182160,0.405,False,2TIlqbIneP0ZY1O0EzYLlc,0.0,0.105,-5.679,Someone You Loved,88,0.0319,109.891,spotify:track:2TIlqbIneP0ZY1O0EzYLlc,https://open.spotify.com/track/2TIlqbIneP0ZY1O...,0.446
4,0.0906,0.499,229573,0.8,False,4Yq3XUNfWrAPWuB94qkC09,0.0,0.147,-2.665,Take What You Want (feat. Ozzy Osbourne & Trav...,78,0.0502,139.919,spotify:track:4Yq3XUNfWrAPWuB94qkC09,https://open.spotify.com/track/4Yq3XUNfWrAPWuB...,0.272


In [35]:
##just a small piece of code to remove duplicate elements from the table if any
print(len(df))
final_df = df.sort_values('popularity', ascending=False).drop_duplicates('id').sort_index()
print(len(final_df))

1491
1490


In [36]:
final_df.to_csv(loc+"\\SongsData.csv")