# Problem Statement

Spotify uses its popularity parameter in order to rank songs, albums, and artists. This "popularity" metric is based on how often users stream songs from Spotify. But how does this metric compare with other metrics? 

What about aspects of the music itself: like danceability, energy, and acousticness? What about the content of an artist's lyrics? And what about Twitter users' reviews of the same music/artist? How do each of these factors influence our ability to predict the popularity of an artist or song? 

Finally, when using Regression modeling and Natural Language Processing Classification to predict the popularity of a musical artist, how can we use both these Spotify and non-Spotify popularity metrics to recommend which rising pop artists to fund, advertise, and support?

# Executive Summary

# Spotify Data Collection

In [1]:
# Referencing Spotipy API Tutorial by Medium Author Well Loot for following code
# https://medium.com/@RareLoot/extracting-spotify-data-on-your-favourite-artist-via-python-d58bc92a4330

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials #To access authorised Spotify data
import spotipy.util as util

In [2]:
client_id = "d7eee18620f34508b15f78ee4b9cfec4"
client_secret = "ea9cbeba0ebb43b2813c22564b03110c"

In [3]:
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager) #spotify object to access API

In [4]:
#testing artist scraping
name = "Nicki Minaj" #chosen artist
result = sp.search(name) #search query
result['tracks']['items'][0]['artists']

[{'external_urls': {'spotify': 'https://open.spotify.com/artist/5dHt1vcEm9qb8fCyLcB3HL'},
  'href': 'https://api.spotify.com/v1/artists/5dHt1vcEm9qb8fCyLcB3HL',
  'id': '5dHt1vcEm9qb8fCyLcB3HL',
  'name': 'A$AP Ferg',
  'type': 'artist',
  'uri': 'spotify:artist:5dHt1vcEm9qb8fCyLcB3HL'},
 {'external_urls': {'spotify': 'https://open.spotify.com/artist/0hCNtLu0JehylgoiP8L4Gh'},
  'href': 'https://api.spotify.com/v1/artists/0hCNtLu0JehylgoiP8L4Gh',
  'id': '0hCNtLu0JehylgoiP8L4Gh',
  'name': 'Nicki Minaj',
  'type': 'artist',
  'uri': 'spotify:artist:0hCNtLu0JehylgoiP8L4Gh'},
 {'external_urls': {'spotify': 'https://open.spotify.com/artist/5SyGEPymt1G2uto47tVWvZ'},
  'href': 'https://api.spotify.com/v1/artists/5SyGEPymt1G2uto47tVWvZ',
  'id': '5SyGEPymt1G2uto47tVWvZ',
  'name': 'MadeinTYO',
  'type': 'artist',
  'uri': 'spotify:artist:5SyGEPymt1G2uto47tVWvZ'}]

In [5]:
# sp.user_playlist_tracks("username", "playlist_id")
# following code developed with reference to Max Hilsdorf, medium author
# https://towardsdatascience.com/how-to-create-large-music-datasets-using-spotipy-40e7242cc6a6

In [6]:
sp.user_playlist_tracks("spotify", "37i9dQZF1DWUa8ZRTfalHk");

In [7]:
import pandas as pd

In [68]:
#https://towardsdatascience.com/how-to-create-large-music-datasets-using-spotipy-40e7242cc6a6
#Function based on function model from this, plus Spotify Database API tags
def analyze_playlist(creator, playlist_id):
    
    # Create empty dataframe
    playlist_features_list = ["artist","album","track_name",  "track_id", "danceability","energy","key",
                              "loudness","mode", "speechiness","instrumentalness","liveness",
                              "valence","tempo", "duration_ms","time_signature"]
    
    playlist_df = pd.DataFrame(columns = playlist_features_list)
    
    # Loop through every track in the playlist, extract features and append the features to the playlist df
    
    playlist = sp.user_playlist_tracks(creator, playlist_id)["items"]
    for track in playlist:
        # Create empty dict
        playlist_features = {}
        # Get metadata
        playlist_features["artist"] = track["track"]["album"]["artists"][0]["name"]
        playlist_features["album"] = track["track"]["album"]["name"]
        playlist_features["track_name"] = track["track"]["name"]
        playlist_features["track_id"] = track["track"]["id"]
        playlist_features["popularity"] = track["track"]["popularity"]
        
        # Get audio features
        audio_features = sp.audio_features(playlist_features["track_id"])[0]
        for feature in playlist_features_list[4:]:
            playlist_features[feature] = audio_features[feature]
        
        # Concat the dfs
        track_df = pd.DataFrame(playlist_features, index = [0])
        playlist_df = pd.concat([playlist_df, track_df], ignore_index = True)
        
    return playlist_df


In [69]:
playlist_df_1 = analyze_playlist("Spotify", "37i9dQZF1DWUa8ZRTfalHk")
playlist_df_1.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,popularity
0,Marshmello,OK Not To Be OK,OK Not To Be OK,0zzVTGyRrWpQu8Fr28NRAv,0.743,0.837,1,-5.025,0,0.0649,0.0,0.0743,0.263,103.072,159863,4,76.0
1,Dixie D’Amelio,Be Happy (feat. blackbear) [Remix],Be Happy (feat. blackbear) - Remix,3JwghlOgXpcxFHDEbfvaYL,0.576,0.749,9,-3.612,0,0.0522,1.12e-06,0.12,0.343,173.969,191578,4,63.0
2,Ava Max,OMG What's Happening,OMG What's Happening,6T7NPX0BWpaapcp0Jn7OK9,0.698,0.854,9,-3.84,0,0.0451,0.0,0.107,0.931,124.042,179832,4,75.0
3,salem ilese,Mad at Disney,Mad at Disney,7aGyRfJWtLqgJaZoG9lJhE,0.738,0.621,0,-7.313,1,0.0486,7.39e-06,0.692,0.715,113.968,136839,4,82.0
4,24kGoldn,Mood (feat. Iann Dior),Mood (feat. Iann Dior),3tjFYV6RSFtuktYl3ZtYcq,0.7,0.722,7,-3.558,0,0.0369,0.0,0.272,0.756,90.989,140526,4,98.0


In [70]:
playlist_df_2 = analyze_playlist("Linards Zahrins", "5HRNyPYz3WO0w7gBf0HK9O")
playlist_df_2.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,popularity
0,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,0.746,0.69,11,-7.956,1,0.164,0.0,0.101,0.497,89.977,181733,4,98.0
1,Doja Cat,Boss Bitch,Boss Bitch,78qd8dvwea0Gosb6Fe6j3k,0.707,0.955,10,-4.593,0,0.222,0.0,0.202,0.575,125.989,134240,4,86.0
2,Linards Zarins,I Miss You,I Miss You,52g4ZRv99HEDcGNGWT9fG6,0.71,0.351,6,-10.476,1,0.0284,0.0,0.195,0.661,104.935,197903,4,6.0
3,Dua Lipa,Future Nostalgia,Hallucinate,1nYeVF5vIBxMxfPoL0SIWg,0.627,0.69,10,-5.396,0,0.139,0.0,0.0742,0.627,122.053,208505,4,83.0
4,Zaryah,Invite,Invite,75WEC68Cuu6bijnu2A6hPS,0.785,0.203,2,-18.369,0,0.0749,0.000433,0.0908,0.0881,124.981,147840,4,32.0


In [71]:
playlist_df_3 = analyze_playlist("Pop Rizing", "293s8bPv39QLRSXANkHfNa")
playlist_df_3.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,popularity
0,Sharp Elijah,Dance All Night,Dance All Night,5LwHCXAoq2po5My5qNRAeg,0.751,0.725,0,-6.336,1,0.0384,0.000169,0.149,0.396,120.01,169042,4,23.0
1,Ghita,Real Lies,Real Lies,0eOBx65BAaEi8IaKd24aJC,0.726,0.623,1,-5.517,0,0.0304,1.2e-05,0.115,0.391,100.077,218702,4,42.0
2,MASHI,Bridges,Bridges,4daRt4KvOAdwSCvwZH51rO,0.6,0.589,9,-6.039,0,0.048,0.0,0.0871,0.415,125.011,179680,4,17.0
3,Sara Diamond,IDK,Great Together,5Tw3hxeILGUhCmgg0A2Bha,0.76,0.643,9,-5.617,1,0.0476,8.7e-05,0.106,0.326,132.895,156121,4,40.0
4,Madison Olds,3'S a Crowd,3'S a Crowd,5d7BGTlN3xLfu2Mwtc5mAS,0.618,0.698,1,-4.835,0,0.101,0.00154,0.0969,0.581,173.995,173119,4,49.0


In [72]:
new_song_df = pd.concat([playlist_df_1,playlist_df_2,playlist_df_3])
new_song_df.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,popularity
0,Marshmello,OK Not To Be OK,OK Not To Be OK,0zzVTGyRrWpQu8Fr28NRAv,0.743,0.837,1,-5.025,0,0.0649,0.0,0.0743,0.263,103.072,159863,4,76.0
1,Dixie D’Amelio,Be Happy (feat. blackbear) [Remix],Be Happy (feat. blackbear) - Remix,3JwghlOgXpcxFHDEbfvaYL,0.576,0.749,9,-3.612,0,0.0522,1.12e-06,0.12,0.343,173.969,191578,4,63.0
2,Ava Max,OMG What's Happening,OMG What's Happening,6T7NPX0BWpaapcp0Jn7OK9,0.698,0.854,9,-3.84,0,0.0451,0.0,0.107,0.931,124.042,179832,4,75.0
3,salem ilese,Mad at Disney,Mad at Disney,7aGyRfJWtLqgJaZoG9lJhE,0.738,0.621,0,-7.313,1,0.0486,7.39e-06,0.692,0.715,113.968,136839,4,82.0
4,24kGoldn,Mood (feat. Iann Dior),Mood (feat. Iann Dior),3tjFYV6RSFtuktYl3ZtYcq,0.7,0.722,7,-3.558,0,0.0369,0.0,0.272,0.756,90.989,140526,4,98.0


In [73]:
pd.set_option("display.max_rows", 999)
new_song_df.sort_values('popularity', ascending = False)[['artist','track_name','popularity']]

Unnamed: 0,artist,track_name,popularity
4,24kGoldn,Mood (feat. Iann Dior),98.0
0,DaBaby,ROCKSTAR (feat. Roddy Ricch),98.0
34,Jay Wheeler,La Curiosidad,93.0
46,Miley Cyrus,Midnight Sky,92.0
20,Ariana Grande,Stuck with U (with Justin Bieber),92.0
60,BTS,Dynamite,91.0
84,Jason Derulo,Take You Dancing,90.0
8,Internet Money,Lemonade,90.0
23,BLACKPINK,Ice Cream (with Selena Gomez),89.0
15,Doja Cat,Say So,89.0


In [74]:
#new_song_df.sort_values('popularity', ascending = False)[['artist','track_name','popularity']]

In [75]:
greater_than_59 = new_song_df[(new_song_df['popularity'] > 59 ) & (new_song_df['popularity'] < 81 )][['artist','track_name','popularity']]
uncleaned_songlist = greater_than_59.sort_values('popularity', ascending = False)[['artist','track_name','popularity']]

In [76]:
uncleaned_songlist.shape

(66, 3)

In [77]:
uncleaned_songlist.head()

Unnamed: 0,artist,track_name,popularity
19,Sam Smith,I’m Ready (with Demi Lovato),80.0
79,Royal & the Serpent,Overwhelmed,80.0
7,Big Sean,Wolves (feat. Post Malone),80.0
28,Gabby Barrett,I Hope,80.0
44,Saweetie,"Tap In (feat. Post Malone, DaBaby & Jack Harlow)",80.0
