# 1. Web Scraping Spotify and Genius Lyrics

# Problem Statement (**Goal**)

1. Spotify uses its popularity parameter in order to rank songs, albums, and artists. This "popularity" metric is based on how often users stream songs from Spotify. 

2. But how does this popularity metric by song-streaming compare with other metrics for popularity? This metric only shows how popular very recent artists are in general (not popularity according to genre or popularity by song/lyrical content). 

3. As a result, historically VERY popular classic songs (by Earth, Wind, & Fire, The Beatles, and other "classic groups") are overlooked. Additionally, artists who are VERY popular in their genre become ignored due to people from higher popularity genres like "pop." 

4. We need a new metric for popularity. In fact, we need ideally ideas for more than one new popularity metric and how to collect those metrics.

So:

1. Can we predict a song's popularity by stream count accurately using Regression Modeling?

2. Can we predict whether a song is popular by stream count using Classification Modeling?

3. What can we say about a song's popularity based on aspects of the music itself: like danceability, energy, and acousticness? 

4. What can we say about a song’s popularity based on the content of an artist's lyrics--the verbal connotations and vibe of the poetry? 

5. How do each of these factors influence our ability to predict the popularity of an artist or song?

6. Finally, when using Regression modeling, Classification modeling, and NLP Clustering to predict the popularity of a musical artist, how can evaluate whether or not to trust Spotify's ranking of popularity? 

7. What other metrics of popularity should we define and recommend that Spotify and other top streaming sites adopt? What is our reasoning?

# Executive Summary (**Overview**)

Spotify Song Attributes

1. First (for Song Attributes), I scrape ten different playlists off of Spotify full of exactly 699 "Rising" songs from 2020. I clean the data, removing NAN values and duplicates for the songs. Spotify has a built in popularity function based on number of streams. This is ordered_playlist. Then, I import a dataframe of roughly 232,000 songs from 2018-2020 made by a prominent Kaggle musical data scientist, Zaheen Hamidani, to the small dataset. I clean this data, dropping NAN values and duplicates. Next, I concatenate this songlist to ordered_songlist. At last, I name this large dataframe of roughly 150,000 songs as giant_ordered_playlist.
2. Second, I build a wide variety of Regression Models that try to accurately predict a song's "stream-popularity" based off of the song's musical attributes (like energy, valence, modality, time signature, and other characteristics). I will also use many different Classification Models to measure whether we can predict that a song is popular (above 75% popularity on a scale of 0 to 100) based off of these same song attributes. 
3. Finally, I interpret the differences between the stream-based popularity metric and this song-attribute-based popularity metric, generating reasons for incongruities and making conclusions about the effectiveness of our popularity metric.

Genius Lyric Attributes

3. First (for Lyric Attributes), I use the shorter list of playlist songs (just 700 songs from ordered_playlist) from Spotify as a basis for which lyrics to scrape. I scrape the lyrics for each of these songs off of Genius' lyric library.
4. Second, I use sentiment analysis and NLP (CountVectorizer) to perform EDA on the most common words/sentiments for each song.
5. Finally, I try to evaluate whether there is a correlation between most common words and song sentiment with its popularity. 

Lyric Clustering Processing (Completed Stretch Goal)

6. First (for Lyrics), I use Spacey to convert the lyrics of the 300 most common words in each song of ordered_playlist into vectors. These word vectors are arranged by their similarity to one another on a large coordinate plane. 
7. Finally, I try to evaluate whether there is a correlation between a group of lyrics' content and their artist's stream-popularity. I conclude that yes, there IS a clear relationship between a song's stream-popularity and lyrical content. Though, for further research, I would like to pursue Hypothesis Testing to be certain of this relationship being a correlation at a statistically significant level.

### Stretch Goals

- Use Word2Vec or SpaCy to vectorize and cluster lyrics data for smaller dataset.
    - I reached this goal!

- Build a Neural Network to predict popularity using information from Spotify.
    - I reached this goal!

- Use Twitter to scrape tweets relating to artists.
    - Not possible!
    - Twitter edited their API, so no more scraping is allowed
    
- Create a recommender system to recommend songs to people based on their desired mood
    - Base recommender system off of BOTH lyrics and song attributes
    - Not yet reached

- Figure out other interesting information from the data! I collected SO MUCH interesting information, I want to know what further directions I can take this project towards that would benefit a record label or music/tech firm. Any suggestions would be highly appreciated.

### Project Questions to Pursue

- Regression can be used to predict a song's popularity value to some degree of accuracy. If it cannot be predicted accurately enough, it means that Spotify popularity is not correlated with the actual qualities of the music in an extremely strong way. 
- Classification can be used to differentiate between unpopular (<70% popularity) and popular (>70% popularity) songs with a high degree of accuracy.

# Standard Imports

Here, I import a wide variety of different libraries and functions that will allow me to complete this project.

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import time
import re

# bs4, nltk, and sklearn imports
from bs4 import BeautifulSoup   
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn import metrics
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

#pd.set_option("display.max_rows", None, "display.max_columns", None)

# Spotify Data Collection

In "Spotify Data Collection," I the Spotify official API ("Spotipy") to import song information on a wide variety of musical characteristics for songs from mainly 2017-2020. I use a wide variety of different genres in order to keep my data very unbiased toward pop. This will allow me to compare songs regardless of initial genre as a predictor of popularity.

In [2]:
# Referencing Spotipy API Tutorial by Medium Author Well Loot for following code
# https://medium.com/@RareLoot/extracting-spotify-data-on-your-favourite-artist-via-python-d58bc92a4330

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials #To access authorised Spotify data
import spotipy.util as util

In [3]:
client_id = "d7eee18620f34508b15f78ee4b9cfec4"
client_secret = "ea9cbeba0ebb43b2813c22564b03110c"

In [4]:
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager) #spotify object to access API

In [5]:
#testing artist scraping
name = "Nicki Minaj" #chosen artist
result = sp.search(name) #search query
result['tracks']['items'][0]['artists'] # Prints the first (zero-th) result for the given artist

[{'external_urls': {'spotify': 'https://open.spotify.com/artist/790FomKkXshlbRYZFtlgla'},
  'href': 'https://api.spotify.com/v1/artists/790FomKkXshlbRYZFtlgla',
  'id': '790FomKkXshlbRYZFtlgla',
  'name': 'KAROL G',
  'type': 'artist',
  'uri': 'spotify:artist:790FomKkXshlbRYZFtlgla'},
 {'external_urls': {'spotify': 'https://open.spotify.com/artist/0hCNtLu0JehylgoiP8L4Gh'},
  'href': 'https://api.spotify.com/v1/artists/0hCNtLu0JehylgoiP8L4Gh',
  'id': '0hCNtLu0JehylgoiP8L4Gh',
  'name': 'Nicki Minaj',
  'type': 'artist',
  'uri': 'spotify:artist:0hCNtLu0JehylgoiP8L4Gh'}]

In [6]:
# sp.user_playlist_tracks("username", "playlist_id")
# following code developed with reference to Max Hilsdorf, medium author
# https://towardsdatascience.com/how-to-create-large-music-datasets-using-spotipy-40e7242cc6a6

In [7]:
sp.user_playlist_tracks("spotify", "37i9dQZF1DWUa8ZRTfalHk"); # instatantiating code to test if I can import a full playlist

In [8]:
#https://towardsdatascience.com/how-to-create-large-music-datasets-using-spotipy-40e7242cc6a6
#Function based on function model from this, plus Spotify Database API tags
def analyze_playlist(creator, playlist_id):
    
    # Create empty dataframe
    playlist_features_list = ["artist","album","track_name",  "track_id", "danceability","energy","key",
                              "loudness","mode", "speechiness","instrumentalness","liveness",
                              "valence","tempo", "duration_ms","time_signature", "acousticness"]
    
    # https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/
    
    playlist_df = pd.DataFrame(columns = playlist_features_list)
    
    # Loop through every track in the playlist, extract features and append the features to the playlist df
    
    playlist = sp.user_playlist_tracks(creator, playlist_id)["items"]
    for track in playlist:
        # Create empty dict
        playlist_features = {}
        # Get metadata
        playlist_features["artist"] = track["track"]["album"]["artists"][0]["name"]
        playlist_features["album"] = track["track"]["album"]["name"]
        playlist_features["track_name"] = track["track"]["name"]
        playlist_features["track_id"] = track["track"]["id"]
        playlist_features["popularity"] = track["track"]["popularity"]
        
        # Get audio features
        audio_features = sp.audio_features(playlist_features["track_id"])[0]
        for feature in playlist_features_list[4:]:
            playlist_features[feature] = audio_features[feature]
        
        # Concat the dfs
        track_df = pd.DataFrame(playlist_features, index = [0])
        playlist_df = pd.concat([playlist_df, track_df], ignore_index = True)
        
    return playlist_df

In [9]:
playlist_df_1 = analyze_playlist("Spotify", "37i9dQZF1DWUa8ZRTfalHk")
playlist_df_1.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,Ruel,say it over (feat. Cautious Clay),say it over (feat. Cautious Clay),4jSE5cAaa5rwTyhDSXfwQN,0.438,0.315,2,-10.941,0,0.044,0.0,0.606,0.151,156.031,238500,4,0.566,68.0
1,Bebe Rexha,"Baby, I'm Jealous (feat. Doja Cat)","Baby, I'm Jealous (feat. Doja Cat)",2fTdRdN73RgIgcUZN33dvt,0.737,0.867,11,-2.259,0,0.0458,0.0,0.32,0.506,98.05,175873,4,0.0398,68.0
2,Dua Lipa,Levitating (feat. DaBaby),Levitating (feat. DaBaby),463CkQjx2Zk1yXoBuierM9,0.702,0.825,6,-3.787,0,0.0601,0.0,0.0674,0.915,102.977,203064,4,0.00883,78.0
3,Halsey,Manic,I'm Not Mad,6SL8U8TtdwOtGhbmGzsMfX,0.78,0.684,0,-5.758,1,0.0636,0.0,0.105,0.732,149.916,173467,4,0.17,74.0
4,Julia Michaels,Lie Like This,Lie Like This,5yCXLEi384DHGRXYMXgjBR,0.735,0.694,1,-5.721,1,0.0538,1.72e-06,0.0675,0.849,120.983,218383,4,0.136,72.0


In [10]:
playlist_df_2 = analyze_playlist("Linards Zahrins", "5HRNyPYz3WO0w7gBf0HK9O")
playlist_df_2.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,0.746,0.69,11,-7.956,1,0.164,0.0,0.101,0.497,89.977,181733,4,0.247,96.0
1,Doja Cat,Boss Bitch,Boss Bitch,78qd8dvwea0Gosb6Fe6j3k,0.707,0.955,10,-4.593,0,0.222,0.0,0.202,0.575,125.989,134240,4,0.127,84.0
2,Linards Zarins,I Miss You,I Miss You,52g4ZRv99HEDcGNGWT9fG6,0.71,0.351,6,-10.476,1,0.0284,0.0,0.195,0.661,104.935,197903,4,0.0801,2.0
3,Dua Lipa,Future Nostalgia,Hallucinate,1nYeVF5vIBxMxfPoL0SIWg,0.627,0.69,10,-5.396,0,0.139,0.0,0.0742,0.627,122.053,208505,4,0.033,82.0
4,Zaryah,Invite,Invite,75WEC68Cuu6bijnu2A6hPS,0.785,0.203,2,-18.369,0,0.0749,0.000433,0.0908,0.0881,124.981,147840,4,0.0586,37.0


In [11]:
playlist_df_3 = analyze_playlist("Pop Rizing", "293s8bPv39QLRSXANkHfNa")
playlist_df_3.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,Sharp Elijah,4 Life,4 Life,4ArOkJprDLRRZyy1mwnCR4,0.684,0.579,4,-8.447,0,0.205,0.0,0.243,0.582,110.121,174545,4,0.229,23.0
1,Kaitlyn Velez,FOMO,FOMO,3ANoQMolPtM6GHQ8zrGeVE,0.761,0.475,11,-6.251,0,0.0463,1.33e-06,0.0717,0.332,150.044,160000,4,0.265,48.0
2,MASHI,Bridges,Bridges,4daRt4KvOAdwSCvwZH51rO,0.6,0.589,9,-6.039,0,0.048,0.0,0.0871,0.415,125.011,179680,4,0.43,11.0
3,Ghita,Real Lies,Real Lies,0eOBx65BAaEi8IaKd24aJC,0.726,0.623,1,-5.517,0,0.0304,1.17e-05,0.115,0.391,100.077,218702,4,0.112,41.0
4,Sharp Elijah,Dance All Night,Dance All Night,5LwHCXAoq2po5My5qNRAeg,0.751,0.725,0,-6.336,1,0.0384,0.000169,0.149,0.396,120.01,169042,4,0.161,19.0


In [12]:
playlist_df_4 = analyze_playlist("Chosen Pop", "5CmQB7YW2O3CQxnYktisbA") 
playlist_df_4.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,Farii,How you wanna play,How you wanna play,5MjVX3LgkIK5ilhQ5ANLvV,0.746,0.601,7,-9.693,0,0.0827,1e-06,0.261,0.412,115.02,221398,4,0.117,35.0
1,HRVY,NEVERMIND (Acoustic),NEVERMIND,4FlBROqpe3miOcUbcATnQv,0.658,0.529,5,-6.807,1,0.111,0.0,0.129,0.59,102.18,176080,4,0.15,33.0
2,Jonas Blue,Naked,Naked,2gGLpMzoo80A7jGEIr4ou8,0.856,0.622,10,-5.217,0,0.0564,0.0,0.0875,0.778,114.976,210921,4,0.36,71.0
3,Sam Feldt,Hold Me Close (feat. Ella Henderson),Hold Me Close (feat. Ella Henderson),24aN8j7dBw0FxxKUBlCtd6,0.654,0.795,0,-4.419,1,0.0536,0.0,0.19,0.465,120.019,185750,4,0.178,74.0
4,RAYE,Natalie Don’t,Natalie Don't,5CO4uJ11ZVKhsO2Lu9NUSk,0.807,0.533,9,-4.899,0,0.0443,0.0,0.284,0.853,124.006,194347,4,0.176,69.0


In [13]:
playlist_df_5 = analyze_playlist("Chosen Rap", "6uXVAe8ty2JNTueizGD4tN") 
playlist_df_5.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,Rich The Kid,BOSS MAN,Racks On (feat. YoungBoy Never Broke Again),03RLJVJKfqsuCZfnqlhJVh,0.558,0.511,0,-6.404,1,0.207,0.0,0.131,0.232,105.353,177867,5,0.111,4.0
1,Roselli,Strapped,Strapped,5B8YTG5V7IjmrsStzxXuu7,0.91,0.494,11,-9.078,0,0.204,0.00731,0.116,0.223,131.916,149091,4,0.0488,39.0
2,Roselli,Limitless,Hard Body,48MNIUUrddd55lvqhw1Msp,0.738,0.699,5,-8.292,0,0.142,0.0,0.131,0.924,121.907,220328,4,0.23,0.0
3,Roselli,Limitless,Border to Border,0uPejgUXMDllbOKZw1mjPr,0.85,0.472,7,-10.584,1,0.387,0.0,0.114,0.818,117.143,155897,4,0.141,1.0
4,Biggy Boats,Ruthless,Ruthless,35cVwO0HyO8QpjjeJvw0Ga,0.901,0.803,0,-4.476,1,0.113,7.33e-05,0.212,0.585,129.984,156111,4,0.131,0.0


In [14]:
playlist_df_6 = analyze_playlist("Rising Rock", "1R6sNoFTMwJ6pFvjpzGEZH") 
playlist_df_6.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,After the Calm,Home Sweet Home,Home Sweet Home,6DvEYe2c1d5pkkIDE2EIXk,0.547,0.94,9,-3.729,1,0.0552,0.0,0.0806,0.722,130.035,175800,4,6.1e-05,30.0
1,Wildstreet,Born to Be,Born to Be,5TeZm9VbENB3JE1v3wSYk9,0.542,0.965,6,-3.257,1,0.118,0.000863,0.284,0.47,133.048,216067,4,0.00246,34.0
2,Glossii,Watching Me,Watching Me,1it6clwVMKCsJtUlgQSUxx,0.424,0.956,4,-5.277,0,0.0513,0.0,0.347,0.808,162.035,192340,4,0.000122,30.0
3,After the Calm,Greenway,Greenway,7Eu7LWwLrzNxWpQUNLHoXd,0.558,0.885,10,-5.68,1,0.0331,0.0068,0.156,0.246,107.536,198837,4,7.6e-05,14.0
4,Amongst Liars,Burn the Vision,Burn the Vision,5g15dxdGa8KhXtpjanp86r,0.552,0.893,11,-5.708,0,0.039,0.185,0.0815,0.308,93.997,248759,4,0.0012,34.0


In [15]:
playlist_df_7 = analyze_playlist("New Boots", "37i9dQZF1DX8S0uQvJ4gaa") 
playlist_df_7.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,Riley Green,If It Wasn't For Trucks,If I Didn’t Wear Boots,2imQgpXrOacGLfgx9nevja,0.426,0.897,7,-4.549,1,0.0326,0,0.337,0.52,155.823,180360,4,0.00101,61.0
1,Nate Smith,Wildfire,Wildfire,21HxYsyuuXZNqB1Dme5PQN,0.511,0.674,6,-4.992,1,0.026,0,0.18,0.363,159.874,190249,4,0.0371,73.0
2,Lee Brice,More Beer,More Beer,4NmUNvMjX0LzztKePGtiC2,0.537,0.802,6,-6.88,1,0.0471,0,0.176,0.776,152.097,154387,4,0.0239,55.0
3,Andrew Jannakos,Gone Too Soon,Gone Too Soon,7mDZ2NdYOeKFcz2zGnKBwU,0.529,0.723,4,-6.022,0,0.035,0,0.364,0.459,119.942,169255,4,0.0175,68.0
4,Caitlyn Smith,Supernova (Deluxe),I Can't (feat. Old Dominion),2YoOaGlM2zGpYBanN3AxrV,0.499,0.667,0,-4.336,1,0.0325,0,0.0894,0.38,137.905,210147,4,0.2,54.0


In [16]:
playlist_df_8 = analyze_playlist("Low-Key", "37i9dQZF1DX2yvmlOdMYzV")
playlist_df_8.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,070 Shake,Guilty Conscience (Tame Impala Remix),Guilty Conscience - Tame Impala Remix,4nNkCxutxk68CulzSBy0Tq,0.416,0.878,1,-3.654,0,0.322,0.00453,0.25,0.317,191.936,214987,4,0.0583,41.0
1,Simpson,Cherry Ice Cream Sundae,Cherry Ice Cream Sundae,2kp5QEtvCuWmDmc7prlDJq,0.602,0.582,6,-7.981,1,0.0305,0.00126,0.107,0.67,89.976,192360,4,0.226,45.0
2,Alann8h,Dumb Daze,My Mind Is a Maze,2P4qoMmSqElFcI7GYaPLwf,0.85,0.223,1,-11.46,1,0.0705,3.2e-05,0.0914,0.468,95.008,146678,4,0.671,60.0
3,Joesef,I Wonder Why,I Wonder Why,2HpDcssMlgQXfmAUYhePIP,0.619,0.654,0,-8.541,0,0.0425,0.00332,0.0604,0.726,80.0,228348,4,0.209,60.0
4,OSHUN,Sango,Sango,4D7o2dM2OeEFXB1omK5aTF,0.588,0.263,1,-16.722,1,0.194,0.000329,0.11,0.232,84.01,240482,4,0.746,45.0


In [17]:
playlist_df_9 = analyze_playlist("Study Break", "37i9dQZF1DX1dvMSwf27JO")
playlist_df_9.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,HANNI,Golden Eyes,Golden Eyes,2CVWGyKDFZRYzYkd9OxQRv,0.606,0.406,8,-9.689,1,0.034,3e-05,0.162,0.136,87.025,148252,4,0.623,51.0
1,Rence,Baby Blue,Baby Blue,1eA6HGJ1qZXEL7NIFKYrXK,0.518,0.411,7,-8.909,1,0.0335,0.000716,0.112,0.0782,120.06,208000,4,0.896,62.0
2,spill tab,Cotton Candy,Cotton Candy,7K1H5Peem34cxKy40kNFw5,0.685,0.46,0,-7.809,1,0.224,1.3e-05,0.357,0.596,96.876,93871,3,0.675,58.0
3,Neeko Crowe,"noway, trusay","noway, trusay",3Q3PynExfV1zosxXQ0wXgA,0.895,0.434,0,-9.785,1,0.366,4e-06,0.0925,0.639,116.99,220000,4,0.617,40.0
4,Taylor Hill,Like YOU,Like YOU,1LU9Dqce1Ri6q5b0ajkdIT,0.755,0.683,5,-2.317,1,0.0351,0.0,0.105,0.695,122.016,206576,4,0.0191,42.0


In [18]:
playlist_df_10 = analyze_playlist("Are & Be", "37i9dQZF1DX4SBhb3fqCJd")
playlist_df_10.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,Trey Songz,Back Home,Rain (feat. Swae Lee),1pZwFpiLKrSfNtXs6WQLlf,0.618,0.416,1,-9.077,1,0.0772,0.0,0.106,0.404,137.851,223800,4,0.733,55.0
1,Bryson Tiller,A N N I V E R S A R Y,Outta Time (feat. Drake),0LGtMvQJ37SsEYbkP6TcVJ,0.714,0.582,5,-7.272,0,0.0808,0.0,0.0774,0.338,92.819,198822,4,0.0129,73.0
2,SZA,Hit Different,Hit Different,7Bar1kLTmsRmH6FCKKMEyU,0.679,0.516,0,-6.371,0,0.0452,0.0,0.0965,0.716,120.074,202008,4,0.199,81.0
3,Giveon,Spotify Singles,LIKE I WANT YOU - Acoustic,0qXu4pFQSIwkfTUgkE6WzF,0.463,0.381,10,-7.66,0,0.0309,0.00139,0.109,0.256,117.762,185224,3,0.798,69.0
4,Teyana Taylor,Wake Up Love,Wake Up Love,2KkNkv6ciB6bt2hvHtOrin,0.583,0.766,11,-4.743,0,0.405,1.79e-06,0.159,0.203,131.374,215329,3,0.502,67.0


In [19]:
playlist_df_large = analyze_playlist("Longest Playlist Ever", "5fMCrRnSy4TauAmM36zrIP")
playlist_df_large.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,Ween,The Mollusk,Ocean Man,6M14BiCN00nOsba4JaYsHW,0.72,0.912,4,-6.13,1,0.0363,0.00122,0.0982,0.973,122.782,126947,4,0.551,65.0
1,Daft Punk,Random Access Memories,Fragments of Time (feat. Todd Edwards),0IedgQjjJ8Ad4B3UDQ5Lyn,0.807,0.51,0,-9.729,1,0.0433,0.115,0.104,0.961,130.118,279773,4,0.041,58.0
2,Daft Punk,Random Access Memories,Get Lucky (feat. Pharrell Williams & Nile Rodg...,69kOkLUCkxIZYexIgSG8rq,0.81,0.793,6,-9.404,0,0.0403,2e-06,0.072,0.863,116.049,369627,4,0.0378,72.0
3,The Coral,Magic & Medicine,Pass It On,5uB85PzlWrnNnxL2A1TcKD,0.395,0.727,7,-6.015,1,0.0375,7e-06,0.0683,0.872,179.097,139133,4,0.282,52.0
4,Various Artists,Cheap Date,Dreaming of You,40SE4lxCPNibwbdw1zzWV5,0.447,0.729,9,-6.108,0,0.0307,0.00037,0.109,0.971,199.098,141200,4,0.431,32.0


In [20]:
new_song_df = pd.concat([playlist_df_1,playlist_df_2,playlist_df_3,playlist_df_4, playlist_df_5, playlist_df_6,playlist_df_7,playlist_df_8,playlist_df_9, playlist_df_10, playlist_df_large])
new_song_df.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,Ruel,say it over (feat. Cautious Clay),say it over (feat. Cautious Clay),4jSE5cAaa5rwTyhDSXfwQN,0.438,0.315,2,-10.941,0,0.044,0.0,0.606,0.151,156.031,238500,4,0.566,68.0
1,Bebe Rexha,"Baby, I'm Jealous (feat. Doja Cat)","Baby, I'm Jealous (feat. Doja Cat)",2fTdRdN73RgIgcUZN33dvt,0.737,0.867,11,-2.259,0,0.0458,0.0,0.32,0.506,98.05,175873,4,0.0398,68.0
2,Dua Lipa,Levitating (feat. DaBaby),Levitating (feat. DaBaby),463CkQjx2Zk1yXoBuierM9,0.702,0.825,6,-3.787,0,0.0601,0.0,0.0674,0.915,102.977,203064,4,0.00883,78.0
3,Halsey,Manic,I'm Not Mad,6SL8U8TtdwOtGhbmGzsMfX,0.78,0.684,0,-5.758,1,0.0636,0.0,0.105,0.732,149.916,173467,4,0.17,74.0
4,Julia Michaels,Lie Like This,Lie Like This,5yCXLEi384DHGRXYMXgjBR,0.735,0.694,1,-5.721,1,0.0538,1.72e-06,0.0675,0.849,120.983,218383,4,0.136,72.0


In [21]:
new_song_df.shape

(699, 18)

# Spotify Features Data Cleaning

In this section, I clean the previous dataframe of information that is inappropriate for our analysis. Thus, I remove duplicates of the songs and songs with NAN values for datapoints. As I will be adding a lot more data later to balance out the Regression and Classification models, I can afford to be a bit more liberal with how many songs I drop at this stage.

In [22]:
desired_features = ['artist','track_name','popularity',"danceability","energy",
                              "loudness","mode", "speechiness","instrumentalness","liveness",
                              "valence","tempo", "duration_ms","time_signature","acousticness"]

In [23]:
ordered_songlist = new_song_df.sort_values('popularity', ascending = False)

In [24]:
ordered_songlist.shape

(699, 18)

In [25]:
ordered_songlist.reset_index(drop=True, inplace=True)

In [26]:
ordered_songlist.head()

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,0.746,0.69,11,-7.956,1,0.164,0.0,0.101,0.497,89.977,181733,4,0.247,96.0
1,Justin Bieber,Holy,Holy (feat. Chance The Rapper),5u1n1kITHCxxp8twBcZxWy,0.673,0.704,6,-8.056,1,0.36,0.0,0.0898,0.372,86.919,212093,4,0.196,94.0
2,Pop Smoke,Shoot For The Stars Aim For The Moon,What You Know Bout Love,1tkg4EHVoqnhR6iFEXb60y,0.709,0.548,10,-8.493,1,0.353,1.59e-06,0.133,0.543,83.995,160000,4,0.65,91.0
3,Ariana Grande,Stuck with U,Stuck with U (with Justin Bieber),4HBZA5flZLE435QTztThqH,0.597,0.45,8,-6.658,1,0.0418,0.0,0.382,0.537,178.765,228482,3,0.223,90.0
4,salem ilese,Mad at Disney,Mad at Disney,7aGyRfJWtLqgJaZoG9lJhE,0.738,0.621,0,-7.313,1,0.0486,7.39e-06,0.692,0.715,113.968,136839,4,0.424,88.0


In [27]:
ordered_songlist.shape

(699, 18)

In [28]:
ordered_songlist.drop_duplicates(inplace=True)

In [29]:
ordered_songlist.reset_index(inplace=True)

In [30]:
ordered_songlist.shape

(695, 19)

In [31]:
ordered_songlist.describe()

Unnamed: 0,index,danceability,energy,loudness,speechiness,liveness,valence,tempo,acousticness,popularity
count,695.0,695.0,695.0,695.0,695.0,695.0,695.0,695.0,695.0,695.0
mean,350.690647,0.636685,0.615591,-7.403243,0.108964,0.170022,0.483547,121.620601,0.248156,44.709353
std,201.242282,0.14129,0.18164,2.780212,0.120995,0.115498,0.21373,28.879571,0.25593,24.384979
min,0.0,0.24,0.124,-18.369,0.0236,0.0228,0.0503,61.525,7e-06,0.0
25%,177.5,0.5385,0.4855,-9.077,0.0371,0.09735,0.326,98.9645,0.03295,26.0
50%,351.0,0.644,0.616,-6.883,0.0554,0.123,0.476,120.026,0.16,48.0
75%,524.5,0.7505,0.752,-5.3205,0.1305,0.208,0.644,141.987,0.418,65.0
max,698.0,0.933,0.991,-1.507,0.912,0.725,0.973,203.783,0.977,96.0


|        | Popularity | Valence | Energy | Loudness | Danceability | Liveness | Tempo | Acousticness |
|--------|------------|---------|--------|----------|--------------|----------|-------|--------------|
| Mean   | 44.71%     | 48.36%  | 61.56% | -7.40    | 63.67%       | 17.00%   | 122   | 24.82%       |
| Median | 48.00%     | 47.60%  | 61.60% | -6.88    | 64.40%       | 12.30%   | 120   | 16.00%       |

On a scale of 1 to 100, the mean popularity for these 695 songs is around 45. This is slightly below 50% of songs being more popular, which seems understandable. Not many songs make it to a high stream-count. If anything, I am surprised that this value is not smaller. 

Danceability's mean value is 63%, meaning that many of the songs are more "dance-worthy" than not. This may factor into their popularities. The same can be said for these songs' mean energy level at 61%.

The songs' average loudness is -7 loudness on a scale of -50 to 0 (decibels behave in a different manner, so this is a more appropriate way of modeling them). This is an extremely high score for the songs' loudness, indicating that these recent songs (from 2020) are fairly loud. This makes sense; before the 2000s--music was technologically much more compressed when it was rendered into a digital format. Now that computer systems have grown so much more powerful, reducing less of the overtones and decibels from the music, these songs become less compressed and more loud on average. It is not that kids' rock music is louder than 1970s music was. It is that compression has changed. This evidence comes from analysis by Elena Georgieva of Stanford University Data Scientists.

(https://slideslive.com/38931524/hitpredict-using-spotify-data-to-predict-billboard-hits?ref=account-60259-latest)

In [32]:
ordered_songlist.isnull().sum()

index               0
artist              0
album               0
track_name          0
track_id            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
time_signature      0
acousticness        0
popularity          0
dtype: int64

# Importing Gigantic Dataset to Improve Song Attribute Predictions

In [33]:
#Gigantic Songlist Data Source: https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db
large_df = pd.read_csv('spotify_features.csv')

In [34]:
large_df.reset_index(inplace=True)

In [35]:
large_df.head()

Unnamed: 0,index,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39


In [36]:
large_df.shape

(232725, 19)

In [37]:
# # Dataset Pulled by Tgel0 on Github Using Approximately the following method (looked at and roughly transcribed for my purposes of understanding)

# # storing the track searching results
# artist = []
# track_name = []
# popularity = []
# track_id = []

# for i in range(0,10_000,50):
#     track_results = sp.search(q='year:2020', type='track', limit=50)
#     for i, t in enumerate(track_results['tracks']['items']): #i, t is used for enumerate instead of i, j because "t" stands for the track we are indexing by
#         artist.append(t['artists'][0]['name'])
#         track_name.append(t['name'])
#         track_id.append(t['id'])
#         popularity.append(t['popularity'])
        
# df_tracks = pd.DataFrame({'artist':artist,'track_name':track_name,'track_id':track_id,'popularity':popularity})
# print(df_tracks.shape)


In [38]:
# # we need to use batchsizes since there's a limit to how many tracks you can get audio features from in one query
# # empty list, batchsize and the counter for None results
# rows = []
# batchsize = 100
# None_counter = 0

# for i in range(0,len(df_tracks['track_id']),batchsize):
#     batch = df_tracks['track_id'][i:i+batchsize]
#     feature_results = sp.audio_features(batch)
#     for i, t in enumerate(feature_results):
#         if t == None:
#             None_counter = None_counter + 1
#         else:
#             rows.append(t)

In [39]:
# df_audio_features = pd.DataFrame.from_dict(rows,orient='columns')
# df_audio_features['track_id'] = df_audio_features['id']
# df_audio_features.head()

# Cleaning Gigantic Dataset

I import this incredibly large dataset of 232,725 songs. I then drop of all of the duplicate songs from the dataset and remove NAN values, just as I did before. This effectively cuts the dataset's size in half.

In [40]:
print("Shape of the dataset:", large_df.shape)

Shape of the dataset: (232725, 19)


In [41]:
large_df.drop_duplicates(subset=['track_name'], inplace=True)
print("Shape of the dataset:", large_df.shape)
large_df.head()

Shape of the dataset: (148615, 19)


Unnamed: 0,index,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39


In [42]:
large_df['mode'] = large_df['mode'].map({'Major':1, 'Minor':0});

In [43]:
large_df['artist'] = large_df['artist_name']
large_df.head()

Unnamed: 0,index,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,artist
0,0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,1,0.0525,166.969,4/4,0.814,Henri Salvador
1,1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,0,0.0868,174.003,4/4,0.816,Martin & les fées
2,2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,0,0.0362,99.488,5/4,0.368,Joseph Williams
3,3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,1,0.0395,171.758,4/4,0.227,Henri Salvador
4,4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,1,0.0456,140.576,4/4,0.39,Fabien Nataf


In [44]:
large_df.drop(columns=['artist_name'], inplace = True)

In [45]:
large_df.sort_values('popularity', ascending = False)[desired_features].head()

Unnamed: 0,artist,track_name,popularity,danceability,energy,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness
9027,Ariana Grande,7 rings,100,0.725,0.321,-10.744,0,0.323,0.0,0.0884,0.319,70.142,178640,4/4,0.578
86951,Post Malone,Wow.,99,0.833,0.539,-7.399,0,0.178,2e-06,0.101,0.385,99.947,149520,4/4,0.163
9026,Ariana Grande,"break up with your girlfriend, i'm bored",99,0.726,0.554,-5.29,0,0.0917,0.0,0.106,0.335,169.999,190440,4/4,0.0421
66643,Daddy Yankee,Con Calma,98,0.737,0.86,-2.652,0,0.0593,2e-06,0.0574,0.656,93.989,193227,4/4,0.11
9028,Halsey,Without Me,97,0.752,0.488,-7.05,1,0.0705,9e-06,0.0936,0.533,136.041,201661,4/4,0.297


In [46]:
large_df.isnull().sum()

index               0
genre               0
track_name          0
track_id            0
popularity          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
valence             0
artist              0
dtype: int64

In [47]:
large_df.head(20)

Unnamed: 0,index,genre,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,artist
0,0,Movie,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,1,0.0525,166.969,4/4,0.814,Henri Salvador
1,1,Movie,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,0,0.0868,174.003,4/4,0.816,Martin & les fées
2,2,Movie,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,0,0.0362,99.488,5/4,0.368,Joseph Williams
3,3,Movie,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,1,0.0395,171.758,4/4,0.227,Henri Salvador
4,4,Movie,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,1,0.0456,140.576,4/4,0.39,Fabien Nataf
5,5,Movie,Le petit souper aux chandelles,0Mf1jKa8eNAf1a4PwTbizj,0,0.749,0.578,160627,0.0948,0.0,C#,0.107,-14.97,1,0.143,87.479,4/4,0.358,Henri Salvador
6,6,Movie,"Premières recherches (par Paul Ventimila, Lori...",0NUiKYRd6jt1LKMYGkUdnZ,2,0.344,0.703,212293,0.27,0.0,C#,0.105,-12.675,1,0.953,82.873,4/4,0.533,Martin & les fées
7,7,Movie,Let Me Let Go,0PbIF9YVD505GutwotpB5C,15,0.939,0.416,240067,0.269,0.0,F#,0.113,-8.949,1,0.0286,96.827,4/4,0.274,Laura Mayne
8,8,Movie,Helka,0ST6uPfvaPpJLtQwhE6KfC,0,0.00104,0.734,226200,0.481,0.00086,C,0.0765,-7.725,1,0.046,125.08,4/4,0.765,Chorus
9,9,Movie,Les bisous des bisounours,0VSqZ3KStsjcfERGdcWpFO,10,0.319,0.598,152694,0.705,0.00125,G,0.349,-7.79,1,0.0281,137.496,4/4,0.718,Le Club des Juniors


# Concatenating Small DF and Large DF into Giant Ordered DF

In [48]:
giant_ordered_df = pd.concat([ordered_songlist, large_df])

In [49]:
giant_ordered_df.head()

Unnamed: 0,index,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity,genre
0,0,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,0.746,0.69,11,-7.956,1,0.164,0.0,0.101,0.497,89.977,181733,4,0.247,96.0,
1,1,Justin Bieber,Holy,Holy (feat. Chance The Rapper),5u1n1kITHCxxp8twBcZxWy,0.673,0.704,6,-8.056,1,0.36,0.0,0.0898,0.372,86.919,212093,4,0.196,94.0,
2,2,Pop Smoke,Shoot For The Stars Aim For The Moon,What You Know Bout Love,1tkg4EHVoqnhR6iFEXb60y,0.709,0.548,10,-8.493,1,0.353,1.59e-06,0.133,0.543,83.995,160000,4,0.65,91.0,
3,3,Ariana Grande,Stuck with U,Stuck with U (with Justin Bieber),4HBZA5flZLE435QTztThqH,0.597,0.45,8,-6.658,1,0.0418,0.0,0.382,0.537,178.765,228482,3,0.223,90.0,
4,4,salem ilese,Mad at Disney,Mad at Disney,7aGyRfJWtLqgJaZoG9lJhE,0.738,0.621,0,-7.313,1,0.0486,7.39e-06,0.692,0.715,113.968,136839,4,0.424,88.0,


In [50]:
giant_ordered_df.drop(columns=['key','index','genre','track_id'], inplace = True)

In [51]:
giant_ordered_df.head()

Unnamed: 0,artist,album,track_name,danceability,energy,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),0.746,0.69,-7.956,1,0.164,0.0,0.101,0.497,89.977,181733,4,0.247,96.0
1,Justin Bieber,Holy,Holy (feat. Chance The Rapper),0.673,0.704,-8.056,1,0.36,0.0,0.0898,0.372,86.919,212093,4,0.196,94.0
2,Pop Smoke,Shoot For The Stars Aim For The Moon,What You Know Bout Love,0.709,0.548,-8.493,1,0.353,1.59e-06,0.133,0.543,83.995,160000,4,0.65,91.0
3,Ariana Grande,Stuck with U,Stuck with U (with Justin Bieber),0.597,0.45,-6.658,1,0.0418,0.0,0.382,0.537,178.765,228482,3,0.223,90.0
4,salem ilese,Mad at Disney,Mad at Disney,0.738,0.621,-7.313,1,0.0486,7.39e-06,0.692,0.715,113.968,136839,4,0.424,88.0


In [52]:
giant_ordered_df.describe()

Unnamed: 0,danceability,energy,loudness,speechiness,liveness,valence,tempo,acousticness,popularity
count,149310.0,149310.0,149310.0,149310.0,149310.0,149310.0,149310.0,149310.0,149310.0
mean,0.536072,0.552618,-10.367721,0.130416,0.228544,0.448735,116.949267,0.414327,35.728036
std,0.192841,0.280585,6.608798,0.209269,0.214957,0.269593,31.369127,0.370729,17.463493
min,0.0569,2e-05,-52.457,0.0222,0.00967,0.0,30.379,0.0,0.0
25%,0.406,0.331,-13.274,0.0371,0.0977,0.216,91.87825,0.0473,24.0
50%,0.553,0.589,-8.314,0.0496,0.131,0.437,114.9185,0.303,36.0
75%,0.68,0.791,-5.657,0.103,0.283,0.666,138.3405,0.809,48.0
max,0.987,0.999,3.744,0.967,1.0,1.0,242.903,0.996,100.0


|        | Popularity | Valence | Energy | Loudness | Danceability | Liveness | Tempo | Acousticness |
|--------|------------|---------|--------|----------|--------------|----------|-------|--------------|
| Mean   | 35.72%     | 44.87%  | 51.26% | -10.37   | 53.61%       | 22.85%   | 117   | 41.43%       |
| Median | 36.00%     | 43.70%  | 58.90% | -8.31    | 55.30%       | 13.10%   | 115   | 30.30%       |

The danceability and energy for both the mean and median of this dataset is a high average of around 55%, indicating that danceability and energy are generally favored in music streamed through spotify from 2017-2020. This makes total sense.

Loudness is on average -10 for mean and -8 for median. The minimum is -52 and the maximum is 3. This makes lots of sense when we consider the history of compression in the music industry.

Speechiness is, on average, 13% mean and 5% median. With the popularity of rap, this is a bit confusing. However, I reason this lack of speechiness as the fact that "sung" music is not recognized as "speechy"--only spoken music like rap is considered "speechy." This distinction is important.

Liveness is 21.5% mean and 13% median, which makes lots of sense. Most modern pop songs, even with mostly instrumental and non-electronic musicians like Adele, have some element of digital music like "synths" that make this statistic lower than one may expect.

Valence is described by spotify in terms of "positivity" versus "negativity." Their API website states "valence" as being the following: "A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)." On average for both mean and median, these songs are slightly pessimistic (45%). 

The mean and median tempos both fall around 115 bpm, which makes sense. Most new, streamed music is quicker in pace. 

Acousticness is 41% for mean and 30% for median, indicating that modern music is less acoustic. This also makes sense with the rise of digital music.

The mean popularity of this music is 35% and the median is 30%, indicating that most of the data is not in the camp of being "very popular." So, Spotify CAN afford to be selective about which artists it selects. Not EVERYONE can be a pop star; this is proof.

### All of the song data is in hand

# Genius Lyrics Data Collection

In this section, we use the Genius Lyrics' website's API with a Python 3 for-loop to scrape 700 song lyrics. We then append them to ordered_songlist.

In [53]:
import lyricsgenius
genius = lyricsgenius.Genius("noljdbG6ASnbF8-q8YREqBhcY8nWCx4kTUxXqq2XJ23-4W5mNj_5SGeZfjy757Dt")

In [54]:
#ordered_songlist['track_name']

In [55]:
ordered_songlist.head()

Unnamed: 0,index,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity
0,0,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,0.746,0.69,11,-7.956,1,0.164,0.0,0.101,0.497,89.977,181733,4,0.247,96.0
1,1,Justin Bieber,Holy,Holy (feat. Chance The Rapper),5u1n1kITHCxxp8twBcZxWy,0.673,0.704,6,-8.056,1,0.36,0.0,0.0898,0.372,86.919,212093,4,0.196,94.0
2,2,Pop Smoke,Shoot For The Stars Aim For The Moon,What You Know Bout Love,1tkg4EHVoqnhR6iFEXb60y,0.709,0.548,10,-8.493,1,0.353,1.59e-06,0.133,0.543,83.995,160000,4,0.65,91.0
3,3,Ariana Grande,Stuck with U,Stuck with U (with Justin Bieber),4HBZA5flZLE435QTztThqH,0.597,0.45,8,-6.658,1,0.0418,0.0,0.382,0.537,178.765,228482,3,0.223,90.0
4,4,salem ilese,Mad at Disney,Mad at Disney,7aGyRfJWtLqgJaZoG9lJhE,0.738,0.621,0,-7.313,1,0.0486,7.39e-06,0.692,0.715,113.968,136839,4,0.424,88.0


In [56]:
# #test for a song
# genius.verbose = False
# genius.search_song(ordered_songlist.loc[36,'track_name'], ordered_songlist.loc[36,'artist'])

In [57]:
# So, here’s the code that iterates through ordered_songlist (my dataframe). It takes each artist and each trackname and searches genius for it, append it to lyrics. The try - except is to make sure that songs that have NO LYRICS (instrumental songs) are still scraped without error.

lyrics_list = []
genius.verbose = False # keeps us from having to see every song printed out as its lyrics are searched (saves time)

start_time = time.time()
for i in ordered_songlist.index:
    artist = ordered_songlist.loc[i,'artist']
#    print(artist)
    track_name = ordered_songlist.loc[i,'track_name']
#    print(track_name)
    try:
        song = genius.search_song(track_name, artist)
        lyrics = song.lyrics
    except:
        lyrics = ''
    lyrics_list.append(lyrics)
print("--- %s seconds ---" % (time.time() - start_time))

--- 2384.0907316207886 seconds ---


In [58]:
ordered_songlist['lyrics'] = lyrics_list

In [59]:
ordered_songlist.head()

Unnamed: 0,index,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity,lyrics
0,0,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,0.746,0.69,11,-7.956,1,0.164,0.0,0.101,0.497,89.977,181733,4,0.247,96.0,"[Intro: DaBaby]\nWoo, woo\nI pull up (pull up)..."
1,1,Justin Bieber,Holy,Holy (feat. Chance The Rapper),5u1n1kITHCxxp8twBcZxWy,0.673,0.704,6,-8.056,1,0.36,0.0,0.0898,0.372,86.919,212093,4,0.196,94.0,[Verse 1: Justin Bieber]\nI hear a lot about s...
2,2,Pop Smoke,Shoot For The Stars Aim For The Moon,What You Know Bout Love,1tkg4EHVoqnhR6iFEXb60y,0.709,0.548,10,-8.493,1,0.353,1.59e-06,0.133,0.543,83.995,160000,4,0.65,91.0,[Intro]\nUh\n\n[Verse 1]\nShawty go jogging ev...
3,3,Ariana Grande,Stuck with U,Stuck with U (with Justin Bieber),4HBZA5flZLE435QTztThqH,0.597,0.45,8,-6.658,1,0.0418,0.0,0.382,0.537,178.765,228482,3,0.223,90.0,"[Intro: Ariana Grande]\nMmm\nHey, yeah\n(That'..."
4,4,salem ilese,Mad at Disney,Mad at Disney,7aGyRfJWtLqgJaZoG9lJhE,0.738,0.621,0,-7.313,1,0.0486,7.39e-06,0.692,0.715,113.968,136839,4,0.424,88.0,"[Verse 1]\nI'm mad at Disney, Disney\nThey tri..."


# Genius Lyrics - Lyric Cleaning

We clean the lyrics of stop words and undesired characters in order to make the data of the lyrics ready for our analytical processes.

In [60]:
def cleaning_lyrics(single_song_lyrics): # Developed from 5.3 lesson with Patrick Wales-Dinan, General Assembly Data Scientist and Instructor
    # Function to convert a raw review to a string of words
    
    # 1. Remove HTML.
    lyrics_text = BeautifulSoup(single_song_lyrics, "lxml").get_text()
    
    # 2. Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ", lyrics_text)
    
    # 3. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 4. In Python, searching a set is much faster than searching
    # a list, so convert the stopwords to a set.
    stops = set(stopwords.words('english'))
    new_stops = ['intro','chorus','verse','bridge','outro',
                 'like', 'oh','yeah','ooh','cause',
                'know','wanna','let','say','one',
                'feel','ah','get','go','take','pre','got','la',
                'mind','think','made','back','ft']
    stops.update(new_stops)
    
    # 5. Remove stopwords.
    meaningful_words = [w for w in words if w not in stops]
    
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

In [61]:
# #testing the function
# cleaning_lyrics(ordered_songlist['lyrics'][36])

In [62]:
# We will now apply this lyric_cleaning process to every set of lyrics.
# Initialize an empty list to hold the clean lyrics.
clean_lyrics = []

# For every post in our training set...
for lyrics in ordered_songlist['lyrics']:
    
    # Convert lyrics to words, then append to clean_train_reviews.
    clean_lyrics.append(cleaning_lyrics(lyrics))

In [63]:
# Test confirmation of the lyric process
clean_lyrics[36]

'feelin alien baby ridin around world feeling stranger baby round around come far away home goin fightin feelin alien baby ridin around world feeling stranger baby round around round around post round around around million miles home goin fightin feelin alien baby ridin around world feeling stranger baby round around round around post round around feelin alien baby ridin around world feeling stranger baby round around around'

In [64]:
ordered_songlist['lyrics'] = clean_lyrics

In [65]:
ordered_songlist.head()

Unnamed: 0,index,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,acousticness,popularity,lyrics
0,0,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,0.746,0.69,11,-7.956,1,0.164,0.0,0.101,0.497,89.977,181733,4,0.247,96.0,dababy woo woo pull pull pull baby pull pull p...
1,1,Justin Bieber,Holy,Holy (feat. Chance The Rapper),5u1n1kITHCxxp8twBcZxWy,0.673,0.704,6,-8.056,1,0.36,0.0,0.0898,0.372,86.919,212093,4,0.196,94.0,justin bieber hear lot sinners saint might riv...
2,2,Pop Smoke,Shoot For The Stars Aim For The Moon,What You Know Bout Love,1tkg4EHVoqnhR6iFEXb60y,0.709,0.548,10,-8.493,1,0.353,1.59e-06,0.133,0.543,83.995,160000,4,0.65,91.0,uh shawty jogging every morning every morning ...
3,3,Ariana Grande,Stuck with U,Stuck with U (with Justin Bieber),4HBZA5flZLE435QTztThqH,0.597,0.45,8,-6.658,1,0.0418,0.0,0.382,0.537,178.765,228482,3,0.223,90.0,ariana grande mmm hey fun stuck ariana grande ...
4,4,salem ilese,Mad at Disney,Mad at Disney,7aGyRfJWtLqgJaZoG9lJhE,0.738,0.621,0,-7.313,1,0.0486,7.39e-06,0.692,0.715,113.968,136839,4,0.424,88.0,mad disney disney tricked tricked wishing shoo...


# Exporting All DFs

Finally, we export these dataframes to use in later notebooks for this project.

In [66]:
giant_ordered_df.to_csv('giant_ordered_df.csv', index=False)

In [67]:
ordered_songlist.to_csv('ordered_songlist.csv',index=False)

In [68]:
large_df.to_csv('ordered_large_df.csv',index=False)

# Twitter Data Collection - Further Research for When Twitter Becomes Available to Scrape Again

In [69]:
# import tweepy

In [70]:
# auth = tweepy.OAuthHandler('3Yl9jmtyYvPUqlNfo2USaXsid', '1GiCFFplVNgaeHY7nNuDsp8JGe6ReupMBm7r60leP99027pyOR')
# auth.set_access_token('229257549-tvbno92WebencvssWdh7okWtqEAjbAHTw5zG2nx4', 'eU6GSzD2WdyDrNybRsXbQ559xTZBBfeNbpqEXlwtZSTN6')

In [71]:
# api = tweepy.API(auth)
# #test tweet pulling
# for tweet in tweepy.Cursor(api.search, q=artist).items(10):
#     print(tweet.text)

### Go to Part 2. EDA for more!