# An Analysis Of The Current Music Critic Landscape
by Nathan Kaplan

The job of a music critic is to review notable music as it becomes released by the artist. Music fans tend to to use critic's reviews 1) to determine what to listen to and 2) as a centerpiece for discussing and debating music. As a result, a positive review can have a big effect on a band's commercial success. Over the past ten years bands like Death Grips, Brockhampton, Daughters, and Spellling have all been propelled to commercial success by positive reviews from critics. On the other side of the coin, negative reviews can be detrimental to a band's success. Pitchfork, a large music publication, infamously reviewed the 2006 album Shine On by Jet [by just embedding a video of a monkey peeing into its own mouth](https://pitchfork.com/reviews/albums/9464-shine-on/). The album did not sell well, and although it's impossible to measure, Jet undoubtably lost a significant number of fans. 

Given the role of music critics and the subjective nature of music reviewing, it is in the interest of fans and artists to know how different music publications review music differently. In this project, I will be analyzing the differences between several publication's top-songs-of-the-year lists. 


The music critic landscape is incredibly varied, and each publication is known for having certain biases. Here is an overview of some of the major critic publications that we will be looking at in this tutorial:
* The most popular music critic today seems to be Anthony Fantano, who has been releasing video reviews of new albums under his Youtube channel **The Needle Drop** since 2014 and has amassed over 2.5 million subscribers. Those who are critical of Fantano say that he is biased toward experimental music, and doesn't give certain well-liked albums a chance. Fantano's top-songs-of-the-year list are also only 50 songs long as opposed to the other publications which list 100.
* One of the historically most popular music critic publications that still has a strong following today is the aforementioned **Pitchfork**. In contrast to Fantano, Pitchfork is made up of a team of music reviewers who release written music reviews. Pitchfork is critized for retroactively changing their reviews to correspond with public opinion. 
* The website rateyourmusic.com (abbreviated **RYM**), is a large forum for music fans to rate, review, and discuss music. This forum may also be viewed as a critic publication representing the views of music fans. The RYM ecosystem is critized for being overly-contrarian and rating small experimental songs over bigger well-liked ones. This may be exacerbated by the fact that people typically don't rate music that they don't listen to, and so well liked songs from extremely niche genres rise to the top. RYM is also different from the other publications in that their top-song-of-the-year list changes well after the year ends as people continue to use the website. 
* We can also view the general US public as a collective critic, who rates songs by listening to them more or less. In contrast to the music on RYM, the music listened to the most by the wider public is criticized for being overly-homogenous and inoffensive. The songs listened to the most by the US public are compiled by the organization **Billboard** every year. 

## Data collection/curation + parsing 

The first thing we need to do is collect and parse our data. We will be using the Spotipy library, which is a user-friendly python library to interact with the Spotify API and obtain audio features that have been pre-computed by Spotify. To get additional data about the artist, we will be crosslisting the Spotify data with a music database called MusicBrainz. The first step here is to import our libraries and connect to both Spotify and MusicBrainz. Spotify requires a public and secret developer key-- I store mine in a seperate text file but you can get your own by creating a developer account [here](https://developer.spotify.com/dashboard/application).


In [None]:
%%capture
import os
os.system('pip install spotipy | (grep -v \'Requirement already satisfied\' 1>&2)')
os.system('pip install musicbrainzngs | (grep -v \'Requirement already satisfied\' 1>&2)')
os.system('pip install --user mca | (grep -v \'Requirement already satisfied\' 1>&2)')
import spotipy
from spotipy.oauth2 import SpotifyOAuth
from spotipy.oauth2 import SpotifyClientCredentials
import numpy as np
import pandas as pd
import musicbrainzngs
import pickle
import random
import math
from datetime import date,time,datetime,timedelta
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KernelDensity
with open("../Spotify_keys.txt","r+") as file:
    client_id=file.readline()
    client_secret=file.readline()
    
%env SPOTIPY_CLIENT_ID=client_id
%env SPOTIPY_CLIENT_SECRET=client_secret
musicbrainzngs.set_useragent("Critic Analyzer", "1.0")
sp = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials())

Below are listed three functions which collectively will extract our data and format it. 

In [None]:
#Return spotify song information given list of playlist uris
def get_spotify_info(uri):
    
    #Get song uri's
    songs=sp.playlist_items(uri);
    track_names=[song['track']['name'] for song in songs['items']]
    uris=[songs['items'][i]['track']['uri'] for i in range(len(songs['items']))]
    
    #Get song and artist features
    popularity=[songs['items'][i]['track']['popularity'] for i in range(len(songs['items']))] 
    features=list(sp.audio_features(tracks=uris))
    num_artists=[len(songs['items'][i]['track']['artists']) for i in range(len(songs['items']))]
    artist_names=[[songs['items'][i]['track']['artists'][j]['name'] for j in range(num_artists[i])] for i in range(len(songs['items']))]
    artist_uris=[[songs['items'][i]['track']['artists'][j]['uri'] for j in range(num_artists[i])] for i in range(len(songs['items']))]
    artist_genres=[set() for i in range(len(track_names))]
    
    for i in range(len(track_names)):
        for j in range(len(artist_uris[i])):
            artist_genres[i].update(sp.artist(artist_uris[i][j])['genres'])
            
    return track_names,popularity,features,num_artists,artist_names,artist_genres

#Return musicbrainz artist information given list of artists associated with each track
def get_musicbrainz_info(artist_names,track_names):
    
    #Get artist and artist info
    mb_artist_info=[[musicbrainzngs.search_artists(query=artist, limit=1)['artist-list'][0] for artist in artist_list] for artist_list in artist_names]
    artist_genders=[[] for i in range(len(track_names))]
    artist_born=[[] for i in range(len(track_names))]
    group_started=[[] for i in range(len(track_names))]
    artist_countries=[[] for i in range(len(track_names))]
    
    #Get features from each artist
    for i in range(len(track_names)):
        for j in range(len(artist_names[i])):
            gender=mb_artist_info[i][j].get('gender');
            
            #Extract gender (or group)
            if gender is not None:
                artist_genders[i].append(gender[0])
                
            elif(mb_artist_info[i][j].get('type')=='Group'):
                artist_genders[i].append('g')
            else:
                artist_genders[i].append('missing gender') #We don't use pd.NA because it simplifies quantifying missingness

            
            #Extract start date (group formation date if group, birth date otherwise)
            start=mb_artist_info[i][j]['life-span'].get('begin')
            if start is not None:
                start=int(start[0:4])
                if gender is not None:
                    artist_born[i].append(start)
                elif(mb_artist_info[i][j].get('type')=='Group'):
                    group_started[i].append(start)
            
            #Extract country information
            country=mb_artist_info[i][j].get('country');
            if(country is not None):
                artist_countries[i].append(country)
            else:
                artist_countries[i].append('missing country')  
                    
    return artist_genders,artist_born,group_started,artist_countries

#Creates organized list with spotify and musicbrainz information given list of playlist uris and associated years and publication
def create_df_list(uri,year,publication):
    track_names,popularity,features,num_artists,artist_names,artist_genres=get_spotify_info(uri)
    artist_genders,artist_born,group_started,artist_countries=get_musicbrainz_info(artist_names,track_names)
    to_remove=['type','id','uri','track_href','analysis_url']
    
    #Constructs dataframe input, entity-by-entity
    for i in range(len(track_names)):
        [features[i].pop(key) for key in to_remove]
        features[i]['title']=track_names[i]
        features[i]['popularity']=popularity[i]
        features[i]['genres']=artist_genres[i]
        features[i]['num_artists']=num_artists[i]
        features[i]['gender']=artist_genders[i]
        features[i]['born']=artist_born[i]
        features[i]['group started']=group_started[i]
        features[i]['country']=artist_countries[i]
        features[i]['year']=year
        features[i]['publication']=pubs
    return features

Our database will consist of top-songs-of-the-year lists from Pitchfork, Billboard, The Needle Drop, and RYM from 2018 to 2021. To collect data on these songs, I have compiled or made Spotify playlists for each year and compiled the URI's (an identifier for the playlist) below, along with labels for their years and publications. After inputting these lists into the aformentioned data collection and parsing functions, we can create our dataframe.

In [18]:
collect_data=False
uris = ['spotify:playlist:56wt6zNzhzhrLP9nzX1KbE','spotify:playlist:7us0dBvTuLmDBk89LSxaeE','spotify:playlist:4nOIuiVdgx5M8ULw5A2Qtc','spotify:playlist:2VyXmtMZMdSgesnjd6hrLL','spotify:playlist:6J4j6rDdlbUD021gMr06Sl','spotify:playlist:5lO5SKSvwbLetiBt6k7wNX','https://open.spotify.com/playlist/1WBljFutuk7uLQtfqfmjWV?si=423d7d9630b34377','spotify:playlist:5Nt7KFSEfXIlsDIB8SCpNU','spotify:playlist:6rWybOnbRPy2HtHRPlDrkZ','spotify:playlist:30u7B05HgbYAz2KOGRU0F9','spotify:playlist:7E3ygsfSyeZZE9BeYcrNNN','spotify:playlist:0bzxbdJIhQSNYeB5wqm4Vx','spotify:playlist:60yLVRdN9gW4o6MZjWS8mX','https://open.spotify.com/playlist/4kjy0VYhy6tFgYVGQiRJT2?si=2f185eb6f3a84b2b','https://open.spotify.com/playlist/0fPyyFzEu3Fet52UfFQXyn?si=783ee355501a4736','spotify:playlist:6PCMVaxqovwyPBXr5fG9nz']
years=list(range(2018,2022))*4
pubs=["Pitchfork","Pitchfork","Pitchfork","Pitchfork","Billboard","Billboard","Billboard","Billboard","The Needle Drop","The Needle Drop","The Needle Drop","The Needle Drop","RYM","RYM","RYM","RYM"]

if collect_data:
    df_list=[]
    for i in range(len(uris)):
        df_list=df_list+create_df_list(uris[i],years[i],pubs[i])
    
    with open('df_list'+str(random.randint(1,9999))+'.p', 'wb') as file:
        pickle.dump(df_list, file)
else:
    with open('df_list8838.p', 'rb') as file:
        df_list = pickle.load(file)
        
df=pd.DataFrame(df_list)
display(df)

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,...,title,popularity,genres,num_artists,gender,born,group started,country,year,publication
0,0.472,0.745,4,-5.297,1,0.0356,0.000607,0.000027,0.0957,0.0631,...,Love It If We Made It,67,"{pop, rock, modern alternative rock, modern rock}",1,[g],[],[2002],[GB],2018,Pitchfork
1,0.652,0.604,11,-8.666,0,0.1010,0.017900,0.000589,0.1190,0.0680,...,Honey,44,"{swedish electropop, swedish pop, electropop, ...",1,[f],[1979],[],[SE],2018,Pitchfork
2,0.893,0.482,0,-4.779,0,0.0873,0.549000,0.016400,0.0752,0.5300,...,MALAMENTE - Cap.1: Augurio,68,{r&b en espanol},1,[f],[1993],[],[ES],2018,Pitchfork
3,0.585,0.909,8,-6.474,1,0.0707,0.089100,0.000097,0.1190,0.7580,...,Nice For What,77,"{hip hop, canadian pop, canadian hip hop, toro...",1,[m],[1986],[],[CA],2018,Pitchfork
4,0.712,0.601,5,-8.968,1,0.0620,0.000011,0.802000,0.0982,0.5130,...,Pick Up,54,"{tech house, electronica, minimal techno, deep...",1,[m],[1972],[],[DE],2018,Pitchfork
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1377,0.894,0.615,3,-6.269,1,0.0826,0.030800,0.009210,0.3090,0.6610,...,Headshots (4r Da Locals),60,"{tennessee hip hop, hip hop, rap, underground ...",1,[m],[1991],[],[US],2021,RYM
1378,0.663,0.881,5,-3.854,0,0.3110,0.122000,0.000000,0.2940,0.6280,...,Life is Like a Dice Game - Spotify Singles,54,"{hip hop, dmv rap, underground hip hop, trap, ...",3,"[m, m, m]","[1973, 1997, 1982]",[],"[US, US, US]",2021,RYM
1379,0.699,0.839,6,-6.394,0,0.0582,0.006600,0.660000,0.2450,0.1500,...,50/50,40,"{escape room, art pop, london indie}",1,[g],[],[2017],[missing country],2021,RYM
1380,0.748,0.740,8,-6.010,1,0.0484,0.010700,0.000022,0.1010,0.5180,...,Take My Breath - Single Version,77,"{pop, canadian contemporary r&b, canadian pop}",1,[m],[1990],[],[CA],2021,RYM


## Data management/representation

Note that there are a total of 15 songs that are missing from either Spotify or MusicBrainz. The list of these missing songs can be found [here](). Upon viewing this list of missing songs, it is clear that these songs are missing not at random. That is, the missingness of the songs are different from the general population of songs in a way that can't be explained by the data we have. In particular, these songs tend to be less-popular Trap or Experimental songs from RYM or Pitchfork. In the abscense of any data of the distribution of missing songs on Spotify, one method for imputing these songs would be to hand-pick other songs that share sonic characteristics with these songs, and impute them over several random trials. However, this would likely impart some of my bias on the selection of the songs. Therefore, for simplicity, we will accept some innacuracy and keep an eye on categories like popularity that we would expect to be effected. 

We are also missing several countries of origin and genders of artists from the MusicBrainz database. The best possible solution here would be to do seperate research for each missing entry (gender and country of origin is generally not too hard to find). Due to the volume of missing data and the time span of this project, I'm going to rule that out. If we considered this data missing at random (is MusicBrainz data collection random? Not clear), and had numbers for what proportion of musical artists are from each country and gender, we could again impute this data over several random trials. In the absence of this data, we will again just have to accept some inaccuracy and keep an eye on where the data might be skewed. 

Now that we have a dataframe, we can do some extra data parsing to make the data more easily analyzed. In particular, we will extract following features and put them into the dataframe:
1) If any of the song's artists are in a group, what is the mean year the groups were started?
2) If any of the song's artists are solo artists, what is the mean year these artists were born?
3) Does the song contains an artist that is male/female/group/missing gender?
4) Does the song contains an artist that is from the US/Great Britain (GB)/Japan (JP)/Australia (AU)/Canada (CA)/Spain (ES)/missing country? 

Each of these features are more useful to us than the the column they were extracted from, since they can be expressed as a number. We also convert each data to it's proper data type so that certain Pandas functions will work.

In [None]:
countries=['US','GB','JP','AU','CA','ES','missing country']

# Iterrate through each song, extracting new attributes
for i,row in df.iterrows():
    
    # Calculate the mean year the artists who are groups started
    df.at[i,'group started']=pd.NA
    if(len(row['group started'])>0):
        df.at[i,'group started']=sum(row['group started'])/len(row['group started'])
    
    # Calculate the mean year the artists who are solo artists were born
    df.at[i,'born']=pd.NA
    if(len(row['born'])>0):
        df.at[i,'born']=sum(row['born'])/len(row['born'])
    
    # Calculate whether the song contains an artist that is male/female/group/missing gender?
    df.at[i,'female']= 'f' in row['gender']
    df.at[i,'male']='m' in row['gender']
    df.at[i,'group']='g' in row['gender']
    df.at[i,'missing gender']='missing gender' in row['gender']
    
    # Calculate whether the song contains an artist that is from the US/Great Britain (GB)/Japan (JP)/Australia (AU)/Canada (CA)/Spain (ES)/missing country? 
    for country in countries:
        df.at[i,country]=country in row['country']

# Set types of new data types
df['born']=pd.to_numeric(df['born'])
df['group started']=pd.to_numeric(df['group started'])
df['female']= df['female'].astype(bool)
df['male']= df['male'].astype(bool)
df['group']= df['group'].astype(bool)
df['missing gender']= df['missing gender'].astype(bool)
for country in countries:
    df[country]=df[country].astype(bool)

df.drop(columns=['time_signature','key','mode'],inplace=True) #Drop unhelpful columns
display(df)

In [None]:
pubs=list(df['publication'].unique())
df_arr=np.array(df.drop(['born','group started'],axis=1).select_dtypes(include=['int64', 'float64','bool']))
df_arr=StandardScaler().fit_transform(df_arr)

fig,ax = plt.subplots(figsize=(12,8), dpi= 100)

pca = PCA(n_components=2)
pca_data = pca.fit_transform(df_arr)

pub_indices=[[] for pub in pubs]
for i,pub in enumerate(pubs):
    bool_series=df['publication']==pub
    pub_indices[i]=bool_series[bool_series].index.values
    ax.scatter(pca_data[pub_indices[i],0],pca_data[pub_indices[i],1],s=4)
    
ax.legend(labels=pubs)

In [None]:
years=list(df['year'].unique())
fig,ax = plt.subplots(figsize=(12,8), dpi= 100)
year_indices=[[] for year in years]
for i,year in enumerate(years):
    bool_series=df['year']==year
    year_indices[i]=bool_series[bool_series].index.values
    ax.scatter(pca_data[year_indices[i],0],pca_data[year_indices[i],1],s=4)

ax.legend(labels=years)

In [None]:
pub_zs=[[] for pub in pubs]
metrics=[]
numerics=['int64', 'float64','bool']

for i,(publication,group) in enumerate(df.drop(['year'],axis=1).groupby(df['publication'],sort=False)):
    print(publication)
    for name,column in group.iteritems():
        if(column.dtype in numerics):
            if(i==0):
                metrics.append(name)
            z,p=ztest(column.dropna(),df[name].dropna())
            pub_zs[i].append(z)

fig,ax = plt.subplots(figsize=(12,8), dpi= 100)


bar_x=np.arange(len(metrics))
for i,pub_z_list in enumerate(pub_zs):
    ax.bar(bar_x+i*.15,pub_z_list,width=.15)

ax.legend(labels=pubs)
plt.xticks(bar_x+.3, metrics,rotation='vertical')
plt.axhline(y=1.96, color='r', linestyle='--')
plt.axhline(y=-1.96, color='r', linestyle='--')

In [None]:
year_zs=[[] for year in years]
for i,(year,group) in enumerate(df.groupby(df['year'])):
    for name,column in group.drop('year',axis=1).iteritems():
        if(column.dtype in numerics):
            z,p=ztest(column.dropna(),df[name].dropna())
            year_zs[i].append(z)

fig,ax = plt.subplots(figsize=(12,8), dpi= 100)


bar_x=np.arange(len(metrics))
for i,year_z_list in enumerate(year_zs):
    ax.bar(bar_x+i*.15,year_z_list,width=.15)

ax.legend(labels=years)
plt.xticks(bar_x+.3, metrics,rotation='vertical')
plt.axhline(y=1.96, color='r', linestyle='--')
plt.axhline(y=-1.96, color='r', linestyle='--')

In [None]:
pop_violin=[[] for pub in pubs]
for i,(publication,group) in enumerate(df.groupby(df['publication'],sort=False)):
    pop_violin[i]=list(group['popularity'].dropna())

fig, ax = plt.subplots()
ax.violinplot(pop_violin,widths=1,showmeans=True)
plt.xticks(range(1,5), pubs)
ax.set_xlabel("Publication")
ax.set_ylabel("Popularity")
ax.set_title("Violin Plot Of Popularity Vs. Publication")

In [None]:
with pd.option_context("display.max_rows", 1000, "display.max_columns", None):
    kde_all = KernelDensity(kernel='gaussian',bandwidth=1).fit(df_arr)
    kde_top_points=np.flip(np.argsort(kde_all.score_samples(df_arr)))
    print(df.loc[points[0:10]])
    kde_billboard_top_points = KernelDensity(kernel='gaussian',bandwidth=.25).fit(df_arr[pub_indices[pubs.index('Billboard')],:])
    print(df.loc[np.argmax(kde_billboard.score_samples(df_arr))])


In [None]:
fig,ax = plt.subplots(figsize=(12,8), dpi= 100)
    
tnse=TSNE()
tsne_data = tnse.fit_transform(df_arr)

pub_indices=[[] for pub in pubs]
for i,pub in enumerate(pubs):
    bool_series=df['publication']==pub
    pub_indices[i]=bool_series[bool_series].index.values
    ax.scatter(tsne_data[pub_indices[i],0],tsne_data[pub_indices[i],1],s=4)
    
# scatter plot by publication 
# scatter plot by year




    
#print(sum(['GB' in country_list for country_list in df['country/countries']]))

* There are many other critic publications that I did not have the opportunity to look at in this piece. In the future, it would be valuable to include publications like Rolling Stone, NPR music, and Stereogum
* Find songs that all these publications rated and analyze the scores that were given by each
* Analyze which songs and albums were reviewed by each publication (especially those like The Needle Drop and Pitchfork who can only review a limited number things)
* Analyze the record label composition, and the budget associated with each label
* Include music genres in the analysis

In [None]:
# print(features)

# print(features[0].keys())
# recommendations(seed_tracks=None)

# print(songs['items'][0].keys())
# print(songs['items'][0]['track']['uri'])

# print(names)

# for publication,group in df.groupby(df['publication']):
#     print("Average energy in "+publication+"'s year end lists: "+ str(sum([energy for energy in group['popularity']])/len(group.index)))
#     print("Average instrumentalness in "+publication+"'s year end lists: "+ str(sum([inst for inst in group['instrumentalness']])/len(group.index)))
#     print("Prevalence of females in "+publication+"'s year end lists: "+ str(group['female'].sum()/len(group.index)))
#     print("Prevalence of US artists in "+publication+"'s year end lists: "+ str(sum(['US' in country_list for country_list in group['country']])/len(group.index)))
# results = spotify.artist_albums(birdy_uri, album_type='album')
# albums = results['items']
# album_names=[album['name'] for album in albums]
# print(albums[1:3])
# print(album_names)

# while results['next']:
#     results = spotify.next(results)
#     albums.extend(results['items'])

# for album in albums:
#     print(album['name'])

In [None]:
Note: There are many other critic publications that I did not have the opportunity to look at in this piece. In the future, it would be valuable to include publications like Rolling Stone, NPR music, and Stereogum