# Data Wrangling - Spotify Top 50

### Introduction
In this project we will look at multiple questions regarding the Top 50 Spotify songs. For this we will look at the sentiment of the lyrics and combine this data with other datasets to hopefully derive some interesting conclusions. Every section in this notebok represents a research question regarding this topic.

Group members of this project:
<ul>
    <li>Luuk Kaandorp (2623537)</li>
    <li>Lucas de Geus</li>
    <li>Ward Pennink</li>
    <li>Matthijs Blaauw</li>
</ul>

### Import
This section imports every function and package used in the following sections, please run this before running all others.

In [1]:
# read neccessary libraries for pandas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.max_rows = 20
np.set_printoptions(precision = 4, suppress = True)

# read neccessary libraries for getting the lyrics
import requests
from bs4 import BeautifulSoup

# read neccessary libraries for sentiment analysis (see cell below for acknowledgement)
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# read neccessary libraries for text pre-processing
import string

# read neccessary libraries for language detection
from langdetect import detect

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\lucas\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


### Acknowledgement
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

### Q1: Proportion of songs with positive sentiment compared to songs with negative sentiment in the Top 50

This question tries to answer whether there are more songs with a positive sentiment in the Top 50 than songs with a negative sentiment. We will first show how our analyses and functions work by looking at our first week. Later we will analyse 6 weeks of the year 2019 so we can make a conclusion. In the code field below we start by getting the weekly Top 50 from the US on Spotify in the week of 1-11-2019 to 8-11-2019. 

In [2]:
# this reads in the Spotify first Top 50 of the US
first_top50_file = "q1_data/regional-us-weekly-2019-11-01--2019-11-08.csv"
first_top50 = pd.read_csv(first_top50_file, header=1)
first_top50 = pd.DataFrame(first_top50.head(50))
first_top50

Unnamed: 0,Position,Track Name,Artist,Streams,URL
0,1,HIGHEST IN THE ROOM,Travis Scott,9395208,https://open.spotify.com/track/3eekarcy7kvN4yt...
1,2,Circles,Post Malone,9248356,https://open.spotify.com/track/21jGcNKet2qwijl...
2,3,Lose You To Love Me,Selena Gomez,8918735,https://open.spotify.com/track/1HfMVBKM75vxSfs...
3,4,Bandit (with YoungBoy Never Broke Again),Juice WRLD,8522651,https://open.spotify.com/track/6Gg1gjgKi2AK4e0...
4,5,ROXANNE,Arizona Zervas,8252346,https://open.spotify.com/track/1ZPWWSwCkxKfqdp...
...,...,...,...,...,...
45,46,On God,Kanye West,3256548,https://open.spotify.com/track/2SasoXZyv82yYgH...
46,47,Bad Bad Bad (feat. Lil Baby),Young Thug,3249963,https://open.spotify.com/track/1GeNui6m825V8jP...
47,48,Graveyard,Halsey,3209662,https://open.spotify.com/track/6V9fHiv84WlVTg7...
48,49,Did It Again,Lil Tecca,3207435,https://open.spotify.com/track/4guBZjUyrGoHsTa...


Now we got the songs we need, we need to establish a function which gets the lyrics for the given song. We use the Genius API to get the lyrics. We also need a function which processes the result. The last function we need is the one that scrapes the lyrics from the html page we get from the request. The below code section contains these functions and demonstrates their working.

In [3]:
# this function requests the song info based on the artist and song title
def request_song_info(song_title, artist_name):
    base_url = 'https://api.genius.com'
    headers = {'Authorization': 'Bearer ' + '2wN0egWOzQ-_KQDN4XkblxZoUjG2H_Zl-xq3uXNJhVFNaIkaM0QfvSWM4pm3fWhw'}
    search_url = base_url + '/search'
    data = {'q': song_title + ' ' + artist_name}
    response = requests.get(search_url, data=data, headers=headers)

    return response

# this function processes the result of the info request
def process_request(response, artist_name):
    json = response.json()
    remote_song_info = None

    for hit in json['response']['hits']:
        if artist_name.lower() in hit['result']['primary_artist']['name'].lower():
            remote_song_info = hit
            break
            
    if remote_song_info:
        return remote_song_info['result']['url']
    else :
        return ""
    
# this function scraps the lyrics from the html page
def scrap_lyrics(url):
    if url != "":
        page = requests.get(url)
        html = BeautifulSoup(page.text, 'html.parser')
        lyrics = html.find('div', class_='lyrics').get_text()
        # we detect the language of the lyrics, since the function that comes later in this document can only
        # calculate the sentiment of english lyrics we won't process lyrics that are not english
        if detect(lyrics) == "en":
            return process_lyrics(lyrics)
    return ""

# this function pre-processes the lyrics
def process_lyrics(lyrics):
    # replace enter (\n) with space
    enter_removed = lyrics.replace('\u2005', ' ').replace('\u200a', ' ').replace('\n', ' ')
    
    # remove punctuatin
    punctuation_removed = enter_removed.translate(str.maketrans("","", string.punctuation)) 
    
    return punctuation_removed
            
# TEST
test_lyrics = scrap_lyrics(process_request(request_song_info('Circles', 'Post Malone'), 'Post Malone'))
test_lyrics

'  Intro Oh oh oh Oh  oh oh Oh  oh oh oh oh  Verse 1 We couldnt turn around til we were upside down Ill be the bad guy now but know I aint too proud I  couldnt be there even when I tried You  dont believe it we do this every time  Chorus Seasons change and our love went cold Feed  the flame cause we cant let go Run away but were running in circles Run away run away I dare you to do something Im  waiting on you again so I dont take the blame Run away but were running in circles Run away run away run away  Verse 2 Let go I got a feeling that its time to let go I said so I knew that this was doomed from the getgo You thought that it was special special But it was just the sex though the sex though And I still hear the echoes The echoes I got a feeling that its time to let it go let it go  Chorus Seasons change and our love went cold Feed the flame cause we cant let go Run away but were running in circles Run away run away I dare you to do something Im waiting on you again so I dont take t

Now that we have the correct functions, we can get the lyrics for all the songs in the Top50. All we need now is a function which determines the sentiment of the lyrics. The below code section contains this function and demonstrates its working.

In [4]:
# this function calculates the sentiment of the lyrics
def sentiment_of_lyrics(lyrics):
    if lyrics != "":
        sid = SentimentIntensityAnalyzer()
        return sid.polarity_scores(lyrics)
    else:
        return False

# TEST
sentiment_of_lyrics(test_lyrics)

{'neg': 0.049, 'neu': 0.846, 'pos': 0.104, 'compound': 0.963}

We now have everything we need: the songs, the lyrics and the sentiment processor. In the following code section we have a function that creates the final table containing the songs and their sentiment and we demonstrate its working. Note that creating this table can take up to 2-3 minutes!

In [5]:
# calculate sentiment for all songs in a Top 50
def sentiment_for_top50(top50):
    for index, row in top50.iterrows():
        sentiment = sentiment_of_lyrics(scrap_lyrics(process_request(request_song_info(row['Track Name'], row['Artist']), row['Artist'])))
        if (sentiment != False):
            top50.loc[index,'neg_sentiment'] = sentiment.get('neg')
            top50.loc[index,'neu_sentiment'] = sentiment.get('neu')
            top50.loc[index,'pos_sentiment'] = sentiment.get('pos')
            top50.loc[index,'compound_sentiment'] = sentiment.get('compound')
        else:
            top50.loc[index,'neg_sentiment'] = 0
            top50.loc[index,'neu_sentiment'] = 0
            top50.loc[index,'pos_sentiment'] = 0
            top50.loc[index,'compound_sentiment'] = 0
    return top50

# TEST
processed_first_top50 = sentiment_for_top50(first_top50)
processed_first_top50

Unnamed: 0,Position,Track Name,Artist,Streams,URL,neg_sentiment,neu_sentiment,pos_sentiment,compound_sentiment
0,1,HIGHEST IN THE ROOM,Travis Scott,9395208,https://open.spotify.com/track/3eekarcy7kvN4yt...,0.062,0.771,0.167,0.9897
1,2,Circles,Post Malone,9248356,https://open.spotify.com/track/21jGcNKet2qwijl...,0.049,0.846,0.104,0.9630
2,3,Lose You To Love Me,Selena Gomez,8918735,https://open.spotify.com/track/1HfMVBKM75vxSfs...,0.185,0.431,0.384,0.9993
3,4,Bandit (with YoungBoy Never Broke Again),Juice WRLD,8522651,https://open.spotify.com/track/6Gg1gjgKi2AK4e0...,0.256,0.640,0.104,-0.9992
4,5,ROXANNE,Arizona Zervas,8252346,https://open.spotify.com/track/1ZPWWSwCkxKfqdp...,0.093,0.634,0.273,0.9979
...,...,...,...,...,...,...,...,...,...
45,46,On God,Kanye West,3256548,https://open.spotify.com/track/2SasoXZyv82yYgH...,0.059,0.779,0.163,0.9796
46,47,Bad Bad Bad (feat. Lil Baby),Young Thug,3249963,https://open.spotify.com/track/1GeNui6m825V8jP...,0.308,0.648,0.043,-0.9996
47,48,Graveyard,Halsey,3209662,https://open.spotify.com/track/6V9fHiv84WlVTg7...,0.180,0.739,0.080,-0.9930
48,49,Did It Again,Lil Tecca,3207435,https://open.spotify.com/track/4guBZjUyrGoHsTa...,0.209,0.690,0.100,-0.9908


We now have the table containing the top 50 songs on Spotify with their corresponding sentiment based on their lyrics. We will also use this table in further sections. We will now look at the proportion of negative and positive lyrics based on their compound sentiment to answer our question. We take 0.4 as the bound for positive sentiment and -0.4 as the bound for negative sentiment.

In [6]:
# the number of positive sentiment lyrics
first_top50_compound_sentiment = processed_first_top50['compound_sentiment']
first_top50_num_positive = first_top50_compound_sentiment[first_top50_compound_sentiment > 0.4 ].count()

# the number of negative sentiment lyrics
first_top50_num_negative = first_top50_compound_sentiment[first_top50_compound_sentiment < -0.4 ].count()

print("Songs with positive sentiment: " + str(first_top50_num_positive))
print("Songs with negative sentiment: " + str(first_top50_num_negative))

Songs with positive sentiment: 20
Songs with negative sentiment: 20


In order to not have to repeat the code above every time we want to compute the compound sentiment for a week's top 50, we wrap the above code in a function, called compound_sentiment. This function takes the file for which the compound sentiment needs to be calculated as parameter, and returns a tuple containing the number of songs with a positive sentiment and the number of songs with a negative sentiment. 

In [7]:
def compound_sentiment_function(file):    
    top50 = pd.DataFrame(pd.read_csv(file, header=1).head(50))

    # creates augmented top50 table with sentiment
    processed_top50 = sentiment_for_top200(top50)
    top50_compound_sentiment = processed_top50['compound_sentiment']
    
    # the number of positive sentiment lyrics    
    top50_num_positive = top50_compound_sentiment[top50_compound_sentiment > 0.4 ].count()

    # the number of negative sentiment lyrics
    top50_num_negative = top50_compound_sentiment[top50_compound_sentiment < -0.4 ].count()
    
    return (top50_num_positive, top50_num_negative)

As you can see from the result, there is only a small difference between the positive and negative sentiment. We consider this difference to be negligable. Therefore we decided to also look at 5 more random weeks to see if this holds for multiple weeks or was just coincidence. In the below code field we look at all 6 random weeks from 2019. Note that this can take up to 20 minutes to calculate!

In [8]:
# This for loop, loops through all 6 random weeks and prints the number of songs with negative and positive sentiment as well as
# calculates a difference in percentages
files = ["q1_data/regional-us-weekly-2019-01-04--2019-01-11.csv" , "q1_data/regional-us-weekly-2019-03-08--2019-03-15.csv" ,
         "q1_data/regional-us-weekly-2019-04-26--2019-05-03.csv" , "q1_data/regional-us-weekly-2019-07-05--2019-07-12.csv" , 
         "q1_data/regional-us-weekly-2019-08-30--2019-09-06.csv" , "q1_data/regional-us-weekly-2019-11-01--2019-11-08.csv"]
total_positive = 0
total_negative = 0
for file in files:
    # Get the sentiment from the file
    compound_sentiment_of_week = compound_sentiment_function(file)
    
    # Add count of positive and negative sentiment to the totals
    total_positive += compound_sentiment_of_week[0]
    total_negative += compound_sentiment_of_week[1]

    # Print the counts of the week
    print("Songs with positive sentiment: " + str(compound_sentiment_of_week[0]))
    print("Songs with negative sentiment: " + str(compound_sentiment_of_week[1]))
    print("")

NameError: name 'sentiment_for_top200' is not defined

As you can see from the output above, all 6 weeks have been calculated. In the next code field we calculate the difference in percentage.

In [None]:
# we calculate the difference between positive and negative songs in percentage
difference = ((total_positive - total_negative) / total_negative) * 100
print("In the last 6 weeks, we see that there are " + str(round(difference, 2)) + "% more songs with positive sentiment in their lyrics in the Top50 of the US than songs with negative sentiment.")

Our conclusion is that it seems like there are slightly more songs with a positive sentiment than ones with negative sentiment in the Top 50 songs on Spotify in the US. This is based on 6 random weeks from 2019.

### Q2: Influence of holidays on the sentiment listened to

This question tries to answer whether people tend to listen more to songs with a positive sentiment than songs with a negative sentiment during weeks around holidays. For this question we will look at the 2 weeks around the holidays Christmas, Easter and Thanksgiving. Because we have already looked at the Christmas period for 2019, this question looks at the holiday periods of 2018. 

In [None]:
christmas2018 = compound_sentiment_function("regional-us-weekly-2018-12-21--2018-12-28.csv")

print("Songs with positive sentiment: " + str(christmas2018[0]))
print("Songs with negative sentiment: " + str(christmas2018[1]))

In [None]:
easter2018 = compound_sentiment_function("regional-us-weekly-2018-03-30--2018-04-06.csv")

print("Songs with positive sentiment: " + str(easter2018[0]))
print("Songs with negative sentiment: " + str(easter2018[1]))

In [None]:
thanksgiving2018 = compound_sentiment_function("regional-us-weekly-2018-11-16--2018-11-23.csv")

print("Songs with positive sentiment: " + str(thanksgiving2018[0]))
print("Songs with negative sentiment: " + str(thanksgiving2018[1]))