# Data Wrangling - Spotify Top 200

### Introduction
In this project we will look at multiple questions regarding the Top 200 Spotify songs. For this we will look at the sentiment of the lyrics and combine this data with other datasets to hopefully derive some interesting conclusions. Every section in this notebok represents a research question regarding this topic.

Group members of this project:
<ul>
    <li>Luuk Kaandorp (2623537)</li>
    <li>Lucas de Geus</li>
    <li>Ward Pennink</li>
    <li>Matthijs Blaauw</li>
</ul>

### Import
This section imports every function and package used in the following sections, please run this before running all others.

In [1]:
# read neccessary libraries for pandas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.max_rows = 20
np.set_printoptions(precision = 4, suppress = True)

# read neccessary libraries for getting the lyrics
import requests
from bs4 import BeautifulSoup

# read neccessary libraries for sentiment analysis
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\luuk1\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


### Q1: Influence of sentiment on position

This question tries to answer whether songs with a positive sentiment are more likely to be in the Top 200 than songs with a negative sentiment. In the code field below we start by getting the weekly Top 200 from the US on Spotify in the week of 09-01-2019.

In [2]:
# this reads in the Spotify Top 200 of the US
top200_file = "regional-us-weekly-latest.csv"
top200 = pd.read_csv(top200_file, header=1)
top200

Unnamed: 0,Position,Track Name,Artist,Streams,URL
0,1,The Box,Roddy Ricch,18952305,https://open.spotify.com/track/0nbXyq5TXYPCO7p...
1,2,ROXANNE,Arizona Zervas,9671478,https://open.spotify.com/track/696DnlkuDOXcMAn...
2,3,Yummy,Justin Bieber,9648561,https://open.spotify.com/track/41L3O37CECZt3N7...
3,4,Circles,Post Malone,8244725,https://open.spotify.com/track/21jGcNKet2qwijl...
4,5,BOP,DaBaby,7985170,https://open.spotify.com/track/6Ozh9Ok6h4Oi1wU...
...,...,...,...,...,...
195,196,I,Lil Skies,1547924,https://open.spotify.com/track/4ZT9FnbFu1PaBfV...
196,197,Love Me,Lil Tecca,1542802,https://open.spotify.com/track/4e0FYxSROat25pH...
197,198,Young Dumb & Broke,Khalid,1542419,https://open.spotify.com/track/5Z3GHaZ6ec9bsiI...
198,199,Youngblood,5 Seconds of Summer,1541047,https://open.spotify.com/track/2iUXsYOEPhVqEBw...


Now we got the songs we need, we need to establish a function which gets the lyrics for the given song. We use the Genius API to get the lyrics. We also need a function which processes the result. The last function we need is the one that scrapes the lyrics from the html page we get from the request. The below code section contains these functions.

In [3]:
# this function requests the song info based on the artist and song title
def request_song_info(song_title, artist_name):
    base_url = 'https://api.genius.com'
    headers = {'Authorization': 'Bearer ' + '2wN0egWOzQ-_KQDN4XkblxZoUjG2H_Zl-xq3uXNJhVFNaIkaM0QfvSWM4pm3fWhw'}
    search_url = base_url + '/search'
    data = {'q': song_title + ' ' + artist_name}
    response = requests.get(search_url, data=data, headers=headers)

    return response

# this function processes the result of the info request
def process_request(response, artist_name):
    json = response.json()
    remote_song_info = None

    for hit in json['response']['hits']:
        if artist_name.lower() in hit['result']['primary_artist']['name'].lower():
            remote_song_info = hit
            break
            
    if remote_song_info:
        return remote_song_info['result']['url']
    else :
        return ""
    
# this function scraps the lyrics from the html page
def scrap_lyrics(url):
    if url != "":
        page = requests.get(url)
        html = BeautifulSoup(page.text, 'html.parser')
        lyrics = html.find('div', class_='lyrics').get_text()

        return lyrics
    else:
        return ""
            
# TEST
test_lyrics = scrap_lyrics(process_request(request_song_info('The Box', 'Roddy Ricch'), 'Roddy Ricch'))
test_lyrics

"\n\n[Chorus]\nPullin' out the coupe at the lot\nTold 'em fuck 12, fuck SWAT\nBustin' all the bales\u2005out\u2005the box\nI just\u2005hit a lick with the box\nHad\u2005to put the stick in a box, mmh\nPour up the whole damn seal, I'ma get lazy\nI got the mojo deals, we been trappin' like the '80s\nShe sucked a nigga soul, got the Cash App\nTold 'em wipe a nigga nose, say slatt, slatt\nI won't never sell my soul, and I can back that\nAnd I really wanna know where you at, at\n\n[Verse 1]\nI was out back where the stash at\nCruise the city in a bulletproof Cadillac (Skrrt)\n'Cause I know these niggas after where the bag at (Yeah)\nGotta move smarter, gotta move harder\nNigga try to get me for my water\nI'll lay his ass down, on my son, on my daughter\nI had the Draco with me, Dwayne Carter\nLotta niggas out here playin' ain't ballin'\nI done put my whole arm in the rim, Vince Carter (Yeah)\nAnd I know probably get a key for the quarter\nShawty barely seen in double C's, I bought 'em\nGot 

Now that we have the correct functions, we can get the lyrics for all the songs in the Top200. All we need now is a function which determines the sentiment of the lyrics. 

In [4]:
# this function calculates the sentiment of the lyrics
def sentiment_of_lyrics(lyrics):
    if lyrics != "":
        sid = SentimentIntensityAnalyzer()
        return sid.polarity_scores(lyrics)
    else:
        return False

# TEST
sentiment_of_lyrics(test_lyrics)

{'neg': 0.128, 'neu': 0.813, 'pos': 0.059, 'compound': -0.9917}

We now have everything we need: the songs, the lyrics and the sentiment processor. In the following section we create the final table containing the songs and their sentiment.

In [7]:
# calculate sentiment for all songs
for index, row in top200.iterrows():
    sentiment = sentiment_of_lyrics(scrap_lyrics(process_request(request_song_info(row['Track Name'], row['Artist']), row['Artist'])))
    if (sentiment != False):
        top200.loc[index,'neg_sentiment'] = sentiment.get('neg')
        top200.loc[index,'neu_sentiment'] = sentiment.get('neu')
        top200.loc[index,'pos_sentiment'] = sentiment.get('pos')
        top200.loc[index,'compound_sentiment'] = sentiment.get('compound')
    else:
        top200.loc[index,'neg_sentiment'] = 0
        top200.loc[index,'neu_sentiment'] = 0
        top200.loc[index,'pos_sentiment'] = 0
        top200.loc[index,'compound_sentiment'] = 0

top200


Unnamed: 0,Position,Track Name,Artist,Streams,URL,neg_sentiment,neu_sentiment,pos_sentiment,compound_sentiment
0,1,The Box,Roddy Ricch,18952305,https://open.spotify.com/track/0nbXyq5TXYPCO7p...,0.128,0.813,0.059,-0.9917
1,2,ROXANNE,Arizona Zervas,9671478,https://open.spotify.com/track/696DnlkuDOXcMAn...,0.082,0.680,0.238,0.9971
2,3,Yummy,Justin Bieber,9648561,https://open.spotify.com/track/41L3O37CECZt3N7...,0.015,0.752,0.232,0.9981
3,4,Circles,Post Malone,8244725,https://open.spotify.com/track/21jGcNKet2qwijl...,0.044,0.851,0.105,0.9686
4,5,BOP,DaBaby,7985170,https://open.spotify.com/track/6Ozh9Ok6h4Oi1wU...,0.110,0.789,0.101,-0.9368
...,...,...,...,...,...,...,...,...,...
195,196,I,Lil Skies,1547924,https://open.spotify.com/track/4ZT9FnbFu1PaBfV...,0.146,0.735,0.119,-0.8669
196,197,Love Me,Lil Tecca,1542802,https://open.spotify.com/track/4e0FYxSROat25pH...,0.083,0.612,0.305,0.9990
197,198,Young Dumb & Broke,Khalid,1542419,https://open.spotify.com/track/5Z3GHaZ6ec9bsiI...,0.378,0.496,0.126,-0.9985
198,199,Youngblood,5 Seconds of Summer,1541047,https://open.spotify.com/track/2iUXsYOEPhVqEBw...,0.080,0.797,0.124,-0.1027
