# **Set-up:**

These are the additional packages to be installed on the terminal:

```bash
pip install spotipy
pip install nltk
```

Import necessary packages

*⚠️ Note: Do not run this more than once. Restart the kernel before running this code chunk.*

In [1]:
import pandas as pd
import numpy as np
import json
import spotipy
from base64 import *
from spotipy.oauth2 import SpotifyClientCredentials
from datetime import datetime
import os
os.chdir(os.path.expanduser("../"))                 # change directory to main project directory
from dees_package.spotify_functions import *
from dees_package.data_expansion_functions import *

Check that we are in the correct current working directory

*⚠️ Note: We should be in the main project directory*

In [2]:
os.chdir('/Users/ruka/Desktop/ds105a-project-dees-nuts')

print("Current working directory:", os.getcwd())

Current working directory: /Users/ruka/Desktop/ds105a-project-dees-nuts


Next, we will proceed with expansion of dataframe

In [3]:
df = pd.read_csv('./data/raw_compiled_data.csv')

In [4]:
df.head()

Unnamed: 0,title,artist,url,lyrics,view_count,like_count,comment_count,wikipedia_categories,duration_seconds,comments
0,Shape of You,Ed Sheeran,https://genius.com/Ed-sheeran-shape-of-you-lyrics,A club isn't the best place to find a lover So...,6187040943,32536757.0,1150191,"Music, Pop music",264.0,['27 January 2024 Lets see how many legends😊 a...
1,Sugar,Maroon 5,https://genius.com/Maroon-5-sugar-lyrics,"I'm hurting, baby, I'm broken down I need your...",3996606810,15982972.0,420780,"Music, Pop music",302.0,"['Lol me', 'This is my mom‘s ringtone', 'This ..."
2,Waka Waka (This Time for Africa),Shakira,https://genius.com/Shakira-waka-waka-this-time...,"O-o-oh, e-e-e-e-e-eh Viva Africa ( Otra, otra ...",3840577985,22122088.0,1322022,"Music, Music of Latin America, Pop music",211.0,"['🤍', 'Memories 😢 💔\nI was watching this with ..."
3,Thinking Out Loud,Ed Sheeran,https://genius.com/Ed-sheeran-thinking-out-lou...,When your legs don't work like they used to be...,3723229528,14997487.0,373089,"Music, Pop music",297.0,"['2024 AND STILL IN LOVE', 'When your legs don..."
4,Perfect,Ed Sheeran,https://genius.com/Ed-sheeran-perfect-lyrics,"I found a love for me Oh, darlin', just dive r...",3654497221,20738347.0,516002,"Music, Pop music",280.0,"['Anyone 2024?', 'Bro ! my favourite song , an..."


We apply our function to find the "genre level" of a song.

We noticed a significant lack of categories of a song based on Wikipedia search. Moreover, every song has a default category of "music", which is irrelevant. Hence, instead of working with small datapoints by having multiple categories of genres, we broadly categorised the songs into "Low", "Medium" or "High". This determines a rough level of "diversity" of a song by looking at how many, instead of which, genres the songs are in.

In [5]:
df = df.rename(columns={'wikipedia_categories': 'genre_level'})
df['genre_level'] = df['genre_level'].apply(lambda x: get_category_number(x))


We apply our sentiment analysis functions to evaluate lyrics quantitatively. We want the scores "positive", "neutral", "negative" and "compound".

But we first want to ensure that our lyrics are in `str` format

In [6]:
df['lyrics'] = df['lyrics'].astype(str)

In [7]:
sid = SentimentIntensityAnalyzer()

def get_sentiment_score(lyric):
    lyric_string = str(lyric)
    scores = sid.polarity_scores(lyric_string)
    list = [scores['neg'], scores['neu'], scores['pos'], scores['compound']]
    return list

In [8]:
df['sentiment_positive'] = df['lyrics'].apply(lambda x: get_sentiment_score(x)[2])
df['sentiment_neutral'] = df['lyrics'].apply(lambda x: get_sentiment_score(x)[1])
df['sentiment_negative'] = df['lyrics'].apply(lambda x: get_sentiment_score(x)[0])
df['sentiment_compound'] = df['lyrics'].apply(lambda x: get_sentiment_score(x)[3])

df.head()

Unnamed: 0,title,artist,url,lyrics,view_count,like_count,comment_count,genre_level,duration_seconds,comments,sentiment_positive,sentiment_neutral,sentiment_negative,sentiment_compound
0,Shape of You,Ed Sheeran,https://genius.com/Ed-sheeran-shape-of-you-lyrics,A club isn't the best place to find a lover So...,6187040943,32536757.0,1150191,Low,264.0,['27 January 2024 Lets see how many legends😊 a...,0.229,0.757,0.014,0.9995
1,Sugar,Maroon 5,https://genius.com/Maroon-5-sugar-lyrics,"I'm hurting, baby, I'm broken down I need your...",3996606810,15982972.0,420780,Low,302.0,"['Lol me', 'This is my mom‘s ringtone', 'This ...",0.288,0.642,0.07,0.9988
2,Waka Waka (This Time for Africa),Shakira,https://genius.com/Shakira-waka-waka-this-time...,"O-o-oh, e-e-e-e-e-eh Viva Africa ( Otra, otra ...",3840577985,22122088.0,1322022,Medium,211.0,"['🤍', 'Memories 😢 💔\nI was watching this with ...",0.015,0.958,0.027,-0.4215
3,Thinking Out Loud,Ed Sheeran,https://genius.com/Ed-sheeran-thinking-out-lou...,When your legs don't work like they used to be...,3723229528,14997487.0,373089,Low,297.0,"['2024 AND STILL IN LOVE', 'When your legs don...",0.188,0.789,0.023,0.9967
4,Perfect,Ed Sheeran,https://genius.com/Ed-sheeran-perfect-lyrics,"I found a love for me Oh, darlin', just dive r...",3654497221,20738347.0,516002,Low,280.0,"['Anyone 2024?', 'Bro ! my favourite song , an...",0.251,0.7,0.049,0.9974


We want to conduct a sentiment analysis on the comments as well. However, we noticed that comments are mostly irrelevant to the song, and we are unsure of the reliability of such data. However, we still want to see if there are any insights. Hence, we only want to find out the "sentiment compound" of the comments.

In [9]:
df2  = df

In [10]:
df2['comments'] = df2['comments'].astype(str)

In [11]:
df2 = df2.rename(columns={'comments': 'comments_sentiment'})
df2['comments_sentiment'] = df2['comments_sentiment'].apply(lambda x: get_sentiment_score(x)[3])
df2.head()

Unnamed: 0,title,artist,url,lyrics,view_count,like_count,comment_count,genre_level,duration_seconds,comments_sentiment,sentiment_positive,sentiment_neutral,sentiment_negative,sentiment_compound
0,Shape of You,Ed Sheeran,https://genius.com/Ed-sheeran-shape-of-you-lyrics,A club isn't the best place to find a lover So...,6187040943,32536757.0,1150191,Low,264.0,0.8388,0.229,0.757,0.014,0.9995
1,Sugar,Maroon 5,https://genius.com/Maroon-5-sugar-lyrics,"I'm hurting, baby, I'm broken down I need your...",3996606810,15982972.0,420780,Low,302.0,0.7498,0.288,0.642,0.07,0.9988
2,Waka Waka (This Time for Africa),Shakira,https://genius.com/Shakira-waka-waka-this-time...,"O-o-oh, e-e-e-e-e-eh Viva Africa ( Otra, otra ...",3840577985,22122088.0,1322022,Medium,211.0,0.4329,0.015,0.958,0.027,-0.4215
3,Thinking Out Loud,Ed Sheeran,https://genius.com/Ed-sheeran-thinking-out-lou...,When your legs don't work like they used to be...,3723229528,14997487.0,373089,Low,297.0,0.9995,0.188,0.789,0.023,0.9967
4,Perfect,Ed Sheeran,https://genius.com/Ed-sheeran-perfect-lyrics,"I found a love for me Oh, darlin', just dive r...",3654497221,20738347.0,516002,Low,280.0,0.8522,0.251,0.7,0.049,0.9974


Next, we apply our function to find the lexical richness of a song.

We define lexical richness as the proportion of unique words as total words in a song, giving us a proxy measure of the range of vocabulary that artists use.

In [12]:
df2['lexical_richness'] = df2['lyrics'].apply(lambda x: get_lexical_richness(x))

df2.head()

Unnamed: 0,title,artist,url,lyrics,view_count,like_count,comment_count,genre_level,duration_seconds,comments_sentiment,sentiment_positive,sentiment_neutral,sentiment_negative,sentiment_compound,lexical_richness
0,Shape of You,Ed Sheeran,https://genius.com/Ed-sheeran-shape-of-you-lyrics,A club isn't the best place to find a lover So...,6187040943,32536757.0,1150191,Low,264.0,0.8388,0.229,0.757,0.014,0.9995,23
1,Sugar,Maroon 5,https://genius.com/Maroon-5-sugar-lyrics,"I'm hurting, baby, I'm broken down I need your...",3996606810,15982972.0,420780,Low,302.0,0.7498,0.288,0.642,0.07,0.9988,29
2,Waka Waka (This Time for Africa),Shakira,https://genius.com/Shakira-waka-waka-this-time...,"O-o-oh, e-e-e-e-e-eh Viva Africa ( Otra, otra ...",3840577985,22122088.0,1322022,Medium,211.0,0.4329,0.015,0.958,0.027,-0.4215,39
3,Thinking Out Loud,Ed Sheeran,https://genius.com/Ed-sheeran-thinking-out-lou...,When your legs don't work like they used to be...,3723229528,14997487.0,373089,Low,297.0,0.9995,0.188,0.789,0.023,0.9967,44
4,Perfect,Ed Sheeran,https://genius.com/Ed-sheeran-perfect-lyrics,"I found a love for me Oh, darlin', just dive r...",3654497221,20738347.0,516002,Low,280.0,0.8522,0.251,0.7,0.049,0.9974,52


We want to find song length as well

In [13]:
df2['song_length'] = df2['lyrics'].apply(lambda x: len(x.split()))

df2.head()

Unnamed: 0,title,artist,url,lyrics,view_count,like_count,comment_count,genre_level,duration_seconds,comments_sentiment,sentiment_positive,sentiment_neutral,sentiment_negative,sentiment_compound,lexical_richness,song_length
0,Shape of You,Ed Sheeran,https://genius.com/Ed-sheeran-shape-of-you-lyrics,A club isn't the best place to find a lover So...,6187040943,32536757.0,1150191,Low,264.0,0.8388,0.229,0.757,0.014,0.9995,23,699
1,Sugar,Maroon 5,https://genius.com/Maroon-5-sugar-lyrics,"I'm hurting, baby, I'm broken down I need your...",3996606810,15982972.0,420780,Low,302.0,0.7498,0.288,0.642,0.07,0.9988,29,472
2,Waka Waka (This Time for Africa),Shakira,https://genius.com/Shakira-waka-waka-this-time...,"O-o-oh, e-e-e-e-e-eh Viva Africa ( Otra, otra ...",3840577985,22122088.0,1322022,Medium,211.0,0.4329,0.015,0.958,0.027,-0.4215,39,323
3,Thinking Out Loud,Ed Sheeran,https://genius.com/Ed-sheeran-thinking-out-lou...,When your legs don't work like they used to be...,3723229528,14997487.0,373089,Low,297.0,0.9995,0.188,0.789,0.023,0.9967,44,326
4,Perfect,Ed Sheeran,https://genius.com/Ed-sheeran-perfect-lyrics,"I found a love for me Oh, darlin', just dive r...",3654497221,20738347.0,516002,Low,280.0,0.8522,0.251,0.7,0.049,0.9974,52,297


We noticed that "sentiment compound" measures how positive or negative a song is, and the value tend to be closer to the extreme (close to +-1). Here we indicate another variable "sentiment compound absolute", where we ignore whether the song is happy or sad, and measure how "extreme" the lyrics are by considering the absolute value of "sentiment compound".

In [14]:
df2['sentiment_compound_absolute'] = df2['sentiment_compound'].abs()

df2.head()

Unnamed: 0,title,artist,url,lyrics,view_count,like_count,comment_count,genre_level,duration_seconds,comments_sentiment,sentiment_positive,sentiment_neutral,sentiment_negative,sentiment_compound,lexical_richness,song_length,sentiment_compound_absolute
0,Shape of You,Ed Sheeran,https://genius.com/Ed-sheeran-shape-of-you-lyrics,A club isn't the best place to find a lover So...,6187040943,32536757.0,1150191,Low,264.0,0.8388,0.229,0.757,0.014,0.9995,23,699,0.9995
1,Sugar,Maroon 5,https://genius.com/Maroon-5-sugar-lyrics,"I'm hurting, baby, I'm broken down I need your...",3996606810,15982972.0,420780,Low,302.0,0.7498,0.288,0.642,0.07,0.9988,29,472,0.9988
2,Waka Waka (This Time for Africa),Shakira,https://genius.com/Shakira-waka-waka-this-time...,"O-o-oh, e-e-e-e-e-eh Viva Africa ( Otra, otra ...",3840577985,22122088.0,1322022,Medium,211.0,0.4329,0.015,0.958,0.027,-0.4215,39,323,0.4215
3,Thinking Out Loud,Ed Sheeran,https://genius.com/Ed-sheeran-thinking-out-lou...,When your legs don't work like they used to be...,3723229528,14997487.0,373089,Low,297.0,0.9995,0.188,0.789,0.023,0.9967,44,326,0.9967
4,Perfect,Ed Sheeran,https://genius.com/Ed-sheeran-perfect-lyrics,"I found a love for me Oh, darlin', just dive r...",3654497221,20738347.0,516002,Low,280.0,0.8522,0.251,0.7,0.049,0.9974,52,297,0.9974


Open JSON file containing credentials

*⚠️ Note: Our credentials should be stored in a file titled `credentials.json` and stored in the root of the project folder*

*⚠️ Caution: '03 Spotify Scraper' should be run first. The following codes will not run unless the token as been added into the credentials folder, saved as "token"*

In [15]:
credentials_file_path = './credentials.json'

with open(credentials_file_path, 'r') as f:
    credentials = json.load(f)

We now integrate the 'spotipy' package and the search() function.

From there, we are able to get data in the json file such as release date, a popularity score, whether the song is explicit, and the number of markets that the song is in during its initial release. We navigate through the json file to find the data we want.

In [16]:
sp = spotipy.Spotify(auth=credentials['token'])


We first create the functions to get the data. Note that we are not able to store these functions in our `dees-package` as the functions require the token to work.

In [17]:
def get_release_date(song):
    result = sp.search(song)
    tracks = result['tracks']['items']
    if len(tracks) > 0:
        return tracks[0]['album']['release_date']
    else:
        return None

def get_popularity(song):
    result = sp.search(song)
    tracks = result['tracks']['items']
    if len(tracks) > 0:
        return tracks[0]['popularity']
    else:
        return None

def get_explicitness(song):
    result = sp.search(song)
    tracks = result['tracks']['items']
    if len(tracks) > 0:
        return tracks[0]['explicit']
    else:
        return None

def get_market_number(song):
    result = sp.search(song)
    tracks = result['tracks']['items']
    if len(tracks) > 0:
        return len(tracks[0]['available_markets'])
    else:
        return None

We run the codes one by one due to the large volumes of data.

In [18]:
df3 = df2

In [19]:
df3['release_date'] = df3['title'].apply(lambda x: get_release_date(x))

In [20]:
df4 = df3

In [21]:
df4['popularity'] = df4['title'].apply(lambda x: get_popularity(x))

In [22]:
df5 = df4

In [23]:
df5['explicitness'] = df5['title'].apply(lambda x: get_explicitness(x))

In [24]:
df6 = df5

In [25]:
df6['markets'] = df6['title'].apply(lambda x: get_market_number(x))

We want to convert our date to datetime format

In [26]:
df7 = df6

In [27]:
df7 = df7.dropna()

In [28]:
df7['markets'] = df7['markets'].apply(lambda x: market_availability_category(x))
df7.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df7['markets'] = df7['markets'].apply(lambda x: market_availability_category(x))


Unnamed: 0,title,artist,url,lyrics,view_count,like_count,comment_count,genre_level,duration_seconds,comments_sentiment,...,sentiment_neutral,sentiment_negative,sentiment_compound,lexical_richness,song_length,sentiment_compound_absolute,release_date,popularity,explicitness,markets
0,Shape of You,Ed Sheeran,https://genius.com/Ed-sheeran-shape-of-you-lyrics,A club isn't the best place to find a lover So...,6187040943,32536757.0,1150191,Low,264.0,0.8388,...,0.757,0.014,0.9995,23,699,0.9995,2017-03-03,89.0,False,High
1,Sugar,Maroon 5,https://genius.com/Maroon-5-sugar-lyrics,"I'm hurting, baby, I'm broken down I need your...",3996606810,15982972.0,420780,Low,302.0,0.7498,...,0.642,0.07,0.9988,29,472,0.9988,2020-08-25,73.0,True,High
2,Waka Waka (This Time for Africa),Shakira,https://genius.com/Shakira-waka-waka-this-time...,"O-o-oh, e-e-e-e-e-eh Viva Africa ( Otra, otra ...",3840577985,22122088.0,1322022,Medium,211.0,0.4329,...,0.958,0.027,-0.4215,39,323,0.4215,2023-04-14,79.0,False,High
3,Thinking Out Loud,Ed Sheeran,https://genius.com/Ed-sheeran-thinking-out-lou...,When your legs don't work like they used to be...,3723229528,14997487.0,373089,Low,297.0,0.9995,...,0.789,0.023,0.9967,44,326,0.9967,2014-06-21,85.0,False,Medium
4,Perfect,Ed Sheeran,https://genius.com/Ed-sheeran-perfect-lyrics,"I found a love for me Oh, darlin', just dive r...",3654497221,20738347.0,516002,Low,280.0,0.8522,...,0.7,0.049,0.9974,52,297,0.9974,2017-03-03,90.0,False,High


In [29]:
df7['release_date'] = df7['release_date'].apply(lambda x: pd.to_datetime(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df7['release_date'] = df7['release_date'].apply(lambda x: pd.to_datetime(x))


In [30]:
df7.head()

Unnamed: 0,title,artist,url,lyrics,view_count,like_count,comment_count,genre_level,duration_seconds,comments_sentiment,...,sentiment_neutral,sentiment_negative,sentiment_compound,lexical_richness,song_length,sentiment_compound_absolute,release_date,popularity,explicitness,markets
0,Shape of You,Ed Sheeran,https://genius.com/Ed-sheeran-shape-of-you-lyrics,A club isn't the best place to find a lover So...,6187040943,32536757.0,1150191,Low,264.0,0.8388,...,0.757,0.014,0.9995,23,699,0.9995,2017-03-03,89.0,False,High
1,Sugar,Maroon 5,https://genius.com/Maroon-5-sugar-lyrics,"I'm hurting, baby, I'm broken down I need your...",3996606810,15982972.0,420780,Low,302.0,0.7498,...,0.642,0.07,0.9988,29,472,0.9988,2020-08-25,73.0,True,High
2,Waka Waka (This Time for Africa),Shakira,https://genius.com/Shakira-waka-waka-this-time...,"O-o-oh, e-e-e-e-e-eh Viva Africa ( Otra, otra ...",3840577985,22122088.0,1322022,Medium,211.0,0.4329,...,0.958,0.027,-0.4215,39,323,0.4215,2023-04-14,79.0,False,High
3,Thinking Out Loud,Ed Sheeran,https://genius.com/Ed-sheeran-thinking-out-lou...,When your legs don't work like they used to be...,3723229528,14997487.0,373089,Low,297.0,0.9995,...,0.789,0.023,0.9967,44,326,0.9967,2014-06-21,85.0,False,Medium
4,Perfect,Ed Sheeran,https://genius.com/Ed-sheeran-perfect-lyrics,"I found a love for me Oh, darlin', just dive r...",3654497221,20738347.0,516002,Low,280.0,0.8522,...,0.7,0.049,0.9974,52,297,0.9974,2017-03-03,90.0,False,High


In [31]:
df_final = df7

In [32]:
df_final.to_csv('./data/final_compiled.csv', index=False)

### For the number of markets of song release, we found some interesting facts:

For an initial release of song, it is in either:
* all 184 markets in the world
* slightly less than 184 markets (a sign that there are some censorship in some countries, a hint that the song may be culturally inappropriate/politically sensitive)
* or very little markets (<50) (a sign that the song is deliberately only released in some markets, targeting niche categories)

Hence justifying the below function, categorising them into high, medium, or low level of outreach

We noticed some `Nan` values could affect our functions later on.

We categorise the market availability into "Low", "Medium" and "High".

We have noticed that during an initial release of a song, the song is either in (1) all 184 markets, (2) slightly less than 184 markets, or (3) have very restricted markets (<50). Hence it is reasonable for us to categorise the songs as such, where high category indicates less censorship/songs are more global in nature instead of local.

### Similarly, for song categories:

We initially attempted to obtain song genres via YouTube, Genius or Spotify. However, we faced significant difficulties due to the fact that:
* The data is not explicitly available – these platforms offer limited sources of data to public due to privacy reasons
* It is very difficult to get the genre via the API itself

Therefore, we enlisted Wikipedia, an open source, to find out on the song genre/category. However, due to the limited amount of categorisations there are on Wikipedia, we focus on the number of categories, i.e. number of wikipedia pages they occur instead.
* Most songs do not belong to any specific category on Wikipedia, they are being categorised as "music".
* For most of the other songs, they belong to two Wikipedia categories, "music" and something else, such as "electro"
* The rest of the songs are extreme minorities which belongs to three or more Wikipedia categories

Hence justifying our rationale to have broad categories. Songs that are not relevant enough to have more than one genre are categorised as "Low" in terms of category popularity; two as "Medium", three or more as "High". The function below: