# **04 Data Expansion**


### Expanding our dataframe through various methods to get a richer data 

---

### **Set-up ⚙️**

Import necessary packages

In [None]:
import pandas as pd
import numpy as np
import json
import spotipy
from base64 import *
from spotipy.oauth2 import SpotifyClientCredentials
from datetime import datetime
import os
os.chdir(os.path.expanduser("../"))                 # change directory to main project directory

from functions.data_expansion_functions import *

Check that we are in the correct current working directory

*⚠️ Note: We should be in the main project directory*

In [None]:
print("Current working directory:", os.getcwd())

Next, we will proceed with expansion of dataframe

In [None]:
df = pd.read_csv('./data/raw_compiled_data.csv')

We apply our function to find the "genre level" of a song.

We noticed a significant lack of categories of a song based on Wikipedia search. Moreover, every song has a default category of "music", which is irrelevant. Hence, instead of working with small datapoints by having multiple categories of genres, we broadly categorised the songs into "Low", "Medium" or "High". This determines a rough level of "diversity" of a song by looking at how many, instead of which, genres the songs are in.

In [None]:
df = df.rename(columns={'wikipedia_categories': 'genre_level'})
df['genre_level'] = df['genre_level'].apply(lambda x: get_category_number(x))

We apply our sentiment analysis functions to evaluate lyrics quantitatively. We want the scores "positive", "neutral", "negative" and "compound".

But we first want to ensure that our lyrics are in `str` format

In [None]:
df['lyrics'] = df['lyrics'].astype(str)

In [None]:
sid = SentimentIntensityAnalyzer()

def get_sentiment_score(lyric):
    lyric_string = str(lyric)
    scores = sid.polarity_scores(lyric_string)
    list = [scores['neg'], scores['neu'], scores['pos'], scores['compound']]
    return list

In [None]:
df['sentiment_positive'] = df['lyrics'].apply(lambda x: get_sentiment_score(x)[2])
df['sentiment_neutral'] = df['lyrics'].apply(lambda x: get_sentiment_score(x)[1])
df['sentiment_negative'] = df['lyrics'].apply(lambda x: get_sentiment_score(x)[0])
df['sentiment_compound'] = df['lyrics'].apply(lambda x: get_sentiment_score(x)[3])

df.head()

We want to conduct a sentiment analysis on the comments as well. However, we noticed that comments are mostly irrelevant to the song, and we are unsure of the reliability of such data. However, we still want to see if there are any insights. Hence, we only want to find out the "sentiment compound" of the comments.

In [None]:
df2  = df

In [None]:
df2['comments'] = df2['comments'].astype(str)

In [None]:
df2 = df2.rename(columns={'comments': 'comments_sentiment'})
df2['comments_sentiment'] = df2['comments_sentiment'].apply(lambda x: get_sentiment_score(x)[3])
df2.head()

Next, we apply our function to find the lexical richness of a song.

We define lexical richness as the proportion of unique words as total words in a song, giving us a proxy measure of the range of vocabulary that artists use.

In [None]:
df2['lexical_richness'] = df2['lyrics'].apply(lambda x: get_lexical_richness(x))

df2.head()

We want to find song length as well

In [None]:
df2['song_length'] = df2['lyrics'].apply(lambda x: len(x.split()))

df2.head()

We noticed that "sentiment compound" measures how positive or negative a song is, and the value tend to be closer to the extreme (close to +-1). Here we indicate another variable "sentiment compound absolute", where we ignore whether the song is happy or sad, and measure how "extreme" the lyrics are by considering the absolute value of "sentiment compound".

In [None]:
df2['sentiment_compound_absolute'] = df2['sentiment_compound'].abs()

df2.head(2)

Open JSON file containing credentials

*⚠️ Note: Our credentials should be stored in a file titled `credentials.json` and stored in the root of the project folder*

*⚠️ Caution: Run the notebook 03 Spotify Token Generator first. The following codes will not run unless the token has been added into the credentials folder, saved as `token`*

In [None]:
credentials_file_path = './credentials.json'

with open(credentials_file_path, 'r') as f:
    credentials = json.load(f)

We now integrate the 'spotipy' package and the search() function.

From there, we are able to get data in the json file such as release date, a popularity score, whether the song is explicit, and the number of markets that the song is in during its initial release. We navigate through the json file to find the data we want.

In [None]:
sp = spotipy.Spotify(auth=credentials['token'])

We first create the functions to get the data. 

We were not able to store these functions separately in our functions folder as they require the token to work. Hence it will be easier to run them within this notebook instead.

In [None]:
def get_release_date(song):
    result = sp.search(song)
    tracks = result['tracks']['items']
    if len(tracks) > 0:
        return tracks[0]['album']['release_date']
    else:
        return None

def get_popularity(song):
    result = sp.search(song)
    tracks = result['tracks']['items']
    if len(tracks) > 0:
        return tracks[0]['popularity']
    else:
        return None

def get_explicitness(song):
    result = sp.search(song)
    tracks = result['tracks']['items']
    if len(tracks) > 0:
        return tracks[0]['explicit']
    else:
        return None

def get_market_number(song):
    result = sp.search(song)
    tracks = result['tracks']['items']
    if len(tracks) > 0:
        return len(tracks[0]['available_markets'])
    else:
        return None

We run the codes one by one due to the large volumes of data. Note that strength of internet connection may significantly affect the run time. We have break this down into multiple dataframes instead there is a need to go back.

In [None]:
df3 = df2

In [None]:
df3['release_date'] = df3['title'].apply(lambda x: get_release_date(x))

In [None]:
df4 = df3

In [None]:
df4['popularity'] = df4['title'].apply(lambda x: get_popularity(x))

In [None]:
df5 = df4

In [None]:
df5['explicitness'] = df5['title'].apply(lambda x: get_explicitness(x))

In [None]:
df6 = df5

In [None]:
df6['markets'] = df6['title'].apply(lambda x: get_market_number(x))

In [None]:
df7 = df6

Having `Nan` values could affect our code later on, hence:

In [None]:
df7 = df7.dropna()

We categorise the market availability into "Low", "Medium" and "High".

We have noticed that during an initial release of a song, the song is either in (1) all 184 markets, (2) slightly less than 184 markets, or (3) have very restricted markets (<50). Hence it is reasonable for us to categorise the songs as such, where high category indicates less censorship/songs are more global in nature instead of local.

In [None]:
df7['markets'] = df7['markets'].apply(lambda x: market_availability_category(x))
df7.head()

We want to convert our date to datetime format

In [None]:
df7['release_date'] = df7['release_date'].apply(lambda x: pd.to_datetime(x))

In [None]:
df7.head()

In [None]:
df_final = df7

Finally, saving our final data into csv, and we are ready for visualisation

In [None]:
df_final.to_csv('./data/final_compiled.csv', index=False)