In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import kpss

In [None]:
sentiment = pd.read_csv('../lyrics_sentiment.csv')
sentiment.head()

track_info = pd.read_csv('../track_info.csv')
track_info = track_info.drop(['Unnamed: 0', 'level_0', 'artist', 'title'], axis='columns')
audio_features = pd.read_csv('../audio_features.csv')
audio_features = audio_features.drop(['Unnamed: 0'], axis='columns')

track_df = pd.merge(track_info, audio_features, how='inner', on='id')

billboard_df = pd.read_csv('../Billboard_Lists_1960-01-01_2024-02-23.csv')
billboard_df['Week'] = pd.to_datetime(billboard_df['Week'])
filtered_billboard_df = billboard_df[(billboard_df['Week'].dt.year >= 1990)]

In [None]:
track_df['Artist_Name'] = track_df['query'].apply(lambda x: x.split('|')[0])
track_df['Song'] = track_df['query'].apply(lambda x: x.split('|')[1])

In [None]:
billboard_analysis = pd.merge(filtered_billboard_df, track_df, how='inner', left_on=['Artist_Name', 'Song'], right_on=['Artist_Name', 'Song'])
billboard_analysis = pd.merge(billboard_analysis, sentiment[['Artist_Name', 'Song', 'tweetnlp_label', 'tweetnlp_neg', 'tweetnlp_neu', 'tweetnlp_pos']], how='inner', left_on=['Artist_Name', 'Song'], right_on=['Artist_Name', 'Song'])

In [None]:
mean_results = billboard_analysis[['Week','valence', 'duration_ms', 'danceability', 'tweetnlp_pos', 'tweetnlp_neg']].groupby('Week').mean().reset_index()
mean_results['label'] = 'mean'
median_results = billboard_analysis[['Week','valence', 'duration_ms', 'danceability', 'tweetnlp_pos', 'tweetnlp_neg']].groupby('Week').median().reset_index()
median_results['label'] = 'median'
results = pd.concat([mean_results, median_results], axis='rows')

# Trend Analysis

## Statistical Tests

In [None]:
def adf_test(timeseries):
    print ('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)

def kpss_test(timeseries):
    print ('Results of KPSS Test:')
    kpsstest = kpss(timeseries, regression='c', nlags="auto")
    kpss_output = pd.Series(kpsstest[0:3], index=['Test Statistic','p-value','#Lags Used'])
    for key,value in kpsstest[3].items():
        kpss_output['Critical Value (%s)'%key] = value
    print (kpss_output)

We can analyze the resulting trends in our analysis by testing whether the trends are linearly staionary to have a diffinitive understanding of the characterisitics of the time series data.

Augmented Dickey-Fuller Test
- Null Hypothesis: The null root is present in the time series. Meaning the time series is stationary
- Alternate Hypothesis: The null root is not present in the time series. Meaning the time series is non-staionary

KPSS
- Null Hypothesis: The null root is not present in the time series. Meaning the time series is non-stationary
- Alternate Hypothesis: The null root is not present in the time series. Meaning the time series is stationary

## Sentiment

In [None]:
px.line(results, x='Week', y='valence', color='label')

Visually analyzing the trend we can see that there is a slightly negative trend towards lower valence as time goes on.

In [None]:
adf_test(billboard_analysis['valence'])

In [None]:
kpss_test(billboard_analysis['valence'])

From the statistical tests we can conclude that the trend is trend stable since the results of the ADF show that the time series is non-stationary while the KPSS test shows that it is stationary. What the time series being trend stable is that the mean, median and the variance of the time series data is stable throughout time if we are able to remove the effects of the trend and seasonaility in the time series.

This test is normally done as a precursor to clean and process the data to create a linear regression that will model the time series data once we are able to stabilise the time series. Although the results of these statistical tests are still insightful for us since we are able to prove that there is a trend and some seasonality in our data although the effects and magnitute of the effects of the trend are not clear from this analysis.

Due to time constraints we were not able to clean and preprocess the time-series for us to model the trend.

In [None]:
month_mean_results = billboard_analysis[['valence', 'duration_ms', 'danceability']].groupby(billboard_analysis['Week'].dt.month).mean().reset_index()
month_mean_results['label'] = 'mean'
month_median_results = billboard_analysis[['valence', 'duration_ms', 'danceability']].groupby(billboard_analysis['Week'].dt.month).median().reset_index()
month_median_results['label'] = 'median'
month_results = pd.concat([month_mean_results, month_median_results], axis='rows').rename(columns={'Week': 'Month'})

In [None]:
px.line(month_results, x='Month', y='valence', color='label')

Visualy inspecting the monthly trend we can see that there is a trend for happier sounding music to become more popular towards the end of Spring and the middle of Autumn. Which makes sense as there is a connotation that the summer is supposed to be a happier time while the winter tends to be more somber.

In [None]:
adf_test(month_results['valence'])
kpss_test(month_results['valence'])

Analyzing the results of the statistical tests we can see that the ADF test results point towards the time series being stationary while the KPSS test points towards the time series being non-stationary. What this means is that the time series is possibly difference stationary which means that if we were to apply a differ the time series with itself along a specific lag we would be able to get the data to become stationary. As a result this shows us that there is a clear seasonal trend within the data that if we were to normalize the data along a specific lag we would get a stationary time series.

In [None]:
results['tweetnlp_norm'] = results.apply(lambda x: x['tweetnlp_pos'] - x['tweetnlp_neg'], axis='columns')

Normalize the positive and negative probabilities by subtracting them to each other which makes the data more sparse and provides a higher resolution in both the negative and positive categories

In [None]:
px.line(results, x='Week', y='tweetnlp_norm', color='label')

In [None]:
adf_test(results['tweetnlp_norm'])
kpss_test(results['tweetnlp_norm'])

The results of the statistical tests again show that the time series is Trend Stationary. Due to time constraints we were not able to get a full analysis to detrend and examine the characteristics of the trend. Using visual inspection we don't see a clear trend in the graph so without detrending an looking into the regression we won't be able to confidently identify the direction of the trend.

In [None]:
px.scatter(billboard_analysis, x='valence', y='tweetnlp_norm', color='tweetnlp_label', hover_name='Song', hover_data=['Artist_Name', 'valence', 'tweetnlp_pos'], color_discrete_map={'positive': '#097969', 'neutral': '#808080', 'negative': '#800020'})

This visualization purely from visual inspection shows that there is no correlation between valence (musical mood) and lyrical sentiment. This makes sense because it is entirely possible to have music that sounds positive and happy but has sad lyrics. While the opposite is also possible to have sad sounding music with happy lyrics. Hovering over the graph for examples on each quadrant we can see examples of each.

## Danceability

In [None]:
px.line(results, x='Week', y='danceability', color='label')

We visualized the danceability trend throughout time but upon visual inspection there doesn't seem to be a clear insight that we can extract immediately. We wanted to examine specific trends points but again due to time constraints we were unable to perform this analysis.

## Duration

In [None]:
px.line(results, x='Week', y='duration_ms', color='label', labels={'duration_ms': 'Duration (ms)'})

In [None]:
adf_test(billboard_analysis['duration_ms'])
kpss_test(billboard_analysis['duration_ms'])

Through visual inspection we can see that there is a clear and strong negative trend in the duration of music over the years. We can infer from this graph that music is getting shorter and shorter over time. The statistical tests also support this possibility by showing that the time series is Trend Stationary.

In [None]:
billboard_analysis['Week'] = billboard_analysis['Week'].astype(str)
px.histogram(billboard_analysis, x='duration_ms')

In [None]:
def duration_label(duration):
  if duration >= 330000:
    return 'Long (> 5.5 minutes)'
  elif duration <= 150000:
    return 'Short (< 2.5 minutes)'
  else:
    return 'Medium (2.5 - 5.5 minutes)'

billboard_analysis['duration_category'] = billboard_analysis['duration_ms'].apply(lambda x: duration_label(x))

In [None]:
px.scatter(billboard_analysis, x='Week', y='duration_ms', color='duration_category', color_discrete_map={'Long (> 5.5 minutes)': '#097969', 'Medium (2.5 - 5.5 minutes)': '#808080', 'Short (< 2.5 minutes)': '#800020'}, labels={'duration_ms': 'Duration (ms)'})

Plotting the scatter plot of long and short music we can see that there is a higher concentration of long music in 1990s-2000s while there is a high concentration of short music in modern music (2015-2024). Which further proves that music is trending towards getting shorter over time.