## Structure of the Data
* Artists -> Songs -> Attributes
    * Artists
        * Nominal Attributes:
            * artist: name
            * artist_id: spotify_id
            * one-to-one relationships
        * Popularity-Related Attributes:
            * followers: Amount of Spotify listeners following artist's page
            * popularity: Aggregate Popularity of All Artist's Songs
            * one-to-one relationships
        * Categorical Attributes:
            * genre: list of subgenres that Spotify has classified Song As.
            * one-to-many relationship
    * Songs
        * Nominal Attributes:
            * Song
            * Performer
            * id
            * id_y
            * id_fk
            * release_date
            * one-to-one relationships
        * Interval Attributes:
            * 0 to 1 Intervals:
                * danceability:
                * energy:
                * liveness:
                * speechiness:
                * acousticness
                * instrumentalness:
                * valence:
            * Area-Specific Intervals:
                * tempo
                * duration
                * popularity
                * loudness
        * Categorical Attributes:
            * explicit
            * timesignature
            * key
            * mode
            * genre
            * genre_super
            * chart
    * Billboard
        * Nominal Attributes:
            * Song
            * Performer
            * SongID
            * WeekID
        * Ordinal Attributes:
            * Week Position
            * bill_popularity
        * Derivates Based on Week Position:
            * Previous Week Position (Week Position @ t-1)
            * Peak Position (Max Inclusive of Song's Current Week Position)
            * Weeks on Chart (Cumlative Sum of All Weeks on Chart)
            * Instance (Amount of Times Song Entered the Chart)

In [3]:
# Stories to Tell with Billboard Data Based on Time.
    # How Long Does a Song Last on the Billboard Chart? What is the Average Longevity of A Popular Song?
    # What Was the Biggest Jump Between Weeks? Previous Week Position - Week Position
    # What Was the Longest Gap Between a Song's Release and its Charting? Is This Occuring More or Less Frequently With the Advent of Streaming?
    # Which Song Lasted the Longest on the Chart? Instance staying the same.
    # Was There a Meaningful Difference in Attributes Between 1 Songs vs. the other 99 Field?
    # Which Song Was the Most Average, given the averages on the billboard songs' attributes for the year?
    # Which Song Deviated the Most From These Attributes? 
    # How many Songs Enter and Leave the chart in a given week, month, year?
    # How has the genre make-up of the Hot 100 changed over time? Is Pop more popular in the Spring? Does Billboard Valence Decrease in the Winter?
    # Questions that Ground Our Artists, Songs and their Spotify Attributes in Time and Explain the Changes That Have Occured In Popular Music.   

In [1]:
import pandas as pd, sqlalchemy as sql, numpy as np, datetime as dt, os, re, time
import operator, requests, string as s, random
import plotly.graph_objects as go
import plotly.express as px

In [2]:
engine = sql.create_engine("sqlite:///../src/data/music.db")
with engine.connect() as con:
    billboard = pd.read_sql("SELECT * FROM billboard", con=con)
    attributes = pd.read_sql("SELECT * FROM attributes", con=con)
    artists = pd.read_sql("SELECT * FROM artists", con=con)
    recommend_join = pd.read_sql("SELECT * FROM recommendation_join", con=con)
    artists_join = pd.read_sql("SELECT * FROM artists_join", con=con)

In [3]:
billboard.shape

(109600, 10)

In [5]:
billboard.WeekID = pd.to_datetime(billboard.WeekID)

In [7]:
attributes.release_date = pd.to_datetime(attributes.release_date)

In [15]:
sample_year = billboard[lambda x: x.WeekID < dt.datetime(2000, 1, 1)]

In [50]:
samples_for_year = attributes[lambda x: ((x.release_date < dt.datetime(2000, 1, 1)) & (x.release_date > dt.datetime(1998, 12, 31)) & (x.chart == 0))]

In [8]:
# Summary of Year Via Interval Attributes

In [40]:
df_99 =  sample_year.merge(attributes.loc[:, ["SongID", "energy", "popularity"]], "left", on=["SongID"])

In [41]:
df_99 = df_99.drop_duplicates(subset=["SongID", "Instance"])

In [42]:
fig = px.histogram(df_99, x="energy")
fig.show()

In [54]:
fig = px.histogram(samples_for_year, x="energy")
fig.show()

In [43]:
fig = px.box(df_99, x="energy")
fig.show()

In [55]:
fig = px.box(samples_for_year, x="energy")
fig.show()

In [44]:
fig = px.violin(df_99, x="energy", box=True, points="all")
fig.show()

In [56]:
fig = px.violin(samples_for_year, x="energy", box=True, points="all")
fig.show()

In [45]:
fig = px.scatter(df_99, x='energy', y='bill_popularity', marginal_x='histogram', marginal_y='rug')
fig.show()

In [47]:
fig = px.scatter(df_99, x='energy', y='popularity', marginal_x='histogram', marginal_y='violin')
fig.show()

In [57]:
fig = px.scatter(samples_for_year, x='energy', y='popularity', marginal_x='histogram', marginal_y='violin')
fig.show()

In [58]:
fig = px.scatter(df_99, x='energy', y='bill_popularity', marginal_x='histogram', marginal_y='violin')
fig.show()

In [60]:
df_99_full =  sample_year.merge(attributes, "left", on=["SongID"])

In [71]:
df_interval_attr = df_99_full.loc[:, ["SongID", "danceability", "energy", "liveness", "acousticness", "instrumentalness", "loudness", "explicit", "tempo", "duration","valence", "popularity", "bill_popularity"]].drop_duplicates(subset=["SongID"])

In [72]:
fig = px.imshow(df_interval_attr.corr())
fig.show()

In [73]:
fig = px.imshow(samples_for_year.loc[:, ["SongID", "danceability", "energy", "liveness", "acousticness", "instrumentalness", "loudness", "explicit", "tempo", "duration","valence", "popularity",]].corr())
fig.show()

In [86]:
fig = px.bar(df_99_full.drop_duplicates("SongID").groupby("genre_super", as_index=False).count().sort_values("SongID", ascending=False), x="genre_super", y="SongID", title="# of Unique Songs")
fig.show()

In [96]:
pie_df = df_99_full.assign(genre_super = lambda x: x.genre_super.apply(lambda g: "other" if g == "missing" or g == "empty" else g)).groupby("genre_super", as_index=False).sum()

fig = px.pie(pie_df, values="Week Position", names="genre_super", title="Genre % by Positions")
fig.show()