# Final Project

Analysis of songs and artists from spotify that can be obtained from [Spotify's API](https://developer.spotify.com/documentation/web-api/). We can also use data **Archive.zip** that are in our repository.

In [None]:
import pandas as pd
import numpy as np

tracks = pd.read_csv("../intro-to-programming-data-manipulation-in-python/data/tracks.csv")
artists = pd.read_csv("../intro-to-programming-data-manipulation-in-python/data/artists.csv")

## Familiarize yourself with the datasets

Some information about selected variables from the dataset tracks:
- acousticness: The relative metric of the track being acoustic, (Ranges from 0 to 1)
- danceability: The relative measurement of the track being danceable, (Ranges from 0 to 1)
- energy: The energy of the track, (Ranges from 0 to 1)
- duration_ms: The length of the track in milliseconds (ms), (Integer typically ranging from 200k to 300k)
- instrumentalness:, The relative ratio of the track being instrumental, (Ranges from 0 to 1)
- valence: The positiveness of the track, (Ranges from 0 to 1)
- popularity: The popularity of the song lately, default country = US, (Ranges from 0 to 100)
- tempo:The tempo of the track in Beat Per Minute (BPM), (Float typically ranging from 50 to 150)
- liveness: The relative duration of the track sounding as a live performance, (Ranges from 0 to 1)
- loudness: Relative loudness of the track in decibel (dB), (Float typically ranging from -60 to 0)
- speechiness: The relative length of the track containing any kind of human voice, (Ranges from 0 to 1)
- key: The primary key of the track encoded as integers in between 0 and 11 (starting on C as 0, C# as 1 and so on…)
- artists: The list of artists credited for production of the track
- release_date: Date of release mostly in yyyy-mm-dd format, however precision of date may vary
- name: The title of the track
- mode: The binary value representing whether the track starts with a major (1) chord progression or a minor (0)

In [None]:
# play around with the data sets here:

In [None]:
# import plotly graph objects
import plotly.graph_objs as go
import plotly.express as px

# import pandas
import pandas as pd

**Task**: Plot a line chart to see how different song metrics evolve in time:
- per year
- per decade

Time should be on the **axis x** and metric on **y**

In [None]:
tracks.release_date = pd.to_datetime(tracks.release_date)
tracks['year'] = tracks.release_date.dt.year
tracks['decade'] = tracks['year'] // 10 * 10

#### Per Year

In [None]:
to_viz = tracks[['year','danceability','energy','loudness','speechiness','acousticness','instrumentalness',
             'liveness','valence','tempo']].groupby('year').mean()

In [None]:
# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=to_viz.index, y=to_viz['danceability'],
                    mode='lines',
                    name='danceability'))
fig.add_trace(go.Scatter(x=to_viz.index, y=to_viz['energy'],
                    mode='lines',
                    name='energy'))
fig.add_trace(go.Scatter(x=to_viz.index, y=to_viz['acousticness'],
                    mode='lines',
                    name='acousticness'))
fig.add_trace(go.Scatter(x=to_viz.index, y=to_viz['speechiness'],
                    mode='lines',
                    name='speechiness'))
fig.add_trace(go.Scatter(x=to_viz.index, y=to_viz['liveness'],
                    mode='lines',
                    name='liveness'))
fig.add_trace(go.Scatter(x=to_viz.index, y=to_viz['valence'],
                    mode='lines',
                    name='valence'))
fig.show()

In [None]:
# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=to_viz.index, y=to_viz['tempo'],
                    mode='lines',
                    name='tempo'))
fig.show()

#### Per Decade

In [None]:
to_viz = tracks[['decade','danceability','energy','loudness','speechiness','acousticness','instrumentalness',
             'liveness','valence','tempo']].groupby('decade').mean()

In [None]:
# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=to_viz.index, y=to_viz['danceability'],
                    mode='lines',
                    name='danceability'))
fig.add_trace(go.Scatter(x=to_viz.index, y=to_viz['energy'],
                    mode='lines',
                    name='energy'))
fig.add_trace(go.Scatter(x=to_viz.index, y=to_viz['acousticness'],
                    mode='lines',
                    name='acousticness'))
fig.add_trace(go.Scatter(x=to_viz.index, y=to_viz['speechiness'],
                    mode='lines',
                    name='speechiness'))
fig.add_trace(go.Scatter(x=to_viz.index, y=to_viz['liveness'],
                    mode='lines',
                    name='liveness'))
fig.add_trace(go.Scatter(x=to_viz.index, y=to_viz['valence'],
                    mode='lines',
                    name='valence'))
fig.show()

**Task**: Visualize how music genres change in time. [(visualization from lecture)](https://plotly.com/python/filled-area-plots/#stacked-area-chart-with-normalized-values)
- per year



In [None]:
# vycistim artists stlpec v tracks dataset-e
tracks_one_artist_per_row = tracks.copy(deep=True)
tracks_one_artist_per_row['artists'] = tracks_one_artist_per_row['artists'].apply(lambda v: v.replace("['",'').replace("']",'').split(','))
tracks_one_artist_per_row = tracks_one_artist_per_row.explode('artists')

In [None]:
tracks_one_artist_per_row.head()

In [None]:
tracks_with_genres = (tracks_one_artist_per_row[['artists','name','year']]
                      .merge(artists[["name",'genres']],
                            left_on = 'artists',
                            right_on = 'name'))

In [None]:
# vycistim genres stlpec von z listu
tracks_with_genres['genres'] = tracks_with_genres['genres'].apply(lambda v: v.replace("['",'').replace("']",'').split(','))
tracks_with_genres = tracks_with_genres.explode('genres')
tracks_with_genres['clean_genres'] = tracks_with_genres['genres'].str.replace("'",'')

In [None]:
tracks_with_genres.tail()

In [None]:
tracks_with_genres.tail()

In [None]:
tracks_with_genres['grouped_genres'] = tracks_with_genres['clean_genres'].copy()
tracks_with_genres = tracks_with_genres[tracks_with_genres.clean_genres != '[]']
tracks_with_genres.loc[tracks_with_genres['clean_genres'].str.contains('rock'), 'grouped_genres'] = 'rock'
tracks_with_genres.loc[tracks_with_genres['clean_genres'].str.contains('pop'), 'grouped_genres'] = 'pop'
tracks_with_genres.loc[tracks_with_genres['clean_genres'].str.contains('hip hop'), 'grouped_genres'] = 'hip hop'
tracks_with_genres.loc[tracks_with_genres['clean_genres'].str.contains('jazz'), 'grouped_genres'] = 'jazz'
tracks_with_genres.loc[tracks_with_genres['clean_genres'].str.contains('children'), 'grouped_genres'] = 'children'
tracks_with_genres.loc[tracks_with_genres['clean_genres'].str.contains('punk'), 'grouped_genres'] = 'punk'
tracks_with_genres.loc[tracks_with_genres['clean_genres'].str.contains('metal'), 'grouped_genres'] = 'metal'

In [None]:
tracks_with_genres.loc[tracks_with_genres['clean_genres'].str.contains('soul'), 'grouped_genres'] = 'soul'

In [None]:
# take top 10 genres
tracks_with_genres.grouped_genres.value_counts().head(10)

In [None]:
to_viz = tracks_with_genres[['year','grouped_genres']].groupby(['year','grouped_genres']).size().unstack().fillna(0)

In [None]:
to_viz.head()

In [None]:
import plotly.graph_objects as go

x=to_viz.index.tolist()

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=x, y=to_viz['rock'],
    hoverinfo='x+y',
    mode='lines',
    line=dict(width=0.5),
    stackgroup='one', # define stack group
    groupnorm='percent',
    name='rock'
))
fig.add_trace(go.Scatter(
    x=x, y=to_viz['jazz'],
    hoverinfo='x+y',
    mode='lines',
    line=dict(width=0.5),
    stackgroup='one',
    name='jazz'
))
fig.add_trace(go.Scatter(
    x=x, y=to_viz['pop'],
    hoverinfo='x+y',
    mode='lines',
    line=dict(width=0.5),
    stackgroup='one',
    name='pop'
))
fig.add_trace(go.Scatter(
    x=x, y=to_viz['metal'],
    hoverinfo='x+y',
    mode='lines',
    line=dict(width=0.5),
    stackgroup='one',
    name='metal'
))
fig.add_trace(go.Scatter(
    x=x, y=to_viz['soul'],
    hoverinfo='x+y',
    mode='lines',
    line=dict(width=0.5),
    stackgroup='one',
    name='soul'
))
fig.add_trace(go.Scatter(
    x=x, y=to_viz['punk'],
    hoverinfo='x+y',
    mode='lines',
    line=dict(width=0.5),
    stackgroup='one',
    name='punk'
))
fig.add_trace(go.Scatter(
    x=x, y=to_viz['hip hop'],
    hoverinfo='x+y',
    mode='lines',
    line=dict(width=0.5),
    stackgroup='one',
    name='hip hop'
))

fig.update_layout(yaxis_range=(0, 100))
fig.show()

## Use Pandas and Data Visualization Libraries to extract following information from the data?

**Question 1:** In order to write a popular song, is the key and mode of the song important?

- mode: The binary value representing whether the track starts with a major (1) chord progression or a minor (0)
- key: The primary key of the track encoded as integers in between 0 and 11 (starting on C as 0, C# as 1 and so on…)

In [None]:
mode_popularity = tracks.groupby(['mode']).mean()['popularity']

In [None]:
# doesn't look like there is significant difference between mode 0 and 1 in terms of popularity
mode_popularity.plot(kind='bar')

In [None]:
key_popularity = tracks.groupby(['key']).mean()['popularity']

In [None]:
# key=3 looks to have the worst popularity
key_popularity.plot(kind='bar')

**Question 2:** 1955-1960: The popularity level climbs up - Who are the most famous artists at this time?

In [None]:
# artists for the most populat tracks between 1955-1960
data_late50s_topArtists = (tracks[(tracks['year'] >= 1955) & (tracks['year'] <= 1960)]
                           .sort_values(by='popularity', ascending=False))
# clean artists from list
data_late50s_topArtists.loc[:,'artist'] = data_late50s_topArtists['artists'].map(lambda x: x.replace('[', '').replace(']', '').replace('\'', ''))
data_late50s_topArtists[['artist','popularity']].head(10)

**Question 3:** Which actual top artists write/have the most popular songs?

In [None]:
# As a second order popularity condition, artists must have more than 200 songs on Spotify
data_topArtists = tracks.sort_values(by='popularity', ascending=False)

# extract list of artists who have more than 200 songs
top_artists = data_topArtists['artists'].value_counts()[data_topArtists['artists'].value_counts() > 200].index.tolist()

# take only topArtists who have at least 200 songs
data_topArtists = data_topArtists[data_topArtists['artists'].isin(top_artists)]

# extract artists
data_topArtists.loc[:,'artist'] = data_topArtists['artists'].map(lambda x: x.replace('[', '').replace(']', '').replace('\'', ''))

data_topArtists[['artist','popularity']].groupby('artist').mean().sort_values('popularity',ascending=False).head(10)

**Question 4:** How do you become a famous song writer?
- what genres do you sing?
- is your music fast/energetic?
- ....

In [None]:
# famous = lot of followers

In [None]:
most_famous_artists = artists.sort_values('followers', ascending=False).head(10)
most_famous_artists

In [None]:
# vycistim artists stlpec v tracks dataset-e
most_famous_artists['genres'] = most_famous_artists['genres'].apply(lambda v: v.replace("['",'').replace("']",'').split(','))
most_famous_artists = most_famous_artists.explode('genres')
most_famous_artists['clean_genres'] = most_famous_artists['genres'].str.replace("'",'')

In [None]:
# the most famous artists sing post-teen pop!! 
most_famous_artists.clean_genres.value_counts()

In [None]:
# mena najpopularnejsich 
most_famous_artists_list = most_famous_artists['name'].tolist()

In [None]:
# vycistim artists stlpec v tracks dataset-e
tracks['artists'] = tracks['artists'].apply(lambda v: v.replace("['",'').replace("']",'').split(','))
tracks = tracks.explode('artists')

In [None]:
tracks.head()

In [None]:
tracks_famous_artists = tracks[tracks['artists'].isin(most_famous_artists_list)]

In [None]:
tracks_famous_artists[['artists','danceability','energy','loudness','speechiness','acousticness','instrumentalness',
             'liveness','valence','tempo']].groupby('artists').mean()

**Question 5:** What are those songs with high popularity but zero tempo, zero danceability and zero speechiness?

In [None]:
# Find some examples (most popular ones) of zero tempo songs
data_zero_tempo = tracks[tracks['tempo'] ==0].sort_values(by='popularity', ascending=False)
print(f'# of songs with 0 bpm: {str(data_zero_tempo.shape[0])}')
data_zero_tempo[['name', 'artists', 'year', 'tempo', 'speechiness', 'danceability', 'popularity']].head(10)