# SPOTIFY Pipeline


* DATA SOURCE: https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db
* VARIABLES: https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/
* SPOTIPY: https://spotipy.readthedocs.io/en/latest/

APIs to enrich data with MUSIC AWARENESS: 
* **Google Trends**: https://www.npmjs.com/package/google-trends-api (https://pypi.org/project/pytrends/)
* **YouTube**: https://www.youtube.com/intl/es/yt/dev/api-resources/ (https://developers.google.com/youtube/v3/quickstart/python)

## Audio Features Object

###  KEY : VALUE TYPE : VALUE DESCRIPTION

* **duration_ms	(int)**	The duration of the track in milliseconds.
* **key	(int)**	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
* **mode	(int)**	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
* **time_signature	(int)**	An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
* **acousticness	(float)**	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. 
* **danceability	(float)**	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. 
* **energy	(float)**	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. 
* **instrumentalness	(float)**	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
* **liveness	(float)**	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. 
* **loudness	(float)**	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. 
* **speechiness	(float)**	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. 
* **valence	(float)**	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). 
* **tempo	(float)**	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. 

## Previous - Importing Python packages

In [1]:
# importing all needed packages/libraries to work with on Spotify pipeline

# to work with data: dataframes, statistics & regular expressions
import pandas as pd
import numpy as np
import re

# to import and connect external data via API
import json
import requests
#import argparse

# for data viz
%matplotlib inline
import matplotlib 
import matplotlib.pyplot as plt
import seaborn as sns

## REF: Associated Functions

In [28]:
def importing_csv(csv_path):
    df = pd.read_csv(csv_path)
    return df

def raw(df):
    print('shape:',df.shape)
    print('columns:',df.columns)
    print('variables info:')    
    return df.info()

def conversion_ms_to_min(x):
    return x/60000

def concatenate2columns(df,a,b):
    name = a+'_'+b
    df[name]=df[[a,b]].apply(lambda x: ' '.join(x),axis=1)
    print(df[name].head())
    
def tempo_classification(origin_var,new_var):
    df[new_var] = df[origin_var]
    bins_labels = ['Larghissimo','Grave','Lento','Larghetto','Adagio','Andante','Moderato','Allegro','Vivace','Presto','Prestissimo']
    cutoffs = [0,20,40,60,66,76,108,120,140,168,200,400] 
    df[new_var] = pd.cut(df[origin_var],cutoffs, labels=bins_labels)
    return df[[origin_var,new_var]].head()

def datasubset(df_origin,columns_selection,df_subset_name):
    df_subset_name = df[df_origin.columns.intersection(columns_selection)]
    return df_subset_name.shape,df_subset_name.head()

def removing_duplicates(df, columns = []):
    before_removing = len(df)
    df = df.drop_duplicates(columns, keep='last')
    after_removing = len(df)
    removed = before_removing - after_removing
    print('# duplicated removed from df: {}'.format(removed))
    return df

def bins(var):
    bins_labels = ['Low','Mid','High']
    if cutoffs_table[var]['min'] != cutoffs_table[var]['25%']:
        cutoffs = [cutoffs_table[var]['min'],cutoffs_table[var]['25%'],cutoffs_table[var]['75%'],cutoffs_table[var]['max']]
    else:
        cutoffs = [cutoffs_table[var]['min'],cutoffs_table[var]['50%'],cutoffs_table[var]['75%'],cutoffs_table[var]['max']] 
    df[str(var)+'_labels']= pd.cut(df[var],cutoffs, labels=bins_labels)
    return df.head(5)

def valcount(data, var):
    return df[var].value_counts()

def topN(data,var,n):
    return df[var].value_counts().head(n)

## 1. Raw data

In [4]:
# importing dataset:
csv_path = '../db/SpotifyFeatures.csv'
df = importing_csv(csv_path)

In [5]:
# first we do a general discovering of the dataset: shape, columns and variables type & data:
raw(df)

shape: (228159, 18)
columns: Index(['genre', 'artist_name', 'track_name', 'track_id', 'popularity',
       'acousticness', 'danceability', 'duration_ms', 'energy',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'time_signature', 'valence'],
      dtype='object')
variables info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 228159 entries, 0 to 228158
Data columns (total 18 columns):
genre               228159 non-null object
artist_name         228159 non-null object
track_name          228159 non-null object
track_id            228159 non-null object
popularity          228159 non-null int64
acousticness        228159 non-null float64
danceability        228159 non-null float64
duration_ms         228159 non-null int64
energy              228159 non-null float64
instrumentalness    228159 non-null float64
key                 228159 non-null object
liveness            228159 non-null float64
loudness            228159 non-null float

In [6]:
# then a quick overview of data aspect in each variable:
df.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Opera,Giuseppe Verdi,"Stiffelio, Act III: Ei fugge! … Lina, pensai c...",7EsKYeHtTc4H4xWiTqSVZA,21,0.986,0.313,490867,0.231,0.000431,C#,0.0964,-14.287,Major,0.0547,86.001,4/4,0.0886
1,Opera,Giacomo Puccini,Madama Butterfly / Act 1: ... E soffitto e pareti,7MfmRBvqaW0I6UTxXnad8p,18,0.972,0.36,176797,0.201,0.028,D#,0.133,-19.794,Major,0.0581,131.798,4/4,0.369
2,Opera,Giacomo Puccini,"Turandot / Act 2: Gloria, gloria, o vincitore",7pBo1GDhIysyUMFXiDVoON,10,0.935,0.168,266184,0.47,0.0204,C,0.363,-8.415,Major,0.0383,75.126,3/4,0.0696
3,Opera,Giuseppe Verdi,"Rigoletto, Act IV: Venti scudi hai tu detto?",02mvYZX5aKNzdqEo6jF20m,17,0.961,0.25,288573,0.00605,0.0,D,0.12,-33.44,Major,0.048,76.493,4/4,0.038
4,Opera,Giuseppe Verdi,"Don Carlo / Act 4: ""Ella giammai m'amò!""",03TW0jwGMGhUabAjOpB1T9,19,0.985,0.142,629760,0.058,0.146,D,0.0969,-23.625,Major,0.0493,172.935,4/4,0.0382


In [7]:
# for numerical variables, we have a look of general statistics of the data
df.describe()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
count,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0
mean,44.20913,0.3512,0.554198,236609.2,0.580967,0.13731,0.214638,-9.354658,0.122442,117.423062,0.444795
std,17.276599,0.351385,0.183949,116678.7,0.260577,0.292447,0.196977,5.940994,0.186264,30.712458,0.255397
min,0.0,1e-06,0.0569,15509.0,2e-05,0.0,0.00967,-52.457,0.0222,30.379,0.0
25%,33.0,0.0309,0.437,186253.0,0.405,0.0,0.0977,-11.287,0.0368,92.734,0.232
50%,47.0,0.205,0.57,221173.0,0.618,3.7e-05,0.128,-7.515,0.0506,115.347,0.43
75%,57.0,0.689,0.69,264840.0,0.793,0.0234,0.263,-5.415,0.109,138.887,0.643
max,100.0,0.996,0.987,5552917.0,0.999,0.999,1.0,1.585,0.967,239.848,1.0


## 2. Preparing data: internal (Database) & external (APIs) - Data integration

In [None]:
# in progress

In [None]:
from pytrends.request import TrendReq
pytrends = TrendReq(hl='en-US', tz=360)

## 3. Data cleaning & manipulation

In [8]:
# we check if there are some null values in each variable
null_cols = df.isnull().sum()
null_cols

genre               0
artist_name         0
track_name          0
track_id            0
popularity          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
valence             0
dtype: int64

In [9]:
# to simplify, we convert duration in ms to min
df['duration_min']= df.duration_ms.apply(conversion_ms_to_min)
df.drop(columns='duration_ms',inplace=True)

In [15]:
# we concatenate key & mode to identify major and menor chords
concatenate2columns(df,'key','mode')

0    C# Major
1    D# Major
2     C Major
3     D Major
4     D Major
Name: key_mode, dtype: object


In [29]:
# for tempo, we will associate the bins to the standard classification
tempo_classification('tempo','tempo_clas')

Unnamed: 0,tempo,tempo_clas
0,86.001,Andante
1,131.798,Allegro
2,75.126,Adagio
3,76.493,Andante
4,172.935,Presto


In [26]:
# to classify low-mid-high bins for music KPIs variables, we will use the general distribution statistics to set the cutoffs parameters:
cutoffs_table = df.describe()
music_KPIs_columns = cutoffs_table.columns
datasubset(df,music_KPIs_columns,'music_KPIs')

((228159, 11),
    popularity  acousticness  danceability   energy  instrumentalness  \
 0          21         0.986         0.313  0.23100          0.000431   
 1          18         0.972         0.360  0.20100          0.028000   
 2          10         0.935         0.168  0.47000          0.020400   
 3          17         0.961         0.250  0.00605          0.000000   
 4          19         0.985         0.142  0.05800          0.146000   
 
    liveness  loudness  speechiness    tempo  valence  duration_min  
 0    0.0964   -14.287       0.0547   86.001   0.0886      8.181117  
 1    0.1330   -19.794       0.0581  131.798   0.3690      2.946617  
 2    0.3630    -8.415       0.0383   75.126   0.0696      4.436400  
 3    0.1200   -33.440       0.0480   76.493   0.0380      4.809550  
 4    0.0969   -23.625       0.0493  172.935   0.0382     10.496000  )

In [36]:
# we apply our bins function to all music KPIs variables
music_KPIs_bins= music_KPIs.apply(bins,axis=1)

KeyError: ("None of [Float64Index([                  21.0,                  0.986,\n                               0.313,                  0.231,\n              0.00043099999999999996,                 0.0964,\n                             -14.287,                 0.0547,\n                              86.001,                 0.0886,\n                   8.181116666666666],\n             dtype='float64')] are in the [columns]", 'occurred at index 0')

In [None]:
bins('popularity')
bins('acousticness')
bins('danceability')
bins('duration_min')
bins('energy')
bins('instrumentalness')
bins('liveness')
bins('loudness')
bins('speechiness')
bins('valence')

In [None]:
null_cols = df.isnull().sum()
null_cols

In [None]:
df['instrumentalness_labels'].value_counts()

In [None]:
# df.where(df['instrumentalness_labels'].isnull())

In [None]:
# top 10 artists
topN(df,'artist_name',10)

In [None]:
# top 10 tracks
topN(df,'track_name',10)

In [None]:
df[['genre','artist_name','track_name','popularity','energy_labels','danceability_labels','valence_labels']].sort_values(by='popularity',ascending=False).head(10)

In [None]:
df.time_signature.value_counts()

In [None]:
df['mode'].value_counts()

In [None]:
def bars(df, var):
    sns.set_style(style='darkgrid')
    table=df[var].value_counts()
    table_plot=pd.DataFrame(table)
    plt.title(var+' ranking & distribution')
    return sns.barplot(table_plot[var], table_plot.index, palette="viridis")

bars(df,'key')

In [None]:
# checking null data
null_cols = df.isnull().sum()
null_cols

In [None]:
df.genre.nunique.value_counts()

In [None]:
df.tempo_clas.value_counts()

In [None]:


df = removing_duplicates(df, columns = ['track_name','artist_name'])

## 4. Analysis & Insights

In [None]:
def histo(df, var):
    return df[var].hist()
histo(df,'popularity')

In [None]:
df[['track_name','artist_name','energy','valence','tempo_clas']].sort_values(by='valence',ascending=False).head(10)

In [None]:
df[['track_name','artist_name','energy','valence','tempo_clas']].sort_values(by='valence',ascending=False).head(10)

In [None]:
pop_temp_genre = pd.pivot_table(df, values='popularity', index=['tempo_clas'],
                  columns=["genre"], aggfunc=np.mean)
pop_temp_genre

In [None]:
pivot = pd.pivot_table(df, values='danceability', index=["genre"],
                  columns=['popularity_labels'], aggfunc=np.mean)
pivot

In [None]:
pivot[:]['High'].plot.bar()

In [None]:
pivot2 = pd.pivot_table(df, values='valence', index=["genre"],
                  columns=['energy_labels'], aggfunc=np.mean)
pivot2

In [None]:
pivot2[:]['High'].plot.bar()

In [None]:
pivot3 = pd.pivot_table(df, values='energy', index=["mode","key"],
                  columns=['genre'], aggfunc=np.mean)
pivot3

In [None]:
pivot4 = pd.pivot_table(df, values='duration_min', index=["energy_labels","valence_labels"],
                  columns=['popularity_labels'], aggfunc=np.mean)
pivot4

## 5. Reporting & Data Viz

In [None]:
def boxplotting(table):
    sns.set_style("whitegrid")
    fig = plt.figure(figsize=(15,5))
    return sns.boxplot(data=table)

In [None]:
boxplotting(pivot3)

In [None]:
def violinplotting(data):
    f, ax = plt.subplots()
    sns.despine(offset=10, trim=True)
    return sns.violinplot(data=data);

violinplotting(pivot4)

In [None]:
pivot5 = pd.pivot_table(df, values='tempo', index=["energy_labels","valence_labels"],
                  columns=['popularity_labels'], aggfunc=np.mean)
pivot5

In [None]:
f, ax = plt.subplots()
sns.violinplot(data=pivot5)
sns.despine(offset=10, trim=True);

In [None]:
sns.set_style(style='darkgrid')
sns.distplot(df['popularity'],hist=True,kde=True)

In [None]:
sns.set_style(style='darkgrid')
sns.distplot(df['energy'],hist=True,kde=True)

In [None]:
fig = plt.figure(figsize=(15,15))
fig = plt.suptitle('Audio Features', fontsize=18)
ax = plt.subplot(211)
data = df[df.genre.isin(df.groupby('genre').agg([np.sum])['popularity'].sort_values(by='sum', ascending=False).head(10).index.tolist())]
ax = sns.boxplot(x='popularity', y='genre', data=df, palette='Pastel2')
ax = plt.title('Sorted by Popularity')
ax = plt.ylabel(''), plt.xlabel('')

In [None]:
# Heatmap # for color selection: https://matplotlib.org/examples/color/colormaps_reference.html
plt.figure(figsize=(15,5))
sns.heatmap(df.corr(),cmap="PiYG")
plt.figure(figsize=(15,5))
sns.heatmap(df.corr(),cmap="PiYG",annot=True)

In [None]:
pop_rock_indie = df[df['genre'] == 'Rock']+df[df['genre'] == 'Pop']+df[df['genre'] == 'Indie']
pop_rock_indie.head()

In [None]:
'''#sns.jointplot(data=pop_rock_indie,y='energy',x='liveness',kind='reg')
sns.jointplot(data=rock,y='energy',x='liveness',kind='reg')

sns.set(style="white", color_codes=True)
    tips = sns.load_dataset("tips")
    g = sns.jointplot("total_bill", "tip", data=tips, kind="reg")'''

'''from pandas.plotting import scatter_matrix
scatter_matrix(df, alpha=0.2, figsize=(20,20), diagonal='kde')
df.corr(method='pearson', min_periods=1)'''