# Linear Regression using the Spotify Charts Dataset

Can you build a linear model to predict a song's popularity using the metrics provided as features?

The relevant metrics are 'popularity', 'danceability', 'energy','loudness','speechiness', 'acousticness', 'instrumentalness',
 'liveness', 'valence', and 'tempo'

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
df = pd.read_csv('data/spotify_daily_charts_tracks.csv')
df.head()

In [None]:
df.info()

In [None]:
# remove null
df = df[~df['track_name'].isnull()]
len(df)

In [None]:
#filter unnecessary fields
df =df[['popularity', 'danceability', 'energy',
       'loudness','speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo']]
df.head()

### 1. Explore the dataset

In [None]:
#Make a table of distribution stats of song metrics using df.describe

df[['popularity', 'danceability', 'energy',
       'loudness','speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo']].describe()

In [None]:
# Visualize histograms of each song metric
for col in ['popularity', 'danceability', 'energy','loudness', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo']:
    sns.distplot(df[col])
    plt.title(col)
    plt.ylabel('Frequency')
    plt.show()


### 2. Feature Engineering

Normalize loudness and tempo. 
>Q: Whats the best norm to use for each?


In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

df['loudness'] = scaler.fit_transform(df[['loudness']])
df.head()

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

df['tempo'] = scaler.fit_transform(df[['tempo']])
df.head()

Visualize the new loudness and tempo distributions

In [None]:
# Visualize histograms of each song metric
for col in ['loudness', 'tempo']:
    sns.distplot(df[col])
    plt.title(col)
    plt.ylabel('Frequency')
    plt.show()


### 3. Examine Features

To reduce variability, we could limit our analysis to only those songs that are sufficiently popular

In [None]:
#filter songs with above median popularity
df= df[df['popularity']>=61]
df = df.reset_index(drop=True)
len(df)

Pick 3 features that you think would give you a good fit.
>Q: *Hypothesis*: Why do you think these 3 could be a good predictor for popularity?

Visualize the relationship between the features and the response using scatterplots

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(16, 8), sharey=True)
df.plot(kind='scatter', x='danceability', y='popularity', ax=axs[0], grid=True)
df.plot(kind='scatter', x='energy', y='popularity', ax=axs[1], grid=True)
df.plot(kind='scatter', x='loudness', y='popularity', ax=axs[2], grid=True)

### 4. Fit the Model

Do the following steps for each of your selected features:

a. Determine best fit line coefficients

In [None]:
from sklearn.linear_model import LinearRegression

feature_cols = ['danceability']
X = df[feature_cols]
y = df['popularity']

model = LinearRegression(fit_intercept=True)
model.fit(X,y)

print('Model slope: %0.4f' % model.coef_[0])
print('Model intercept: %0.4f' % model.intercept_)


b. Obtain the R2 for the fit

In [None]:
print('Model R2: %0.4f' % model.score(X,y))

>Q: Interpret the model coefficients. What does the R2 value tell you about the fitted model?

An increase of 0.1 in danceability will result to an increase of popularity by 0.7 points. 
But based from the R2, this is a very bad fit and so we hold back from this interpretation.

c. Compute for RMSE and MAE

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

#define RMSE function
def RMSE(model, X, y):
    predicted = model.predict(X)
    rmse = np.sqrt(mean_squared_error(y, predicted))
    return rmse
  
#define MAE function
def MAE(model, X, y):
    predicted = model.predict(X)
    mae = mean_absolute_error(y, predicted)
    return mae

In [None]:
print('Model RMSE: %0.4f' % RMSE(model,X,y))
print('Model MAE: %0.4f' % MAE(model,X,y))

>Q: What does the RMSE and MAE tell you about the model performance?

The model's predicted popularity is more or less expected to be off by 8.22 points (conservative) or 6.9 points (equal weighting) for all songs considered

d. Check for outliers and determine if removing them could result to a better fit

In [None]:
q1 = df['popularity'].quantile(0.25)
q3 =  df['popularity'].quantile(0.75)
IQR = q3 -q1

outliers = df[(df['popularity']<(q1-1.5*IQR))&(df['popularity']>(q3+1.5*IQR))]['popularity']
outliers

### 5. Using multiple features

Will using all 3 of your chosen features result into a better fit? Repeat the procedure in 4 and see if the metrics improve.
If it did improve, do you think its enough to make the model more credible?

In [None]:
# create X and y
feature_cols = ['danceability', 'energy', 'loudness']
X = df[feature_cols]
y = df['popularity']

lm = LinearRegression()
lm.fit(X, y)

# print intercept and coefficients
print(lm.intercept_)
# pair the feature names with the coefficients
print(list(zip(feature_cols, lm.coef_)))

In [None]:
# calculate the R-squared
lm.score(X, y)

### 6. Using k-fold cross validation
We could further investigate on the models predictive performance using k-fold cross validation.
What does folding reveal about the linear model you built?

- For the model you built in (5), try the validation procedure for k=5 and k=10



In [None]:
X = X.to_numpy()
Y = y.to_numpy()

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, random_state=None, shuffle=False)


for train_index, test_index in kf.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    Y_train, Y_test = Y[train_index], Y[test_index]
    #print(np.shape(X_test), np.shape(Y_test))
    
    #fit using training data
    lin_model = LinearRegression()
    lin_model.fit(X_train, Y_train)
    
    #evaluate fit of train data
    print('train: R2=%0.2f '% lin_model.score(X_train, Y_train))

    #evaluate using test data
    print('test: RMSE=%0.2f, R2=%0.2f' % (RMSE(lin_model, X_test, Y_test), lin_model.score(X_test,Y_test)))

> All training sets show poor fit. Some test sets produced relatively better fits, but this is only local to the fold since RMSE remains high.