## Lab

The data in “songs.csv” contains descriptive information on various songs from Spotify and one person’s personal opinion of each song. We want you to build a model that can predict “target”; target is a 1 when they liked the song, and 0 when the did not like it.

You need to decide:

* Which variables to use
* Do you want to do any feature engineering?
* How to explore and clean the data
* Which model to use
* Do you want to change any of the parameters of the mode?
* How to measure the accuracy of the model

You will want to build several models and compare the accuracy between them. Try to get the best model possible, but be careful about over fitting.

For this lab, please use Python.

In [1]:
import pandas as pd
import numpy as np

In [19]:
songs = pd.read_csv('data/songs.csv')

In [3]:
# EDA

In [20]:
songs.head()

Unnamed: 0.1,Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target,song_title,artist
0,0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,1,Mask Off,Future
1,1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4.0,0.588,1,Redbone,Childish Gambino
2,2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4.0,0.173,1,Xanny Family,Future
3,3,0.604,0.494,199413,0.338,0.51,5,0.0922,-15.236,1,0.0261,86.468,4.0,0.23,1,Master Of None,Beach House
4,4,0.18,0.678,392893,0.561,0.512,5,0.439,-11.648,0,0.0694,174.004,4.0,0.904,1,Parallel Lines,Junior Boys


In [6]:
songs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2017 entries, 0 to 2016
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        2017 non-null   int64  
 1   acousticness      2017 non-null   float64
 2   danceability      2017 non-null   float64
 3   duration_ms       2017 non-null   int64  
 4   energy            2017 non-null   float64
 5   instrumentalness  2017 non-null   float64
 6   key               2017 non-null   int64  
 7   liveness          2017 non-null   float64
 8   loudness          2017 non-null   float64
 9   mode              2017 non-null   int64  
 10  speechiness       2017 non-null   float64
 11  tempo             2017 non-null   float64
 12  time_signature    2017 non-null   float64
 13  valence           2017 non-null   float64
 14  target            2017 non-null   int64  
 15  song_title        2017 non-null   object 
 16  artist            2017 non-null   object 


In [12]:
songs.isna().sum()

# no na values

Unnamed: 0          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
valence             0
target              0
song_title          0
artist              0
dtype: int64

Which type of model? Not sure - because I can't run pandasprofiling.

I'll start with a logistic regression.

In [21]:
songs.head()

Unnamed: 0.1,Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target,song_title,artist
0,0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,1,Mask Off,Future
1,1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4.0,0.588,1,Redbone,Childish Gambino
2,2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4.0,0.173,1,Xanny Family,Future
3,3,0.604,0.494,199413,0.338,0.51,5,0.0922,-15.236,1,0.0261,86.468,4.0,0.23,1,Master Of None,Beach House
4,4,0.18,0.678,392893,0.561,0.512,5,0.439,-11.648,0,0.0694,174.004,4.0,0.904,1,Parallel Lines,Junior Boys


In [25]:
# selecting some variables

songs_subset = songs.loc[:,['acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'target']].copy()

In [27]:
songs_subset.info()

# all columns are numbers so that is fine

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2017 entries, 0 to 2016
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      2017 non-null   float64
 1   danceability      2017 non-null   float64
 2   duration_ms       2017 non-null   int64  
 3   energy            2017 non-null   float64
 4   instrumentalness  2017 non-null   float64
 5   key               2017 non-null   int64  
 6   liveness          2017 non-null   float64
 7   loudness          2017 non-null   float64
 8   mode              2017 non-null   int64  
 9   speechiness       2017 non-null   float64
 10  tempo             2017 non-null   float64
 11  time_signature    2017 non-null   float64
 12  valence           2017 non-null   float64
 13  target            2017 non-null   int64  
dtypes: float64(10), int64(4)
memory usage: 220.7 KB


In [28]:
from sklearn.linear_model import LinearRegression

In [29]:
# get response

response_var = songs_subset.loc[:,'target']

# get predictors
predictors_vars = songs_subset.drop(columns='target')

In [30]:
predictors_vars

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,0.01020,0.833,204600,0.434,0.021900,2,0.1650,-8.795,1,0.4310,150.062,4.0,0.286
1,0.19900,0.743,326933,0.359,0.006110,1,0.1370,-10.401,1,0.0794,160.083,4.0,0.588
2,0.03440,0.838,185707,0.412,0.000234,2,0.1590,-7.148,1,0.2890,75.044,4.0,0.173
3,0.60400,0.494,199413,0.338,0.510000,5,0.0922,-15.236,1,0.0261,86.468,4.0,0.230
4,0.18000,0.678,392893,0.561,0.512000,5,0.4390,-11.648,0,0.0694,174.004,4.0,0.904
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2012,0.00106,0.584,274404,0.932,0.002690,1,0.1290,-3.501,1,0.3330,74.976,4.0,0.211
2013,0.08770,0.894,182182,0.892,0.001670,1,0.0528,-2.663,1,0.1310,110.041,4.0,0.867
2014,0.00857,0.637,207200,0.935,0.003990,0,0.2140,-2.467,1,0.1070,150.082,4.0,0.470
2015,0.00164,0.557,185600,0.992,0.677000,1,0.0913,-2.735,1,0.1330,150.011,4.0,0.623


In [31]:
model = LinearRegression()
model.fit(predictors_vars, response_var)

LinearRegression()

In [32]:
model.score(predictors_vars, response_var)

0.1317097558476329

In [33]:
# Reading intercept

model.intercept_

-0.31348116636664203

In [35]:
songs_subset.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target
0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,1
1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4.0,0.588,1
2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4.0,0.173,1
3,0.604,0.494,199413,0.338,0.51,5,0.0922,-15.236,1,0.0261,86.468,4.0,0.23,1
4,0.18,0.678,392893,0.561,0.512,5,0.439,-11.648,0,0.0694,174.004,4.0,0.904,1


In [36]:
# separate predictors and response columns
songs_predictors = songs_subset.drop(columns='target')
songs_response = songs_subset.target

In [37]:
from sklearn.model_selection import train_test_split

In [38]:
songs_pred_train, songs_pred_test, songs_resp_train, songs_resp_test = (
    train_test_split(
        songs_predictors,
        songs_response,
        test_size = 0.1,
        random_state = 7
    )
)

In [44]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [41]:
model = RandomForestClassifier(n_estimators=100)
model.fit(songs_pred_train, songs_resp_train)

RandomForestClassifier()

In [46]:
scores = cross_val_score(
    model, songs_pred_train, songs_resp_train, scoring = 'accuracy', cv =5)

In [47]:
scores.mean()

0.7713498622589532

Outcome of random forest is pretty good - 77%

In [None]:
# trying with pipeline and linear regression

In [48]:
songs_subset.head(1)

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target
0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,1


In [49]:
# splitting dataset into test and train

a = songs_subset.drop('target', axis = 1)
b = songs_subset.target

from sklearn.model_selection import train_test_split
a_train, a_test, b_train, b_test = train_test_split(
    a, b, test_size=0.3
)


In [52]:
# imputing in sklearn

from sklearn.impute import SimpleImputer

imp_median = SimpleImputer(missing_values=np.nan, strategy='median')

a_train = imp_median.fit_transform(a_train)
a_train = pd.DataFrame(a_train, columns=a.columns)

In [54]:
a_train.isnull().sum()

acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
valence             0
dtype: int64

In [55]:
# scaling features

a_train.mean()

acousticness             0.188373
danceability             0.617970
duration_ms         246660.538625
energy                   0.683434
instrumentalness         0.132042
key                      5.344437
liveness                 0.190192
loudness                -7.092711
mode                     0.613040
speechiness              0.096295
tempo                  121.609284
time_signature           3.970234
valence                  0.495463
dtype: float64

In [56]:
a_train.std()

acousticness            0.258732
danceability            0.161730
duration_ms         82522.947822
energy                  0.210877
instrumentalness        0.272526
key                     3.610132
liveness                0.155088
loudness                3.823641
mode                    0.487227
speechiness             0.094024
tempo                  26.535063
time_signature          0.250885
valence                 0.248248
dtype: float64

In [57]:
# Building pipelines

In [61]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

In [66]:
pipe_linreg = Pipeline(steps = [
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='median')),
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

pipe_linreg.fit(a_train, b_train)

Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler()),
                ('regressor', LinearRegression())])

In [68]:
from sklearn.metrics import mean_absolute_error

b_train_predict = pipe_linreg.predict(a_train)
b_test_predict = pipe_linreg.predict(a_test)

print('MAE train', mean_absolute_error(b_train, b_train_predict))
print('MAE test', mean_absolute_error(b_test, b_test_predict))

MAE train 0.42747934258337533
MAE test 0.43998678858673235
