In [1]:
import pandas as pd

# Spotify Popularity Predictor (39%)

The goal of this challenge is to create a model that predicts the popularity of a song based on its features.

The dataset contains a list of tracks with the following characteristics:
- `acousticness`: whether the track is acoustic
- `danceability`: describes how suitable a track is for dancing
- `duration_ms`: duration of the track in milliseconds
- `energy`: represents a perceptual measure of intensity and activity
- `explicit`: whether the track has explicit lyrics
- `id`: id for the track
- `instrumentalness`: predicts whether a track contains no vocals
- `key`: the key the track is in
- `liveness`: detects the presence of an audience in the recording
- `loudness`: the overall loudness of a track in decibels
- `mode`: modality of a track
- `name`: name of the track
- `popularity`: popularity of the track
- `release_date`: release date
- `speechiness`: detects the presence of spoken words in a track
- `tempo`: overall estimated tempo of a track in beats per minute
- `valence`: describes the musical positiveness conveyed by a track
- `artist`: artist who performed the track

# Model

## Data collection

📝 **Load the `spotify_popularity_train.csv` dataset from the provided URL**
- Display the first few rows
- Perform the basic cleaning operations (remove redundant lines, as well as those with missing values)
- Store the result in a `DataFrame` named `data`

In [2]:
url = "https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv"

In [3]:
data = pd.read_csv(url)

In [4]:
data.head(1)

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,artist
0,0.654,0.499,219827,0.19,0,0B6BeEUd6UwFlbsHMQKjob,0.00409,7,0.0898,-16.435,1,Back in the Goodle Days,40,1971,0.0454,149.46,0.43,John Hartford


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52317 entries, 0 to 52316
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      52317 non-null  float64
 1   danceability      52317 non-null  float64
 2   duration_ms       52317 non-null  int64  
 3   energy            52317 non-null  float64
 4   explicit          52317 non-null  int64  
 5   id                52317 non-null  object 
 6   instrumentalness  52317 non-null  float64
 7   key               52317 non-null  int64  
 8   liveness          52317 non-null  float64
 9   loudness          52317 non-null  float64
 10  mode              52317 non-null  int64  
 11  name              52317 non-null  object 
 12  popularity        52317 non-null  int64  
 13  release_date      52317 non-null  object 
 14  speechiness       52317 non-null  float64
 15  tempo             52317 non-null  float64
 16  valence           52317 non-null  float6

In [6]:
data.dropna(inplace=True)

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52313 entries, 0 to 52316
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      52313 non-null  float64
 1   danceability      52313 non-null  float64
 2   duration_ms       52313 non-null  int64  
 3   energy            52313 non-null  float64
 4   explicit          52313 non-null  int64  
 5   id                52313 non-null  object 
 6   instrumentalness  52313 non-null  float64
 7   key               52313 non-null  int64  
 8   liveness          52313 non-null  float64
 9   loudness          52313 non-null  float64
 10  mode              52313 non-null  int64  
 11  name              52313 non-null  object 
 12  popularity        52313 non-null  int64  
 13  release_date      52313 non-null  object 
 14  speechiness       52313 non-null  float64
 15  tempo             52313 non-null  float64
 16  valence           52313 non-null  float6

🧪 **Run the following cell to save your results**

In [8]:
from nbresult import ChallengeResult

ChallengeResult(
    "data_cleaning",
    shape=data.shape).write()

In [9]:
data.keys()

Index(['acousticness', 'danceability', 'duration_ms', 'energy', 'explicit',
       'id', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode', 'name',
       'popularity', 'release_date', 'speechiness', 'tempo', 'valence',
       'artist'],
      dtype='object')

## Simple model

📝 **Which sklearn's scoring [metric](https://scikit-learn.org/stable/modules/model_evaluation.html) should we use if we want to:**
- **Strongly penalize** largest errors
- Measure errors **in the same unit** than `popularity` 
- Is better when greater (metric_good_model > metric_bad_model)

👉 Store its exact name as `string` in the variable `scoring` below

🚨 You must use this metric for the rest of the challenge

In [24]:
scoring = 'neg_root_mean_squared_error'

**📝 Let's build a first simple linear model using only the numerical features in our dataset to start with**
- Build `X_simple` keeping only numerical features
- Build `y` your target containing the `popularity`

In [10]:
X_simple = data[['acousticness', 'danceability', 'duration_ms', 'energy', 'explicit',\
                 'instrumentalness', 'key', 'liveness', 'loudness', 'mode',\
                 'speechiness', 'tempo', 'valence',]]
y = data[['popularity']]

In [11]:
X_simple.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52313 entries, 0 to 52316
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      52313 non-null  float64
 1   danceability      52313 non-null  float64
 2   duration_ms       52313 non-null  int64  
 3   energy            52313 non-null  float64
 4   explicit          52313 non-null  int64  
 5   instrumentalness  52313 non-null  float64
 6   key               52313 non-null  int64  
 7   liveness          52313 non-null  float64
 8   loudness          52313 non-null  float64
 9   mode              52313 non-null  int64  
 10  speechiness       52313 non-null  float64
 11  tempo             52313 non-null  float64
 12  valence           52313 non-null  float64
dtypes: float64(9), int64(4)
memory usage: 5.6 MB


### Holdout evaluation

**📝 Create the 4 variables `X_train_simple` `y_train`, `X_test_simple`, `y_test` with a 50% split with random sampling**

In [12]:
from sklearn.model_selection import train_test_split

X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_simple,y,test_size=0.5)

**📝 Fit and evaluate a basic linear model (do not fine tune it) with this holdout method**
- Store your model true performance in a float variable `score_simple_holdout`

In [25]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
import math

preproc = StandardScaler()
lin_model = LinearRegression()

basic_model = Pipeline([('preproc',preproc),
                        ('model',lin_model)
                       ])

basic_model.fit(X_train_simple,y_train)

Pipeline(steps=[('preproc', StandardScaler()), ('model', LinearRegression())])

In [26]:
def root_mse(mse):
    return math.sqrt(mse)

In [27]:
y_pred = basic_model.predict(X_test_simple)
score_simple_holdout = root_mse(mean_squared_error(y_test,y_pred))
score_simple_holdout

18.33817747072249

### Cross-validation evaluation

📝 **Let's be sure our score is representative**: 
- 5-times cross validate a basic linear model on the whole numeric dataset (`X_simple`, `y`)
- Do not fine tune your model
- Store your mean performance in a variable `score_simple_cv_mean` as a `float`
- Store the standard deviation of your performances in a float variable `score_simple_cv_std`

In [22]:
from sklearn.model_selection import cross_validate

cv = cross_validate(basic_model, X_simple, y, cv=5, scoring='neg_root_mean_squared_error')


In [23]:
cv['test_score']

array([-18.43649885, -18.46079317, -18.37189603, -18.47472207,
       -18.24571041])

In [28]:
score_simple_cv_mean = - cv['test_score'].mean()
score_simple_cv_std =  cv['test_score'].std()

In [29]:
score_simple_cv_mean

18.397924106389315

In [30]:
score_simple_cv_std

0.08388674341016651

🧪 **Run the following cell to save your results**

In [None]:
from nbresult import ChallengeResult

ChallengeResult(
    "simple_model",
    scoring=scoring,
    shape_train = X_train_simple.shape,
    score_simple_holdout=score_simple_holdout,
    score_simple_cv_mean=score_simple_cv_mean,
    score_simple_cv_std=score_simple_cv_std,
).write()

## Feature engineering

(From now on, we will stop using train/test split but cross-validation on the whole dataset instead)  

Let's try to improve performance using the feature `release_date`

**📝 Create `X_engineered` by adding a new column `year` to `X`, containing the release year of the track as `integer`**

📝 **Let's see how this impact the performance of our model.**
- Retrain the same simple linear model on numerical values only, adding the new feature `year`
- Save the mean cross-validated performance metric in a variable named `score_engineered` as a `float`
- Do not fine tune the model yet

🧪 **Run the following cell to save your results**

In [None]:
from nbresult import ChallengeResult

ChallengeResult("feature_engineering",
    cols = X_engineered.columns,
    years = X_engineered.get("year"),
    score_engineered=score_engineered
).write()

## Pipelining

Let's now look for maximum performance by creating a solid preprocessing pipeline.

**📝 Create a sklearn preprocessing [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and store it as `preproc`**

- Feel free to add any preprocessing steps you think of
- You may want to integrate your feature engineering for `year`
- You may also further improve it using the `ArtistPopularityTransformer` class given to you below
- Don't add any model to it yet

🚨 Advice: It is better for you to have a working pipeline (even simple one) rather than NO pipeline at all

In [None]:
# 👉 Do not hesitate to reload a clean new dataset if you need a fresh start.
# X = 
# y = 

In [None]:
# We are giving you below a custom transformer that you may want to use in your pipeline (make sure you understanding it)

from sklearn.base import BaseEstimator, TransformerMixin

class ArtistPopularityTransformer(BaseEstimator, TransformerMixin):
    """
    Compute, as a new feature of the test set, the mean popularity of 
    all songs made by the artist on the train set.
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        process artist mean popularity from artists songs popularity
        process song global mean popularity
        """

        # process artist popularity
        self.artist_popularity = y.groupby(X.artist).agg("mean")
        self.artist_popularity.name = "artist_popularity"

        # process mean popularity
        self.mean_popularity = y.mean()

        return self

    def transform(self, X, y=None):
        """
        apply artist mean popularity vs song global mean popularity to songs
        """

        # inject artist popularity
        X_copy = X.merge(self.artist_popularity, how="left", left_on="artist", right_index=True)

        # fills popularity of unknown artists with song global mean popularity
        X_copy.replace(np.nan, self.mean_popularity, inplace=True)

        return X_copy[["artist_popularity"]]

**📝 Store the number of columns/feature after preprocessing your inputs in a variable `col_number`**

🧪 **Run the following cells to save your results**

In [None]:
# Visually print your preproc
from sklearn import set_config; set_config(display='diagram')
preproc

In [None]:
# Save your preproc
from nbresult import ChallengeResult

ChallengeResult(
    "preprocessing",
    col_number=col_number,
    first_observation = preproc.fit_transform(X, y)[0]
).write()

## Training

📝 **Time to fine tune your models**

- Add an **estimator** to your pipeline (only from Scikit-learn) 

- Train your pipeline and **fine-tune** (optimize) your estimator to maximize prediction score

- You must try to fine tune at least 2 different models: 
    - create one pipeline with a **linear model** of your choice
    - create one pipeline with an **ensemble model** of your choice

Then, 

- Save your two best 5-time cross-validated scores as _float_: `score_linear` and `score_ensemble`

- Save your two best trained pipelines as _Pipeline_ objects: `pipe_linear` and `pipe_ensemble`

### Linear

### Ensemble

🧪 **Run the following cells to save your results**

In [None]:
# Print below your best pipe for correction purpose
from sklearn import set_config; set_config(display='diagram')
pipe_linear

In [None]:
# Print below your best pipe for correction purpose
pipe_ensemble

In [None]:
from nbresult import ChallengeResult

ChallengeResult("model_tuning",
    scoring = scoring,
    score_linear=score_linear,
    score_ensemble=score_ensemble).write()

## API 

Time to put a pipeline in production!

👉 Go to https://github.com/lewagon/data-certification-api and follow instructions

**This final part is independent from the above notebook**