## Project Description 


🎯 **The objective is to create a model that predicts the popularity of a song based on its characteristics**

### Details on dataset

The Dataset contains a list of songs with the following characteristics:

**acousticness**: whether the track is acoustic  

**danceability**: describes how suitable a track is for dancing  

**duration_ms**: duration of the track in milliseconds  

**energy**: represents a perceptual measure of intensity and activity  

**explicit**: whether the track has explicit lyrics  

**id**: id for the track  

**instrumentalness**: predicts whether a track contains no vocals  

**key**: the key the track is in  

**liveness**: detects the presence of an audience in the recording  

**loudness**: the overall loudness of a track in decibels  

**mode**: modality of a track  

**name**: name of the track  

**popularity**: popularity of the track  

**release_date**: release date  

**speechiness**: detects the presence of spoken words in a track  

**tempo**: overall estimated tempo of a track in beats per minute  

**valence**: describes the musical positiveness conveyed by a track  

**artist**: artist who performed the track

## Data Cleaning 

🎯 **Load and clean the data**


In [3]:
import pandas as pd
import numpy as np

In [None]:
url = 'https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv'
df = pd.read_csv(url)
print(f'df shape : {df.shape}')
df.head()

📝 Clean the data : make sure that no duplicates nor missing values remain in df

In [None]:
# Counting duplicates
print(f'duplicates : {df.duplicated().sum()}')

# Counting the number of NaN for each column
df.isnull().sum().sort_values(ascending=False)

In [None]:
# Drop duplicates and NaN from df 
df = df.drop_duplicates().dropna()

# Check duplicates and NaN are well droped
print(f'duplicates : {df.duplicated().sum()}')
print(f'duplicates : {df.isnull().sum()}')

#Check new shape 
print(f'df new shape : {df.shape}')

## Supervised Learning

🎯 **Baseline and evaluation of a basic model**

📝 Scoring metric : Negative RMSE

- strongly penalize largest errors relatively to smaller ones  
- measure errors in the same unit as the target `popularity`  
- the greater, the better (metric_good_model > metric_bad_model) 

In [None]:
scoring = 'neg_root_mean_squared_error'

📝 Features and target

In [None]:
X_simple = df.select_dtypes(include=['int64', 'float64'])
y = df['popularity']

📝 Baseline score

In [None]:
# Compute mean squared error
mse = np.mean((y - y.mean())**2)

# Compute the negative RMSE 
baseline_score = -np.sqrt(mse)
print(f'Baseline score is {baseline_score}')

📝 Split data and Evaluate on basic Linear Regression

In [None]:
# Train-Test split
from sklearn.model_selection import train_test_split

X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_simple, y, test_size=0.5, random_state=42)

print(f'X_train_simple shape : {X_train_simple.shape}')
print(f'y_train : {y_train.shape}')

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error

# Create function for neg rmse
def neg_rmse(y_true, y_pred):
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    return -rmse

# Create scorer 
score = make_scorer(neg_rmse, greater_is_better=True)

# Instanciate model
model = LinearRegression()

# Train model on train set
model.fit(X_train_simple, y_train)

# Predict on test set
y_pred = model.predict(X_test_simple)

# Evaluate with neg rmse
score_simple_holdout = score(model, X_test_simple, y_test)

print(f'Score on basic Linear Regression : {score_simple_holdout}')

📝 Cross-Validation with 5 folds

In [None]:
from sklearn.model_selection import cross_val_score

scores_result = cross_val_score(model_simple, X_train_simple, y_train, cv=5, scoring=scoring)

score_simple_cv_mean = scores_result.mean()
score_simple_cv_std = scores_result.std()

print(f"Mean score : {score_simple_cv_mean}")
print(f"Standard deviation : {score_simple_cv_std}")

## Feature Engineering

🎯 Improving performance using the feature release_date

In [None]:
# Convert release_date to datetime
year_col = pd.to_datetime(df['release_date'])

# Extract year
year_col = year_col.dt.year

# Join year_col to X_Simple into new DF X_engineered
X_engineered = X_simple.join(year_col)
X_engineered = X_engineered.drop(columns='release_date')

print(f'X_engineered shape : {X_engineered.shape}')

📝 Retrain a basic Linear Regression to check the impact of the new feature on the performance of the model

In [None]:
# Train model on train set
model.fit(X_engineered, y)

# Cross Validate
scores_result = cross_val_score(model, X_engineered, y, cv=5, scoring=scoring)
score_engineered = scores_result.mean()

print(f'New score : {score_engineered}')

## Unsupervised Learning

📝 Using a KMeans to assign each track to a cluster

In [None]:
from sklearn.cluster import KMeans

# Number of clusters
n = 5

# Instanciate model & Fit 
km = KMeans(n_clusters=n)
kmeans = km.fit(X_simple)

# Get cluster predictions on X_simple
cluster_pred = kmeans.predict(X_simple)
cluster_pred

# New column of X_engineered with clusters
X_engineered['clusters'] = cluster_pred

# Check
X_engineered.head()

📝 Check the impact of the new clusters feature on the performance of the model

In [None]:
# Re-train model on X_engineered
model.fit(X_engineered, y)

# Cross Validate
scores_result_c = cross_val_score(model, X_engineered, y, cv=5, scoring=scoring)
score_clusters = scores_result_c.mean()

print(f'New score : {score_clusters}')

## Preprocessing

🎯 **Constructing a preprocessing pipeline for the data**

In [None]:
# This help visualize pipelines
from sklearn import set_config; set_config(display='diagram')

In [None]:
# Reloading a clean new dataset
X = df.drop(columns='popularity')
y = df['popularity']

In [None]:
# Create new df with only object-type columns
object_columns = df.select_dtypes(include=['object'])

# Print the object-type columns
print(f'Object-type columns : {object_columns.columns})

#Check their number of unique values
print(f'Unique values in id column : {len(object_columns.id.unique())}')
print(f'Unique values in name column : {len(object_columns.name.unique())}')
print(f'Unique values in realease_date column : {len(object_columns.release_date.unique())}')
print(f'Unique values in artist column : {len(object_columns.artist.unique())}')
      
print('No need to One-Hot-Encode those columns')

### Custom Transformer

📝 Creating a custom transformer to extract the year from release_date

In [None]:
from sklearn.preprocessing import FunctionTransformer

def extract_year(x):
    x = pd.to_datetime(x)
    return x.dt.year

transformer_year_2 = FunctionTransformer(extract_year, validate=False)

In [None]:
def release_date(data):
    tab=[]
    for i in range(data.shape[0]):
        tab.append(int(data.iloc[i]['release_date'][0:4]))
    return pd.DataFrame(tab)

transformer_year = FunctionTransformer(release_date)

📝 Creating a pipeline_year

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

pipeline_year = Pipeline([
    ('transformer_year', transformer_year),
    ('scaler', MinMaxScaler(feature_range=(0, 1)))
])

📝 Creating a pipeline_cluster

In [None]:
# Provided : Custom transformer to extract a cluster id for each observation
def process_clusters(clusters):
    return np.argmin(clusters, axis=1).reshape((-1, 1))

transformer_clusters = FunctionTransformer(process_clusters)

In [None]:
from sklearn.preprocessing import OneHotEncoder

pipeline_clusters = Pipeline([
    ('kmeans', KMeans(n_clusters=n)),
    ('transformer_cluster', transformer_clusters),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

pipeline_clusters

📝 Creating a pipeline_artist

In [None]:
# Provided : Custom transformer custom Transformer Class below

from sklearn.base import BaseEstimator, TransformerMixin

class ArtistPopularityTransformer(BaseEstimator, TransformerMixin):
    """
    Compute, as a new feature the artist's popularity
    Do so by computing the mean popularity of all songs from the artist
    Notice that the popularity is computed on the train only to avoid leakage
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        process artist mean popularity from artists songs popularity
        process song global mean popularity
        """

        # process artist popularity
        self.artist_popularity = y.groupby(X.artist).agg("mean")
        self.artist_popularity.name = "artist_popularity"

        # process mean popularity
        self.mean_popularity = y.mean()

        return self

    def transform(self, X, y=None):
        """
        apply artist mean popularity vs song global mean popularity to songs
        """

        # inject artist popularity
        X_copy = X.merge(self.artist_popularity, how="left", left_on="artist", right_index=True)

        # fills popularity of unknown artists with song global mean popularity
        X_copy.replace(np.nan, self.mean_popularity, inplace=True)

        return X_copy[["artist_popularity"]]

In [None]:
# Instanciate and fit Class
artist_popularity = ArtistPopularityTransformer()
artist_popularity = artist_popularity.fit(X,y)

# Make pipeline
pipeline_artist = Pipeline([
    ('transformer', ArtistPopularityTransformer()),
    ('scaler', MinMaxScaler(feature_range=(0, 1)))
])

### Preprocessing Pipeline with all Column Transformers

In [None]:
from sklearn.compose import ColumnTransformer

# Select only numeric features
numeric_features = ['acousticness','danceability',
 'duration_ms',
 'energy',
 'explicit',
 'instrumentalness',
 'key',
 'liveness',
 'loudness',
 'mode',
 'speechiness',
 'tempo',
 'valence']

# Create preprocessor Pipeline
preprocessor = ColumnTransformer([
    ('num_features', pipeline_clusters, numeric_features),
    ('scaler_num', MinMaxScaler(), numeric_features),
    ('release_date', pipeline_year, ['release_date']),
    ('artist', pipeline_artist, ['artist'])],remainder='drop')

preprocessor

📝 Use pipeline to transform X to X_transformed

In [None]:
# Fit Preprocessor 
preprocessor.fit(X,y)

# Transform X
X_transformed = preprocessor.transform(X)

## Model Selection

🎯 **Select the model that yields the best performance**

### Linear Models

In [None]:
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline

# Construct Pipeline with processor + model
pipe_linear = make_pipeline(preprocessor, Ridge())

# Cross-validate Pipeline
score_linear = cross_val_score(pipe_linear, X, y, cv=5, scoring=scoring).mean()

print(f'Ridge model score : {score_linear}')

### Ensemble Methods

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Instanciate model
model = RandomForestRegressor(max_depth=5,min_samples_leaf=5)

# Construct Pipeline with processor + model
pipe_ensemble = make_pipeline(preprocessor, model)

# Cross-validate 
score_ensemble = cross_val_score(pipe_ensemble, X, y, cv=5, scoring=scoring).mean()

print(f'RandomForestRegressor model score : {score_ensemble}')

In [None]:
# Trying another model

from sklearn.ensemble import GradientBoostingRegressor

# Instanciate model
model_gbr = GradientBoostingRegressor(
    n_estimators=100, 
    learning_rate=0.1,
    max_depth=3
)

# Construct Pipeline with processor + model
pipe_gb = make_pipeline(preprocessor, model_gbr)

# Cross-validate 
score = cross_val_score(pipe_gb, X, y, cv=5, scoring=scoring)

print(f'GBR model score : {score.mean()}')

## Fine-Tuning 

🎯 **Fine-tuning the best model to achieve the highest possible score**

📝 Cross-validating grid search 

In [None]:
from sklearn.model_selection import GridSearchCV

# Create a dictionary with the hyperparameters to search
grid = {
    'model__n_estimators': [10, 50, 100],
    'model__max_depth': [5, 10, 15]
}

# Create the cross-validated grid search
search = GridSearchCV(pipe_ensemble, grid, scoring=scoring, cv=5)

# Fit GridSearchCV
search.fit(X,y)

In [None]:
# Print the best score and best parameters
print(f'Best score: {search.best_score_}')
print(f'Best parameters: {search.best_params_}')

## Recommendations and Continuous Improvement

🎯 **Transform a regression task into a classification task**

📝 Creating a new target y_cat

In [None]:
# Calculate the median of y
median = y.median()

# Create y_cat using the median of y
y_cat = y.apply(lambda x: 1 if x >= median else 0)

📝 Cross validating a classification with accuracy metric

In [None]:
# Create the pipeline with preprocessor and model
pipe_cat = Pipeline([
    ('preprocessor', preprocessor),
    ('model', LogisticRegression())
])

# Cross validate the pipeline with 5 folds
scores = cross_val_score(pipe_cat, X, y_cat, cv=5, scoring='accuracy')

# Calculate the mean of the scores
score_cat = scores.mean()

print(f'Accuracy score for LogReg : {score_cat}')