## Project Description 


🎯 **The objective is to create a model that predicts the popularity of a song based on its characteristics**

The Dataset contains a list of songs with the following characteristics:

**acousticness**: whether the track is acoustic  

**danceability**: describes how suitable a track is for dancing  

**duration_ms**: duration of the track in milliseconds  

**energy**: represents a perceptual measure of intensity and activity  

**explicit**: whether the track has explicit lyrics  

**id**: id for the track  

**instrumentalness**: predicts whether a track contains no vocals  

**key**: the key the track is in  

**liveness**: detects the presence of an audience in the recording  

**loudness**: the overall loudness of a track in decibels  

**mode**: modality of a track  

**name**: name of the track  

**popularity**: popularity of the track  

**release_date**: release date  

**speechiness**: detects the presence of spoken words in a track  

**tempo**: overall estimated tempo of a track in beats per minute  

**valence**: describes the musical positiveness conveyed by a track  

**artist**: artist who performed the track

## Data Cleaning 

🎯 **Load and clean the data**


In [3]:
import pandas as pd
import numpy as np

In [None]:
url = 'https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv'
df = pd.read_csv(url)
print(f'df shape : {df.shape}')
df.head()

📝 Clean the data, make sure that no duplicates nor missing values remain in df

In [None]:
# Counting duplicates
print(f'duplicates : {df.duplicated().sum()}')

# Counting the number of NaN for each column
df.isnull().sum().sort_values(ascending=False)

In [None]:
# Drop duplicates and NaN from df 
df = df.drop_duplicates().dropna()

# Check duplicates and NaN are well droped
print(f'duplicates : {df.duplicated().sum()}')
print(f'duplicates : {df.isnull().sum()}')

#Check new shape 
print(f'df new shape : {df.shape}')

## Supervised Learning

🎯 **Baseline and evaluation of a basic model**

📝 Scoring metric : Negative RMSE

- strongly penalize largest errors relatively to smaller ones  
- measure errors in the same unit as the target `popularity`  
- the greater, the better (metric_good_model > metric_bad_model) 

In [None]:
scoring = 'neg_root_mean_squared_error'

📝 Features and target

In [None]:
X_simple = df.select_dtypes(include=['int64', 'float64'])
y = df['popularity']

📝 Baseline score

In [None]:
# Compute mean squared error
mse = np.mean((y - y.mean())**2)

# Compute the negative RMSE 
baseline_score = -np.sqrt(mse)
print(f'Baseline score is {baseline_score}')