<h1><center>Projet de Machine Learning</center></h1>

Notebook <b>Python</b> avec les codes utilisés pour le rapport final.<br>
Auteurs : Juan AYALA, Jeong Hwan KO, Alice LALOUE, Aldo MELLADO AGUILAR.<br>
4A MA - Groupes A et B<br>
2020 - 2021


<p>Lien <a href="https://github.com/jayalabanda/projet-ML">Github</a></p>

## Importation des librairies 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

from functions import *

%matplotlib inline
sns.set(style="darkgrid")

PROJECT_ROOT_DIR = "."
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images")
os.makedirs(IMAGES_PATH, exist_ok=True)

# Obtenir les données

In [None]:
spotify_data = pd.read_csv("data/spotify-extr.txt", sep=" ")

# Description de l'ensemble du jeu de données

In [None]:
spotify_data.head()

In [None]:
spotify_data.info()

On n'a pas de valeurs manquantes donc on n'a pas besoin de les retravailler.

Les variables explicatives sont :
* `valence` : la positivité de la chanson, vaut 1 si la chanson est très joyeuse, 0 sinon ;
* `year` : année de sortie ;
* `acousticness` : mesure "l'acousticité" de la chanson ;
* `danceability` : mesure la "dançabilite" d'une chanson ;
* `duration` : durée d'une chanson en millisecondes ;
* `energy` : l'énergie de la chanson, vaut 1 si la chanson est très énergétique, 0 sinon ;
* `intrumentalness` : taux d'instrumentalisation, vaut 1 s'il n'y a aucune voix présente dans la chanson, 0 sinon ; 
* `key` : tonalité de la musique (ex : A=la), ne prend pas en compte la distinction majeur/mineur ;
* `liveness` : taux de prestation en live, vaut 1 si la chanson ne comporte que de la musique (sans sons à intérêts non-musicaux), 0 sinon ;
* `loudness` : intensité sonore de la chanson
* `mode` : variable binaire qui indique si la chanson commence par une progression d'accords majeure (1) ou non (0)
* `speechiness` : taux de vocaux dans la chanson, vaut 1 si la chanson comporte de la voix tout le long, 0 sinon ;
* `tempo` :  tempo de la chanson en beats par minute (bpm)

Notre objectif consiste à prédire la valeur de `pop.class` et de `popularity`, c'est-à-dire la popularité d'une chanson, soit comme un entier entre 0 et 100, soit comme une classe $A$, $B$, $C$ ou $D$.

In [None]:
spotify_data.describe()

Dans notre jeu de données, les variables qualitatives sont :
* `pop.class`,
* `key`,
* `mode`.

Le reste des variables sont quantitatives.

In [None]:
print("Valeurs de 'pop.class' :", sorted(set(spotify_data["pop.class"].values)), "\n")
print("Valeurs de 'key':", sorted(set(spotify_data["key"].values)), "\n")
print("Valeurs de 'mode' :", set(spotify_data["mode"].values))

On transforme les variables qualitatives en catégories pour mieux traiter les données.

In [None]:
spotify_data["key"] = pd.Categorical(spotify_data["key"], ordered=False)
spotify_data["mode"] = pd.Categorical(spotify_data["mode"], ordered=False)
spotify_data["pop.class"] = pd.Categorical(spotify_data["pop.class"],
                                           ordered=True)

In [None]:
spotify_data.dtypes

# Analyses uni et multidimensionnelles

## Variables qualitatives

On commence par analyser les variables qualitatives `pop.class`, `key` et `mode`.

In [None]:
data_qual = spotify_data[["pop.class", "mode", "key"]]
data_qual.head()

<b>Classe de popularité</b> (variable à prédire)

Cette variable a été créée en amont de l'obtention des données.

In [None]:
pop_class_count = data_qual["pop.class"].value_counts(normalize=True)

sns.barplot(x=pop_class_count.index, y=pop_class_count.values)
# plt.title("Fréquence des classes de popularité", fontsize=14)
plt.ylabel("% d'occurences")
plt.xlabel("Classe")
save_fig("pop_class_frequencies")
plt.show()

On voit qu'il y a une distribution plutôt uniforme des chansons par classe, sauf pour la classe `A`, qui comprend moins de 10% des chansons. Ceci risque de poser problème dans la suite en termes de prédiction.

<b>Clé</b>

In [None]:
fig, ax = plt.subplots()
key_count = spotify_data['key'].value_counts(
    normalize=True, sort=True, ascending=True) * 100
y_ticks = spotify_data['key'].value_counts().index

sns.barplot(x=key_count.values, y=y_ticks, data=key_count, orient='h')
plt.xlabel("% d'occurences")
plt.ylabel('Clé')
ax.set_xticks(ticks=range(0, 16, 1))
ax.set_yticklabels(labels=y_ticks, fontsize=12)

rects = ax.patches
for rect in rects:
    x_value = rect.get_width()
    y_value = rect.get_y() + rect.get_height() / 2
    label = f'{x_value:.1f}%'

    plt.annotate(label, (x_value, y_value),
                 xytext=(5, 0), textcoords="offset points",
                 va='center', ha='left')

#plt.title("Distribution de 'key'", fontsize=14)
save_fig('keys_frequencies')
plt.show()

In [None]:
sns.barplot(x='key', y='popularity', data=spotify_data)
#plt.title("Popularité selon la clé", fontsize=14)
plt.ylabel("Popularité moyenne")
plt.xlabel("Clé")
save_fig("popularity_by_key")
plt.show()

Les variances de la popularité dans chacune des valeurs de `key` est petite donc nous n'avons pas besoin de transformer ces données.

In [None]:
sns.boxplot(x='key', y='popularity', data=spotify_data)
#plt.title("Popularité selon la clé", fontsize=14)
plt.ylabel("Popularité")
plt.xlabel("Clé")
save_fig("boxplot_of_popularity_by_key")
plt.show()

De la même façon, la distribution de la popularité reste plutôt uniforme par clé : les boîtes ont une taille similaire et la médiane est au même niveau.

<b>Mode</b>

In [None]:
mode_count = spotify_data["mode"].value_counts(normalize=True)

sns.barplot(x=mode_count.index, y=mode_count.values)
#plt.title("Fréquence des modes", fontsize=14)
plt.ylabel("% d'occurences", fontsize=13)
plt.xlabel("Mode", fontsize=13)
save_fig("mode_frequencies")
plt.show()

La distribution de la variable `mode` est inégale : il y a 30% et 70% des chansons avec `mode` = 0 et `mode` = 1 respectivement.

In [None]:
sns.barplot(x='mode', y='popularity', data=spotify_data)
#plt.title("Fréquence des modes", fontsize=14)
plt.ylabel("Popularité moyenne", fontsize=13)
plt.xlabel("Mode", fontsize=13)
save_fig("popularity_by_mode")
plt.show()

Par contre, la popularité est similaire selon le mode.

In [None]:
sns.boxplot(x='mode', y='popularity', data=spotify_data)
#plt.title("Popularité selon la clé", fontsize=14)
plt.ylabel("Popularité")
plt.xlabel("Clé")
save_fig("boxplot_of_popularity_by_mode")
plt.show()

On regroupe toutes les variables qualitatives en un barplot :

In [None]:
plt.figure(figsize=(12, 5))
sns.barplot(x='mode', y='popularity', hue='key', data=spotify_data)
#plt.title("Popularité selon la clé et le mode", fontsize=14)
plt.ylabel("Popularité moyenne")
save_fig("popularity_by_key_and_mode")
plt.show()

In [None]:
plt.figure(figsize=(12, 5))
sns.boxplot(x='mode', y='popularity', hue='key', data=spotify_data)
#plt.title("Popularité selon la clé et le mode", fontsize=14)
plt.ylabel("Popularité")
save_fig("boxplot_popularity_by_key_and_mode")
plt.show()

## Variables quantitatives

On commence par visualiser la corrélation entre les variables quantitatives :

In [None]:
data_quant = spotify_data[spotify_data.columns.difference(
    ['key', 'mode', 'pop.class'], sort=False)]
data_quant.keys()

In [None]:
corr_matrix = data_quant.corr()
cmap = sns.diverging_palette(240, 10, as_cmap=True)

plt.figure(figsize=(8, 6))
hm = sns.heatmap(corr_matrix, cmap=cmap, center=0, linewidths=1, linecolor='gray')
hm.set_yticklabels(hm.get_yticklabels(), fontsize=11)
hm.set_xticklabels(hm.get_xticklabels(), rotation=45, fontsize=11, ha='right')
#plt.title("Matrice de corrélation")
save_fig("correlation_square_matrix")
plt.show()

Ce graphique nous montre qu'il y a certaines variables qui ont une forte corrélation. Par exemple, il y a une forte corrélation négative entre les variables `energy` et `acousticness`. Cela a du sens vu que les chansons acoustiques sont plus tranquilles (moins énergiques) que celles qui ne sont pas acoustiques. De même, `energy` et `loudness` sont positivement corrélées, ce qui est attendu vu que les chansons bruyantes ont souvent plus d'énergie.
<br>
On voit aussi que plus une chanson est acoustique, moins elle est populaire, vu que les variables `acousticness` et `popularity` ont une forte corrélation négative.

In [None]:
series = np.abs(corr_matrix['popularity']).sort_values(ascending=False)
print("Les variables les plus corrélées avec la variable 'popularity' sont : ")
for i, row in enumerate(series):
    if 0.2 <= row < 1:
        print(f'{series.index[i]:17} --> {row: .2f} (abs)')

Voici leurs distributions avec boxplot :

In [None]:
plt.style.use('seaborn-poster')

fig = plt.figure(figsize=(22, 28))
outer = fig.add_gridspec(6, 2, wspace=0.1, hspace=0.5, left=0.03,
                         right=0.98, bottom=0.03, top=0.98)

a = 0
for i in range(6):
    for j in range(2):
        feature = data_quant.columns[a]
        inner = outer[i, j].subgridspec(2, 1, wspace=0.2, hspace=0,
                                        height_ratios=[0.15, 0.85])
        axs = inner.subplots(sharex=True)

        sns.boxplot(data=data_quant, x=feature, orient='h', ax=axs[0])
        sns.histplot(data=data_quant, x=feature, bins=50 if a != 1 else 100,
                     ax=axs[1], kde=True)

        axs[0].spines['top'].set_color('black')
        axs[0].spines['right'].set_color('black')
        axs[0].spines['left'].set_color('black')

        axs[1].set_title("Distribution de '" + feature + "'", y=1.2, fontsize=14)
        axs[1].spines['bottom'].set_color('black')
        axs[1].spines['right'].set_color('black')
        axs[1].spines['left'].set_color('black')

        a += 1

    #fig.suptitle('Distribution des variables quantitatives', y=1.01, fontsize=20)
save_fig('hist_boxplot_of_data', tight_layout=False)
plt.show()

Voici une étude plus approfondie de chaque variable quantitative :

<b>Acousticness</b>

In [None]:
ax_data = spotify_data.groupby('acousticness')['popularity'].mean().to_frame().reset_index()

plt.figure(figsize=(10, 5))
sc = sns.scatterplot(x=ax_data['acousticness'], y=ax_data['popularity'], color='blue')
sc.tick_params(labelsize=12)
#plt.title("Acousticité")
plt.ylabel('Popularité moyenne')
save_fig('mean_popularity_by_acousticness')
plt.show()

<b>Danceability</b>

In [None]:
ax_data = spotify_data.groupby('danceability')['popularity'].mean().to_frame().reset_index()

plt.figure(figsize=(10, 5))
sc = sns.scatterplot(x='danceability', y='popularity', data=ax_data, color='blue')
sc.tick_params(labelsize=12)
#plt.title('Dançabilité')
plt.ylabel('Popularité moyenne')
save_fig('mean_popularity_by_danceability')
plt.show()

<b>Duration</b>

On convertit la durée des chansons en minutes pour en tirer plus d'informations.

In [None]:
spotify_data['duration'] = spotify_data['duration'] / 60000
spotify_data['duration'].describe()

In [None]:
plt.figure(figsize=(12, 6))
hp = sns.histplot(spotify_data['duration'], bins=60, kde=False)
hp.tick_params(labelsize=12)
plt.xlabel('duration (mins)')
plt.show()

On voit que la chanson la plus longue dans le jeu de données dure 45 minutes, donc on choisit de séparer les chansons longues de chansons courtes au seuil de 7 minutes pour mieux voir les durées.

In [None]:
long_songs = spotify_data.loc[spotify_data['duration'] > 7]
short_songs = spotify_data.loc[spotify_data['duration'] <= 7]

In [None]:
plt.figure(figsize=(10, 5))
hp = sns.histplot(short_songs['duration'], kde=False, bins=80)
hp.tick_params(labelsize=12)
#plt.title(f'Chansons courtes (<=7 mins) : {short_songs.shape[0]} chansons')
plt.xlabel('duration (mins)')
save_fig('hist_of_short_songs')
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
hp = sns.histplot(long_songs['duration'], kde=False, bins=50)
hp.tick_params(labelsize=12)
#plt.title(f'Chansons longues (>7 mins) : {long_songs.shape[0]} chansons')
plt.xlabel('duration (mins)')
save_fig('hist_of_long_songs')
plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

ax1 = sns.histplot(short_songs['duration'], kde=False, bins=60, ax=ax1)
ax1.set_xlabel('')
ax1.set_xticks(range(0, 8, 1))
ax1.tick_params(labelsize=12)

ax2 = sns.histplot(long_songs['duration'], kde=False, bins=50, ax=ax2)
ax2.set_xlabel('')
ax2.set_ylabel('')
ax2.set_xticks(range(10, 46, 5))
ax2.tick_params(labelsize=12)

plt.text(0.5, -20, 'duration (mins)', ha='center', fontsize=16)
plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(12, 6))

ax1_data = short_songs.groupby('duration')['popularity'].mean().to_frame().reset_index()
ax1 = sns.scatterplot(x='duration', y='popularity', data=ax1_data, color='blue', ax=ax1)
ax1.set_xlim(-0.2, 7.2)
ax1.set_ylabel('Popularité moyenne')
ax1.set_xlabel('')
# ax1.set_title('Chansons courtes')

ax2_data = long_songs.groupby('duration')['popularity'].mean().to_frame().reset_index()
ax2 = sns.scatterplot(x=ax2_data['duration'], y=ax2_data['popularity'], color='orange', ax=ax2)
ax2.set_xticks(range(7, 46, 4))
ax2.set_xlabel('')
# ax2.set_title('Chansons longues')

plt.suptitle('duration (mins)', x=0.53, y=-0.01, ha='center', fontsize=16)
save_fig('popularity_by_long_short_songs')
plt.show()

<b>Energy</b>

In [None]:
ax_data = spotify_data.groupby('energy')['popularity'].mean().to_frame().reset_index()

plt.figure(figsize=(10, 5))
sc = sns.scatterplot(x='energy', y='popularity', data=ax_data, color='blue')
sc.tick_params(labelsize=12)
#plt.title('Énergie')
plt.ylabel('Popularité moyenne')
save_fig('mean_popularity_by_energy')
plt.show()

<b>Instrumentalness</b>

In [None]:
spotify_data['instrumentalness'].describe()

In [None]:
spotify_data.loc[spotify_data['instrumentalness'] == 0].shape

La variable `instrumentalness` a une répartition très inégale : presque 30% des chansons ont une valeur d'instrumentalité de 0.

In [None]:
plt.figure(figsize=(15, 5))
vp = sns.violinplot(x="instrumentalness", data=spotify_data)
vp.tick_params(labelsize=13)
plt.show()

In [None]:
ax_data = spotify_data.groupby('instrumentalness')['popularity'].mean().to_frame().reset_index()

plt.figure(figsize=(10, 5))
sc = sns.scatterplot(x='instrumentalness', y='popularity', data=ax_data, color='blue')
sc.tick_params(labelsize=12)
#plt.title('Instrumentalité')
plt.ylabel('Popularité moyenne')
save_fig('mean_popularity_by_instrumentalness')
plt.show()

<b>Liveness</b>

In [None]:
ax_data = spotify_data.groupby('liveness')['popularity'].mean().to_frame().reset_index()

plt.figure(figsize=(10, 5))
sc = sns.scatterplot(x='liveness', y='popularity', data=ax_data, color='blue')
sc.tick_params(labelsize=12)
#plt.title('liveness')
plt.ylabel('Popularité moyenne')
save_fig('mean_popularity_by_liveness')
plt.show()

<b>Popularity</b> (variable à prédire)

In [None]:
spotify_data.loc[spotify_data["popularity"] == 0].shape

In [None]:
spotify_data["popularity"].describe()

On voit qu'il y a un nombre important de chansons ayant 0 comme popularité. En effet ces chansons sont proches de l'extraction de la base des données et donc leur popularité n'avait pas encore été déterminée.

De plus, la moitié des chansons a une popularité entre 11 et 48. Ceci posera aussi des problèmes lors de l'apprentissage.

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(15, 11))
ax1 = sns.histplot(spotify_data['popularity'], ax=ax1, bins=50)
ax2 = sns.histplot(spotify_data.loc[spotify_data['popularity'] > 0, 'popularity'],
                   ax=ax2, bins=50)
ax1.set_xlabel('')

plt.suptitle('Haut : Toutes les données\nBas : Popularité > 0', fontsize=16)
save_fig('hist_of_popularity')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(15, 4))
ax = spotify_data.groupby('year')['popularity'].mean().plot()
#ax.set_title('Popularité moyenne au cours des années')
ax.set_ylabel('Popularité moyenne', fontsize=13)
ax.set_xlabel('Année')
ax.tick_params(labelsize=12)
ax.set_xticks(range(1920, 2021, 5))
save_fig('mean_popularity_by_year')
plt.show()

<b>Tempo</b>

In [None]:
sns.jointplot(x='tempo', y='popularity', data=spotify_data, height=8)
save_fig('jointplot_of_tempo_popularity')
plt.show()

In [None]:
spotify_data.loc[spotify_data['tempo'] == 0].shape

On voit qu'il y a 13 chansons pour lesquelles `tempo` vaut 0 ce qui n'est pas possible.

In [None]:
corrected_tempo = spotify_data.loc[spotify_data['tempo'] > 0]['tempo']
corrected_tempo.describe()

In [None]:
fig, ax = plt.subplots(figsize=(15, 6))
ax = sns.histplot(spotify_data['tempo'], bins=200, kde=False)
ax.set_ylabel('Fréquences', fontsize=12)

ax.text(s='13\nOutliers', x=5, y=40, fontdict={'size': 12, 'c': 'darkred'})
ax.text(s='Valeurs sans 0', x=125, y=160,
        fontdict={'size': 12, 'c': 'darkred'})
ax.text(s='Médiane\ncorrigée\n114.55', x=116, y=40,
        fontdict={'size': 10, 'c': 'darkgreen', 'weight': 'bold'})

ax.axvline(x=114.55, ymin=0, ymax=0.7, color='green',
           linestyle='dashed', linewidth=2)
ax.axvline(x=35.37, ymin=0, ymax=1, color='orange',
           linestyle='dashed', linewidth=3)
ax.axvline(x=214.42, ymin=0, ymax=1, color='orange',
           linestyle='dashed', linewidth=3)

ax.annotate("", xy=(35.37, 150), xytext=(214.42, 150),
            arrowprops=dict(arrowstyle="<->",
                            color='r',
                            linestyle='dashed',
                            linewidth=2))
ax.annotate("", xy=(0, 30), xytext=(0, 50),
            arrowprops=dict(arrowstyle="->",
                            color='r',
                            linestyle='dashed',
                            linewidth=3))

save_fig('distribution_of_tempo')
plt.show()

On replace les valeurs où `tempo` = 0 par la médiane dans la colonne.

In [None]:
median = spotify_data.loc[spotify_data["tempo"] > 0, "tempo"].median()
print(median)

spotify_data.replace(to_replace={"tempo": 0}, value=median, inplace=True)
spotify_data["tempo"].describe()

<b>Year</b>

In [None]:
sns.jointplot(x='year', y='popularity', data=spotify_data, height=10)
#plt.suptitle("Joint plot de la popularité selon l'année de sortie", y=1.02)
save_fig("jointplot_of_popularity_by_year")
plt.show()

<b>Prétraitement de variables :</b>

On normalise la variable `danceability` vu sa ressemblance à une loi gaussienne :

In [None]:
spotify_data["dance_norm"] = (spotify_data["danceability"] - spotify_data["danceability"].mean())\
    / spotify_data["danceability"].std()

plt.figure(figsize=(12, 6))
hp = sns.histplot(spotify_data["dance_norm"], bins=50, kde=True)
hp.tick_params(labelsize=12)
#plt.title("Variable 'danceability' normalisée")
save_fig("scaled_danceability")
plt.show()

Puis on la supprime :

In [None]:
del spotify_data["danceability"]
spotify_data.head()

In [None]:
spotify_data.keys()

Ces deux cellules prennent assez de temps à s'exécuter.

In [None]:
# for i in ['key', 'mode']:
#     sns.pairplot(spotify_data, hue=i)
#     t = 'pairplot_of_data_by_' + i
#     save_fig(t)
#     plt.show()

In [None]:
# sns.pairplot(spotify_data)
# plt.suptitle("Pair plot des données", fontsize=20, y=1.02)
# save_fig("pairplot_of_dataset")
# plt.show()

# ACP

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

attributs = [
    feature for feature in spotify_data.keys()
    if feature not in data_qual.keys()
    and feature != 'popularity'
]
print(attributs)

In [None]:
X_new = spotify_data[attributs]
X_scaled = scale(X_new)
pca = PCA(random_state=42)
spotify_pca = pca.fit_transform(X_scaled)

In [None]:
x = np.arange(pca.explained_variance_.size)
cumsum = np.cumsum(pca.explained_variance_ratio_)
var_ratio = pca.explained_variance_ratio_
labels = ['Dim ' + str(i + 1) for i in x]

In [None]:
x+1

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20, 10), sharex=True)

ax[0].bar(x, var_ratio)
ax[0].plot(var_ratio, color='black')
ax[0].set_xticks(x)
ax[0].set_xticklabels(labels, fontsize=12)
ax[0].set_ylabel("Pourcentage de la variance expliquée", fontsize=16)
#ax[0].set_title("Part de la variance expliquée", fontsize=15)

for p in ax[0].patches:
    text = str(np.round(p.get_height(), 3) * 100)[:4] + '%'
    ax[0].annotate(text=text,
                   xy=(p.get_x() + p.get_width() / 2., p.get_height() + 0.01),
                   fontsize='large', ha='center', va='center')

ax[1].bar(x, cumsum, width=.7)
ax[1].plot(x, cumsum)
ax[1].set_ylabel("Variance partagée", fontsize=16)
ax[1].set_xticklabels(labels, fontsize=12)
#ax[1].set_title("Somme cumulée de la part de la variance", fontsize=15)

for p in ax[1].patches:
    text = str(np.round(p.get_height(), 3) * 100)[:4] + '%'
    ax[1].annotate(text=text,
                   xy=(p.get_x() + p.get_width() / 2., p.get_height() + 0.01),
                   fontsize='large', ha='center', va='center')

fig.text(0.5, -0.03, "Composantes Principales", ha='center', fontsize=20)
#plt.suptitle("Analyse de la variance des composantes principales", fontsize=22)
save_fig("explained_var_ratio_and_cumulative")
plt.show()

In [None]:
plt.figure(figsize=(8, 8))
plt.boxplot(spotify_pca)
plt.axhline(color='grey', linewidth=1, linestyle='--')
#plt.title("Boxplot des variables de l'ACP")
save_fig("boxplot_of_variances")
plt.show()

1. Sélection de variables :
On sélectionne les 6 premières composantes principales.
Variance expliquée par les valeurs propres : 80% de variance expliquée à partir de 6 CP
On observe un coude sur le graphe des variances expliquées à partir de la 6e CP.
Boxplots : étendue des boxplots relativement stable à partir de la 5 ou 6e CP, la médiane des boxplots devient relativement identique.

In [None]:
plt.figure(figsize=(10, 10))
sns.scatterplot(x=spotify_pca[:, 0], y=spotify_pca[:, 1],
                hue='pop.class', data=spotify_data, alpha=.7)
plt.legend(title='Classe de popularité',
           title_fontsize=13, fontsize=12)
plt.axvline(color="grey", linewidth=1)
plt.axhline(color="grey", linewidth=1)
plt.xlabel('Dim 1')
plt.ylabel('Dim 2')
#plt.title("Nuage de points des individus de l'ACP")
save_fig("scatterplot_of_individuals")
plt.show()

2. Nuage de points des individus:
On observe 2 groupes distincts : 1 grand et un plus petit.

In [None]:
plot_corr_circle(X_new, pca, 1, 2)
save_fig("pca_components_1_2")
plt.show()

In [None]:
plot_corr_circle(X_new, pca, 1, 3)
save_fig("pca_components_1_3")
plt.show()

3. Cercle des correlations  (dim 1 et dim 2):

Variables représentées par les flèches.

Speechiness : entièrement expliquée par la dimension 2.
Log_duration et speechiness sont très proches de l'axe des ordonnées : variables expliquées en majorité par la dimension 2.
Instrumentalness, accousticness, loudness: essetiellement expliquées par la dimension 1.

Accousticness et loudness : flèches sur le même axe. Variables inversement corrélées. En accord avec le graphe des corrélations.

Axe 2 : "divise" les flèches en 2 ?
A droite du graphe : dans les valeurs positives, on retrouve les chansons plus calmes / accoustiques / instrumentales
A gauche du graphe : dans les valeurs négatives , on retrouve les chansons plus "loud", dançantes

In [None]:
label_mode = {1: 'Mode 0', 2: 'Mode 1'}

def plot_pca(l_pca, fig, ax, nbc, nbc2, pca):
    cmaps = plt.get_cmap("Set2")
    for i in range(2):
        xs = l_pca[spotify_data["mode"] == i, nbc - 1]
        ys = l_pca[spotify_data["mode"] == i, nbc2 - 1]
        label = label_mode[i + 1]
        color = cmaps(i)
        ax.scatter(x=xs, y=ys, color=color, alpha=.5, s=1, label=label)
        ax.set_xlabel("PC %d: %.2f%%" %
                      (nbc, pca.explained_variance_ratio_[nbc - 1] * 100), fontsize=10)
        ax.set_ylabel("PC %d: %.2f%%" %
                      (nbc2, pca.explained_variance_ratio_[nbc2 - 1] * 100), fontsize=10)

In [None]:
fig = plt.figure(figsize=(12, 8))
for nbc, nbc2, count in [(1, 2, 1), (1, 3, 2), (1, 4, 3),
                         (2, 3, 5), (2, 4, 6), (3, 4, 9)]:
    ax = fig.add_subplot(3, 3, count)
    plot_pca(spotify_pca, fig, ax, nbc, nbc2, pca)
    plt.subplots_adjust(wspace=0.3, hspace=0.3)

plt.legend(loc='best', bbox_to_anchor=(1.8, 2), markerscale=8, fontsize='large')
#plt.suptitle('Principal Components from 1 to 4', fontsize=14)
save_fig('scatter_of_1_4_pc', tight_layout=False)
plt.show()

In [None]:
def plot_corr_circle_bis(data, pca, comp1, comp2, fig, ax):
    '''Plots correlation circle from results of PCA'''
    coord1 = pca.components_[comp1 - 1] * np.sqrt(pca.explained_variance_[comp1 - 1])
    coord2 = pca.components_[comp2 - 1] * np.sqrt(pca.explained_variance_[comp2 - 1])

    cmap = sns.color_palette("flare", as_cmap=True)
    for i, j, nom in zip(coord1, coord2, data.columns):
        plt.text(i, j, nom, rotation=45)
        rainbowarrow(ax, (0, 0), (i, j), cmap=cmap, lw=2)
    plt.axis((-1.2, 1.2, -1.2, 1.2))

    # cercle
    c = plt.Circle((0, 0), radius=1, color='gray', fill=False)
    ax.add_patch(c)
    
    xlab = 'Dim ' + str(comp1) + ': ' + str(pca.explained_variance_ratio_[comp1 - 1] * 100)[:4] + '%'
    ylab = 'Dim ' + str(comp2) + ': ' + str(pca.explained_variance_ratio_[comp2 - 1] * 100)[:4] + '%'
    ax.set_xlabel(xlab, fontsize=12)
    ax.set_ylabel(ylab, fontsize=12)

In [None]:
fig = plt.figure(figsize=(16, 16))
for nbc, nbc2, count in [(1, 2, 1), (1, 3, 2), (1, 4, 3), (2, 3, 5), (2, 4, 6), (3, 4, 9)]:
    ax = fig.add_subplot(3, 3, count)
    plot_corr_circle_bis(X_new, pca, nbc, nbc2, fig, ax)
    plt.subplots_adjust(wspace=0.3, hspace=0.4)

#plt.suptitle('Principal Components from 1 to 4', fontsize=14)
#save_fig('scatter_of_1_4_corr_circle', tight_layout=False)
plt.show()

# Préparation des données

In [None]:
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

spotify_pop_class = spotify_data[["pop.class"]]
spotify_key = spotify_data[["key"]]

In [None]:
label_encoder = LabelEncoder()

spotify_pop_class_encoded = label_encoder.fit_transform(spotify_pop_class.values.ravel())
print(spotify_pop_class_encoded[:15])

In [None]:
ordinal_encoder = OrdinalEncoder(dtype=np.int32)

spotify_key_encoded = ordinal_encoder.fit_transform(spotify_key)
spotify_key_encoded = np.squeeze(spotify_key_encoded)
print(spotify_key_encoded[:15])

In [None]:
spotify_data["key"] = spotify_key_encoded
spotify_data["pop.class"] = spotify_pop_class_encoded

In [None]:
features = [
    feature for feature in spotify_data.keys()
    if feature not in ['popularity', 'pop.class']
]
print(features)

In [None]:
X = spotify_data[features]

y_class = spotify_data[["pop.class"]]
y_reg = spotify_data[["popularity"]]
y_class = y_class.values.ravel()
y_reg = y_reg.values.ravel()

In [None]:
X.head()

In [None]:
print(y_reg[:15])
print(y_class[:15])

# Apprentissage

In [None]:
import sklearn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler

import tensorflow as tf
from tensorflow import keras

In [None]:
def get_NN_model(n_inputs, n_outputs, problem=None):
    '''Fonction pour créer un réseau de neuronnes'''
    model = keras.models.Sequential()
    
    model.add(keras.layers.Dense(30,
                                 input_dim=n_inputs,
                                 activation='relu'))
    
#     model.add(keras.layers.Dense(150, activation='relu'))
#     model.add(keras.layers.Dense(100, activation='relu'))
    
    if problem == 'regression':
        model.add(keras.layers.Dense(n_outputs,
                                     activation='linear'))

        model.compile(loss='mean_squared_error',
                      optimizer=keras.optimizers.SGD(lr=1e-3),
                      metrics=['accuracy'])
    
    elif problem == 'classification':
        model.add(keras.layers.Dense(n_outputs,
                                     activation='softmax'))

        model.compile(loss='sparse_categorical_crossentropy',
                      optimizer=keras.optimizers.SGD(lr=1e-3),
                      metrics=['accuracy'])
    
    return model

## Classification

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

from sklearn.metrics import roc_auc_score, classification_report, accuracy_score

model_accuracy_score = []

In [None]:
X_train, X_test, y_train_class, y_test_class = train_test_split(
    X, y_class, test_size=0.25, random_state=42
)

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
mm_scaler = MinMaxScaler()

X_train_scaled_mm = mm_scaler.fit_transform(X_train)
X_test_scaled_mm = mm_scaler.transform(X_test)

### Régression logistique

#### Sans pénalisation

In [None]:
Log_Reg_Model = LogisticRegression(penalty='none', solver='saga',
                                   multi_class='multinomial', max_iter=4000)
Log_Reg_Model.fit(X_train_scaled, y_train_class)

In [None]:
LogR_Predict = Log_Reg_Model.predict(X_test_scaled)
LogR_Accuracy = accuracy_score(y_test_class, LogR_Predict)
model_accuracy_score.append(LogR_Accuracy)

print("Précision : " + str(LogR_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, LogR_Predict, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Logistic Regression")
save_fig("confusion_matrix_of_Log_Reg_class")
plt.show()

In [None]:
print(classification_report(y_test_class, LogR_Predict))

#### Pénalisation Lasso

In [None]:
param = [{
    "C": [0.1, 1., 5., 10., 15.],
}]

Lasso_Model = GridSearchCV(
    LogisticRegression(penalty='l1', solver='saga', multi_class='multinomial',
                       max_iter=4000, random_state=100),
    param, cv=10)

Lasso_Model.fit(X_train_scaled, y_train_class)

print("Meilleur score = %f, Meilleur paramètre = %s" %
      (Lasso_Model.best_score_, Lasso_Model.best_params_))

In [None]:
Lasso_Predict = Lasso_Model.predict(X_test_scaled)
Lasso_Accuracy = accuracy_score(y_test_class, Lasso_Predict)
model_accuracy_score.append(Lasso_Accuracy)

print("Précision : " + str(Lasso_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, Lasso_Predict, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Lasso Classification")
save_fig("confusion_matrix_of_Lasso_class")
plt.show()

In [None]:
print(classification_report(y_test_class, Lasso_Predict))

#### Pénalisation Ridge

In [None]:
param = [{
    "C": [0.1, 1., 5., 10., 15.],
    "solver": ['saga', 'lbfgs']
}]

Ridge_Model = GridSearchCV(
    LogisticRegression(penalty='l2', multi_class='multinomial',
                       max_iter=4000, random_state=100),
    param, cv=10)

Ridge_Model.fit(X_train_scaled, y_train_class)

print("Meilleur score = %f, Meilleur paramètre = %s" %
      (Ridge_Model.best_score_, Ridge_Model.best_params_))

In [None]:
Ridge_Predict = Ridge_Model.predict(X_test_scaled)
Ridge_Accuracy = accuracy_score(y_test_class, Ridge_Predict)
model_accuracy_score.append(Ridge_Accuracy)

print("Précision : " + str(Ridge_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, Ridge_Predict, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Ridge Classification")
save_fig("confusion_matrix_of_Ridge_class")
plt.show()

In [None]:
print(classification_report(y_test_class, Ridge_Predict))

#### Pénalisation Elastic Net

In [None]:
param = [{
    "C": [0.1, 1., 5., 10., 15.],
    "l1_ratio": [0.25, 0.5, 0.75]
}]

EN_Model = GridSearchCV(
    LogisticRegression(penalty='elasticnet', solver='saga', multi_class='multinomial',
                       max_iter=4000, random_state=42),
    param, cv=10)

EN_Model.fit(X_train_scaled, y_train_class)

print("Meilleur score = %f, Meilleur paramètre = %s" %
      (EN_Model.best_score_, EN_Model.best_params_))

In [None]:
EN_Predict = EN_Model.predict(X_test_scaled)
EN_Accuracy = accuracy_score(y_test_class, EN_Predict)
model_accuracy_score.append(EN_Accuracy)

print("Précision : " + str(EN_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, EN_Predict, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Elastic Net Classification")
save_fig("confusion_matrix_of_ENet_class")
plt.show()

In [None]:
print(classification_report(y_test_class, EN_Predict))

### Random Forest

In [None]:
param = [{
    "max_features": [*range(2, 10), 'auto', 'log2']
}]

RFC_Model = GridSearchCV(
    RandomForestClassifier(n_estimators=500, n_jobs=-1),
    param, cv=5)

RFC_Model.fit(X_train, y_train_class)

print("Meilleur score = %f, Meilleur paramètre = %s" %
      (RFC_Model.best_score_, RFC_Model.best_params_))

In [None]:
RFC_Predict = RFC_Model.predict(X_test)
RFC_Accuracy = accuracy_score(y_test_class, RFC_Predict)
model_accuracy_score.append(RFC_Accuracy)

print("Précision : " + str(RFC_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, RFC_Predict, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Random Forest Classification")
save_fig("confusion_matrix_of_RF_class")
plt.show()

In [None]:
print(classification_report(y_test_class, RFC_Predict))

### Decision Trees

In [None]:
param = [{
    "min_samples_split": range(2, 203, 10),
    "max_features": [None, 'auto', 'log2']
}]

DT_Model = GridSearchCV(DecisionTreeClassifier(random_state=42),
                        param, cv=5)

DT_Model.fit(X_train, y_train_class)

print("Meilleur score = %f, Meilleurs paramètres = %s" %
      (DT_Model.best_score_, DT_Model.best_params_))

In [None]:
DT_Predict = DT_Model.predict(X_test)
DT_Accuracy = accuracy_score(y_test_class, DT_Predict)
model_accuracy_score.append(DT_Accuracy)

print("Précision : " + str(DT_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, DT_Predict, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Decision Trees Classification")
save_fig("confusion_matrix_of_DT_class")
plt.show()

In [None]:
print(classification_report(y_test_class, DT_Predict))

### SVC

#### Linear Kernel

In [None]:
param = [{
    "C": [0.01, 0.1, 0.5, 1., 2., 5., 10.]
}]

Lin_SVC_Model = GridSearchCV(
    SVC(kernel='linear', decision_function_shape='ovo', max_iter=10_000, random_state=100),
    param, cv=5
)

Lin_SVC_Model.fit(X_train_scaled_mm, y_train_class)

print("Meilleur score = %f, Meilleur paramètre = %s" %
      (Lin_SVC_Model.best_score_, Lin_SVC_Model.best_params_))

In [None]:
Lin_SVC_Predict = Lin_SVC_Model.predict(X_test_scaled_mm)
Lin_SVC_Accuracy = accuracy_score(y_test_class, Lin_SVC_Predict)
model_accuracy_score.append(Lin_SVC_Accuracy)

print("Précision : " + str(Lin_SVC_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, Lin_SVC_Predict, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Linear SVC Classification)
save_fig("confusion_matrix_of_Lin_SVC_class")
plt.show()

In [None]:
print(classification_report(y_test_class, Lin_SVC_Predict, zero_division=0))

#### Polynomial Kernel

In [None]:
param = [{
    "C": [1, 10, 100, 1000],
    "degree": [2, 3, 4]
}]

Poly_SVC_Model = GridSearchCV(
    SVC(kernel='poly', gamma='auto', coef0=1., 
        decision_function_shape='ovo', random_state=2021),
    param, cv=5
)

Poly_SVC_Model.fit(X_train_scaled_mm, y_train_class)

print("Meilleur score = %f, Meilleur paramètre = %s" %
      (Poly_SVC_Model.best_score_, Poly_SVC_Model.best_params_))

In [None]:
Poly_SVC_Predict = Poly_SVC_Model.predict(X_test_scaled_mm)
Poly_SVC_Accuracy = accuracy_score(y_test_class, Poly_SVC_Predict)
model_accuracy_score.append(Poly_SVC_Accuracy)

print("Précision :" + str(Poly_SVC_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, Poly_SVC_Predict, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Polynomial SVC Classification")
save_fig("confusion_matrix_of_Poly_SVC_class")
plt.show()

In [None]:
print(classification_report(y_test_class, Poly_SVC_Predict))

#### Radial Kernel

In [None]:
param = [{
    "C": [1, 10, 100, 1000],
}]

Rad_SVC_Model = GridSearchCV(
    SVC(kernel='rbf', gamma='auto', decision_function_shape='ovo', random_state=2021),
    param, cv=5
)

Rad_SVC_Model.fit(X_train_scaled_mm, y_train_class)

print("Meilleur score = %f, Meilleur paramètre = %s" %
      (Rad_SVC_Model.best_score_, Rad_SVC_Model.best_params_))

In [None]:
Rad_SVC_Predict = Rad_SVC_Model.predict(X_test_scaled_mm)
Rad_SVC_Accuracy = accuracy_score(y_test_class, Rad_SVC_Predict)
model_accuracy_score.append(Rad_SVC_Accuracy)

print("Précision :" + str(Rad_SVC_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, Rad_SVC_Predict, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Radial SVC Classification")
save_fig("confusion_matrix_of_Rad_SVC_class")
plt.show()

In [None]:
print(classification_report(y_test_class, Rad_SVC_Predict))

### Réseaux de neurones

In [None]:
X_train_scaled_nn, X_valid, y_train_class_nn, y_valid = train_test_split(
    X_train_scaled, y_train_class, train_size=0.8, random_state=42
)

In [None]:
keras.backend.clear_session()

In [None]:
n_inputs, n_outputs = X_train.shape[1], 4
NN_Model = get_NN_model(n_inputs, n_outputs, 'classification')
NN_Model.summary()

In [None]:
history = NN_Model.fit(X_train_scaled_nn, y_train_class_nn, epochs=1000,
                       batch_size=30, validation_data=(X_valid, y_valid),
                       verbose=0)

In [None]:
def plot_history(history):
    hist = pd.DataFrame(history.history)
    hist['epoch'] = history.epoch
    
    plt.figure(figsize=(12, 6))
    plt.grid(True)
    plt.ylim(0, 2.5)
    plt.plot(hist.epoch, hist.loss, label='Loss')
    plt.plot(hist.epoch, hist.val_loss, label='Validation loss')
    plt.legend(loc='upper right')
    plt.show()
    
plot_history(history)

In [None]:
NN_Predict = np.argmax(NN_Model.predict(X_test_scaled), axis=-1)
NN_Accuracy = accuracy_score(y_test_class, NN_Predict)
model_accuracy_score.append(NN_Accuracy)

print("Précision :" + str(NN_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, NN_Predict, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Neural Network Classification")
save_fig("confusion_matrix_of_NN_class")
plt.show()

In [None]:
print(classification_report(y_test_class, NN_Predict))

## Résumé des résultats en classification

In [None]:
class_models = [
    'Logistic Regression', 'Lasso', 'Ridge', 'Elastic Net', 'Random Forest',
    'Decision Trees', 'Linear SVC', 'Polynomial SVC', 'Radial SVC',
    'Neural Network'
]

model_performance_accuracy = pd.DataFrame({
    'Model': class_models,
    'Accuracy Score': model_accuracy_score
})

model_performance_accuracy

In [None]:
model_performance_accuracy.sort_values(by='Accuracy Score', ascending=False)

## Régression

In [None]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR

from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score

reg_metrics = (mean_squared_error, r2_score, explained_variance_score)
threshold_accuracy_score = []

In [None]:
X_train, X_test, y_train_reg, y_test_reg = train_test_split(
    X, y_reg, test_size=0.25, random_state=42
)

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
mse_scores = []
r2_scores = []
evs_scores = []

### Régression linéaire

#### Sans pénalisation

In [None]:
LR_Model = LinearRegression()
LR_Model.fit(X_train, y_train_reg)
LR_Predict = LR_Model.predict(X_test)

In [None]:
mse_scores.append(mean_squared_error(y_test_reg, LR_Predict))
r2_scores.append(r2_score(y_test_reg, LR_Predict))
evs_scores.append(explained_variance_score(y_test_reg, LR_Predict))

In [None]:
plot_results(reg_metrics, y_test_reg, y_test_class, LR_Predict)
#plt.title("Results of Linear Regression")
save_fig("results_of_LR")
plt.show()

In [None]:
y_rtc_LR = reg_to_class(LR_Predict)
LR_Accuracy = accuracy_score(y_test_class, y_rtc_LR)
threshold_accuracy_score.append(LR_Accuracy)

print("Précision : " + str(LR_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, y_rtc_LR, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Linear Regression")
save_fig("confusion_matrix_of_LR")
plt.show()

#### Pénalisation Lasso

In [None]:
param = [{
    "alpha": [0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 1.5, 2.]
}]

LassoR_Model = GridSearchCV(Lasso(), param, cv=10)
LassoR_Model.fit(X_train, y_train_reg)
LassoR_Predict = LassoR_Model.predict(X_test)

# Paramètre optimal
print("Meilleur score = %f, Meilleur paramètre = %s" %
      (LassoR_Model.best_score_, LassoR_Model.best_params_))

In [None]:
mse_scores.append(mean_squared_error(y_test_reg, LassoR_Predict))
r2_scores.append(r2_score(y_test_reg, LassoR_Predict))
evs_scores.append(explained_variance_score(y_test_reg, LassoR_Predict))

In [None]:
plot_results(reg_metrics, y_test_reg, y_test_class, LassoR_Predict)
#plt.title("Results of Lasso Regression")
save_fig("results_of_Lasso_reg")
plt.show()

In [None]:
y_rtc_LassoR = reg_to_class(LassoR_Predict)
LassoR_Accuracy = accuracy_score(y_test_class, y_rtc_LassoR)
threshold_accuracy_score.append(LassoR_Accuracy)

print("Précision : " + str(LassoR_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, y_rtc_LassoR, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Lasso Regression")
save_fig("confusion_matrix_of_Lasso_Reg")
plt.show()

#### Pénalisation Ridge

In [None]:
param = [{
    "alpha": [0.1, 0.5, 1., 1.5, 2., 3., 5., 10.]
}]

RidgeR_Model = GridSearchCV(Ridge(), param, cv=10)
RidgeR_Model.fit(X_train, y_train_reg)
RidgeR_Predict = RidgeR_Model.predict(X_test)

# Paramètre optimal
print("Meilleur score = %f, Meilleur paramètre = %s" %
      (RidgeR_Model.best_score_, RidgeR_Model.best_params_))

In [None]:
mse_scores.append(mean_squared_error(y_test_reg, RidgeR_Predict))
r2_scores.append(r2_score(y_test_reg, RidgeR_Predict))
evs_scores.append(explained_variance_score(y_test_reg, RidgeR_Predict))

In [None]:
plot_results(reg_metrics, y_test_reg, y_test_class, RidgeR_Predict)
#plt.title("Results of Ridge Regression")
save_fig("results_of_Ridge_reg")
plt.show()

In [None]:
y_rtc_RidgeR = reg_to_class(RidgeR_Predict)
RidgeR_Accuracy = accuracy_score(y_test_class, y_rtc_RidgeR)
threshold_accuracy_score.append(RidgeR_Accuracy)

print("Précision : " + str(RidgeR_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, y_rtc_RidgeR, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Ridge Regression")
save_fig("confusion_matrix_of_Ridge_reg")
plt.show()

#### Pénalisation Elastic Net

In [None]:
param = [{
    "alpha": [0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 1.5, 2.]
}]

ENetR_Model = GridSearchCV(ElasticNet(), param, cv=10)
ENetR_Model.fit(X_train, y_train_reg)
ENetR_Predict = ENetR_Model.predict(X_test)

# Paramètre optimal
print("Meilleur score = %f, Meilleur paramètre = %s" %
      (ENetR_Model.best_score_, ENetR_Model.best_params_))

In [None]:
mse_scores.append(mean_squared_error(y_test_reg, ENetR_Predict))
r2_scores.append(r2_score(y_test_reg, ENetR_Predict))
evs_scores.append(explained_variance_score(y_test_reg, ENetR_Predict))

In [None]:
plot_results(reg_metrics, y_test_reg, y_test_class, ENetR_Predict)
#plt.title("Results of Elastic Net Regression")
save_fig("results_of_ENet_reg")
plt.show()

In [None]:
y_rtc_ENetR = reg_to_class(ENetR_Predict)
ENetR_Accuracy = accuracy_score(y_test_class, y_rtc_ENetR)
threshold_accuracy_score.append(ENetR_Accuracy)

print("Précision : " + str(ENetR_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, y_rtc_ENetR, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Elastic Net Regression")
save_fig("confusion_matrix_of_ENet_reg")
plt.show()

### Random Forest

Optimisation par validation croisée de la valeur de *max_features* et *min_samples_split*.

In [None]:
param = [{
    "max_features": [*range(2, 10), 'auto', 'log2'],
    "min_samples_split": list(range(2, 14))
}]

RF_Model = GridSearchCV(RandomForestRegressor(), param, cv=5, n_jobs=-1)
RF_Model.fit(X_train, y_train_reg)
RF_Predict = RF_Model.predict(X_test)

# Paramètre optimal
print("Meilleur score = %f, Meilleur paramètre = %s" %
      (RF_Model.best_score_, RF_Model.best_params_))

In [None]:
mse_scores.append(mean_squared_error(y_test_reg, RF_Predict))
r2_scores.append(r2_score(y_test_reg, RF_Predict))
evs_scores.append(explained_variance_score(y_test_reg, RF_Predict))

In [None]:
plot_results(reg_metrics, y_test_reg, y_test_class, RF_Predict)
#plt.title("Results of Random Forest Regression")
save_fig("results_of_RF_reg")
plt.show()

In [None]:
y_rtc_RF = reg_to_class(RF_Predict)
RF_Accuracy = accuracy_score(y_test_class, y_rtc_RF)
threshold_accuracy_score.append(RF_Accuracy)

print("Précision : " + str(RF_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, y_rtc_RF, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Random Forest Regression")
save_fig("confusion_matrix_of_RF_reg")
plt.show()

### Decision Trees

Optimisation par validation croisée de la valeur de *max_depth* et *min_samples_split*.

In [None]:
param = [{
    "max_depth": list(range(2, 10)),
    "min_samples_split": list(range(2, 10))
}]

DT_Model = GridSearchCV(DecisionTreeRegressor(), param, cv=10, n_jobs=-1)
DT_Model.fit(X_train, y_train_reg)
DT_Predict = DT_Model.predict(X_test)

# Paramètres optimaux
print("Meilleur score = %f, Meilleur paramètre = %s" %
      (DT_Model.best_score_, DT_Model.best_params_))

In [None]:
mse_scores.append(mean_squared_error(y_test_reg, DT_Predict))
r2_scores.append(r2_score(y_test_reg, DT_Predict))
evs_scores.append(explained_variance_score(y_test_reg, DT_Predict))

In [None]:
plot_results(reg_metrics, y_test_reg, y_test_class, DT_Predict)
#plt.title("Results of Decision Trees Regression")
save_fig("results_of_DT_reg")
plt.show()

In [None]:
y_rtc_DT = reg_to_class(DT_Predict)
DT_Accuracy = accuracy_score(y_test_class, y_rtc_DT)
threshold_accuracy_score.append(DT_Accuracy)

print("Précision : " + str(DT_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, y_rtc_DT, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Decision Trees Regression")
save_fig("confusion_matrix_of_DT_reg")
plt.show()

### SVR

#### Linear Kernel

Optimisation de la pénalisation (paramètre $C$) par validation croisée.

In [None]:
param = [{
    "C": [0.1, 0.5, 0.7, 0.8, 1., 1.5, 2.]
}]

Lin_SVR_Model = GridSearchCV(SVR(kernel='linear'), param, cv=5)
Lin_SVR_Model.fit(X_train_scaled, y_train_reg)
Lin_SVR_Predict = Lin_SVR_Model.predict(X_test_scaled)

# Paramètre optimal
print("Meilleur score = %f, Meilleur paramètre = %s" %
      (Lin_SVR_Model.best_score_, Lin_SVR_Model.best_params_))

In [None]:
mse_scores.append(mean_squared_error(y_test_reg, Lin_SVR_Predict))
r2_scores.append(r2_score(y_test_reg, Lin_SVR_Predict))
evs_scores.append(explained_variance_score(y_test_reg, Lin_SVR_Predict))

In [None]:
plot_results(reg_metrics, y_test_reg, y_test_class, Lin_SVR_Predict)
#plt.title("Results of Linear SVR Regression")
save_fig("results_of_Lin_SVR_reg")
plt.show()

In [None]:
y_rtc_Lin_SVR = reg_to_class(Lin_SVR_Predict)
Lin_SVR_Accuracy = accuracy_score(y_test_class, y_rtc_Lin_SVR)
threshold_accuracy_score.append(Lin_SVR_Accuracy)

print("Précision : " + str(Lin_SVR_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, y_rtc_Lin_SVR, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Linear SVR Regression")
save_fig("confusion_matrix_of_Lin_SVR_reg")
plt.show()

#### Polynomial Kernel

Optimisation de la pénalisation (paramètre $C$) par validation croisée.

In [None]:
param = [{
    "C": [0.1, 0.5, 1., 2., 5.],
    "degree": [2, 3, 4]
}]

Poly_SVR_Model = GridSearchCV(SVR(kernel='poly', gamma='auto', coef0=1.), param, cv=5)
Poly_SVR_Model.fit(X_train_scaled, y_train_reg)
Poly_SVR_Predict = Poly_SVR_Model.predict(X_test_scaled)

# Paramètre optimal
print("Meilleur score = %f, Meilleur paramètre = %s" %
      (Poly_SVR_Model.best_score_, Poly_SVR_Model.best_params_))

In [None]:
mse_scores.append(mean_squared_error(y_test_reg, Poly_SVR_Predict))
r2_scores.append(r2_score(y_test_reg, Poly_SVR_Predict))
evs_scores.append(explained_variance_score(y_test_reg, Poly_SVR_Predict))

In [None]:
plot_results(reg_metrics, y_test_reg, y_test_class, Poly_SVR_Predict)
#plt.title("Results of Polynomial SVR Regression")
save_fig("results_of_Poly_SVR_reg")
plt.show()

In [None]:
y_rtc_Poly_SVR = reg_to_class(Poly_SVR_Predict)
Poly_SVR_Accuracy = accuracy_score(y_test_class, y_rtc_Poly_SVR)
threshold_accuracy_score.append(Poly_SVR_Accuracy)

print("Précision :" + str(Poly_SVR_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, y_rtc_Poly_SVR, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Polynomial SVR Regression")
save_fig("confusion_matrix_of_Poly_SVR_reg")
plt.show()

#### Radial Kernel

In [None]:
param = [{
    "C": [0.1, 0.5, 1., 2., 5.],
    "gamma": ['auto', 'scale', 1e-3, 1e-4]
}]

Rad_SVR_Model = GridSearchCV(SVR(kernel='rbf'), param, cv=5)
Rad_SVR_Model.fit(X_train_scaled, y_train_reg)
Rad_SVR_Predict = Rad_SVR_Model.predict(X_test_scaled)

# Paramètre optimal
print("Meilleur score = %f, Meilleur paramètre = %s" %
      (Rad_SVR_Model.best_score_, Rad_SVR_Model.best_params_))

In [None]:
mse_scores.append(mean_squared_error(y_test_reg, Rad_SVR_Predict))
r2_scores.append(r2_score(y_test_reg, Rad_SVR_Predict))
evs_scores.append(explained_variance_score(y_test_reg, Rad_SVR_Predict))

In [None]:
plot_results(reg_metrics, y_test_reg, y_test_class, Rad_SVR_Predict)
#plt.title("Results of Radial SVR Regression")
save_fig("results_of_Rad_SVR_reg")
plt.show()

In [None]:
y_rtc_Rad_SVR = reg_to_class(Rad_SVR_Predict)
Rad_SVR_Accuracy = accuracy_score(y_test_class, y_rtc_Rad_SVR)
threshold_accuracy_score.append(Rad_SVR_Accuracy)

print("Précision :" + str(Rad_SVR_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, y_rtc_Rad_SVR, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Radial SVR Regression")
save_fig("confusion_matrix_of_Rad_SVR_reg")
plt.show()

### Réseaux de neuronnes

In [None]:
n_inputs, n_outputs = X_train.shape[1], 1
NN_Model = get_NN_model(n_inputs, n_outputs, 'regression')
NN_Model.summary()

In [None]:
history = NN_Model.fit(X_train_scaled_mm, y_train_reg, epochs=200, batch_size=30,
                       validation_data=(X_test_scaled_mm, y_test_reg), verbose=0)
NN_Predict = NN_Model.predict(X_test_scaled_mm)

In [None]:
mse_scores.append(mean_squared_error(y_test_reg, NN_Predict))
r2_scores.append(r2_score(y_test_reg, NN_Predict))
evs_scores.append(explained_variance_score(y_test_reg, NN_Predict))

In [None]:
plot_results(reg_metrics, y_test_reg, y_test_class, NN_Predict)
#plt.title("Results of Neural Network Regression")
save_fig("results_of_NN_reg")
plt.show()

In [None]:
y_rtc_NNR = reg_to_class(NN_Predict)
NNR_Accuracy = accuracy_score(y_test_class, y_rtc_NNR)
threshold_accuracy_score.append(NNR_Accuracy)

print("Précision : " + str(NNR_Accuracy))

In [None]:
plot_cf_matrix(y_test_class, y_rtc_NNR, cmap='Blues', draw_mosaic=False)
#plt.title("Confusion Matrix of Neural Network Regression")
save_fig("confusion_matrix_of_NN_reg")
plt.show()

## Résumé des résultats en régression

In [None]:
reg_models = [
    'Linear Regression', 'Lasso', 'Ridge', 'Elastic Net', 'Random Forest',
    'Decision Trees', 'Linear SVR', 'Polynomial SVR', 'Radial SVR',
    'Neural Network'
]

regression_scores = pd.DataFrame({
    'Model': reg_models,
    'Mean Squared Error': mse_scores,
    'R2 Score': r2_scores,
    'Explained Variance Score': evs_scores,
    'Thresholding Accuracy Score': threshold_accuracy_score
})

regression_scores

In [None]:
regression_scores.sort_values(by='Mean Squared Error', ascending=True)

# Comparaison entre classification et régression

In [None]:
classifs = model_performance_accuracy["Accuracy Score"]
regs = regression_scores["Thresholding Accuracy Score"]

compare_table = pd.DataFrame({
    'Model': class_models,
    'Classification': classifs,
    'Thresholding': regs
})

compare_table