# Podstawy Sztucznej Inteligencji - Projekt Kacper Marzol

Celem tego projektu jest klasyfikacja piosenek na podstawie ich kilku cech. Zbiór danych składa się z 278 tysięcy piosenek z serwisu Spotify.

In [2]:
import pandas as pd

data = pd.read_csv("278k_labelled_uri.csv")
print(data.shape)
data.head()

(277938, 15)


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,duration (ms),danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,spec_rate,labels,uri
0,0,0,195000.0,0.611,0.614,-8.815,0.0672,0.0169,0.000794,0.753,0.52,128.05,3.446154e-07,2,spotify:track:3v6sBj3swihU8pXQQHhDZo
1,1,1,194641.0,0.638,0.781,-6.848,0.0285,0.0118,0.00953,0.349,0.25,122.985,1.464234e-07,1,spotify:track:7KCWmFdw0TzoJbKtqRRzJO
2,2,2,217573.0,0.56,0.81,-8.029,0.0872,0.0071,8e-06,0.241,0.247,170.044,4.00785e-07,1,spotify:track:2CY92qejUrhyPUASawNVRr
3,3,3,443478.0,0.525,0.699,-4.571,0.0353,0.0178,8.8e-05,0.0888,0.199,92.011,7.959809e-08,0,spotify:track:11BPfwVbB7vok7KfjBeW4k
4,4,4,225862.0,0.367,0.771,-5.863,0.106,0.365,1e-06,0.0965,0.163,115.917,4.693131e-07,1,spotify:track:3yUJKPsjvThlcQWTS9ttYx


In [3]:
data.columns

Index(['Unnamed: 0.1', 'Unnamed: 0', 'duration (ms)', 'danceability', 'energy',
       'loudness', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'spec_rate', 'labels', 'uri'],
      dtype='object')

Acousticness: miara akustyczności od 0 do 1
Danceability: miara taneczności od 0 do 1
Energy: miara energiczności od 0 do 1
Instrumentalness: miara instrumentalności od 0 do 1
Liveness: miara obecności publiki w nagraniu od 0 do 1
Loudness: miara głośnośći w decybelach (-60 do 0 decybeli)
Speechiness: miara słów w piosence od 0 do 1
Valence: miara od 0 do 1 opisująca pozytywność przekazywaną przez piosenkę
Tempo: tempo piosenki w BMP (uderzeń na minutę)

In [4]:
data.labels.value_counts()

1    106429
0     82058
2     47065
3     42386
Name: labels, dtype: int64

0 - piosenka smutna
1 - piosenka wesoła
2 - piosenka energetyczna
3 - piosenka spokojna

Usuniemy zbędne kolumny, "uri", ponieważ URL piosenki nie jest potrzebny oraz dwie pierwsze kolumny

In [5]:
cols=["Unnamed: 0.1", "Unnamed: 0","uri"]
data=data.drop(cols ,axis=1)

In [6]:
data.head()

Unnamed: 0,duration (ms),danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,spec_rate,labels
0,195000.0,0.611,0.614,-8.815,0.0672,0.0169,0.000794,0.753,0.52,128.05,3.446154e-07,2
1,194641.0,0.638,0.781,-6.848,0.0285,0.0118,0.00953,0.349,0.25,122.985,1.464234e-07,1
2,217573.0,0.56,0.81,-8.029,0.0872,0.0071,8e-06,0.241,0.247,170.044,4.00785e-07,1
3,443478.0,0.525,0.699,-4.571,0.0353,0.0178,8.8e-05,0.0888,0.199,92.011,7.959809e-08,0
4,225862.0,0.367,0.771,-5.863,0.106,0.365,1e-06,0.0965,0.163,115.917,4.693131e-07,1


Dodamy trochę nanów do zbioru

In [7]:
from sklearn.model_selection import train_test_split
import numpy as np
import random

X = data.drop('labels', axis=1)
y = data.labels

ix = [(row, col) for row in range(X.shape[0]) for col in range(X.shape[1])]
for row, col in random.sample(ix, int(round(.1*len(ix)))):
    X.iat[row, col] = np.nan

X.isna().sum()

duration (ms)       27555
danceability        28093
energy              27824
loudness            27713
speechiness         27787
acousticness        27793
instrumentalness    27861
liveness            27950
valence             27607
tempo               27731
spec_rate           27818
dtype: int64

Podzielimy zbiór na część treningową, walidacyjną i testową:

In [8]:
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.8)
X_val, X_test, y_val, y_test = train_test_split(X_rem,y_rem, test_size=0.5)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(X_val.shape)

(222350, 11)
(222350,)
(27794, 11)
(27794, 11)


In [8]:
import seaborn as sns
import matplotlib.pyplot as plt

xplot=X_train.iloc[:50, :]
sns.pairplot(xplot)
plt.show()


KeyboardInterrupt



Nauczymy kilka modeli

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

pipeline = Pipeline([("imputer", SimpleImputer(strategy="median"))])

# Logistic Regression

In [10]:
models=[]
scores=[]

In [91]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

def pipeline_maker(model):
    return make_pipeline(SimpleImputer(strategy="median"), StandardScaler(),model)

def test (model, param_grid_):
    pipe=pipeline_maker(model)
    search=GridSearchCV(pipe, param_grid=param_grid_,scoring='accuracy')
    search.fit(X_train, y_train)
    return search.best_params_


In [100]:
from sklearn.linear_model import LogisticRegression

logistic=LogisticRegression()

param_grid={
    "logisticregression__C": [0.0001,0.001,0.01,0.1,1,10,100,1000]
}

params=test(logistic, param_grid)

print(params)
logistic=LogisticRegression(C=params['logisticregression__C'])
logistic.fit(X_train, y_train)

models.append(logistic)
scores.append(accuracy_score(y_val,logistic.predict(X_val)))

{'logisticregression__C': 0.01}


ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values