# Music Genre Classification 2024

artist: Name of the Artist.

song: Name of the Track.

popularity: The higher the value the more popular the song is.

danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm

energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.

key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on..

loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative

mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

instrumentalness: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

duration in milliseconds :Time of the song

time_signature : a notational convention used in Western musical notation to specify how many beats (pulses) are contained in each measure (bar), and which note value is equivalent to a beat.

Class: Genre of the track.

## Importing the labiraries

In [4]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

plt.rcParams["figure.figsize"] = (7,6)
import warnings

# ignore warning
warnings.filterwarnings("ignore")
warnings.warn("this will not show")

In [5]:
# pip install tensorflow

In [6]:
# Importing libraries for DL and ML
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import recall_score,\
                            f1_score, precision_recall_curve,\
                            average_precision_score
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import GridSearchCV,\
                                    HalvingGridSearchCV,\
                                    RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler,\
                                  OneHotEncoder
from sklearn.compose import ColumnTransformer

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline

In [7]:
from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    recall_score,
    precision_score,
    make_scorer,
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay,
    average_precision_score,
    roc_curve,
    auc,
)

In [8]:
# This function will be used frequently
from sklearn.metrics import classification_report,confusion_matrix,ConfusionMatrixDisplay

def eval_metric(model, X_train, y_train, X_test, y_test):
    y_train_pred = model.predict(X_train)
    y_pred = model.predict(X_test)
    
    print("Test_Set")
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    print()
    print("Train_Set")
    print(confusion_matrix(y_train, y_train_pred))
    print(classification_report(y_train, y_train_pred))

## Reading the data

In [10]:
#Reading the train data
df = pd.read_csv('data/train.csv')
#Seeing the head of it
df.head()

Unnamed: 0,Id,Artist Name,Track Name,Popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_in min/ms,time_signature,Class
0,1,Marina Maximilian,Not Afraid,37.0,0.334,0.536,9.0,-6.649,0,0.0381,0.378,,0.106,0.235,152.429,204947.0,4,9
1,2,The Black Keys,Howlin' for You,67.0,0.725,0.747,11.0,-5.545,1,0.0876,0.0272,0.0468,0.104,0.38,132.921,191956.0,4,6
2,3,Royal & the Serpent,phuck u,,0.584,0.804,7.0,-6.094,1,0.0619,0.000968,0.635,0.284,0.635,159.953,161037.0,4,10
3,4,Detroit Blues Band,Missing You,12.0,0.515,0.308,,-14.711,1,0.0312,0.907,0.0213,0.3,0.501,172.472,298093.0,3,2
4,5,Coast Contra,My Lady,48.0,0.565,0.777,6.0,-5.096,0,0.249,0.183,,0.211,0.619,88.311,254145.0,4,5


In [11]:
#Reading the test data
df_test = pd.read_csv('data/test.csv')
#Seeing the head of it
df_test.head()

Unnamed: 0,Id,Artist Name,Track Name,Popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_in min/ms,time_signature
0,14397,Juan Pablo Vega,Matando (feat. Vic Mirallas),,0.691,0.67,2.0,-7.093,0,0.0941,0.0757,0.0352,0.197,0.635,89.965,200000.0,4
1,14398,Kappi Kat,Baarish,14.0,0.461,0.777,2.0,-7.469,1,0.0306,0.388,0.923,0.291,0.525,163.043,283909.0,4
2,14399,Plain White T's,Hey There Delilah,80.0,0.656,0.291,2.0,-10.572,1,0.0293,0.872,,0.114,0.298,103.971,232533.0,4
3,14400,WALK THE MOON,Different Colors,52.0,0.48,0.826,,-4.602,1,0.0397,0.000797,1e-06,0.125,0.687,96.0,222053.0,4
4,14401,Peled,◊ß◊®◊ô◊ñ,23.0,0.734,0.729,1.0,-6.381,0,0.283,0.147,,0.0672,0.805,76.03,118439.0,4


### Checking the columns

In [13]:
df.columns

Index(['Id', 'Artist Name', 'Track Name', 'Popularity', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration_in min/ms', 'time_signature', 'Class'],
      dtype='object')

In [14]:
### Checking the shape of the data
print('The shape of the train data is: ', df.shape)
print('The shape of the test data is: ', df_test.shape)

The shape of the train data is:  (14396, 18)
The shape of the test data is:  (3600, 17)


In [15]:
#Checking the describtion of the data
df.describe()

Unnamed: 0,Id,Popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_in min/ms,time_signature,Class
count,14396.0,14063.0,14396.0,14396.0,12787.0,14396.0,14396.0,14396.0,14396.0,10855.0,14396.0,14396.0,14396.0,14396.0,14396.0,14396.0
mean,7198.5,44.525208,0.543105,0.662422,5.953781,-7.900852,0.640247,0.080181,0.246746,0.178129,0.195782,0.486379,122.695372,200094.2,3.924354,6.695679
std,4155.911573,17.41894,0.165517,0.235967,3.200013,4.057362,0.479944,0.085157,0.310922,0.304266,0.159258,0.239476,29.53849,111689.1,0.35952,3.20617
min,1.0,1.0,0.0596,0.00121,1.0,-39.952,0.0,0.0225,0.0,1e-06,0.0119,0.0215,30.557,0.50165,1.0,0.0
25%,3599.75,33.0,0.432,0.508,3.0,-9.538,0.0,0.0348,0.00428,8.8e-05,0.097275,0.299,99.799,165445.8,4.0,5.0
50%,7198.5,44.0,0.545,0.699,6.0,-7.0135,1.0,0.0471,0.08145,0.00392,0.129,0.4805,120.06,208941.0,4.0,8.0
75%,10797.25,56.0,0.658,0.861,9.0,-5.162,1.0,0.0831,0.43225,0.201,0.256,0.672,141.98825,252247.0,4.0,10.0
max,14396.0,100.0,0.989,1.0,11.0,1.342,1.0,0.955,0.996,0.996,0.992,0.986,217.416,1477187.0,5.0,10.0


In [16]:
#Checking the info of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14396 entries, 0 to 14395
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Id                  14396 non-null  int64  
 1   Artist Name         14396 non-null  object 
 2   Track Name          14396 non-null  object 
 3   Popularity          14063 non-null  float64
 4   danceability        14396 non-null  float64
 5   energy              14396 non-null  float64
 6   key                 12787 non-null  float64
 7   loudness            14396 non-null  float64
 8   mode                14396 non-null  int64  
 9   speechiness         14396 non-null  float64
 10  acousticness        14396 non-null  float64
 11  instrumentalness    10855 non-null  float64
 12  liveness            14396 non-null  float64
 13  valence             14396 non-null  float64
 14  tempo               14396 non-null  float64
 15  duration_in min/ms  14396 non-null  float64
 16  time

### Checking the null data

In [18]:
#Checking the null data in the train data
df.isnull().sum()
#As we can see, we do have null data 
#We have null data in Popularity,key, and instrumentalness

Id                       0
Artist Name              0
Track Name               0
Popularity             333
danceability             0
energy                   0
key                   1609
loudness                 0
mode                     0
speechiness              0
acousticness             0
instrumentalness      3541
liveness                 0
valence                  0
tempo                    0
duration_in min/ms       0
time_signature           0
Class                    0
dtype: int64

In [19]:
#Checking the null data in the test data
df_test.isnull().sum()
#We have null data in Popularity,key, and instrumentalness

Id                      0
Artist Name             0
Track Name              0
Popularity             95
danceability            0
energy                  0
key                   405
loudness                0
mode                    0
speechiness             0
acousticness            0
instrumentalness      836
liveness                0
valence                 0
tempo                   0
duration_in min/ms      0
time_signature          0
dtype: int64

### Checking the duplicated data

In [21]:
# Checking the duplicated data in train data
df.duplicated().sum()
#As we can see, we do not have duplicated data

0

In [22]:
# Checking the duplicated data in test data
df_test.duplicated().sum()
#As we can see, we do not have duplicated data

0

In [23]:
# music_data.columns= music_data.columns.str.replace(" ","_").str.lower()
# print("Column names after conversion = ", music_data.columns)

### Filling null data

In [25]:
#Filling the null data with
df["Popularity"].fillna(df["Popularity"].median(), inplace=True)
df_test["Popularity"].fillna(df_test["Popularity"].median(), inplace=True)

In [26]:
#Filling the null data with
df["key"].fillna(df["key"].median(), inplace=True)
df_test["key"].fillna(df_test["key"].median(), inplace=True)

In [27]:
#Filling the null data with
df["instrumentalness"].fillna(df["instrumentalness"].median(), inplace=True)
df_test["instrumentalness"].fillna(df_test["instrumentalness"].median(), inplace=True)

In [28]:
#Checking the null data after filling in the train data
df.isnull().sum()

Id                    0
Artist Name           0
Track Name            0
Popularity            0
danceability          0
energy                0
key                   0
loudness              0
mode                  0
speechiness           0
acousticness          0
instrumentalness      0
liveness              0
valence               0
tempo                 0
duration_in min/ms    0
time_signature        0
Class                 0
dtype: int64

In [29]:
#Checking the null data after filling in the test data
df_test.isnull().sum()

Id                    0
Artist Name           0
Track Name            0
Popularity            0
danceability          0
energy                0
key                   0
loudness              0
mode                  0
speechiness           0
acousticness          0
instrumentalness      0
liveness              0
valence               0
tempo                 0
duration_in min/ms    0
time_signature        0
dtype: int64

### Artist Name

In [31]:
#Let's see how many Artist Name we have in our dataset
df['Artist Name'].value_counts()

Artist Name
Backstreet Boys    58
Westlife           53
Britney Spears     47
Omer Adam          39
Eyal Golan         38
                   ..
Snowy Dunes         1
WhoMadeWho          1
Tom Lewis Band      1
Dozer               1
Freddy Fender       1
Name: count, Length: 7913, dtype: int64

In [32]:
df['Artist Name'].unique()[:60]

array(['Marina Maximilian', 'The Black Keys', 'Royal & the Serpent',
       'Detroit Blues Band', 'Coast Contra', 'Beck', 'Shadow and Light',
       'Within The Ruins', 'Crazy Cavan', 'Day Sulan', 'Dierks Bentley',
       'Elderbrook', 'Alter Bridge', 'Thieves Like Us',
       'Kelsea Ballerini, Kenny Chesney', 'DJ Enimoney', 'Attawalpa',
       'Foo Fighters', 'The Collective Projekt', 'Mohammed Rafi',
       'Die Apokalyptischen Reiter', 'Volbeat', 'Danny Weissfeld',
       'Metric', 'Tyla Yaweh', 'Underoath', 'Adi-Keshet Cohen',
       'Human Impact', 'a crowd of rebellion', 'Planet of Zeus',
       'Arlo Guthrie', 'Give a Little Love (feat. Sufjan Stevens)',
       'Derek & The Dominos', 'Our Lady Peace', 'Reeshabh Purohit',
       'Johnny Cosmic', 'Sonna', 'Urban Dance Squad', 'Betcha', 'ED.',
       'Corey Harper', 'James Taylor', 'Bickram Ghosh, Kala Ramnath',
       'Greta Van Fleet', 'Thornhill', 'J.I the Prince of N.Y',
       'Cody Jinks', 'R. D. Burman', 'R.B James', 'Swiss

### Track Name

In [34]:
#Let's see how many Track Name we have in our dataset
df['Track Name'].value_counts()

Track Name
Fire                              8
Ghost                             7
Runaway                           7
Forever                           6
Dreams                            6
                                 ..
Tangerine                         1
Three Alley Cats                  1
◊™◊ë◊ï◊ê◊ô ◊î◊ô◊ï◊ù               1
In My Room                        1
Before the Next Teardrop Falls    1
Name: count, Length: 12455, dtype: int64

In [35]:
df['Track Name'].unique()[40:80]

array(['On the Run', 'Mexico', 'Rasiya',
       'Rolling In The Deep - Recorded At Spotify Studios NYC', 'Reptile',
       'R&B Shit (feat. A Boogie Wit da Hoodie)', 'What Else Is New',
       'Mehbooba Mehbooba - From “Sholay Songs And Dialogues, Vol. 2” Soundtrack',
       'Let Me Reach That Mountain', 'Cassette', 'One Headlight',
       '◊ê◊ú◊ï◊£ ◊î◊¢◊ï◊ú◊ù', 'Wucan', 'Nookie', '#1 Crush',
       'My Favourite Game', 'RAPSTAR', 'say goodbye', 'The Beholder',
       'Leave The Door Open', 'Dutch Courage', "Nobody's Favorite",
       'Emily', "Eatin' Dust", 'Gentle Tuesday', 'Awakening The Soul 2',
       'This Armistice', 'Anything But Time', 'Sometimes Salvation',
       'Simple Case of the Blues', 'Blashyrkh - Mighty Ravendark',
       'Koi Mar Jaye - Deewaar / Soundtrack Version', 'Uncharted',
       '7 Canciones populares españolas - Arranged by Mischa Maisky: No.5: Nana',
       'Steal Away', 'I Made up My Mind', 'Baile Amargo',
       'Dancing In the Dark', 'Constance', 'A Tric

### Popularity

In [37]:
df['Popularity'].value_counts()

Popularity
44.0     666
42.0     371
41.0     357
34.0     348
43.0     343
        ... 
97.0       3
100.0      2
96.0       2
98.0       1
99.0       1
Name: count, Length: 100, dtype: int64

### danceability

In [39]:
df['danceability'].value_counts()

danceability
0.5520    54
0.5320    48
0.5270    47
0.5290    47
0.6010    47
          ..
0.0650     1
0.1000     1
0.9050     1
0.1650     1
0.0896     1
Name: count, Length: 887, dtype: int64

In [40]:
df.drop('Id', inplace=True, axis=1)

In [41]:
# df.drop('instrumentalness', inplace=True, axis=1)
# df_test.drop('instrumentalness', inplace=True, axis=1)

### Class

In [43]:
df.Class.value_counts()

Class
10    3959
6     2069
9     2019
8     1483
5     1157
1     1098
2     1018
0      500
7      461
3      322
4      310
Name: count, dtype: int64

In [44]:
df.columns

Index(['Artist Name', 'Track Name', 'Popularity', 'danceability', 'energy',
       'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration_in min/ms', 'time_signature', 'Class'],
      dtype='object')

In [45]:
df_test.columns

Index(['Id', 'Artist Name', 'Track Name', 'Popularity', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration_in min/ms', 'time_signature'],
      dtype='object')

In [117]:
X = df.drop(["Class", 'Artist Name', 'Track Name'], axis=1)
y = df.Class
X

Unnamed: 0,Popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_in min/ms,time_signature
0,37.0,0.334,0.536,9.0,-6.649,0,0.0381,0.378000,0.003920,0.1060,0.235,152.429,204947.000000,4
1,67.0,0.725,0.747,11.0,-5.545,1,0.0876,0.027200,0.046800,0.1040,0.380,132.921,191956.000000,4
2,44.0,0.584,0.804,7.0,-6.094,1,0.0619,0.000968,0.635000,0.2840,0.635,159.953,161037.000000,4
3,12.0,0.515,0.308,6.0,-14.711,1,0.0312,0.907000,0.021300,0.3000,0.501,172.472,298093.000000,3
4,48.0,0.565,0.777,6.0,-5.096,0,0.2490,0.183000,0.003920,0.2110,0.619,88.311,254145.000000,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14391,47.0,0.607,0.946,1.0,-2.965,1,0.1500,0.005480,0.000390,0.2780,0.653,120.011,195181.000000,4
14392,27.0,0.435,0.951,8.0,-7.475,1,0.0576,0.000005,0.550000,0.0952,0.203,135.034,282043.000000,4
14393,22.0,0.415,0.941,11.0,-4.300,1,0.0524,0.001810,0.000004,0.3370,0.572,167.978,176529.000000,4
14394,37.0,0.493,0.986,1.0,-2.279,1,0.0917,0.000967,0.006620,0.1230,0.567,122.036,186307.000000,4


In [119]:
# define seed fro models & split for a fair comparison

seed = 42

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=seed)
# stratify since it is imbalance

In [121]:
# Check the size of each one
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(11516, 14)
(2880, 14)
(11516,)
(2880,)


In [124]:
cat_onehot = ["Artist Name", "Track Name"] 
column_trans = make_column_transformer(
                        (OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat_onehot),
                         remainder= MinMaxScaler(),
                         verbose_feature_names_out=False)

column_trans=column_trans.set_output(transform="pandas")

In [126]:
column_trans

In [128]:
scaler = MinMaxScaler()
# le = LabelEncoder()

### RF Model

In [130]:
from sklearn.ensemble import RandomForestClassifier

operations = [('scaler', scaler),
    # ('OneHot', column_trans), 
                ('rf', RandomForestClassifier(random_state=seed, class_weight='balanced'))]
  


rf_m = Pipeline(steps=operations).set_output(transform="pandas")

rf_m.fit(X_train, y_train)

In [131]:
eval_metric(rf_m, X_train, y_train, X_test, y_test)

Test_Set
[[ 79   0   0   6   7   0   0   1   0   7   0]
 [  0   3   6   0   0   8  93   0   4  19  87]
 [  0   1  77   0   0   4  14   0   2  13  93]
 [  9   0   0  48   2   0   0   3   0   2   0]
 [ 11   0   0   1  45   0   0   1   0   4   0]
 [  0   0   5   0   0 162   8   0   0  40  16]
 [  0  49   6   0   0  12 105   0  10  50 182]
 [  5   0   0   1   0   0   0  86   0   0   0]
 [  0   1   0   0   0   0  10   0 146   1 139]
 [  8   8  16   2   3  33  32   0   0 216  86]
 [  2  48  37   2   7   5  67   0  68  62 494]]
              precision    recall  f1-score   support

           0       0.69      0.79      0.74       100
           1       0.03      0.01      0.02       220
           2       0.52      0.38      0.44       204
           3       0.80      0.75      0.77        64
           4       0.70      0.73      0.71        62
           5       0.72      0.70      0.71       231
           6       0.32      0.25      0.28       414
           7       0.95      0.93      0

##### Cross Validation RF

In [98]:
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'f1_score': make_scorer(f1_score)
}

In [99]:
operations = [('OneHot', column_trans), ('rf', RandomForestClassifier(random_state=seed, class_weight='balanced'))]

CV_RF_b  = Pipeline(steps=operations).set_output(transform="pandas")


scores = cross_validate(
    CV_RF_b, X_train, y_train, scoring=scoring, cv=10, return_train_score=True
)
df_scores = pd.DataFrame(scores, index=range(1, 11))
df_scores.mean()

fit_time          139.962808
score_time          0.634158
test_accuracy       0.491317
train_accuracy      0.944676
test_f1_score            NaN
train_f1_score           NaN
dtype: float64

##### Grid/Random Search RF

In [109]:
scoring = 'accuracy'

In [113]:
param_grid = {'rf__n_estimators':[100, 128, 200],
             'rf__max_features':[2, 4, 6,'sqrt'],
             'rf__max_depth':[3, 5, 7, 8],
             # 'rf__min_samples_split':[2, 3, 4],
             # 'rf__min_samples_leaf': [2, 3, 4],
             # 'rf__class_weight': [None, 'balanced', {0: 1, 1: 2}]
              
             #'RF_model__max_samples':[0.4, 0.8 , 1]
             }

operations = [('OneHot', column_trans), ('rf', RandomForestClassifier(random_state=seed, class_weight='balanced'))]

GS_RF_b = Pipeline(steps=operations)

grid_model_RF_b = GridSearchCV(estimator=GS_RF_b, param_grid= param_grid, scoring=scoring, cv=10, n_jobs=-1, return_train_score=True)

grid_model_RF_b.fit(X_train, y_train)

KeyboardInterrupt: 

In [None]:
eval_metric(grid_model_RF_b, X_train, y_train, X_test, y_test)

In [None]:
grid_model_RF_b.best_params_

In [None]:
#Creating the model with the values of best params we got it from the grid search, and then creat the graph to see the result
operations = [('Ordinal', trans_ord), ('rf', RandomForestClassifier(random_state=seed, max_depth=3, max_features='sqrt',  min_samples_leaf=4, min_samples_split=2, n_estimators=50, class_weight='balanced'))]


rf_pipline_ba = Pipeline(steps=operations).set_output(transform="pandas")

rf_pipline_ba.fit(X_train, y_train)

# Get the predicted probabilities for the test data
y_proba = rf_pipline_ba.predict_proba(X_test)
plot_precision_recall(y_test, y_proba)
plt.show();

In [134]:
ID= df_test['Id']
ID

0       14397
1       14398
2       14399
3       14400
4       14401
        ...  
3595    17992
3596    17993
3597    17994
3598    17995
3599    17996
Name: Id, Length: 3600, dtype: int64

In [138]:
Class=rf_m.predict(df_test.drop(['Id', 'Artist Name', 'Track Name'],axis=1))
data={'Id':ID,'Class':Class}
sub=pd.DataFrame(data)
sub.to_csv('rf.csv',index=False)