# Binary Classification Based on Genres

Let's import necessary libraries and load the dataset. 

In [1]:
import pandas as pd
import numpy as np

In [9]:
dataset = pd.read_csv("features_msd_lda_sp.csv")
dataset.drop(["Unnamed: 0"], 1, inplace=True)
dataset

Unnamed: 0,genre,track_id,artist_name,title,loudness_x,tempo_x,time_signature,key_x,mode_x,duration,...,key_y,loudness_y,mode_y,speechiness,acousticness,instrumentalness,liveness,valence,tempo_y,id
0,classic pop and rock,TRNJTPB128F427AE9F,Blue Oyster Cult,Screams,-10.659,148.462,1,4,0,189.80526,...,4,-9.009,0,0.0806,0.32600,0.014300,0.2240,0.4930,149.356,2KsnSzoJlmfT44jFOqRP1E
1,classic pop and rock,TRLFJHA128F427AEEA,Blue Oyster Cult,Dance The Night Away,-13.494,112.909,1,10,0,158.19710,...,10,-11.838,0,0.0359,0.90300,0.000000,0.1810,0.1930,80.510,4qV4ErzHMz4XYGWxMkOFBw
2,classic pop and rock,TRCQZAG128F427DB97,Blue Oyster Cult,Debbie Denise,-12.786,117.429,4,7,1,250.22649,...,7,-7.034,1,0.0303,0.36100,0.002490,0.1170,0.4670,117.720,2o2jnahWDsvyo8v0vV3dif
3,classic pop and rock,TRSIZRN128F427DB95,Blue Oyster Cult,Morning Final,-11.952,100.901,1,0,1,254.17098,...,9,-6.756,0,0.0659,0.05170,0.033700,0.1520,0.5050,100.563,4YsdlcE3GqCHpZD4jNDJgD
4,classic pop and rock,TRDYTEO128F427DB90,Blue Oyster Cult,The Revenge Of Vera Gemini,-11.839,132.361,4,2,1,230.32118,...,9,-7.110,1,0.0659,0.00182,0.000002,0.0814,0.6570,132.659,4aumrXau5uVKj8xXcQELP3
5,classic pop and rock,TRKSICM128F427DB8B,Blue Oyster Cult,True Confessions,-13.760,121.001,4,10,1,177.55383,...,10,-7.493,1,0.0433,0.10200,0.000009,0.0539,0.8140,121.065,6sH5tkQoXL3YKV0hpXZoWj
6,classic pop and rock,TRJPXIV128F426697A,Blue Oyster Cult,Redeemed,-10.799,185.836,4,2,1,231.39220,...,11,-9.794,0,0.0374,0.08460,0.001370,0.3210,0.5170,186.627,1P37l3UBUo49cqATq6Qccz
7,classic pop and rock,TRXWSIN128F4265A40,Blue Oyster Cult,Workshop Of The Telescopes,-11.413,120.171,4,9,1,241.16200,...,2,-8.954,1,0.1140,0.25700,0.011700,0.1110,0.3540,120.413,55f36XW2w1moYF70wsaHLB
8,classic pop and rock,TRNVQPE128F426BBD2,Blue Oyster Cult,Godzilla,-12.083,88.548,4,6,0,469.89016,...,4,-7.719,1,0.0554,0.16400,0.000008,0.6050,0.6960,184.024,6N0AnkdDFZUetw8KAGHV7e
9,classic pop and rock,TRUUZXH128F426C1AD,Blue Oyster Cult,E.T.I. (Extra Terrestrial Intelligence),-7.264,97.298,4,9,1,226.03710,...,9,-5.875,1,0.0486,0.24500,0.002490,0.4350,0.6860,97.335,5KBdHzTROSlD3dACh91sZx


Let's look at which genres are in dataset.

In [17]:
genres = dataset.genre.unique()
genres

array(['classic pop and rock', 'punk', 'folk', 'pop',
       'dance and electronica', 'metal', 'jazz and blues', 'classical',
       'hip-hop', 'soul and reggae'], dtype=object)

In [53]:
string_features = ["track_id", "id", "artist_name", "title"]

In [68]:
def create_input_and_label_for_genre(dataset, genre):
    label = np.where(dataset['genre']==genre, 1, 0)
    feature = dataset.loc[:, dataset.columns != 'genre']
    feature.drop(string_features, 1, inplace = True)
    return feature.as_matrix(), label

In [67]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
def get_random_forest_classifier(n_features):
    return RandomForestClassifier(max_depth=n_features, random_state=0)

In [66]:
for genre in genres:
    print("Binary classification based on genre: "+ genre)
    X, Y = create_input_and_label_for_genre(dataset.copy(), genre)
    clf = get_random_forest_classifier(X.shape[1])
    clf.fit(X,Y)
    print("Feature importance", clf.feature_importances_)
    cv = 10
    print("Cross validation scores", cross_val_score(clf, X, Y, cv=cv))


Binary classification based on genre: classic pop and rock


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Feature importance [ 0.02609409  0.01592925  0.00343933  0.00829981  0.00220663  0.0237888
  0.01828232  0.02484954  0.02555532  0.02310877  0.02029507  0.01832082
  0.01707452  0.01506753  0.01814829  0.03229379  0.01893684  0.01905377
  0.01828286  0.01591284  0.02000526  0.01854523  0.0181194   0.01592058
  0.02351423  0.01839225  0.01873563  0.01703552  0.02275838  0.02279514
  0.03670778  0.01720989  0.01745931  0.01816287  0.01917076  0.04055212
  0.01620707  0.01694954  0.01824538  0.01654733  0.01808314  0.02563516
  0.00819655  0.03192746  0.00204533  0.03283788  0.03416856  0.01488198
  0.01614647  0.02010387  0.01799978]
Cross validation scores [ 0.69434932  0.5625      0.57926307  0.6092545   0.60154242  0.63379074
  0.62264151  0.72641509  0.55060034  0.55660377]
Binary classification based on genre: punk


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Feature importance [ 0.0214146   0.02015329  0.00323032  0.00579147  0.00031789  0.06196919
  0.03492917  0.01151626  0.00998783  0.01357622  0.0265058   0.01668819
  0.01796331  0.02085987  0.01163311  0.02309672  0.01331249  0.01364796
  0.0184548   0.03417477  0.01904512  0.01335866  0.0107724   0.01409371
  0.02207239  0.01790408  0.01230635  0.01457582  0.013089    0.02852978
  0.01597083  0.01303869  0.01357738  0.03432951  0.01474629  0.09243832
  0.01458367  0.01134302  0.01436677  0.00828903  0.01463364  0.05435528
  0.00563149  0.02350692  0.0019122   0.02292351  0.03634875  0.01369582
  0.01520749  0.02138036  0.01275047]
Cross validation scores [ 0.94944302  0.96572408  0.96486718  0.95115681  0.95115681  0.95372751
  0.95372751  0.90651801  0.69725557  0.96397942]
Binary classification based on genre: folk


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Feature importance [ 0.01628226  0.01723896  0.00485198  0.00684417  0.00181193  0.01734463
  0.02075172  0.04548033  0.01672834  0.03329179  0.01573231  0.01834581
  0.01811123  0.019026    0.01656705  0.01590275  0.01760004  0.02249908
  0.01694776  0.01799039  0.01609225  0.01515513  0.02191505  0.01798354
  0.01462363  0.01343148  0.01654997  0.01240268  0.01780282  0.01509217
  0.02963389  0.01775735  0.01677081  0.03350406  0.01953698  0.01642302
  0.01524426  0.02282538  0.02204185  0.01375361  0.01874169  0.04960511
  0.00862223  0.03071302  0.00234734  0.03158366  0.06059483  0.01169889
  0.019057    0.02146236  0.01768544]
Cross validation scores [ 0.77549272  0.80891174  0.78834619  0.80719794  0.80119966  0.80034276
  0.82947729  0.83619211  0.81475129  0.82504288]
Binary classification based on genre: pop


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Feature importance [ 0.02182772  0.01621887  0.00291871  0.00816116  0.00319046  0.00980762
  0.02928041  0.01463646  0.01076608  0.01045035  0.02266376  0.00944432
  0.00936498  0.0168388   0.00910248  0.0180546   0.0111491   0.01487265
  0.0124207   0.01445312  0.01182155  0.01291704  0.01092923  0.01374258
  0.0154395   0.01539292  0.01219516  0.01114772  0.01103847  0.01599873
  0.04153168  0.06874217  0.01845297  0.06768393  0.03491069  0.13299451
  0.01616231  0.02682112  0.03253268  0.01789947  0.01544618  0.01347133
  0.00386036  0.01243293  0.00130287  0.01452906  0.01519022  0.03397544
  0.0115622   0.0096625   0.01459011]
Cross validation scores [ 0.95034247  0.97174658  0.96232877  0.9554413   0.97512864  0.94682676
  0.97598628  0.97598628  0.96912521  0.96312178]
Binary classification based on genre: dance and electronica


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Feature importance [ 0.02422477  0.02424967  0.00423265  0.00731872  0.00240293  0.02029486
  0.0250195   0.02380913  0.03350685  0.03566189  0.02280759  0.03093433
  0.0171148   0.02771985  0.01926233  0.01699782  0.03210534  0.01974538
  0.02201672  0.02818699  0.02832913  0.02238024  0.01489074  0.01730226
  0.02186073  0.01652617  0.0179029   0.0148505   0.01655927  0.01455163
  0.01660929  0.02351774  0.01660691  0.01388086  0.0185022   0.01658036
  0.01453962  0.01492737  0.01619945  0.01330053  0.01677625  0.01914206
  0.00753701  0.03492596  0.00450842  0.02031698  0.02841426  0.02752886
  0.01251364  0.02487136  0.01603521]
Cross validation scores [ 0.96401028  0.96401028  0.96829477  0.96743787  0.96315338  0.96486718
  0.96486718  0.96655232  0.96397942  0.96140652]
Binary classification based on genre: metal


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Feature importance [ 0.01867926  0.00574613  0.00179622  0.00282543  0.00068604  0.0398497
  0.02894397  0.01505792  0.01473474  0.01139526  0.04462949  0.06606809
  0.00952374  0.00692094  0.01096831  0.0414523   0.00973542  0.00821228
  0.0090734   0.03813387  0.00982938  0.01098586  0.00722503  0.00991967
  0.01387522  0.01490648  0.00622894  0.02499898  0.01070765  0.01173635
  0.01074909  0.00876715  0.00560393  0.04905549  0.00697546  0.00526748
  0.01839596  0.00458702  0.01718326  0.00635268  0.0344292   0.05427429
  0.00669463  0.02219085  0.00113203  0.02341293  0.12768458  0.01964126
  0.00533996  0.06024245  0.00717424]
Cross validation scores [ 0.97857755  0.95201371  0.96401028  0.96915167  0.96829477  0.9151671
  0.97686375  0.95626072  0.96483705  0.95711835]
Binary classification based on genre: jazz and blues


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Feature importance [ 0.01944602  0.02457279  0.00693436  0.00867774  0.00197301  0.03291356
  0.01390587  0.02182461  0.01950478  0.01342518  0.02157982  0.01806858
  0.02148671  0.02197383  0.02431175  0.01734923  0.02016146  0.03079342
  0.02722401  0.02548934  0.03693307  0.01099846  0.0198716   0.02634177
  0.02946246  0.01542394  0.03213199  0.01887458  0.0177773   0.0215891
  0.02195242  0.03026552  0.02491261  0.01950868  0.01952308  0.01820919
  0.01636275  0.02087291  0.01371987  0.0142415   0.01655003  0.01821312
  0.00548691  0.01763391  0.00095324  0.01931215  0.02278269  0.01539137
  0.0243682   0.01712949  0.02159004]
Cross validation scores [ 0.97772065  0.97772065  0.97772065  0.97772065  0.97772065  0.97772065
  0.97772065  0.97772065  0.97855918  0.97854077]
Binary classification based on genre: classical


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Feature importance [  2.35224680e-02   6.60459024e-03   6.57444831e-05   6.86681632e-03
   5.98472002e-03   1.18943556e-02   1.12500345e-02   3.61665534e-02
   1.39905604e-02   1.30057547e-02   1.91990377e-02   6.50962609e-02
   2.01268370e-02   7.37644614e-03   2.75068068e-02   4.64676601e-02
   2.25384951e-02   1.54595877e-02   1.12360996e-02   2.80325640e-02
   1.61014425e-02   3.04936066e-02   5.54211323e-03   2.51525866e-02
   1.12731072e-02   1.15621021e-02   4.49345649e-02   4.24146863e-03
   1.87592905e-02   1.04019136e-02   1.87290578e-02   5.40198003e-03
   1.06382288e-02   8.23440972e-03   2.01865543e-02   2.95159401e-02
   2.79999210e-02   1.92938803e-02   1.60761297e-02   2.50976560e-02
   3.45767128e-02   2.96898719e-02   4.21390156e-03   4.36433267e-02
   0.00000000e+00   3.15516449e-02   2.73101209e-02   9.00629004e-03
   1.31346134e-02   3.85096525e-02   1.63365192e-02]
Cross validation scores [ 0.99742931  0.99742931  0.99742931  0.99742931  0.99742931  0.99742931
  0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Feature importance [ 0.04031328  0.00893404  0.          0.01558569  0.00223602  0.01146715
  0.02454278  0.02251938  0.01657521  0.03005332  0.00653556  0.00705229
  0.01351605  0.02529489  0.00959791  0.03446075  0.01559978  0.00560399
  0.0188912   0.00572008  0.00768972  0.02002858  0.          0.02312075
  0.01160437  0.01169695  0.01626724  0.01670982  0.0210705   0.01054201
  0.02646164  0.02531849  0.02447502  0.01604143  0.03811855  0.13085768
  0.01632318  0.00707422  0.03092645  0.02174561  0.00869874  0.02228736
  0.00468217  0.01488098  0.00554889  0.06992131  0.01142484  0.01733738
  0.00519438  0.01558419  0.03386819]
Cross validation scores [ 0.99400685  0.99400685  0.99400685  0.99400171  0.9948542   0.99399657
  0.99656947  0.99571184  0.99571184  0.9948542 ]
Binary classification based on genre: soul and reggae


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Feature importance [ 0.01762691  0.01483982  0.00538987  0.01012638  0.00448668  0.01588301
  0.01339045  0.01708484  0.01913509  0.01726347  0.03180758  0.01533336
  0.01981561  0.01710037  0.02113562  0.01184037  0.02176504  0.02309429
  0.01543914  0.01123516  0.01402021  0.02488612  0.01423302  0.02667493
  0.0360636   0.02045579  0.01520828  0.01914084  0.02375998  0.01830837
  0.02108771  0.01971807  0.01840928  0.02907724  0.0194152   0.0276273
  0.02856864  0.02670327  0.01942977  0.02109147  0.04286648  0.01201135
  0.00981482  0.01502465  0.00204277  0.05882904  0.01733527  0.01605879
  0.01472428  0.02074852  0.02287188]
Cross validation scores [ 0.92722603  0.92887746  0.92202228  0.93916024  0.93744644  0.93658955
  0.9373928   0.9373928   0.91337907  0.91680961]
