Comprehensive EDA, class consolidation and SVM

In [1]:

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from scipy.stats import zscore
import fasttext
import os
%matplotlib inline



In [3]:
data.info()

NameError: name 'data' is not defined

In [4]:
data[data.isnull().any(axis=1)]

In [5]:


data.dropna(inplace=True)
data.reset_index(drop=True, inplace=True)
data.info()



In [6]:
data.describe()

In [7]:
data.describe(include=['O'])

We can see that there are 17 features and one label column (music_genre). Out of the features, 12 are numerical (one of which, tempo, is missclassified and will be dealt with later), and 5 are categorical.
We can also already see hints to hidden missing values in 3 features ('tempo', 'artist_name' and 'duration_ms'). Those will be dealt with shortly one by one

In [8]:
# check if the data is balanced
data['music_genre'].value_counts()


There are 10 different genres with equal distribution (balanced data). This means the accuracy score will be a good metric to use


Exploring the features one by one:



Instance_id:


In [9]:
data = data.drop(columns=['instance_id'])


Artist's Names:


In [10]:


print(f"There are {data['artist_name'].nunique()} unique artists in the set")



In [11]:
data['artist_name'].describe()

In [12]:


missing_artist = data[data['artist_name'] == 'empty_field']
missing_artist.head()



In [13]:


print(f"Percent of missing artist names: {(missing_artist.shape[0]/data.shape[0])*100:2.4}%")




5% of the observations are missing the artist's names (marked as 'empty_field'), but these entries are still valid otherwise. we will not drop these observations

In [14]:


data[data['artist_name'] != 'empty_field'].groupby('artist_name')['music_genre'].nunique().value_counts(normalize=False)



For the entries that do contain an artist's name, it seems that a song that comes from a particular artist has an ~80% chance of belonging to one specific genre.

However, in it's current form it's not helpful for classifying songs from artists outside the training set. We'll need to extract more general features, starting with the simplest - name length.¶

In [15]:


# find the length of the artists names
data['length_name'] = data['artist_name'].str.len()



In [16]:


data[data['artist_name'] != 'empty_field'].groupby('music_genre')['length_name'].describe().T


In [20]:
genre_counts = data['music_genre'].value_counts()

# Plot the pie chart
plt.figure(figsize=(10, 6))
genre_counts.plot.pie(autopct='%.1f%%', startangle=90)
plt.title("Music Genre Distribution")
plt.ylabel('')  # Remove the y-axis label ('genre')

# Save the pie chart to a file
plt.savefig('music_genre_distribution.png', bbox_inches='tight')

# Show the pie chart (optional, since it's already saved)
plt.show()

In [None]:


plt.figure(figsize=(16,8))
sns.boxplot(data=data[data['artist_name'] != 'empty_field'], x='music_genre', y='length_name')
plt.show()




From the above statistics it seems that classical music tends to have noticeably longer names. Could potentialy be a useful feature. We'll keep it in for now

In [None]:
data['track_name'].describe()

In [None]:


# generate track name length
data['length_track_name'] = data['track_name'].str.len()



In [None]:


data.groupby('music_genre')['length_track_name'].describe()



In [None]:
plt.figure(figsize=(16,8))
sns.boxplot(data=data, x='music_genre', y='length_track_name')
plt.show()

The classical genre also has longer track names. The difference is much more pronounced than for the artist's name. This is a better feature for us to use, since it doesn't have the missing values problem. We'll use this and not the artist name length feature.¶

In [None]:


# drop 'length_name' feature
data = data.drop(columns = ['length_name'])




Next we'll make a feature out off the sample's language. Specifically, whether it's written in Japanese. This may help us identify the Anime genre, which is likely to contain track/artist names written in the Japanese alphabet, as shown below:


In [None]:
# download pretrained language identification model
os.system(f"wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz")

In [None]:
data[data['music_genre'] == 'Anime'].head()

In [None]:
PRETRAINED_MODEL_PATH = 'lid.176.ftz'
model = fasttext.load_model(PRETRAINED_MODEL_PATH)

In [None]:
def find_japanese(df):
    '''
    returns a 1D-array of 0's ans 1's, as well as the confidence of each prediction.
    1 - if either the artist or track name is written in japanese.
    0 - otherwise
    ''' 
    jap = []
    confidence = []
    
    for _, row in df.iterrows():      
        pred_track, confidence_track = model.predict(row['track_name'])
        pred_track = pred_track[0].split('__')[-1]
        pred_artist, confidence_artist = model.predict(row['artist_name'])
        pred_artist = pred_artist[0].split('__')[-1]

        # check the confidence of the language detection
        if (pred_track == 'ja') or (pred_artist == 'ja'):
            jap.append(1)
            confidence.append(np.max([confidence_track[0], confidence_artist[0]]))
        else:
            jap.append(0)
    
    return jap, np.array(confidence)

In [None]:
data['Japanese'], confidence = find_japanese(data[['artist_name', 'track_name']])

In [None]:
print(f"The average confidence level for the japanese predictions is {confidence.mean():1.2} +\- {confidence.std():1.2}")

In [None]:


data.groupby('music_genre')['Japanese'].value_counts(normalize=True)



In [None]:
import pandas as pd
import matplotlib.pyplot as plt

jap = data.groupby('music_genre')['Japanese'].value_counts(normalize=True)
jap = jap.unstack()
jap.plot(kind='bar', stacked=True)

plt.xlabel('Music Genre')
plt.ylabel('Percentage')
plt.title('Japanese Songs by Music Genre')

plt.show()

More than 20% of the Anime tracks are indeed written in Japanese, a much higher percentage than all the other music genres combined. This could indeed help us identify the Anime genre.

In [None]:


data = data.drop(columns=['artist_name', 'track_name'])



Popularity

In [None]:
data.groupby('music_genre')['popularity'].describe()

In [None]:


plt.figure(figsize=(16,8))
sns.boxplot(data=data, x='music_genre', y='popularity')
plt.show()



This feature shows a nice spread of distributions for the different genres. Could definitely be useful for classification.

Rap, Hip-Hop and Rock seem to be the most popular genres, while Anime, Blues and Classical are the least popular. The other 4 genres are somewhere in between.

Acousticness:

In [None]:
data.groupby('music_genre')['acousticness'].describe()

In [None]:
plt.figure(figsize=(16,8))
sns.boxplot(data=data, x='music_genre', y='acousticness')
plt.show()

Danceability

In [None]:
data.groupby('music_genre')['danceability'].describe()

In [None]:
plt.figure(figsize=(16,8))
sns.boxplot(data=data, x='music_genre', y='danceability')
plt.show()


Classical music sticks out again, but Rap and Hip-Hop can also be distinguished from the rest (they seem to go together often).


Duration

In [None]:
data.groupby('music_genre')['duration_ms'].describe()


-1.0 is obvously not a valid time measurement. These are missing values.¶


In [None]:
miss_duration = data[data['duration_ms'] == -1].shape[0]
num_obs_tot = data.shape[0]
print(f"There are {miss_duration} missing values, which accounts for {(miss_duration/num_obs_tot)*100:2.4}% of the data points.")

In [None]:


plt.figure(figsize=(16,8))
sns.boxplot(data=data[data['duration_ms'] != -1], x='music_genre', y='duration_ms')
plt.show()



Almost 10% of the entries are missing a duration. We don't want to remove such a large amount of observations, so we'll fill in the missing values with the median, but consider removing the feature entirely in the future.
Also of note is the fact that this feature contains extreme outliers. They could be important for classification, but we'll consider removing them at a later stage¶

In [None]:


# fill in median for missing values
mask_duration = data['duration_ms'] != -1
median_duration = data.loc[mask_duration, 'duration_ms'].median()
data.loc[~mask_duration, 'duration_ms'] = median_duration



Energy

In [None]:
data.groupby('music_genre')['energy'].describe()

In [None]:


plt.figure(figsize=(16,8))
sns.boxplot(data=data, x='music_genre', y='energy')
plt.show()




As usual, classical music stands out (and Jazz to a much lesser degree). Rap and Hip-Hop still match each other.


Instrumentalness

In [None]:


data.groupby('music_genre')['instrumentalness'].describe()



In [None]:


plt.figure(figsize=(16,8))
sns.boxplot(data=data, x='music_genre', y='instrumentalness')
plt.show()



In [None]:
sns.histplot(x='instrumentalness',data=data);

In [None]:
inst_0 = data[data['instrumentalness'] == 0].shape[0]
num_obs = data.shape[0]
print(f"There are {inst_0} observations with 0.0 instrumentalness, which accounts for {(inst_0/num_obs)*100:2.4}% of the data points")

Such a large number of 0.0 entries likely indicates missing values rather than real data points. Since this is a 3rd of our observations, we won't fill in missing values. Instead, we'll discard this feature entirely.

In [None]:
data = data.drop(columns=['instrumentalness'])

Key

In [None]:
data['key'].unique()

In [None]:
sns.catplot(x="music_genre", hue="key",data=data, kind="count",height=5, aspect=3.0, palette = 'Set1');

In [None]:
data.groupby('music_genre')['key'].describe()

In [None]:


# One Hot Encoding
data = pd.get_dummies(data, drop_first=True, prefix='key', columns=['key'])




Different genres have noticeably different spreads. We'll keep this feature, but use One Hot Encoding to make it useful.


Liveness

In [None]:
data.groupby('music_genre')['liveness'].describe()

In [None]:


plt.figure(figsize=(16,8))
sns.boxplot(data=data, x='music_genre', y='liveness')
plt.show()



In [None]:
data.groupby('music_genre')['liveness'].plot.kde()
plt.legend()
plt.xlim([0,1])
plt.show()


The distributions seem similarly skewed for all genres, so this feature will likely not contribute much to the model. We'll try both with and without this feature.


Loudness

In [None]:
data.groupby('music_genre')['loudness'].describe()

In [None]:
plt.figure(figsize=(16,8))
sns.boxplot(data=data, x='music_genre', y='loudness')
plt.show()


As usual, classical music is far from the rest, with Jazz (and Blues) also differing from the rest somewhat.¶


Mode

In [None]:
data['mode'].unique()

In [None]:
data.groupby('music_genre')['mode'].describe()

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x="music_genre", hue="mode",data=data)
plt.legend(loc=0)
plt.show()


All genres seem to have a prefererence for the "Major" mode, but to different degrees. It is the most pronounced in the Country genre. We'll use this feature after one hot encoding.

In [None]:


# One Hot Encoding
data = pd.get_dummies(data, drop_first=True, columns=['mode'])



Speechiness

In [None]:
data.groupby('music_genre')['speechiness'].describe()

In [None]:


plt.figure(figsize=(16,8))
sns.boxplot(data=data, x='music_genre', y='speechiness')
plt.show()




This feature should contribute especially to identifying Hip-Hop and Rap.


Tempo

In [None]:
data.groupby('music_genre')['tempo'].describe()


This feature should be numeric. The "?" is a missing value.


In [None]:


print(f"This feature contains {(data[data['tempo'] == '?'].shape[0]/data.shape[0])*100:2.4}% missing values")



In [None]:


# replace "?" with np.nan and correctly classify the feature:
data.loc[data['tempo'] == '?', 'tempo'] = np.nan
data = data.astype({'tempo': np.float64})



In [None]:
data.groupby('music_genre')['tempo'].describe()

In [None]:
plt.figure(figsize=(16,8))
sns.boxplot(data=data, x='music_genre', y='tempo')
plt.show()



The variation between genres is not great. We'll fill missing values with the median, but consider dropping the feature altogether in the future.¶


In [None]:


median_tempo = data['tempo'].median()
data['tempo'] = data['tempo'].fillna(median_tempo)




Obtained date:¶


In [None]:
data['obtained_date'].unique()

In [None]:
data.groupby('music_genre')['obtained_date'].describe()


Only gives the 4 dates at which the data was obtained. Not useful to us, so we'll drop it.¶


In [None]:
data = data.drop(columns=['obtained_date'])


Valence:

In [None]:
data.groupby('music_genre')['valence'].describe()

In [None]:
plt.figure(figsize=(16,8))
sns.boxplot(data=data, x='music_genre', y='valence')
plt.show()


Again, only classical music truly stands out from the rest.¶


EDA Summary:

Labels:

There are 10 equally likely musical genres (balanced dataset):

    Alternative
    Anime
    Blues
    Classical
    Country
    Electronic
    Hip-Hop
    Jazz
    Rap
    Rock


Features:

Useful features:

    popularity - left as is.
    acousticness - left as is.
    danceability - left as is.
    duration_ms - 10% of the entries had missing values, they were filled in with the median.
    energy - left as is.
    key - a categorical column containing 12 unique categories. One hot encoding was used.
    liveness - left as is with a caveat: may be removed later on due to a lack of variance between genres.
    loudness - left as is.
    mode - a categorical column containing only 2 unique categories. One hot encoding was used.
    speechiness - left as is.
    tempo - contained 10% missing values and missclassified as catagorical. The missing values were filled in with the median and the feature was correctly classified as numerical. Caveat: contains very similar distributions between the genres. might be removed later on.
    valence - left as is.


Unhelpful Features that where removed:

    instance_id - only an index.
    obtained_date - only contains the 4 consecutive dates of data aquisition.
    instrumentalness - contains 30% missing values.
    artist_name and track_name - were used to obtain new features (see below) and then discarded.



New features:

    length_track_name - the track_name feature has essentially been converted to the length of the name. This feature helps identify the classic genre.
    Japanese - This feature indicates wether the track/artist name is written in Japanese. This helps identify the Anime genre.


General observations:

It would appear that most genres tend to have very similar distributions in most features, making it hard for any model to distinguish between them. The obviouse exception is classical music, which has very different distributions in many features. The Anime genre also shows some distinguishing characteristics, as well as Jazz, to a much lesser extent. Hip-Hop and Rap are extremely similar to one another in all features, but are separate from the rest of the genres in some features. They might be easier to identify as one joint genre. All in all, The current features are unlikely to give great predictions for all genres. We'll find out if this is true soon enough.


2. Final Preprocessing



Separate features from labels and encode labels:


In [None]:
data['music_genre'] = data['music_genre'].astype('category')
y = data['music_genre'].cat.codes
y_names = list(data['music_genre'].cat.categories)

X = data.drop(columns=['music_genre'])

In [None]:
scaler = RobustScaler()
scaler.fit(X)
X_scaled = pd.DataFrame(scaler.transform(X), columns=X.columns, index=X.index)

In [None]:
X_scaled.describe()


Outlier Removal:


In [None]:


numerical_feats = ['popularity', 'acousticness', 'danceability', 'duration_ms', 'energy',
                   'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'length_track_name']

plt.figure(figsize=(16,8))
sns.boxplot(data=X_scaled[numerical_feats])
plt.show()



As can be clearly seen, some features, especially the duration feature, contain extreme ouliers. These outliers can hinder the success of all models, so we'll remove them.

In [None]:


X_no_outliers = X_scaled[(np.abs(zscore(X_scaled[numerical_feats])) < 4).all(axis=1)]
y_no_outliers = y[X_no_outliers.index]



In [None]:
plt.figure(figsize=(16,8))
sns.boxplot(data=X_no_outliers[numerical_feats])
plt.show()



(Note: Some of the features are clearly skewed. However, using log/boxcox on them did not improve the final results, and so it is ommitted here)


Check correlation and reduce dimensionality with PCA:

In [None]:


plt.figure(figsize=(10,8))
sns.heatmap(X_no_outliers.corr(), annot=False)
plt.show()




Let's zoom in on the upper left corner where there are some noticeable correlations:


In [None]:


plt.figure(figsize=(10,8))
sns.heatmap(X_no_outliers.iloc[:,:11].corr(), annot=True)
plt.show()




Loudness, Acousticness and energy are highly correlated. PCA will address this while also reducing the dimesionality of our data set.¶


In [None]:
pca = PCA().fit(X_no_outliers)

# find the first n components that account for 95% of the variance 
cum_exp_var = np.cumsum(pca.explained_variance_ratio_)
n_components = (cum_exp_var <= 0.95).sum()

X_pca = pca.transform(X_no_outliers)[:,:n_components]

In [None]:
X_no_outliers.info()

In [None]:


x_train, x_test, y_train, y_test = train_test_split(X_no_outliers, y_no_outliers, test_size=0.3)



In [None]:
from sklearn.model_selection import cross_val_score

def get_metrics(model, X, y, y_names):
    clf = model

    # Split data into train and validation sets
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3)

    # Fit the model on the training set
    clf.fit(X_train, y_train)

    # Evaluate the model on the training set using cross-validation
    cv_scores = cross_val_score(clf, X_train, y_train, cv=5)
    print(f"Cross-validation scores: {cv_scores}")
    print(f"Mean CV score: {cv_scores.mean()*100:2.4}%")

    # Make predictions on the training and validation sets
    predict_train = clf.predict(X_train)
    predict_valid = clf.predict(X_valid)

    # Print accuracy scores and classification report for the validation set
    print(f"Train accuracy score: {accuracy_score(y_train, predict_train)*100:2.4}%")
    print(f"Validation accuracy score: {accuracy_score(y_valid, predict_valid)*100:2.4}%\n")
    print('Classification Report for the validation set:\n')
    print(classification_report(y_valid, predict_valid, target_names=y_names))
    
    # Plot the confusion matrix for the validation set
    print('Confusion Matrix:\n')
    plt.figure(figsize=(8,6))
    sns.heatmap(confusion_matrix(y_valid, predict_valid), annot = True, fmt = ".0f", 
                cmap = "coolwarm", linewidths = 1, linecolor = "white",
                xticklabels = y_names, yticklabels = y_names)
    
    plt.show()

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt

def get_metrics_importance(model, X, y, y_names):
    clf = model

    # Split data into train and validation sets
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3)

    # Fit the model on the training set
    clf.fit(X_train, y_train)

    # Get feature importances using permutation importance
    result = permutation_importance(clf, X_train, y_train, n_repeats=10)
    importances = result.importances_mean

    # Get the feature names
    feature_names = X.columns

    # Plot feature importances on a pie graph
    fig, ax = plt.subplots(figsize=(8, 8))
    ax.pie(importances, labels=feature_names, autopct='%1.1f%%')
    ax.set_title('Feature Importances')
    plt.show()

    # Evaluate the model on the training and validation sets
    predict_train = clf.predict(X_train)
    predict_valid = clf.predict(X_valid)

    # Print accuracy scores and classification report for the validation set
    print(f"Train accuracy score: {accuracy_score(y_train, predict_train)*100:2.4}%")
    print(f"Validation accuracy score: {accuracy_score(y_valid, predict_valid)*100:2.4}%\n")
    print('Classification Report for the validation set:\n')
    print(classification_report(y_valid, predict_valid, target_names=y_names))
    
    # Plot the confusion matrix for the validation set
    print('Confusion Matrix:\n')
    plt.figure(figsize=(8,6))
    sns.heatmap(confusion_matrix(y_valid, predict_valid), annot = True, fmt = ".0f", 
                cmap = "coolwarm", linewidths = 1, linecolor = "white",
                xticklabels = y_names, yticklabels = y_names)
    
    plt.show()


SVC

In [None]:
get_metrics(SVC(), x_train, y_train, y_names)

The precision score for each class represents the proportion of true positives among all predicted positives, while the recall score represents the proportion of true positives among all actual positives. The F1-score is the harmonic mean of precision and recall, and provides a balanced measure of the two metrics.

From the report, we can see that the model achieved the highest precision, recall, and F1-score for the Classical music category, followed by Hip-Hop and Rap. The model performed relatively well on the other three categories (Anime, Jazz, Blues and Electronic, and Rock, Alternative and Country) with F1-scores ranging from 0.75 to 0.75.

The overall accuracy of the model on the validation set was 0.77, indicating that the model correctly predicted the genre of the song in 77% of cases. The weighted average of the evaluation metrics was also 0.77, indicating that the model performed similarly well across all categories.



KNN

In [None]:
get_metrics(KNeighborsClassifier(), x_train, y_train, y_names)


Random Forest:


In [None]:
get_metrics(RandomForestClassifier(n_estimators=100,
    
    criterion='gini',
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=10,
    min_weight_fraction_leaf=0.0,
    max_features='sqrt',
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    bootstrap=True,
    oob_score=False,
    n_jobs=None,
    random_state=None,
    verbose=0,
    warm_start=False,
    class_weight=None,
    ccp_alpha=0.0,
    max_samples=None,), x_train, y_train, y_names)



*Note the obviouse overfitting appearant in this case (and in KNN to a lesser degree).


It would appear that all of the models we tried were almost equaly as ineffective (~40-50% accuracy), but SVC does better than the rest and avoids overfitting.
Before we'll analyze further, we'll do a short sanity check and see if this is related to the use of PCA, outlier removal, or if it has to do with the features that we engineered/filled missing values in

Try with less fetures, without PCA, and keeping the outliers in:

In [None]:


X_reduced = X_scaled.drop(['duration_ms', 'liveness', 'tempo', 'length_track_name', 'Japanese'], axis=1)
x_train_check, x_test_check, y_train_check, y_test_check = train_test_split(X_reduced, y, test_size=0.3)

get_metrics(SVC(), x_train_check, y_train_check, y_names)



This had no effect on the poor performance.

Modeling Summary:

SVM, KNN, and Random Forest models were used on the data set, but all yielded bad results (40-50% accuracy). SVM was the best out of a bad bunch.

Possible effects of outliers, dimensionality reduction and missing values were tested and discarded.

The next section will analyze the results in more depth and propose a solution.

Consolidating classes

An important thing to note is that as we hypothesized from the EDA, one genre, classical music, has much higher precision/recall scores than all the rest. This is also true to a lesser extent regarding the Anime genre.

Meanwhile, Hip-Hop and Rap seem to be interchangeable and can't be distinguished from one another, but they are relatively well separated from the rest of the genres (as evidenced by the confusion matrix).

Rock, Alternative and maybe Country also seem to have some similarity resulting in missclassification. Same goes for Blues and Jazz and maybe Electronic. These combinations are much more tentative, though.

In [None]:

def get_metrics(model, X, y, feature_names):
    # Split data into train and validation sets
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3)

    # Fit the model on the training set
    model.fit(X_train, y_train)

    # Calculate predictions
    y_pred = model.predict(X_valid)

    # Calculate and print the accuracy, precision, recall, and F1 scores
    accuracy = accuracy_score(y_valid, y_pred)
    precision = precision_score(y_valid, y_pred, average='weighted')
    recall = recall_score(y_valid, y_pred, average='weighted')
    f1 = f1_score(y_valid, y_pred, average='weighted')

    print(f'Accuracy: {accuracy:.2f}')
    print(f'Precision: {precision:.2f}')
    print(f'Recall: {recall:.2f}')
    print(f'F1 score: {f1:.2f}')

    # Get feature importances
    importances = model.feature_importances_

    # Set negative importances to 0
    importances = np.maximum(importances, 0)

    # Plot feature importances on a pie graph
    fig, ax = plt.subplots(figsize=(8, 8))
    ax.pie(importances, labels=feature_names, autopct='%1.1f%%')
    ax.set_title('Feature Importances')
    plt.show()
