# Executive Summary

After utilizing a K-Nearest Neighbors classifier, we put it into a confusion matrix. Just by looking at the confusion matrix, our classifier looked like it did a pretty decent job at correctly identifying genres by features. However, when we went to calculate our accuracy score we found that it was around 49%. While our classifier was not entirely successful, we still wanted to talk a look at feature importance through a Random Forest Classifier. From our feature importance chart we saw that the ``'the chroma_stft'`` feature held the most weight when it came to predicting genres, and the ``'tempo'`` feature had the least weight. However, because our accuracy score was considerably low, we must interpret these results with caution. 

# Introduction 

In the music industry today, it is no secret that a lot of songs sound similar to each other. This phenomenon is commonly referred to as the ['pop music formula'](https://www.englishclub.com/vocabulary/music-pop.php#:~:text=Songs%20that%20become%20hits%20almost,and%20two%20or%20more%20verses.) or the 'top hit formula'. From a musical standpoint, its easier to understand why exactly this phenomenon is so common in todays pop culture (popular chord progressions/simpler catchier songs), but what about from a more technical viewpoint? What characteristics, on an audio-signal level, define a 'popular' song? What characteristics define specific genres? 

For this project, we want to be able to predict the genre of a sampled audio file/song based on audio-signal characteristics given in a dataset. We will be doing this by training on said dataset and utilizing the K-Nearest Neighbors Classification. We also want to look at feauture importance through a Random Forest Classification so we can determine what audio-signal characteristics are critical in correctly identifying a genre. 

# Data Description

In [1]:
import pandas as pd

# using the kaggle csv
df = pd.read_csv('music_feats.csv')
# cleaning up data! We only want features that we feel are important and that we can explain. 
df = df[['tempo','beats','chroma_stft','rmse','spectral_centroid', 'spectral_bandwidth','rolloff','zero_crossing_rate']]
df.head()

Unnamed: 0,tempo,beats,chroma_stft,rmse,spectral_centroid,spectral_bandwidth,rolloff,zero_crossing_rate
0,103.359375,50,0.38026,0.248262,2116.942959,1956.611056,4196.10796,0.127272
1,95.703125,44,0.306451,0.113475,1156.070496,1497.668176,2170.053545,0.058613
2,151.999081,75,0.253487,0.151571,1331.07397,1973.643437,2900.17413,0.042967
3,184.570312,91,0.26932,0.119072,1361.045467,1567.804596,2739.625101,0.069124
4,161.499023,74,0.391059,0.137728,1811.076084,2052.332563,3927.809582,0.07548


In [2]:
import plotly.graph_objects as go

# Create the 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(x=df['tempo'],\
    y=df['beats'],
    z=df['chroma_stft'],
    mode='markers',
    marker=dict(
        size=5,
        color=df['rmse'],
        colorscale='Viridis',
        opacity=0.8
    ),
     text=df['spectral_centroid'] 
)])

# Update the layout of the plot
fig.update_layout(
    scene=dict(
        xaxis_title='Tempo',
        yaxis_title='Beats',
        zaxis_title='Chroma STFT'
    ),
    margin=dict(l=0, r=0, b=0, t=0)
)

# Show the plot
fig.show()

We wanted to include a graph with as many features as possible shoved into one, so we constructed this 3D visualzation thats interactive. XYZ values are all labeled and color is determined by the ``'rmse'`` feature. 

You can look at our graph [here](https://ibb.co/gzS6S9M) for context.

In addition to combining as many features as possible into one graph, we found that it would be helpful to split them up too! So the image in the link above shows each feature plotted on a subplot and the units. 

# Method

For our methods, we utilized K-Nearest Neighbors as mentioned above. 