# Audio data classification

## 01 Prepare dataset

First, we prepare the dataset by importing the raw data as Pandas dataframes then merge them into a single dataframe.

In [2]:
# Import necessary package
import pandas as pd

# Read in track metadata with genre labels
tracks = pd.read_csv('tracks.csv')

# Read in track metrics with the features
audio_features = pd.read_json('audio_features.json', precise_float=True)

# Merge the relevant columns of tracks and echonest_metrics
spotify = audio_features.merge(tracks[['track_id', 'genre_top']], on='track_id')

We then inspect the data.

In [3]:
print('First few rows')
print(spotify.head())

First few rows
   track_id  acousticness  danceability    energy  instrumentalness  liveness  \
0         2      0.416675      0.675894  0.634476          0.010628  0.177647   
1         3      0.374408      0.528643  0.817461          0.001851  0.105880   
2       341      0.977282      0.468808  0.134975          0.687700  0.105381   
3     46204      0.953349      0.498525  0.552503          0.924391  0.684914   
4     46205      0.613229      0.500320  0.487992          0.936811  0.637750   

   speechiness    tempo   valence genre_top  
0     0.159310  165.922  0.576661   Hip-Hop  
1     0.461818  126.957  0.269240   Hip-Hop  
2     0.073124  119.646  0.430707      Rock  
3     0.028885   78.958  0.430448      Rock  
4     0.030327  112.667  0.824749      Rock  


In [4]:
print('Features summary statistics')
print(spotify.describe())

Features summary statistics
            track_id  acousticness  danceability       energy  \
count    4802.000000  4.802000e+03   4802.000000  4802.000000   
mean    30164.871720  4.870600e-01      0.436556     0.625126   
std     28592.013796  3.681396e-01      0.183502     0.244051   
min         2.000000  9.491000e-07      0.051307     0.000279   
25%      7494.250000  8.351236e-02      0.296047     0.450757   
50%     20723.500000  5.156888e-01      0.419447     0.648374   
75%     44240.750000  8.555765e-01      0.565339     0.837016   
max    124722.000000  9.957965e-01      0.961871     0.999768   

       instrumentalness     liveness  speechiness        tempo      valence  
count       4802.000000  4802.000000  4802.000000  4802.000000  4802.000000  
mean           0.604096     0.187997     0.104877   126.687944     0.453413  
std            0.376487     0.150562     0.145934    34.002473     0.266632  
min            0.000000     0.025297     0.023234    29.093000     0.01439

Check for null or missing values.

In [5]:
print('Null or missing values')
print(spotify.isnull().sum())

Null or missing values
track_id            0
acousticness        0
danceability        0
energy              0
instrumentalness    0
liveness            0
speechiness         0
tempo               0
valence             0
genre_top           0
dtype: int64
