## Audio Features and download audio files:

First, few imports:

In [1]:
import sys
sys.path.append('..')
import pandas as pd
from src.utils.audio_utils import get_audio_features
import warnings
from tqdm import tqdm
warnings.filterwarnings('ignore')

In [2]:
csv_path = "../data/chartex_final.csv"

Let's look at one sample for the dataset:

In [3]:
# Load the CSV file into a DataFrame
df = pd.read_csv(csv_path)

# Get the first row of data
example_track_data = df.iloc[0]

# Print the resulting row
print(example_track_data)

track_name                        Love You So
track_pop                                   0
artist               The King Khan & BBQ Show
artist_pop                                 37
album                The King Khan & BBQ Show
danceability                            0.389
energy                                  0.896
key                                       5.0
loudness                               -2.622
mode                                      1.0
speechiness                            0.0599
acousticness                             0.79
instrumentalness                      0.00436
liveness                                0.501
valence                                 0.653
tempo                                 115.143
id                     4msYRkezQgynuZNubvVbHk
duration_ms                          225240.0
time_signature                            4.0
artist_name          The King Khan & BBQ Show
total_likes_count                  4187998484
number_of_videos                  

and we will extract well known features from thee audio file:

In [4]:
audio_dir = "../data/track_downloads"

# Get the additional audio features for the track
features = get_audio_features(example_track_data, audio_dir)

Found youtube url: https://www.youtube.com/watch?v=UY3sneP51iM&pp=ygUrVGhlIEtpbmcgS2hhbiAmIEJCUSBTaG93IExvdmUgWW91IFNvIGx5cmljcw%3D%3D
Downloading track 4msYRkezQgynuZNubvVbHk.wav
Downloaded audio to ../data/track_downloads\4msYRkezQgynuZNubvVbHk.wav
chroma_stft: (12, 2584)
rmse: (1, 2584)
spec_cent: (1, 2584)
spec_bw: (1, 2584)
rolloff: (1, 2584)
zcr: (1, 2584)
mfcc: (20, 2584)
Extracted features: {'chroma_stft': 0.38724506, 'rmse': 0.3687946, 'spec_cent': 1762.3390761758187, 'spec_bw': 2108.894739332126, 'rolloff': 3458.9457508949304, 'zcr': 0.07204302619485294, 'mfcc': 4.0506997}


Now that we did it for one song, we will do it for all songs in the dataset:

In [None]:
def batch_extract(df, start_index, end_index):
    for i in range(start_index, end_index):
        print(f'Processing track {i+1} of {len(df)}')
        track_data = df.iloc[i]
        features = get_audio_features(track_data)
        if not features:
            print(f'No features found for track {i+start_index} of {len(df)}, skipping...') 
            continue
        # add features to dataframe
        for feature_names in features.keys():
            df.loc[i, feature_names] = features[feature_names]
        # save dataframe  
        df.to_csv("../data/audio_features.csv", index=False)

We will now check if all songs were downloaded:

In [4]:
import os
downloaded_songs = os.listdir('../data/track_downloads/')
num_of_songs = len(df.index)
print("All songs downloaded" if len(downloaded_songs)==num_of_songs else "There are missing songs")

There are missing songs


In [None]:
for i in range(num_of_songs):
    curr_song = df.iloc[i]
    if not curr_song['id'] + '.wav' in downloaded_songs:
        batch_extract(df,i,i+1)

We couldn't download most of those missing songs. We will ignore them.

Next, we will convert the audio files from MP4 format to MP3 format in order to be able to use it in the torchaudio library:

In [None]:
import os

os.mkdir('../data/audio')

In [None]:
import subprocess

for song in tqdm(downloaded_songs):
    song_path = '../data/track_downloads/' + song
    conv_song_path = '../data/audio/' + song[:-4] + ".mp3"
    ffmpeg_command = f"ffmpeg -i {song_path} -vn -acodec libmp3lame -q:a 4 -ar 22050 {conv_song_path}"

    subprocess.run(ffmpeg_command, shell=True)

### Creating the dataset:

Now let's finish by creating the .csv of the dataset, using the threshold of $5e5$. We will first drop all features that the model cannot infer from the audio and the new features we created:

In [5]:
converted_songs = [song[:-4] for song in os.listdir('../data/audio/')]


df['viral'] = (df['number_of_videos'] > 5e5).astype('int32')
df.drop(['track_name', 'track_pop', 'artist', 'artist_pop', 'album','number_of_videos'\
        'time_signature', 'artist_name','total_likes_count', 'number_of_videos',\
        'chroma_stft', 'rmse','spec_cent', 'spec_bw', 'rolloff', 'zcr', 'mfcc'], axis=1, errors = 'ignore',inplace=True)

and delete all songs from the dataframe that we couldn't download:

In [6]:
df = df[(df['id'].isin(converted_songs))]

We notice that duration_ms is not correct and thus we will fix it:

In [7]:
import ffmpeg
def get_duration_ffmpeg(file_path):
   probe = ffmpeg.probe(file_path)
   stream = next((stream for stream in probe['streams'] if stream['codec_type'] == 'audio'), None)
   duration = float(stream['duration'])
   return duration

In [8]:
for i in tqdm(df.index):
    df.loc[i,'duration_ms'] = get_duration_ffmpeg('../data/audio/' + df.loc[i,'id'] + '.mp3') * 1000

100%|██████████| 3915/3915 [01:45<00:00, 36.97it/s]


and we will delete all songs that are shorter than 30 seconds and longer than 5 minutes:

In [9]:
df = df[(df['duration_ms']< 5*60*1000) & (df['duration_ms']>= 30*1000)]

In [13]:
df.to_csv("../data/metadata.csv")