#### Preparing Audio Files for CNN


For the cnn we want to input a the mel spectrograms of audio files, cropped to the first minute.

Spectrograms are a way to visually represent a signal's loudness, or amplitude, as it varies over time at different frequencies. The horizontal axis is time, the vertical axis is frequency, and the color is amplitude. It is calculated using the fast Fourier transform on short time windows of the signal and transforming the vertical axis (frequency) to log scale and the colored axis (amplitude) to decibals.

Now, what about the "mel" part? Humans are better at detecting differences in lower frequencies than higher frequencies. The mel scale transforms the frequency scale such that sounds at equal distances from each other also sound equal in distance. A mel spectrogram is a spectrogram where the frequencies are converted to the mel scale.


### Toy Example


In [None]:
import warnings
import IPython.display as ipd
import matplotlib.pyplot as plt
import librosa.display
import numpy as np
import pandas as pd
import os
import librosa
import sys
sys.path.append('..')
warnings.filterwarnings('ignore')

In [None]:
audio_directory = "../data/track_downloads/"
example_audio_path = audio_directory + "7ya7Jv4hJ9W0Baz7h9nL7E.wav"
sampling_rate = 22050

### Extracting an Audio Signal

A signal is a variation in a quantity over time. For audio, the quantity that varies is air pressure. We can represent a signal digitally by taking samples of the air pressure over time. We are left with a waveform for the signal. Librosa is a python library that allows us to extract waveforms from audio files along with several other features. This is the primary package that will be used for this project.


In [None]:
# Extracting the wave, "y", and sampling rate, "sr", of the audio file (first sixty seconds)
y, sr = librosa.load(example_audio_path, sr=sampling_rate,
                     mono=True, duration=60)
print(f"sampling rate: {sr}, wave shape: {y.shape}")

In [None]:
# Plotting the wave
plt.plot(y)
plt.title('Signal')
plt.xlabel('Time (samples)')
plt.ylabel('Amplitude')

### Mel Spectrograms

Spectrograms are a way to visually represent a signal's loudness, or amplitude, as it varies over time at different frequencies. The horizontal axis is time, the vertical axis is frequency, and the color is amplitude. It is calculated using the fast Fourier transform on short time windows of the signal and transforming the vertical axis (frequency) to log scale and the colored axis (amplitude) to decibals. Now, what about the "mel" part? Humans are better at detecting differences in lower frequencies than higher frequencies. The mel scale transforms the frequency scale such that sounds at equal distances from each other also sound equal in distance. A mel spectrogram is a spectrogram where the frequencies are converted to the mel scale.

This is what will be the input to the cnn!


In [None]:
# Computing the spectrogram
spec = np.abs(librosa.stft(y, hop_length=512))
spec = librosa.amplitude_to_db(spec, ref=np.max)  # converting to decibals

# Plotting the spectrogram
plt.figure(figsize=(8, 5))
librosa.display.specshow(spec, sr=sr, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Spectrogram')

### Function to Read and Extract Mel Spectrograms from Audio Files

Checking the Size of the Mel Spectrograms
In order to feed the mel spectrogram data into a neural network, they must all be the same size, so I check that here.


In [None]:
from tqdm import tqdm

# Creating an empty list to store sizes in
sizes = []

# Looping through each audio file
for file in tqdm(os.scandir(audio_directory), total=len(os.listdir(audio_directory))):

    # Loading in the audio file
    # print(f"Loading file: {file.path}")
    y, sr = librosa.load(file.path, sr=sampling_rate, mono=True, duration=60)

    # Computing the mel spectrograms
    spect = librosa.feature.melspectrogram(
        y=y, sr=sr, n_fft=2048, hop_length=1024)
    spect = librosa.power_to_db(spect, ref=np.max)

    # Adding the size to the list
    sizes.append(spect.shape)

    # print(f'size: {spect.shape}')a

In [None]:
# Checking if all sizes are the same
print(
    f'The sizes of all the mel spectrograms in our data set are equal: {len(set(sizes)) == 1}')

# Checking the max size
print(f'Sizes: {set(sizes)}')

In [None]:
def make_mel_spectrogram_df(directory):
    '''
    This function takes in a directory of audio files in .wav format, computes the
    mel spectrogram for each audio file, reshapes them so that they are all the 
    same size, flattens them, and stores them in a dataframe.

    Genre labels are also computed and added to the dataframe.

    Parameters:
    directory (int): a directory of audio files in .wav format

    Returns:
    df (DataFrame): a dataframe of flattened mel spectrograms and their 
    corresponding genre labels
    '''

    # loading dataframe
    spotify_df = pd.read_csv('../data/chartex_final.csv')

    # removing unnecessary columns
    spotify_df.drop((['artist_pop', 'track_pop', 'number_of_videos_last_14days', 'total_likes_count', 'key',]), axis = 1, errors='ignore', inplace=True)

    # Creating empty lists for mel spectrograms and labels
    labels = []
    mel_specs = []

    # setting minimum number of videos to be considered viral
    threshold = 500000
    
    # Setting the size for all mel spectrograms
    spectrogram_size = (128, 1292)

    # Looping through each row in the df
    for index, row in tqdm(spotify_df.iterrows(), total=len(df)):
        
        # Loading in the audio file
        spotify_id = row['id']
        audio_path = os.path.join(audio_directory, f'{spotify_id}.wav')
        y, sr = librosa.core.load(audio_path)

        # Extracting the label and adding it to the list
        number_of_videos = row['number_of_videos']
        label = 1 if number_of_videos > threshold else 0
        labels.append(label)

        # Computing the mel spectrograms
        spect = librosa.feature.melspectrogram(
            y=y, sr=sr, n_fft=2048, hop_length=1024)
        spect = librosa.power_to_db(spect, ref=np.max)

        # Adjusting the size to be (128, 1292)
        if spect.shape != spectrogram_size:
            spect.resize(*spectrogram_size, refcheck=False)

        # Flattening to fit into dataframe and adding to the list
        spect = spect.flatten()
        mel_specs.append(spect)

    # Converting the lists to arrays so we can stack them
    mel_specs = np.array(mel_specs)
    labels = np.array(labels).reshape(1000, 1)

    # Create dataframe
    spectrogram_df = pd.DataFrame(np.hstack((mel_specs, labels)))

    # Returning the mel spectrograms and labels
    return spectrogram_df