# Instrument Classification

## Dataset

### Training

The training dataset consists of 6705 audio files in 16 bit stereo wav format at 44.1 kHz samplerate. Being excerpts of 3 seconds from more than 2000 distinct recordings.

The table below shows the instrument, it's corresponding key annotation and the number of files assosiated to that instrument. These are then organised into folders.

| Instrument           | Key | Files |
|----------------------|-----|-------|
| Cello                | cel | 388   |
| Clarinet             | cla | 505   |
| Flute                | flu | 451   |
| Acoustic Guitar      | gac | 637   |
| Electric Guitar      | gel | 760   |
| Organ                | org | 682   |
| Piano                | pia | 721   |
| Saxophone            | sax | 626   |
| Trumpet              | tru | 577   |
| Violin               | vio | 580   |
| Human Singing Voice  | voi | 778   |

Extra information is sometimes provided on the filename which corresponds to the genre and wether drums are present or not.

| Description  | Key     |
|--------------|---------|
| Drums        | dru     |
| No Drums     | nod     |
| Country-Folk | cou_fol |
| Classical    | cla     |
| Pop-Rock     | pop_roc |
| Latin-Soul   | lat_sou |
| Jazz-Blues   | jaz_blu |

The filename structure is then formatted as `004__[instrument][drumspresent][genre]0001__1.wav` with the drums present key sometimes ommitted.

### Testing

The testing dataset consists of 2874 excerpts in 16 bit stereo wav format at 44.1kHz samplerate.

This is built up of exceprts with the following properties:
1. The annotated instruments are the same in the whole excerpt.
2. The length is between 5 and 20 seconds.
3. The excerpts are stereo.

These properties eliminate the need to segment the recognition evaluation based on changing instrumentation within a piece. The duration requirement ensures there is enough information for confident instrument labeling. The focus is on professionally produced stereo music recordings, and annotations exclude instruments like bass, percussion, or sections like brass or strings.

## Downloading the python packages

In [2]:
# Download python packages required using pip
!python -m pip install --upgrade pip
!pip3 install torch torchaudio
!pip install torch-summary
!pip install pandas pyarrow
!pip install librosa
!pip install tqdm

Collecting pip
  Downloading pip-24.0-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-24.0-py3-none-any.whl (2.1 MB)
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
   --- ------------------------------------ 0.2/2.1 MB 5.3 MB/s eta 0:00:01
   ----------- ---------------------------- 0.6/2.1 MB 7.9 MB/s eta 0:00:01
   ------------------ --------------------- 1.0/2.1 MB 8.9 MB/s eta 0:00:01
   -------------------------- ------------- 1.4/2.1 MB 8.0 MB/s eta 0:00:01
   -------------------------------------- - 2.0/2.1 MB 9.3 MB/s eta 0:00:01
   ---------------------------------------  2.1/2.1 MB 9.0 MB/s eta 0:00:01
   ---------------------------------------- 2.1/2.1 MB 6.7 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.3.2
    Uninstalling pip-23.3.2:
      Successfully uninstalled pip-23.3.2
Successfully installed pip-24.0


In [None]:
RUN_ON_COLAB = False

In [None]:
# Import python packages required
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torchaudio
from torchsummary import summary
import pandas as pd
import os
from tqdm import tqdm

if RUN_ON_COLAB: from google.colab import drive

## Downloading the dataset

The IRMAS dataset is downloaded to Google Drive and then unzipped into the local colab session.

In [None]:
# Defining the path constants used
LOCAL_AUDIO = '/wavfiles'

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: /content/Audio/IRMAS-TrainingData/org/[org][pop_roc]1133__1.wav  
  inflating: /content/Audio/IRMAS-TrainingData/org/[org][pop_roc]1133__2.wav  
  inflating: /content/Audio/IRMAS-TrainingData/org/[org][pop_roc]1134__1.wav  
  inflating: /content/Audio/IRMAS-TrainingData/org/[org][pop_roc]1134__2.wav  
  inflating: /content/Audio/IRMAS-TrainingData/org/[org][pop_roc]1134__3.wav  
  inflating: /content/Audio/IRMAS-TrainingData/org/[org][pop_roc]1135__1.wav  
  inflating: /content/Audio/IRMAS-TrainingData/org/[org][pop_roc]1135__2.wav  
  inflating: /content/Audio/IRMAS-TrainingData/org/[org][pop_roc]1135__3.wav  
  inflating: /content/Audio/IRMAS-TrainingData/org/[org][pop_roc]1136__1.wav  
  inflating: /content/Audio/IRMAS-TrainingData/org/[org][pop_roc]1136__2.wav  
  inflating: /content/Audio/IRMAS-TrainingData/org/[org][pop_roc]1136__3.wav  
  inflating: /content/Audio/IRMAS-TrainingData/org/[org][pop_roc]1

## Annotation File

The code below creates an annotations file which provides the information for the dataset including the class labels.

In [None]:
def save_file_annotations_to_csv():
  data = []

  # Walking through the directories in the dataset
  for (dirpath, dirnames, filenames) in os.walk(LOCAL_IRMAS):
    for dir in dirnames:
      for (dirpath, dirnames, filenames) in os.walk(os.path.join(LOCAL_IRMAS, dir)):
        for filename in filenames:
          # Split filenames at delimiters __ and [] and remove empty strings
          names = re.split(r'__|\[|\]', filename)
          names = list(filter(None, names))

          # If the filename doesn't start with the three numbers, insert empty
          # entry at column 0 and 2
          if not re.compile("^\d\d\d__").match(filename):
            names.insert(0, None)
            names.insert(2, None)

          # Include the filename at the start and remove .wav on last column
          names.insert(0, filename)
          names[-1] = names[-1].replace('.wav', '')
          data.append(names)
    break

  # Creating a pandas dataframe to hold this data
  df = pd.DataFrame(data, columns =['FileName', 'number1', 'Instrument', 'Drums',
                                    'Genre', 'number2', 'number3'])

  # Create class ID numbers from the class labels
  df['ClassID'] = df.groupby(['Instrument']).ngroup()

  from IPython.display import display, HTML

  # Assuming that dataframes df1 and df2 are already defined:
  print("Dataframe 1:")
  display(df)
  # print "Dataframe 2:"
  # display(HTML(df2.to_html()))

  # Save the dataframe to a CSV file
  df.to_csv(path_or_buf=f"{LOCAL_IRMAS}.csv", sep=',', encoding='utf-8', index=False)

save_file_annotations_to_csv()

Dataframe 1:


Unnamed: 0,FileName,number1,Instrument,Drums,Genre,number2,number3,ClassID
0,[flu][pop_roc]0477__1.wav,,flu,,pop_roc,0477,1,2
1,[flu][jaz_blu]0498__1.wav,,flu,,jaz_blu,0498,1,2
2,[flu][jaz_blu]0403__1.wav,,flu,,jaz_blu,0403,1,2
3,[flu][cla]0355__2.wav,,flu,,cla,0355,2,2
4,[flu][pop_roc]0395__1.wav,,flu,,pop_roc,0395,1,2
...,...,...,...,...,...,...,...,...
6700,018__[vio][nod][cla]2160__3.wav,018,vio,nod,cla,2160,3,9
6701,[vio][cla]2111__1.wav,,vio,,cla,2111,1,9
6702,180__[vio][nod][cla]2211__3.wav,180,vio,nod,cla,2211,3,9
6703,[vio][jaz_blu]2104__2.wav,,vio,,jaz_blu,2104,2,9


## Custom Dataset

The custom dataset imports the audio dataset's annotation file into python and uses it to load the correct audio files with the corresponding class labels.

In [None]:
class AudioDataset(Dataset):
  ''' Custom Dataset class for audio instrument classification '''

  def __init__(self, annotation_file, audio_dir, device, transformation, transform_fs, num_samples):
    '''
    Parameters -
        annotation_file: a path to a csv file with the annotations for the dataset
                            these annotations should have the filename at index 0 and class ID at -1
        audio_dir:       the path to the audio dataset directory
        device:          the device in use (cude or cpu)
        transformation:  provides the function for performing preprocessing on the data
        transform_fs:    the samplerate to resample at
        num_samples:     the length n of samples to cut / pad the data to
    '''
    # Read in the annotation file for the dataset
    self.annotations = pd.read_csv(annotation_file)

    # Defining attributes
    self.audio_dir = audio_dir
    self.device = device
    self.transformation = transformation.to(self.device) # putting the data onto a cuda device is available
    self.transform_fs = transform_fs
    self.num_samples = num_samples

    # Creating a unique ID for the classes
    self.annotations.assign(id=self.annotations.groupby(['Instrument']).ngroup())

  def __len__(self):
    ''' Magic method to provide the length of the object '''
    return len(self.annotations)

  def __getitem__(self, index):
    ''' Magic method to provide indexing for the object '''

    # Gets the audio path and labelID for the index
    audio_sample_path, labelID = self._get_audio_sample_path_and_label(index)

    # Loads the audio input to the device
    signal, fs = torchaudio.load(audio_sample_path, normalize=True)
    signal = signal.to(self.device)

    # Resamples and reshapes the audio
    signal = self._resample_audio(signal, fs)
    signal = self._reshape_audio(signal)

    # Performs transformation on the device
    signal = self.transformation(signal)

    return signal, labelID

  def get_class_labels(self):
    ''' Public method to provide a list of the class labels '''
    return self.annotations['Instrument'].unique()

  def _get_audio_sample_path_and_label(self, index):
      ''' Private get method for the sample path location
                        and prediction labelID            '''
      label = self.annotations.iloc[index, 2]
      labelID = self.annotations.iloc[index, 7]
      path = os.path.join(self.audio_dir, label, self.annotations.iloc[index, 0])
      return path, labelID

  def _resample_audio(self, signal, fs):
      # Resample the audio signal if needed
      if fs != self.transform_fs:
          resampler = torchaudio.transforms.Resample(fs, self.transform_fs).to(self.device)
          signal = resampler(signal)
      return signal

  def _reshape_audio(self, signal):
      # Convert the signal to mono if needed
      if signal.shape[0] > 1:
          signal = torch.mean(signal, dim=0, keepdim=True)

      # Cut the signal if needed
      if signal.shape[1] > self.num_samples:
          signal = signal[:, :self.num_samples]

      # Pad the signal if needed
      if signal.shape[1] < self.num_samples:
          signal = torch.nn.functional.pad(signal, (0, self.num_samples - signal.shape[1]))

      return signal

# Defining the dataset and dataloader constants
SAMPLE_RATE = 22050
NUM_SAMPLES = 22050
BATCH_SIZE = 128

# Creating a mel spectogram for the feature transformation
mel_spectogram = torchaudio.transforms.MelSpectrogram(
    sample_rate=SAMPLE_RATE,
    n_fft=1024,
    hop_length=512,
    n_mels=64
)

# Defining the device being used
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"
print(f"Device available: {device.upper()}")

# Creating the dataset and dataloader
audio_dataset = AudioDataset(f"{LOCAL_IRMAS}.csv", LOCAL_IRMAS, device, mel_spectogram, SAMPLE_RATE, NUM_SAMPLES)
data_loader = DataLoader(audio_dataset, batch_size=BATCH_SIZE)

# Printing all the class labels
print(audio_dataset.get_class_labels())
# 0 cel
# 1 cla
# 2 flu
# 3 gac
# 4 gel
# 5 org
# 6 pia
# 7 sax
# 8 tru
# 9 vio
# 10 voi

Device available: CUDA
['flu' 'cel' 'org' 'cla' 'pia' 'sax' 'gac' 'gel' 'voi' 'tru' 'vio']


## Convolutional Neural Network

Defining the convolutional neural network architecture with the layers and activation functions.

In [None]:
class CNNNetwork(nn.Module):
  ''' Custom Convolution Neural Network for audio classification'''

  def __init__(self, num_classes):
    '''
    Parameters -
        num_classes: the number of classes used for classification
    '''
    super().__init__()

    # Define convolutional blocks
    self.conv1 = nn.Sequential(
        nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, stride=1, padding=2),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2)
    )

    self.conv2 = nn.Sequential(
        nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=2),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2)
    )

    self.conv3 = nn.Sequential(
        nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=2),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2)
    )

    self.conv4 = nn.Sequential(
        nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1, padding=2),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2)
    )

    # Create a flatten layer to reshape the data into a 1D vector
    self.flatten = nn.Flatten()

    # Create a linear / dense layer to classify the data from the convolutional network
    self.linear = nn.Linear(in_features=128*5*4, out_features=num_classes)

    # Create a softmax layer to provide decimal probabilities to the class predictions
    self.softmax = nn.Softmax(dim=1)

  def forward(self, input_data):
    ''' Passing the data through the network '''
    x = self.conv1(input_data)
    x = self.conv2(x)
    x = self.conv3(x)
    x = self.conv4(x)
    x = self.flatten(x)
    logits = self.linear(x)
    predictions = self.softmax(logits)
    return predictions

# Creating the neural network
cnn = CNNNetwork(audio_dataset.get_class_labels().size).to(device)

# Printing a summary of the network
summary(cnn.cuda(), (1, 64, 44))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1           [-1, 16, 66, 46]             160
              ReLU-2           [-1, 16, 66, 46]               0
         MaxPool2d-3           [-1, 16, 33, 23]               0
            Conv2d-4           [-1, 32, 35, 25]           4,640
              ReLU-5           [-1, 32, 35, 25]               0
         MaxPool2d-6           [-1, 32, 17, 12]               0
            Conv2d-7           [-1, 64, 19, 14]          18,496
              ReLU-8           [-1, 64, 19, 14]               0
         MaxPool2d-9             [-1, 64, 9, 7]               0
           Conv2d-10           [-1, 128, 11, 9]          73,856
             ReLU-11           [-1, 128, 11, 9]               0
        MaxPool2d-12            [-1, 128, 5, 4]               0
          Flatten-13                 [-1, 2560]               0
           Linear-14                   

## Training

Below demonstrates the function to train the neural network. This then saves the model state to google drive.

In [None]:
def train(model, data_loader, loss_func, optimiser, device, epochs):
  '''
  Parameters -
    model: the model to train
    data_loader: the dataloader to optimise memory storage with batch downloads
    loss_func: the loss function to use to update the bias weightings
    optimiser: the optimiser to use for backwards propogation
    device: the device in use (cuda or cpu)
    epochs: the number of epochs to run
  '''
  model.train()

  # Using a loading bar to show progress
  with tqdm(data_loader, unit="batch", total=len(data_loader)) as tepoch:

    # Increament through each epoch
    for epoch in range(epochs):
      for input, target in data_loader:
        tepoch.set_description(f"Epoch {epoch}")

        # Load the batch input into the device memory
        input, target = input.to(device), target.to(device)

        # Calculate the loss
        output = model(input) # pass the inputs into the model
        loss = loss_func(output, target) # compare the predictions to the actual data

        # Calculate the accuracy of the model to display
        predictions = output.argmax(dim=1, keepdim=True).squeeze()
        correct = (predictions == target).sum().item()
        accuracy = correct / BATCH_SIZE

        # Backpropogate the loss and update the NN weights
        optimiser.zero_grad() # resets the gradients
        loss.backward() # performs back propogation
        optimiser.step() # updates the weights

        tepoch.set_postfix(loss=loss.item(), accuracy=100. * accuracy)

  print("\nTraining Completed")

# Defining network constants
EPOCHS = 40 # [estimated 1.5 mins / per epoch]
LEARNING_RATE = .001

# Defining the loss function
loss_func = nn.CrossEntropyLoss()

# Defining the optimiser
optimiser = torch.optim.Adam(cnn.parameters(), lr=LEARNING_RATE)

# Training the model
train(cnn, data_loader, loss_func, optimiser, device, EPOCHS)

# Saving the model state
model_state_path = os.path.join(GDRIVE_AUDIO, "cnn.pth")
torch.save(cnn.state_dict(), model_state_path)
print(f"Model trained and stored  at '{model_state_path}'")

Epoch 39:   0%|          | 0/53 [1:03:31<?, ?batch/s, accuracy=38.3, loss=1.54]


Training Completed
Model trained and stored  at '/content/drive/MyDrive/Audio/cnn.pth'





## References

<Bosch2012>Bosch, J. J., Janer, J., Fuhrmann, F., & Herrera, P. (2012). "A Comparison of Sound Segregation Techniques for Predominant Instrument Recognition in Musical Audio Signals." In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (pp. 559-564).