# Step 1: Data Processing

## 1.1 Feature Extraction

The '**extract_features**' function extracts the pitch, velocity and duration from every instrument present inside the midi file and stores it in a list data structure. 

**Parameter** :     a string of the file location of the MIDI type file

**Output** :        a List containing the information of the features extracted

In [2]:
import os 
import pretty_midi
import numpy as np

def extract_features(midi_file):
    # Initalizing lists 
    pitches = []
    velocities = []
    durations = []
    
    # Loading the MIDI file in a variable
    midi_data = pretty_midi.PrettyMIDI(midi_file)
    
    # Extracting pitch, velocity and note duration features
    for instrument in midi_data.instruments:
        
        # Loop through each note and add to the lists 
        for note in instrument.notes:
            pitches.append(note.pitch)
            velocities.append(note.velocity)
            durations.append(note.end - note.start)
    
    return [pitches, velocities, durations]

## 1.2 Loading the Dataset features into Data Structure

The '**load_dataset**' function locates all the directories of the respecticve composers and points to the .mid files from each of the composer while also labeling those composers. It then calls the '**extract_features**' function on each of the pointed .mod files and saves its data inside the a datastructure which we will use for model training and testing

**Parameters**: a string of the folder path which points to the folder of *PS1*.
 
**Output**: Two Lists, 
                        'X', which consists of all the extracted features of all the composers
                        'y, labels of the extracted features to identify which composer it belongs

The Following image accurately shows the algorithmic flow of this function,

<img src="Doc_images/loadmidi.png" alt="Alternative text" width="80%" height="80%" />

In [3]:
def load_dataset(folder_path):
    X = []
    y = []
    
    # Creating a dictionary to label the composers
    composer_label = {'Bach':0, 'Beethoven':1, 'Brahms':2, 'Schubert':3}
    
    # Loop through each of the composers folder
    for composer_folder in os.listdir(folder_path):
        
        # We ignore all the hidden files 
        if not composer_folder.startswith('.'):
            composer_path = os.path.join(folder_path, composer_folder)
            
            # Label is set ot -1 if the composer is not from our labels
            label = composer_label.get(composer_folder, -1)
            
            if label >= 0 and label <=3:
                # We will Loop over every '.mid' file in the composers folder
                for file in os.listdir(composer_path):
                    if file.endswith('.mid'):
                        midi_file = os.path.join(composer_path, file)
                        
                        # We will nest our code inside a try except, as it will catch any unknown parsing errors 
                        try:
                            midi_features = extract_features(midi_file)
                            X.append(midi_features)
                            y.append(label)
                        except Exception as e:
                            print(f"Error Processing {midi_file}: {str(e)}")
                            continue
    
    return X, y 


## 1.3 Feature Engineering 

The '**feature_padding**' function is used to match the size of each feature by padding 0's at the end. We will match each feature length with the maximum length of a feature

**Parameters**: The Data Structure containing all of the features

**Output**: It gives out a list containing the new Data Structure with all the padded features. The image below is an accurate summarization for the structure of the Data

<img src="Doc_images/midiData.png" alt="Alternative text" width="35%" height="35%" />

In [4]:
def feature_padding(feature_data):
    X_padded = []
    
    # We will find the maximum length of the features
    max_pitch_length = max(len(x[0]) for x in feature_data)
    max_velocity_length = max(len(x[1]) for x in feature_data)
    max_duration_length = max(len(x[1]) for x in feature_data)
    
    # Pad to maximum length by Looping through each pitch velcoity and duration
    for pitch, velocity, duration in feature_data:
        
        # Padding 0's so that each feature length can be equal to the maximum feature length
        padded_pitch = np.pad(pitch, (0, max_pitch_length - len(pitch)), mode = 'constant')
        padded_velocity = np.pad(velocity, (0, max_pitch_length - len(velocity)), mode = 'constant')
        padded_duration = np.pad(duration, (0, max_pitch_length - len(duration)), mode = 'constant')
        X_padded.append([padded_pitch, padded_velocity, padded_duration])
    
    return X_padded

## 1.4 Splitting the Data into Train Test

In this part we will split the extracted features data structures into training and testing portions. This is to train our Machine Learning Algorithm

In [5]:
from sklearn.model_selection import train_test_split

# Path to the folder containing directories of each composers MIDI files
PS1_FOLDER_PATH = "Datasets/PS1"

# Load the data
X_raw, y = load_dataset(PS1_FOLDER_PATH)

# Pad the raw Data
X = feature_padding(X_raw)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)



Error Processing Datasets/PS1/Bach/WTK I, No. 14: Prelude and Fugue in F-sharp minor_BWV859_2305_prelude14.mid: no MTrk header at start of track


# Step 2: Model Training

## 2.1 Training our Model

For classifying the composers we will use the Random Forest Classifier.

The '**train_model**' function trains the Random Forest Classifier by feeding it with a flattened version of the Data

**Parameters**: X_train, a list of the data we will use in training the data <br/>
                y_train, a list of labels for the classifier

**Output**: A trained Random Forest Classifier Model 

In [6]:
from sklearn.ensemble import RandomForestClassifier

def train_model(X_train, y_train):

    # Flatten the padded features for training
    X_train_flat = np.array([np.concatenate(x) for x in X_train])

    # Train a random forest classifier
    model = RandomForestClassifier()
    model.fit(X_train_flat, y_train)

    return model

## 2.2 Evaluating our Model

In '**evaluate_model**' function we will use the rest of our '*X_test*' data to make a prediction with our model and then evaluate our predictions with the orginal answers. We will then return our evaluation result

**Parameters**: *model*, trained random forest model<br/>
                *X_test*, the data we split for the test portion<br/>
                *y_test*, the orignal labels to the X_test data<br/>

**Output**: *accuracy*, a numerical value between 0 and 1 indicating our accuracy 

In [7]:
from sklearn.metrics import accuracy_score

def evaluate_model(model, X_test, y_test):

    # Flatten the padded features for testing
    X_test_flat = np.array([np.concatenate(x) for x in X_test])

    # Make predictions on the test set
    y_pred = model.predict(X_test_flat)

    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    return accuracy

## 2.3 Putting it all together

In [8]:
# Train the model
model = train_model(X_train, y_train)

# Evaluate the model
print(evaluate_model(model, X_test, y_test))

0.7846153846153846


# Step 3: Classify using Model

In [16]:
def classify_audio(model, midi_file):
    # Extracting features from the MIDI file
    pitches, velocities, durations = extract_features(midi_file)

    # Ensuring that the features match the expected dimensions
    max_pitches_len = model.n_features_in_ // 3
    max_velocities_len = model.n_features_in_ // 3
    max_durations_len = model.n_features_in_ // 3

    padded_pitches = np.pad(pitches, (0, max_pitches_len - len(pitches)), mode='constant')
    padded_velocities = np.pad(velocities, (0, max_velocities_len - len(velocities)), mode='constant')
    padded_durations = np.pad(durations, (0, max_durations_len - len(durations)), mode='constant')

    features = np.concatenate([padded_pitches, padded_velocities, padded_durations]).reshape(1, -1)

    # Making a prediction using the trained model
    composer_index = model.predict(features)[0]

    composer_label = {0:'Bach', 1:'Beethoven', 2:'Brahms', 3:'Schubert'}    
    composer = composer_label.get(composer_index, 'Unknown')

    return composer


In [17]:
entriesps2 = os.listdir('Datasets/PS2')

trained_model = model

for ent in entriesps2:
    file_name_ps2 = "Datasets/PS2/"+ent
    composer = classify_audio(trained_model, file_name_ps2)
    print(f"Composer is {composer}")

Composer is Beethoven
Composer is Beethoven
Composer is Schubert




Composer is Beethoven
Composer is Beethoven
Composer is Bach
Composer is Beethoven
Composer is Bach
Composer is Beethoven
Composer is Bach
Composer is Beethoven
Composer is Schubert
Composer is Bach
Composer is Bach
Composer is Bach
Composer is Bach
Composer is Beethoven
Composer is Beethoven
Composer is Bach




Composer is Schubert
Composer is Bach
Composer is Beethoven
Composer is Bach
Composer is Beethoven
Composer is Brahms
Composer is Beethoven




Composer is Beethoven
Composer is Beethoven
Composer is Beethoven
Composer is Schubert
Composer is Beethoven
Composer is Beethoven
Composer is Beethoven
Composer is Beethoven
Composer is Beethoven


