# evaluate_llms.ipynb
# Copyright (c) 2025, Joshua J Hamilton

In this notebook, we will evaluate the ability of different LLMs to tag classical music audio files.

I will evaluate some of the following LLMs:
* Claude 3.5 Sonnet
* DeepSeek R1
* Gemini 2.0 Flash
* GPT o1
* Llama 3.3-70B

To evaluate the LLMs, I will
* Perform prompt engineering to generate prompts for each LLM. Validate prompts based on works I have tagged thus far.
* Evaluate prompts and LLMs and select one for testing.
* Test the optimized prompt on a test set containing the complete works of Beethoven, who was not in the validation set

Validation set:
|updated_composer        |updated_album                                            |updated_orchestra                 |updated_conductor|updated_soloists |
|------------------------|---------------------------------------------------------|----------------------------------|-----------------|-----------------|
|Bach, Johann Sebastian  |Brandenburg Concertos · Orchestral Suites · Chamber Music|Musica Antiqua Köln               |Goebel, Reinhard |                 |
|Bach, Johann Sebastian  |Harpsichord Concertos                                    |Leonhardt-Consort                 |Leonhardt, Gustav|Leonhardt, Gustav|
|Bach, Johann Sebastian  |Harpsichord Concertos                                    |Leonhardt-Consort                 |Leonhardt, Gustav|Curtis, Alan     |
|Bach, Johann Sebastian  |Organ Works                                              |                                  |                 |Walcha, Helmut   |
|Bach, Johann Sebastian  |Complete Bach Collection                                 |                                  |                 |Gould, Glenn     |
|Handel, George Frideric |Concerti Grossi Op 3 & Op 6                              |Münchener Bach-Orchester          |Richter, Karl    |                 |
|Handel, George Frideric |Organ & Harpsichord Music                                |Amsterdam Baroque Orchestra       |                 |Koopman, Ton     |
|Handel, George Frideric |Organ & Harpsichord Music                                |                                  |                 |Ross, Scott      |
|Handel, George Frideric |Organ & Harpsichord Music                                |                                  |                 |Baumont, Olivier |
|Handel, George Frideric |Music for the Royal Fireworks · Water Music              |Academy of St Martin in the Fields|Marriner, Neville|                 |
|Haydn, Joseph           |The Six Organ Concertos                                  |Amsterdam Baroque Orchestra       |Koopman, Ton     |Koopman, Ton     |
|Haydn, Joseph           |Complete Harpsichord Concertos                           |Musica Antiqua Amsterdam          |Koopman, Ton     |Koopman, Ton     |
|Haydn, Joseph           |Complete Music for Solo Keyboard                         |                                  |                 |Brautigam, Ronald|
|Haydn, Joseph           |Complete String Quartets                                 |Angeles String Quartet            |                 |                 |
|Haydn, Joseph           |Complete Symphonies                                      |Austro-Hungarian Haydn Orchestra  |Fischer, Adam    |                 |
|Mozart, Wolfgang Amadeus|Overtures                                                |Staatskapelle Dresden             |Davis, Colin     |                 |
|Mozart, Wolfgang Amadeus|Complete Piano Sonatas                                   |                                  |                 |Brautigam, Ronald|
|Mozart, Wolfgang Amadeus|Serenaden & Divertimenti                                 |Camerata Salzburg                 |Vegh, Sandor     |                 |
|Mozart, Wolfgang Amadeus|The String Quartets                                      |Amadeus Quartet                   |                 |                 |
|Mozart, Wolfgang Amadeus|The String Quintets                                      |Amadeus Quartet                   |                 |Aronowitz, Cecil |
|Mozart, Wolfgang Amadeus|46 Symphonies                                            |Berlin Philharmonic               |Böhm, Karl       |                 |
|Mozart, Wolfgang Amadeus|Piano Concertos                                          |Camerata Salzburg                 |Vegh, Sandor     |Schiff, András   |
|Vivaldi, Antonio        |Vivaldi Edition, Volume 1                                |I Musici                          |                 |                 |
|Vivaldi, Antonio        |Vivaldi Edition, Volume 2                                |I Musici                          |                 |                 |

Test set:
| Genre    | Composer  | Type of Work    | Recording                                    |
| -------- | --------- | --------------- | -------------------------------------------- |
| Romantic | Beethoven | Overtures       | Leipzig Gewandhaus Orchestra with Kart Masur |
| Romantic | Beethoven | Piano Concertos | Wilhelm Kempff                               |
| Romantic | Beethoven | Piano Sonatas   | Alfred Brendel                               |
| Romantic | Beethoven | String Quartets | Emerson String Quartet                       |
| Romantic | Beethoven | Symphonies      | Berlin Philharmonic with Herbert von Karajan |

I will evaluate the LLMs based on the following tags:
* Composer - last name goes first. Use wikipedia for reference
* Album
* Year Recorded - a four-digit year
* Orchestra - may be an orchestra, quartet, etc. translate the name of the ensemble into English
* Conductor - last name goes first
* Soloists - last name goes first. Separate multiple soloists with semi-colons
* Genre - allowed values are: Renaissance, Baroque, Classical, Romantic, 20th Century, 21st Century
* Work
* Work Number - should be padded to the length of the largest work number in the specific work
* InitialKey - single upper-case letter for major keys, upper-case letter plus minor for minor keys
* Catalog # - should be padded to the length of the largest catalog number in the composer's oeuvre
* Opus - should be padded to the length of the largest catalog number in the composer's oeuvre
* Opus Number - should be padded to the length of the largest catalog number in the composer's oeuvre
* Epithet
* Movement - use Roman numerals

Evaluation will be performed by plotting fraction of true positives for each parameter on a spider plot



# Validation

## Import Packages

In [1]:
import numpy as np
import pandas as pd

## Create validation dataset: 1st Attempt

In [2]:
# List of albums to be used in the validation set
# Each album is specified as a list of five tags:
# updated_composer	updated_album	updated_orchestra	updated_conductor	updated_soloists
validation_albums = [
    ['Bach, Johann Sebastian', 'Brandenburg Concertos · Orchestral Suites · Chamber Music', 'Musica Antiqua Köln', 'Goebel, Reinhard', ''],
    ['Bach, Johann Sebastian', 'Harpsichord Concertos', 'Leonhardt-Consort', 'Leonhardt, Gustav', 'Leonhardt, Gustav'],
    ['Bach, Johann Sebastian', 'Harpsichord Concertos', 'Leonhardt-Consort', 'Leonhardt, Gustav', 'Curtis, Alan'],
    ['Bach, Johann Sebastian', 'Organ Works', '', '', 'Walcha, Helmut'],
    ['Bach, Johann Sebastian', 'Complete Bach Collection', '', '', 'Gould, Glenn'],
    ['Handel, George Frideric', 'Concerti Grossi Op 3 & Op 6', 'Münchener Bach-Orchester', 'Richter, Karl', ''],
    ['Handel, George Frideric', 'Organ & Harpsichord Music', 'Amsterdam Baroque Orchestra', '', 'Koopman, Ton'],
    ['Handel, George Frideric', 'Organ & Harpsichord Music', '', '', 'Ross, Scott'],
    ['Handel, George Frideric', 'Organ & Harpsichord Music', '', '', 'Baumont, Olivier'],
    ['Handel, George Frideric', 'Music for the Royal Fireworks · Water Music', 'Academy of St Martin in the Fields', 'Marriner, Neville', ''],
    ['Haydn, Joseph', 'The Six Organ Concertos', 'Amsterdam Baroque Orchestra', 'Koopman, Ton', 'Koopman, Ton'],
    ['Haydn, Joseph', 'Complete Harpsichord Concertos', 'Musica Antiqua Amsterdam', 'Koopman, Ton', 'Koopman, Ton'],
    ['Haydn, Joseph', 'Complete Music for Solo Keyboard', '', '', 'Brautigam, Ronald'],
    ['Haydn, Joseph', 'Complete String Quartets', 'Angeles String Quartet', '', ''],
    ['Haydn, Joseph', 'Complete Symphonies', 'Austro-Hungarian Haydn Orchestra', 'Fischer, Adam', ''],
    ['Mozart, Wolfgang Amadeus', 'Overtures', 'Staatskapelle Dresden', 'Davis, Colin', ''],
    ['Mozart, Wolfgang Amadeus', 'Complete Piano Sonatas', '', '', 'Brautigam, Ronald'],
    ['Mozart, Wolfgang Amadeus', 'Serenaden & Divertimenti', 'Camerata Salzburg', 'Vegh, Sandor', ''],
    ['Mozart, Wolfgang Amadeus', 'The String Quartets', 'Amadeus Quartet', '', ''],
    ['Mozart, Wolfgang Amadeus', 'The String Quintets', 'Amadeus Quartet', '', 'Aronowitz, Cecil'],
    ['Mozart, Wolfgang Amadeus', '46 Symphonies', 'Berlin Philharmonic', 'Böhm, Karl', ''],
    ['Mozart, Wolfgang Amadeus', 'Piano Concertos', 'Camerata Salzburg', 'Vegh, Sandor', 'Schiff, András'],
    ['Vivaldi, Antonio', 'Vivaldi Edition, Volume 1', 'I Musici', '', ''],
    ['Vivaldi, Antonio', 'Vivaldi Edition, Volume 2', 'I Musici', '', '']
]

# Read in the database of all tags
tags_df = pd.read_excel('../tags.xlsx')

# Create a boolean mask for filtering
mask = pd.Series([False] * len(tags_df))

# Iterate over each album in the validation set and update the mask
for album in validation_albums:
    composer, album_name, orchestra, conductor, soloists = album
    mask |= (
        (tags_df['updated_composer'] == composer) &
        (tags_df['updated_album'] == album_name) &
        ((tags_df['updated_orchestra'] == orchestra) | (orchestra == '')) &
        ((tags_df['updated_conductor'] == conductor) | (conductor == '')) &
        ((tags_df['updated_soloists'] == soloists) | (soloists == ''))
    )

# Subset the dataframe using the mask
subset_df = tags_df[mask]

print('The total number of tracks in the validation set is', len(subset_df))

The total number of tracks in the validation set is 3129


That is far too many tracks. Manual review reveals there are 145 distinct types of Works in the validation set. Let's randomly select one of each.

## Create validation dataset: 2nd Attempt

In [3]:
# Determine the number of distinct types of works
distinct_works = subset_df.drop_duplicates(subset=['updated_composer', 'updated_album', 'updated_orchestra', 
                                                   'updated_conductor', 'updated_soloists', 'updated_work'])
print('The total number of distinct works in the validation set is', len(distinct_works), '\n')

# Randomly select one track for each work
subset_df = subset_df.fillna('') # Fill missing values with an empty string for grouping
seed = 0
validation_set = subset_df.groupby(['updated_composer', 'updated_album', 'updated_orchestra', 
                                         'updated_conductor', 'updated_soloists', 'updated_work']
                                         ).apply(lambda x: x.sample(1, random_state=seed)).reset_index(drop=True)
validation_set.replace('', np.nan, inplace=True) # Replace empty strings with NaN

# Write to file
validation_set.to_excel('validation_set.xlsx', index=False)

# Report the number of updated_tag fields that have values
updated_tag_fields = [col for col in validation_set.columns if col.startswith('updated')]
non_null_counts = validation_set[updated_tag_fields].notnull().sum()

print("Number of updated_tag fields that have values:")
print(non_null_counts)


The total number of distinct works in the validation set is 144 



  ).apply(lambda x: x.sample(1, random_state=seed)).reset_index(drop=True)
  validation_set.replace('', np.nan, inplace=True) # Replace empty strings with NaN


Number of updated_tag fields that have values:
updated_composer         144
updated_album            144
updated_year recorded    144
updated_orchestra         71
updated_conductor         54
updated_soloists          86
updated_arranger           0
updated_genre            144
updated_discnumber       130
updated_tracknumber      144
updated_title            144
updated_tracktitle         0
updated_work             144
updated_work number       39
updated_initialkey       111
updated_catalog #        143
updated_opus              19
updated_opus number       25
updated_epithet           10
updated_movement          73
dtype: int64
