# Folk $n$-gram aNalysis (FONN)

*Fonn* (pronounced "fun") is an Irish (*Gaeilge*) word for "tune".

In this strand of research we have created three Polifonia components:

1. Folk $n$-gram aNalysis (FONN)
2. Ceol Rince na hÉireann (CRE) MIDI corpus
3. Root note detection

This Demo notebook will demonstrate how we can process a corpus such as the CRE corpus using FONN.

### Prerequisites

* In `<basepath>/MIDI` we should have a corpus of folk tunes in MIDI format. By default `basepath` is `./corpus/`. If the corpus is elsewhere, change `basepath` below. We will be writing outputs to subdirectories of `basepath`. The corpus should include a `roots.csv` file containing a root note (integer between 0 and 11) for each MIDI file.

* Install the following libraries:

    `pip install fastDamerauLevenshtein music21 numpy pandas tqdm`
    
    or just:
    
    `pip install -r requirements.txt`


In [1]:
import pandas as pd
pd.options.mode.chained_assignment = None

basepath = "corpus"

### MIDI-to-feature sequence conversion

By running `setup_corpus.py` script we can read monophonic MIDI files and convert them to feature sequence representation.

NOTE: Feature sequence representation represents each piece of music in the corpus as a sequence of note events. Each note event has _primary features_, such as pitch, duration, and onset, which can be extracted directly from the MIDI file. From these features, further _secondary features_ such as interval, key-invariant pitch and pitch class can be calculated.

Running the script will create subfolders under `<basepath>/feat_seq_corpus` and populate with `.csv` feature sequence data files:

In [3]:
!python 'setup_corpus.py'


Setting up lookup table for root assignment:
  note names  midi num  root num
0          C        60         0
1   C# or D-        61         1
2          D        62         2
3   D# or E-        63         3
4          E        64         4 



Setting up Music21 root detection lookup table:
  note name  pitch class
0         C          0.0
1        C#          1.0
2        D-          1.0
3         D          2.0
4        D#          3.0 


Reading corpus MIDI files to Music21 streams: 100%|█| 1225/1225 [00:20<00:00, 60
Calculating feature sequences from music21 scores: 100%|█| 1224/1224 [00:05<00:0
Calculating pitch class sequences: 100%|██| 1224/1224 [00:00<00:00, 1232.46it/s]
Calculating interval sequences: 100%|█████| 1224/1224 [00:01<00:00, 1131.71it/s]

Reading roots data from: ./corpus/roots.csv

                             root
title                            
Tureengarbh Jig, The            2
Young And Stylish               9
Fun at the Fai

Sample output for `Lord McDonald's (reel).csv`:

In [4]:
sample = pd.read_csv(basepath + "/feat_seq_corpus/feat_seq/Lord McDonald's (reel).csv", index_col=0)
sample.head()

Unnamed: 0,midi_note,onset,duration,velocity,pitch_class,interval,root,relative_pitch,relative_pitch_class,parsons_code,parsons_cumsum
0,67,0,1,105,7,0,7,0,0,0,0
1,62,1,1,105,2,-5,7,-5,7,-1,-1
2,67,2,1,80,7,5,7,0,0,1,0
3,71,3,1,80,11,4,7,4,4,1,1
4,67,4,1,80,7,-4,7,0,0,-1,0


Here, we see the primary feature sequences generated by Music21 (i.e.: sequences derived directly from the MIDI file).

- *midi_note* -- MIDI note number, chromatic integer scale
- *onset* -- note onset, eighth notes
- *duration* -- note duration, eighth notes
- *velocity* -- MIDI velocity

And the secondary feature sequences (derived from the primary sequences):

- *pitch_class* -- key-invariant chromatic pitch class
- *interval* -- chromatic interval
- *root* -- (scalar) chromatic pitch class representing root / tonal centre of tune.
- *relative_pitch* -- key-invariant chromatic pitch, relative to root
- *relative_pitch_class* -- key-invariant chromatic pitch class, relative to root
- *parsons_code* -- simple melodic contour: up = 1 ; down = -1; repeat = 0
- *parsons_cumsum* -- cumulative Parsons code


### Pattern extraction with $n$-grams

Next, using $n$-grams, we extract all *relative_pitch_class* patterns of 3-7 notes in length which occur at least once in the corpus, and count their occurrences in each tune.

Results are saved to `<basepath>/pattern_corpus` subdirectory as two sparse pandas Dataframes in .pkl format: one containing pattern frequency counts, the other pattern TF-IDF values.

Default pattern extraction parameters such as target feature sequence, length of pattern(s), and level (accent- or note-) can be accessed and edited via pattern_extraction.main() in `<basepath>/pattern_extraction.py` file.

NOTE: The work below targets accent-level feature sequence data, which is obtained by filtering the note-level feature sequences
and retaining only notes which occur on rhythmically-accented beats.

In [5]:
!python 'pattern_extraction.py'


Initial n-gram corpus data:
       ngram  freq  doc_freq      idf
0  (0, 7, 0)   930       319  5.46672
1  (0, 0, 0)  1104       318  5.46985
2  (0, 0, 7)   775       295  5.54462
3  (2, 0, 0)   736       285  5.57897
4  (7, 0, 7)   742       281  5.59306
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75183 entries, 0 to 75182
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   ngram     75183 non-null  object 
 1   freq      75183 non-null  int64  
 2   doc_freq  75183 non-null  int64  
 3   idf       75183 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 2.3+ MB
None

Extracting n-gram patterns...: 100%|████████| 1224/1224 [00:23<00:00, 51.56it/s]

Populated n-gram corpus dataframe:
       ngram  freq  ...  Sheahan's Fancy  Sligo Jig, The
0  (0, 7, 0)   930  ...                2               0
1  (0, 0, 0)  1104  ...                0               0
2  (0, 0, 7)

Now we have TF-IDF and frequency values for all $n$-gram pattern instances in the corpus.
This data is stored in sparse pandas Dataframes, which are written to .pkl respectively at `<basepath>/pattern_corpus/freq.pkl` and `<basepath>/pattern_corpus/tfidf.pkl`.

See output of cell above for summary info and head for both tables.

The above table contains all pitch class patterns of 3-7 items in length which occur at least one time in the corpus.
<br><br>These patterns are held in the *ngrams* column.

To the right of *ngrams*, are simple corpus-level statistics (frequency, document frequency, IDF), followed by a column for every tune in the corpus.
In `<basepath>/pattern_corpus/freq.pkl` these columns hold frequency values for each pattern;
in `<basepath>/pattern_corpus/tfidf.pkl` they hold TF-IDF values.

Finally, we will run the work-in-progress script for calculating similarity between tunes (based on similarity between frequent local $n$-gram patterns). This uses the Damerau-Levenshtein local alignment algorithm.

In [6]:
from similarity_search import PatternSimilarity

# read TF-IDF pattern corpus:
basepath = "./corpus"
f_in = basepath + "/pattern_corpus/tfidf.pkl"
pattern_search = PatternSimilarity(f_in)
# set up out path for results:
res_path = basepath + "/results"
pattern_search.results_path = res_path
# set up search candidate tune and extract search term pattern:
# First arg: 'title' parameter: tune title per original MIDI file.
# Second arg: 'n' parameter: number of items in search term pattern(s) to be extracted.
# In this notebook 'n' can be any int value from 3-7.
# third arg: 'mode' parameter: can be 'max' (extracts patterns for mx TF-IDF value in candidate tune)
# or 'idx': extract pattern(s) at specific indices as ranked by TF-IDF.
# Note: IF using 'idx' mode, an additional 'indices' arg must be passed, pointing to a list of indices per example below
pattern_search.extract_candidate_patterns("Lord McDonald's (reel)", n=6, mode='idx', indices=[0])
# run pattern similarity search:
# 'edit_dist_threshold' arg sets the number of differences above which a pattern is considered dissimilar.
pattern_search.find_similar_patterns(edit_dist_threshold=1)
# run local pattern-based tune similarity search:
pattern_search.find_similar_tunes()
# display and save results table:
pattern_search.compile_results_table()

Locating candidate tune in pattern corpus...: 100%|██████████| 1226/1226 [00:00<00:00, 1064861.61it/s]


'idx' mode selected -- extracting pattern(s) as ranked by TF-IDF according to their indices...


Calculating pattern similarity...: 100%|██████████| 75183/75183 [00:00<00:00, 424588.17it/s]


Frequent n-gram pattern(s) extracted from Lord McDonald's (reel):
(2, 7, 2, 7, 4, 7)






34 Similar patterns detected:
                   ngram  (2, 7, 2, 7, 4, 7)
0     (2, 4, 2, 7, 4, 7)                 1.0
1  (0, 2, 7, 2, 7, 4, 7)                 1.0
2     (7, 7, 2, 7, 4, 7)                 1.0
3    (2, 11, 2, 7, 4, 7)                 1.0
4     (2, 7, 2, 7, 9, 7)                 1.0

Searching corpus for similar tunes...

Similarity results for Lord McDonald's (reel):
                             title  count
0           Lord McDonald's (reel)      8
1            Tim Mulloney's (reel)      5
2  Maid Of Mount Kisco (reel), The      3
3                    Ballykeal Jig      3
4        Biddy the Darling (slide)      2


The above code cell performs the following operations:
- Read the $n$-gram pattern corpus ranked by TF-IDF at `<basepath>/pattern_corpus/tfidf.pkl`.
- Extract the first-indexed 6-gram pattern from the tune *Lord McDonald's (reel)*, as ranked by tf-idf.
- Find all patterns in the corpus within a Damerau-Levenshtein edit distance of 1 from the search term pattern.
- Filter the pattern corpus, retaining only tunes in which similar local patterns occur.
- Count the number of similar patterns per retained tune and print results table.
- Save csv table of pattern counts per tune to `<basepath>/pattern_corpus/results` directory.

As can be seen above, the top result is the candidate tune itself, which contains 8 similar patterns to our search term.
Next is the tune *Tim Mulloney's (Reel)*, which contains 5 similar patterns, and so on down the table.

Change arg values in calls above, including target tune title, n value, edit distance threshold, to explore the similarity search tool.
Tune titles must be formatted per the MIDI tune filenames in `<basepath>/MIDI/`.
For  information on the args please see comments in code cell above.

This approach to measuring similarity gives musically plausible results in informal testing but has not yet been
formally evaluated. The methodology is currently undergoing testing on the corpus of 40,000+ Irish tunes held on
 [thesession.org](https://thesession.org). We are also currently compiling a test subset of ground-truth anotated tunes from *The Session* to allow quantitative testing and tuning of the similarity search tool.