# Folk $n$-gram aNalysis (FONN)

*Fonn* (pronounced "fun") is an Irish (*Gaeilge*) word for "tune".

In this strand of research we have created three Polifonia components:

1. Folk $n$-gram aNalysis (FONN)
2. Ceol Rince na hÉireann (CRE) MIDI corpus
3. Root note detection

This Demo notebook will demonstrate how we can process a corpus such as the CRE corpus using FONN.

### Prerequisites

* In `<basepath>/MIDI` we should have a corpus of folk tunes in MIDI format. By default `basepath` is `./corpus/`. If the corpus is elsewhere, change `basepath` below. We will be writing outputs to subdirectories of `basepath`. The corpus should include a `roots.csv` file containing a root note (integer between 0 and 11) for each MIDI file.

* Install the following libraries:

    `pip install feather music21 pyarrow fastDamerauLevenshtein`
    
    or just:
    
    `pip install -r requirements.txt`
    
### TODO

Some TODO items are mentioned through the notebook. 

In [1]:
import os.path
import glob
import sys
sys.path.append("setup_corpus") # TODO we should be able to remove this by making setup_corpus a proper module

import pandas as pd
pd.options.mode.chained_assignment = None

In [2]:
basepath = "corpus"
inpath = basepath + "/MIDI"
typical_midi_path = glob.glob(inpath + "/*.mid")[0]
typical_midi_filename = os.path.basename(typical_midi_path)
roots_path = basepath + "/roots.csv"
feat_seq_path = basepath + "/feat_seq_data/note"
accents_path = basepath + "/feat_seq_data/accent"
typical_feat_filename = feat_seq_path + "/" + typical_midi_filename[:-4] + "_note.csv"
duration_weighted_path = basepath + "/feat_seq_data/duration_weighted"
pitch_class_accents_ngrams_freq_path = basepath + "/ngrams/cre_pitch_class_accents_ngrams_freq.csv"
ngram_inpath = basepath + "/feat_seq_data/accent"
ngram_outpath = basepath + "/ngrams"
ngram_sim_inpath = basepath + "/ngrams/cre_pitch_class_accents_ngrams_tfidf.ftr" # please check
if os.path.exists(inpath):
    for path in [feat_seq_path, accents_path, duration_weighted_path, 
                 ngram_inpath, ngram_outpath]:
        os.makedirs(path, exist_ok=True)
else:
    print(f"Input path for MIDI corpus {inpath} does not exist")

In [3]:
typical_feat_filename

'corpus/feat_seq_data/note/Tureengarbh Jig, The_note.csv'

These two Python scripts contain tools for reading the MIDI data, processing it to find the primary and secondary feature sequences, key-invariant sequences, and duration-weighted sequences.

**TODO** define "primary and secondary feature sequences" very briefly in the notebook (already described in other deliverable of course).

In [4]:
from setup_corpus import setup_corpus
from corpus_processing_tools import Music21Corpus, MusicDataCorpus


Setting up lookup table for root assignment:
  note names  midi num  root num
0          C        60         0
1   C# or D-        61         1
2          D        62         2
3   D# or E-        63         3
4          E        64         4 



Setting up Music21 root detection lookup table:
  note name  pitch class
0         C            0
1        C#            1
2        D-            1
3         D            2
4        D#            3 




Running the following cell will take about 15 minutes. It will produce many `csv` files under `<basepath>/feat_seq_data/note`, `<basepath>/feat_seq_data/accent`, `<basepath>/feat_seq_data/duration_weighted`. To save time for common situations we will check whether these files exist first, and skip running the code if they do.

In [5]:
if not os.path.exists(typical_feat_filename):
    m21_corpus = Music21Corpus(inpath)
    corpus = setup_corpus.SetupCorpus(m21_corpus)
    corpus.generate_primary_feat_seqs()
    corpus.setup_music_data_corpus()
    corpus.run_simple_secondary_feature_sequence_calculations()
    corpus.run_key_invariant_sequence_calulations(roots_path)
    corpus.run_duration_weighted_sequence_calculations(['pitch', 'pitch_class'])
    corpus.save_corpus(
        feat_seq_path=feat_seq_path,
        accents_path=accents_path,
        duration_weighted_path=duration_weighted_path
    )



For example, in `A Trip To Galway_note.csv`, the first few lines will be:

In [6]:
df = pd.read_csv(basepath + "/feat_seq_data/note/A Trip To Galway_note.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,MIDI_note,onset,duration,velocity,interval,parsons_code,Parsons_cumsum,chromatic_root,pitch,pitch_class
0,0,74,0.0,1.0,105,0,0,0,4,10,10
1,1,71,1.0,1.0,105,-3,-1,-1,4,7,7
2,2,67,2.0,1.0,80,-4,-1,-2,4,3,3
3,3,64,3.0,1.0,80,-3,-1,-3,4,0,0
4,4,64,4.0,1.0,95,0,0,-3,4,0,0


Here, we see that **TODO** add here

Next, we will calculate the most important $n$-grams in each tune, calculating importance using TF-IDF.

In [7]:
from ngram_tfidf_tools import NgramCorpus
from setup_ngrams_tfidf import SetupNgramsTfidf

Again, the following cell will take about 25 minutes to run, so we check whether the output files already exist before running.

In [8]:
if not os.path.exists(pitch_class_accents_ngrams_freq_path):
    n_vals = list(range(5, 10))
    feat_seq_corpus = NgramCorpus(ngram_inpath)
    ngram_corpus = SetupNgramsTfidf(feat_seq_corpus, "pitch_class", n_vals)
    ngram_corpus.extract_ngrams()
    ngram_corpus.calculate_tfidf()
    ngram_corpus.save_results(outpath=ngram_outpath,
                              corpus_name='cre_pitch_class_accents')

Now we have TF-IDF scores for each $n$-gram for each tune. For example:

In [9]:
df = pd.read_csv(pitch_class_accents_ngrams_freq_path)
df.head()


Unnamed: 0.1,Unnamed: 0,ngram,"Primrose Girl (reel), The_accent_pitch_class_freq","Soldier's Joy, The_accent_pitch_class_freq","Sligo Jig, The_accent_pitch_class_freq","Rambling Connachtman (reel), The_accent_pitch_class_freq",Orange and Green_accent_pitch_class_freq,"London Lasses (reel), The_accent_pitch_class_freq",Dusty Miller_accent_pitch_class_freq,"Rainy Day Jig, The_accent_pitch_class_freq",...,Tear the Calico (reel)_accent_pitch_class_freq,Jack's Alive (reel)_accent_pitch_class_freq,"Tirnaskea Lasses (reel), The_accent_pitch_class_freq",Roaring Mary (reel)_accent_pitch_class_freq,"Master's Sporting Paddy (reel), The_accent_pitch_class_freq","Chattering Magpie (reel), The_accent_pitch_class_freq",Scully Casey's ( ) (hornpipe)_accent_pitch_class_freq,"Ranger (h'pipe), The_accent_pitch_class_freq",freq,idf
0,0,"(0, 0, 0, 0, 0)",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,595,6.225
1,1,"(0, 0, 0, 0, 0, 0)",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,494,6.41
2,2,"(0, 0, 0, 0, 0, 0, 0)",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,456,6.49
3,3,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,435,6.537
4,4,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,425,6.561


**TODO** Explain a little of the above, or else print out something else to show what we have got from the ngrams.

In [10]:
df[["ngram", "A Trip To Galway_accent_pitch_class_freq", "freq", "idf"]]

Unnamed: 0,ngram,A Trip To Galway_accent_pitch_class_freq,freq,idf
0,"(0, 0, 0, 0, 0)",0,595,6.225
1,"(0, 0, 0, 0, 0, 0)",0,494,6.410
2,"(0, 0, 0, 0, 0, 0, 0)",0,456,6.490
3,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",0,435,6.537
4,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)",0,425,6.561
...,...,...,...,...
495,"(7, 7, 0, 0, 4)",0,19,9.667
496,"(10, 10, 0, 2, 10)",0,19,9.667
497,"(7, 10, 7, 10, 7, 10)",0,19,9.667
498,"(0, 5, 0, 0, 4)",0,19,9.667


Finally, we will demonstrate some work-in-progress for calculating similarity between tunes, based on similarity between between their $n$-grams. This uses the Damerau-Levenshtein algorithm.

In [11]:
from ngram_pattern_search import NgramSimilarity
from fastDamerauLevenshtein import damerauLevenshtein

In [12]:
pattern_search = NgramSimilarity(ngram_sim_inpath)
pattern_search.extract_candidate_ngrams("Lord McDonald's (reel)", n=6, mode='idx', indices=[0, 1])
pattern_search.setup_test_corpus()
pattern_search.find_similar_patterns(edit_dist_threshold=1)
pattern_search.find_similar_tunes()
print(pattern_search.results)

                            ngram  \
0  (2.0, 7.0, 2.0, 7.0, 4.0, 7.0)   
1  (4.0, 4.0, 7.0, 7.0, 2.0, 4.0)   
2  (7.0, 2.0, 4.0, 0.0, 7.0, 4.0)   
3  (7.0, 4.0, 7.0, 2.0, 7.0, 2.0)   
4  (7.0, 2.0, 4.0, 7.0, 2.0, 4.0)   

   Primrose Girl (reel), The_accent_pitch_class_tfidf  \
0                                                0.0    
1                                                0.0    
2                                                0.0    
3                                                0.0    
4                                                0.0    

   Soldier's Joy, The_accent_pitch_class_tfidf  \
0                                          0.0   
1                                          0.0   
2                                          0.0   
3                                          0.0   
4                                          0.0   

   Sligo Jig, The_accent_pitch_class_tfidf  \
0                                      0.0   
1                                      0.

Corpus n-gram filtering complete.

Searching corpus for similar n-gram patterns...
57 Similar patterns detected:
                            ngram  (2.0, 7.0, 2.0, 7.0, 4.0, 7.0)  \
0  (2.0, 7.0, 2.0, 7.0, 2.0, 7.0)                             1.0   
1       (4.0, 4.0, 7.0, 7.0, 4.0)                             4.0   
2       (7.0, 2.0, 7.0, 4.0, 7.0)                             1.0   
3  (2.0, 7.0, 4.0, 7.0, 4.0, 7.0)                             1.0   
4       (2.0, 7.0, 2.0, 7.0, 4.0)                             1.0   

   (4.0, 4.0, 7.0, 7.0, 2.0, 4.0)  
0                             4.0  
1                             1.0  
2                             4.0  
3                             4.0  
4                             3.0  
Searching corpus for similar tunes...
Similarity results for Lord McDonald's (reel):
                                           title  count
33      Lord McDonald's (reel)_accent_pitch_clas     15
41       Tim Mulloney's (reel)_accent_pitch_clas      5
53 

**TODO** explain something about the above results, or show some more interesting results.