## Demo notebook for FoNN pattern extraction tools

In [1]:
# imports
from FoNN.pattern_extraction import NgramPatternCorpus

Step 1: Setup NgramPatternCorpus class instance to extract and store patterns from feature sequence data.

In [2]:
# define paths
# 'in_dir' must point to a directory of feature sequence data csv files extracted from the corpus via feature_extraction_demo.ipynb.
inpath = '../mtc_ann_corpus/feature_sequence_data/duration_weighted'
outpath = '../mtc_ann_corpus/pattern_corpus/duration_weighted'
# set n_vals variable as tuple containing min and max pattern lengths for which patterns will be extracted
n_vals = (3, 12)
# Note: as above, maximum range is 3-12 pattern elements. If only a single pattern length is under investigation, the tuple still requires two elements, per (4, 4) for 4-element patterns.

# setup NgramPatternCorpus instance for MTC-ANN corpus:

# Args:
# 'in_dir' and 'out_dir' -- map to in and out paths defined above.
# 'feature' -- the target musical feature for which patterns will be extracted.
# 16 features are available; names of all feature are accessible by reading NgramPatternCorpus.FEATURES, are listed in ./README.md, and in feature_extraction_tools.py docstring.

# In this example we are extracting duration-weighted note-level diatonic scale degree patterns from the MTC-ANN corpus.
mtc_ann_pattern_corpus = NgramPatternCorpus(in_dir=inpath, out_dir=outpath, feature='diatonic_scale_degree', n_vals=n_vals)

Reading input data: 100%|██████████| 360/360 [00:00<00:00, 1489.36it/s]
Formatting data: 100%|██████████| 360/360 [00:00<00:00, 467766.25it/s]

Process completed.





Extract all tune titles from corpus, store as NgramPatternCorpus.titles attr, and write to file

In [3]:
# save tune titles to 'titles' attr and write to file
mtc_ann_pattern_corpus.save_tune_titles_to_file()

One-step call to perform two related tasks:
1. Extract all n-gram patterns between 3-12 elements in length which occur at least once in the corpus. Store as NgramPatternCorpus.patterns attr and write to file.
2. Count occurrences of these patterns in all tunes in corpus and store in a sparse matrix as NgramPatternCorpus.pattern_freq_matrix and write to file.
attr.

In [4]:
# extract all patterns via n-grams; populate pattern occurrences matrix, save both to file
mtc_ann_pattern_corpus.create_pattern_frequency_matrix(write_output=True)

Print corpus info: (via custom NgramPatternCorpus.__repr__)

In [5]:
# print corpus info
print(mtc_ann_pattern_corpus)


Corpus name: mtc_ann_corpus
Level: note-level (duration-weighted)
Input directory: ../mtc_ann_corpus/feature_sequence_data/duration_weighted
Corpus contains 360 tunes.
Number of patterns extracted: 82447



Transform pattern occurrence counts in NgramPatternCorpus.pattern_freq_matrix into TF-IDF values. Store the output matrix as NgramPatternCorpus.pattern_tfidf_matrix attr and write to file.

In [6]:

mtc_ann_pattern_corpus.calculate_tfidf_vals(write_output=True)

Final step: precalculate the "TFIDF" similarity metric: create a pairwise Cosine similarity matrix between the pattern TF-IDF vectors of all tunes in the corpus, and write output to file.

In [7]:
mtc_ann_pattern_corpus.calculate_tfidf_vector_cos_similarity()

NOTE: All files outputted via pattern extraction pipeline are stored in automatically-generated ```./pattern_corpus``` dir under the corpus root dir. They are input requirements for the similarity search tools in FoNN.similarity_search.PatternSimilarity class, which are illustrated in similarity_search_demo.ipynb notebook.