##  Demo notebook using FoNN pattern extraction tools to calculate a pairwise TF-IDF vector Cosine similarity matrix between all tunes in the corpus.

Note: Running this notebook in full is a requirement for the 'TF-IDF' similarity method, as is '''setup_pattern_corpus_demo.py'''.

In [2]:
# imports
from FoNN.pattern_extraction import NgramPatternCorpus

Setup NgramPatternCorpus class instance to extract and store patterns from feature sequence data.

In [3]:
# define paths

# 'inpath' must point to a directory of feature sequence data csv files extracted from the corpus via feature_extraction_demo.ipynb.
# The pattern extraction class (FoNN.pattern_extraction.NgramPatternCorpus) will auto-detect the level of input data representation (i.e. 'note', 'accent', or 'duration_weighted' from the filepath provided to 'inpath' var.)
inpath = '../mtc_ann_corpus/feature_sequence_data/duration_weighted'
# the 'outpath' location may be changed if desired, but please retain the convention of locating these files in a dir per /[corpus name]/pattern_corpus/[level] to retain compatibility with ../FoNN/similarity_search.py, which is hard-coded took look in this location for corpus pattern data.
outpath = '../mtc_ann_corpus/pattern_corpus/duration_weighted'
# define n_vals variable as tuple containing min and max pattern lengths for which patterns will be extracted. Allowable values range from 3 to 16.
# In our experimental inputs for the 'TF-IDF' similarity method, we standardise on (6, 12), extracting all patterns between 6 and 12 elements in length, but other ranges are permissible if desired, up to a maximum range of (3, 16).
n_vals = (6, 12)
# setup NgramPatternCorpus instance for MTC-ANN corpus:
# Args:
# 'in_dir' and 'out_dir' -- map to in and out paths defined above.
# 'feature' -- the target musical feature for which patterns will be extracted.
# 16 features are available; names of all feature are accessible by reading NgramPatternCorpus.FEATURES, are listed in ./README.md, and in feature_extraction_tools.py docstring.

# In this example we are extracting duration-weighted note-level diatonic scale degree patterns of 6-12 elements in length from the MTC-ANN corpus.
mtc_ann_pattern_corpus = NgramPatternCorpus(in_dir=inpath, out_dir=outpath, feature='diatonic_scale_degree', n_vals=n_vals)

Reading input data: 100%|██████████| 360/360 [00:00<00:00, 1330.52it/s]
Formatting data: 100%|██████████| 360/360 [00:00<00:00, 206785.74it/s]

Process completed.





Extract all tune titles from corpus, store as NgramPatternCorpus.titles attr

In [4]:
# save tune titles to 'titles' attr and write to file
mtc_ann_pattern_corpus.save_tune_titles_to_file()

One-step call to perform two related tasks:
1. Extract all n-gram patterns of user-selected length which occur at least once in the corpus. Store as NgramPatternCorpus.patterns attr and write to file.
2. Count occurrences of these patterns in all tunes in corpus and store in a sparse matrix as NgramPatternCorpus.pattern_freq_matrix and write to file.
attr.

In [5]:
# extract all patterns via n-grams; populate pattern frequency matrix
mtc_ann_pattern_corpus.create_pattern_frequency_matrix(write_output=False)

Print corpus info: (via custom NgramPatternCorpus.__repr__)

In [6]:
# print corpus info
print(mtc_ann_pattern_corpus)


Corpus name: mtc_ann_corpus
Level: note-level (duration-weighted)
Input directory: ../mtc_ann_corpus/feature_sequence_data/duration_weighted
Corpus contains 360 tunes.
Number of patterns extracted: 10154



Transform pattern occurrence counts in NgramPatternCorpus.pattern_freq_matrix into TF-IDF values. Store the output matrix as NgramPatternCorpus.pattern_tfidf_matrix attr.

In [7]:
# convert values in pattern frequency matrix from raw pattern occurrence counts to TF-IDF values.
mtc_ann_pattern_corpus.calculate_tfidf_vals(write_output=False)

Precalculate the "TFIDF" similarity results: create a pairwise Cosine similarity matrix between the pattern TF-IDF vectors of all tunes in the corpus, and write output to file.


In [7]:
# generate TF-IDF vector Cosine similarity matrix
mtc_ann_pattern_corpus.calculate_tfidf_vector_cos_similarity()

NOTE: All files outputted via pattern extraction pipeline are stored in automatically-generated ```./pattern_corpus``` dir under the corpus root dir. They are input requirements for the similarity search tools in FoNN.similarity_search.PatternSimilarity class, which are illustrated in ./similarity_search_demo.ipynb notebook.