## Polifonia [Patterns Knowledge Graph](https://github.com/polifonia-project/patterns-knowledge-graph) (KG) ingest pipeline. Step 1: Data extraction.

This notebook uses FoNN to extract patterns, pattern occurrences and pattern locations from an input music corpus in feature sequence format, as outputted by ```../demo_notebooks/feature_extraction_demo.ipynb```.
Any corpus for which a KG is being generated must first be processed via this notebook.

This is the first of two FoNN KG preprocessing steps. Step two can be found in ```./patterns_kg_data_processing.ipynb```

In [40]:
# imports

from FoNN.pattern_extraction import NgramPatternCorpus


In [41]:
# set n_vals variable as tuple containing min and max pattern lengths for which patterns will be extracted
n_vals = (4, 6)
# Note: as above, maximum range is 4-6 pattern elements. If only a single pattern length is under investigation, the tuple still requires two elements, per (4, 4) for 4-element patterns.
# Set musical feature under investigation. Default is 'diatonic_scale_degree'. A full list of feature names and explanations is available at ./README.md and in ../feature_sequence_extraction_tools.py top docstring.
feature= 'diatonic_scale_degree'
# set in path corresponding to the level of granularity of input corpus data under investigation -- this value can be either
# 'note', 'accent' or 'duration_weighted' as discussed in FoNN README.md.
in_path = '../mtc_ann_corpus/feature_sequence_data/duration_weighted'
out_path = '../mtc_ann_corpus/kg_pipeline_input_data'

# For each pattern length, create an NgramPatternCorpus object
# Note: this differs from the standard FoNN ingest pipeline, which extracts patterns at all lengths via a single NgramPatternCorpus obj.

_pattern_lengths = range(n_vals[0], n_vals[1] + 1)
data = []
for n in _pattern_lengths:
    pattern_corpus = NgramPatternCorpus(in_dir=in_path, out_dir=out_path, feature=feature, n_vals=(n, n))
    data.append(pattern_corpus)

Reading input data: 100%|██████████| 360/360 [00:00<00:00, 970.94it/s]
Formatting data: 100%|██████████| 360/360 [00:00<00:00, 208096.67it/s]
Reading input data:  38%|███▊      | 137/360 [00:00<00:00, 1361.25it/s]

Process completed.


Reading input data: 100%|██████████| 360/360 [00:00<00:00, 1298.29it/s]
Formatting data: 100%|██████████| 360/360 [00:00<00:00, 127035.96it/s]
Reading input data:  32%|███▎      | 117/360 [00:00<00:00, 1162.69it/s]

Process completed.


Reading input data: 100%|██████████| 360/360 [00:00<00:00, 1297.78it/s]
Formatting data: 100%|██████████| 360/360 [00:00<00:00, 378528.31it/s]

Process completed.





In [42]:
# create a corpus-level pattern occurrences matrix for each n value (i.e.: for each patter length)
for pattern_corpus in data:
    pattern_corpus.create_pattern_frequency_matrix(write_output=False)
    print(pattern_corpus)


Corpus name: mtc_ann_corpus
Level: note-level (duration-weighted)
Input directory: ../mtc_ann_corpus/feature_sequence_data/duration_weighted
Corpus contains 360 tunes.
Number of patterns extracted: 1026


Corpus name: mtc_ann_corpus
Level: note-level (duration-weighted)
Input directory: ../mtc_ann_corpus/feature_sequence_data/duration_weighted
Corpus contains 360 tunes.
Number of patterns extracted: 2580


Corpus name: mtc_ann_corpus
Level: note-level (duration-weighted)
Input directory: ../mtc_ann_corpus/feature_sequence_data/duration_weighted
Corpus contains 360 tunes.
Number of patterns extracted: 4977



In [43]:
# convert pattern occurrences matrices to pandas DataFrames and write to file
for idx, pattern_corpus in enumerate(data):
    n = _pattern_lengths[idx]
    pattern_corpus.convert_matrix_to_df(pattern_corpus.pattern_freq_matrix, write_output=True, filename=f"{n}gram_patterns")

              NLB072355_01  NLB072255_01  NLB076303_01  NLB073150_01  \
patterns                                                               
[1, 1, 1, 1]           3.0           NaN           NaN           6.0   
[1, 1, 1, 2]           1.0           NaN           NaN           1.0   
[1, 1, 1, 3]           1.0           NaN           1.0           NaN   
[1, 1, 1, 4]           NaN           NaN           NaN           1.0   
[1, 1, 1, 5]           NaN           NaN           NaN           NaN   

              NLB072567_01  NLB073296_01  NLB073269_02  NLB076211_01  \
patterns                                                               
[1, 1, 1, 1]           NaN           2.0           6.0           3.0   
[1, 1, 1, 2]           1.0           2.0           NaN           NaN   
[1, 1, 1, 3]           NaN           NaN           1.0           NaN   
[1, 1, 1, 4]           NaN           NaN           NaN           NaN   
[1, 1, 1, 5]           NaN           NaN           1.0         

In [44]:
# Run functions from pattern_locations.py to extract pattern locations data.
# What we call 'locations' are the offset location or index of each pattern occurrence in the feature sequences
# representing each tune in the corpus. For example, pattern [1 2 3 4] occurring in tune [1 2 3 4 5 1 2 3 4 5] will have
# locations 0 and 5, representing the two indices at which the pattern's first element occurs in the tune sequence.

# Note: The call below will automatically extract locations for all patterns between 4-6 elements in length, corresponding to
# the range of pattern lengths defined above in 'n_vals' for which patterns were extracted.

from FoNN.pattern_locations import *

for n in _pattern_lengths:
    results = {}
    # call functions from FoNN.pattern_locations and run them:
    in_files = read_file_paths(in_path)
    for path in in_files:   # for all files in corpus
        title = read_tune_title(path)                   # read titles
        data = read_tune_data(path, feature)            # read feature sequence data
        patterns = list(extract_patterns(data, n))      # extract n-gram patterns
        locations = find_pattern_locations(patterns)    # calculate pattern locations
        results[title] = dict(locations)                # return in nested dict per: {tune title: {pattern: locations}}

    # store output as pickle file in out_path directory
    f_name = f'{n}gram_locations.pkl'
    locations_path = f"{out_path}/{f_name}"
    with open(locations_path, 'wb') as f_out:
        pickle.dump(results, f_out)