# FeatureLoader for Meertens Tune Collections

The [Meertens Tune Collections](http://www.liederenbank.nl/mtc/) provide various data sets with melodic data. The melodies are provided in Humdrum **kern encoding and as MIDI sequences. In many cases, a representation of the melodies as sequences of feature values is needed. `MTCFeatureLoader` is a Python module that provides such feature sequences together with functionality for feature and object filtering and feature extraction

In [None]:
from MTCFeatures.MTCFeatureLoader import MTCFeatureLoader

A `MTCFeatureLoader` object takes as source a `.jsonl` file (optionally gzipped), which is a text file with on each line a json object representing a melody. A melody object contains `metadata` fields and several sequences of feature values. E.g.
```
{'id': 'NLB178968_01',
 'type': 'vocal',
 'year': 1866,
 'freemeter': false,
 'tunefamily': '1302_0',
 'tunefamily_full': 'Contre les chagrins de la vie',
 'ann_bgcorpus': True,
 'features': {'pitch40': [135, 141, 147, 152, 158,    [...] 158, 135],
              'scaledegree': [1, 2, 3, 4, 5, 1, 6,    [...] 2, 5, 1],
              'scaledegreespecifier': ['P', 'M', 'M', [...] 'M', 'P', 'P'],
              
              [...]
              
              'phrasepos': [0.0, 0.071429, 0.142857,  [...] 0.833333, 1.0],
              'songpos': [0.0, 0.007142857142857143,  [...] 1.0]}
}
```
In this example the metadata fields are `id`, `type`, `year`, `tunefamily`, `tunefamily_full`, `freemeter`, and `ann_bgcorpus`. The named object `features` contains several sequences of feature values.

Several `.jsonl` files are provided with the module:
* `MTC-ANN-2.0.1`
* `MTC-FS-INST-2.0`

The `MTCFeatureLoader` can be initialized either with one of these, or with a user provided `.jsonl` or `.jsonl.gz` file:
* `fl = MTCFeatureLoader('MTC-ANN-2.0.1')`
* `fl = MTCFeatureLoader('MTC-FS-INST-2.0')`
* `fl = MTCFeatureLoader('../path/to/my/file.jsonl.gz')`
* `fl = MTCFeatureLoader('/path/to/my/file.jsonl')`

The `MTCFeatureLoader` class provides various functionalities:
* Melody Filtering : select melodies according to given criteria
* Feature selection : keep subset of features
* Feature extraction : compute a new feature from existing features and add it to the object
* Split data in train/test sets while respecting groupings

Operations can be chained. All feature extractors, feature selectors and object filters return an interator over the sequences. Each has an argument `seq_iter`. If `seq_iter==None` (default) the `.jsonl` file is taken as data source and a new iterator is created. Otherwise the provided iterator is taken as data source. Also, a method is available which takes a list of filter names and applies those.

The method `MTCFeatureLoader.writeJSON(self, json_out_path, seq_iter=None)` can be used to write the filtered set `seq_iter` to a `.jsonl` or `.jsonl.gz` file. If the final extension of the filename  `json_out_path` is `.gz` a gzipped file will be written.

## Melody Filters

### Available filters

The following filters are registered in class `MTCFeatureLoader`

* `vocal` : Only keep vocal melodies
* `instrumental` : Only keep instrumental melodies
* `firstvoice` : Only keep first voices/stanzas (i.e. identifier ending with `_01`)
* `ann_bgcorpus` : Only keep melodies unrelated to MTC-ANN (only applicable to MTC-FS-INST)
* `labeled` : Only keep melodies with a tune family label
* `unlabeled`: Only keep melodies without a tune family label
* `afteryear(year)` : Only keep melodies in sources dated later than `year` (`year` not included)
* `beforeyear(year)` : Only keep melodies in sources dated before `year` (`year` not included)
* `betweenyears(year1, year2)` : Only keep melodies in sources dated between `year1` and `year2` (both not included)
* `inOGL` : Only keep melodies that are part of Onder de Groene Linde
* `inNLBIDs(id_list)` : Only keep melodies with given identifiers in `id_list`
* `inTuneFamilies(tf_list)` : Only keep melodies in given tune families in `tf_list`
* `inInstTest` : Only keep melodies that are in cINST.

Available as separate functions:

* `DataLoader.minClassSizeFilter(self, classfeature, mininum=0, seq_iter=None)` : Keeps only melodies in classes with >= `minimum` members.<br>
`classfeature` (string) : name of the feature to use for counting.
* `DataLoader.maxClassSizeFilter(self, classfeature, maximum=100, seq_iter=None)` : Keeps only melodies in classes with <= `maximum` members.<br>
`classfeature` (string) : name of the feature to use for counting.

### How to: apply filter

In [None]:
fl = MTCFeatureLoader('MTC-FS-INST-2.0')
seq_iter = fl.applyFilter('vocal')

In [None]:
len(list(seq_iter))

If a filter has arguments, these sould be provided with the filtername as tuple.

In [None]:
seq_iter = fl.applyFilter( ('afteryear', 1950) )
seq_iter = fl.applyFilter( ('betweenyears', 1850, 1900) )

Keep only songs in tune families with more than 10 members:

In [None]:
seq_iter = fl.minClassSizeFilter('tunefamily', 10)

A filter can be inverted by setting argument `invert` to `True`

In [None]:
seq_iter = fl.applyFilter( ('afteryear', 1950), invert=True )

A chain of filters can be applied with the `applyFilters` method. The filters will be applied in the order provided.

In [None]:
seq_iter = fl.applyFilters(
    [
        {'mfilter':'vocal'},
        {'mfilter':'freemeter', 'invert':True},
        {'mfilter':('afteryear',1850)}
    ]
)

### How to: register a filter

Use method `MTCFeatureLoader.registerMelodyFilter(self, name, mfilter)`
<br>
`mfilter` : function returning `True` if the melody should be kept.

In [None]:
fl.registerFilter('vocal', lambda x: x['type'] == 'vocal')

Register a filter with arguments:

In [None]:
fl.registerFilter('afteryear', lambda y: lambda x: x['year'] > y )

## Feature Extractors

### Available Feature Extractors

In class `MTCFeatureLoader`:
* `full_beat` : concat `beat` and `beat_fraction`

The following Feature Extractor is available as separate function:
<br>
`MTCFeatureLoader.concatAllFeatures(self, name='concat', seq_iter=None)`<br>
`name` : name of the new feature<br>

### How to: apply a Feature Extractor

Use method `MTCFeatureLoader.applyFeatureExtractor(self, name, seq_iter=None)`
<br>
`name` : name (string) of the extractor 

In [None]:
seq_iter = fl.applyFeatureExtractor('full_beat_str')

## Feature Selector

E.g. only retain features `midipitch` and `IOR`:

In [None]:
seq_iter = fl.selectFeatures(['midipitch', 'IOR'])

# Example Configurations

### pitch

objects: all songs in MTC-ANN-2.0.1.
<br>
features: midipitch

In [None]:
fl = MTCFeatureLoader('MTC-ANN-2.0.1')
seq_iter=fl.selectFeatures(['midipitch'])

### pitch and duration

objects: all songs in MTC-ANN-2.0.1.
<br>
features: midipitch and duration

In [None]:
fl = MTCFeatureLoader('MTC-ANN-2.0.1')
seq_iter=fl.selectFeatures(['midipitch', 'duration'])

### intervals and inter onset interval ratios

objects: all songs in MTC-ANN-2.0.1.<br>
features: chromaticinterval and IOR

In [None]:
fl = MTCFeatureLoader('MTC-ANN-2.0.1')
seq_iter=fl.selectFeatures(['chromaticinterval', 'IOR'])

### scale degree, metric contour and beat position

objects: all songs in MTC-ANN
<br>
features: scale degree, metric contour and beat position

In [None]:
fl = MTCFeatureLoader('MTC-ANN-2.0.1')
seq_iter = fl.selectFeatures(['scaledegree','metriccontour','full_beat_str'])
seq_iter = fl.applyFeatureExtractor('full_beat_str', seq_iter=seq_iter)

### Get backgroundcorpus for MTC-ANN from MTC-FS-INST

In [None]:
fl = MTCFeatureLoader('MTC-FS-INST-2.0')
seq_iter = fl.applyFilter('ann_bgcorpus')

### Get labeled songs in *Onder de groene linde*

In [None]:
fl = MTCFeatureLoader('MTC-FS-INST-2.0')
seq_iter = fl.applyFilters(
    [
        {'mfilter':'inOGL'},
        {'mfilter':'labeled'}
    ]
)

Keep only those in tune families with more than 2 melodies:

In [None]:
seq_iter = fl.minClassSizeFilter('tunefamily', 2, seq_iter=seq_iter)

### Use labeled 17th and 18th century fiddle music only

In [None]:
fl = MTCFeatureLoader('MTC-FS-INST-2.0')

sel_instr = fl.applyFilter('instrumental')
sel_17th18th_c = fl.applyFilter( ('betweenyears', 1600, 1800), seq_iter=sel_instr )
sel_labeled = fl.applyFilter('labeled', seq_iter=sel_17th18th_c)

or:

In [None]:
seq_iter = fl.applyFilters(
    [
        {'mfilter':'instrumental'},
        {'mfilter':'labeled'},
        {'mfilter':('betweenyears', 1600, 1800)}
    ]
)

### Use big tune families (>=20 melodies)

In [None]:
fl = MTCFeatureLoader('MTC-FS-INST-2.0')
sel_big = fl.minClassSizeFilter('tunefamily', 20)

### Use small tune families (<=5 melodies) only

In [None]:
fl = MTCFeatureLoader('MTC-FS-INST-2.0')
sel_small = fl.maxClassSizeFilter('tunefamily', 5)

### Use only melodies with given identifiers

In [None]:
fl = MTCFeatureLoader('MTC-FS-INST-2.0')
id_list = ['NLB125814_01','NLB125815_01','NLB125817_01','NLB125818_01','NLB125822_01','NLB125823_01']
sel_list = fl.applyFilter( ('inNLBIDs', id_list) )

### Use only instrumental melodies from tune family 2805_0

In [None]:
fl = MTCFeatureLoader('MTC-FS-INST-2.0')
tf_list = ['2805_0']
sel_list = fl.applyFilter( ('inTuneFamilies', tf_list), seq_iter=fl.applyFilter('instrumental'))

Write the result to a gzipped `.jsonl` file.

In [None]:
fl.writeJSON('2805_0.jsonl.gz', seq_iter=sel_list)

### Get vocal melodies that have a meter

In [None]:
fl = MTCFeatureLoader('MTC-FS-INST-2.0')
seq_iter = fl.applyFilters(
    [
        {'mfilter':'vocal'},
        {'mfilter':'freemeter', 'invert':True}
    ]
)