Grouped Stratified Split

The Problem

Video classification can be a challenging task, not only for the construction of the neural model itself, but also for the dataset split into train, validation and test sets. The videos are generally split into smaller slices/segments of a few seconds. Each of these arrays of segments is considered a group. These groups can’t repeat in the split sets, that is, if a group is present in the train set, it can’t have any of its segments also present in the validation or test sets. This work aims to respond to the following question: How to distribute the groups of different sizes between the split sets balanced by its classes?

More specifically, this work aims to deal with the split of the NES-MVDB dataset. It contains gameplay videos of all the Nintendo Entertainment System (NES) games. We want to classify these videos by genre. The dataset contain the two following tables:

Since we’ve got 95.899 segments, an 80:10:10 split would ideally have 9.590 segments for validation, that same amount for test, and 76.719 for train. Another way of doing that would be to perform an 80:10:10 split for each game genre, that is, a stratified split, that would guarantee that the sets are balanced by genre. First dealing with the smaller sets, and leaving the rest for train. Again, the problem is that each group of gameplay segments can’t overlap each other inside the split sets, and each group contains a "random" number of segments.

The Scikit Learn library already contains a StratifiedGroupKFold function, which performs the split of the non overlapping groups into K folds of the same size in a stratified manner, but it doesn't have a similar method for performing a classic train, validation, test split.

The Solution

This problem can be seen as a subset sum execution for each slipt set and class. In our case, for each split set, a subset sum can be performed to get the gameplays that will enter them, with amount of slices proportional to each of the classes. The sums will be performed over the number of slices/segments each gameplay contains, and the set of groups that amount to a value closer to the target will be selected.

Example:

If our dataset contains the following groups:

Group(0, 'A', 40), Group(1, 'B', 40), Group(2, 'A', 2), Group(3, 'B', 8), Group(4, 'A', 4), Group(5, 'B', 6)

Where the first position represents the group id, the second the group label and the last one the group size (or amount of video slices in our case).

The total size of samples amout to 100. in a 80:10:10 train, eval, test split we would have 80 samples for train, 10 for eval and 10 for test. We use priority in the name of our algorithm because we prioritize the smaller split sets, solving them first, leaving the groups that are left to the bigger ones. So we might begin with the eval split.

For class A we have the following groups available:

Group(0, 'A', 40), Group(2, 'A', 2),Group(4, 'A', 4)

We need a subset of those groups that amout to 10, which is the size of our eval. Since we want it stratified, that is, proportional to the classes A and B, our target value will be 10% the size of class A, which is roud(0.146) = 5 . We use the subset sum algorithm to return the set which sum is closer to that value: Group(4, 'A', 4).

For class B we have the following groups available:

Group(1, 'B', 40), Group(3, 'B', 8), Group(5, 'B', 6)

Our target value will be round(0.154)=5. Wich yields Group(5, 'B', 6)

So, for the eval set, our algorithm returned Group(4, 'A', 4) and Group(5, 'B', 6), since 4 + 6 amounts to 10 and we've got a distribution of almost 50% percent of the slices for each class, this is an optimal solution.

Usage

As desmostrated in the example a group in our implementation in defined as Group(ID, CLASS, SIZE). A set of groups is represented by the class GroupSet, wich expects a list of Groups. The function get_split receives theGroupSet that represent your dataset and the split you want to create, which would be [0.8, 0.1, 0.1] in our exaple. get_split returns a list of GroupSets that correspond to the specified split. Following our example we would have:

groups = [Group(0, 'A', 40), Group(1, 'B', 40), Group(2, 'A', 2), Group(3, 'B', 8), Group(4, 'A', 4), Group(5, 'B', 6)]
groups = GroupSet(groups)

priority_split = PrioritySplit()
sets = priority_split.get_split(groups, [0.8, 0.1, 0.1])

print(sets)
# Displays: [GroupSet({Group(0, A, 40), Group(1, B, 40)}), GroupSet({Group(4, A, 4), Group(5, B, 6)}), GroupSet({Group(2, A, 2), Group(3, B, 8)})]

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets/imgs		assets/imgs
dist		dist
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grouped Stratified Split

The Problem

The Solution

Example:

Usage

About

Releases

Packages

Contributors 2

Languages

License

jpmedras/grouped_stratified_split

Folders and files

Latest commit

History

Repository files navigation

Grouped Stratified Split

The Problem

The Solution

Example:

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages