Skip to content

Code to perform stratified split of grouped datasets into train and validation sets using optimization

License

Notifications You must be signed in to change notification settings

joaofig/strat-group-split

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

strat-group-split

This repository contains code to perform stratified splitting of grouped datasets into train/validation sets or K-folds using optimization.

Summary

Given a labeled and grouped dataset, we want to split it into training and validation sets (or equally sized K folds) while keeping the label distribution as close as possible on both and group integrity. After breaking the data into the two datasets, the groups must maintain their integrity, assigned to either set and not split among them. Furthermore, the splitting process should closely respect the imposed splitting proportion and label stratification.

The expected result for this problem is, given an input dataset, the list of groups assigned to each dataset, ensuring that both the train/validation split and the stratification are as close as possible to the specified values.

Using the Code

Train/Validation Split

All the code is contained in the group_split.py file. The main function runs a benchmark between the two optimization algorithms. It generates a problem matrix using the generate_counts function and then submits it to both algorithms, outputting the time taken, final cost value and the approximations to both the desired split and the stratification.

Please note that the code is on a proof-of-concept stage. In the future I plan to create an independent Python package with these ideas.

K-Fold Split

All the code is contained in the k_fold_split.py file. You can alternatively use the k-fold.ipynb Jupyter notebook.

Medium Articles

Stratified Splitting of Grouped Datasets Using Optimization

Stratified K-Fold Cross-Validation on Grouped Datasets

About

Code to perform stratified split of grouped datasets into train and validation sets using optimization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published