Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implemented the method train_test_split to prepare a DataFrame for machine learning algorithms, by dividing it into 2-3 smaller DataFrames, which represent the train, test (and optionally validation) sets and additional test. The idea is that users are working with a DataFrame that has columns
text, pca, ..., X, Y
where columnsX, Y
are columns with input and labels/output (e.g. X might be a bert embedding of the text, Y might be class targets and users want to train an ML model to predict the labels). Most users currently usesklearn.train_test_split
which we also internally use in the new function, but with this function weOverview
Here the method takes:
stratify
), we pass a Series with the labels here (see example below)The method will return 2 or 3 DataFrames in a Tuple, with at the first 🥇 position the train_set_DataFrame, at the second 🥈 place the test_set_DataFrame and if validation was set to a different value than 0.0 it will return on the third 🥉 place the validation_set_DataFrame.
Example