Implement train_test_split #167

mk2510 · 2020-08-26T11:43:57Z

Implemented the method train_test_split to prepare a DataFrame for machine learning algorithms, by dividing it into 2-3 smaller DataFrames, which represent the train, test (and optionally validation) sets and additional test. The idea is that users are working with a DataFrame that has columns text, pca, ..., X, Y where columns X, Y are columns with input and labels/output (e.g. X might be a bert embedding of the text, Y might be class targets and users want to train an ML model to predict the labels). Most users currently use sklearn.train_test_split which we also internally use in the new function, but with this function we

provide a unified interface for users (don't have to import sklearn for their ML preprocessing pipeline)
extend the sklearn function to split into train-test-validation (sklearn can only split into train-test)

Overview

def train_test_split(df,train=0.0,test=0.0,val=0.0,class_balance=None,shuffle=True,random_state=None) -> 
Union[
    Tuple[pd.DataFrame, pd.DataFrame], Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]
]:

Here the method takes:

a DataFrame that we want to split up into training, testing, and optionally validation.
train=0.0,test=0.0,val=0.0, where, when all are set to 0.0, then the proportion will be 70% training set and 30% test set. If just validation is set, then we will return an error, as no training and test sizes are defined. If ints instead of floats are given, it will interpret this as the absolute value (e.g. train=200 -> train DF will have 200 documents)
class_balance: if the data is already labeled and we want to keep the label proportions in the subsets (same as sklearn stratify), we pass a Series with the labels here (see example below)
shuffle, which shuffles the dataset
random_state, to create deterministic results

The method will return 2 or 3 DataFrames in a Tuple, with at the first 🥇 position the train_set_DataFrame, at the second 🥈 place the test_set_DataFrame and if validation was set to a different value than 0.0 it will return on the third 🥉 place the validation_set_DataFrame.

Example

>>> import texthero as hero
>>> import pandas as pd
>>> df = pd.read_csv(
...    "https://raw.githubusercontent.com/jbesomi/texthero/master/dataset/bbcsport.csv"
... ) # doctest: +SKIP
>>> train_df, test_df = hero.train_test_split(
...                  df, class_balance = df["topic"], random_state = 42, train=0.8
... ) # doctest: +SKIP
>>> len(train_df) # doctest: +SKIP
589
>>> len(test_df) # doctest: +SKIP
148
>>> # to check wether we have a equal class distribution (in this case a class is a topic)
>>> # we will look at the percentage of each topic in the DataFrame
>>> train_df["topic"].value_counts() / train_df["topic"].value_counts().sum() # doctest: +SKIP
football     0.359932
rugby        0.198642
cricket      0.168081
athletics    0.137521
tennis       0.135823
Name: topic, dtype: float64
>>> # and now the same for the test dataset
>>> test_df["topic"].value_counts() / test_df["topic"].value_counts().sum() # doctest: +SKIP
football     0.358108
rugby        0.202703
cricket      0.168919
athletics    0.135135
tennis       0.135135
Name: topic, dtype: float64

Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>

jbesomi · 2020-09-14T15:56:36Z

As Texthero cannot really offer ML capabilities (see #155 for instance), I don't really see where users can make use of this function as they would need anyway to import and use scikit-learn ...

Also, split into train-test-dev is quite straightforward:

X_train, y_train, X_test, y_test = train_test_split(df[['columns', ...,]], df['target], ...)
X_val, y_val, X_test, y_test = train_test_split(X_test, y_test, ...)

Not sure users would want to read the docstring of a new function from a library not designed for these kinds of ML operations ...

Sorry for the negative review! :( 👍

Would be happy to see your motivations though!

mk2510 · 2020-09-14T16:16:47Z

Would be happy to see your motivations though!

we saw three main advantages in this implementation

the users don't have to import sklearn, when they do machine learning. They can preprocess the text with Texthero, split it with Texthero, and then dump it into TensorFlow. This way the user saves to import one additional library
it makes the code of the user shorter as it saves one line
we have the option to keep the class balance even for the case with validation, so the user doesn't have to create his own solution, so he doesn't have to google it.

But no worries if it's not gonna be merged, as we do this project for fun 🪂 🥇 ⛽

jbesomi · 2020-10-09T18:16:44Z

Closing this as this is out-of-scope, at least for now. Thank you for the contribution nonetheless! 👌

henrifroese and others added 3 commits August 26, 2020 12:22

add function to split test train validate, missing example and test

422e69f

Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>

added example

a7b57e9

added test for function

dca6205

vercel bot deployed to Preview August 26, 2020 11:44 View deployment

henrifroese changed the title ~~Implementation of the Function train_test_split~~ Add train_test_split Aug 26, 2020

henrifroese changed the title ~~Add train_test_split~~ Implement train_test_split Aug 26, 2020

henrifroese added the enhancement New feature or request label Aug 26, 2020

henrifroese mentioned this pull request Aug 28, 2020

👩‍💻 API next steps: checklist #85

Open

17 tasks

jbesomi marked this pull request as draft September 14, 2020 15:56

jbesomi closed this Oct 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement train_test_split #167

Implement train_test_split #167

mk2510 commented Aug 26, 2020 •

edited by henrifroese

jbesomi commented Sep 14, 2020

mk2510 commented Sep 14, 2020

jbesomi commented Oct 9, 2020

Implement train_test_split #167

Implement train_test_split #167

Conversation

mk2510 commented Aug 26, 2020 • edited by henrifroese

Overview

Example

jbesomi commented Sep 14, 2020

mk2510 commented Sep 14, 2020

jbesomi commented Oct 9, 2020

mk2510 commented Aug 26, 2020 •

edited by henrifroese