Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement train_test_split #167

Closed
wants to merge 3 commits into from

Conversation

mk2510
Copy link
Collaborator

@mk2510 mk2510 commented Aug 26, 2020

Implemented the method train_test_split to prepare a DataFrame for machine learning algorithms, by dividing it into 2-3 smaller DataFrames, which represent the train, test (and optionally validation) sets and additional test. The idea is that users are working with a DataFrame that has columns text, pca, ..., X, Y where columns X, Y are columns with input and labels/output (e.g. X might be a bert embedding of the text, Y might be class targets and users want to train an ML model to predict the labels). Most users currently use sklearn.train_test_split which we also internally use in the new function, but with this function we

  • provide a unified interface for users (don't have to import sklearn for their ML preprocessing pipeline)
  • extend the sklearn function to split into train-test-validation (sklearn can only split into train-test)

Overview

def train_test_split(df,train=0.0,test=0.0,val=0.0,class_balance=None,shuffle=True,random_state=None) -> 
Union[
    Tuple[pd.DataFrame, pd.DataFrame], Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]
]:

Here the method takes:

  • a DataFrame that we want to split up into training, testing, and optionally validation.
  • train=0.0,test=0.0,val=0.0, where, when all are set to 0.0, then the proportion will be 70% training set and 30% test set. If just validation is set, then we will return an error, as no training and test sizes are defined. If ints instead of floats are given, it will interpret this as the absolute value (e.g. train=200 -> train DF will have 200 documents)
  • class_balance: if the data is already labeled and we want to keep the label proportions in the subsets (same as sklearn stratify), we pass a Series with the labels here (see example below)
  • shuffle, which shuffles the dataset
  • random_state, to create deterministic results

The method will return 2 or 3 DataFrames in a Tuple, with at the first 🥇 position the train_set_DataFrame, at the second 🥈 place the test_set_DataFrame and if validation was set to a different value than 0.0 it will return on the third 🥉 place the validation_set_DataFrame.

Example

>>> import texthero as hero
>>> import pandas as pd
>>> df = pd.read_csv(
...    "https://raw.githubusercontent.com/jbesomi/texthero/master/dataset/bbcsport.csv"
... ) # doctest: +SKIP
>>> train_df, test_df = hero.train_test_split(
...                  df, class_balance = df["topic"], random_state = 42, train=0.8
... ) # doctest: +SKIP
>>> len(train_df) # doctest: +SKIP
589
>>> len(test_df) # doctest: +SKIP
148
>>> # to check wether we have a equal class distribution (in this case a class is a topic)
>>> # we will look at the percentage of each topic in the DataFrame
>>> train_df["topic"].value_counts() / train_df["topic"].value_counts().sum() # doctest: +SKIP
football     0.359932
rugby        0.198642
cricket      0.168081
athletics    0.137521
tennis       0.135823
Name: topic, dtype: float64
>>> # and now the same for the test dataset
>>> test_df["topic"].value_counts() / test_df["topic"].value_counts().sum() # doctest: +SKIP
football     0.358108
rugby        0.202703
cricket      0.168919
athletics    0.135135
tennis       0.135135
Name: topic, dtype: float64

@henrifroese henrifroese changed the title Implementation of the Function train_test_split Add train_test_split Aug 26, 2020
@henrifroese henrifroese changed the title Add train_test_split Implement train_test_split Aug 26, 2020
@henrifroese henrifroese added the enhancement New feature or request label Aug 26, 2020
@jbesomi
Copy link
Owner

jbesomi commented Sep 14, 2020

As Texthero cannot really offer ML capabilities (see #155 for instance), I don't really see where users can make use of this function as they would need anyway to import and use scikit-learn ...

Also, split into train-test-dev is quite straightforward:

X_train, y_train, X_test, y_test = train_test_split(df[['columns', ...,]], df['target], ...)
X_val, y_val, X_test, y_test = train_test_split(X_test, y_test, ...)

Not sure users would want to read the docstring of a new function from a library not designed for these kinds of ML operations ...

Sorry for the negative review! :( 👍

Would be happy to see your motivations though!

@jbesomi jbesomi marked this pull request as draft September 14, 2020 15:56
@mk2510
Copy link
Collaborator Author

mk2510 commented Sep 14, 2020

Would be happy to see your motivations though!

we saw three main advantages in this implementation

  1. the users don't have to import sklearn, when they do machine learning. They can preprocess the text with Texthero, split it with Texthero, and then dump it into TensorFlow. This way the user saves to import one additional library
  2. it makes the code of the user shorter as it saves one line
  3. we have the option to keep the class balance even for the case with validation, so the user doesn't have to create his own solution, so he doesn't have to google it.

But no worries if it's not gonna be merged, as we do this project for fun 🪂 🥇 ⛽

@jbesomi
Copy link
Owner

jbesomi commented Oct 9, 2020

Closing this as this is out-of-scope, at least for now. Thank you for the contribution nonetheless! 👌

@jbesomi jbesomi closed this Oct 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants