Skip to content

Preprocessing

Nicolas Leenaerts edited this page Nov 12, 2023 · 2 revisions

A function is provided that can preprocess data for machine learning analyses. More specifically, the function splits data using a train-test split or a k-fold split, and returns the split data as a list. Additionally, the function can standardize your data (based on the mean and standard deviation of the training data), and remove predictors that display zero variance (meaning that they only have one unique value).

Arguments

  • data

Dataframe which includes the outcome and the predictor variables.

  • outcome

Outcome variable.

  • predictors_con

A list of continuous predictors that will be used to predict the outcome.

  • predictors_cat

A list of categorical predictors that will be used to predict the outcome.

  • split

A number indicating the percentage of the data that will be used as training data. The rest of the data is used as test data. This argument is ignored when nested cross-validation is used. The default value is 80.

  • outer_cv

A number defining how many folds will be used in the outer cross-validation loop. Specifying a number here will make the wrapper use nested cross-validation.

  • stratified

A logical indicating whether the train test split or cross-validation needs to be stratified. The default is TRUE.

  • scaling

A logical defining whether the continuous predictors need to be scaled. The default is TRUE.

  • shuffle

A logical identifying whether the data need to be shuffled before the split.

  • seed

A number defining the seed which is used for the steps of the wrapper that are random.

  • clean_columns.

A logical indicating whether predictors with zero variance need to be removed. The default is TRUE.

Output

The function returns a list. At the first depth level, a number of split data sets can be found based on the type of split that was requested (i.e., one data set of a simple train-test split, and a k number of data sets for a k-fold split). At the second depth level, the different training and test sets can be found.

For example:

  • data_preprocessed[[2]] is the second split dataset from a k-fold split.

  • data_preprocessed[[2]][1] is the outcome of the training set

  • data_preprocessed[[2]][2] is the outcome of the test set

  • data_preprocessed[[2]][3] are the predictors of the training set

  • data_preprocessed[[2]][4] are the predictors of the test set

Clone this wiki locally