Skip to content

keithmcnulty/r_and_py_models

Repository files navigation

Running Python Models in R

library(reticulate)

Prerequisites

For these methods to work, you will need to point to a Python executable in a Conda environment or Virtualenv that contains all the Python packages you need. You can do this by using a .Rprofile file in your project directory. See the contents of the .Rprofile file in this project to see how I have done this.

Write Python functions to run on a data set in R

In the file python_functions.py I have written the required functions in Python to perform an XGBoost model on an arbitrary data set. We expect all the parameters for these functions to to be in a single dict called parameters. I am now going to source these functions into R so they become R functions that expect equivalent data structures.

source_python("python_functions.py")

Example: Using XGBoost in R

We now use these Python function on a prepared wine dataset in R to try to learn to predict a high quality wine.

First we download data sets for white wines and red wines.

white_wines <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",
                        sep = ";")
red_wines <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", 
                      sep = ";")

We will create ‘white versus red’ as a new feature, and we will define ‘High Quality’ to be a quality score of seven or more.

library(dplyr)

white_wines$red <- 0
red_wines$red <- 1

wine_data <- white_wines %>% 
  bind_rows(red_wines) %>% 
  mutate(high_quality = ifelse(quality >= 7, 1, 0)) %>% 
  select(-quality)

knitr::kable(head(wine_data))
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol red high_quality
7.0 0.27 0.36 20.7 0.045 45 170 1.0010 3.00 0.45 8.8 0 0
6.3 0.30 0.34 1.6 0.049 14 132 0.9940 3.30 0.49 9.5 0 0
8.1 0.28 0.40 6.9 0.050 30 97 0.9951 3.26 0.44 10.1 0 0
7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.40 9.9 0 0
7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.40 9.9 0 0
8.1 0.28 0.40 6.9 0.050 30 97 0.9951 3.26 0.44 10.1 0 0

Now we set our list of parameters (a list in R is equivalent to a dict in Python):

params <- list(
  input_cols = colnames(wine_data)[colnames(wine_data) != 'high_quality'],
  target_col = 'high_quality',
  test_size = 0.3,
  random_state = 123,
  subsample = (3:9)/10, 
  xgb_max_depth = 3:9,
  colsample_bytree = (3:9)/10,
  xgb_min_child_weight = 1:4,
  k = 3,
  k_shuffle = TRUE,
  n_iter = 10,
  scoring = 'f1',
  error_score = 0,
  verbose = 1,
  n_jobs = -1
)

Now we are ready to run our XGBoost model with 3-fold cross validation. First we split the data:

split <- split_data(df = wine_data,  parameters = params)

This produces a list, which we can feed into our scaling function:

scaled <- scale_data(split$X_train, split$X_test)

Now we can run the XGBoost algorithm with the defined parameters on our training set:

trained <- train_xgb_crossvalidated(
  scaled$X_train_scaled,
  split$y_train,
  parameters = params
)

Finally we can generate a classification report on our test set:

report <- generate_classification_report(trained, scaled$X_test_scaled, split$y_test)

knitr::kable(report)
precision recall f1-score
0.0 0.8859915 0.9377407 0.9111319
1.0 0.6777409 0.5204082 0.5887446
accuracy 0.8538462 0.8538462 0.8538462
macro avg 0.7818662 0.7290744 0.7499382
weighted avg 0.8441278 0.8538462 0.8463238

About

How to run Python ML models in R

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published