Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
146 lines (121 sloc) 4.37 KB
---
title: "Working with SQL data bases"
author: "Michel Lang"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
set.seed(1)
```
# Setup
First, make sure that you have the following packages installed on your system:
```{r}
suppressPackageStartupMessages(library(mlr3))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(mlr3db))
# Abstraction for various data bases
if (!requireNamespace("dbplyr", quietly = TRUE))
install.packages("dbplyr")
# SQLite DBI interface
if (!requireNamespace("RSQLite", quietly = TRUE))
install.packages("RSQLite")
# Example data set
if (!requireNamespace("nycflights13", quietly = TRUE))
install.packages("nycflights13")
```
# NY City Flights Example
In this example, we will
- Create a temporary data base
1. Load the `nycflights13::flights` data set.
2. Add an ID column to the data set.
3. Create a SQLite data base and copy the flights data to SQLite.
- Create a `mlr3::DataBackend` based on a subset of the SQLite data base.
- Create a `mlr3::Task` which uses the `mlr3::DataBackend` and fit regression trees.
## Create a temporary data base
### Load the data
```{r}
data("flights", package = "nycflights13")
str(flights)
```
### Add a unique ID column
Here, we add a simple integer row id column.
And as we do not know (and don't care) if there is already any order in the data, we assign randomly permutated ids:
```{r}
flights$row_id = sample.int(nrow(flights))
```
### Create a SQLite data base with the flights data set
We store the SQLite data base in the temporary directory and then copy over the flight data set:
```{r}
path = tempfile("flights", fileext = ".sqlite")
con = DBI::dbConnect(RSQLite::SQLite(), path)
tbl = copy_to(con, flights)
```
We now continue to work on the data set stored on the file system, referenced by `tbl`, and can remove the in-memory copy.
```{r}
rm(flights) # remove the in-memory data set
```
## Create a DataBackend
We reduce the data set to only keep a subset of features where the regression tree will have no problems:
```{r}
keep = c("row_id", "year", "month", "day", "hour", "minute", "dep_time",
"arr_time", "carrier", "flight", "air_time", "distance", "arr_delay")
tbl = select(tbl, keep)
colnames(tbl)
```
Additionally, we remove those observations where the arrival delay (`arr_delay`) has a missing value:
```{r}
tbl = filter(tbl, !is.na(arr_delay))
```
And to keep the runtimes reasonable for this toy example, we filter the data to only use every second row:
```{r}
tbl = filter(tbl, row_id %% 2 == 0)
```
Finally, we merge the factor levels of feature `carrier` so that infrequent carriers are replaced by level "other":
```{r}
tbl = mutate(tbl, carrier = case_when(
carrier %in% c("OO", "HA", "YV", "F9", "AS", "FL", "VX", "WN") ~ "other",
TRUE ~ carrier)
)
```
This subsetted table is now used to create a `mlr3db::DataBackendDplyr`.
```{r}
b = as_data_backend(tbl, primary_key = "row_id")
```
We can now use the interface of `mlr3::DataBackend` to query the data:
```{r}
b$nrow
b$ncol
b$head()
```
Note that the `mlr3::DataBackend` just does not know about any rows or columns we have filtered out with `dplyr` before creating the `mlr3::DataBackend`.
## Create a regression task and fit regression trees
We create the following mlr3 objects:
* A regression task, based on the previously created `mlr3db::DataBackendDplyr`
* A regression learner (using `rpart::rpart()`)
* A resampling strategy: 3 times repeated subsampling using 10\% of the observations for training (see `mlr3::Resampling`)
* A list of `mlr3::Measure` to measure MSE, time to train and time to predict
```{r}
task = TaskRegr$new("flights_sqlite", b, target = "arr_delay")
learner = mlr_learners$get("regr.rpart")
resampling = mlr_resamplings$get("subsampling")
resampling$param_set$values = list(repeats = 3, ratio = 0.1)
```
We pass all these objects to `mlr3::resample()` to perform a simple resampling with three iterations.
In each iteration, only the required subset of the data is queried from the SQLite data base and passed to `rpart::rpart()`:
```{r}
rr = resample(task, learner, resampling)
print(rr)
```
## Cleanup
Finally, we remove the `tbl` object and close the connection.
```{r}
rm(tbl)
DBI::dbDisconnect(con)
```