Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functional data section #101

Closed
wants to merge 23 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ pages:
- 'Classifier Calibration Plots': 'classifier_calibration.md'
- 'Hyperparameter Tuning Effects': 'hyperpar_tuning_effects.md'
- 'Out-of-Bag Predictions': 'out_of_bag_predictions.md'
- 'Functional Data': 'functional_data.md'
- Extend:
- 'Create Custom Learners': 'create_learner.md'
- 'Create Custom Measures': 'create_measure.md'
Expand Down
237 changes: 237 additions & 0 deletions src/functional_data.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
# Functional Data

Functional data provides information about curves varying over a continuum, such as time.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How well known is functional data analysis? I would give a bit more of a primer on how, when, and why functional analysis is used. Like a paragraph or so

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for all of my comments keep in mind I am coming from the perspective of someone who knows little about FDA. I may not be your target audience

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that there should be just a paragraph of explanation, especially on the how and why of FDA

Copy link
Collaborator Author

@pfistfl pfistfl Mar 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, see next commit.

I may not be your target audience ->You are exactly the target audience!

This type of data is often present when analyzing measurements at various time points.
Such curves usually are interdependent, which means that the measurement at a point $t_{i + 1}$ usually depends on some measurements ${t_1, ..., t_i}; i \in \mathbb{N}$.

As traditional machine learning techniques usually do not emphasize the interdependence between features,
they are often not _well suited_ for such tasks, which can lead to poor performance.
Functional data analysis on the other hand tries to address this by either using algorithms specifically tailored to functional data, or by transforming the functional covariates into a non time-dependent feature space.
For a more in depth introduction to functional data analysis see e.g [When the data are functions](http://rd.springer.com/article/10.1007/BF02293704) Ramsay, J.O., 1982.

Each observation of a functional covariate in the data are evaluations of a functional, i.e. measurements of a scalar value at various time points.
A single observation might then look like this:
```{r}
# Plot NIR curve for first observarion
library(FDboost)
data(fuelSubset)
library(ggplot2)
# NIR_Obs_1 are the measurements for NIR of the first functional covariate.
# lambda are the time points, the data was measured at.
df = data.frame("NIR_Obs1" = fuelSubset$NIR[1, ],
"lambda" = fuelSubset$nir.lambda)
ggplot(df) +
geom_line(aes(y = NIR_Obs1, x = lambda))
```

## How to model functional data

There are two commonly used approaches for analysing functional data.

* Directly analyze the functional data using a [learner](&Learner.md) that is suitable for functional data on a [Task](&makeTask). Those learners have the prefixes __classif.fda__ and __regr.fda__.
For more info on learners see [fda learners](functional_data.Rmd#constructing-a-learner).
For this purpose, the functional data has to be saved as a matrix column in the data.frame used
for constructing the [Task](&makeTask). For more info on functional tasks consider the folowing section.

* Transform the task into a format suitable for standard __classification__ or __regression__ [learners](&Learner.md).
This is done by [extracting](functional_data.Rmd#feature-extraction) non-temporal/non-functional features from the curves. Non-temporal features do not have any interdependence between each other, similarly to features in traditional machine learning.This is explained in more detail [below](functional_data.Rmd#feature-extraction).


## Creating a Task that contains functional features

The first step is to get the data in the right format. [%mlr] expects a [data.frame](&base::data.frame) which consists of the functional features and the target variable as input. Functional data in contrast to __numeric__ data have to be stored as a matrix column in the data.frame.
After that a [Task](&makeTask) that contains the data in a well-defined format is created. [Tasks](&makeTask) come in different flavours, such as [ClassifTask](&makeClassifTask) and [RegrTask](&makeRegrTask), which can be used according to the class of the target variable.

In the following example, the data is first stored as matrix columns using the
helper function [makeFunctionalData](&makeFunctionalData) for the [fuelSubset](fuelSubset.task)
data from package [%FDboost].

The data is provided in the following structure:

```{r}
str(fuelSubset)
```

* __heatan__ is the target variable, in this case a numeric value.
* __h2o__ is an additional scalar variable.
* __NIR__ and __UVVIS__ are matricies containing the curve data. Each column corresponds to a single time point the data was sampled at. Each row indicates a single curve. __NIR__ was measured at $231$ time points, while __UVVIS__ was measured at $129$ time points.
* __nir.lambda__ and __uvvis.lambda__ are vectors of length $231$ and $129$ indicate the time points the data was measured at. Each entry corresponds to one column of __NIR__ and __UVVIS__ respectively. For now we ignore this additional information in mlr.

Our data already contains functional features as matricies in a list.
In order to showcase how such a matrix can
be created from arbitrary numeric columns, we transform the list into a data.frame with a set of numeric columns for each matrix. These columns refer to the matrix columns in the list, i.e
__UVVIS.1__ is the first column of the UVVIS matrix.

```{r}
## Put all values into a data.frame
df = data.frame(fuelSubset[c("heatan", "h2o", "UVVIS", "NIR")])
str(df[, 1:5])
```

Before constructing the [Task](&makeTask), the data is again reformated so it contains column matricies.
This is done by providing a list __fd.features__, that identifies the functional covariates.
All columns not mentioned in the list are kept as-is. In our case the column indices 3:136 correspond to the columns of the UVVIS matrix. Alternatively we could also specify the respective column names.

```{r}
# fd.features is a named list, where each name corresponds to the name of the
# fuctional feature and the values to the respective column indices or column names.
fd.features = list("UVVIS" = 3:136, "NIR" = 137:367)
fdf = makeFunctionalData(df, fd.features = fd.features)
```

[makeFunctionalData](&makeFunctionalData) returns a data.frame, where the functional features are
matricies.

```{r}
str(fdf)
```

Now with a data.frame containing the functionals as matricies, a [Task](&makeTask) can be created:

```{r}
# Create a regression task, classification tasks behave analogously
# In this case we use column indices
tsk1 = makeRegrTask("fuelsubset", data = fdf, target = "heatan")
tsk1
```


## Constructing a learner

For functional data, [learners](&Learner.md) are constructed using
`makeLearner("<classif.<R_method_name>")` or
`makeLearner("<regr.<R_method_name>")` depending on the target variable.

Applying learners to a [Task](&makeTask) works in two ways:

* Use a [learner](&Learner.md)

+ For regression:

```{r}
# The following learners can be used for the task.
listLearners(tsk1, properties = "functionals")
# Create a FDboost learner
fdalrn = makeLearner("regr.FDboost")
```

+ Or alternatively for classification:

```{r}
# knn learner
knn.lrn = makeLearner("classif.fdausc.knn")
```

* Use a _standard_ [learner](&Learner.md):
In this case the temporal structure is disregarded

```{r}
## Decision Tree learner
rpartlrn = makeLearner("classif.rpart")
```

* Alternatively, transform the functional data into a non-temporal/non-functional space by [extracting](functional_data.Rmd#feature-extraction) features before training.
In this case, a normal regression- or classification-[learner](&Learner.md)
can be applied.

This is explained in more detail in the [feature extraction](functional_data.Rmd#feature-extraction)
section below.


## Train the learner

The resulting learner can now be trained on the task created in section [Creating a Task](functional_data.Rmd#creating-a-task) above.

```{r}
# Train the fdalrn on the constructed task
m = train(learner = fdalrn, task = tsk1)
m
p = predict(m, tsk1)
performance(p, rmse)

# Or simply resample (3-fold Cross-Validation)
resample(fdalrn, tsk1, resampling = cv3, measures = mse)
```

Alternatively, learners that do not specifically treat functional covariates can
be applied. In this case the temporal structure is completely disregarded, and all
columns are treated as independent.

```{r}
# Train a normal learner on the constructed task.
# Note that we get a message, that functionals have been converted to numerics.
rpart.lrn = makeLearner("regr.rpart")
m = train(learner = rpart.lrn, task = tsk1)
m
```

## Feature Extraction

In contrast to applying a learner that works on a [Task](&makeTask) containing functional features,
the [Task](&makeTask) can be converted to a normal [&Task.md].
This works by transforming the functional features into a
non-functional domain, e.g by extracting wavelets.

The currently supported preprocessing functions are:
* discrete wavelet transform
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: unrelated to tutorial
It would be really nice if these preprocessing methods could be used in forecasting as well. Maybe we could construct a sub preprocessing method like createWaveletFeatures() that can be used on TimeTasks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We kinda have that.
My proposal for the API is contained in the fda_pull1_task_featExtract branch.
I am not entirely sure, this is how it is going to be, but we can build upon that.

Copy link
Collaborator Author

@pfistfl pfistfl Mar 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to keep
The currently supported preprocessing functions are:
as we can extend this list along with an example whenever a new method is added.
Or do we want a getFDAFeatureExtractors() or getFDAFeaturePreprocessingMethods()

* fast fourier transform
* functional principal component analysis
* multi-resolution feature extraction

In order to do this, we specify methods for each functional feature in the task in a __list__.
In this case we simply want to extract the mean from each __UVVIS__ functional and the fourier transformed features from each __NIR__ functional. Additional arguments can be passed on

```{r}
# feat.methods specifies what to extract from which functional
# from the first functional we extract the fourier transformation, from the second the fpca scores
feat.methods = list("UVVIS" = extractFDAFourier(), "NIR" = extractFDAFPCA())

# Either create a new task from an existing task
extracted = extractFDAFeatures(tsk1, feat.methods = feat.methods)
extracted
```


### Wavelets

In this case, discrete wavelet feature transformation is applied.
We can specify which feature extraction method is used via _method = "wavelets"_ and add additional parameters (i.e. the filter and the boundary) in the pars argument.
This functions returns a regression task of type regr since the raw data contained temporal structure but the transformed data does not inherit temporal structure anymore.
For more informations on wavelets consider the documentation [wavelets](dwt).

```{r, eval = FALSE}
## Specify the feature extraction method and generate new task.
## Here, we use the Haar filter:
feat.methods = list("UVVIS" = extractFDAWavelets(filter = "haar"))
task.w = extractFDAFeatures(tsk1, feat.methods = feat.methods)

# Use the Daubechie wavelet with filter length 4.
feat.methods = list("NIR" = extractFDAWavelets(filter = "d4"))
task.wd4 = extractFDAFeatures(tsk1, feat.methods = feat.methods)
```


### Fourier transformation

Now, we use the fourier feature transformation. Either the amplitude or the phase of the complex fourier coefficients can be used for analysis. This can be specified in the additional _fft.coeff_ argument:

```{r, eval = FALSE}
# Specify the feature extraction method and generate new task.
# We use the fourier features and the amplitude for NIR, as well as the phase for UVVIS
feat.methods = list("NIR" = extractFDAFourier(trafo.coeff = "amplitude"),
"UVVIS" = extractFDAFourier(trafo.coeff = "phase"))
task.fourier = extractFDAFeatures(tsk1, feat.methods = feat.methods)
task.fourier
```

### Wrappers
Additionally we can wrap the preprocessing around a standard learner such as __classif.rpart__.
For additional information, please consider the __Wrappers__ section.

```{r}
# Use a FDAFeatExtractWrapper
feat.methods = list("UVVIS" = extractFDAMultiResFeatures(), "NIR" = extractFDAFourier())
wrapped.lrn = makeExtractFDAFeatsWrapper("classif.rpart", feat.methods = feat.methods)
wrapped.lrn
```