Skip to content

Commit

Permalink
refine README
Browse files Browse the repository at this point in the history
  • Loading branch information
hoxo-m committed Jun 30, 2019
1 parent 18f3259 commit 0c2c14f
Show file tree
Hide file tree
Showing 6 changed files with 285 additions and 67 deletions.
147 changes: 117 additions & 30 deletions README.Rmd
Expand Up @@ -20,7 +20,7 @@ library(mvtnorm)

# An R Package for Density Ratio Estimation

#### *Koji MAKIYAMA (@hoxo_m)*
#### *Koji MAKIYAMA (@hoxo-m)*

<!-- badges: start -->
[![Travis-CI Build Status](https://travis-ci.org/hoxo-m/densratio.svg?branch=master)](https://travis-ci.org/hoxo-m/densratio)
Expand All @@ -32,43 +32,41 @@ library(mvtnorm)

## 1. Overview

**Density ratio estimation** is described as follows: for given two data samples `x` and `y` from unknown distributions `p(x)` and `q(y)` respectively, estimate `w(x) = p(x) / q(x)`, where `x` and `y` are d-dimensional real numbers.
**Density ratio estimation** is described as follows: for given two data samples `x1` and `x2` from unknown distributions `p(x)` and `q(x)` respectively, estimate `w(x) = p(x) / q(x)`, where `x1` and `x2` are d-dimensional real numbers.

The estimated density ratio function `w(x)` can be used in many applications such as **anomaly detection** [Hido et al. 2011], **changepoint detection** [Liu et al. 2013], and **covariate shift adaptation** [Sugiyama et al. 2007].
Other useful applications about density ratio estimation were summarized by [Sugiyama et al. 2012].

The package **densratio** provides a function `densratio()`.
The function outputs an object that has a function to estimate density ratio.
The package **densratio** provides a function `densratio()` that returns an object with a method to estimate density ratio as `compute_density_ratio()`.

For example,

```{r result, cache=TRUE}
set.seed(3)
x <- rnorm(200, mean = 1, sd = 1/8)
y <- rnorm(200, mean = 1, sd = 1/2)
x1 <- rnorm(200, mean = 1, sd = 1/8)
x2 <- rnorm(200, mean = 1, sd = 1/2)
library(densratio)
result <- densratio(x, y)
densratio_obj <- densratio(x1, x2)
```

The function `densratio()` estimates the density ratio of `p(x)` to `q(y)`, `w(x) = p(x)/q(y) = Norm(1, 1/8) / Norm(1, 1/2)`, and provides a function to compute estimated density ratio.
The result object has a function `compute_density_ratio()` that can compute the estimated density ratio `w_hat(x)` for any d-dimensional input `x` (now d=1).
The densratio object has a function `compute_density_ratio()` that can compute the estimated density ratio `w_hat(x)` for any d-dimensional input `x` (now d=1).

```{r compute-estimated-density-ratio, fig.width=5, fig.height=4}
new_x <- seq(0, 2, by = 0.06)
w_hat <- result$compute_density_ratio(new_x)
new_x <- seq(0, 2, by = 0.03)
w_hat <- densratio_obj$compute_density_ratio(new_x)
plot(new_x, w_hat, pch=19)
```

In this case, the true density ratio `w(x) = p(x)/q(y) = Norm(1, 1/8) / Norm(1, 1/2)` can be computed precisely.
So we can compare `w(x)` with the estimated density ratio `w_hat(x)`.
In this case, the true density ratio `w(x) = p(x)/q(y) = Norm(1, 1/8) / Norm(1, 1/2)` is known.
So we can compare `w(x)` with the estimated density ratio `w-hat(x)`.

```{r compare-true-estimate, fig.width=5, fig.height=4}
true_density_ratio <- function(x) dnorm(x, 1, 1/8) / dnorm(x, 1, 1/2)
plot(true_density_ratio, xlim=c(-1, 3), lwd=2, col="red", xlab = "x", ylab = "Density Ratio")
plot(result$compute_density_ratio, xlim=c(-1, 3), lwd=2, col="green", add=TRUE)
plot(true_density_ratio, xlim=c(0, 2), lwd=2, col="red", xlab = "x", ylab = "Density Ratio")
plot(densratio_obj$compute_density_ratio, xlim=c(0, 2), lwd=2, col="green", add=TRUE)
legend("topright", legend=c(expression(w(x)), expression(hat(w)(x))), col=2:3, lty=1, lwd=2, pch=NA)
```

Expand All @@ -83,8 +81,8 @@ install.packages("densratio")
You can also install the package from [GitHub](https://github.com/hoxo-m/densratio).

```{r eval=FALSE}
install.packages("devtools") # if you have not installed "devtools" package
devtools::install_github("hoxo-m/densratio")
install.packages("remotes") # if you have not installed "remotes" package
remotes::install_github("hoxo-m/densratio")
```

The source code for **densratio** package is available on GitHub at
Expand All @@ -93,41 +91,130 @@ The source code for **densratio** package is available on GitHub at

## 3. Details

### 3.1 Basics

The package provides `densratio()`.
The function returns an object that has a function to compute estimated density ratio.

For data samples `x1` and `x2`,

```{r eval=FALSE}
set.seed(3)
x1 <- rnorm(200, mean = 1, sd = 1/8)
x2 <- rnorm(200, mean = 1, sd = 1/2)
library(densratio)
densratio_obj <- densratio(x1, x2)
```

In this case, `densratio_obj$compute_density_ratio()` can compute estimated density ratio.

```{r basics-compute-estimated-density-ratio, fig.width=5, fig.height=4}
new_x <- seq(0, 2, by = 0.03)
w_hat <- densratio_obj$compute_density_ratio(new_x)
plot(new_x, w_hat, pch=19)
```

### 3.2 Methods

`densratio()` has `method` argument that you can pass `"uLSIF"`, `"RuSLIF"`, or `"KLIEP"`.

- **uLSIF** (unconstrained Least-Squares Importance Fitting) is the default method.
This algorithm estimates density ratio by minimizing the squared loss.
You can find more information in [Kanamori et al. 2009] and [Hido et al. 2011].

- **RuLSIF** (Relative unconstrained Least-Squares Importance Fitting).
This algorithm estimates relative density ratio by minimizing the squared loss.
You can find more information in [Yamada et al. 2011] and [Liu et al. 2013].

- **KLIEP** (Kullback-Leibler Importance Estimation Procedure).
This algorithm estimates density ratio by minimizing Kullback-Leibler divergence.
You can find more information in [Sugiyama et al. 2007].

There is a vignette for the package. For more detail, read it.
The methods assume that density ratio are represented by linear model:

- `w(x) = theta_1 * K(x, c_1) + theta_2 * K(x, c_2) + ... + theta_b * K(x, c_b)`

where

- `K(x, c) = exp(-||x - c||^2 / 2 * sigma^2)`

is the Gaussian (RBF) kernel.

`densratio()` performs the following:

- Decides kernel parameter `sigma` by cross-validation,
- Optimizes the kernel weights `theta` (in other words, find the optimal coefficients of the linear model), and
- The parameters `sigma` and `theta` are saved into `densratio` object, and are used when to compute density ratio in the call `compute_density_ratio()`.

### 3.3 Result and Arguments

```{r open-vignette, eval=FALSE}
vignette("densratio")
`densratio()` outputs the result like as follows:

```{r echo=FALSE}
densratio_obj
```

You can also find it on CRAN.
- **Kernel type** is fixed as Gaussian.
- **Number of kernels** is the number of kernels in the linear model.
You can change by setting `kernel_num` argument.
In default, `kernel_num = 100`.
- **Bandwidth (sigma)** is the Gaussian kernel bandwidth.
In default, `sigma = "auto"`, the algorithm automatically select an optimal value by cross validation.
If you set `sigma` a number, that will be used.
If you set `sigma` a numeric vector, the algorithm select an optimal value in them by cross validation.
- **Centers** are centers of Gaussian kernels in the linear model.
These are selected at random from the data sample `x1` underlying a numerator distribution `p(x)`.
You can find the whole values in `result$kernel_info$centers`.
- **Kernel Weights** are `theta` parameters in the linear kernel model.
You can find these values in `result$kernel_weights`.
- **Function to Estimate Density Ratio** is named `compute_density_ratio()`.

## 4. Multi Dimensional Data Samples

So far, the input data samples `x1` and `x2` were one dimensional.
`densratio()` allows to input multidimensional data samples as `matrix`, as long as their dimensions are the same.

For example,

```{r cache=TRUE}
library(densratio)
library(mvtnorm)
set.seed(3)
x1 <- rmvnorm(300, mean = c(1, 1), sigma = diag(1/8, 2))
x2 <- rmvnorm(300, mean = c(1, 1), sigma = diag(1/2, 2))
densratio_obj_2d <- densratio(x1, x2)
densratio_obj_2d
```

- [An R Package for Density Ratio Estimation](https://CRAN.R-project.org/package=densratio/vignettes/densratio.html)
In this case, as well, we can compare the true density ratio with the estimated density ratio.

## 4. Related work
```{r compare-2d, fig.width=7, fig.height=4}
true_density_ratio <- function(x) {
dmvnorm(x, mean = c(1, 1), sigma = diag(1/8, 2)) /
dmvnorm(x, mean = c(1, 1), sigma = diag(1/2, 2))
}
We have also developed a Python package for density ratio estimation.
N <- 20
range <- seq(0, 2, length.out = N)
input <- expand.grid(range, range)
w_true <- matrix(true_density_ratio(input), nrow = N)
w_hat <- matrix(densratio_obj_2d$compute_density_ratio(input), nrow = N)
- https://github.com/hoxo-m/densratio_py
par(mfrow = c(1, 2))
contour(range, range, w_true, main = "True Density Ratio")
contour(range, range, w_hat, main = "Estimated Density Ratio")
```

The package is available on PyPI (Python Package Index).
## 5. Related work

- https://pypi.python.org/pypi/densratio
- A Python Package for Density Ratio Estimation
- https://pypi.org/project/densratio/
- APPEstimation: Adjusted Prediction Model Performance Estimation
- https://cran.r-project.org/package=APPEstimation

## 5. References
## References

- Hido, S., Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori.
**Statistical outlier detection using direct density ratio estimation.**
Expand Down

0 comments on commit 0c2c14f

Please sign in to comment.