Releases: jlmelville/rnndescent
rnndescent 0.1.6
rnndescent 0.1.6
- This is a patch release to fix an issue that occurred with the recent release of a new version of 'dqrng' on CRAN. Unfortunately, version 0.1.5 didn't quite work out. A fix for incorrect behavior for some metrics is also included (see below).
Bug fixes and minor improvements
- Building an index with the
"cosine"
,"jaccard"
or"hellinger"
metrics could give incorrect results if maximally-dissimilar items were in the nearest neighbors. Thank you to Maciej Beręsewicz for the report (#14).
CRAN release 0.1.5
This is a minor release to change an internal API to support an upcoming release for dqrng. See daqana/dqrng#80. Thank you to Ralf Stubner for the report and a patch.
v0.1.4
rnndescent 0.1.4
New features
- New parameter: for
nnd_knn
andrnnd_build
:weight_by_degree
. If set toTRUE
, then the candidate list in nearest neighbor descent is weighted in favor of low-degree items, which should make for a more diverse local join. There is a minor increase in computation but also a minor increase in accuracy.
Bug fixes and minor improvements
rnnd_build
generated an error when preparing the search graph for some metrics (notablycosine
andjaccard
).- Fix a factor of 2 error in the TS-SS metric. This does not affect the returned neighbors, just the distances. Thank you to reporter
Henry Linder (#8). - New parameter for
prepare_search_graph
:use_alt_metric
. This behaves like the existinguse_alt_metric
parameters in other functions and may speed up the index preparation step in e.g.rnnd_build
. - New parameter for
rnnd_build
andprepare_search_graph
:prune_reverse
. If set toTRUE
the reverse graph will be pruned using the
pruning_degree_multiplier
parameter before any diversification. This can help prevent an excessive amount of time being spent in the diversification step in the case where an item has a large number of neighbors (in the reverse graph this can be as large as the number of items in the dataset). Pruning of the merged graph still occurs, so this is an additional pruning step. This should have little effect on search results, but for backwards compatibility, the default isFALSE
.
CRAN release 0.1.3
This is a patch release to fix some ASAN/UBSAN errors that occurred due to the sentinel values not being correctly cast to unsigned integral types when missing values were in the k-nearest neighbor graph (and also one case of insufficient input validation).
CRAN release 0.1.1
This is the first CRAN release. There are no significant changes compared to the last pre-release version, just minor documentation changes.
rnndescent 0.0.16
rnndescent 0.0.16
Breaking changes
rnnd_build
now always prepares the search graph.- The
rnnd_prepare
function has been removed. The option to not prepare the search graph during index building only made sense if you were only interested in the k-nearest neighbor graph. Now thatrnnd_knn
exists for that purpose (see below), the logic of index building has been substantially simplified. - The
nn_to_sparse
function has been removed. - The
merge_knn
function has been removed, andmerge_knnl
has been renamed tomerge_knn
. If you were running e.g.merge_knn(nn1, nn2)
, you must now usemerge_knn(list(nn1, nn2))
. Also the parameternn_graphs
has been renamedgraphs
.
New features
- New function:
rnnd_knn
. Behaves a lot likernnd_build
, but only returns the knn graph with no index built. The index can be very large in size for high dimensional or large datasets, so this function is useful if you only care about the knn graph and won't ever want to query new data. - New function:
neighbor_overlap
. Measures the overlap of two knn graphs via their shared indices. A similar function was used extensively in some vignettes so it may have sufficient utility to be useful to others. - New parameter for
rnnd_query
andgraph_knn_query
:max_search_fraction
. This parameter controls the maximum number of nodes that can be searched during querying. If the number of nodes searched exceeds this fraction of the total number of nodes in the graph, the search will be terminated. This can be used in combination withepsilon
to avoid excessive search times.
Bug fixes and minor improvements
- The sparse
spearmanr
distance has been fixed. - During tree-building with
n_threads = 0
, progress/interrupt monitoring was not occurring. - You can provide a user-defined graph to the
init
parameter ofrnnd_query
. rnnd_query
: ifverbose = TRUE
, a summary of the min, max and average number of distance queries will be logged. This can help tuneepsilon
andmax_search_fraction
.
rnndescent 0.0.15
rnndescent 0.0.15
Breaking Changes
- Standalone distance functions have been removed. They hadn't expanded to match all the distances available in the nearest neighbor functions, nor was sparse support added. Doing so would increase the size of this package's API even further. They may show up in another package.
- The
local_scale_nn
has been removed, for similar reasons to the removal of the standalone distance functions. It remains in thelocalscale
branch of the github repo. - The search graph returned from
prepare_search_graph
is now transposed. This prevents having to repeatedly transpose inside every call tograph_knn_query
if multiple queries are being made. You will need to either regenerate any saved search graphs or transpose them withMatrix::t(search_graph)
.
New features
- New functions:
rnnd_build
,rnnd_query
andrnnd_prepare
. These functions streamline the process of building a k-nearest neighbor graph, preparing a search graph and querying it.
rnndescent 0.0.14
rnndescent 0.0.14
Breaking Changes
-
The
bhamming
metric no longer exists as a specialized metric. Instead, if you pass alogical
matrix todata
,reference
orquery
parameter (depending on the function) and specifymetric = "hamming"
you will automatically get the binary-specific version of the hamming metric. -
The
hamming
andbhamming
metrics are now normalized with respect to the number of features, to be consistent with the other binary-style metrics (and PyNNDescent). If you need the old distances, multiply the distance matrix by the number of columns, e.g. do something like:res <- brute_force_knn(X, metric = "hamming") res$dist <- res$dist * ncol(X)
-
The metric
l2sqr
has been renamedsqeuclidean
to be consistent with PyNNDescent.
New features
- Metrics? We got 'em! The
metric
parameter now accepts a much larger number of metrics. See the rdoc for the full list of supported metrics. Currently, most of the metrics from PyNNDescent which don't require extra parameters are supported. The number of specialized binary metrics has also been expanded. - New parameter for
rpf_knn
andrpf_build
:max_tree_depth
this controls the depth of the tree and was set to 100 internally. This default has been doubled to 200 and can now be user-controlled. Ifverbose = TRUE
and the largest leaf in the forest exceeds theleaf_size
parameter, a message warning you about this will be logged and indicates that the maximum tree depth has been exceeded. Increasingmax_tree_depth
may not be the answer: it's more likely there is something unusual about the distribution of the distances in your dataset and a random initialization might be a better use of your time.
rnndescent 0.0.13
rnndescent 0.0.13
New features
- Sparse data is now supported. Pass a
dgCMatrix
to thedata
,reference
orquery
parameters where you would usually use a dense matrix or data frame.cosine
,euclidean
,manhattan
,hamming
andcorrelation
are all available, but alternative versions in the dense case, e.g.cosine-preprocess
or the binary-specificbhamming
for dense data is not. - A new
init
option forgraph_knn_query
: you can now pass an RP forest and initialize with that, e.g. fromrpf_build
, or by settingret_forest = TRUE
onnnd_knn
orrpf_knn
. You may want to cut down the size of the forest used for initialization withrpf_filter
first, though (a single tree may be enough). This will also use the metric data in the forest, so settingmetric
(oruse_alt_metric
) in the function itself will be ignored.
Bug fixes and minor improvements
- If the knn graph you pass to
prepare_search_graph
or tograph_knn_query
contains missing data, this will no longer cause an error (it still might not be the best idea though).
rnndescent 0.0.12
New features
- New function:
rpf_knn
. Calculates the approximate k-nearest neighbors using a random partition forest. - New function:
rpf_build
. Builds a random partition forest. - New function:
rpf_knn_query
. Queries a random partition forest (built withrpf_build
to find the approximate nearest neighbors for the query points. - New function:
rpf_filter
. Retains only the best "scoring" trees in a forest, where each tree is scored based on how well it reproduces a given knn. - New initialization method for
nnd_knn
:init = "tree"
. Uses the RP Forest initialization method. - New parameter for
nnd_knn
:ret_forest
. Returns the search forest used ifinit = "tree"
so it can be used for future searching or filtering. - New parameter for
nnd_knn
:init_opts
. Options that can be passed to the RP forest initialization (same as inrpf_knn
).