The Turi Create nearest neighbors
toolkit
is used to find the rows in a data table that are most similar to a
query row. This is a two-stage process, analogous to many other Turi
Create toolkits. First we create a
NearestNeighborsModel,
using a reference dataset contained in an
SFrame.
Next we query the model, using either the query
or the
similarity_graph
method. Each of these methods is explained further
below.
For this chapter we use an example dataset of house attributes and prices:
import turicreate as tc
sf = tc.SFrame.read_csv('houses.csv')
sf.head(5)
+------+---------+------+--------+------+-------+
| tax | bedroom | bath | price | size | lot |
+------+---------+------+--------+------+-------+
| 590 | 2 | 1.0 | 50000 | 770 | 22100 |
| 1050 | 3 | 2.0 | 85000 | 1410 | 12000 |
| 20 | 3 | 1.0 | 22500 | 1060 | 3500 |
| 870 | 2 | 2.0 | 90000 | 1300 | 17500 |
| 1320 | 3 | 2.0 | 133000 | 1500 | 30000 |
+------+---------+------+--------+------+-------+
[5 rows x 6 columns]
Because the features in this dataset have very different scales (e.g. price is in the hundreds of thousands while the number of bedrooms is in the single digits), it is important to normalize the features. In this example we standardize so that each feature is measured in terms of standard deviations from the mean (see Wikipedia for more detail). In addition, both reference and query datasets may have a column with row labels, but for this example we let the model default to using row indices as labels.
for c in sf.column_names():
sf[c] = (sf[c] - sf[c].mean()) / sf[c].std()
First, we create a nearest neighbors model. We can list specific features to use in our distance computations, or default to using all features in the reference SFrame. In the model summary below the following code snippet, note that there are three features, because our second command specifies three numeric SFrame columns as features for the model. There are also three unpacked features, because each feature is in its own column.
model = tc.nearest_neighbors.create(sf)
model = tc.nearest_neighbors.create(sf, features=['bedroom', 'bath', 'size'])
model.summary()
Class : NearestNeighborsModel
Attributes
----------
Distance : euclidean
Method : ball tree
Number of examples : 15
Number of feature columns : 3
Number of unpacked features : 3
Total training time (seconds) : 0.0091
Ball Tree Attributes
--------------------
Tree depth : 1
Leaf size : 1000
To retrieve the five closest neighbors for new data points or a subset of
the original reference data, we query the model with the query
method.
Query points must also be contained in an SFrame, and must have columns with the
same names as those used to construct the model (additional columns are allowed,
but ignored). The result of the query
method is an SFrame with four columns:
query label, reference label, distance, and rank of the reference point among
the query point's nearest neighbors.
knn = model.query(sf[:5], k=5)
knn.head()
+-------------+-----------------+----------------+------+
| query_label | reference_label | distance | rank |
+-------------+-----------------+----------------+------+
| 0 | 0 | 0.0 | 1 |
| 0 | 5 | 0.100742954001 | 2 |
| 0 | 7 | 0.805943632008 | 3 |
| 0 | 10 | 1.82070683014 | 4 |
| 0 | 2 | 1.83900997922 | 5 |
| 1 | 1 | 0.0 | 1 |
| 1 | 8 | 0.181337317202 | 2 |
| 1 | 4 | 0.181337317202 | 3 |
| 1 | 11 | 0.322377452803 | 4 |
| 1 | 12 | 0.705200678007 | 5 |
+-------------+-----------------+----------------+------+
[10 rows x 4 columns]
In some cases the query dataset is the reference dataset. For this task of
constructing the similarity_graph on the reference data, the model's
similarity_graph
can be used. For brute force models it can be almost twice as
fast, depending on the data sparsity and chosen distance function. By default,
the similarity_graph
method returns an
SGraph
whose vertices are the rows of the reference dataset and whose edges indicate a
nearest neighbor match. Specifically, the destination vertex of an edge is a
nearest neighbor of the source vertex. similarity_graph
can also return
results in the same form as the query
method if so desired.
sim_graph = model.similarity_graph(k=3)
The most critical choice in computing nearest neighbors is the distance function that measures the dissimilarity between any pair of observations.
For numeric data, the options are euclidean
, manhattan
,
cosine
, and transformed_dot_product.
For data in dictionary
format (i.e. sparse data), jaccard
and weighted_jaccard
are also
options, in addition to the numeric distances. For string features, use
levenshtein
distance, or use the text analytics toolkit's
count_ngrams
feature to convert strings to dictionaries of words or
character shingles, then use Jaccard or weighted Jaccard distance.
Leaving the distance parameter set to its default value of auto
tells the model to choose the most reasonable distance based on the type
of features in the reference data. In the following output cell, the
second line of the model summary confirms our choice of Manhattan
distance.
model = tc.nearest_neighbors.create(sf, features=['bedroom', 'bath', 'size'],
distance='manhattan')
model.summary()
Class : NearestNeighborsModel
Attributes
----------
Distance : manhattan
Method : ball tree
Number of examples : 15
Number of feature columns : 3
Number of unpacked features : 3
Total training time (seconds) : 0.013
Ball Tree Attributes
--------------------
Tree depth : 1
Leaf size : 1000
Distance functions are also exposed in the turicreate.distances
module. This allows us not only to specify the distance argument for a
nearest neighbors model as a distance function (rather than a string),
but also to use that function for any other purpose.
In the following snippet we use a nearest neighbors model to find the closest reference points to the first three rows of our dataset, then confirm the results by computing a couple of the distances manually with the Manhattan distance function.
model = tc.nearest_neighbors.create(sf, features=['bedroom', 'bath', 'size'],
distance=tc.distances.manhattan)
knn = model.query(sf[:3], k=3)
knn.print_rows()
sf_check = sf[['bedroom', 'bath', 'size']]
print "distance check 1:", tc.distances.manhattan(sf_check[2], sf_check[10])
print "distance check 2:", tc.distances.manhattan(sf_check[2], sf_check[14])
+-------------+-----------------+-----------------+------+
| query_label | reference_label | distance | rank |
+-------------+-----------------+-----------------+------+
| 0 | 0 | 0.0 | 1 |
| 0 | 5 | 0.100742954001 | 2 |
| 0 | 7 | 0.805943632008 | 3 |
| 1 | 1 | 0.0 | 1 |
| 1 | 8 | 0.181337317202 | 2 |
| 1 | 4 | 0.181337317202 | 3 |
| 2 | 2 | 0.0 | 1 |
| 2 | 10 | 0.0604457724006 | 2 |
| 2 | 14 | 1.61656820464 | 3 |
+-------------+-----------------+-----------------+------+
[9 rows x 4 columns
distance check 1: 0.0604457724006
distance check 2: 1.61656820464
Turi Create also allows composite distances, which allow the nearest neighbors tool (and other distance-based tools) to work with features that have different types. Specifically, a composite distance is a weighted sum of standard distances applied to subsets of features, specified in the form of a Python list. Each element of a composite distance list contains three things:
- a list or tuple of feature names
- a standard distance name
- a multiplier for the standard distance.
In our house price dataset, for example, suppose we want to measure the difference between numbers of bedrooms and baths with Manhattan distance and the difference between house and lot sizes with Euclidean distance. In addition, we want the Euclidean component to carry twice as much weight. The composite distance for this would be:
my_dist = [[['bedroom', 'bath'], 'manhattan', 1],
[['size', 'lot'], 'euclidean', 2]]
This list can be passed to the distance
parameter just like a standard
distance function name or handle.
model = tc.nearest_neighbors.create(sf, distance=my_dist)
model.summary()
Class : NearestNeighborsModel
Attributes
----------
Method : brute_force
Number of distance components : 2
Number of examples : 15
Number of feature columns : 4
Number of unpacked features : 4
Total training time (seconds) : 0.0017
If we specify the distance parameter as auto
, a composite distance
is created where each type of feature is paired with the most
appropriate distance function. Please see the documentation for the
Turi Create distances
module
for more on composite distances.
Another important choice in model creation is the method. The
brute_force
method computes the distance between a query point and each of
the reference points, with a run time linear in the number of reference points.
Creating a model with the ball_tree
method takes more time, but leads to
much faster queries by partitioning the reference data into successively smaller
balls and searching only those that are relatively close to the query. The
default method is auto
which chooses a reasonable method based on both the
feature types and the selected distance function. The method parameter is also
specified when the model is created. The third row of the model summary confirms
our choice to use the ball tree in the next example.
model = tc.nearest_neighbors.create(sf, features=['bedroom', 'bath', 'size'],
method='ball_tree', leaf_size=5)
model.summary()
Class : NearestNeighborsModel
Attributes
----------
Method : ball_tree
Number of distance components : 1
Number of examples : 15
Number of feature columns : 3
Number of unpacked features : 3
Total training time (seconds) : 0.0253
Ball Tree Attributes
--------------------
Tree depth : 3
Leaf size : 5
If the ball tree is used, it's important to choose an appropriate value for the
'leaf_size' parameter, which controls how many observations are stored in each
leaf of the ball tree. By default, this is set so that the tree is no more than
12 levels deep, but larger or smaller values may lead to quicker queries
depending on the shape and dimension of the data. Our houses example only has 15
rows, so the leaf_size
parameter (and the ball_tree
method for that
matter) are not too useful, but for illustration purposes we set the leaf size
to 5 above.
The reference dataset that is used to create the nearest neighbors model cannot have missing data. Please use the SFrame.fillna and SFrame.dropna utilities to preprocess missing values before creating a nearest neighbors model.
On the other hand, data passed to the query
method can have missing
data. For numeric columns, missing values are imputed to be the mean of
the corresponding column in the reference dataset used to create the
model. Missing values in string columns are imputed as empty strings.