Skip to content

L1 versus L2

chrisjmccormick edited this page Mar 16, 2018 · 1 revision

The most fundamental operation performed by the Nearist vector search acceleration (VSX) hardware is the large-scale comparison of feature vectors. Vectors are compared using distance calculations—the distance between two vectors is taken as a measure of their similarity.

There are many vector distance metrics that are commonly used depending on the data type. One of the most common is the Euclidean distance or “L2” distance metric. L2 is the straight line distance between two points—it’s what we most naturally think of as “distance”.

Nearist hardware uses the Manhattan or “L1” distance to approximate L2. Here we'll explain the difference in the two metrics, explain why Nearist hardware uses L1, and show some experimental results on how L1 affects accuracy in a few classification tasks.

Definitions

The Euclidean or L2 distance is the straight line distance between two points. It’s calculated as the “square root of the sum of squared distances”. For two vectors a and b of length n, the formula is:

L2 distance equation

The Manhattan or L1 distance is a little different. It’s called the Manhattan distance (or, even more directly, the “taxicab” or “city block” distance) because it can be used to calculate the distance a taxi must travel in Manhattan to arrive at its destination given the grid-like structure of the city blocks.

2D illustration of L1 distance versus L2

L1 is calculated as the “sum of absolute differences” (leading to yet another name— “SAD”).

Formula for L1 distance

Implementation in Hardware

It’s pretty clear from the formulas which of these two metrics is computationally simpler. The L1 distance only requires addition, subtraction, and the absolute value, whereas the L2 also requires multiplication.

Side Note: What about the square root operation?

In similarity search applications it’s not necessary to perform the square root operation because it does not change the ordering of the results. Most implementations of L2 similarity search therefore use the “squared L2 distance”, meaning they omit the square root step.

This means an L1 distance calculator requires significantly fewer transistors to implement than an L2 calculator. Since the L1 calculators are smaller, we can include more of them in our hardware, further increasing the available parallelism and performance for similarity search applications.

Experimental Results

So how good of an approximation is L1 to L2? We have applied both distance metrics to a couple classification benchmarks to measure their accuracy. We use k-Nearest Neighbor classification for these experiments since k-NN relies entirely on distance calculations to perform the classifications.

We have results on two common benchmarks:

  • MNIST – The popular hand-written digit dataset with 10 categories (one for each integer 0 - 9). We are using feature vector representations of the images extracted using a convolutional neural network from the TensorFlow tutorial here.
    • 55,000 training vectors and 10,000 test vectors
    • 1,024 features per vector
  • Reuters – A news article classification task, based partly on the scikit-learn example here. This is a two category classification problem--articles labeled as relating to "mergers and acquisitions" are the positive class, and all other articles are the negative class.The Reuters dataset contains 19,621 articles with category information. We took an 80/20 split of the dataset for training and test, respectively. For feature extraction, we’ve applied Latent Semantic Indexing (LSI) with 300 topics.
    • 15,696 training vectors and 3,925 test vectors
    • 300 features per vector

Classification of the test vectors is performed by performing a k-NN search against the training vectors. We show results with both k=1 (assign the class of the nearest training sample) and k=10 (assign the label of the class most represented in the top 10 results).

Side Note: Floating point vs. integers

Nearist hardware uses integer components in place of floating point to further simplify the calculators and increase the number we can implement in the available gates. However, for the sake of these experiments we are only using floating point values for both L2 and L1 so that the difference in accuracy is only due to the difference in distance metric. The impact of the conversion to integer data is explored in a later post.

Accuracy of of L1 versus L2 on MNIST dataset

The difference in accuracy is negligible for the MNIST dataset, with L1 distance actually getting a few more correct classifications out of the total 10,000 test samples.

Accuracy of of L1 versus L2 on Reuters dataset

For the Reuters dataset, the difference appears to depend more on the value of 'k'. For 1-NN classification, the L1 distance has 22 more incorrect classifications (out of ~4,000) versus L2. But for 10-NN, the L1 distance gets 7 more correct than L2.

Conclusions

The L1 distance actually proved to be more accurate than L2 in 3 out of 4 of the above experiments (counting both datasets and both values of k). This is a relatively small sample of tasks and, of course, it's probably wise to perform experiments on your own data to measure the impact. But based on results like these, we at Nearist feel that the additional acceleration achieved by using the L1 distance is well worth the potential small variation in accuracy caused by the different distance metric.

Clone this wiki locally