SparseVector #3

Goutte · 2016-05-13T00:21:52Z

Your yummy test-suite has spurred me to try my hand at making a quick'n dirty implementation of a sparse list using not a dictionary of keys, but linked lists : two numpy.ndarrays internally, one of integers for the keys, and one for the values, whose dtype is left to the choice of the user.

If this works, I'll probably make a sparse_vector lib, unless you think this lib should hold multiple implementations. (my inner sloth hopes that you do)

I called it vector because it's specialized for numerical values.

The idea is to make the init with initial values and the setting via list slicing (see PR #2) both faster.

I tried scipy.sparse and pandas.Series but they don't fit my immediate needs.

My constraints are :
1. Python only
2. Performance
3. Available now :)

I just want a simple sparse 1D numerical vector that works well with numpy.

I guess scipy.sparse should make separate classes for 1D, 2D, 3D, and ND.
They only support 2D right now (and therefore 1D), with classes such as scipy.sparse.lil_matrix, which is also a linked list implementation.

But the API does not feel right for vectors, look at the instantiation :

# data = [6.28, 1.62, 2.72, 2.675]
# indices = [42, 1e9, 6e6, 1]  # not sorted (constraint)
vector = lil_matrix(
    (values, (np.zeros(len(indices)), indices)),  # data, (rowinds, colsinds)
    shape=(1, 1e16)
)

Besides having to np.zeros(len(indices)) to fulfill the 2D signature, there's no settable default value,
which is zero. (and I need another)

And there's a bunch of 2D-only "features" such as :

vector[0] # same object as vector, shape (1, 1e16)
# it somewhat makes sense in 2D, but not in 1D

I have not tried sympy.matrices.sparse.SparseMatrix yet.

Anyhow, I'm running benchmarks with sparse_vector :

import benchmark
import numpy as np

from sparse_list import SparseList
from sparse_vector import SparseVector


class BenchmarkSparse(benchmark.Benchmark):

    each = 100          # number of runs
    full_size = 100000  # total size of sparse lists and vectors
    data_size = 1000    # actual data size of sparse lists and vectors

    def setUp(self):
        # Can also specify tearDown, eachSetUp, and eachTearDown
        self.a = np.random.rand(self.data_size)
        self.k = np.arange(self.data_size)

    def test_sparse_list_init(self):
        sl = SparseList(self.full_size, default_value=0)

    def test_sparse_vector_init(self):
        sv = SparseVector(self.full_size, default_value=0)

    def test_sparse_list_set_with_iterables_in_slice(self):
        sl = SparseList(self.full_size, default_value=0)
        sl[self.k] = self.a

    def test_sparse_list_set_with_iterables_on_init(self):
        sl = SparseList(dict(zip(self.k, self.a)), default_value=0)
        sl.size = self.full_size

    def test_sparse_vector_set_with_iterables_in_slice(self):
        sv = SparseVector(self.full_size, default_value=0)
        sv[self.k] = self.a

    def test_sparse_vector_set_with_iterables_on_init(self):
        sv = SparseVector((self.k, self.a), default_value=0)
        sv.size = self.full_size


if __name__ == '__main__':
    benchmark.main(format="markdown", numberFormat="%.4g")

which right now yields :

Benchmark Report
================

BenchmarkSparse
---------------

                                     name | rank | runs |      mean |        sd | timesBaseline
------------------------------------------|------|------|-----------|-----------|--------------
                         sparse list init |    1 |  100 | 1.771e-06 | 1.344e-06 |           1.0
                       sparse vector init |    2 |  100 | 4.199e-06 | 1.877e-06 | 2.37012113055
 sparse vector set with iterables on init |    3 |  100 | 1.531e-05 |  7.01e-06 | 8.64468371467
   sparse list set with iterables on init |    4 |  100 | 0.0004169 | 8.095e-05 | 235.347240915
  sparse list set with iterables in slice |    5 |  100 | 0.0004247 | 7.585e-05 | 239.769851952
sparse vector set with iterables in slice |    6 |  100 |  0.008322 | 0.0005945 | 4698.04441454

Not optimal yet, but it's still quick'n dirty, I can clean that up some and maybe start comparing with lil_matrix too.

The text was updated successfully, but these errors were encountered:

Goutte · 2016-05-13T17:48:23Z

I managed to make setting with iterable slices twice as fast when the indices are already in the sparse vector (which is my real-world use-case):

                                             name | rank | runs |     mean |        sd | timesBaseline
--------------------------------------------------|------|------|----------|-----------|--------------
sparse vector set with iterables in slice present |    1 | 1000 | 0.002095 | 0.0002696 |           1.0
          sparse list set with iterables in slice |    2 | 1000 | 0.004443 | 0.0004396 | 2.12089296511
 sparse vector set with iterables in slice absent |    3 | 1000 | 0.007088 | 0.0005914 | 3.38362551308

But it's still slower when the indices are not yet present in the vector.

Goutte · 2016-05-13T23:50:38Z

Optimized some more by leveraging numpy's internals for things like densification :

BenchmarkDensify
----------------

          name | rank | runs |    mean |       sd | timesBaseline
---------------|------|------|---------|----------|--------------
vector densify |    1 |   10 | 0.07599 | 0.002629 |           1.0
  list densify |    2 |   10 |   1.918 |  0.01043 | 25.2395356579

BenchmarkGet
------------

                             name | rank | runs |   mean |       sd | timesBaseline
----------------------------------|------|------|--------|----------|--------------
  list get with iterable in slice |    1 |   10 | 0.0178 | 0.001255 |           1.0
vector get with iterable in slice |    2 |   10 | 0.0961 | 0.001629 | 5.39947356436

BenchmarkInit
-------------

                   name | rank | runs |      mean |        sd | timesBaseline
------------------------|------|------|-----------|-----------|--------------
              list init |    1 |   10 | 1.717e-06 | 1.047e-06 |           1.0
            vector init |    2 |   10 | 5.603e-06 | 4.342e-06 | 3.26388888889
vector init with values |    3 |   10 | 3.209e-05 | 1.347e-05 | 18.6944444444
  list init with values |    4 |   10 |  0.004974 |  0.001387 | 2897.43055556

BenchmarkSet
------------

                                      name | rank | runs |     mean |        sd | timesBaseline
-------------------------------------------|------|------|----------|-----------|--------------
vector set with iterables in slice present |    1 |   10 | 0.002003 | 3.217e-05 |           1.0
          list set with iterables in slice |    2 |   10 | 0.004546 | 0.0007072 | 2.26944226237
 vector set with iterables in slice absent |    3 |   10 | 0.006572 | 9.501e-05 | 3.28099930967

I also found out the hard way that isinstance(1e6, int) is False. (it's a float)
I tweaked SparseVector accordingly, but i guess there should be a fix for SparseList too.

Goutte · 2016-05-30T11:29:26Z

I went ahead and published sparse_vector on pypi.
It's overall faster than sparse_list, but it's not overly optimized either.

johnsyweb · 2016-06-01T10:28:06Z

Hi @Goutte! Sorry I missed this for so long. Great work on creating sparse_vector!

Goutte mentioned this issue May 14, 2016

PARTIAL files and RAM usage healpy/healpy#339

Closed

johnsyweb closed this as completed Nov 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparseVector #3

SparseVector #3

Goutte commented May 13, 2016 •

edited

Loading

Goutte commented May 13, 2016 •

edited

Loading

Goutte commented May 13, 2016

Goutte commented May 30, 2016

johnsyweb commented Jun 1, 2016

SparseVector #3

SparseVector #3

Comments

Goutte commented May 13, 2016 • edited Loading

Goutte commented May 13, 2016 • edited Loading

Goutte commented May 13, 2016

Goutte commented May 30, 2016

johnsyweb commented Jun 1, 2016

Goutte commented May 13, 2016 •

edited

Loading

Goutte commented May 13, 2016 •

edited

Loading