Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparseVector #3

Closed
Goutte opened this issue May 13, 2016 · 4 comments
Closed

SparseVector #3

Goutte opened this issue May 13, 2016 · 4 comments

Comments

@Goutte
Copy link

Goutte commented May 13, 2016

Your yummy test-suite has spurred me to try my hand at making a quick'n dirty implementation of a sparse list using not a dictionary of keys, but linked lists : two numpy.ndarrays internally, one of integers for the keys, and one for the values, whose dtype is left to the choice of the user.

If this works, I'll probably make a sparse_vector lib, unless you think this lib should hold multiple implementations. (my inner sloth hopes that you do)

I called it vector because it's specialized for numerical values.

The idea is to make the init with initial values and the setting via list slicing (see PR #2) both faster.

I tried scipy.sparse and pandas.Series but they don't fit my immediate needs.

My constraints are :
1. Python only
2. Performance
3. Available now :)

I just want a simple sparse 1D numerical vector that works well with numpy.

I guess scipy.sparse should make separate classes for 1D, 2D, 3D, and ND.
They only support 2D right now (and therefore 1D), with classes such as scipy.sparse.lil_matrix, which is also a linked list implementation.

But the API does not feel right for vectors, look at the instantiation :

# data = [6.28, 1.62, 2.72, 2.675]
# indices = [42, 1e9, 6e6, 1]  # not sorted (constraint)
vector = lil_matrix(
    (values, (np.zeros(len(indices)), indices)),  # data, (rowinds, colsinds)
    shape=(1, 1e16)
)

Besides having to np.zeros(len(indices)) to fulfill the 2D signature, there's no settable default value,
which is zero. (and I need another)

And there's a bunch of 2D-only "features" such as :

vector[0] # same object as vector, shape (1, 1e16)
# it somewhat makes sense in 2D, but not in 1D

I have not tried sympy.matrices.sparse.SparseMatrix yet.


Anyhow, I'm running benchmarks with sparse_vector :

import benchmark
import numpy as np

from sparse_list import SparseList
from sparse_vector import SparseVector


class BenchmarkSparse(benchmark.Benchmark):

    each = 100          # number of runs
    full_size = 100000  # total size of sparse lists and vectors
    data_size = 1000    # actual data size of sparse lists and vectors

    def setUp(self):
        # Can also specify tearDown, eachSetUp, and eachTearDown
        self.a = np.random.rand(self.data_size)
        self.k = np.arange(self.data_size)

    def test_sparse_list_init(self):
        sl = SparseList(self.full_size, default_value=0)

    def test_sparse_vector_init(self):
        sv = SparseVector(self.full_size, default_value=0)

    def test_sparse_list_set_with_iterables_in_slice(self):
        sl = SparseList(self.full_size, default_value=0)
        sl[self.k] = self.a

    def test_sparse_list_set_with_iterables_on_init(self):
        sl = SparseList(dict(zip(self.k, self.a)), default_value=0)
        sl.size = self.full_size

    def test_sparse_vector_set_with_iterables_in_slice(self):
        sv = SparseVector(self.full_size, default_value=0)
        sv[self.k] = self.a

    def test_sparse_vector_set_with_iterables_on_init(self):
        sv = SparseVector((self.k, self.a), default_value=0)
        sv.size = self.full_size


if __name__ == '__main__':
    benchmark.main(format="markdown", numberFormat="%.4g")

which right now yields :

Benchmark Report
================

BenchmarkSparse
---------------

                                     name | rank | runs |      mean |        sd | timesBaseline
------------------------------------------|------|------|-----------|-----------|--------------
                         sparse list init |    1 |  100 | 1.771e-06 | 1.344e-06 |           1.0
                       sparse vector init |    2 |  100 | 4.199e-06 | 1.877e-06 | 2.37012113055
 sparse vector set with iterables on init |    3 |  100 | 1.531e-05 |  7.01e-06 | 8.64468371467
   sparse list set with iterables on init |    4 |  100 | 0.0004169 | 8.095e-05 | 235.347240915
  sparse list set with iterables in slice |    5 |  100 | 0.0004247 | 7.585e-05 | 239.769851952
sparse vector set with iterables in slice |    6 |  100 |  0.008322 | 0.0005945 | 4698.04441454

Not optimal yet, but it's still quick'n dirty, I can clean that up some and maybe start comparing with lil_matrix too.

@Goutte
Copy link
Author

Goutte commented May 13, 2016

I managed to make setting with iterable slices twice as fast when the indices are already in the sparse vector (which is my real-world use-case):

                                             name | rank | runs |     mean |        sd | timesBaseline
--------------------------------------------------|------|------|----------|-----------|--------------
sparse vector set with iterables in slice present |    1 | 1000 | 0.002095 | 0.0002696 |           1.0
          sparse list set with iterables in slice |    2 | 1000 | 0.004443 | 0.0004396 | 2.12089296511
 sparse vector set with iterables in slice absent |    3 | 1000 | 0.007088 | 0.0005914 | 3.38362551308

But it's still slower when the indices are not yet present in the vector.

@Goutte
Copy link
Author

Goutte commented May 13, 2016

Optimized some more by leveraging numpy's internals for things like densification :

BenchmarkDensify
----------------

          name | rank | runs |    mean |       sd | timesBaseline
---------------|------|------|---------|----------|--------------
vector densify |    1 |   10 | 0.07599 | 0.002629 |           1.0
  list densify |    2 |   10 |   1.918 |  0.01043 | 25.2395356579

BenchmarkGet
------------

                             name | rank | runs |   mean |       sd | timesBaseline
----------------------------------|------|------|--------|----------|--------------
  list get with iterable in slice |    1 |   10 | 0.0178 | 0.001255 |           1.0
vector get with iterable in slice |    2 |   10 | 0.0961 | 0.001629 | 5.39947356436

BenchmarkInit
-------------

                   name | rank | runs |      mean |        sd | timesBaseline
------------------------|------|------|-----------|-----------|--------------
              list init |    1 |   10 | 1.717e-06 | 1.047e-06 |           1.0
            vector init |    2 |   10 | 5.603e-06 | 4.342e-06 | 3.26388888889
vector init with values |    3 |   10 | 3.209e-05 | 1.347e-05 | 18.6944444444
  list init with values |    4 |   10 |  0.004974 |  0.001387 | 2897.43055556

BenchmarkSet
------------

                                      name | rank | runs |     mean |        sd | timesBaseline
-------------------------------------------|------|------|----------|-----------|--------------
vector set with iterables in slice present |    1 |   10 | 0.002003 | 3.217e-05 |           1.0
          list set with iterables in slice |    2 |   10 | 0.004546 | 0.0007072 | 2.26944226237
 vector set with iterables in slice absent |    3 |   10 | 0.006572 | 9.501e-05 | 3.28099930967

I also found out the hard way that isinstance(1e6, int) is False. (it's a float)
I tweaked SparseVector accordingly, but i guess there should be a fix for SparseList too.

@Goutte
Copy link
Author

Goutte commented May 30, 2016

I went ahead and published sparse_vector on pypi.
It's overall faster than sparse_list, but it's not overly optimized either.

@johnsyweb
Copy link
Owner

Hi @Goutte! Sorry I missed this for so long. Great work on creating sparse_vector!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants