In [1]:
import numpy as np
from scipy import sparse

Sparse matrices offer a middle-ground between a comprehensive data warehouse solution with extensive test coverage and a directory of text files and database dumps. Sparse matrices do not work for all data types, but in situations where they are an appropriate technology you can leverage them even under load in production. Lets use an example to see how this process might play out.

A sparse matrix is one in which most of the values are zero. If the number of zero-valued elements divided by the size of the matrix is greater than 0.5 then it is consider sparse.

In [6]:
A = np.random.randint(0,2,100000).reshape(100,1000)
sparcity = 1.0 - (np.count_nonzero(A) / A.size)
print(round(sparcity,4))

0.5031


### [coo_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy-sparse-coo-matrix)

sparse matrix built from the COOrdinates and values of the non-zero entries.

In [7]:
A = np.random.poisson(0.3, (10,100))
B = sparse.coo_matrix(A)
C = B.todense()

print("A",type(A),A.shape,"\n"
      "B",type(B),B.shape,"\n"
      "C",type(C),C.shape,"\n")

A <class 'numpy.ndarray'> (10, 100) 
B <class 'scipy.sparse._coo.coo_matrix'> (10, 100) 
C <class 'numpy.matrix'> (10, 100) 



### [csc_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy-sparse-csc-matrix)

When there are repeated entries in the rows or cols, we can remove the redundancy by indicating the location of the first occurrence of a value and its increment instead of the full coordinates. When the repeats occur in colums we use a CSC format.

In [8]:
A = np.random.poisson(0.3, (10,100))
B = sparse.csc_matrix(A)

### [csr_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy-sparse-csr-matrix)

Like the CSC format there is a CSR format to account for data that repeat along the rows

In [9]:
A = np.random.poisson(0.3, (10,100))
B = sparse.csr_matrix(A)

Because the coordinate format is easier to create, it is common to create it first then cast to another more efficient format. Let us first show how to create a matrix from coordinates:

In [10]:
rows = [0,1,2,8]
cols = [1,0,4,8]
vals = [1,2,1,4]

A = sparse.coo_matrix((vals, (rows, cols)))
print(A.todense())

[[0 1 0 0 0 0 0 0 0]
 [2 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 4]]


Then to cast it to a CSR matrix

In [11]:
B = A.tocsr()

Because this introduction to sparse matrices is applied to data ingestion we would need to be able to:

1. concatenate matrices (e.g., add a new user to a recommender matrix)

2. read and write the matrices to and from disk

In [12]:
## matrix merge example
C = sparse.csr_matrix(np.array([0,1,0,0,2,0,0,0,1]).reshape(1,9))
print(B.shape,C.shape)
D = sparse.vstack([B,C])
print(D.todense())

(9, 9) (1, 9)
[[0 1 0 0 0 0 0 0 0]
 [2 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 4]
 [0 1 0 0 2 0 0 0 1]]


In [13]:
## read and write
file_name = "sparse_matrix.npz"
sparse.save_npz(file_name, D)
E = sparse.load_npz(file_name)
print(E.shape)

(10, 9)
