# Demo: Shuffling sparse matrices
Natalia Vélez, June 2021

(Feel free to move this to a `scratch/` folder or delete once done.) 

In this notebook, we'll load a sparse matrix and shuffle each row independently.

In [1]:
import pymongo, gridfs, pickle
import numpy as np
from scipy import sparse

import sys
sys.path.append('..')
from utils import shuffle_csr

## Load real data

Connect to database

In [2]:
keyfile = '../6_database/credentials.key'
creds = open(keyfile, "r").read().splitlines()
myclient = pymongo.MongoClient('134.76.24.75', username=creds[0], password=creds[1], authSource='ohol') 
db = myclient.ohol

print(db)
print(db.list_collection_names())

Database(MongoClient(host=['134.76.24.75:27017'], document_class=dict, tz_aware=False, connect=True, authsource='ohol'), 'ohol')
['tfidf_matrix.files', 'maplogs', 'item_embeddings', 'tech_tree', 'lifelogs', 'item_links_demo', 'objects', 'expanded_transitions', 'avatar_embeddings', 'tfidf_matrix.chunks', 'transitions', 'activity_matrix.files', 'activity_matrix.chunks', 'cleaned_job_matrix.chunks', 'cleaned_job_matrix.files', 'item_interactions', 'nmf_validation', 'activity_labels', 'categories']


Activity matrix pointer

In [3]:
# Get pointer from activity_matrix.files
activity_file = list(db.activity_matrix.files.find())
activity_id = activity_file[0]['_id']
print('File metadata:')
print(activity_file)

File metadata:
[{'_id': ObjectId('6085c6c1affb2a7f0bf57a44'), 'uploadDate': datetime.datetime(2021, 4, 25, 19, 48, 18, 502000), 'length': 412799872, 'chunkSize': 261120, 'md5': 'bdcad93c8ef343607e2019d4f188b200'}]


Load activity matrix and check format

In [4]:
#Load sparse matrix. This takes about 30 seconds
fs = gridfs.GridFS(db, collection='activity_matrix')
activity_bin = fs.get(activity_id)
activity_mtx = pickle.load(activity_bin, encoding='latin1')
print('Loaded activity matrix:')
print(activity_mtx.shape)

Loaded activity matrix:
(763682, 3044)


In [5]:
activity_mtx

<763682x3044 sparse matrix of type '<class 'numpy.float64'>'
	with 34145387 stored elements in Compressed Sparse Row format>

The activity matrix is in CSR format. From the [scipy documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html):

> column indices for row i are stored in `indices[indptr[i]:indptr[i+1]]` and their corresponding values are stored in `data[indptr[i]:indptr[i+1]]`. If the shape parameter is not supplied, the matrix dimensions are inferred from the index arrays.

Example: Below, we've loaded the (a) column indices and (b) data for a single row.

In [6]:
# Indices
print(activity_mtx.indices[activity_mtx.indptr[2]:activity_mtx.indptr[3]])

# Data
print(activity_mtx.data[activity_mtx.indptr[2]:activity_mtx.indptr[3]])

[   0    1    3    4    8    9   15   16   20   22   23   28   40   50
   51   68   70   71  103  106  113  118  120  149  150  152  160  161
  162  167  175  192  199  202  203  240  248  250  252  253  254  256
  262  264  293  302  304  305  307  308  310  318  347  349  352  353
  404  412  417  427  490  516  550  562  668  673  735  737  746  747
  750  751  755  759  763  814  832  843  846  885  887  940  948  953
  955  956 1003 1004 1005 1006 1007 1016 1341 1344 1345 1346 1347 1348
 1353 1354 1355 1356 1357 1358 1359 1361 1363 1364 1365 1366 1368 1375
 1382 1400 1408 1456 1481 1560 1562 1570 1639 1658 1694 1822 1832 1856
 1886 1906 1918 1919 1927 1955 1967 2046 2048 2049 2110 2112 2156]
[ 3.  1.  4.  3.  4.  1.  1.  3.  2.  7.  8.  8.  1.  4.  1.  2.  3.  3.
  2.  2.  1.  7.  1. 12.  7.  1.  1.  1.  8.  5.  1.  2.  2.  2. 16.  4.
  3.  4.  3.  1.  3. 10.  1.  2.  1.  1.  2.  1.  2. 37.  1.  1.  2.  1.
  2.  2.  2.  1.  1.  1.  1.  6.  1.  4. 12.  1.  7.  8.  3.  1.  3.  2.
  

We'll test this out in two ways. First, we'll run it on a small, toy problem to check the outputs. Then, we'll run it on the full activity matrix to get a sense of how long this takes to shuffle.

## Demo on toy matrix

In [7]:
toy_matrix = np.array([[0,1,0,2],
                       [0,3,0,4],
                       [5,0,6,7],
                       [8,0,9,0]])
toy_sparse = sparse.csr_matrix(toy_matrix)
print('Original matrix:')
print(toy_matrix)

print('\nShuffling...')
%time toy_shuffled = shuffle_csr(toy_sparse)
print('\nAfter shuffling:')
print(toy_shuffled.toarray())

Original matrix:
[[0 1 0 2]
 [0 3 0 4]
 [5 0 6 7]
 [8 0 9 0]]

Shuffling...
CPU times: user 592 µs, sys: 0 ns, total: 592 µs
Wall time: 602 µs

After shuffling:
[[2 0 0 1]
 [0 3 0 4]
 [5 6 0 7]
 [9 8 0 0]]


## Shuffle full activity matrix

In [8]:
%time activity_shuffled = shuffle_csr(activity_mtx)

CPU times: user 41.8 s, sys: 128 ms, total: 41.9 s
Wall time: 42 s


Check outputs:

In [11]:
print('Are these identical arrays?')
print((activity_mtx != activity_shuffled).nnz == 0)

print('\nRow sums:')
print(np.array_equal(activity_mtx.sum(axis=1), activity_shuffled.sum(axis=1)))

print('\nCol sums:')
print(np.array_equal(activity_mtx.sum(axis=0), activity_shuffled.sum(axis=0)))

Are these identical arrays?
False

Row sums:
True

Col sums:
False
