# Sparse Array Basics
- Sparse matrix formats, Storage efficiency, When to use sparse
- Real examples: Text data, Network graphs, Large-scale data

In [1]:
import numpy as np
from scipy import sparse
import matplotlib.pyplot as plt
print('Sparse arrays module loaded')

Sparse arrays module loaded


## What are Sparse Arrays?

**Definition**: Arrays where most elements are zero

**Why use sparse?**
- **Memory**: Store only non-zero elements
- **Speed**: Skip zero computations
- **Scale**: Handle massive matrices

**Rule of thumb**: Use sparse when >90% zeros

**Common in**:
- Text processing (TF-IDF, word counts)
- Graphs/networks (adjacency matrices)
- Scientific computing (FEM, CFD)
- Machine learning (features, embeddings)

In [2]:
# Dense vs sparse comparison
n = 10000
density = 0.01  # 1% non-zero

# Dense array
np.random.seed(42)
data = np.random.rand(n, n)
data[data > density] = 0  # Make sparse

print(f'Array: {n}×{n}')
print(f'Density: {density*100}%\n')

# Memory comparison
dense_size = data.nbytes / (1024**2)  # MB
sparse_csr = sparse.csr_array(data)
sparse_size = (sparse_csr.data.nbytes + sparse_csr.indices.nbytes + 
               sparse_csr.indptr.nbytes) / (1024**2)

print(f'Dense array:')
print(f'  Memory: {dense_size:.2f} MB')
print(f'  Non-zeros: {np.count_nonzero(data):,}\n')

print(f'Sparse array (CSR):')
print(f'  Memory: {sparse_size:.2f} MB')
print(f'  Compression: {dense_size/sparse_size:.1f}×')
print(f'  Memory saved: {(1-sparse_size/dense_size)*100:.1f}%')

Array: 10000×10000
Density: 1.0%

Dense array:
  Memory: 762.94 MB
  Non-zeros: 999,771

Sparse array (CSR):
  Memory: 11.48 MB
  Compression: 66.5×
  Memory saved: 98.5%


## Sparse Matrix Formats

**scipy.sparse** supports multiple formats:

1. **CSR (Compressed Sparse Row)**: Fast row slicing, arithmetic
2. **CSC (Compressed Sparse Column)**: Fast column slicing
3. **COO (Coordinate)**: Fast construction, easy to build
4. **DOK (Dictionary of Keys)**: Incremental construction
5. **LIL (List of Lists)**: Flexible construction
6. **DIA (Diagonal)**: Diagonal matrices
7. **BSR (Block Sparse Row)**: Dense sub-blocks

**Most used**: CSR (general), CSC (column ops), COO (construction)

In [3]:
# Create small sparse matrix
data_dense = np.array([
    [1, 0, 0, 2],
    [0, 3, 0, 0],
    [0, 0, 0, 0],
    [4, 0, 5, 6]
], dtype=float)

print('Dense matrix:')
print(data_dense)
print(f'\nNon-zeros: {np.count_nonzero(data_dense)}/{data_dense.size}')
print(f'Sparsity: {(1 - np.count_nonzero(data_dense)/data_dense.size)*100:.1f}%\n')

# Convert to different formats
formats = ['csr', 'csc', 'coo', 'lil', 'dok']

for fmt in formats:
    if fmt == 'csr':
        sp = sparse.csr_array(data_dense)
    elif fmt == 'csc':
        sp = sparse.csc_array(data_dense)
    elif fmt == 'coo':
        sp = sparse.coo_array(data_dense)
    elif fmt == 'lil':
        sp = sparse.lil_array(data_dense)
    elif fmt == 'dok':
        sp = sparse.dok_array(data_dense)
    
    print(f'{fmt.upper()}: {type(sp).__name__}')
    print(f'  Shape: {sp.shape}')
    print(f'  Non-zeros: {sp.nnz}')
    print(f'  Data type: {sp.dtype}')

Dense matrix:
[[1. 0. 0. 2.]
 [0. 3. 0. 0.]
 [0. 0. 0. 0.]
 [4. 0. 5. 6.]]

Non-zeros: 6/16
Sparsity: 62.5%

CSR: csr_array
  Shape: (4, 4)
  Non-zeros: 6
  Data type: float64
CSC: csc_array
  Shape: (4, 4)
  Non-zeros: 6
  Data type: float64
COO: coo_array
  Shape: (4, 4)
  Non-zeros: 6
  Data type: float64
LIL: lil_array
  Shape: (4, 4)
  Non-zeros: 6
  Data type: float64
DOK: dok_array
  Shape: (4, 4)
  Non-zeros: 6
  Data type: float64


## CSR Format Deep Dive

**CSR (Compressed Sparse Row)**: Most common format

**Storage**: Three arrays
- **data**: Non-zero values
- **indices**: Column indices
- **indptr**: Row pointers

**Row i spans**: indices[indptr[i]:indptr[i+1]]

**Advantages**:
- Fast row access
- Fast matrix-vector multiplication
- Efficient arithmetic operations

In [4]:
# CSR internals
csr = sparse.csr_array(data_dense)

print('CSR internal structure:\n')
print(f'data (values):    {csr.data}')
print(f'indices (cols):   {csr.indices}')
print(f'indptr (rows):    {csr.indptr}\n')

print('Decoding:')
for i in range(csr.shape[0]):
    start, end = csr.indptr[i], csr.indptr[i+1]
    if start < end:
        values = csr.data[start:end]
        cols = csr.indices[start:end]
        print(f'  Row {i}: {list(zip(cols, values))}')
    else:
        print(f'  Row {i}: empty')

CSR internal structure:

data (values):    [1. 2. 3. 4. 5. 6.]
indices (cols):   [0 3 1 0 2 3]
indptr (rows):    [0 2 3 3 6]

Decoding:
  Row 0: [(np.int32(0), np.float64(1.0)), (np.int32(3), np.float64(2.0))]
  Row 1: [(np.int32(1), np.float64(3.0))]
  Row 2: empty
  Row 3: [(np.int32(0), np.float64(4.0)), (np.int32(2), np.float64(5.0)), (np.int32(3), np.float64(6.0))]


## Real Example: Text Document Matrix
**Problem**: Represent documents as word count vectors
**Why sparse**: Most words don't appear in most documents

**TF-IDF, bag-of-words**: Naturally sparse

In [5]:
# Simulate document-term matrix
n_docs = 1000
n_words = 10000
avg_words_per_doc = 100  # Sparsity ~99%

print('Text Document Matrix')
print(f'  Documents: {n_docs:,}')
print(f'  Vocabulary: {n_words:,}')
print(f'  Avg words/doc: {avg_words_per_doc}\n')

# Build sparse matrix (DOK for construction)
np.random.seed(42)
doc_term = sparse.dok_array((n_docs, n_words), dtype=np.int32)

for doc_id in range(n_docs):
    # Random words in this document
    n_words_doc = np.random.poisson(avg_words_per_doc)
    word_ids = np.random.choice(n_words, size=min(n_words_doc, n_words), replace=False)
    counts = np.random.randint(1, 10, size=len(word_ids))
    
    for word_id, count in zip(word_ids, counts):
        doc_term[doc_id, word_id] = count

# Convert to CSR for operations
doc_term_csr = doc_term.tocsr()

print(f'Matrix stats:')
print(f'  Shape: {doc_term_csr.shape}')
print(f'  Non-zeros: {doc_term_csr.nnz:,}')
print(f'  Density: {doc_term_csr.nnz / (n_docs * n_words) * 100:.3f}%\n')

# Memory comparison
dense_mem = n_docs * n_words * 4 / (1024**2)  # int32 = 4 bytes
sparse_mem = (doc_term_csr.data.nbytes + doc_term_csr.indices.nbytes + 
              doc_term_csr.indptr.nbytes) / (1024**2)

print(f'Memory usage:')
print(f'  Dense would be: {dense_mem:.1f} MB')
print(f'  Sparse (CSR): {sparse_mem:.1f} MB')
print(f'  Savings: {(1 - sparse_mem/dense_mem)*100:.1f}%')

Text Document Matrix
  Documents: 1,000
  Vocabulary: 10,000
  Avg words/doc: 100

Matrix stats:
  Shape: (1000, 10000)
  Non-zeros: 100,139
  Density: 1.001%

Memory usage:
  Dense would be: 38.1 MB
  Sparse (CSR): 0.8 MB
  Savings: 98.0%


## Real Example: Social Network Graph
**Problem**: Store friendships in social network
**Adjacency matrix**: A[i,j]=1 if i and j are friends

**Why sparse**: Most people not friends with most others

In [6]:
# Social network
n_users = 10000
avg_friends = 50  # Average friends per person

print('Social Network Graph')
print(f'  Users: {n_users:,}')
print(f'  Avg friends/user: {avg_friends}\n')

# Build adjacency matrix (symmetric)
np.random.seed(42)
# Use LIL for construction
adj = sparse.lil_array((n_users, n_users), dtype=np.int8)

for user in range(n_users):
    n_friends = np.random.poisson(avg_friends)
    friends = np.random.choice(n_users, size=min(n_friends, n_users-1), replace=False)
    friends = friends[friends != user]  # Remove self
    
    for friend in friends:
        adj[user, friend] = 1
        adj[friend, user] = 1  # Symmetric

# Convert to CSR
adj_csr = adj.tocsr()

print(f'Adjacency matrix:')
print(f'  Shape: {adj_csr.shape}')
print(f'  Edges: {adj_csr.nnz // 2:,}')  # Divide by 2 for undirected
print(f'  Density: {adj_csr.nnz / (n_users**2) * 100:.3f}%\n')

# Find user with most friends
friend_counts = np.array(adj_csr.sum(axis=1)).flatten()
max_friends_user = friend_counts.argmax()

print(f'User statistics:')
print(f'  Most connected: User {max_friends_user} ({friend_counts[max_friends_user]} friends)')
print(f'  Avg friends: {friend_counts.mean():.1f}')
print(f'  Median friends: {np.median(friend_counts):.0f}')

Social Network Graph
  Users: 10,000
  Avg friends/user: 50

Adjacency matrix:
  Shape: (10000, 10000)
  Edges: 498,130
  Density: 0.996%

User statistics:
  Most connected: User 7070 (140 friends)
  Avg friends: 99.6
  Median friends: 99


## Summary

### When to Use Sparse:
✓ **>90% zeros**: Large memory/speed gains  
✓ **Large scale**: Can't fit dense in memory  
✓ **Natural sparsity**: Text, graphs, sensor data  

### Format Selection:

| Format | Construction | Row ops | Col ops | Arithmetic | Best for |
|--------|-------------|---------|---------|------------|----------|
| **CSR** | Slow | Fast | Slow | Fast | General, matrix-vector |
| **CSC** | Slow | Slow | Fast | Fast | Column slicing |
| **COO** | Fast | Slow | Slow | Medium | Construction, I/O |
| **LIL** | Fast | Fast | Slow | Slow | Incremental build |
| **DOK** | Fast | Medium | Medium | Slow | Random access |

### Best Practices:
✓ **Build in COO/LIL/DOK**: Fast incremental construction  
✓ **Convert to CSR/CSC**: For computation  
✓ **Check density**: `nnz / (rows * cols)`  
✓ **Use appropriate format**: Match access pattern  

### Common Applications:
- **NLP**: TF-IDF, word embeddings, co-occurrence
- **Graphs**: Social networks, web graphs, molecules
- **Images**: Feature descriptors, SIFT, HOG
- **Finance**: Correlation matrices, time series
- **Physics**: FEM meshes, Laplacian matrices
- **ML**: Feature matrices, kernel matrices