# Sparse Array Construction and Conversion
- Creating sparse arrays, Format conversion, Efficient construction
- Real examples: Building sparse matrices from data

In [1]:
import numpy as np
from scipy import sparse
import matplotlib.pyplot as plt
print('Sparse array construction module loaded')

Sparse array construction module loaded


## Construction Methods

**From dense**:
- `csr_array(dense_array)`
- Automatic detection of zeros

**From COO format**:
- `coo_array((data, (rows, cols)), shape)`
- Most flexible

**Special constructors**:
- `eye()` - Identity
- `diags()` - Diagonal
- `random()` - Random sparse
- `block_diag()` - Block diagonal

In [2]:
# Method 1: From dense array
dense = np.array([[1, 0, 2], [0, 0, 3], [4, 0, 0]])
sparse_from_dense = sparse.csr_array(dense)

print('Method 1: From dense array')
print(f'  Created: {sparse_from_dense.shape} CSR matrix')
print(f'  Non-zeros: {sparse_from_dense.nnz}\n')

# Method 2: COO format (best for construction)
rows = np.array([0, 0, 1, 2])
cols = np.array([0, 2, 2, 0])
data = np.array([1, 2, 3, 4])

sparse_coo = sparse.coo_array((data, (rows, cols)), shape=(3, 3))
print('Method 2: COO construction')
print(f'  Rows: {rows}')
print(f'  Cols: {cols}')
print(f'  Data: {data}')
print(f'  Result:\n{sparse_coo.toarray()}\n')

# Method 3: Special constructors
identity = sparse.eye(4, format='csr')
print('Method 3: Identity matrix')
print(f'  Shape: {identity.shape}')
print(f'  Format: {identity.format}')

Method 1: From dense array
  Created: (3, 3) CSR matrix
  Non-zeros: 4

Method 2: COO construction
  Rows: [0 0 1 2]
  Cols: [0 2 2 0]
  Data: [1 2 3 4]
  Result:
[[1 0 2]
 [0 0 3]
 [4 0 0]]

Method 3: Identity matrix
  Shape: (4, 4)
  Format: csr


## Diagonal Matrices
`sparse.diags()` - efficient for diagonal/banded matrices

**Parameters**:
- `diagonals`: List of diagonal values
- `offsets`: Position (0=main, ±k=off-diagonal)

In [3]:
# Tridiagonal matrix
main_diag = [4, 4, 4, 4, 4]
off_diag = [-1, -1, -1, -1]

tridiag = sparse.diags([off_diag, main_diag, off_diag], 
                        offsets=[-1, 0, 1], 
                        format='csr')

print('Tridiagonal matrix:')
print(tridiag.toarray())
print(f'\nStorage: {tridiag.nnz} values instead of {tridiag.shape[0]**2}')

Tridiagonal matrix:
[[ 4. -1.  0.  0.  0.]
 [-1.  4. -1.  0.  0.]
 [ 0. -1.  4. -1.  0.]
 [ 0.  0. -1.  4. -1.]
 [ 0.  0.  0. -1.  4.]]

Storage: 13 values instead of 25


## Format Conversion

**Convert between formats**:
- `.tocsr()` → CSR
- `.tocsc()` → CSC
- `.tocoo()` → COO
- `.tolil()` → LIL
- `.todok()` → DOK
- `.toarray()` → Dense numpy array

**Strategy**: Build in COO/LIL → Convert to CSR/CSC

In [4]:
# Start with COO (easy to build)
rows = [0, 1, 2, 0, 1, 2]
cols = [0, 1, 2, 1, 2, 0]
data = [10, 20, 30, 5, 15, 25]

coo = sparse.coo_array((data, (rows, cols)), shape=(3, 3))
print('Original: COO format')
print(coo.toarray())
print()

# Convert to different formats
formats = {
    'CSR': coo.tocsr(),
    'CSC': coo.tocsc(),
    'LIL': coo.tolil(),
    'DOK': coo.todok()
}

for name, sp in formats.items():
    print(f'{name}: {type(sp).__name__}, nnz={sp.nnz}')

Original: COO format
[[10  5  0]
 [ 0 20 15]
 [25  0 30]]

CSR: csr_array, nnz=6
CSC: csc_array, nnz=6
LIL: lil_array, nnz=6
DOK: dok_array, nnz=6


## Real Example: Rating Matrix Construction
**Problem**: User-item ratings (Netflix, Amazon)
**Challenge**: Millions of users, items; few ratings per user

**Build incrementally with LIL or COO**

In [5]:
# Simulate user-item ratings
n_users = 100000
n_items = 50000
n_ratings = 1000000  # ~0.02% density

print('Collaborative Filtering: Rating Matrix')
print(f'  Users: {n_users:,}')
print(f'  Items: {n_items:,}')
print(f'  Ratings: {n_ratings:,}\n')

# Generate random ratings
np.random.seed(42)
users = np.random.randint(0, n_users, n_ratings)
items = np.random.randint(0, n_items, n_ratings)
ratings = np.random.randint(1, 6, n_ratings)  # 1-5 stars

print('Building sparse matrix...')
# COO is perfect for this
rating_matrix = sparse.coo_array((ratings, (users, items)), 
                                 shape=(n_users, n_items))

print(f'Matrix constructed!')
print(f'  Shape: {rating_matrix.shape}')
print(f'  Non-zeros: {rating_matrix.nnz:,}')
print(f'  Density: {rating_matrix.nnz/(n_users*n_items)*100:.4f}%\n')

# Convert to CSR for operations
rating_csr = rating_matrix.tocsr()

# Find most-rated item
item_counts = np.array(rating_csr.sum(axis=0)).flatten()
most_rated = item_counts.argmax()
print(f'Most rated item: Item {most_rated} ({int(item_counts[most_rated])} ratings)')

# Memory efficiency
dense_size = n_users * n_items * 4 / (1024**3)  # GB
sparse_size = (rating_csr.data.nbytes + rating_csr.indices.nbytes + 
               rating_csr.indptr.nbytes) / (1024**2)  # MB
print(f'\nMemory:')
print(f'  Dense would need: {dense_size:.1f} GB')
print(f'  Sparse only needs: {sparse_size:.1f} MB')
print(f'  Compression: {dense_size*1024/sparse_size:.0f}×')

Collaborative Filtering: Rating Matrix
  Users: 100,000
  Items: 50,000
  Ratings: 1,000,000

Building sparse matrix...
Matrix constructed!
  Shape: (100000, 50000)
  Non-zeros: 1,000,000
  Density: 0.0200%

Most rated item: Item 15800 (139 ratings)

Memory:
  Dense would need: 18.6 GB
  Sparse only needs: 16.0 MB
  Compression: 1191×


## Real Example: Feature Matrix from Transactions
**Problem**: Convert transaction data to sparse feature matrix
**Use case**: Fraud detection, recommendation

**One-hot encoding naturally sparse**

In [6]:
# Transaction data
n_transactions = 50000
n_features = 10000  # Categories, merchants, etc.

print('Transaction Feature Matrix')
print(f'  Transactions: {n_transactions:,}')
print(f'  Feature categories: {n_features:,}\n')

# Simulate: each transaction has 5-20 active features
np.random.seed(42)
rows = []
cols = []
vals = []

for txn_id in range(n_transactions):
    n_active = np.random.randint(5, 21)
    features = np.random.choice(n_features, size=n_active, replace=False)
    
    for feat_id in features:
        rows.append(txn_id)
        cols.append(feat_id)
        vals.append(1)  # Binary feature

# Build sparse matrix
feature_matrix = sparse.coo_array((vals, (rows, cols)), 
                                  shape=(n_transactions, n_features))
feature_csr = feature_matrix.tocsr()

print(f'Feature matrix:')
print(f'  Shape: {feature_csr.shape}')
print(f'  Non-zeros: {feature_csr.nnz:,}')
print(f'  Avg features/txn: {feature_csr.nnz/n_transactions:.1f}')
print(f'  Density: {feature_csr.nnz/(n_transactions*n_features)*100:.3f}%\n')

# Example: Get features for transaction 0
txn_features = feature_csr[0].nonzero()[1]
print(f'Transaction 0 active features: {txn_features[:10]}...')

Transaction Feature Matrix
  Transactions: 50,000
  Feature categories: 10,000

Feature matrix:
  Shape: (50000, 10000)
  Non-zeros: 624,498
  Avg features/txn: 12.5
  Density: 0.125%



IndexError: tuple index out of range

## Efficient Incremental Construction

**Problem**: Building matrix element-by-element

**Bad**: LIL/DOK for each element (slow)
**Good**: Collect all (row, col, data), then COO

**Best practice**: Batch construction

In [None]:
import time

n = 5000
nnz = 25000

np.random.seed(42)
rows = np.random.randint(0, n, nnz)
cols = np.random.randint(0, n, nnz)
data = np.random.rand(nnz)

print(f'Building {n}×{n} matrix with {nnz:,} elements\n')

# Method 1: LIL incremental (SLOW)
start = time.time()
lil = sparse.lil_array((n, n))
for i in range(min(1000, nnz)):  # Just 1000 for demo
    lil[rows[i], cols[i]] = data[i]
time_lil = time.time() - start
print(f'Method 1 (LIL incremental): {time_lil:.3f}s for 1000 elements')

# Method 2: COO batch (FAST)
start = time.time()
coo = sparse.coo_array((data, (rows, cols)), shape=(n, n))
csr = coo.tocsr()
time_coo = time.time() - start
print(f'Method 2 (COO batch): {time_coo:.3f}s for all {nnz:,} elements')
print(f'\nSpeedup: {time_lil/time_coo*nnz/1000:.0f}× faster!')

## Summary

### Construction Strategy:
```python
# 1. Collect data
rows = [...]
cols = [...]
data = [...]

# 2. Build COO
coo = sparse.coo_array((data, (rows, cols)), shape=(m, n))

# 3. Convert for operations
csr = coo.tocsr()  # Row operations
csc = coo.tocsc()  # Column operations
```

### Format Selection for Construction:
- **COO**: Best for batch construction from (row, col, data)
- **LIL**: Good for incremental, row-by-row
- **DOK**: Good for random access during construction

### Special Constructors:
```python
# Identity
I = sparse.eye(n, format='csr')

# Diagonal
D = sparse.diags([diag_values], [offset], format='csr')

# Random
R = sparse.random(m, n, density=0.01, format='csr')

# Block diagonal
B = sparse.block_diag([A1, A2, A3], format='csr')
```

### Best Practices:
✓ **Batch construction**: Collect data, then build  
✓ **Use COO**: Fastest for construction  
✓ **Convert before operations**: COO → CSR/CSC  
✓ **Pre-allocate**: Know shape beforehand  
✓ **Remove duplicates**: Sum duplicate (row,col) entries  