# Loading Matrix Market single-cell RNA data into TileDB

Data files are assumed to be in the same directory as this repository, the three needed files are available on Synapse: https://www.synapse.org/#!Synapse:syn22150189

[Matrix Market](https://math.nist.gov/MatrixMarket/formats.html) is a text format (usually with the extension `.mtx`) that describes the sparse cell by gene expression matrix, and has entries only for the nonzero values. The other barcodes and features text files describe the cell UMIs and gene IDs, respectively.

In [18]:
import gzip
import itertools
from pathlib import Path

import tiledb

The rows of the `.mtx` file contain the coordinates and values for nonzero entries in the cell by gene matrix. The first column is the feature (gene) index, the second column is the cell barcode index, and the last column is the read count. The counts for the first row is very high, but that makes sense because it corresponds to the PhiX spike-in control.

In [9]:
with gzip.open("matrix.mtx.gz", "rt") as f:
    for i, line in enumerate(f):
        print(line)
        if i > 5:
            break

%%MatrixMarket matrix coordinate integer general

%metadata_json: {"format_version": 2, "software_version": "3.1.0"}

55633 7202 29247318

53809 1 21

53264 1 1

53259 1 2

52877 1 1



We can see that the features have extra metadata about the genes that would be unsuitable to store within the matrix.

In [7]:
with gzip.open("features.tsv.gz", "rt") as f:
    for i, line in enumerate(f):
        print(line)
        if i > 5:
            break

ENSMUSG00000102693.1	4933401J01Rik	Gene Expression

ENSMUSG00000064842.1	Gm26206	Gene Expression

ENSMUSG00000051951.5	Xkr4	Gene Expression

ENSMUSG00000102851.1	Gm18956	Gene Expression

ENSMUSG00000103377.1	Gm37180	Gene Expression

ENSMUSG00000104017.1	Gm37363	Gene Expression

ENSMUSG00000103025.1	Gm37686	Gene Expression



Here we can see that the barcodes file contains all of the cell UMIs.

In [6]:
with gzip.open("barcodes.tsv.gz", "rt") as f:
    for i, line in enumerate(f):
        print(line)
        if i > 5:
            break

AAACCTGAGAGTCGGT-1

AAACCTGAGTCGTACT-1

AAACCTGAGTGGTCCC-1

AAACCTGCAATGGACG-1

AAACCTGCACAGTCGC-1

AAACCTGCAGCTGTTA-1

AAACCTGCAGTCAGCC-1



Now we can begin to look at loading the data into TileDB. The feature and barcode matrices are small enough to load into memory, in this case it is also feasible with the `.mtx` but we will use a streaming approach instead.

In [15]:
with gzip.open("barcodes.tsv.gz", "rt") as f:
    barcodes = f.readlines()

with gzip.open("features.tsv.gz", "rt") as f:
    features = f.readlines()

num_barcodes = len(barcodes)
num_features = len(features)

print(barcodes[-1])
print(features[-1])

TTTGTCATCGTTACGA-1

gSpikein_phiX174	gSpikein_phiX174	Gene Expression



The approach here is very similar to the other notebook. We create a 2-D sparse array to hold the integer expression values, where the rows are cells and the columns are genes. The way we lay out the tile extents here is optimized for selecting expression values for all cells across small numbers of genes, this is tweakable by changing the `tile` parameter to `tileDb.Dim`. The choice of using `np.uint32` for the counts is to accomodate the very large count for the spikein, it would not fit into a 16-bit unsigned int.

One thing we would like to do is batch the writes to the array for performance. We can read many rows of the input `.mtx` at a time, generate the lists for the x and y coordinates and the counts we need to set, and then write them. Note that the `.mtx` uses 1-based indexing while TileDB expects 0-based indices, so we will need to convert.

In [None]:
def chunks(iterable, chunk_size):
    """
    See https://stackoverflow.com/a/8998040
    """
    it = iter(iterable)
    while True:
        chunk_it = itertools.islice(it, chunk_size)
        try:
            first_el = next(chunk_it)
        except StopIteration:
            return
        yield itertools.chain((first_el,), chunk_it)

group_name = "sc-matrix"
group_path = Path(group_name)

if not group_path.exists():
    tiledb.group_create(group_name)

counts_array_path = group_path / "sc-matrix-counts"
if not counts_array_path.exists():
    dom = tiledb.Domain(
        tiledb.Dim(name="cells", domain=(0, num_barcodes - 1), tile=num_barcodes, dtype=np.uint32),
        tiledb.Dim(name="genes", domain=(0, num_features - 1), tile=2, dtype=np.uint32),
    )
    schema = tiledb.ArraySchema(
        domain=dom,
        sparse=True,
        attrs=(tiledb.Attr(name="counts", dtype=np.uint32,),)
    )
    tiledb.SparseArray.create(str(counts_array_path), schema)
    with tiledb.SparseArray(str(counts_array_path), mode='w') as A:
        with gzip.open("features.tsv.gz", "rt") as f:
            for chunk in chunks(f):
                x, y = np.nonzero(adata.X)
                A[x, y] = {"counts": adata.X.toarray()[np.nonzero(adata.X)].astype(np.uint16)}

In [16]:
num_barcodes

7202