In [25]:
import genvarloader as gvl
import numba as nb
import numpy as np
import polars as pl
import seqpro as sp
import pooch

from tqdm.auto import tqdm

# Tutorial: Geuvadis

In this tutorial we'll see how to use GenVarLoader (GVL) to:

1. Write a GVL dataset to disk
2. Inspect the dataset
3. Optional: write transformed versions of the tracks to disk
4. Add on-the-fly transformations
5. Obtain splits from the dataset
6. Get a PyTorch DataLoader

## Downloading the data

The Geuvadis dataset is 451 individuals from the 1000 Genomes Project that have both whole genome sequencing and RNA-seq from blood samples. We'll see how to use GVL to get a high performance dataloader that yields haplotypes and tracks for training or running inference with sequence models. For the sake of this tutorial, we'll only work with chromosome 22 so everything can run in a few minutes.

Downloading this data should take ~5-10 minutes and is the slowest step in this notebook.

In [2]:
# GRCh38 chromosome 22 sequence
reference = pooch.retrieve(
    url="https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz",
    known_hash="sha256:974f97ac8ef7ffae971b63b47608feda327403be40c27e391ee4a1a78b800df5",
    progressbar=True,
)
!gzip -dc {reference} | bgzip > {reference[:-3]}.bgz
reference = reference[:-3] + ".bgz"

# PLINK 2 files
variants = pooch.retrieve(
    url="doi:10.5281/zenodo.13656224/1kGP.chr22.pgen",
    known_hash="md5:31aba970e35f816701b2b99118dfc2aa",
    progressbar=True,
    fname="1kGP.chr22.pgen",
)
pooch.retrieve(
    url="doi:10.5281/zenodo.13656224/1kGP.chr22.psam",
    known_hash="md5:eefa7aad5acffe62bf41df0a4600129c",
    progressbar=True,
    fname="1kGP.chr22.psam",
)
pooch.retrieve(
    url="doi:10.5281/zenodo.13656224/1kGP.chr22.pvar",
    known_hash="md5:5f922af91c1a2f6822e2f1bb4469d12b",
    progressbar=True,
    fname="1kGP.chr22.pvar",
)

# BigWigs and sample ID mapping
bw_paths = pooch.retrieve(
    url="doi:10.5281/zenodo.13656224/bw_chr22.tar.gz",
    known_hash="md5:14bf72e9e9d3e2318d07315c4a2675fb",
    progressbar=True,
    processor=pooch.Untar(),
)
bw_table_path = pooch.retrieve(
    url="doi:10.5281/zenodo.13656224/bigwig_table.csv",
    known_hash="md5:7fe7c55b61c7dfa66cfd0a49336f3b08",
    progressbar=True,
)

# BED
bed_path = pooch.retrieve(
    url="doi:10.5281/zenodo.13656224/chr22_egenes.bed",
    known_hash="md5:ccb55548e4ddd416d50dbe6638459421",
    progressbar=True,
)

Downloading data from 'https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz' to file '/carter/users/dlaub/.cache/pooch/edfb24b9fee5f1060c26e092a696e447-Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz'.
100%|█████████████████████████████████████| 11.4M/11.4M [00:00<00:00, 14.8GB/s]
Downloading data from 'doi:10.5281/zenodo.13656224/1kGP.chr22.pgen' to file '/carter/users/dlaub/.cache/pooch/1kGP.chr22.pgen'.
100%|█████████████████████████████████████| 20.4M/20.4M [00:00<00:00, 17.7GB/s]
Downloading data from 'doi:10.5281/zenodo.13656224/1kGP.chr22.psam' to file '/carter/users/dlaub/.cache/pooch/1kGP.chr22.psam'.
100%|█████████████████████████████████████| 6.38k/6.38k [00:00<00:00, 8.98MB/s]
Downloading data from 'doi:10.5281/zenodo.13656224/1kGP.chr22.pvar' to file '/carter/users/dlaub/.cache/pooch/1kGP.chr22.pvar'.
100%|█████████████████████████████████████| 1.17G/1.17G [00:00<00:00, 1.09TB/s]
Downloading data from 'doi:10.5281/zenodo.13

## Writing the GVL dataset

We'll specify a path to store the dataset, which is a directory (like Zarr stores if you're familiar with those).

In [3]:
ds_path = "geuvadis.chr22.gvl"

We'll also need a table or dictionary specifying the sample names for each BigWig. We'll use a table here, which must have at least have columns `sample` and `path` as seen below. The join is added here to update the paths to match the actual download paths.

In [4]:
bigwig_table = (
    pl.read_csv(bw_table_path)
    .join(
        pl.Series(bw_paths).to_frame("realpath"),
        left_on="path",
        right_on=pl.col("realpath").str.split("/").list.get(-1),
    )
    .drop("path")
    .rename({"realpath": "path"})
)
bigwig_table.head()

sample,read_count,path
str,i64,str
"""HG00236""",34548283,"""/carter/users/dlaub/.cache/poo…"
"""HG00259""",53041143,"""/carter/users/dlaub/.cache/poo…"
"""NA20519""",36620358,"""/carter/users/dlaub/.cache/poo…"
"""NA20811""",24398971,"""/carter/users/dlaub/.cache/poo…"
"""NA20768""",30019566,"""/carter/users/dlaub/.cache/poo…"


Finally, we'll need a BED file specifying what regions to include in the dataset. We can either specify a path or a polars DataFrame. We'll use [gvl.read_bedlike](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.read_bedlike) to conveniently read the BED file into memory and subset it to just the first 5 regions for this tutorial. The BED file provided corresponds to eGenes, sorted in descending order by their absolute sum of coefficients.

In [5]:
bed = gvl.read_bedlike(bed_path)[:5]
bed.head()

chrom,chromStart,chromEnd,name,score,strand
str,i64,i64,str,f64,str
"""chr22""",41699499,41699499,"""ENSG00000167077""",,"""+"""
"""chr22""",42835412,42835412,"""ENSG00000100266""",,"""-"""
"""chr22""",20858983,20858983,"""ENSG00000099940""",,"""+"""
"""chr22""",20707691,20707691,"""ENSG00000241973""",,"""-"""
"""chr22""",49918167,49918167,"""ENSG00000184164""",,"""+"""


Now, we're ready to write the dataset.

We'll instantiate a [gvl.BigWigs](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.BigWigs) object here, which has alternative constructors in case we don't want to use a table. We also name this track as "depth" (as in read depth) so we can manage different transformations of the track data or provide multiple tracks for the same samples. Later, we'll add a transformed track for $\log_2(\text{CPM}+1)$ to see this in action.

We also will pass `max_jitter` as 128. This will allow random jittering of the sequences and tracks up to 128 bp in either direction. When we open the dataset later it will use the maximum amount of jitter by default.

In [9]:
gvl.write(
    path=ds_path,
    bed=bed,
    variants=variants,
    bigwigs=gvl.BigWigs.from_table(name="depth", table=bigwig_table),
    length=2**15,
    max_jitter=128,
    overwrite=True,
)

[32m2024-09-03 16:05:37.755[0m | [1mINFO    [0m | [36mgenvarloader._dataset._write[0m:[36mwrite[0m:[36m74[0m - [1mWriting dataset to geuvadis.chr22.gvl[0m
[32m2024-09-03 16:05:37.897[0m | [1mINFO    [0m | [36mgenvarloader._variants._records[0m:[36mread_pvar[0m:[36m432[0m - [1mReading .pvar file...[0m
[32m2024-09-03 16:05:39.368[0m | [1mINFO    [0m | [36mgenvarloader._variants._records[0m:[36mread_pvar[0m:[36m440[0m - [1mFinished reading .pvar file.[0m
[32m2024-09-03 16:05:40.821[0m | [1mINFO    [0m | [36mgenvarloader._dataset._write[0m:[36mwrite[0m:[36m137[0m - [1mUsing 451 samples.[0m
[32m2024-09-03 16:05:40.822[0m | [1mINFO    [0m | [36mgenvarloader._dataset._write[0m:[36mwrite[0m:[36m143[0m - [1mWriting genotypes.[0m


  0%|          | 0/1 [00:00<?, ?it/s]

[32m2024-09-03 16:05:41.080[0m | [34m[1mDEBUG   [0m | [36mgenvarloader._dataset._write[0m:[36m_read_variants_chunk[0m:[36m381[0m - [34m[1mregion length 34024[0m
[32m2024-09-03 16:05:41.081[0m | [34m[1mDEBUG   [0m | [36mgenvarloader._dataset._write[0m:[36m_read_variants_chunk[0m:[36m387[0m - [34m[1mread genotypes[0m
[32m2024-09-03 16:05:49.475[0m | [34m[1mDEBUG   [0m | [36mgenvarloader._dataset._write[0m:[36m_read_variants_chunk[0m:[36m398[0m - [34m[1mget haplotype region ilens[0m
[32m2024-09-03 16:05:49.507[0m | [34m[1mDEBUG   [0m | [36mgenvarloader._dataset._write[0m:[36m_read_variants_chunk[0m:[36m404[0m - [34m[1maverage haplotype length 34010.851219512195[0m
[32m2024-09-03 16:05:49.509[0m | [34m[1mDEBUG   [0m | [36mgenvarloader._dataset._write[0m:[36m_read_variants_chunk[0m:[36m407[0m - [34m[1mmax missing length -726[0m
[32m2024-09-03 16:05:49.510[0m | [34m[1mDEBUG   [0m | [36mgenvarloader._dataset._write[0

  0%|          | 0/1 [00:00<?, ?it/s]

[32m2024-09-03 16:05:50.483[0m | [1mINFO    [0m | [36mgenvarloader._dataset._write[0m:[36mwrite[0m:[36m170[0m - [1mFinished writing.[0m


Note that [gvl.write](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.write) will also automatically use the intersection of samples from source files. In this case, they are perfectly matched to each other. But, if we had used PLINK files for the full 3,202 samples from the 1000 Genomes Project then it would have identified and used the 451 intersecting samples.

## Inspecting the dataset

In [10]:
ds = gvl.Dataset.open(ds_path)

[32m2024-09-03 16:05:55.644[0m | [1mINFO    [0m | [36mgenvarloader._dataset[0m:[36m_open[0m:[36m175[0m - [1m
GVL store geuvadis.chr22.gvl
Is subset: False
# of regions: 5
# of samples: 451
Original region length: 32,768
Max jitter: 128
Has genotypes: True
Has tracks: ['depth'][0m


If we don't provide a reference genome to a dataset that has genotypes, we will get an informative warning and the dataset will never provide haplotypes. Let's go ahead and specify a reference genome.

In [11]:
ds = gvl.Dataset.open(ds_path, reference=reference)

[32m2024-09-03 16:05:56.621[0m | [1mINFO    [0m | [36mgenvarloader._dataset[0m:[36m_open[0m:[36m122[0m - [1mLoading reference genome into memory. This typically has a modest memory footprint (a few GB) and greatly improves performance.[0m
[32m2024-09-03 16:05:58.380[0m | [1mINFO    [0m | [36mgenvarloader._dataset[0m:[36m_open[0m:[36m175[0m - [1m
GVL store geuvadis.chr22.gvl
Is subset: False
# of regions: 5
# of samples: 451
Original region length: 32,768
Max jitter: 128
Has genotypes: True
Has tracks: ['depth'][0m


Now that a reference genome is provided, haplotypes can be returned. We also are given some summary information about this dataset. Let's use the dataset to inspect a few sequences and tracks and seeing how we can adjust what is returned as well.

In [12]:
ds[0]

(array([[b'G', b'T', b'G', ..., b'T', b'G', b'T'],
        [b'C', b'C', b'A', ..., b'A', b'C', b'T']], dtype='|S1'),
 array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32))

Indexing into a GVL dataset corresponds to the raveled indices, so the 0-th index is the data for the first region and sample.

Since this dataset has jitter enabled (the maximum amount by default), we will get different data each time we access it. We can disable jittering, but we will still get randomly shifted data for haplotypes that are longer than the output length due to indels. We can also provide a seed to the dataset for determinism, see [gvl.Dataset.with_settings](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.Dataset.with_settings).

We also are receiving both haplotypes and tracks from the dataset, and they have an additional dimension for ploidy.

In [13]:
[a.shape for a in ds[0]]

[(2, 32768), (2, 32768)]

We can disable returning haplotypes and return reference sequences instead, and now the ploidy dimension will be gone. We can also see that disabling jitter will increase the sequence length to the maximum available. We could disable jittering without altering sequence length by slicing the them  on-the-fly with a transform.

In [14]:
ref_ds = ds.with_settings(jitter=0, return_sequences='reference')
[a.shape for a in ref_ds[0]]

[(33024,), (33024,)]

We can also slice the dataset or use lists/arrays of indices to get batches of data.

In [15]:
[a.shape for a in ds[:10]]

[(10, 2, 32768), (10, 2, 32768)]

In [16]:
[a.shape for a in ds[[0, 3, 999]]]

[(3, 2, 32768), (3, 2, 32768)]

## Optional: pre-computing transformed tracks and saving them to disk

Suppose we would like to normalize the read depth across the dataset to account for library size. We could compute this on-the-fly, but GVL also offers a way to write this data back to disk to cache this computation and potentially improve performance. Note that this is the most technical part of this tutorial, so feel free to skip this and come back later.

In [37]:
sample_library_sizes = (
    pl.Series(ds.samples)
    .to_frame("sample")
    .join(bigwig_table, on="sample", how="left")["read_count"]
    .to_numpy()
)
sample_library_sizes[:5]

array([27256165, 43941108, 39687917, 22341838, 23258231])

For this step, we'll use [gvl.Dataset.write_transformed_track](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.Dataset.write_transformed_track) which expects a transform function to be given. From the docs:

> The arguments given to the transform will be the dataset indices, region indices, and sample indices as numpy arrays and the tracks themselves as a [Ragged](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.Ragged) array with shape (regions, samples). The tracks must be a [Ragged](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.Ragged) array since regions may be different lengths to accomodate indels. This function should then return the transformed tracks as a [Ragged](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.Ragged) array with the same shape and lengths.

Below, you can see an example of a transform of ragged data that uses Numba to accelerate the computation. Note that working with [Ragged](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.Ragged) arrays is generally not necessary with on-the-fly transformations, since the data is processed to be uniform length before any transformation.

In [18]:
@nb.njit(parallel=True, nogil=True, fastmath=True)
def inner_transform(s_idx, data, offsets):
    log_cpm = np.empty_like(data)
    for i in nb.prange(len(offsets) - 1):
        start = offsets[i]
        end = offsets[i + 1]
        sample = s_idx[i]
        log_cpm[start:end] = np.log1p(
            data[start:end] / sample_library_sizes[sample] * 1e6
        )
    return log_cpm


def log_cpm(ds_idx, r_idx, s_idx, tracks: gvl.Ragged[np.float32]):
    data = inner_transform(s_idx, tracks.data, tracks.offsets)
    return gvl.Ragged.from_offsets(data, tracks.shape, tracks.offsets)


ds = ds.write_transformed_track("lcpb", "depth", log_cpm, overwrite=True, max_mem=4 * 2**30)

  0%|          | 0/1 [00:00<?, ?it/s]

## On-the-fly transformations

One thing you may have noticed is that the sequences are output as ASCII characters. We'll often need to either tokenize or one-hot encode them for machine learning models. We can do this on-the-fly with, for example, fast implementations from [SeqPro](https://github.com/ML4GLand/SeqPro), but in general arbitrary transformations can be used.

In [19]:
def tokenize_transform(haplotypes, tracks):
    return sp.tokenize(haplotypes, dict(zip(sp.DNA.alphabet, range(4))), 4), tracks


def ohe_transform(haplotypes, tracks):
    return sp.DNA.ohe(haplotypes), tracks


token_ds = ds.with_settings(transform=tokenize_transform)
ohe_ds = ds.with_settings(transform=ohe_transform)

In [20]:
token_ds[0][0], ohe_ds[0][0]

(array([[0, 0, 3, ..., 3, 1, 3],
        [3, 3, 0, ..., 2, 2, 1]], dtype=int32),
 array([[[1, 0, 0, 0],
         [0, 0, 0, 1],
         [0, 0, 0, 1],
         ...,
         [0, 0, 1, 0],
         [0, 0, 1, 0],
         [1, 0, 0, 0]],
 
        [[0, 1, 0, 0],
         [1, 0, 0, 0],
         [0, 1, 0, 0],
         ...,
         [0, 0, 0, 1],
         [0, 0, 1, 0],
         [0, 0, 1, 0]]], dtype=uint8))

## Splitting datasets

Suppose we're training a model and thus need to split our dataset. Let's create a subset of the dataset to the first 400 samples for training.

In [21]:
train_ds = ds.subset_to(samples=ds.samples[:400])
train_ds

GVL store geuvadis.chr22.gvl
Is subset: True
# of regions: 5
# of samples: 400
Original region length: 32,768
Max jitter: 128
Has genotypes: True
Has tracks: ['depth', 'lcpb']

We can see that now the dataset is marked as a subset and the # of samples has reduced from 451 to 400. Some other properties reflect these changes as well:

In [22]:
len(ds), len(train_ds), ds.shape, train_ds.shape

(2255, 2000, (5, 451), (5, 400))

After splitting a dataset, it can be very useful to have indices mapping each sample to its region and sample in the full dataset. GVL datasets can return these by enabling `return_indices`. When this is enabled, three arrays are appended to each instance returned. Each corresponds to the full dataset, region, and sample indices respectively. For example, we can see that the 401st instance from the train dataset maps to the 451st instance in the full dataset aka the first region and second sample.

In [50]:
train_ds.with_settings(return_indices=True)[400]

(array([[b'G', b'A', b'C', ..., b'T', b'T', b'C'],
        [b'C', b'T', b'G', ..., b'T', b'C', b'C']], dtype='|S1'),
 array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 1., 1., 1.]], dtype=float32),
 array([451]),
 array([0]),
 array([1]))

These indices can be used to index into additional data that has no sequence length. For example, if we wanted to predict RNA-seq counts instead of read depth, we could use it to index into a gene expression table of counts. Or if we were working with chromatin accessibility data, we could do the same with a table of peak counts.

## Getting a PyTorch DataLoader

In [35]:
train_dl = train_ds.to_dataloader(batch_size=64, shuffle=True)

for batch in tqdm(train_dl):
    pass

batch[0].shape, batch[1].shape

  0%|          | 0/32 [00:00<?, ?it/s]

(torch.Size([16, 2, 32768]), torch.Size([16, 2, 32768]))

In addition, since GVL provides a map-style PyTorch Dataset it is compatible with distributed data parallel (DDP) for use across multiple GPUs or nodes.