# Managing cloud-hosted `momics` repositories

`momics` provides native support for cloud-hosted repositories. This means that you can use `momics` to create repositories directly on cloud storage services like S3, GCP or Azure buckets. Here, we will see how to create, manage and query cloud-hosted repositories using `momics`.

## Prerequisites

To use cloud-hosted repositories, you need to have the following:

1. A cloud storage service account with the necessary permissions to create and manage buckets.
2. The necessary credentials to access the cloud storage service. This could be in the form of a service account key, or a token.
3. The `momics` package installed in your Python environment.

We will use an S3 bucket for demonstration purposes. You can use the same steps for other cloud storage services as well.

## Creating a `momics` repository on an S3 bucket

The first step to create a S3-hosted repository is set up the configuration that `momics` will use. For S3, you need to provide the following information:

1. The access key and secret key for the S3 account.
2. The region where the bucket is located.
3. The name of a bucket where the repository will be created. You will need authorization to read from and write to this bucket.

In [2]:
from momics import config as mconfig
import os

s3_access_key_id = os.environ.get("AWS_ACCESS_KEY_ID")
s3_secret_access_key = os.environ.get("AWS_SECRET_ACCESS_KEY")

s3_cfg = mconfig.S3Config(region="eu-west-3", access_key_id=s3_access_key_id, secret_access_key=s3_secret_access_key)
momics_cfg = mconfig.MomicsConfig(s3=s3_cfg)
momics_cfg.cfg

In [3]:
"The registered S3 secret access key starts with: " + momics_cfg.cfg.get("vfs.s3.aws_secret_access_key")[0:10] + "..."

'The registered S3 secret access key starts with: 2FgOW+IT5e...'

Now that the S3 configuration is set up, you can create a `momics` repository as you would do locally. The only two differences is that you need to 1) use the `s3://<your_bucket>` protocol and 2) provide the S3 configuration to the `Repository` object. 

In [14]:
from momics.momics import Momics

mom = Momics("s3://momics/my_repo.mom", config=momics_cfg)

momics :: INFO :: 2025-01-29 17:30:57,759 :: Created s3://momics-test/my_repo2.mom


If the repository does not exist, it will be automatically created, and a log message will be displayed. If the repository already exists, it will be opened and you can directly interact with it. 

Now that we created a repository, let's try and fetch chromosome sizes from it!

In [15]:
mom.chroms()
"Number of chromosomes: " + str(len(mom.chroms()))

'Number of chromosomes: 0'

We can see that chromosomes have not been added to the repository yet. This is because we have just created the repository and it is empty. In the next section, we will see how to add data to the repository.

**Note:** 

Note that we do not provide the `config` argument to the `chroms()` method. In fact, the config is stored in the variable itself. 

In [6]:
mom.cfg

<momics.config.MomicsConfig at 0x7c900830ab00>

On top of storing the secrets required to access the cloud storage service, the `MomicsConfig` object extends the `TileDB` configuration object. This means that you can leverage additional TileDB configuration functionalities to interact with your repository! A notable example is the TileDB virtual file system (VFS) implementation which allows you to interact with files and directories located on a cloud storage services in a seamless manner.

In [7]:
mom.cfg.vfs.ls_recursive(mom.path)

['s3://momics-test/my_repo2.mom/__tiledb_group.tdb',
 's3://momics-test/my_repo2.mom/annotations/__tiledb_group.tdb',
 's3://momics-test/my_repo2.mom/coverage/__tiledb_group.tdb',
 's3://momics-test/my_repo2.mom/genome/__tiledb_group.tdb',
 's3://momics-test/my_repo2.mom/genome/chroms.tdb/__commits/__1738168070623_1738168070623_708cec572083e220509251572f7a09d5_22.wrt',
 's3://momics-test/my_repo2.mom/genome/chroms.tdb/__fragments/__1738168070623_1738168070623_708cec572083e220509251572f7a09d5_22/__fragment_metadata.tdb',
 's3://momics-test/my_repo2.mom/genome/chroms.tdb/__fragments/__1738168070623_1738168070623_708cec572083e220509251572f7a09d5_22/a0.tdb',
 's3://momics-test/my_repo2.mom/genome/chroms.tdb/__fragments/__1738168070623_1738168070623_708cec572083e220509251572f7a09d5_22/a0_var.tdb',
 's3://momics-test/my_repo2.mom/genome/chroms.tdb/__fragments/__1738168070623_1738168070623_708cec572083e220509251572f7a09d5_22/a1.tdb',
 's3://momics-test/my_repo2.mom/genome/chroms.tdb/__meta/__

So far, the repository does not contain any data. 

## Adding data to the repository

Just like local repositories, populating a cloud-hosted repository has to start with registering chromosomes. 

In [None]:
## We will get chromosome sizes from a local fasta file.
from pyfaidx import Fasta

f = Fasta("/data/momics/data/S288c.fa")
chrom_lengths = {chrom: len(seq) for chrom, seq in zip(f.keys(), f.values())}

mom.ingest_chroms(chrom_lengths, genome_version="S288c")
mom.chroms()

Unnamed: 0,chrom_index,chrom,length
0,0,I,230218
1,1,II,813184
2,2,III,316620
3,3,IV,1531933
4,4,V,576874
5,5,VI,270161
6,6,VII,1090940
7,7,VIII,562643
8,8,IX,439888
9,9,X,745751


Once the chromosomes are registered, you can ingest data, e.g. genomic sequence or genomic features, to the repository.

In [9]:
## Ingesting genome reference sequence
mom.ingest_sequence("/data/momics/data/S288c.fa")
mom.seq()

momics :: INFO :: 2024-10-21 09:08:11,417 :: Genome sequence ingested in 47.2602s.


Unnamed: 0,chrom_index,chrom,length,seq
0,0,I,230218,CCACACCACA...TGTGTGTGGG
1,1,II,813184,AAATAGCCCT...GTGGGTGTGT
2,2,III,316620,CCCACACACC...GGTGTGTGTG
3,3,IV,1531933,ACACCACACC...TAGCTTTTGG
4,4,V,576874,CGTCTCCTCC...TTTTTTTTTT
5,5,VI,270161,GATCTCGCAA...TGGTGTGTGG
6,6,VII,1090940,CCACACCCAC...TTTTTTTTTT
7,7,VIII,562643,CCCACACACA...GTGTGTGTGG
8,8,IX,439888,CACACACACC...GTGTGTGTGT
9,9,X,745751,CCCACACACA...GTGTGGGTGT


In [10]:
## Ingesting genome-wide tracks
mom.ingest_tracks(
    {
        "atac": "/data/momics/data/S288c_atac.bw",
        "rna": "/data/momics/data/S288c_rna.bw",
        "scc1": "/data/momics/data/S288c_scc1.bw",
        "mnase": "/data/momics/data/S288c_mnase.bw",
    }
)
mom.tracks()

momics :: INFO :: 2024-10-21 09:10:27,371 :: 4 tracks ingested in 132.0506s.


Unnamed: 0,idx,label,path
0,0,atac,/data/momics/S288c_atac.bw
1,1,rna,/data/momics/S288c_rna.bw
2,2,scc1,/data/momics/S288c_scc1.bw
3,3,mnase,/data/momics/S288c_mnase.bw


## Querying data from the repository

Now that we have added data to the repository, we can query specific genomic ranges using `MomicsQuery` objects. 

In [11]:
## We define non-overlapping windows of 10kb over the entire S288c genome
windows = mom.bins(1000, stride=1000, cut_last_bin_out=True)
windows

Unnamed: 0,Chromosome,Start,End
0,I,0,1000
1,I,1000,2000
2,I,2000,3000
3,I,3000,4000
4,I,4000,5000
...,...,...,...
12143,XVI,943000,944000
12144,XVI,944000,945000
12145,XVI,945000,946000
12146,XVI,946000,947000


In [24]:
## Next, we build a query object to query specific tracks from the momics object
from momics.query import MomicsQuery

q = MomicsQuery(mom, windows)
q.query_tracks(tracks=["atac", "rna"])
"ATAC coverage over the first range queried: " + str(q.coverage["atac"]["I:0-1000"][0:5]) + "..."

momics :: INFO :: 2024-10-21 09:14:01,919 :: Query completed in 38.6298s.


'ATAC coverage over the first range queried: [2.56415 7.25287 7.25287 7.25287 7.25287]...'

In [25]:
## We can also query sequences over the windows
q.query_sequence()
"Genome sequence over the first range queried: " + str(q.seq["nucleotide"]["I:0-1000"][0:10]) + "..."

momics :: INFO :: 2024-10-21 09:16:24,760 :: Query completed in 36.2156s.


'Genome sequence over the first range queried: CCACACCACA...'

## Extracting data from the repository

A `momics` repository can also be used to store and retrieve data. This data can be extracted from the repository and saved to a local file. 

In [26]:
atac = mom.tracks(label="atac")

In [28]:
from momics import utils as mutils

path = mutils.dict_to_bigwig(atac, "extracted_atac_track.bw")
"File saved to: " + path.name
"File exists: " + str(os.path.exists(path))

'File exists: True'

## Deleting a repository 

To delete a repository, you can use the `remove()` method on the repository object. This will delete the repository and all its contents. Now that this notebook is complete, we can delete the repository :)

In [29]:
mom.remove()

momics :: INFO :: 2024-10-21 09:16:50,603 :: Purged s3://momics/my_repo.mom


True