# Data Manifest Tutorial

The data manifest is the system we use for storing, tracking, and local caching of large files.

This tutorial assumes that you've already configured data manifest. See README.md for instructions and details about the command line interface.

## Create a test data manifest

In [9]:
!ls test_data
!rm -rf ./test_checkout/ test.data_manifest.tsv*

data  genome  README


In [10]:
!dm --quiet create test.data_manifest.tsv --checkout-prefix ./test_checkout/ --remote-datastore-uri s3://test-data-manifest-2-2024/test1  ./test_data/*

  __import__('pkg_resources').require('datamanifest==1.0.0')
Add files: 100%|██████████████████████████████████| 4/4 [00:00<00:00,  4.04it/s]


# Using the DataManifest

note: some of the code examples are in markdown rather than python, because writing to the data_manifest in automated tests would cause problems due to file locking or require managing a lot of temporary files making the documentation difficult to understand.

In [11]:
import pandas
pandas.read_table(f'test.data_manifest.tsv', comment='#', na_filter=False, dtype={'notes': str})

Unnamed: 0,key,md5sum,size,notes
0,data/small.chr6.bam.bai,69ef0af03399b9cfe7037aaaa5cdff7b,97152,
1,data/small.chr6.bam,100d7d094d19c7eaa2b93a4bb67ecda7,198736,
2,genome/GRCh38.p12.genome.chr6_99110000_9913000...,f02b28cef526d5ee3d657f010bfbc2bb,283499,
3,README,ca1ea02c10b7c37f425b9b7dd86d5e11,9,


In [13]:
from datamanifest import DataManifest, DataManifestWriter
import tempfile

## Getting a file from the Datamanifest

In [31]:
with DataManifest("test.data_manifest.tsv") as dm:
    # get a data manifest record
    print("Record:\n", dm.get('README'), sep="")
    # get an s3 uri
    print("\nRemote URI:", dm.get('README').remote_uri.uri)
    # run sync_and_get to make sure the file has been synced locally (useful if for example you're a job in AWS):
    file_path = dm.sync_and_get('README').path
    
with open(file_path) as fp:
    print('\nFile contents: """\n' + fp.read() + '\n"""')

Record:
DataManifestRecord(key='README', md5sum='ca1ea02c10b7c37f425b9b7dd86d5e11', size=9, notes='', path='/home/nboley/src/datamanifest/notebooks/test_checkout/README', remote_uri=RemotePath(scheme='s3', bucket='test-data-manifest-2-2024', path='test1/ca1ea02c10b7c37f425b9b7dd86d5e11-README'))

Remote URI: s3://test-data-manifest-2-2024/test1/ca1ea02c10b7c37f425b9b7dd86d5e11-README

File contents: """
Test data
"""


## Adding a file to an existing DataManifest

This will add a file to the data manifest

In [34]:
!echo "AAAAAAAAAA" > TENAs.txt
!ls

TENAs.txt      test_data	       test.data_manifest.tsv.local_config
test_checkout  test.data_manifest.tsv  tutorial.ipynb


In [39]:
dm = DataManifestWriter("test.data_manifest.tsv")

In [42]:
dm.add('test_key', 'TENAs.txt')
print(open(dm.get('test_key').path).read())

'test1/86d48f739677a6bc11751a9a3fd4a0d1-test_key' already exists in s3 -- using existing version.


AAAAAAAAAA



In [43]:
!echo "BBBBBBBBBB" > TENAs.txt

In [44]:
dm.update('test_key', 'TENAs.txt')
print(open(dm.get('test_key').path).read())

BBBBBBBBBB



In [45]:
dm.delete('test_key')

In [46]:
dm.get('test_key')

KeyError: 'test_key'

## Concurrency is handled through file system level locks

In [48]:
# Locking prevents new data manifests from being opened if any are opened for writing. It's usually safest to open data manifest writers in a context manager.
dm2 = DataManifestWriter("test.data_manifest.tsv")

RuntimeError: 'test.data_manifest.tsv' has an exclusive lock from another process and so it can't be opened for reading

In [49]:
dm2 = DataManifest("test.data_manifest.tsv")

RuntimeError: 'test.data_manifest.tsv' has an exclusive lock from another process and so it can't be opened for reading

In [50]:
dm.close()
dm2 = DataManifest("test.data_manifest.tsv")

In [63]:
# we can open multiple concurrent readers
dm1 = DataManifest("test.data_manifest.tsv")
dm2 = DataManifest("test.data_manifest.tsv")
dm3 = DataManifest("test.data_manifest.tsv")

In [64]:
# opening a writer 
dmw = DataManifestWriter("test.data_manifest.tsv")
# dmw.close()

RuntimeError: 'test.data_manifest.tsv' has been opened by another process, and so it can't be opened for writing

In [65]:
dm1.close()
dm2.close()
dm3.close()

In [66]:
dmw = DataManifestWriter("test.data_manifest.tsv")
dmw.close()

### Using glob to find multiple files

In [58]:
with DataManifest("test.data_manifest.tsv") as dm:
    print(dm.glob('*bam*'))
    print()
    for rec in dm.glob_records('*bam*'):
        print(rec)

['data/small.chr6.bam.bai', 'data/small.chr6.bam']

DataManifestRecord(key='data/small.chr6.bam.bai', md5sum='69ef0af03399b9cfe7037aaaa5cdff7b', size=97152, notes='', path='/home/nboley/src/datamanifest/notebooks/test_checkout/data/small.chr6.bam.bai', remote_uri=RemotePath(scheme='s3', bucket='test-data-manifest-2-2024', path='test1/data/69ef0af03399b9cfe7037aaaa5cdff7b-small.chr6.bam.bai'))
DataManifestRecord(key='data/small.chr6.bam', md5sum='100d7d094d19c7eaa2b93a4bb67ecda7', size=198736, notes='', path='/home/nboley/src/datamanifest/notebooks/test_checkout/data/small.chr6.bam', remote_uri=RemotePath(scheme='s3', bucket='test-data-manifest-2-2024', path='test1/data/100d7d094d19c7eaa2b93a4bb67ecda7-small.chr6.bam'))


### Cleanup Tutorial Files

In [67]:
!rm TENAs.txt
!rm -rf ./test_checkout/ test.data_manifest.tsv*