# Data Manifest Tutorial

The data manifest is the system we use for storing, tracking, and local caching of large files.

This tutorial assumes that you've already configured data manifest. See README.md for instructions.

## Create a test data manifest

In [11]:
!ls test_data
!rm -rf ./test_checkout/ test.data_manifest.tsv*

data  genome  README


In [13]:
!dm --quiet create test.data_manifest.tsv --checkout-prefix ./test_checkout/ --remote-datastore-uri s3://test-data-manifest-2-2024/test1  ./test_data/*

  __import__('pkg_resources').require('datamanifest==1.0.0')
Add files: 100%|██████████████████████████████████| 4/4 [00:00<00:00,  4.44it/s]


### Example Data Manifest:

Unnamed: 0,key,md5sum,size,notes
0,data/small.chr6.bam.bai,69ef0af03399b9cfe7037aaaa5cdff7b,97152,
1,data/small.chr6.bam,100d7d094d19c7eaa2b93a4bb67ecda7,198736,
2,genome/GRCh38.p12.genome.chr6_99110000_9913000...,f02b28cef526d5ee3d657f010bfbc2bb,283499,
3,README,ca1ea02c10b7c37f425b9b7dd86d5e11,9,


# Using the DataManifest

note: some of the code examples are in markdown rather than python, because writing to the data_manifest in automated tests would cause problems due to file locking or require managing a lot of temporary files making the documentation difficult to understand.

In [10]:
from ravel.data_manifest import DataManifest, DataManifestWriter
from ravel.constants import REPO_DIR
import tempfile

## Getting a file from the Datamanifest
    
If no path is passed to DataManifest(), it will default to Ravel/repo_data_manifest.tsv

In [20]:
with DataManifest() as dm:
    file_path = dm.get('reference/GRCh38/chrom.sizes').path
    # to make sure the file has been synced locally (useful if for example you're a job in AWS):
    file_path = dm.sync_and_get('reference/GRCh38/chrom.sizes').path
    
with open(file_path) as fp:
    print(fp.read(44))

chr1	248956422
chr2	242193529
chr3	198295559


## Adding a file to an existing DataManifest

This will add a file to the default datamanifest (you'll see a new line appear in your Ravel/repo_data_manifest.tsv file)

Example using python:
```
with DataManifestWriter() as dm:
    # to add a file, you must set the key.  This is basically the path in the Ravel/data directory.
    dm.add('a/data_manifest/key', '/path/to/local/file')        
```

Example using CLI:
```
dm add a/data_manifest/key /path/to/local/file
```

CLI Usage:
```
> dm add --help
usage: dm add [-h] [--manifest-fname MANIFEST_FNAME] [--notes NOTES] key path

positional arguments:
  key                   key name of the file
  path                  path of the file to add

optional arguments:
  -h, --help            show this help message and exit
  --manifest-fname MANIFEST_FNAME
  --notes NOTES         Notes to add to the manifest
```




## Creating a New Empty DataManifest
    
    use .new() to create a new data manifest

    with DataManifestWriter.new(f'{REPO_DIR}/data_manifests/my_datamanifest.tsv') as dm:
        dm.add('a/data_manifest/key', '/path/to/local/file')
        
you can then use this datamanifest like this:

    with DataManifest(f'{REPO_DIR}/data_manifests/my_datamanifest.tsv') as dm:
        dm.get('a/data_manifest/key').path
        
or

    with DataManifestWriter(f'{REPO_DIR}/data_manifests/my_datamanifest.tsv') as dm:
        dm.add('a/data_manifest/key', '/some/file').path


## Creating a DataManifest from a Directory
    
    
CLI Usage:
```
> dm create --help
usage: dm create [-h] [--manifest-fname MANIFEST_FNAME] [--dry-run] [--resume]
                 [--checkout-prefix CHECKOUT_PREFIX]
                 directory_to_add

positional arguments:
  directory_to_add

optional arguments:
  -h, --help            show this help message and exit
  --manifest-fname MANIFEST_FNAME
  --dry-run
  --resume
  --checkout-prefix CHECKOUT_PREFIX
                        Location of checkout directory. Default:
                        /home/nboley/src/Ravel/data
```

Example:
```
(ravel) nboley@chimata:/ssd/nboley$ ls test
TSSs.discriminative.tsv  TSSs.healthy.chr11_5226554_5227578.png  TSSs.healthy.chr1_247171498_247172522.png
```

```
(ravel) nboley@chimata:/ssd/nboley$ dm create ./test/ --manifest-fname test.tsv
```

```
(ravel) nboley@chimata:/ssd/nboley$ cat test.tsv
key     md5sum  size    permissions     notes
TSSs.discriminative.tsv 2962707a37ea8d186d8a58ec66479177        23167   0664
TSSs.healthy.chr11_5226554_5227578.png  9d94ea39b07d1f97f690f66e2c37eded        688932  0664
TSSs.healthy.chr1_247171498_247172522.png       c1c7106110646c4e676ff46c2a4b6ae2        793133  0664
```

## Deleting

    with DataManifestWriter() as dm:
        dm.delete('a/data_manifest/key')

Note that this will not delte the file from s3. If you'd like to delete a file from s3 (which you should only do if you're sure it's not being used anywhere -- even in other branches or old versions of MASTER) then you can run:

    with DataManifestWriter() as dm:
        dm.delete('a/data_manifest/key', delete_from_datastore=True)
