# Data Manifest Tutorial

The data manifest is the system we use for storing, tracking, and local caching of large files.
Files are tracked in a tsv file, for example:
* **Ravel/repo_data_manifest.tsv** - Our **default** data manifest, contains a variety of files for analysis and testing of code.
* **Ravel/data_manifests/frag_beds.tsv** - fragment beds
* **Ravel/data_manifests/*.tsv** - other sets of files specific to a particular analysis or dataset

Technical details:
* When you add a file, it is stored in an S3 bucket, so it can take some time to upload it.
* There is a local mirror of the S3 bucket in /scratch
* When you do a **dm checkout --fast**, you are creating a bunch of symlinks to the local mirror inside Ravel/data.  The symlinks represent all of the files specified in your Ravel's current git checkout's data manifests.
* make sure you use the context managers (**with DataManifest...**).  The DataManifest takes out a lot of file locks, and if you don't make sure you close them properly, you might end up in a deadlock.

## Check out and syncing the data directory

After cloning the Ravel repository, you should check out the data_manifest, which will populate Ravel/data.
**--fast** validates integrity using file sizes, rather than md5sums.

from a shell:

    conda activate ravel
    dm checkout --fast

If you've checked out a new branch, you can sync your Ravel/data directory to the data manifest tsvs in the branch using sync

    dm sync --fast
    
`--fast` disables the MD5 check, which can be slow.

### Example Data Manifest:

In [5]:
import pandas
from ravel.constants import REPO_DIR
pandas.read_table(f'{REPO_DIR}/repo_data_manifest.tsv', na_filter=False, dtype={'notes': str}).head(3)

Unnamed: 0,key,md5sum,size,permissions,notes
0,snyder/sample_info.tsv,8313f773d0392fc81a123fe778233787,7098,664,
1,immune_cell_atlas/myeloid.hg38.narrowPeak.gz.tbi,a6ba73d88430c111ba47ad300aa91046,76,664,
2,immune_cell_atlas/b_cell.hg38.narrowPeak.gz,9afa8ce4cf9cb386d804be787460f706,1362,664,


# Using the DataManifest

note: some of the code examples are in markdown rather than python, because writing to the data_manifest in automated tests would cause problems due to file locking or require managing a lot of temporary files making the documentation difficult to understand.

In [10]:
from ravel.data_manifest import DataManifest, DataManifestWriter
from ravel.constants import REPO_DIR
import tempfile

## Getting a file from the Datamanifest
    
If no path is passed to DataManifest(), it will default to Ravel/repo_data_manifest.tsv

In [20]:
with DataManifest() as dm:
    file_path = dm.get('reference/GRCh38/chrom.sizes').path
    # to make sure the file has been synced locally (useful if for example you're a job in AWS):
    file_path = dm.sync_and_get('reference/GRCh38/chrom.sizes').path
    
with open(file_path) as fp:
    print(fp.read(44))

chr1	248956422
chr2	242193529
chr3	198295559


## Adding a file to an existing DataManifest

This will add a file to the default datamanifest (you'll see a new line appear in your Ravel/repo_data_manifest.tsv file)

Example using python:
```
with DataManifestWriter() as dm:
    # to add a file, you must set the key.  This is basically the path in the Ravel/data directory.
    dm.add('a/data_manifest/key', '/path/to/local/file')        
```

Example using CLI:
```
dm add a/data_manifest/key /path/to/local/file
```

CLI Usage:
```
> dm add --help
usage: dm add [-h] [--manifest-fname MANIFEST_FNAME] [--notes NOTES] key path

positional arguments:
  key                   key name of the file
  path                  path of the file to add

optional arguments:
  -h, --help            show this help message and exit
  --manifest-fname MANIFEST_FNAME
  --notes NOTES         Notes to add to the manifest
```




## Creating a New Empty DataManifest
    
    use .new() to create a new data manifest

    with DataManifestWriter.new(f'{REPO_DIR}/data_manifests/my_datamanifest.tsv') as dm:
        dm.add('a/data_manifest/key', '/path/to/local/file')
        
you can then use this datamanifest like this:

    with DataManifest(f'{REPO_DIR}/data_manifests/my_datamanifest.tsv') as dm:
        dm.get('a/data_manifest/key').path
        
or

    with DataManifestWriter(f'{REPO_DIR}/data_manifests/my_datamanifest.tsv') as dm:
        dm.add('a/data_manifest/key', '/some/file').path


## Creating a DataManifest from a Directory
    
    
CLI Usage:
```
> dm create --help
usage: dm create [-h] [--manifest-fname MANIFEST_FNAME] [--dry-run] [--resume]
                 [--checkout-prefix CHECKOUT_PREFIX]
                 directory_to_add

positional arguments:
  directory_to_add

optional arguments:
  -h, --help            show this help message and exit
  --manifest-fname MANIFEST_FNAME
  --dry-run
  --resume
  --checkout-prefix CHECKOUT_PREFIX
                        Location of checkout directory. Default:
                        /home/nboley/src/Ravel/data
```

Example:
```
(ravel) nboley@chimata:/ssd/nboley$ ls test
TSSs.discriminative.tsv  TSSs.healthy.chr11_5226554_5227578.png  TSSs.healthy.chr1_247171498_247172522.png
```

```
(ravel) nboley@chimata:/ssd/nboley$ dm create ./test/ --manifest-fname test.tsv
```

```
(ravel) nboley@chimata:/ssd/nboley$ cat test.tsv
key     md5sum  size    permissions     notes
TSSs.discriminative.tsv 2962707a37ea8d186d8a58ec66479177        23167   0664
TSSs.healthy.chr11_5226554_5227578.png  9d94ea39b07d1f97f690f66e2c37eded        688932  0664
TSSs.healthy.chr1_247171498_247172522.png       c1c7106110646c4e676ff46c2a4b6ae2        793133  0664
```

## Deleting

    with DataManifestWriter() as dm:
        dm.delete('a/data_manifest/key')

Note that this will not delte the file from s3. If you'd like to delete a file from s3 (which you should only do if you're sure it's not being used anywhere -- even in other branches or old versions of MASTER) then you can run:

    with DataManifestWriter() as dm:
        dm.delete('a/data_manifest/key', delete_from_datastore=True)
