# Finding a volume's Extracted Features data from a HathiTrust volume ID

The Extracted Features Dataset is organized using a pairtree structure, which allows us to find an exact volume from its volume id.

## `utils.download_file` - Downloading files by ID

For small jobs you can download a file within Python using its HathiTrust id. This allows you to avoid worrying about the pairtree structure. `utils.download_file(id)` uses a system subprocess call to `rsync` so it will only work when `rsync` is available.

**Usage**: 

Download one file to the current directory:

```
utils.download_file(htids='nyp.33433042068894')
```

Download multiple files to the current directory:

```
ids = ['nyp.33433042068894', 'nyp.33433074943592', 'nyp.33433074943600']
utils.download_file(htids=ids)
```

Download file to `/tmp`:
```
utils.download_file(htids='nyp.33433042068894', outdir='/tmp')
```

Download file to current directory, keeping pairtree directory structure,
i.e. './nyp/pairtree_root/33/43/30/42/06/88/94/33433042068894/nyp.33433042068894.json.bz2':

```
utils.download_file(htids='nyp.33433042068894', keep_dirs=True)
```

## Converting HathiTrust IDs to the Pairtree path
### `htid2rsync` - on the command line

When you install the HTRC Feature Reader, a command line utility is installed, `htid2rsync`, which converts one or more volume ids to paths in the rsync pairtree. For help, run ```bash htid2rsync --help```.

Here is a basic example of using `htid2rsync` with two volume ids:

```bash
$ htid2rsync miun.adx6300.0001.001 hvd.32044010273894
miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2
hvd/pairtree_root/32/04/40/10/27/38/94/32044010273894/hvd.32044010273894.json.bz2
```

This should work on all operating systems. If not, leave a bug report at https://github.com/htrc/htrc-feature-reader.

On Unix or Linux command lines, you can send these files directly into RSync by specifying `--files-from=-`, which tells Rsync to listen for a file list sent in from the previous command with a pipe (`|`):

```bash
$ htid2rsync miun.adx6300.0001.001 hvd.32044010273894 | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ my-local-folder/
```

#### Loading volume ids from a file

If you have a file of volume ids, one per line, use ```--from-file filename```, or just `-f filename`.

```bash
$ htid2rsync --from-file volumeids.txt

$ htid2rsync --f volumeids.txt
```


#### Piping in volume ids from STDIN

If you are supplying volume ids from STDIN rather than a file, use ```bash --from-file -```. For example:

```bash
$ some_cmd_that_outputs_volume_ids | htid2rsync --from-file -
```

### Saving to an output file

Supply `--outfile` or `-o` to save the output to a text file. e.g.

```bash
$ htid2rsync --outfile paths.txt miun.adx6300.0001.001

$ htid2rsync -o paths.txt miun.adx6300.0001.001
```

### `id_to_rsync`: In Python

In [None]:
from htrc_features import utils
utils.id_to_rsync('miun.adx6300.0001.001')

'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2'

### Without the HTRC Feature Reader Library

The filepath to sync Extracted Features files through RSync follows a [pairtree format](https://wiki.ucop.edu/display/Curation/PairTree?preview=/14254128/16973838/PairtreeSpec.pdf).

Because the official pairtree library is only compatible with Python 2.X, we recommend just using the following functions for encoding a volume id to a filename-friendly format and converting the safe id to a file path: 

In [None]:
def id_encode(id):
    return id.replace(":", "+").replace("/", "=").replace(".", ",")

def id2path(id):
    clean_id = id_encode(id)
    path = []
    while len(clean_id) > 0:
        val, clean_id = clean_id[:2], clean_id[2:]
        path.append(val)
    return '/'.join(path)

This method brings it all together, generating the pairtree path which can then be downloaded through RSync:

In [None]:
def id_to_rsync(htid):
    '''
    Take an HTRC id and convert it to an Rsync location for syncing Extracted
    Features
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = id_encode(volid)
    filename = '.'.join([libid, volid_clean, 'json.bz2'])
    path = '/'.join([libid, 'pairtree_root', id2path(volid).replace('\\', '/'),
                     volid_clean, filename])
    return path

For example,

In [None]:
id_to_rsync('miun.adx6300.0001.001')

'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2'

The Extracted Features for this volume can be downloaded using RSync:

```
rsync -azv data.sharc.hathitrust.org::pd-features/{{URL}} .
```

## Downloading a list of volumes

`select.txt` contains a set of ids for 10k HathiTrust Digital Library volumes in the PZ class (_Fiction and juvenile belles lettres_), that were collected from the HTRC from though its Solr Proxy:

http://chinkapin.pti.indiana.edu:9994/solr/meta/select/?q=callnumber:PZ*&wt=csv&fl=id&rows=10000

Here is what the id's look like:

In [None]:
idlist = open("select.txt", "r+").readlines()
idlist = [id.strip() for id in idlist[1:]]
print(idlist[:3])

['mdp.39015030727963', 'uc2.ark:/13960/t5cc0v81s', 'miun.adx6300.0001.001']


In [None]:
rsynclist = [id_to_rsync(v) for v in idlist]
rsynclist[:2]

['mdp/pairtree_root/39/01/50/30/72/79/63/39015030727963/mdp.39015030727963.json.bz2',
 'uc2/pairtree_root/ar/k+/=1/39/60/=t/5c/c0/v8/1s/ark+=13960=t5cc0v81s/uc2.ark+=13960=t5cc0v81s.json.bz2']

We can also write the full list of our desired volume urls to a file and tell rsync to download from that list.

In [None]:
# Write to file
rsyncf = open('rsync-urls.txt', 'w+')
rsyncf.write("\n".join(rsynclist))
rsyncf.close()

Syncing from a file of URLs can be done as follows:

```
rsync -azv --files-from=rsync-urls.txt data.sharc.hathitrust.org::pd-features/ files/
```

If you don't need the full pairtree directory structure, add the `--no-relative` argument.

## Explanation of ID-to-URL encoding

In [None]:
htid = 'miun.adx6300.0001.001'
libid, volid = htid.split('.', 1)
print("Institution:\t%s\nId:\t\t%s" % (libid, volid))

Institution:	miun
Id:		adx6300.0001.001


The HathiTrust ID has a library id and an identifier within that library. In this case, _miun_ notes the origin of the file as Michigan University.

In [None]:
volid_clean = id_encode(volid)
filename = ".".join([libid, volid_clean, 'json.bz2'])
print("Filename:\t%s" % filename)

Filename:	miun.adx6300,0001,001.json.bz2


Some IDs don't play nice with filesystems and need to be encoded in a cleaner format: ":" becomes "+", "/" becomes "=", and "." becomes "," (as above). The institution identifier and the encoded version of the institution's local id become the filename.

In [None]:
pairtree_root = [libid, 'pairtree_root']
path = '/'.join(pairtree_root + [id2path(volid).replace('\\', '/'),
                                 volid_clean, filename])
print("Pairtree Root:\t%s" % pairtree_root)
print("Full Path:\t%s" % path)

Pairtree Root:	['miun', 'pairtree_root']
Full Path:	miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2


The institutional id is split up and encoded, then recombined into a path.