# Finding a volume's Extracted Features data from a HathiTrust volume ID

The filepath to sync Extracted Features files through RSync follows a [pairtree format](https://wiki.ucop.edu/display/Curation/PairTree?preview=/14254128/16973838/PairtreeSpec.pdf), keeping the institutional shortcode intact (e.g. mpd, uc2). If you don't have it, you may have to install the pairtree library with `pip install pairtree` (only compatible with Python 2.X).

In [None]:
import sys
import pairtree.pairtree_path as pp

This method converts the ID the pair tree path, which can then be downloaded through Rsync:

In [None]:
def id_to_rsync(htid, kind='basic'):
    '''
    Take an HTRC id and convert it to an Rsync location for syncing Extracted
    Features
    
    kind: [basic|advanced]
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = pp.id_encode(volid)
    filename = ".".join([libid, volid_clean, kind, 'json.bz2'])
    pairtree_root = [kind, libid, 'pairtree_root']
    path = pairtree_root + pp.id_to_dir_list(volid) + [volid_clean, filename]
    return '/'.join(path)

For example,

In [None]:
id_to_rsync('miun.adx6300.0001.001')

The Extracted Features for this volume can be downloaded using RSync:

```
rsync -azv data.sharc.hathitrust.org::pd-features/{{URL}} .
```

## Compiling and downloading a list of volumes

`select.txt` contains a set of ids for 10k HathiTrust Digital Library volumes in the PZ class (_Fiction and juvenile belles lettres_), that were collected from the HTRC from though its Solr Proxy:

http://chinkapin.pti.indiana.edu:9994/solr/meta/select/?q=callnumber:PZ*&wt=csv&fl=id&rows=10000

Here is what the id's look like:

In [None]:
idlist = file("select.txt", "r+").readlines()
idlist = [id.strip() for id in idlist[1:]]
print(idlist[:3])

In [None]:
rsynclist = [id_to_rsync(v) for v in idlist]
rsynclist[:2]

We can also write the full list of our desired volume urls to a file and tell rsync to download from that list.

In [None]:
# Write to file
rsyncf = open('rsync-urls.txt', 'w+')
rsyncf.write("\n".join(rsynclist))
rsyncf.close()

Syncing from a file of URLs can be done as follows:

```
rsync -azv --files-from=rsync-urls.txt data.sharc.hathitrust.org::pd-features/ files/
```

If you don't need the full pair tree directory structure, it can be flattened to a single folder. This example uses [GNU Parallel](http://www.gnu.org/software/parallel/), available for Linux or Mac OS, or installed on Cygwin in Windows.

```
find analysis/sample-files/advanced -type -f | parallel --eta mv {} analysis/sample-files
rm -rf advanced
```

## Explanation of ID-to-URL encoding

In [None]:
htid = 'miun.adx6300.0001.001'
kind = 'basic'
libid, volid = htid.split('.', 1)
print("Institution:\t%s\nId:\t\t%s" % (libid, volid))

The extracted Features dataset has _advanced_ and _basic_ files. For most uses, you'll want the information in _basic_, and _advanced_ may be removed in future releases.

The HathiTrust id uses the institution's identifier separately, _miun_ (Michigan University) in this case.

In [None]:
volid_clean = pp.id_encode(volid)
filename = ".".join([libid, volid_clean, kind, 'json.bz2'])
print("Filename:\t%s" % filename)

In [None]:
pairtree_root = [kind, libid, 'pairtree_root']
path = pairtree_root + pp.id_to_dir_list(volid) + [volid_clean, filename]
print("Pairtree Root:\t%s" % pairtree_root)
print("Full Path:\t%s" % ('/'.join(path)))

The institutional id is split up and encoded, then recombined into a path.