# Attachnments and configurable blobs
This notebooks demonstrates the use of DataJoint's support for storing complex datatypes (blobs) and file attachments.

**blob** in the context of DataJoint refers to an attribute that can store complex data structures such as numeric arrays.

**attachment** refers to an attribute that can store an entire file with its filename, etc.

These features are currently in pre-release.  To enable them, use the following upgrade command

```shell
$ pip3 install --upgrade --pre datajoint
```

As always start by importing DataJoint

In [1]:
import datajoint as dj
import numpy as np

In [2]:
dj.conn()

Connecting root@db:3306


DataJoint connection (connected) root@db:3306

In [3]:
schema = dj.schema('test')

In [4]:
@schema
class Names(dj.Manual):
    definition = """
    name: varchar(128)  
    ---
    age: int 
    email: varchar(64)
    height: float
    dob: date
    picture: blob@stores
    """

In [5]:
dj.config

{   'connection.charset': '',
    'connection.init_function': None,
    'database.host': 'db',
    'database.password': 'simple',
    'database.port': 3306,
    'database.reconnect': True,
    'database.user': 'root',
    'display.limit': 12,
    'display.show_tuple_count': True,
    'display.width': 14,
    'fetch_format': 'array',
    'loglevel': 'INFO',
    'safemode': True}

## Configuring stores

Now extend the configuration by defining a "stores". Here I'm calling a new store `local`, using `file` protocol. You can also use `s3` protocol to connect to AWS S3 or S3 compatible services like [minio](https://min.io/).

In [6]:
dj.config['stores'] = {
    'local': {
        'protocol': 'file',
        'location': '/notebooks/data'
    }
}

## Using external blobs

To use your new store, simply use `blob@store-name` in place of `blob` or `longblob`

In [13]:
@schema
class DemoBlob(dj.Manual):
    definition = """
    id:  int   # some id
    ---
    mydata: blob@local
    """

Now the attribute `mydata` is connected to the `local` storage. You can interact with the "external" blob field just like you would with a regular blob or longblob field.

Insert data into it:

In [14]:
DemoBlob.insert1((1, np.random.randn(1, 10)))

And getting it back out

In [15]:
DemoBlob().fetch()

array([(1, array([[ 0.01581526,  1.83804671, -0.28828823,  1.56576029, -0.0382735 ,
        -1.41737824,  0.74457659,  0.02570973, -0.5727615 ,  0.09865645]]))],
      dtype=[('id', '<i8'), ('mydata', 'O')])

## Using `attach` feature

A new data type called `attach` allows you to "insert" and "retrieve" data files rather directly. Using `attach` dypte by itself will insert the inserted file directly into the database.

We recommend you use "external" attach, therefore causing the file to be stored in one of the stores you configured.

Here we are defining a new table with external attach linked to the `local` store.

In [19]:
@schema
class DemoAttach(dj.Manual):
    definition = """
    id: int # some id
    ---
    image: attach@local
    """

To insert data into `attach`, you pass it a valid path to the file you want to insert. Here, I'm passing in a path to the image.

In [20]:
DemoAttach.insert1((1, './images/random_dog.jpg'))

When you fetch an `attach` attribute, two things happen:

1. DataJoint **downloads** the file to `download_path` which defaults to your current directory. You can specify this location as an argument into `fetch`
2. You are returned the full path to the donwloaded file.

In [26]:
mkdir downloads

mkdir: cannot create directory ‘downloads’: File exists


In [27]:
d = DemoAttach.fetch1(download_path='downloads')

In [28]:
d

{'id': 1, 'image': '/notebooks/downloads/random_dog.jpg'}

## Caching
By default, the data from blobs and attachments are retrieved from remote stores with every fetch command. 
For repeated queries, a cache folder may be specified to improve performance and reduce cost of operations.
After the first fetch of a given blob or attachment, it will be read from the cache. 

In [29]:
# configure the cache
dj.config['cache'] = './dj-cache'

In [43]:
import os
# clear the cache for the timing test
import shutil
if os.path.isdir(dj.config['cache']):
    shutil.rmtree(dj.config['cache'])

First time fetching will lead to the creation of cache.

In [44]:
%%time
DemoAttach.fetch1(download_path='downloads')

CPU times: user 10 ms, sys: 0 ns, total: 10 ms
Wall time: 25.5 ms


{'id': 1, 'image': '/notebooks/downloads/random_dog_0009.jpg'}

Second time, it will retrieve from cache, and should be faster.

In [45]:
%%time
DemoAttach.fetch1(download_path='downloads')

CPU times: user 10 ms, sys: 0 ns, total: 10 ms
Wall time: 16.4 ms


{'id': 1, 'image': '/notebooks/downloads/random_dog_000a.jpg'}

In [27]:
%%timeit -n1 -r1

# first time no cache
files = OriginalFile.fetch('image_file')

2.53 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## Caution

In the current `dev` release of DataJoint, fetching `attach`ed file will always result in a new file being placed into your `download_path`, even if there already exists the same file. In the coming up `dev` release, we will perform content check so duplicate files won't be created. In the meantime, we recommend that you delete your retrieved file at the earliest convenience.

## Deleting
Deleting from tables using external storage is just as simple and transaction-safe as with all other kinds of attributes. Simply use the `delete` method:

In [46]:
DemoBlob.delete()

About to delete:
`test`.`demo_blob`: 1 items


Proceed? [yes, No]:  yes


Committed.


However, deleting the entry doesn't immediately lead to the deletion of the data tracked in the external table -- corresponding to the fact that your store will still have the file corresponding to the data.

In [53]:
list(schema.external)

['local']

In [54]:
schema.external['local']

hash,size  size of object in bytes,timestamp  automatic timestamp
293c284a-e17f-5195-860b-99a01737973c,117,2019-05-02 11:19:30
3ca8ee87-9083-7f4f-7441-516624d9e06c,117,2019-04-03 11:30:01
438aa4ed-6db7-942c-2557-146ba4c385f1,136901,2019-05-02 11:22:20


# Cleanup 

For the sake of performance, deleting from tables does not immediately remove the data from external storage. 
The data must cleared periodically at non-critical times. 

The current contents of external storage can be inspected by querying `schema.external`:

You may cleanup the external table using its `delete` method.  It is a transaction-safe operation and can be performed at any time.

In [55]:
schema.external['local'].delete()

Deleted 2 items


After the external table has been updated, the remote stores may be cleaned up too. This operation is **not** transactions safe and may result in race conditions in situations of heavy concurrent read-write use of the same data.

In [56]:
schema.external['local'].clean()

Deleting...
/notebooks/data/test/29/3c/293c284ae17f5195860b99a01737973c
/notebooks/data/test/3c/a8/3ca8ee8790837f4f7441516624d9e06c
Deleted 2 objects
