# Using Ceph from Jupyter notebooks

This notebook will guide you on how to interact with Ceph that is provided by DataHub team directly from Jupyter notebooks.

In order to use Ceph, there needs to be installed `thoth-storages` package that provides an adapter for interacting with Ceph. There are implemented also other adapters that will help you interact with other persistent parts, but we will focus strictly on Ceph in this notebook.

In [1]:
from thoth.storages import CephStore

**Warning:** If you want to use Thoth directly, please use adapters that encapsulate Ceph handling and ensure data consistency, such as `SolverResultsStore`, `BuildLogsStore` or `AnalysisResultsStore`. This notebook presents low level adapter API.

To check what methods the Ceph adapter provides, we can simply check Python documentation.

In [2]:
help(CephStore)

Help on class CephStore in module thoth.storages.ceph:

class CephStore(thoth.storages.base.StorageBase)
 |  Adapter for storing and retrieving data from Ceph - low level API.
 |  
 |  Method resolution order:
 |      CephStore
 |      thoth.storages.base.StorageBase
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, result_type, *, host:str=None, key_id:str=None, secret_key:str=None, bucket:str=None, region:str=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  connect(self) -> None
 |      Create a connection to the remote Ceph.
 |  
 |  document_exists(self, document_id:str) -> bool
 |      Check if the there is an object with the given key in bucket, does only HEAD request.
 |  
 |  get_document_listing(self) -> Generator[str, NoneType, NoneType]
 |      Get listing of documents stored on the Ceph.
 |  
 |  is_connected(self) -> bool
 |      Check whether adapter is already connected to the remote Ceph storage.
 |  
 | 

The constructor accepts all the parameters that can be supplied eigher explicitly on adapter instantiation or there can be used environment variables (preferred). The ones supplied to constructor have higher priority. Let's check the code of constructor to see which environment variables are applicable:

In [3]:
import inspect

lines = inspect.getsourcelines(CephStore.__init__)
print("".join(lines[0]))

    def __init__(self, result_type, *,
                 host: str=None, key_id: str=None, secret_key: str=None, bucket: str=None, region: str=None):
        super().__init__()
        self.deployment_name = os.environ['THOTH_DEPLOYMENT_NAME']
        self.host = host or os.environ['THOTH_S3_ENDPOINT_URL']
        self.key_id = key_id or os.environ['THOTH_CEPH_KEY_ID']
        self.secret_key = secret_key or os.environ['THOTH_CEPH_SECRET_KEY']
        self.bucket = bucket or os.environ['THOTH_CEPH_BUCKET']
        self.region = region or os.getenv('THOTH_CEPH_REGION', None)
        self.result_type = result_type
        self._s3 = None

        assert self.result_type, "Result type cannot be empty: {}".format(self.result_type)
        assert self.deployment_name, "Deployment name has to be set, got {}".format(self.deployment_name)

        self.prefix = "{}/{}/".format(self.deployment_name, self.result_type)



As we don't want to expose credentials in this notebook that is availble publicly, we assume that environment variables are present inside running Jupyter notebook and we can easily instantiate adapter instance and make a connection to Ceph:

In [4]:
import os

adapter = CephStore(result_type='testing')
adapter.connect()

Let's check the connection status:

In [5]:
adapter.is_connected()

True

Let's check whether our document `foo` exists on Ceph:

In [6]:
adapter.document_exists('foo')

False

As it is not already present, let's create one with some content:

In [7]:
adapter.store_document({'some': 'document'}, 'foo')

{'ETag': '"b7d144531216255307a634d8fe75361e"',
 'ResponseMetadata': {'HTTPHeaders': {'accept-ranges': 'bytes',
   'content-length': '0',
   'date': 'Thu, 15 Mar 2018 09:45:01 GMT',
   'etag': '"b7d144531216255307a634d8fe75361e"',
   'x-amz-request-id': 'tx0000000000000000019c0-005aaa409d-93c8ce-default'},
  'HTTPStatusCode': 200,
  'HostId': '',
  'RequestId': 'tx0000000000000000019c0-005aaa409d-93c8ce-default',
  'RetryAttempts': 0}}

In [8]:
adapter.document_exists('foo')

True

Now we can try to retrieve it:

In [9]:
adapter.retrieve_document('foo')

{'some': 'document'}

As Ceph is an object store, Ceph adapter also provides low-level operations that work directly on bytes so you can easily store documents that are not dictionaries, such as text files, images or anything alse:

In [10]:
adapter.store_blob('This is some text'.encode(), 'bar')

{'ETag': '"97214f63224bc1e9cc4da377aadce7c7"',
 'ResponseMetadata': {'HTTPHeaders': {'accept-ranges': 'bytes',
   'content-length': '0',
   'date': 'Thu, 15 Mar 2018 09:45:08 GMT',
   'etag': '"97214f63224bc1e9cc4da377aadce7c7"',
   'x-amz-request-id': 'tx0000000000000000019c3-005aaa40a4-93c8ce-default'},
  'HTTPStatusCode': 200,
  'HostId': '',
  'RequestId': 'tx0000000000000000019c3-005aaa40a4-93c8ce-default',
  'RetryAttempts': 0}}

In [11]:
adapter.retrieve_blob('bar').decode()

'This is some text'

One can also get a listing of all acvailable objects stored on Ceph with appropriate keys:

In [12]:
list(adapter.get_document_listing())

['bar', 'foo']