Add guide to use own S3 bucket with platform (#612)

* Add guide to use own S3 bucket with platform * Implement feedback and fix typos * Add feedback Gerhard. Add S3 fuse reference
lightly-ai · Dec 3, 2021 · cc551df · cc551df
1 parent c01ac6c
commit cc551df
Show file tree

Hide file tree

Showing 3 changed files with 167 additions and 11 deletions.
diff --git a/.gitignore b/.gitignore
@@ -9,6 +9,7 @@ docs/source/tutorials/platform/*
 docs/source/tutorials_source/platform/data
 docs/source/tutorials_source/platform/pizzas
 docs/source/docker/resources
+docs/source/getting_started/resources
 !docs/source/tutorials/package.rst
 !docs/source/tutorials/platform.rst
 !docs/source/tutorials/package/structure_your_input.rst

diff --git a/docs/Makefile b/docs/Makefile
@@ -3,14 +3,15 @@
 
 # You can set these variables from the command line, and also
 # from the environment for the first two.
-SPHINXOPTS     ?=
-SPHINXBUILD    ?= sphinx-build
-SOURCEDIR      = source
-BUILDDIR       = build
-DATADIR	       = _data
-PACKAGESOURCE  = source/tutorials_source/package
-PLATFORMSOURCE = source/tutorials_source/platform
-DOCKERSOURCE   = source/docker
+SPHINXOPTS     			?=
+SPHINXBUILD    			?= sphinx-build
+SOURCEDIR      			= source
+BUILDDIR       			= build
+DATADIR	       			= _data
+PACKAGESOURCE  			= source/tutorials_source/package
+PLATFORMSOURCE 			= source/tutorials_source/platform
+DOCKERSOURCE   			= source/docker
+GETTING_STARTED_IMAGES 	= source/getting_started/resources
 
 ZIPOPTS        ?= -qo
 
@@ -45,6 +46,11 @@ download:
 	# sunflowers dataset
 	wget -N https://storage.googleapis.com/datasets_boris/Sunflowers.zip -P $(DATADIR)
 
+	# download resources for s3 integration
+	mkdir -p $(GETTING_STARTED_IMAGES)
+	wget -N https://storage.googleapis.com/datasets_boris/resources_s3_integration.zip -P $(DATADIR)
+	unzip $(ZIPOPTS) $(DATADIR)/resources_s3_integration.zip  -d $(GETTING_STARTED_IMAGES)
+
 	# download images and report for docker
 	wget -N https://storage.googleapis.com/datasets_boris/resources.zip -P $(DATADIR)
 	unzip $(ZIPOPTS) $(DATADIR)/resources.zip  -d $(DOCKERSOURCE)

diff --git a/docs/source/getting_started/platform.rst b/docs/source/getting_started/platform.rst
@@ -141,7 +141,7 @@ As with images and embeddings before, it's also possible to upload custom metada
 Configuration
 ^^^^^^^^^^^^^^^
 
-In order to use the custom metadata on the Lightly Platform, it must be configured first. For this,
+To use the custom metadata on the Lightly Platform, it must be configured first. For this,
 follow these steps:
 
 1. Go to your dataset and click on "Configurator" on the left side.
@@ -162,7 +162,7 @@ Done! You can now use the custom metadata in the "Explore" and "Analyze & Filter
 Format
 ^^^^^^^^^^^
 
-In order to upload the custom metadata, you need to save it to a `.json` file in a COCO-like format.
+To upload the custom metadata, you need to save it to a `.json` file in a COCO-like format.
 The following things are important:
 
 - Information about the images is stored under the key `images`.
@@ -272,7 +272,7 @@ section by clicking on it.
 Dataset Identifier
 -------------------------
 
-Every dataset has a unique identifier called 'Dataset ID'. You find it in the dataset overview page.
+Every dataset has a unique identifier called 'Dataset ID'. You find it on the dataset overview page.
 
 .. figure:: images/webapp_dataset_id.jpg
     :align: center
@@ -302,3 +302,152 @@ account (top right)-> preferences on the
 .. warning:: Keep the token for yourself and don't share it. Anyone with the
           token could access your datasets!
 
+
+How to use S3 with Lightly
+------------------------------
+
+
+Lightly allows you to configure a remote datasource like Amazon S3 (Amazon Simple Storage Service) so that you don't need to upload your data to Lightly and can preserve its privacy.
+
+
+**What you will learn**
+
+
+In this guide, we will show you how to setup your S3 bucket, configure your dataset to use said bucket, and only upload metadata to Lightly while preserving the privacy of your data
+
+
+Setting up Amazon S3
+^^^^^^^^^^^^^^^^^^^^^^
+For Lightly to be able to create so-called `presigned URLs/read URLs <https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html>`_ to be used for displaying your data in your browser, Lightly needs at minimum to be able to read and list permissions on your bucket. If you want Lightly to create optimal thumbnails for you while uploading the metadata of your images, write permissions are also needed.
+
+Let us assume your bucket is called `datalake`. And let us assume the folder you want to use with Lightly is located at projects/farm-animals/
+
+**Setting up IAM**
+
+1. Go to the `Identity and Access Management IAM page <https://console.aws.amazon.com/iamv2/home?#/users>`_ and create a new user for Lightly.
+2. Choose a unique name of your choice and select "Programmatic access" as "Access type". Click next
+
+    .. figure:: resources/AWSCreateUser2.png
+        :align: center
+        :alt: Create AWS User
+
+        Create AWS User
+
+3. We will want to create very restrictive permissions for this new user so that it can't access other resources of your company. Click on "Attach existing policies directly" and then on "Create policy". This will bring you to a new page
+
+    .. figure:: resources/AWSCreateUser3.png
+        :align: center
+        :alt: Setting user permission in AWS
+
+        Setting user permission in AWS
+
+4. As our policy is very simple, we will use the JSON option and enter the following while substituting `datalake` with your bucket and `projects/farm-animals/` with the folder you want to share.
+
+    .. code-block:: json
+
+        {
+            "Version": "2012-10-17",
+            "Statement": [
+                {
+                    "Sid": "VisualEditor0",
+                    "Effect": "Allow",
+                    "Action": "s3:ListBucket",
+                    "Resource": [
+                        "arn:aws:s3:::datalake",
+                        "arn:aws:s3:::datalake/projects/farm-animals/*"
+                    ]
+                },
+                {
+                    "Sid": "VisualEditor1",
+                    "Effect": "Allow",
+                    "Action": "s3:*",
+                    "Resource": [
+                        "arn:aws:s3:::datalake/projects/farm-animals/*"
+                    ]
+                }
+            ]
+        }
+    .. figure:: resources/AWSCreateUser4.png
+        :align: center
+        :alt: Permission policy in AWS
+
+        Permission policy in AWS
+5. Go to the next page and create tags as you see fit (e.g `external` or `lightly`) and give a name to your new policy before creating it.
+
+    .. figure:: resources/AWSCreateUser5.png
+        :align: center
+        :alt: Review and name permission policy in AWS
+
+        Review and name permission policy in AWS
+6. Return to the previous page as shown in the screenshot below and reload. Now when filtering policies your newly created policy will show up. Select it and continue setting up your new user.
+
+    .. figure:: resources/AWSCreateUser6.png
+        :align: center
+        :alt: Attach permission policy to user in AWS
+
+        Attach permission policy to user in AWS
+7. Write down the `Access key ID` and the `Secret access key` in a secure location (such as a password manager) as you will not be able to access this information again (you can generate new keys and revoke old keys under `Security credentials` of a users detail page)
+
+    .. figure:: resources/AWSCreateUser7.png
+        :align: center
+        :alt: Get security credentials (access key id, secret access key) from AWS
+
+        Get security credentials (access key id, secret access key) from AWS
+
+**Preparing your data**
+
+
+For Lightly to be able to create embeddings and extract metadata from your data, `lightly-magic` needs to be able to access your data. You can either download/sync your data from S3 or you can mount S3 as a drive. We recommend downloading your data from S3 as it makes the overall process faster.
+
+**Downloading from S3 (recommended)**
+
+1. Install AWS cli by following the `guide of Amazon <https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html>`_
+2. Run `aws configure` and set the credentials
+3. Download/synchronize the folder located on S3 to your current directory `aws s3 sync s3://datalake/projects/farm-animals ./farm`
+
+**Mount S3 as a drive**
+
+For Linux and MacOS we recommend using `s3fs-fuse <https://github.com/s3fs-fuse/s3fs-fuse>`_ to mount S3 buckets to a local file storage. 
+You can have a look at our step-by-step guide: :ref:`ref-docker-integration-s3fs-fuse`. 
+
+
+Uploading your data
+^^^^^^^^^^^^^^^^^^^^^^
+
+Create and configure a dataset
+
+1. `Create a new dataset <https://app.lightly.ai/dataset/create>`_ in Lightly
+2. Edit your dataset and select S3 as your datasource
+
+    .. figure:: resources/LightlyEdit1.png
+        :align: center
+        :alt: Get security credentials (access key id, secret access key) from AWS
+
+        Get security credentials (access key id, secret access key) from AWS
+
+3. As the resource path, enter the full S3 URI to your resource eg. `s3://datalake/projects/farm-animals/`
+4. Enter the `access key` and the `secret access key` we obtained from creating a new user in the previous step and select the AWS region in which you created your bucket in
+5. The thumbnail suffix allows you to configure
+
+    - where your thumbnails are stored when you already have generated thumbnails in your S3 bucket
+    - where your thumbnails will be stored when you want Lightly to create thumbnails for you. For this to work, the user policy you have created must possess write permissions.
+    - when the thumbnail suffix is not defined/empty, we will load the full image even when requesting the thumbnail.
+
+    .. figure:: resources/LightlyEdit2.png
+        :align: center
+        :alt: Lightly S3 connection config
+        :width: 60%
+
+        Lightly S3 connection config
+
+6. Press save and ensure that at least the lights for List and Read turn green.
+
+
+Use Lightly
+Use `lightly-magic` and `lightly-upload` just as you always would with the following considerations;
+
+- If you already have generated thumbnails, don't want to see thumbnails or just want to use the full image for a thumbnail (by setting the thumbnail suffix to empty), add `upload=metadata` to the `lightly-magic` command.
+- If you want Lightly to create thumbnails for you, you can add `upload=thumbnails` to the `lightly-magic` command.
+
+
+