Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality that allow to pre-load data into the storage bucket(s) #79

Open
MiltiadisKoutsokeras opened this issue Sep 2, 2021 · 5 comments

Comments

@MiltiadisKoutsokeras
Copy link

It would be really useful, if the Docker container could start with pre-loaded data. This can help the usage of the tool for unit and integration testing. This could be done by either attaching a volume with data to pre-load or provide a hook script that is called before the server starts. Similar functionality is implemented in other GCP Storage Emulators and also in other common Docker images like databases (postgres Docker image has the /docker-entrypoint-initdb.d directory where the user can have SQL scripts for database initialization and data import).

@oittaa
Copy link
Owner

oittaa commented Sep 2, 2021

https://github.com/oittaa/gcp-storage-emulator#docker

The directory used for the emulated storage is located under /storage in the container. In the following example the host's directory $(pwd)/cloudstorage will be bound to the emulated storage.

@MiltiadisKoutsokeras
Copy link
Author

This is a directory controlled by the service and is read/write root by default, as the Docker service also runs as root. Additionally this cannot apply in memory backed storage. What I would like to have is a user directory, with user permissions and mounting on the container so at launch it imports all data in there. Ideally the top level directories of the import directory should be used as bucket names. For example the following directory:

import-dir
|_bucket_a
  |_directory_a
  |_directory_b
    |_file_a
    |_file_b
|_bucket_b
  |_directory_c
  |_directory_d
    |_file_e
    |_file_f

Should be loaded on startup and the server should create or use the buckets bucket_a, bucket_b (in memory or disk) and upload the corresponding files into the proper bucket.

@oittaa
Copy link
Owner

oittaa commented Sep 3, 2021

Yeah, that sounds like a good idea. I don't have much time at the moment, but pull requests are welcome.

@mike-marcacci
Copy link

@MiltiadisKoutsokeras just FYI https://github.com/fsouza/fake-gcs-server has the behavior you're after.

For our use-case we actually don't want that behavior and and are trying to move to gcp-storage-emulator instead. But I figured I would drop a note in case you're still in need of that.

@MiltiadisKoutsokeras
Copy link
Author

I have come up with a solution to the problem. Here it goes.

First I use Docker Compose to launch the container with these directives:

google_storage:
        image: oittaa/gcp-storage-emulator
        restart: unless-stopped
        ports:
            # Exposed in port 9023 of localhost
            - "127.0.0.1:9023:9023/tcp"
        environment:
            ####################################################################
            # Application environment variables
            PROJECT_ID: ${PROJECT_ID:-localtesting}
        entrypoint: /entrypoint.sh
        command: ["gcp-storage-emulator", "start",
            "--host=google_storage", "--port=9023", "--in-memory",
            "--default-bucket=${BUCKET_NAME:-localtesting_bucket}" ]
        volumes:
            - ./tests/storage/entrypoint.sh:/entrypoint.sh:ro
            - ./tests/storage/docker_entrypoint_init.py:/docker_entrypoint_init.py:ro
            - ./tests/storage/buckets:/docker-entrypoint-init-storage:ro

As you can see I pass the desired project name and bucket name via Env Vars, PROJECT_ID and BUCKET_NAME.
I override the entrypoint of the container with my own Bash script/Python script combination, entrypoint.sh and docker_entrypoint_init.py. Here are their contents:

entrypoint.sh

#!/usr/bin/env bash

# Exit in any error
set -e

[ "${PROJECT_ID}" = "" ] && { echo "PROJECT_ID Environment Variable is not Set!"; exit 1; }

# Install Python requirements
pip install google-cloud-storage==1.31.2

# Execute command line arguments in background and save process ID
"${@}" & PROCESSID=$!

# Wait process to start
while ! kill -0 "${PROCESSID}" >/dev/null 2>&1
do
    echo "Waiting for process to start..."
    sleep 1
done
echo "Process started, ID = ${PROCESSID}"
sleep 2

# Cloud Emulators
export STORAGE_EMULATOR_HOST=http://google_storage:9023

# Import data to bucket
echo "Importing data..."
python3 /docker_entrypoint_init.py
echo "DONE"

# Wait process to exit
wait "${PROCESSID}"

docker_entrypoint_init.py

"""Initialize Google Storage data
"""

import logging
from os import scandir, environ
import sys
from google.auth.credentials import AnonymousCredentials
from google.cloud import storage

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

def upload_contents(client, directory, bucket_name=None):
    """Upload recursively contents of specified directory.

    Args:
        client (google.cloud.storage.Client): Google Storage Client.
        directory (str): upload directory path.
        bucket_name (str, optional): Bucket name to use for upload. Defaults to
        None.
    """
    for entry in scandir(directory):
        print(entry.path)
        if entry.is_dir():
            if bucket_name is not None:
                # This is a normal directory inside a bucket
                upload_contents(client, directory + '/' +
                                entry.name, bucket_name)
            else:
                # This is a bucket directory
                upload_contents(client, directory + '/' +
                                entry.name, entry.name)
        elif entry.is_file():
            if bucket_name is not None:
                tokens = entry.path.split(bucket_name + '/')
                bucket_obj = client.bucket(bucket_name)
                if len(tokens) > 1:
                    gs_path = tokens[1]
                    blob_obj = bucket_obj.blob(gs_path)
                    blob_obj.upload_from_filename(entry.path)

PROJECT_ID = environ.get('PROJECT_ID')
if PROJECT_ID is None:
    logger.error('Missing required Environment Variables! Please set \
PROJECT_ID')
    sys.exit(1)

storage_client = storage.Client(credentials=AnonymousCredentials(),
                                project=PROJECT_ID)

# Scan import data directory
upload_contents(storage_client, '/docker-entrypoint-init-storage')

logger.info('Successfully imported bucket data!')
logger.info('List:')
for bucket in storage_client.list_buckets():
    print(f'Bucket: {bucket}')
    for blob in bucket.list_blobs():
        print(f'|_Blob: {blob}')

# All OK
sys.exit(0)

I hope this is helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants