No description, website, or topics provided.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docker
src
tests
.gitignore
.travis.yml
Dockerfile
LICENSE
README.md
aws-services.json fix: remove access key generation in aws cf template Nov 2, 2017
code-coverage.sh
composer.json
composer.lock
deploy.sh
docker-compose.yml
phpstan.neon
phpunit.xml.dist feat: code coverage Feb 1, 2017

README.md

Keboola AWS S3 Extractor

Build Status Code Climate Test Coverage

Download files from S3 to /data/out/files.

Features

  • Use * for wildcards
  • Subfolders
  • Can process only new files
  • Skips files stored in Glacier

Configuration options

  • accessKeyId (required) -- AWS Access Key ID
  • #secretAccessKey (required) -- AWS Secret Access Key
  • bucket (required) -- AWS S3 bucket name, it's region will be autodetected
  • key (required) -- Search key prefix, optionally ending with a * wildcard. all filed downloaded with a wildcard are stored in /data/out/files/wildcard folder.
  • saveAs (optional) -- Store all downloaded file(s) in a folder.
  • includeSubfolders (optional) -- Download also all subfolders, only available with a wildcard in the search key prefix. Subfolder structure will be flattened, / in the path will be replaced with a - character, eg folder1/file1.csv => folder1-file1.csv. Existing - characters will be escaped with an extra - character to resolve possible collisions, eg. collision-file.csv => collision--file.csv.
  • newFilesOnly (optional) -- Download only new files. Last file timestamp is stored in the lastDownloadedFileTimestamp property of the state file. If more files with the same timestamp exist, the state processedFilesInLastTimestampSecond property is used to save all processed files within the given second.
  • limit (optional, default 0) -- Maximum number of files downloaded, if the key matches more files than limit, the oldest files will be downloaded. If used together with newFilesOnly, the extractor will process limit number of files that have not yet been processed.

Sample configurations

Single file

{
    "parameters": {
        "accessKeyId": "AKIA****",
        "#secretAccessKey": "****",
        "bucket": "myBucket",
        "key": "myfile.csv",
        "includeSubfolders": false,
        "newFilesOnly": false
    }
}

Wildcard

{
    "parameters": {
        "accessKeyId": "AKIA****",
        "#secretAccessKey": "****",
        "bucket": "myBucket",
        "key": "myfolder/*",
        "saveAs": "myfolder",
        "includeSubfolders": false,
        "newFilesOnly": false
    }
}

Wildcard, subfolders and new files only

{
    "parameters": {
        "accessKeyId": "AKIA****",
        "#secretAccessKey":  "****",
        "bucket": "myBucket",
        "key": "myfolder/*",
        "includeSubfolders": true,
        "newFilesOnly": true
    }
}

Note: state.json has to be provided in this case

Small increments, suitable for frequent jobs

{
    "parameters": {
        "accessKeyId": "AKIA****",
        "#secretAccessKey":  "****",
        "bucket": "myBucket",
        "key": "myfolder/*",
        "includeSubfolders": true,
        "newFilesOnly": true,
        "limit": 100
    }
}

Note: state.json has to be provided in this case

Development

Preparation

  • Create AWS S3 bucket and IAM user using aws-services.json CloudFormation template.
  • Create .env file. Use output of aws-services CloudFront stack to fill the variables and your Redshift credentials.
AWS_S3_BUCKET=
AWS_REGION=
UPLOAD_USER_AWS_ACCESS_KEY=
UPLOAD_USER_AWS_SECRET_KEY=
DOWNLOAD_USER_AWS_ACCESS_KEY=
DOWNLOAD_USER_AWS_SECRET_KEY=
  • Build Docker images
docker-compose build
  • Install Composer packages
docker-compose run --rm dev composer install --prefer-dist --no-interaction

Tests Execution

Run tests with following command.

docker-compose run --rm dev ./vendor/bin/phpunit