Skip to content
This repository has been archived by the owner on Aug 25, 2023. It is now read-only.

Commit

Permalink
Merge branch 'master' into logging_reduction
Browse files Browse the repository at this point in the history
  • Loading branch information
marcin-kolda committed Aug 17, 2018
2 parents 485b8ec + 5d47117 commit c77789e
Show file tree
Hide file tree
Showing 148 changed files with 2,005 additions and 776 deletions.
17 changes: 12 additions & 5 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,17 @@ before_install:
- pip install -r requirements.txt
- pip install -t lib -r requirements.txt
- pip install -r requirements_tests.txt
- pip install coveralls
- pip install pylint
script:
# - find . -name "*.py" -path "./*" -not \( -path "./lib/*" -o -path "./google-cloud-sdk/*" \) | xargs pylint --reports=yes --rcfile=.pylintrc
- coverage run --source=src test_runner.py --test-path tests/ --test-pattern 'test*.py' -v ./google-cloud-sdk
- pip install coveralls==1.3.0
- pip install pylint==1.9.2
- pip install click==6.7
- PYTHONPATH=$PYTHONPATH:./lib:./google-cloud-sdk/bin

jobs:
include:
- stage: unit-tests
script: coverage run --source=src test_runner.py --test-path tests/ --test-pattern 'test*.py' -v ./google-cloud-sdk
- stage: isolation-tests
script: python isolation_test.py --test_runner "test_runner.py" --test_path "tests" --google_cloud_sdk ./google-cloud-sdk

after_success:
coveralls
85 changes: 49 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,24 @@
BBQ (read: barbecue) is a python app that runs on Google App Engine (GAE) and creates daily backups of BigQuery tables.

# Table of contents

* [Setup](#setup)
* [Motivation](#motivation)
* [Features](#features)
* [High level architecture](#high-level-architecture)
* [Backup process](#backup-process)
* [Restore process](#restore-process)
* [Retention process](#retention-process)
* [Setup](#setup)

* [Usage](#usage)

# Setup
To install BBQ in GCP, follow installation steps from [Setup.md](./SETUP.md) doc.

Below button opens [Setup.md](./SETUP.md) in Google Cloud Shell, where you could instantly follow described steps.

[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/open?git_repo=https%3A%2F%2Fgithub.com%2Focadotechnology%2Fbbq&page=shell&tutorial=SETUP.md)


# Motivation

[Google BigQuery](https://cloud.google.com/bigquery/) is fast, highly scalable, cost-effective and fully-managed enterprise data warehouse for analytics at any scale. BigQuery automatically replicates data and keeps a 7-day history of changes.
Expand Down Expand Up @@ -51,9 +59,10 @@ In such scenario we're not able to restore data using BigQuery build-in features
* Dataset/table labels as they are not copied by BigQuery copy job (again, you can use [GCP Census](https://github.com/ocadotechnology/gcp-census) for that)

### Known caveats
* Modification of table metadata (including table description) qualifies table to be backed up at the next cycle. It can be a problem for partitioned tables, where such change updates last modified time in every partition. Then BBQ will backup all partitions again, even though there was no actually change in partition data
* There's 10,000 [copy jobs per project per day limit](https://cloud.google.com/bigquery/quotas#copy_jobs), which you may hit on the first day. This limit can be increased by Google Support
* Data in table streaming buffer will be backed up on the next run, once the buffer is flushed. BBQ uses [copy-job](https://cloud.google.com/bigquery/docs/managing-tables#copy-table) for creating backups and *"Records in the streaming buffer are not considered when a copy or extract job runs"* (check [Life of a BigQuery streaming insert](https://cloud.google.com/blog/big-data/2017/06/life-of-a-bigquery-streaming-insert) for more details).
* Modification of table metadata (including table description) qualifies table to be backed up at the next cycle. It can be a problem for partitioned tables, where such change updates last modified time in every partition. Then BBQ will backup all partitions again, even though there was no actually change in partition data,
* There's 10,000 [copy jobs per project per day limit](https://cloud.google.com/bigquery/quotas#copy_jobs), which you may hit on the first day. This limit can be increased by Google Support,
* Data in table streaming buffer will be backed up on the next run, once the buffer is flushed. BBQ uses [copy-job](https://cloud.google.com/bigquery/docs/managing-tables#copy-table) for creating backups and *"Records in the streaming buffer are not considered when a copy or extract job runs"* (check [Life of a BigQuery streaming insert](https://cloud.google.com/blog/big-data/2017/06/life-of-a-bigquery-streaming-insert) for more details),
* When table name is longer than 400 characters, then in rare cases BBQ may backup tables more than once. Such backup duplicates are automatically removed by retention process.

# High level architecture

Expand Down Expand Up @@ -121,25 +130,21 @@ Every day retention process scans all backups to find and delete backups matchin
### Example of 7 months old backup deletion and source deletion grace period
![Retention process](docs/images/bbq_retention_process_7_months.gif)

# Setup
To install BBQ in GCP, click button below or follow [Setup.md](./SETUP.md) doc.

<a href="https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/ocadotechnology/bbq&page=editor&open_in_editor=SETUP.md">
<img alt="Open in Cloud Shell" src ="http://gstatic.com/cloudssh/images/open-btn.png"></a>

# Usage

## How to run backups?
Backup process is scheduled periodically for all specified projects (check [config.yaml](./config/config.yaml) to specify which projects to backup and [config/cron.yaml](./config/cron.yaml) to configure schedule time).
Note that cron uses UTC.

However, you may also invoke backup process manually from [cron jobs](https://console.cloud.google.com/appengine/taskqueues/cron).

It's worth to underline that:
* Backups for partitions are scheduled randomly within the range of time specified in [config.yaml](./config/config.yaml),
* It is possible to check the progress via [Task Queues](https://console.cloud.google.com/appengine/taskqueues).

## How to list already created backups?
In order to find where is stored backup __Y__ for table __X__:
## How to find backup for given table?
### Option 1
In order to find backup __Y__ for table __X__:
1. In Cloud Console visit [Datastore](https://console.cloud.google.com/datastore),
1. Find __Key literal__ for table _X_:
* Select __Table__ kind,
Expand All @@ -150,33 +155,41 @@ In order to find where is stored backup __Y__ for table __X__:
* Filter entities by _Key_ that __has ancestor__ _X.Key literal_.

To check the content for given backup __Y__ in Big Query:
1. Open [Big Query](https://console.cloud.google.com/bigquery),
1. Open [Big Query](https://console.cloud.google.com/bigquery) in BBQ storage project,
1. Filter tables by _Y.dataset_id_ or _Y.table_id_ in search bar,
1. Select table and check _Schema_, _Details_ or _Preview_ tab.

### Option 2
It is possible to export Datastore kinds and query them in Big Query, this method is recommended for more frequent usage.
* To enable export, check [Cloud Datastore export](./SETUP.md#cloud-datastore-export) section.
* Export is scheduled periodically; however, to have latest data you should invoke them manually from [cron jobs](https://console.cloud.google.com/appengine/taskqueues/cron).
* To find backup __Y__ for table __X__ open [Big Query](https://console.cloud.google.com/bigquery) in BBQ storage project __Z__ - replace __X__, __Y__, __Z__ in query below and execute:
```sql
#StandardSQL
WITH last_tables AS (
SELECT *
FROM `Y.datastore_export.Table_*`
WHERE _TABLE_SUFFIX IN (
SELECT MAX(_TABLE_SUFFIX) FROM `Y.datastore_export.Table_*`
)
), last_backups AS (
SELECT *, CAST(SPLIT(__key__.path, ',')[OFFSET(1)] AS INT64) AS PARENT_ID
FROM `Y.datastore_export.Backup_*`
WHERE _TABLE_SUFFIX IN (
SELECT MAX(_TABLE_SUFFIX) FROM `Y.datastore_export.Backup_*`
)
)
SELECT * FROM last_backups WHERE PARENT_ID IN (
SELECT __key__.id FROM last_tables
WHERE project_id = X.project_id AND dataset_id = X.dataset_id AND table_id = X.table_id
)
```
## How to restore data from backups?
There are several options to restore data, available from _\<your-project-id>_.__appspot.com__ (dropdown tab _Actions_)
* __Restore whole dataset__ (_\<your-project-id>.appspot.com_/__ui/restoreDataset__). Parameters:
* Source project id: id of project where dataset is placed originally,
* Source dataset id: original dataset id,
* Target dataset id (optional): id of temporary dataset that will be used (and created if does not exist) as container for restored table. Remember that this will be a temporary dataset with expiration time set to 7 days. __Note that passed dataset could already exists - it should be in the same location as backup__.
If _target dataset id_ is not passed, then _source dataset id_ value will be used as a target dataset id in restoration project
* Max partition days (optional): number of days from partitioned tables will be restored (eg. 30 means that partitions from last 30 days will be restored),
* __Restore tables from list of backups__ (_\<your-project-id>.appspot.com_/__ui/restoreList__). Parameters:
* Target dataset id (optional): id of temporary dataset that will be used (and created if does not exist) as container for restored table. Remember that this will be a temporary dataset with expiration time set to 7 days. __Note that passed dataset could already exists - it should be in the same location as backup__.
If _target dataset id_ is not passed, then source dataset id value of each backup will be used as a target dataset id in restoration project.
In case of restoring backups from different datasets multiple target datasets will be created.
* Backup list: set of backups in __JSON__ format, each of them is designated by the url safe key of backup entity available from [Datastore](https://console.cloud.google.com/datastore). Example:
```json
[
{
"backupUrlSafeKey" : "ahFlfmRldi1wcm9qZWN0LPPicXIlCxIFVGFibGUYgICAkOaLgAgMCxIGQmFja3VwGICAgICAgJJJJA"
},
{
"backupUrlSafeKey" : "ahFlfmRldi1wcm9qZWN0LWJicXIlCxIFVGFibGUYgICAkJOlgAgMCxIGQmFja3VwGICAgICAgIAKDA"
}
]
```
There are several options to restore data, available from _\<your-project-id>_.__appspot.com__
* __Restore whole dataset__
* __Restore single table__
* __Restore tables from custom list of backups__

#### Checking status of restoration process
Restore process is asynchronous. To check status of process, follow links returned in response:

Expand Down
Loading

0 comments on commit c77789e

Please sign in to comment.