Skip to content
This repository has been archived by the owner on Aug 25, 2023. It is now read-only.

Export Datstore to BigQuery #14

Merged
merged 29 commits into from
Jul 12, 2018
Merged

Export Datstore to BigQuery #14

merged 29 commits into from
Jul 12, 2018

Conversation

przemyslaw-jasinski
Copy link
Contributor

No description provided.

@coveralls
Copy link

coveralls commented Jul 10, 2018

Pull Request Test Coverage Report for Build 330

  • 0 of 155 (0.0%) changed or added relevant lines in 5 files are covered.
  • 37 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-5.4%) to 82.396%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/appinfo.py 0 15 0.0%
src/datastore_export/export_datastore_to_big_query_service.py 0 18 0.0%
src/datastore_export/export_datastore_to_big_query_handler.py 0 31 0.0%
src/datastore_export/export_datastore_backups_to_gcs_service.py 0 43 0.0%
src/datastore_export/load_datastore_backups_to_big_query_service.py 0 48 0.0%
Files with Coverage Reduction New Missed Lines %
src/restore/test/table_randomizer.py 1 97.83%
src/main.py 5 0.0%
src/big_query/big_query.py 31 76.02%
Totals Coverage Status
Change from base Build 266: -5.4%
Covered Lines: 2008
Relevant Lines: 2437

💛 - Coveralls

request = {
'project_id': app_id,
'output_url_prefix': output_url_prefix,
'entity_filter': entity_filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entity_filter could be inlined

dataset_id, date = source_gcs_bucket.split("//")[1].split("/")
return {
"projectId": configuration.backup_project_id,
"location": "EU",
Copy link
Contributor

@marcin-kolda marcin-kolda Jul 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed dataset and job should be created in the same region as GAE app.

def get(self):
logging.info("Scheduling export of Datastore entities to GCS ...")
output_url = ExportDatastoreToGCSService\
.invoke(self.request, self.response)\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handler shouldn't pass request/response objects

logging.info("Scheduling export of Datastore entities to GCS ...")
output_url = ExportDatastoreToGCSService\
.invoke(self.request, self.response)\
.wait_till_done(timeout=600)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we simply raise an alert via error reporting if the whole request took more than 10 minutes?

app_id = configuration.backup_project_id
url = 'https://datastore.googleapis.com/v1/projects/%s:export' % app_id

output_url_prefix = cls.get_output_url_prefix(request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use convention staging.project_id.appspot.com bucket, where lifecycle management is enabled. No need for configuration then

class ExportDatastoreToGCSService(webapp2.RequestHandler):
@classmethod
def invoke(cls, request, response):
access_token, _ = app_identity.get_access_token(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using google-api-python-client for consistency with other APIs

finish_time = time.time() + timeout
while time.time() < finish_time:
logging.info(
"Export from GCS to BQ - "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would log load_job_id here as well

src/appinfo.py Outdated
return httplib2.Http(timeout=60)

@staticmethod
def __map(location_id):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just check prefixes, not every zone? They change quite often now

src/appinfo.py Outdated
return "EU"
if location_id in ASIA_LOCATIONS:
return "Asia"
return "UNKNOWN"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Australia?

ExportDatastoreBackupsToGCSService().export(gcs_output_uri, kinds)

logging.info("Loading Datastore backups from GCS to Big Query")
LoadDatastoreBackupsToBigQueryService(now_date)\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here in case of timeout after error reporting is sent?


@staticmethod
def __create_gcs_output_url(gcs_folder_name):
app_id = configuration.backup_project_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be taken from current GAE app, backup project id could be different where we don't have access.

src/appinfo.py Outdated
if location_id in EU_LOCATIONS:
return "EU"
if location_id in ASIA_LOCATIONS:
return "Asia"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Asia there is only: asia-northeast1
https://cloud.google.com/bigquery/docs/dataset-locations

def __create_job_body(self, source_uri, kind):
return {
"projectId": configuration.backup_project_id,
"location": "EU",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EU?
Should be the same as dataset id I think


self.response.headers['Content-Type'] = 'application/json'
self.response.set_status(200)
self.response.out.write(json.dumps({'status': 'success'}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not success in case of timeout

"Loading Datastore backups from GCS to BQ (jobId: %s) - "
"waiting %d seconds for request to end...", load_job_id, PERIOD
)
time.sleep(PERIOD)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move sleeping after check result logic to avoid waiting 1minute before checking for the first time as discussed

radkomateusz
radkomateusz previously approved these changes Jul 12, 2018
@przemyslaw-jasinski przemyslaw-jasinski merged commit cebbdef into master Jul 12, 2018
@przemyslaw-jasinski przemyslaw-jasinski deleted the export_ds_to_bq branch July 12, 2018 10:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants