-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need way of updating db rather than creating from scratch #89
Comments
How do we feel about the following API? from etl_manager.meta import get_existing_database_from_glue_catalogue
# Note I am not going to attempt to read current tables from Glue and create table objects
db = get_existing_database_from_glue_catalogue('my_database')
t = TableMeta(name="table1", location="somewhere")
t.add_column(name= "employee_id2", type= "character", description= "a new description")
db.add_table(t)
# Will not replace existing tables unless overwrite is set to true
db.append_tables_to_glue_database(overwrite=False) |
Yeah fine by me. On top of that this should be used to fix #117. I'd imagine that you could have something like: def create_glue_database(self, delete_if_exists=False):
"""
Creates a database in Glue based on the database object calling the method function.
By default, will error out if database exists - unless delete_if_exists is set to True (default is False).
"""
if delete_if_exists:
self.delete_glue_database()
db = get_existing_database_from_glue_catalogue(self.name)
if db:
existing_tables = db._tables
else:
db = {"DatabaseInput": {"Description": self.description, "Name": self.name}}
_glue_client.create_database(**db)
existing_tables = []
for tab in [t for t in self._tables if t not in existing_tables]:
glue_table_def = tab.glue_table_definition(self.s3_database_path)
_glue_client.create_table(DatabaseName=self.name, TableInput=glue_table_def) There are some issues with the above (indenting probably for one). But we would probably want to parameterise the function to only update new tables, set a list of tables to update or do all of them. Anyway thought I'd add this as it will define what is returned from |
If we have two repos that contribute to the same database, then it's hard to include tables from both repos.
For instance, if we have two repositories that both create tables in the open_data database (e.g. one repo that ETLs ONS data to the platform, and another than ETLs travel time data), then you can add either the ONS data or the travel time data to the glue catalogue, but you can't easily have both.
The text was updated successfully, but these errors were encountered: