Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need way of updating db rather than creating from scratch #89

Closed
RobinL opened this issue Mar 11, 2019 · 2 comments · Fixed by #122
Closed

Need way of updating db rather than creating from scratch #89

RobinL opened this issue Mar 11, 2019 · 2 comments · Fixed by #122

Comments

@RobinL
Copy link
Member

RobinL commented Mar 11, 2019

If we have two repos that contribute to the same database, then it's hard to include tables from both repos.

For instance, if we have two repositories that both create tables in the open_data database (e.g. one repo that ETLs ONS data to the platform, and another than ETLs travel time data), then you can add either the ONS data or the travel time data to the glue catalogue, but you can't easily have both.

@RobinL
Copy link
Member Author

RobinL commented Apr 2, 2020

How do we feel about the following API?

from etl_manager.meta import get_existing_database_from_glue_catalogue

# Note I am not going to attempt to read current tables from Glue and create table objects
db = get_existing_database_from_glue_catalogue('my_database')

t = TableMeta(name="table1", location="somewhere")
t.add_column(name= "employee_id2", type= "character", description= "a new description")

db.add_table(t)

# Will not replace existing tables unless overwrite is set to true
db.append_tables_to_glue_database(overwrite=False)

@isichei
Copy link
Contributor

isichei commented Apr 2, 2020

Yeah fine by me. On top of that this should be used to fix #117. I'd imagine that you could have something like:

    def create_glue_database(self, delete_if_exists=False):
        """
        Creates a database in Glue based on the database object calling the method function.
        By default, will error out if database exists - unless delete_if_exists is set to True (default is False).
        """

        if delete_if_exists:
            self.delete_glue_database()

         db = get_existing_database_from_glue_catalogue(self.name)
         if db:
             existing_tables = db._tables
         else:
             db = {"DatabaseInput": {"Description": self.description, "Name": self.name}}
             _glue_client.create_database(**db)
            existing_tables = []

        for tab in [t for t in self._tables if t not in existing_tables]:
            glue_table_def = tab.glue_table_definition(self.s3_database_path)
            _glue_client.create_table(DatabaseName=self.name, TableInput=glue_table_def)

There are some issues with the above (indenting probably for one). But we would probably want to parameterise the function to only update new tables, set a list of tables to update or do all of them. Anyway thought I'd add this as it will define what is returned from get_existing_database_from_glue_catalogue

@RobinL RobinL mentioned this issue Apr 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants