Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augur db with import/export of metadata #1094

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Nov 8, 2022

Description of proposed changes

This adds a basic interface to enable working with metadata in a world of SQLite3.

See functional test files for example usage.

Implementation was pulled out of #854 and repurposed for the command line interface. I suspect this PR will be easier to review and require less testing. If this is merged, #854 can be rebased to utilize the new changes.

I wrote this with the thought of adding more features in the future. Examples: generating year/month/day columns, loading a sequences FASTA, appending to existing tables, and supporting other databases.

Everything is up for debate, including naming!

I also suspect that I might be re-inventing the wheel with some class interfaces here. Welcoming references to other tools/packages that others are aware of.

Related issue(s)

Testing

  • Added functional tests

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR. Keep headers and formatting consistent with the rest of the file.

Summary of comment threads

These are comment threads with a decent amount of discussion.

  • Support other database implementations (e.g. PostgreSQL)? (ref)
  • Attribute docstrings (ref1, ref2)
  • Merge metadata and sequences tables? (ref)
  • Standardize database schema? (ref)
  • Use pandas to export query results to a file? (ref)
  • Use TEXT type for all columns? (ref)
  • Connection and transaction management (ref1, ref2)

Copy link
Member Author

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments to start discussion.

@@ -23,6 +23,7 @@
"parse",
"curate",
"index",
"db",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I placed this here thinking that eventually, it would make sense to load files from previous commands on this list into the database.

augur/io/sqlite3.py Show resolved Hide resolved
augur/db/__init__.py Outdated Show resolved Hide resolved
@victorlin victorlin mentioned this pull request Nov 8, 2022
@victorlin victorlin marked this pull request as ready for review November 8, 2022 01:59
@victorlin victorlin requested a review from a team November 8, 2022 01:59
@codecov
Copy link

codecov bot commented Nov 8, 2022

Codecov Report

Base: 67.98% // Head: 66.91% // Decreases project coverage by -1.07% ⚠️

Coverage data is based on head (bc3f5c5) compared to base (c8900c5).
Patch coverage: 40.22% of modified lines in pull request are covered.

❗ Current head bc3f5c5 differs from pull request most recent head c5574fa. Consider uploading reports for the commit c5574fa to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1094      +/-   ##
==========================================
- Coverage   67.98%   66.91%   -1.08%     
==========================================
  Files          57       66       +9     
  Lines        6656     6909     +253     
  Branches     1637     1673      +36     
==========================================
+ Hits         4525     4623      +98     
- Misses       1825     1982     +157     
+ Partials      306      304       -2     
Impacted Files Coverage Δ
augur/__init__.py 66.66% <ø> (ø)
augur/io/tabular_file.py 21.62% <21.62%> (ø)
augur/db/import_/sqlite3.py 22.95% <22.95%> (ø)
augur/io/sqlite3.py 31.25% <31.25%> (ø)
augur/db/export/sqlite3.py 35.71% <35.71%> (ø)
augur/io/sequences.py 76.47% <38.46%> (-23.53%) ⬇️
augur/io/file.py 80.00% <50.00%> (-20.00%) ⬇️
augur/io/metadata.py 92.07% <50.00%> (-3.96%) ⬇️
augur/db/export/__init__.py 70.00% <70.00%> (ø)
augur/db/import_/__init__.py 71.42% <71.42%> (ø)
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@huddlej huddlej self-requested a review November 22, 2022 22:33
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting together a smaller PR just for the database interface, @victorlin! I agree that it is generally easier to understand this smaller bit of work (even though this work reflects a substantial amount of effort on your part!). I also like the idea of exposing a database subcommand sooner that allows us to create a database to pass as input to other commands like augur filter. I did some simple testing of the commands from the terminal, but nothing more sophisticated than what you've already tested with functional tests here. I have a bunch of implementation questions/comments below that I'm happy to discuss with you synchronously or in threads here.

Comment on lines 438 to 440
self.potential_id_columns = potential_id_columns or ("strain", "name")
"""Metadata columns that may contain identifier information."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised to see this style of commenting throughout this PR where I would have expected something like this instead:

# Metadata columns that may contain identifier information.
self.potential_id_columns = potential_id_columns or ("strain", "name")

I'm used to limiting use of multi-line comments to docstrings or at least having the comment precede the code it refers to. Is there a technical benefit to this style or is it an aesthetic preference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From PEP 257 – Docstring Conventions:

String literals occurring immediately after a simple assignment at the top level of a module, class, or __init__ method are called “attribute docstrings”.

These are used by VSCode to show the docstring upon mouse hover elsewhere in the code. I suspect other editors that use docstrings to enrich the experience might also support them.

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sphinx.ext.autodoc also supports this kind of less-frequently-seen docstring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. It feels a bit like catering to a computer's ability to read the code rather than a human's, but if the dev experience for most people benefits from this approach it makes sense. I would not generally want these types of comments in sphinx docs, though...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not generally want these types of comments in sphinx docs, though...

Not sure I follow. You wouldn't want the attributes of a class to be documented with this sort of comment on what they are?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @huddlej may have been referring to other more verbose comments that are used around the codebase, which are only useful when you're trying to understand the underlying logic. The numpy-style docstrings also contain descriptions that are equivalent to the docstrings I'm using here, which are already rendered on the Developer API doc pages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsibley I definitely want class and instance attributes to be documented. My initial concern was that other variable assignments that use attribute docstrings would get pulled into API docs, too, but PEP 258 is clearer that this this shouldn't happen.

I was also expecting class and instance attributes to be documented by class- and init-level docstrings based on the more traditional guidance from PEP 257:

The docstring for a class should summarize its behavior and list the public methods and instance variables.

It's cool that the attribute docstrings appear in VSCode and other docutils processing systems, but they don't show up on the Python or ipython REPLs where I would look for them during usage and debugging. For example, if I define AClass from the PEP 258 example code, I can't access the attribute docstrings with AClass.c.__doc__. A class- or init-level docstring would be accessible from the REPLs and provide the same information via AClass.__doc__ or AClass().__doc__. For example, the following version of AClass provides information about class attributes on the REPL and API docs:

class AClass:
    """A class!

    Attributes
    ----------
    c : str
       This is AClass.c's docstring. 
    i : str
        This is self.i's docstring.

    """
    c = 'class attribute'

    def __init__(self):
        """Method __init__'s docstring."""

        self.i = 'instance attribute'

From the REPL, I can get the docstring with:

>>> print(AClass.__doc__)
A class!

    Attributes
    ----------
    c : str
       This is AClass.c's docstring.
    i : str
        This is self.i's docstring.


>>>


def register_parser(parent_subparsers):
parser = parent_subparsers.add_parser("export", help=first_line(__doc__))
parser.add_argument('--sqlite3', required=True, help="SQLite3 database file to export from.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider using --database here instead of specifying the database implementation? We've talked about eventually supporting database servers along with database files, so separating the CLI argument from the specific implementation details might prepare us for that future. Or is the idea that we would add other arguments to support other implementations like --postgres-server?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the idea that we would add other arguments to support other implementations like --postgres-server?

Yes, this was my intention of naming it specifically --sqlite3. There are potential use cases for augur db with other types of databases, and the code here has been organized with that in mind.

However, I suspect much of our initial use will be with SQLite3 – passing the database file around as currently done with sequences and metadata files. Even if this means users might be unfamiliar with SQLite3 and do not necessarily need to know how it works, knowing what it is under the hood can be beneficial for debugging and general awareness.

For the --database idea specifically: I suppose we could use pyodbc and accept an ODBC connection string, but that's not so user-friendly (at least for SQLite3 which requires downloading an ODBC driver separately):

  • --database "DRIVER={SQLite3 ODBC Driver};SERVER=localhost;DATABASE=example.db;Trusted_connection=yes"
  • --database "DRIVER={PostgreSQL ODBC Driver(UNICODE)};SERVER=some.postgresql.endpoint;PORT=5432;DATABASE=example;UID=user;PWD=pass"

My intention is a more user-oriented CLI that allows for custom "connection strings" per database type:

  • --sqlite3 example.db
  • --postgresql "postgresql://user:pass@some.postgresql.endpoint:5432/example"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we could support custom "connection values" and determine database type internally.

  • --database example.db
  • --database "postgresql://user:pass@some.postgresql.endpoint:5432/example"

Under the hood:

if is_sqlite_file(args.database):
    load_sqlite3(…)
elif is_postgresql_connection_string(args.database):
    load_postgres(…)
…
else:
    raise AugurError("Unsupported connection value. Run augur db load --help to for supported options.")

I'd be open to this alternative. Does it sound better to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds better to me. It's common convention to have a single argument/option/parameter/etc. (here, --database) that takes some sort of DSN (data source name) which identifies the implementation. The postgresql:// URLs and ODBC connection strings are both examples of DSNs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not reinvent the wheel though, so we should pick an existing DSN convention and stick with it. Some prior art:

The first link also includes other references at the bottom of its README.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relatedly, let's make --output-sqlite3--output-database and so on for similar options.

(Should we use db instead of database, e.g. --db and --output-db, so it matches with augur db? Or inversely, should we rename augur db to augur database and leave the options as --database? Maybe we should just alias 'em all.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion on aliasing. Renamed --sqlite3--db and --output-sqlite3--output-db in force-push.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using DSN conventions sounds like a good idea if/when we want to support other database implementations. However, I'd like to limit the scope of this PR. How about only supporting SQLite3 database filepaths for --db/--output-db for now?

If/when we consider adding support for a remote database server (e.g. PostgreSQL), we should consider using DSNs to overload the db arguments.

If/when we consider adding support for another file-based database implementation (e.g. DuckDB), we should consider DSNs as well as adding the ability to decide which database implementation to use for filepaths.

from augur.io.sqlite3 import Sqlite3Database


def export(db_file:str, metadata_table:str, to_metadata:str=''):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing context here from the other original PR, but when to_metadata isn't provided, this function quietly does nothing. Is that expected behavior or is the idea that this function isn't part of a public API, but it will always be called from another internal context where we know how to call it? Will this function eventually include to_sequences?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that expected behavior or is the idea that this function isn't part of a public API, but it will always be called from another internal context where we know how to call it?

Yes, this is what I was thinking. In this case, the user requirement is defined on the argparse level.

We could add something like this, but it seems unnecessary:

if not all(db_file, metadata_table, to_metadata):
    raise AugurError("At least one input not defined")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this function eventually include to_sequences?

Also what I was thinking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add something like this, but it seems unnecessary:

I'd advocate for this sort of internal consistency check, even if it seems a bit much now. As the code changes over time and callers become more distant, it's more likely that we'll trip over this "silently does nothing" scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it needs to be an AugurError necessarily; an assert statement would do. The goal is to catch programming bugs, not user errors (because as you've said, this is an internal function).

augur/db/export/sqlite3.py Outdated Show resolved Hide resolved
augur/db/load/__init__.py Outdated Show resolved Hide resolved
augur/io/sqlite3.py Show resolved Hide resolved
augur/io/sqlite3.py Outdated Show resolved Hide resolved
augur/io/sqlite3.py Outdated Show resolved Hide resolved
augur/db/export/__init__.py Outdated Show resolved Hide resolved
tests/functional/db/cram/load-query-export.t Outdated Show resolved Hide resolved
@tsibley tsibley self-requested a review December 8, 2022 00:08
@victorlin victorlin force-pushed the victorlin/augur-db branch 6 times, most recently from 9f4034a to a7f322a Compare December 8, 2022 23:22
augur/io/sqlite3.py Outdated Show resolved Hide resolved

def register_parser(parent_subparsers):
parser = parent_subparsers.add_parser("export", help=first_line(__doc__))
parser.add_argument('--sqlite3', required=True, help="SQLite3 database file to export from.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relatedly, let's make --output-sqlite3--output-database and so on for similar options.

(Should we use db instead of database, e.g. --db and --output-db, so it matches with augur db? Or inversely, should we rename augur db to augur database and leave the options as --database? Maybe we should just alias 'em all.)

augur/db/load/sqlite3.py Outdated Show resolved Hide resolved
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpointing a first quick-ish pass of comments, largely driven by what I noticed while reading/digesting the discussion between you and John.

I'll 👀 more tomorrow morning.

@victorlin victorlin force-pushed the victorlin/augur-db branch 2 times, most recently from 26d2f3e to 07a7f04 Compare December 12, 2022 17:45
@victorlin victorlin changed the title augur db with load/export of metadata augur db with import/export of metadata Dec 12, 2022
augur/io/metadata.py Outdated Show resolved Hide resolved
augur/io/tabular_file.py Show resolved Hide resolved
augur/io/tabular_file.py Outdated Show resolved Hide resolved
def rows(self):
"""Yield rows from a tabular file."""
with open_file(self.file) as f:
reader = csv.reader(f, delimiter=self.delimiter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a DictReader instead here? Or should the reader class used be customizable by subclasses? (I'd think a NamedTupleReader would be most ideal...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to be a DictReader in c22611e...7c8b304, for related discussion.

I didn't try yet, but I don't think namedtuple is supported here since execute/executemany requires either a sequence or dict. What's the benefit of namedtuple over dict here?

I think customizability of the reader class can be considered in the future if/when there's a use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't try yet, but I don't think namedtuple is supported here since execute/executemany requires either a sequence or dict. What's the benefit of namedtuple over dict here?

namedtuples are already a sequence and also easily turned into dicts with ._asdict() (which, yes, is a public method). I don't think there's any issue using them with the DB2 API provided by the sqlite module.

Named tuples are nice for a few reasons:

  • They allow accessing values as attributes (row.accession) rather than items (row["accession"]) and are thus a bit more ergonomic.
  • They are lighter weight in memory than dicts, which is nice if you have a bunch of them (e.g. a set of rows held in memory at once).
  • They can be easily typed for specific kinds of records, if desired.

Anecdotally, I used namedtuples in the Python data access layer for ID3C (e.g.) and that was quite nice to use.

Comment on lines 21 to 50
def execute(self, *args):
"""Execute under a connection context."""
with self.connect() as con:
con.execute(*args)

def executemany(self, *args):
"""Executemany under a connection context."""
# TODO: benchmark performance with cursor.executemany / con.commit at the end
with self.connect() as con:
con.executemany(*args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd want to hold the connection open across several execute statements, rather than connecting each time.

But stepping back, why make these wrappers for execute() and executemany() in the first place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd want to hold the connection open across several execute statements, rather than connecting each time.

This is a good point. I'm thinking of making the Sqlite3Database a context manager so it can be used like this, where metadata and sequences each get their own connection:

with Sqlite3Database(db_file) as database:
    database.create_table("metadata")
    database.executemany(insert_statement, metadata.rows)
    database.create_unique_index("metadata", "strain")
with Sqlite3Database(db_file) as database:
    database.create_table("sequences")
    database.executemany(insert_statement, sequences.records)
    database.create_unique_index("sequences", "strain")

Copy link
Member Author

@victorlin victorlin Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But stepping back, why make these wrappers for execute() and executemany() in the first place?

I forget why I made it like this. There was probably a good reason in the other PR, but not for here.

However, with the context manager proposed in the above comment, these wrappers could do more, e.g.

def execute(self, *args):
    if self.connection:
        self.connection.execute(*args)
    else:
        with self.connect() as connection:
            connection.execute(*args)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in b3c93f3...aba56fe.

augur/db/import_/sqlite3.py Outdated Show resolved Hide resolved
Comment on lines 25 to 54
insert_statement = f"""
INSERT INTO {table_name}
VALUES ({','.join(['?' for _ in metadata.names])})
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be inserting explicitly by name instead of implicitly by position, to make this code more robust to corruption issues introduced by future changes (e.g. in table creation.) So, for example:

This would also let us handle rows with missing columns earlier and generate good error messages (including the names of the missing columns) more nicely than the SQLite error message matching below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented as c22611e...7c8b304.

Notes:

  1. Behavior is changed – rows that have a mismatched number of columns will be added instead of producing an error. You can see from the diff of functional tests linked above. This changes the motivation of having named placeholders – there isn't a way to generate good error messages like you described. It's worth noting that pandas.read_csv (which most of Augur uses) has the same accepting behavior for these mismatches, so accepting them here is probably fine.
  2. Column names cannot be used directly since placeholder names cannot be quote-sanitized in the insert statement. I created a function to generate valid placeholder names as a workaround (also, not re-using column names as placeholder names helped me better understand the SQLite behavior).

I still have to do this for import_sequences() below, but it'll be about the same. I'm thinking it might even be worth creating a class method Sqlite3Database.insert(table_name, column_names, rows) which would take care of the placeholder name generation under the hood.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the paragraph above among other things in 7c8b304...f1b2d0b.

raise AugurError(f'Failed to create table: {e}')

insert_statement = f"""
INSERT INTO {table_name}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More quoting/escaping of table_name is needed here.

In general, we should not be doing ~any normal string interpolation into the SQL we generate without corresponding quoting/escaping applied. (Static interpolations, like of '?' are fine, as they're not dynamic input.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added: 30c7bcc

And yes, agreed. We might have to rely on code review to catch these things... I can't think of a good way to automatically "lint" for it. Maybe look for quoting in Sqlite3Database.execute(), but that means it's checked during run time which is not so useful.

docs/api/developer/augur.db.load.rst Outdated Show resolved Hide resolved
tests/functional/db/cram/export-metadata.t Outdated Show resolved Hide resolved
@victorlin
Copy link
Member Author

Force-pushing basic sequence import/export.

Comment on lines +1 to +3
# Table names
DEFAULT_IMPORTED_METADATA_TABLE = 'metadata'
DEFAULT_IMPORTED_SEQUENCES_TABLE = 'sequences'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe a bigger picture design decision, but: Should we mirror the traditional TSV/FASTA split in the database like this with separate tables… or should we unify into a single table?

Seems worth doing a bit of critical thinking thru each choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really cool if we can pipe augur curate outputs directly to augur db import then all downstream commands can just work from the database! This would be easiest if augur db import imports an NDJSON input into a single table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to unifying in a single table. This how GenBank records, GISAID NJSON records, etc. already work and, like @joverlee521 mentions, how our curate logic thinks about metadata and sequences, where a sequence is another attribute of a strain record. The choice to store metadata as TSV and sequences as FASTA was based on the desire to keep intermediate and final outputs from Augur in standard bioinformatics formats so other tools could parse them. Inside the database, that split doesn't make as much sense.

How would this single table implementation relate to prior work with SFS's sequence database?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this single table implementation relate to prior work with SFS's sequence database?

This would be much simpler than the SFS ID3C schema since ID3C had to support one-to-many relationships. I think we only expect one-to-one between metadata and sequences here?

Copy link
Contributor

@j23414 j23414 Dec 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect one-to-many relationships if we expect augur db to work with multisegmented viruses:

  • e.g. One influenza A isolate links out to PB2,PB1,PA,HA,NA,NP,M,NS sequences.

However, if we're using Accession IDs (and not strain names) then ignore my comment above. There should be a one-to-one match of Accession-sequence.

I would expect many to one in the case of:

  • Multiple isolates from the same barn/hospital/home have the same consensus sequence.
  • Convergent evolution

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only expect one-to-one between metadata and sequences here?

Setting the expectation here is key. I'm not sure what to expect as I don't have much experience using these file formats in practice.

If we intend to support situations where there are duplicates or mismatches, even for the curate step (e.g. metadata de-duplication by taking the most recent metadata entry), the behavior should be customizable by the user.

We can easily store raw data as two separate tables and create a view that combines both for easy query-ing. It can be customized based on how duplicates or mismatches should be handled.


@joverlee521 separately, we were discussing that maybe even augur curate could use the database internally. Things like column indexing could give a performance boost.

augur/db/export/__init__.py Outdated Show resolved Hide resolved
augur/db/export/sqlite3.py Outdated Show resolved Hide resolved
augur/db/import_/sqlite3.py Outdated Show resolved Hide resolved
augur/db/import_/sqlite3.py Outdated Show resolved Hide resolved
augur/db/import_/sqlite3.py Outdated Show resolved Hide resolved
Comment on lines 63 to 74
def import_sequences(sequences:Sequences, table_name:str, database:Sqlite3Database):
columns = [sequences.id_name, DEFAULT_SEQUENCES_SEQ_NAME]
try:
database.create_table(table_name, columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Another bigger picture design decision: I think we should be lightly standardizing the database schema. In particular, the names of the tables and their id columns. So the metadata id column should be static and the same as the sequences id column.

Without any standardization of schema, the SQLite database acts only as a container (albeit a useful one). With some light standardization, it can also be an interface that eases semantics-preserving interchange.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you think to further standardize the schema? augur db import provides default values of metadata and sequences for the table name arguments and we require metadata to have a strain or name column through the read_metadata function. If we go with the single-table implementation you mention above, that would resolve the potential id column name mismatch.

augur/io/sequences.py Outdated Show resolved Hide resolved
Comment on lines 100 to 103
# Get results in chunks so they are not loaded into memory all at once.
df_chunks = pd.read_sql_query(query, con, chunksize=chunksize)
for df_chunk in df_chunks:
sequences = df_chunk.apply(lambda row: SeqRecord(id=row[id_name], seq=Seq(row[sequence_name]), description=""), axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use Pandas here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I took a stab at doing it without Pandas, and this is what it would be: 39997df

In terms of development, there's pros and cons to both approaches.

  • Pandas pros: I suspect this type of query→csv operation will be used frequently with various usages, and any extra features we may want down the road should be customizable though to_csv_kwargs. This would be easier to build on in comparison to managing connections/cursors to get the equivalent. Also, developers should be familiar with Pandas due to usage across the codebase.
  • Pandas cons: need to read in chunks to avoid loading a large amount of data into memory.

Without Pandas, the pros/cons are inverted.

I did not try for sequences yet (query_to_fasta()). That would probably require a csv.DictWriter to access id and seq columns.

Note that I haven't tested performance differences, which should be an important consideration here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in Pandas there's a lot of underlying, hidden complexity waiting to bite us, and that it's really unnecessary for dumping a database query cursor to a file. Some of this is from experience being bitten.

I think overall the non-Pandas approach you drafted in 39997df is simpler and more understandable.

(Passing thru keyword args from a first-party function to a third-party function, as in to_csv_kwargs, is an anti-pattern, in my book, so it's not something I'd want to build on.)

Comment on lines +55 to +60
# Infer delimiter from the first line.
dialect = csv.Sniffer().sniff(f.readline(), delimiters=possible_delimiters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant conversation from another PR for augur curate:

@tsibley said:

Dare I ask what happens if some of your TSV column names contain commas?

In the end I dared… and it goes about as well as you expect (not well). So yeah, we'll need to either make the sniffer smarter (look at proportions of one delimiter to another) or allow manual override (probably best).

augur/io/file.py Outdated
@@ -27,3 +27,20 @@ def open_file(path_or_buffer, mode="r", **kwargs):
yield handle
except TypeError:
yield path_or_buffer


POTENTIAL_STRAIN_ID_NAMES = ("strain", "id")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this include "name" since it is supported elsewhere in augur?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this to a new file in 28001ac, along with some more context. This is independent of db work and could even be a separate PR.

database = Sqlite3Database(db_file)

if bool(metadata_file and metadata_table_name):
metadata = Metadata(metadata_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will raise a StopIteration error if an empty file is provided. We should probably provide a clearer error message that we failed to import an empty file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f2de9e8 – under the assumption that empty FASTA files are valid, just not metadata which requires at least a header row.

Comment on lines +11 to +15
if output_metadata:
export_metadata(database, metadata_table, output_metadata)

if output_sequences:
export_sequences(database, sequences_table, output_sequences)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both export_* can raise a sqlite3.OperationalError if the table does not exist in the database.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in force-push.

export_metadata(database, metadata_table, output_metadata)

if output_sequences:
export_sequences(database, sequences_table, output_sequences)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a cryptic error when I tried to export sequences from an empty table:

Traceback (most recent call last):
  File "./augur/__init__.py", line 68, in run
    return args.__command__.run(args)
  File "./augur/db/export/__init__.py", line 26, in run
    args.output_metadata, args.output_sequences)
  File "./augur/db/export/sqlite3.py", line 15, in export
    export_sequences(database, sequences_table, output_sequences)
  File "./augur/db/export/sqlite3.py", line 41, in export_sequences
    sequence_name=DEFAULT_SEQUENCES_SEQ_NAME,
  File "./augur/io/sqlite3.py", line 106, in query_to_fasta
    SeqIO.write(sequences, out_file, "fasta")
  File "/Users/jlee2346/opt/miniconda3/envs/augur/lib/python3.7/site-packages/Bio/SeqIO/__init__.py", line 518, in write
    fp.write(format_function(record))
  File "/Users/jlee2346/opt/miniconda3/envs/augur/lib/python3.7/site-packages/Bio/SeqIO/FastaIO.py", line 416, in as_fasta
    id = _clean(record.id)
AttributeError: 'str' object has no attribute 'id'

Comment on lines +1 to +3
# Table names
DEFAULT_IMPORTED_METADATA_TABLE = 'metadata'
DEFAULT_IMPORTED_SEQUENCES_TABLE = 'sequences'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really cool if we can pipe augur curate outputs directly to augur db import then all downstream commands can just work from the database! This would be easiest if augur db import imports an NDJSON input into a single table.

@victorlin victorlin force-pushed the victorlin/augur-db branch 5 times, most recently from f1b2d0b to e330db7 Compare January 7, 2023 00:10
This avoids re-defining these values at each use case and prevents them
from getting out of sync.
Comment on lines +24 to +27
$ ${AUGUR} db import \
> --metadata in.tsv \
> --sequences in.fasta \
> --output-db data.sqlite3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm having second thoughts here. I think these subcommands are trying to do too much at once. Having --metadata and --sequences at the same time makes sense for other subcommands where both are needed, but that's not the case here.

What about a modified CLI where each subcommand acts on file types rather than Augur-specific "metadata"/"sequences", and only one at a time?

augur db import csv \
    --file metadata.tsv \
    --table metadata \
    --output-db data.sqlite3

augur db import fasta \
    --file sequences.fasta \
    --table sequences \
    --output-db data.sqlite3

augur db export csv \
    --db data.sqlite3 \
    --table metadata \
    --output-file metadata.tsv

augur db export fasta \
    --db data.sqlite3 \
    --table sequences \
    --output-file sequences.fasta

This way, the path to support NDJSON and chaining with augur curate as mentioned in #1094 (comment) is much more clear.

augur curate passthru \
    --metadata metadata.tsv \
| augur db import ndjson \
    --file - \
    --table metadata_and_sequences \
    --output-db data.sqlite3

augur db export ndjson \
    --db data.sqlite3 \
    --table metadata_and_sequences \
    --output-file - \
| augur curate normalize-strings > records.ndjson

Anyways, this might be premature thinking since we haven't even defined a use case for the augur db CLI. I should be focusing more on integrating the underlying class interfaces into augur filter which is a well-defined use case...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leary of --table. I don't think that sort of general interface is what we want here. I think we want a specific "Augur database" format, the same way we have a specific "Augur node data" format or "Nextstrain dataset" format.

Anyways, this might be premature thinking since we haven't even defined a use case for the augur db CLI. I should be focusing more on integrating the underlying class interfaces into augur filter which is a well-defined use case...

Maybe let's get a small group of interested folks to discuss directions/use case/intentions here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's get a small group of interested folks to discuss directions/use case/intentions here?

Sure! I want to put the newer work with augur filter out for review first (as mentioned in #854 (comment)) - with that it should be easier to have this discussion.

These are in anticipation of usage by the augur db subcommand, but can
be used by other subcommands such as augur filter.
This adds a basic interface to enable working in a world of SQLite3.

See functional test files for example usage.
This serves to detect duplicates and speed up strain-based queries.
This is unused but can be handy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

5 participants