`augur db` with import/export of metadata #1094

victorlin · 2022-11-08T00:59:41Z

Description of proposed changes

This adds a basic interface to enable working with metadata in a world of SQLite3.

See functional test files for example usage.

Implementation was pulled out of #854 and repurposed for the command line interface. I suspect this PR will be easier to review and require less testing. If this is merged, #854 can be rebased to utilize the new changes.

I wrote this with the thought of adding more features in the future. Examples: generating year/month/day columns, loading a sequences FASTA, appending to existing tables, and supporting other databases.

Everything is up for debate, including naming!

I also suspect that I might be re-inventing the wheel with some class interfaces here. Welcoming references to other tools/packages that others are aware of.

Related issue(s)

Support databases #866

Testing

Added functional tests

Checklist

Add a message in CHANGES.md summarizing the changes in this PR. Keep headers and formatting consistent with the rest of the file.

Summary of comment threads

These are comment threads with a decent amount of discussion.

Support other database implementations (e.g. PostgreSQL)? (ref)
Attribute docstrings (ref1, ref2)
Merge metadata and sequences tables? (ref)
Standardize database schema? (ref)
Use pandas to export query results to a file? (ref)
Use TEXT type for all columns? (ref)
Connection and transaction management (ref1, ref2)

victorlin

Some comments to start discussion.

victorlin · 2022-11-08T01:08:01Z

augur/__init__.py

@@ -23,6 +23,7 @@
    "parse",
    "curate",
    "index",
+    "db",


I placed this here thinking that eventually, it would make sense to load files from previous commands on this list into the database.

augur/io/sqlite3.py

augur/db/__init__.py

codecov · 2022-11-08T02:03:25Z

Codecov Report

Base: 67.98% // Head: 66.91% // Decreases project coverage by -1.07% ⚠️

Coverage data is based on head (bc3f5c5) compared to base (c8900c5).
Patch coverage: 40.22% of modified lines in pull request are covered.

❗ Current head bc3f5c5 differs from pull request most recent head c5574fa. Consider uploading reports for the commit c5574fa to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1094      +/-   ##
==========================================
- Coverage   67.98%   66.91%   -1.08%     
==========================================
  Files          57       66       +9     
  Lines        6656     6909     +253     
  Branches     1637     1673      +36     
==========================================
+ Hits         4525     4623      +98     
- Misses       1825     1982     +157     
+ Partials      306      304       -2

Impacted Files	Coverage Δ
augur/__init__.py	`66.66% <ø> (ø)`
augur/io/tabular_file.py	`21.62% <21.62%> (ø)`
augur/db/import_/sqlite3.py	`22.95% <22.95%> (ø)`
augur/io/sqlite3.py	`31.25% <31.25%> (ø)`
augur/db/export/sqlite3.py	`35.71% <35.71%> (ø)`
augur/io/sequences.py	`76.47% <38.46%> (-23.53%)`	⬇️
augur/io/file.py	`80.00% <50.00%> (-20.00%)`	⬇️
augur/io/metadata.py	`92.07% <50.00%> (-3.96%)`	⬇️
augur/db/export/__init__.py	`70.00% <70.00%> (ø)`
augur/db/import_/__init__.py	`71.42% <71.42%> (ø)`
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

huddlej

Thank you for putting together a smaller PR just for the database interface, @victorlin! I agree that it is generally easier to understand this smaller bit of work (even though this work reflects a substantial amount of effort on your part!). I also like the idea of exposing a database subcommand sooner that allows us to create a database to pass as input to other commands like augur filter. I did some simple testing of the commands from the terminal, but nothing more sophisticated than what you've already tested with functional tests here. I have a bunch of implementation questions/comments below that I'm happy to discuss with you synchronously or in threads here.

huddlej · 2022-11-23T22:20:36Z

augur/io/metadata.py

+        self.potential_id_columns = potential_id_columns or ("strain", "name")
+        """Metadata columns that may contain identifier information."""


I was surprised to see this style of commenting throughout this PR where I would have expected something like this instead:

# Metadata columns that may contain identifier information. self.potential_id_columns = potential_id_columns or ("strain", "name")

I'm used to limiting use of multi-line comments to docstrings or at least having the comment precede the code it refers to. Is there a technical benefit to this style or is it an aesthetic preference?

From PEP 257 – Docstring Conventions:

String literals occurring immediately after a simple assignment at the top level of a module, class, or __init__ method are called “attribute docstrings”.

These are used by VSCode to show the docstring upon mouse hover elsewhere in the code. I suspect other editors that use docstrings to enrich the experience might also support them.

sphinx.ext.autodoc also supports this kind of less-frequently-seen docstring.

Got it. It feels a bit like catering to a computer's ability to read the code rather than a human's, but if the dev experience for most people benefits from this approach it makes sense. I would not generally want these types of comments in sphinx docs, though...

I would not generally want these types of comments in sphinx docs, though...

Not sure I follow. You wouldn't want the attributes of a class to be documented with this sort of comment on what they are?

I think @huddlej may have been referring to other more verbose comments that are used around the codebase, which are only useful when you're trying to understand the underlying logic. The numpy-style docstrings also contain descriptions that are equivalent to the docstrings I'm using here, which are already rendered on the Developer API doc pages.

@tsibley I definitely want class and instance attributes to be documented. My initial concern was that other variable assignments that use attribute docstrings would get pulled into API docs, too, but PEP 258 is clearer that this this shouldn't happen.

I was also expecting class and instance attributes to be documented by class- and init-level docstrings based on the more traditional guidance from PEP 257:

The docstring for a class should summarize its behavior and list the public methods and instance variables.

It's cool that the attribute docstrings appear in VSCode and other docutils processing systems, but they don't show up on the Python or ipython REPLs where I would look for them during usage and debugging. For example, if I define AClass from the PEP 258 example code, I can't access the attribute docstrings with AClass.c.__doc__. A class- or init-level docstring would be accessible from the REPLs and provide the same information via AClass.__doc__ or AClass().__doc__. For example, the following version of AClass provides information about class attributes on the REPL and API docs:

class AClass: """A class! Attributes ---------- c : str This is AClass.c's docstring. i : str This is self.i's docstring. """ c = 'class attribute' def __init__(self): """Method __init__'s docstring.""" self.i = 'instance attribute'

From the REPL, I can get the docstring with:

>>> print(AClass.__doc__) A class! Attributes ---------- c : str This is AClass.c's docstring. i : str This is self.i's docstring. >>>

huddlej · 2022-11-23T22:23:25Z

augur/db/export/__init__.py

+
+def register_parser(parent_subparsers):
+    parser = parent_subparsers.add_parser("export", help=first_line(__doc__))
+    parser.add_argument('--sqlite3', required=True, help="SQLite3 database file to export from.")


Would you consider using --database here instead of specifying the database implementation? We've talked about eventually supporting database servers along with database files, so separating the CLI argument from the specific implementation details might prepare us for that future. Or is the idea that we would add other arguments to support other implementations like --postgres-server?

Or is the idea that we would add other arguments to support other implementations like --postgres-server?

Yes, this was my intention of naming it specifically --sqlite3. There are potential use cases for augur db with other types of databases, and the code here has been organized with that in mind.

However, I suspect much of our initial use will be with SQLite3 – passing the database file around as currently done with sequences and metadata files. Even if this means users might be unfamiliar with SQLite3 and do not necessarily need to know how it works, knowing what it is under the hood can be beneficial for debugging and general awareness.

For the --database idea specifically: I suppose we could use pyodbc and accept an ODBC connection string, but that's not so user-friendly (at least for SQLite3 which requires downloading an ODBC driver separately):

--database "DRIVER={SQLite3 ODBC Driver};SERVER=localhost;DATABASE=example.db;Trusted_connection=yes"

--database "DRIVER={PostgreSQL ODBC Driver(UNICODE)};SERVER=some.postgresql.endpoint;PORT=5432;DATABASE=example;UID=user;PWD=pass"

My intention is a more user-oriented CLI that allows for custom "connection strings" per database type:

--sqlite3 example.db

--postgresql "postgresql://user:pass@some.postgresql.endpoint:5432/example"

Actually, we could support custom "connection values" and determine database type internally.

--database example.db

--database "postgresql://user:pass@some.postgresql.endpoint:5432/example"

Under the hood:

if is_sqlite_file(args.database): load_sqlite3(…) elif is_postgresql_connection_string(args.database): load_postgres(…) … else: raise AugurError("Unsupported connection value. Run augur db load --help to for supported options.")

I'd be open to this alternative. Does it sound better to you?

Sounds better to me. It's common convention to have a single argument/option/parameter/etc. (here, --database) that takes some sort of DSN (data source name) which identifies the implementation. The postgresql:// URLs and ODBC connection strings are both examples of DSNs.

Let's not reinvent the wheel though, so we should pick an existing DSN convention and stick with it. Some prior art:

https://github.com/libwww-perl/URI-db (background)

https://github.com/xo/dburl

The first link also includes other references at the bottom of its README.

Relatedly, let's make --output-sqlite3 → --output-database and so on for similar options.

(Should we use db instead of database, e.g. --db and --output-db, so it matches with augur db? Or inversely, should we rename augur db to augur database and leave the options as --database? Maybe we should just alias 'em all.)

Good suggestion on aliasing. Renamed --sqlite3 → --db and --output-sqlite3 → --output-db in force-push.

Using DSN conventions sounds like a good idea if/when we want to support other database implementations. However, I'd like to limit the scope of this PR. How about only supporting SQLite3 database filepaths for --db/--output-db for now?

If/when we consider adding support for a remote database server (e.g. PostgreSQL), we should consider using DSNs to overload the db arguments.

If/when we consider adding support for another file-based database implementation (e.g. DuckDB), we should consider DSNs as well as adding the ability to decide which database implementation to use for filepaths.

huddlej · 2022-11-23T22:28:55Z

augur/db/export/sqlite3.py

+from augur.io.sqlite3 import Sqlite3Database
+
+
+def export(db_file:str, metadata_table:str, to_metadata:str=''):


I'm probably missing context here from the other original PR, but when to_metadata isn't provided, this function quietly does nothing. Is that expected behavior or is the idea that this function isn't part of a public API, but it will always be called from another internal context where we know how to call it? Will this function eventually include to_sequences?

Is that expected behavior or is the idea that this function isn't part of a public API, but it will always be called from another internal context where we know how to call it?

Yes, this is what I was thinking. In this case, the user requirement is defined on the argparse level.

We could add something like this, but it seems unnecessary:

if not all(db_file, metadata_table, to_metadata): raise AugurError("At least one input not defined")

Will this function eventually include to_sequences?

Also what I was thinking.

We could add something like this, but it seems unnecessary:

I'd advocate for this sort of internal consistency check, even if it seems a bit much now. As the code changes over time and callers become more distant, it's more likely that we'll trip over this "silently does nothing" scenario.

I don't think it needs to be an AugurError necessarily; an assert statement would do. The goal is to catch programming bugs, not user errors (because as you've said, this is an internal function).

augur/db/export/sqlite3.py

augur/db/load/__init__.py

augur/io/sqlite3.py

augur/db/export/__init__.py

tests/functional/db/cram/load-query-export.t

augur/io/sqlite3.py

tsibley · 2022-12-09T01:04:33Z

augur/db/export/__init__.py

+
+def register_parser(parent_subparsers):
+    parser = parent_subparsers.add_parser("export", help=first_line(__doc__))
+    parser.add_argument('--sqlite3', required=True, help="SQLite3 database file to export from.")


Relatedly, let's make --output-sqlite3 → --output-database and so on for similar options.

(Should we use db instead of database, e.g. --db and --output-db, so it matches with augur db? Or inversely, should we rename augur db to augur database and leave the options as --database? Maybe we should just alias 'em all.)

augur/db/load/sqlite3.py

tsibley

Checkpointing a first quick-ish pass of comments, largely driven by what I noticed while reading/digesting the discussion between you and John.

I'll 👀 more tomorrow morning.

augur/io/metadata.py

augur/io/tabular_file.py

tsibley · 2022-12-09T22:54:45Z

augur/io/tabular_file.py

+    def rows(self):
+        """Yield rows from a tabular file."""
+        with open_file(self.file) as f:
+            reader = csv.reader(f, delimiter=self.delimiter)


Do we want a DictReader instead here? Or should the reader class used be customizable by subclasses? (I'd think a NamedTupleReader would be most ideal...)

Updated to be a DictReader in c22611e...7c8b304, for related discussion.

I didn't try yet, but I don't think namedtuple is supported here since execute/executemany requires either a sequence or dict. What's the benefit of namedtuple over dict here?

I think customizability of the reader class can be considered in the future if/when there's a use case.

I didn't try yet, but I don't think namedtuple is supported here since execute/executemany requires either a sequence or dict. What's the benefit of namedtuple over dict here?

namedtuples are already a sequence and also easily turned into dicts with ._asdict() (which, yes, is a public method). I don't think there's any issue using them with the DB2 API provided by the sqlite module.

Named tuples are nice for a few reasons:

They allow accessing values as attributes (row.accession) rather than items (row["accession"]) and are thus a bit more ergonomic.

They are lighter weight in memory than dicts, which is nice if you have a bunch of them (e.g. a set of rows held in memory at once).

They can be easily typed for specific kinds of records, if desired.

Anecdotally, I used namedtuples in the Python data access layer for ID3C (e.g.) and that was quite nice to use.

tsibley · 2022-12-10T01:35:08Z

augur/io/sqlite3.py

+    def execute(self, *args):
+        """Execute under a connection context."""
+        with self.connect() as con:
+            con.execute(*args)
+
+    def executemany(self, *args):
+        """Executemany under a connection context."""
+        # TODO: benchmark performance with cursor.executemany / con.commit at the end
+        with self.connect() as con:
+            con.executemany(*args)


I think we'd want to hold the connection open across several execute statements, rather than connecting each time.

But stepping back, why make these wrappers for execute() and executemany() in the first place?

I think we'd want to hold the connection open across several execute statements, rather than connecting each time.

This is a good point. I'm thinking of making the Sqlite3Database a context manager so it can be used like this, where metadata and sequences each get their own connection:

with Sqlite3Database(db_file) as database: database.create_table("metadata") database.executemany(insert_statement, metadata.rows) database.create_unique_index("metadata", "strain") with Sqlite3Database(db_file) as database: database.create_table("sequences") database.executemany(insert_statement, sequences.records) database.create_unique_index("sequences", "strain")

But stepping back, why make these wrappers for execute() and executemany() in the first place?

I forget why I made it like this. There was probably a good reason in the other PR, but not for here.

However, with the context manager proposed in the above comment, these wrappers could do more, e.g.

def execute(self, *args): if self.connection: self.connection.execute(*args) else: with self.connect() as connection: connection.execute(*args)

Implemented in b3c93f3...aba56fe.

augur/db/import_/sqlite3.py

tsibley · 2022-12-12T20:24:51Z

augur/db/import_/sqlite3.py

+    insert_statement = f"""
+        INSERT INTO {table_name}
+        VALUES ({','.join(['?' for _ in metadata.names])})
+    """


I think we should be inserting explicitly by name instead of implicitly by position, to make this code more robust to corruption issues introduced by future changes (e.g. in table creation.) So, for example:

This would also let us handle rows with missing columns earlier and generate good error messages (including the names of the missing columns) more nicely than the SQLite error message matching below.

Implemented as c22611e...7c8b304.

Notes:

Behavior is changed – rows that have a mismatched number of columns will be added instead of producing an error. You can see from the diff of functional tests linked above. This changes the motivation of having named placeholders – there isn't a way to generate good error messages like you described. It's worth noting that pandas.read_csv (which most of Augur uses) has the same accepting behavior for these mismatches, so accepting them here is probably fine.

Column names cannot be used directly since placeholder names cannot be quote-sanitized in the insert statement. I created a function to generate valid placeholder names as a workaround (also, not re-using column names as placeholder names helped me better understand the SQLite behavior).

I still have to do this for import_sequences() below, but it'll be about the same. I'm thinking it might even be worth creating a class method Sqlite3Database.insert(table_name, column_names, rows) which would take care of the placeholder name generation under the hood.

Addressed the paragraph above among other things in 7c8b304...f1b2d0b.

tsibley · 2022-12-12T20:26:26Z

augur/db/import_/sqlite3.py

+        raise AugurError(f'Failed to create table: {e}')
+
+    insert_statement = f"""
+        INSERT INTO {table_name}


More quoting/escaping of table_name is needed here.

In general, we should not be doing ~any normal string interpolation into the SQL we generate without corresponding quoting/escaping applied. (Static interpolations, like of '?' are fine, as they're not dynamic input.)

Added: 30c7bcc

And yes, agreed. We might have to rely on code review to catch these things... I can't think of a good way to automatically "lint" for it. Maybe look for quoting in Sqlite3Database.execute(), but that means it's checked during run time which is not so useful.

docs/api/developer/augur.db.load.rst

tests/functional/db/cram/export-metadata.t

victorlin · 2022-12-13T02:07:18Z

Force-pushing basic sequence import/export.

tsibley · 2022-12-14T21:52:38Z

augur/db/defaults.py

+# Table names
+DEFAULT_IMPORTED_METADATA_TABLE = 'metadata'
+DEFAULT_IMPORTED_SEQUENCES_TABLE = 'sequences'


This is maybe a bigger picture design decision, but: Should we mirror the traditional TSV/FASTA split in the database like this with separate tables… or should we unify into a single table?

Seems worth doing a bit of critical thinking thru each choice.

It would be really cool if we can pipe augur curate outputs directly to augur db import then all downstream commands can just work from the database! This would be easiest if augur db import imports an NDJSON input into a single table.

+1 to unifying in a single table. This how GenBank records, GISAID NJSON records, etc. already work and, like @joverlee521 mentions, how our curate logic thinks about metadata and sequences, where a sequence is another attribute of a strain record. The choice to store metadata as TSV and sequences as FASTA was based on the desire to keep intermediate and final outputs from Augur in standard bioinformatics formats so other tools could parse them. Inside the database, that split doesn't make as much sense.

How would this single table implementation relate to prior work with SFS's sequence database?

How would this single table implementation relate to prior work with SFS's sequence database?

This would be much simpler than the SFS ID3C schema since ID3C had to support one-to-many relationships. I think we only expect one-to-one between metadata and sequences here?

I would expect one-to-many relationships if we expect augur db to work with multisegmented viruses:

e.g. One influenza A isolate links out to PB2,PB1,PA,HA,NA,NP,M,NS sequences.

However, if we're using Accession IDs (and not strain names) then ignore my comment above. There should be a one-to-one match of Accession-sequence.

I would expect many to one in the case of:

Multiple isolates from the same barn/hospital/home have the same consensus sequence.

Convergent evolution

I think we only expect one-to-one between metadata and sequences here?

Setting the expectation here is key. I'm not sure what to expect as I don't have much experience using these file formats in practice.

If we intend to support situations where there are duplicates or mismatches, even for the curate step (e.g. metadata de-duplication by taking the most recent metadata entry), the behavior should be customizable by the user.

We can easily store raw data as two separate tables and create a view that combines both for easy query-ing. It can be customized based on how duplicates or mismatches should be handled.

@joverlee521 separately, we were discussing that maybe even augur curate could use the database internally. Things like column indexing could give a performance boost.

augur/db/export/__init__.py

augur/db/export/sqlite3.py

augur/db/import_/sqlite3.py

tsibley · 2022-12-14T22:10:27Z

augur/db/import_/sqlite3.py

+def import_sequences(sequences:Sequences, table_name:str, database:Sqlite3Database):
+    columns = [sequences.id_name, DEFAULT_SEQUENCES_SEQ_NAME]
+    try:
+        database.create_table(table_name, columns)


Hmm. Another bigger picture design decision: I think we should be lightly standardizing the database schema. In particular, the names of the tables and their id columns. So the metadata id column should be static and the same as the sequences id column.

Without any standardization of schema, the SQLite database acts only as a container (albeit a useful one). With some light standardization, it can also be an interface that eases semantics-preserving interchange.

How would you think to further standardize the schema? augur db import provides default values of metadata and sequences for the table name arguments and we require metadata to have a strain or name column through the read_metadata function. If we go with the single-table implementation you mention above, that would resolve the potential id column name mismatch.

augur/io/sequences.py

tsibley · 2022-12-14T22:15:14Z

augur/io/sqlite3.py

+            # Get results in chunks so they are not loaded into memory all at once.
+            df_chunks = pd.read_sql_query(query, con, chunksize=chunksize)
+            for df_chunk in df_chunks:
+                sequences = df_chunk.apply(lambda row: SeqRecord(id=row[id_name], seq=Seq(row[sequence_name]), description=""), axis=1)


Why use Pandas here?

Good question. I took a stab at doing it without Pandas, and this is what it would be: 39997df

In terms of development, there's pros and cons to both approaches.

Pandas pros: I suspect this type of query→csv operation will be used frequently with various usages, and any extra features we may want down the road should be customizable though to_csv_kwargs. This would be easier to build on in comparison to managing connections/cursors to get the equivalent. Also, developers should be familiar with Pandas due to usage across the codebase.

Pandas cons: need to read in chunks to avoid loading a large amount of data into memory.

Without Pandas, the pros/cons are inverted.

I did not try for sequences yet (query_to_fasta()). That would probably require a csv.DictWriter to access id and seq columns.

Note that I haven't tested performance differences, which should be an important consideration here.

I think in Pandas there's a lot of underlying, hidden complexity waiting to bite us, and that it's really unnecessary for dumping a database query cursor to a file. Some of this is from experience being bitten.

I think overall the non-Pandas approach you drafted in 39997df is simpler and more understandable.

(Passing thru keyword args from a first-party function to a third-party function, as in to_csv_kwargs, is an anti-pattern, in my book, so it's not something I'd want to build on.)

joverlee521 · 2022-12-14T23:11:23Z

augur/io/tabular_file.py

+                # Infer delimiter from the first line.
+                dialect = csv.Sniffer().sniff(f.readline(), delimiters=possible_delimiters)


Relevant conversation from another PR for augur curate:

@tsibley said:

Dare I ask what happens if some of your TSV column names contain commas?

In the end I dared… and it goes about as well as you expect (not well). So yeah, we'll need to either make the sniffer smarter (look at proportions of one delimiter to another) or allow manual override (probably best).

joverlee521 · 2022-12-15T00:00:34Z

augur/io/file.py

@@ -27,3 +27,20 @@ def open_file(path_or_buffer, mode="r", **kwargs):
            yield handle
    except TypeError:
        yield path_or_buffer
+
+
+POTENTIAL_STRAIN_ID_NAMES = ("strain", "id")


Should this include "name" since it is supported elsewhere in augur?

I've moved this to a new file in 28001ac, along with some more context. This is independent of db work and could even be a separate PR.

joverlee521 · 2022-12-16T00:23:18Z

augur/db/import_/sqlite3.py

+    database = Sqlite3Database(db_file)
+
+    if bool(metadata_file and metadata_table_name):
+        metadata = Metadata(metadata_file)


This will raise a StopIteration error if an empty file is provided. We should probably provide a clearer error message that we failed to import an empty file.

Fixed in f2de9e8 – under the assumption that empty FASTA files are valid, just not metadata which requires at least a header row.

joverlee521 · 2022-12-16T00:30:04Z

augur/db/export/sqlite3.py

+    if output_metadata:
+        export_metadata(database, metadata_table, output_metadata)
+
+    if output_sequences:
+        export_sequences(database, sequences_table, output_sequences)


Both export_* can raise a sqlite3.OperationalError if the table does not exist in the database.

Fixed in force-push.

joverlee521 · 2022-12-16T00:42:40Z

augur/db/export/sqlite3.py

+        export_metadata(database, metadata_table, output_metadata)
+
+    if output_sequences:
+        export_sequences(database, sequences_table, output_sequences)


I ran into a cryptic error when I tried to export sequences from an empty table:

Traceback (most recent call last): File "./augur/__init__.py", line 68, in run return args.__command__.run(args) File "./augur/db/export/__init__.py", line 26, in run args.output_metadata, args.output_sequences) File "./augur/db/export/sqlite3.py", line 15, in export export_sequences(database, sequences_table, output_sequences) File "./augur/db/export/sqlite3.py", line 41, in export_sequences sequence_name=DEFAULT_SEQUENCES_SEQ_NAME, File "./augur/io/sqlite3.py", line 106, in query_to_fasta SeqIO.write(sequences, out_file, "fasta") File "/Users/jlee2346/opt/miniconda3/envs/augur/lib/python3.7/site-packages/Bio/SeqIO/__init__.py", line 518, in write fp.write(format_function(record)) File "/Users/jlee2346/opt/miniconda3/envs/augur/lib/python3.7/site-packages/Bio/SeqIO/FastaIO.py", line 416, in as_fasta id = _clean(record.id) AttributeError: 'str' object has no attribute 'id'

joverlee521 · 2022-12-16T01:15:23Z

augur/db/defaults.py

+# Table names
+DEFAULT_IMPORTED_METADATA_TABLE = 'metadata'
+DEFAULT_IMPORTED_SEQUENCES_TABLE = 'sequences'


It would be really cool if we can pipe augur curate outputs directly to augur db import then all downstream commands can just work from the database! This would be easiest if augur db import imports an NDJSON input into a single table.

This avoids re-defining these values at each use case and prevents them from getting out of sync.

victorlin · 2023-01-12T00:37:56Z

tests/functional/db/cram/import-query-export.t

+  $ ${AUGUR} db import \
+  >  --metadata in.tsv \
+  >  --sequences in.fasta \
+  >  --output-db data.sqlite3


Hmm, I'm having second thoughts here. I think these subcommands are trying to do too much at once. Having --metadata and --sequences at the same time makes sense for other subcommands where both are needed, but that's not the case here.

What about a modified CLI where each subcommand acts on file types rather than Augur-specific "metadata"/"sequences", and only one at a time?

augur db import csv \ --file metadata.tsv \ --table metadata \ --output-db data.sqlite3 augur db import fasta \ --file sequences.fasta \ --table sequences \ --output-db data.sqlite3 augur db export csv \ --db data.sqlite3 \ --table metadata \ --output-file metadata.tsv augur db export fasta \ --db data.sqlite3 \ --table sequences \ --output-file sequences.fasta

This way, the path to support NDJSON and chaining with augur curate as mentioned in #1094 (comment) is much more clear.

augur curate passthru \ --metadata metadata.tsv \ | augur db import ndjson \ --file - \ --table metadata_and_sequences \ --output-db data.sqlite3 augur db export ndjson \ --db data.sqlite3 \ --table metadata_and_sequences \ --output-file - \ | augur curate normalize-strings > records.ndjson

Anyways, this might be premature thinking since we haven't even defined a use case for the augur db CLI. I should be focusing more on integrating the underlying class interfaces into augur filter which is a well-defined use case...

I'm leary of --table. I don't think that sort of general interface is what we want here. I think we want a specific "Augur database" format, the same way we have a specific "Augur node data" format or "Nextstrain dataset" format.

Anyways, this might be premature thinking since we haven't even defined a use case for the augur db CLI. I should be focusing more on integrating the underlying class interfaces into augur filter which is a well-defined use case...

Maybe let's get a small group of interested folks to discuss directions/use case/intentions here?

Maybe let's get a small group of interested folks to discuss directions/use case/intentions here?

Sure! I want to put the newer work with augur filter out for review first (as mentioned in #854 (comment)) - with that it should be easier to have this discussion.

These are in anticipation of usage by the augur db subcommand, but can be used by other subcommands such as augur filter.

This adds a basic interface to enable working in a world of SQLite3. See functional test files for example usage.

This serves to detect duplicates and speed up strain-based queries.

This is unused but can be handy.

victorlin self-assigned this Nov 8, 2022

victorlin force-pushed the victorlin/augur-db branch from dcc438c to 100fc43 Compare November 8, 2022 01:06

victorlin commented Nov 8, 2022

View reviewed changes

victorlin force-pushed the victorlin/augur-db branch from 100fc43 to 2a20ff7 Compare November 8, 2022 01:23

victorlin mentioned this pull request Nov 8, 2022

Support databases #866

Open

victorlin force-pushed the victorlin/augur-db branch from 2a20ff7 to 1fec958 Compare November 8, 2022 01:48

victorlin marked this pull request as ready for review November 8, 2022 01:59

victorlin requested a review from a team November 8, 2022 01:59

huddlej self-requested a review November 22, 2022 22:33

huddlej reviewed Nov 23, 2022

View reviewed changes

tsibley self-requested a review December 8, 2022 00:08

victorlin force-pushed the victorlin/augur-db branch 6 times, most recently from 9f4034a to a7f322a Compare December 8, 2022 23:22

tsibley reviewed Dec 9, 2022

View reviewed changes

victorlin force-pushed the victorlin/augur-db branch 2 times, most recently from 26d2f3e to 07a7f04 Compare December 12, 2022 17:45

victorlin changed the title ~~augur db with load/export of metadata~~ augur db with import/export of metadata Dec 12, 2022

victorlin force-pushed the victorlin/augur-db branch from 07a7f04 to e1e3e5e Compare December 12, 2022 18:21

tsibley reviewed Dec 12, 2022

View reviewed changes

victorlin force-pushed the victorlin/augur-db branch from e1e3e5e to ee7c51b Compare December 13, 2022 02:07

tsibley reviewed Dec 14, 2022

View reviewed changes

joverlee521 reviewed Dec 16, 2022

View reviewed changes

victorlin force-pushed the victorlin/augur-db branch from aba56fe to 0009d4d Compare December 31, 2022 04:45

victorlin force-pushed the victorlin/augur-db branch 5 times, most recently from f1b2d0b to e330db7 Compare January 7, 2023 00:10

victorlin force-pushed the victorlin/augur-db branch from f2de9e8 to 4b17d11 Compare January 11, 2023 22:47

io: Define potential strain column names in a new file

7d9e83a

This avoids re-defining these values at each use case and prevents them from getting out of sync.

victorlin force-pushed the victorlin/augur-db branch from 4b17d11 to b004001 Compare January 11, 2023 22:48

victorlin commented Jan 12, 2023

View reviewed changes

victorlin added 5 commits January 11, 2023 16:40

io: Add class interfaces for working with SQLite3 and data files

d8f7342

These are in anticipation of usage by the augur db subcommand, but can be used by other subcommands such as augur filter.

db: Implement basic import/export of data

439e0c2

This adds a basic interface to enable working in a world of SQLite3. See functional test files for example usage.

db: Index the ID columns

fb55576

This serves to detect duplicates and speed up strain-based queries.

db/import: Add argument to skip strain indexing

0215144

io/sqlite3: Add option to use in-memory database

c5574fa

This is unused but can be handy.

victorlin force-pushed the victorlin/augur-db branch from b004001 to c5574fa Compare January 12, 2023 00:43

victorlin marked this pull request as draft September 21, 2023 23:07

		self.potential_id_columns = potential_id_columns or ("strain", "name")
		"""Metadata columns that may contain identifier information."""

		from augur.io.sqlite3 import Sqlite3Database


		def export(db_file:str, metadata_table:str, to_metadata:str=''):

		# Infer delimiter from the first line.
		dialect = csv.Sniffer().sniff(f.readline(), delimiters=possible_delimiters)

augur db with import/export of metadata #1094

Are you sure you want to change the base?

augur db with import/export of metadata #1094

Conversation

victorlin commented Nov 8, 2022 • edited Loading

Description of proposed changes

Related issue(s)

Testing

Checklist

Summary of comment threads

victorlin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 8, 2022 • edited Loading

Codecov Report

huddlej left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsibley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

victorlin Dec 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

victorlin commented Dec 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j23414 Dec 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

`augur db` with import/export of metadata #1094

`augur db` with import/export of metadata #1094

victorlin commented Nov 8, 2022 •

edited

Loading

codecov bot commented Nov 8, 2022 •

edited

Loading

victorlin Dec 16, 2022 •

edited

Loading

j23414 Dec 17, 2022 •

edited

Loading