Standardizing functionality API #103

nsheff · 2023-10-24T21:40:05Z

Today we discussed retrieve_multiple, and we're kind of getting a bit lost in the complexity of the API. Here's my attempt to figure out what we need:

list all record identifiers, paged (list_record_identifiers)
list all records, paged (list_records)
get one record (retrieve_one ?)
get multiple records (retrieve_multiple ?) -- paged? probably not.
select column filtering?

API interface ideas

In #99 I proposed using the __getitem__ interface to retrieve records. Given the above distinction between retrieve_one and retrieve_multiple we're proposing, one possibility is to allow getitem to be the one that can handle either, like this:

def retrieve_one(record_identifier: str, ...) -> dict
def retrieve_multiple(record_identifier_list: list[str], ...) -> list[dict]
def __getitem__(record_identifiers: Union[str, list[str]], ...) -> Union[dict, list[dict])

So the actual functions have strict schema for parameters and results, but getitem allows for convenience, whereby you can do either psm["xyz123md5"] or psm[["xyz123md5", "abc456md5"]].

Cursor-based pagination ideas

If we implement cursor-based paging following some google standards, we could have an interface like this:

def list_record_identifiers(page_token: str, page_size: int)
def list_records(page_token: str, page_size: int)

return value would be:

{
  total_size: 100,
  page_size: 50,
  page_token: "xyz123md5",  # maybe?
  next_page_token: "gyx456md5",
  prev_page_token: "",  # maybe?
  records: [ ... ]
}

The text was updated successfully, but these errors were encountered:

donaldcampbelljr · 2023-10-24T21:57:26Z

Currently, for retrieving or selecting records, we have 10 functions:

get_one_record
get_records
list_recent_results
list_results
retrieve
retrieve_distinct
retrieve_multiple
select
select_txt
select_distinct

donaldcampbelljr · 2023-10-25T19:11:04Z

Some idea's from today's discussion:

select_records

select_records -> will replace select, select_txt

can take a list of tuples such that they can be used to filter e.g.
[(k,v), (k,v), (k,v)] --> distinct columns
can check type and construct appropriate where statements

if type(k) == str:
    where(k=v)

or

if type(k) == tuple
     where( k = ("genome", "alias", "aa") ==v  )

Can potentially implement OR logic as well:

if type(v) == tuple
   .or(k=v1)
   .or(k=v2)
######
#or:
if type(v) == tuple:
 .in_([v1, v2,...])

retrieve_one and retrieve_many

Two convenient retrieval functions which will actually use select_records:
retrieve_one -> retrieving one record
retrieve_many -> retrieving multiple records

select_distinct

select_distinct_values -> will be select_distinct, are we able to simply add .distinct to our query from select_records to get the distinct records

All of these are two be implemented on the abstract class for BOTH backends.

donaldcampbelljr · 2023-10-26T21:19:05Z

Ok, first pass at this, implemented:
select_records
retrieve_one
retrieve_many

All three currently return:

        records_dict = {
            "total_size": total_count,
            "page_size": limit,
            "next_page_token": next_cursor,
            "records": results,
        }

select_records is very similar to the original select function but uses cursor pagination.

Next steps:

Need to change the creation of the dynamic filter as it is not what we want based on yesterday's discussion, i.e. attempt to implement select_txt functionality as well as select_distinct
retrieve_many loops over a list and performs a select_records call for each r_id, but we should be able to do this in one session call

Also, I left all the original functions in place for now. Will remove them once we are happy with the newest changes.

nsheff · 2023-10-27T11:27:37Z

this sounds like great progress!

donaldcampbelljr · 2023-11-06T16:14:03Z

PR merged, this work is now complete.

* Add get_records to pipestat_manager and add related test #75 * clean up signatures #72 * clean up docstrings * remove superfluous get_table_name function * Reduce get_orm and get_model into one function. #71 * Fix CLI to use record-identifier instead of sample-name * Fix CLI to use pipeline_type #37 * modify get_records to return new structure #75 * fix html report generation using get_records * add simple CLI test for reporting and retrieving #81 * Pipestat reader (#80) * update version in prep for pipestat table * add table function * add initial stats function from looper, remove counter * fixing _create_stats_summary and get_file_for_project * fix sample_level stats reporting * add object reporting * remove redundancies * lint * add assertion for table generation to pre-existing test * use pipeline_type when retrieving samples * Fix stats table generation output to look better * fix return types * fix list vs List * fix typo for stats table * fix doc strings * update func names for disambiguation * fix key error issue in html_reports * clean up * adjust docstrings * update docstrings Co-authored-by: Nathan Sheffield <nsheff@users.noreply.github.com> * adjust docstrings, rename html_reports_pipestat.py to reports.py * adjust LOGGER.info string * simplify objs and stats generation * remove old todos * remove sample_name, add pipeline_name to project field definitions * work towards using project_name instead of sample_name from project-level pipelines * fix check record bug, add better error message for pipeline type. * add checks for pipeline types, lint * fix get_samples with pipeline type * fix fetch_pipeline_results * add basic tests for project_level, fix associated bugs * allow table creation to use project_name if applicable * add passing project name to file_backend * refactor to use r_id as input * update doc strings * more refactoring of records vs samples to be more general * move table generation to reports.py * allow status file dir path to be associated with config path OR the file path to ensure looper compatibility. * add output_dir as a parameter for placing reports and stat files. Otherwise defaults to results_file dir or config_file dir * add obtaining schema_path from pipestat config for Looper compatibility * added select function to main pipestat class * added select_txt and select_distinct function to main pipestat class * added get_one_record function to main pipestat class #70 * Removing CLI req that one must supply config and schema because the user can supply schema within the config. For Looper compatibility. * Revert "added get_one_record function to main pipestat class #70" This reverts commit b477060. * Revert "added select_txt and select_distinct function to main pipestat class" This reverts commit 4588f4f. * Revert "added select function to main pipestat class" This reverts commit ee98145. * Add basic glossary page in the form of a table. Addresses: pepkit/looper#290 * remove redundant tag * add RecordNotFoundError across both backends #74 * Add PipelineTypeNotSuppliedError to hide dbbackend information #74 * Initial work towards a fastapi implementation of pipestat reader #22 * Initial work to separate classes #78 * working implementation of child classes for reporting using DBBACKEND. Manual testing confirmed. pytest Tests broken. #78 * implement report for filebackend * fix inheritance issue with pipeline_type * typo * major refactoring, many tests still broken #78 * Fix signatures for report and explicitly state input arguments. * remove pipeline type check * add projectpipestamanager and refactor project_name to record_identifier * fix get_status * fix tests * fix all tests #78 * add todo * code cleanup and remove getting_table_name * clean up * partial removal of pipeline_type logic from DB backend * remove pipeline_type input arguments from File Backend * clean up * add simple class to wrap SamplePipestatManager and ProjectPipestatManager * change pipestatmanager to mutable mapping #78 * lowercase attributes #78 * polish PipestatBoss #78 * env_vars for pipeline_type * update docs * working proof of concept to retrieve results #22 * update readme and add getting output schema * add else catch for outptu schema and fix typo * add ability to run with uvicorn if calling reader.py directly. * add __init__.py and main func * add more endpoints #22 * update endpoints, add Query * allow for more complex filtering using post and pydantic models #22 * Add image and file endpoints #22 * lint * Add basic arg parser to pass absolute path to config file * Add endpoint for get_records() * clean up * modify get_records to return new structure #75 * fix html report generation using get_records * attempt global creation of psm, does not work * fix creation of global psm * change fetching by filetype * add cli option for pipestat serve * simplify and fix pipestat serve #22 * add host and port arguments #22 * add config error * update readme --------- Co-authored-by: nsheff <nsheff@users.noreply.github.com> * list to List * fixed typing in python3.8 * update changelog * fix table_name_bug * "0.5.4a1" for alpha release * "0.6.0a1" for alpha release * Dev json schema (#87) * first pass to allow for output_schema as a true JSON schema #85 * remove unused import * simplify logic, extend _safe_pop_one_mapping to handles multiple keys * disambiguation key vs keys in parsed schema * make samples and project objects instead of arrays, add more test assertions, clean up docstrings. * add status data to string representation of ParsedSchema * add retrieve_distinct function for #70 and databio/bedhost#75 * add `pipestat link` functionality to create symlinks (#90) * begin work on file inking via pipestat * continue implementing link for an output_dir OR the results_file.yaml #89 * implemented symlinks for filebackend given output_dir #89 * implemented symlinks for filebackend using results.yaml, polish tests #89 * fix typo, clean up * more lcean up * begin rework to be backend agnostic * refactor to place in folders specific to a result_identifier * add more complex types, add temp directory for better testing * fix recursive finding of paths * clean up, move to abstract class, confirm works for both backends * remove unused test files * add jinja2 to requirements-all.txt #91 * complex example with collision * add warning * add cli option * remove unused functions * update changelog * clean up docstrings, remove unused variables, refactor link_dir * clean up more docstrings * remove unnecessary print statement to reduce verbosity * Begin work on adding created and modified datetime, DateTime objects are not parsing correctly, tests broken #93 * fix datetime objects #93 * fix pipeline_name field beginning with underscore #94 * timestamp continuation, list_recent_results does NOT work * fix doc string for schema * add filtering by created and modified time #93 * update version for alpha release v0.6.0a2 * fix list reversing #93 * polish tests #93 * add filebackend list_recent_results, remove records broken #93 * fix remove records bug #93 * update doc strings #93 * update change log for list_recent_results * begin work on retrieve_multiple #96 * add db backend, test broken #96 * fix db backend logic #96 * implement file backend #96 * fix docstrings * some minor polishing * update readme * move pipestat reader per #22 * fix pipestat reader setup and manifest per #22 * Add optional dependencies via setup #22 * pipestat can now run without DB specific dependencies #22 * Some refactoring #64 * add other requirements to test * refactor to have ParsedSchemaDB inherit from ParsedSchema for code reduction #64 * Clean up * return total record count #75 * make initialization of file and DB backend into separate functions * add decorator for checking dependencies * polishing decorator for checking dependencies * consolidate decorator and ancillary function into one * first pass #99, some tests broken * fix all tests #99 * refactor store to cfg #99 * fix bug with table path * setitem is now a wrapper for report * fix getitem implementation, add test #99 * Begin work on select_records, retrieve_one, retrieve_many #103 * Implement select_records, and cursor pagination #103 * clarify init message * Update dbbackend.py * add loading JSON and column.contains for filtering #103 * testing differences * add or_ and and_ sqlmodel operators #103 * some comments * steps toward smarter pytesting * #103 -- added multi key filter * typo * polish changes to select_records and add tests. #103 * fix typo * remove Union for python 3.8 compat * move get_nested_column and define_sqlalchemy_type outside of select_records * update doc strings * Modify to be true JSONs chema with using $ref and $def for custom objects (image, file) #85 * Update pipestat/backends/abstract.py Co-authored-by: Oleksandr <sasha99250@gmail.com> * Update pipestat/backends/db_backend/dbbackend.py Co-authored-by: Oleksandr <sasha99250@gmail.com> * Update pipestat/backends/db_backend/dbbackend.py Co-authored-by: Oleksandr <sasha99250@gmail.com> * Update pipestat/pipestat.py Co-authored-by: Oleksandr <sasha99250@gmail.com> * Update pipestat/pipestat.py Co-authored-by: Oleksandr <sasha99250@gmail.com> * Update tests/data/sample_output_schema.yaml Co-authored-by: Oleksandr <sasha99250@gmail.com> * Polish PR, fix docstrings, add wrapper for remove * remove unnecessary mutablemapping inheritance * lint * cleaning * fixed failing test * remove old, unused validate_schema function * change ALL output_schemas to reflect JSON schema * lint * per request, remove select function as well as dynamic_filter * add to do * revert PipestatManager to a MutableMapping * remove select_txt and select_distinct * revert select_distinct for now * remove unused backend funcs, tests broken * implement list_recent_results wrapper (DBBackend only) * implement list_recent_results wrapper (DBBackend only) * some polish, begin implementing select_records * more work towards select_records * change operator func, add list comprehension for sets * implement AND OR Logic and fix tests, add sorting * fix time stamp filtering if NOT nested * lint and fix db test * attempt to pare down results based on columns, does not work. * add column filtering for filebackend, add tests, add no filter condition retrieving all records on filebackend * fix complex object filtering * implement select_distinct for filebackend * select_dsitinct now works on record identifiers for filebackend, polish tests * fix retrieve_one and retrieve_many for filebackend * fix timestamp test for filebackend * use isinstance for type checking * refactored for efficiency * fix recursive schema after merge * remove get_records, fix return type for select_records if given columns * begin remove of get_one_record * deleted get_one_record function * fix get_status on dbbackend using select_records * remove retrieve from dbbackend and reports * replace retrieve with retrieve_one and begin replacing retrieve with select_records * lint * fix select_records filebackend bug, remove retrieve from tests * finish removing retrieve function * attempt re-implementing record not found error for retrieve_one and retrieve_many * fix bugs with select_records and RecordNotFound, implement ColumnNotFound exception * lint * update changelog and version for 0.6.0a3 release * refactoring based on ruff * update cli and usage and api docs, lint * update documentation focused on schemas * add result_identifier as parameter to retrieve_one * add result_identifier as parameter to retrieve_many * __version__ = "0.6.0a4" * Finish adding pytest.mark.skipif * lint * move general tests to test_pipestat, remove some redundancy * implement clear_status pytest for filebackend #110 * polish test set_item, report_overwrite * add pytest fixture val_dict * add pytest fixture values_sample, values_project * add pytest fixture values_complex_linking, move fixtures to conftest.py * deleted unnecessary print * add test_select_no_filter_limit * black formatting * add pytest fixture for ranges * lint * fix record_identifier being returned twice if it is an input column * fix typo * change retrieve_one output to no longer we wrapped by select_records output * fix schema propert to return psm.cfg["_schema"].to_dict() * fix schema propert to return psm.cfg["_schema"] * add psm.schema.original_schema to parsed schema * add r_id = record_identifier or self.record_identifier to retrieve_one * fix majority of tests due to retrieve_one * lint * fix table creation bug with retrieve_one * remove RecordNotFoundError because it breaks pypiper * fix parsed_schema json schema parsing to be correct hierarchy * update jupyter api docs * update cli docs * fix property access #113 * add retrieving multiple results using retrieve_one * list to List * add quickstart guide * add psm.schema.resolved_schema * remove unused function * version bump to v0.6.0a5 * fix config_path property * re-implement RecordNotFoundError * be more selective about which exceptions to pass * potential solution for #117 * version bump 0.6.0a6 * remove redundant pydantic * fix pipestat reader dependency checking * fix cli bug regarding env variables * retrieve_one record_identifier defaults to None to allow for use with env variables, clean README.md * further documentation polishing * fix obtaining record_identifier from cfg file * correct config.md * correct pipestat_specification.md * 0.6.0a7 prerelease * Clarify return on select_records * ensure requirements list yacman>= 0.9.2 * potential fix for #119 * change warning to debug for creating status table * fix pipestat creating subdirectories before creating results file. * First proof of concept for #120 * simply do replacement at beginning * Fixed #106 * now functional, create aggregate_results.yaml which replaces self._data before reporting #120 * replace spaces in names with %20 during html link generation * fix reported_objs bug * expandpath for link_dir during pipestat link * add tests for new aggregate_results functionality * refactor db-backend to dbbackend * fix stats and objects buttons for summary page #130 * fix stats table on summary page #130 * fix broken status if multi_results_files #130 * fix log and profile files #130 * fix tsv columns #130 * Add check for pipeline type during report generation #130 * refactor Sample to Record #130 * fix refactor for sample page * fix stats.tsv and objects.yaml generation to incorporate pipeline types * add project_objects page #130 * minor polishing * remove temp test * lint * add passing looper's samples to pipestat summarize to populate table * remove unused code * version bump for pre-release v0.6.0a9 * use "object_type" for image and file types or fall back on using "type", add try except * ensure project level objects are accurate on summary/index page * remove project name from stats and object file names * update changelog * 0.6.0a10 bump for pre-release * allow project-level figures to display as columns instead of rows * 0.6.0a11 pre-release * v0.6.0 release prep * lint --------- Co-authored-by: nsheff <nsheff@users.noreply.github.com> Co-authored-by: Khoroshevskyi <sasha99250@gmail.com>

nsheff added the brainstorming label Oct 24, 2023

donaldcampbelljr added a commit that referenced this issue Oct 26, 2023

Begin work on select_records, retrieve_one, retrieve_many #103

18ef632

donaldcampbelljr added a commit that referenced this issue Oct 26, 2023

Implement select_records, and cursor pagination #103

a73a30a

nsheff added this to the v0.6.0 milestone Oct 27, 2023

donaldcampbelljr added a commit that referenced this issue Oct 27, 2023

add loading JSON and column.contains for filtering #103

c5cb162

donaldcampbelljr added a commit that referenced this issue Oct 27, 2023

add or_ and and_ sqlmodel operators #103

2fc3bfb

khoroshevskyi added a commit that referenced this issue Oct 27, 2023

#103 -- added multi key filter

616def4

donaldcampbelljr added a commit that referenced this issue Oct 30, 2023

polish changes to select_records and add tests. #103

795d3c2

donaldcampbelljr mentioned this issue Oct 30, 2023

Dev 99 103 api refactor #108

Merged

1 task

nsheff mentioned this issue Nov 3, 2023

bedhost select and retrieve on backend databio/bedhost#75

Closed

donaldcampbelljr added the likely-solved label Nov 6, 2023

donaldcampbelljr closed this as completed Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardizing functionality API #103

Standardizing functionality API #103

nsheff commented Oct 24, 2023

donaldcampbelljr commented Oct 24, 2023

donaldcampbelljr commented Oct 25, 2023 •

edited by khoroshevskyi

Loading

donaldcampbelljr commented Oct 26, 2023 •

edited

Loading

nsheff commented Oct 27, 2023

donaldcampbelljr commented Nov 6, 2023

Standardizing functionality API #103

Standardizing functionality API #103

Comments

nsheff commented Oct 24, 2023

API interface ideas

Cursor-based pagination ideas

donaldcampbelljr commented Oct 24, 2023

donaldcampbelljr commented Oct 25, 2023 • edited by khoroshevskyi Loading

Some idea's from today's discussion:

select_records

retrieve_one and retrieve_many

select_distinct

donaldcampbelljr commented Oct 26, 2023 • edited Loading

nsheff commented Oct 27, 2023

donaldcampbelljr commented Nov 6, 2023

donaldcampbelljr commented Oct 25, 2023 •

edited by khoroshevskyi

Loading

donaldcampbelljr commented Oct 26, 2023 •

edited

Loading