Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardizing functionality API #103

Closed
nsheff opened this issue Oct 24, 2023 · 5 comments
Closed

Standardizing functionality API #103

nsheff opened this issue Oct 24, 2023 · 5 comments

Comments

@nsheff
Copy link
Contributor

nsheff commented Oct 24, 2023

Today we discussed retrieve_multiple, and we're kind of getting a bit lost in the complexity of the API. Here's my attempt to figure out what we need:

  • list all record identifiers, paged (list_record_identifiers)
  • list all records, paged (list_records)
  • get one record (retrieve_one ?)
  • get multiple records (retrieve_multiple ?) -- paged? probably not.
  • select column filtering?

API interface ideas

In #99 I proposed using the __getitem__ interface to retrieve records. Given the above distinction between retrieve_one and retrieve_multiple we're proposing, one possibility is to allow getitem to be the one that can handle either, like this:

def retrieve_one(record_identifier: str, ...) -> dict
def retrieve_multiple(record_identifier_list: list[str], ...) -> list[dict]
def __getitem__(record_identifiers: Union[str, list[str]], ...) -> Union[dict, list[dict])

So the actual functions have strict schema for parameters and results, but getitem allows for convenience, whereby you can do either psm["xyz123md5"] or psm[["xyz123md5", "abc456md5"]].

Cursor-based pagination ideas

If we implement cursor-based paging following some google standards, we could have an interface like this:

def list_record_identifiers(page_token: str, page_size: int)
def list_records(page_token: str, page_size: int)

return value would be:

{
  total_size: 100,
  page_size: 50,
  page_token: "xyz123md5",  # maybe?
  next_page_token: "gyx456md5",
  prev_page_token: "",  # maybe?
  records: [ ... ]
}

@donaldcampbelljr
Copy link
Contributor

Currently, for retrieving or selecting records, we have 10 functions:

get_one_record
get_records
list_recent_results
list_results
retrieve
retrieve_distinct
retrieve_multiple
select
select_txt
select_distinct

@donaldcampbelljr
Copy link
Contributor

donaldcampbelljr commented Oct 25, 2023

Some idea's from today's discussion:


select_records

select_records -> will replace select, select_txt

  • can take a list of tuples such that they can be used to filter e.g.
    [(k,v), (k,v), (k,v)] --> distinct columns
    can check type and construct appropriate where statements
if type(k) == str:
    where(k=v)

or

if type(k) == tuple
     where( k = ("genome", "alias", "aa") ==v  )

Can potentially implement OR logic as well:

if type(v) == tuple
   .or(k=v1)
   .or(k=v2)
######
#or:
if type(v) == tuple:
 .in_([v1, v2,...])

retrieve_one and retrieve_many

Two convenient retrieval functions which will actually use select_records:
retrieve_one -> retrieving one record
retrieve_many -> retrieving multiple records


select_distinct

select_distinct_values -> will be select_distinct, are we able to simply add .distinct to our query from select_records to get the distinct records


All of these are two be implemented on the abstract class for BOTH backends.

@donaldcampbelljr
Copy link
Contributor

donaldcampbelljr commented Oct 26, 2023

Ok, first pass at this, implemented:
select_records
retrieve_one
retrieve_many

All three currently return:

        records_dict = {
            "total_size": total_count,
            "page_size": limit,
            "next_page_token": next_cursor,
            "records": results,
        }

select_records is very similar to the original select function but uses cursor pagination.

Next steps:

  • Need to change the creation of the dynamic filter as it is not what we want based on yesterday's discussion, i.e. attempt to implement select_txt functionality as well as select_distinct
  • retrieve_many loops over a list and performs a select_records call for each r_id, but we should be able to do this in one session call

Also, I left all the original functions in place for now. Will remove them once we are happy with the newest changes.

@nsheff nsheff added this to the v0.6.0 milestone Oct 27, 2023
@nsheff
Copy link
Contributor Author

nsheff commented Oct 27, 2023

this sounds like great progress!

@donaldcampbelljr
Copy link
Contributor

PR merged, this work is now complete.

donaldcampbelljr added a commit that referenced this issue Dec 22, 2023
* Add get_records to pipestat_manager and add related test #75

* clean up signatures #72

* clean up docstrings

* remove superfluous get_table_name function

* Reduce get_orm and get_model into one function. #71

* Fix CLI to use record-identifier instead of sample-name

* Fix CLI to use pipeline_type #37

* modify get_records to return new structure #75

* fix html report generation using get_records

* add simple CLI test for reporting and retrieving #81

* Pipestat reader (#80)

* update version in prep for pipestat table

* add table function

* add initial stats function from looper, remove counter

* fixing _create_stats_summary and get_file_for_project

* fix sample_level stats reporting

* add object reporting

* remove redundancies

* lint

* add assertion for table generation to pre-existing test

* use pipeline_type when retrieving samples

* Fix stats table generation output to look better

* fix return types

* fix list vs List

* fix typo for stats table

* fix doc strings

* update func names for disambiguation

* fix key error issue in html_reports

* clean up

* adjust docstrings

* update docstrings

Co-authored-by: Nathan Sheffield <nsheff@users.noreply.github.com>

* adjust docstrings, rename html_reports_pipestat.py to reports.py

* adjust LOGGER.info string

* simplify objs and stats generation

* remove old todos

* remove sample_name, add pipeline_name to project field definitions

* work towards using project_name instead of sample_name from project-level pipelines

* fix check record bug, add better error message for pipeline type.

* add checks for pipeline types, lint

* fix get_samples with pipeline type

* fix fetch_pipeline_results

* add basic tests for project_level, fix associated bugs

* allow table creation to use project_name if applicable

* add passing project name to file_backend

* refactor to use r_id as input

* update doc strings

* more refactoring of records vs samples to be more general

* move table generation to reports.py

* allow status file dir path to be associated with config path OR the file path to ensure looper compatibility.

* add output_dir as a parameter for placing reports and stat files. Otherwise defaults to results_file dir or config_file dir

* add obtaining schema_path from pipestat config for Looper compatibility

* added select function to main pipestat class

* added select_txt and select_distinct function to main pipestat class

* added get_one_record function to main pipestat class #70

* Removing CLI req that one must supply config and schema because the user can supply schema within the config. For Looper compatibility.

* Revert "added get_one_record function to main pipestat class #70"

This reverts commit b477060.

* Revert "added select_txt and select_distinct function to main pipestat class"

This reverts commit 4588f4f.

* Revert "added select function to main pipestat class"

This reverts commit ee98145.

* Add basic glossary page in the form of a table. Addresses: pepkit/looper#290

* remove redundant tag

* add RecordNotFoundError across both backends #74

* Add PipelineTypeNotSuppliedError to hide dbbackend information #74

* Initial work towards a fastapi implementation of pipestat reader
#22

* Initial work to separate classes #78

* working implementation of child classes for reporting using DBBACKEND. Manual testing confirmed. pytest Tests broken. #78

* implement report for filebackend

* fix inheritance issue with pipeline_type

* typo

* major refactoring, many tests still broken #78

* Fix signatures for report and explicitly state input arguments.

* remove pipeline type check

* add projectpipestamanager and refactor project_name to record_identifier

* fix get_status

* fix tests

* fix all tests #78

* add todo

* code cleanup and remove getting_table_name

* clean up

* partial removal of pipeline_type logic from DB backend

* remove pipeline_type input arguments from File Backend

* clean up

* add simple class to wrap SamplePipestatManager and ProjectPipestatManager

* change pipestatmanager to mutable mapping #78

* lowercase attributes #78

* polish PipestatBoss #78

* env_vars for pipeline_type

* update docs

* working proof of concept to retrieve results #22

* update readme and add getting output schema

* add else catch for outptu schema and fix typo

* add ability to run with uvicorn if calling reader.py directly.

* add __init__.py and main func

* add more endpoints #22

* update endpoints, add Query

* allow for more complex filtering using post and pydantic models #22

* Add image and file endpoints #22

* lint

* Add basic arg parser to pass absolute path to config file

* Add endpoint for get_records()

* clean up

* modify get_records to return new structure #75

* fix html report generation using get_records

* attempt global creation of psm, does not work

* fix creation of global psm

* change fetching by filetype

* add cli option for pipestat serve

* simplify and fix pipestat serve #22

* add host and port arguments #22

* add config error

* update readme

---------

Co-authored-by: nsheff <nsheff@users.noreply.github.com>

* list to List

* fixed typing in python3.8

* update changelog

* fix table_name_bug

* "0.5.4a1" for alpha release

* "0.6.0a1" for alpha release

* Dev json schema (#87)

* first pass to allow for output_schema as a true JSON schema #85

* remove unused import

* simplify logic, extend _safe_pop_one_mapping to handles multiple keys

* disambiguation key vs keys in parsed schema

* make samples and project objects instead of arrays, add more test assertions, clean up docstrings.

* add status data to string representation of ParsedSchema

* add retrieve_distinct function for #70 and databio/bedhost#75

* add `pipestat link` functionality to create symlinks (#90)

* begin work on file inking via pipestat

* continue implementing link for an output_dir OR the results_file.yaml
#89

* implemented symlinks for filebackend given output_dir
#89

* implemented symlinks for filebackend using results.yaml, polish tests
#89

* fix typo, clean up

* more lcean up

* begin rework to be backend agnostic

* refactor to place in folders specific to a result_identifier

* add more complex types, add temp directory for better testing

* fix recursive finding of paths

* clean up, move to abstract class, confirm works for both backends

* remove unused test files

* add jinja2 to requirements-all.txt #91

* complex example with collision

* add warning

* add cli option

* remove unused functions

* update changelog

* clean up docstrings, remove unused variables, refactor link_dir

* clean up more docstrings

* remove unnecessary print statement to reduce verbosity

* Begin work on adding created and modified datetime, DateTime objects are not parsing correctly, tests broken #93

* fix datetime objects #93

* fix pipeline_name field beginning with underscore #94

* timestamp continuation, list_recent_results does NOT work

* fix doc string for schema

* add filtering by created and modified time #93

* update version for alpha release v0.6.0a2

* fix list reversing #93

* polish tests #93

* add filebackend list_recent_results, remove records broken #93

* fix remove records bug #93

* update doc strings #93

* update change log for list_recent_results

* begin work on retrieve_multiple #96

* add db backend, test broken #96

* fix db backend logic #96

* implement file backend #96

* fix docstrings

* some minor polishing

* update readme

* move pipestat reader per #22

* fix pipestat reader setup and manifest per #22

* Add optional dependencies via setup #22

* pipestat can now run without DB specific dependencies #22

* Some refactoring #64

* add other requirements to test

* refactor to have ParsedSchemaDB inherit from ParsedSchema for code reduction #64

* Clean up

* return total record count #75

* make initialization of file and DB backend into separate functions

* add decorator for checking dependencies

* polishing decorator for checking dependencies

* consolidate decorator and ancillary function into one

* first pass #99, some tests broken

* fix all tests #99

* refactor store to cfg #99

* fix bug with table path

* setitem is now a wrapper for report

* fix getitem implementation, add test #99

* Begin work on select_records, retrieve_one, retrieve_many #103

* Implement select_records, and cursor pagination #103

* clarify init message

* Update dbbackend.py

* add loading JSON and column.contains for filtering #103

* testing differences

* add or_ and and_ sqlmodel operators #103

* some comments

* steps toward smarter pytesting

* #103 -- added multi key filter

* typo

* polish changes to select_records and add tests. #103

* fix typo

* remove Union for python 3.8 compat

* move get_nested_column and define_sqlalchemy_type outside of select_records

* update doc strings

* Modify to be true JSONs chema with using $ref and $def for custom objects (image, file) #85

* Update pipestat/backends/abstract.py

Co-authored-by: Oleksandr <sasha99250@gmail.com>

* Update pipestat/backends/db_backend/dbbackend.py

Co-authored-by: Oleksandr <sasha99250@gmail.com>

* Update pipestat/backends/db_backend/dbbackend.py

Co-authored-by: Oleksandr <sasha99250@gmail.com>

* Update pipestat/pipestat.py

Co-authored-by: Oleksandr <sasha99250@gmail.com>

* Update pipestat/pipestat.py

Co-authored-by: Oleksandr <sasha99250@gmail.com>

* Update tests/data/sample_output_schema.yaml

Co-authored-by: Oleksandr <sasha99250@gmail.com>

* Polish PR, fix docstrings, add wrapper for remove

* remove unnecessary mutablemapping inheritance

* lint

* cleaning

* fixed failing test

* remove old, unused validate_schema function

* change ALL output_schemas to reflect JSON schema

* lint

* per request, remove select function as well as dynamic_filter

* add to do

* revert PipestatManager to a MutableMapping

* remove select_txt and select_distinct

* revert select_distinct for now

* remove unused backend funcs, tests broken

* implement list_recent_results wrapper (DBBackend only)

* implement list_recent_results wrapper (DBBackend only)

* some polish, begin implementing select_records

* more work towards select_records

* change operator func, add list comprehension for sets

* implement AND OR Logic and fix tests, add sorting

* fix time stamp filtering if NOT nested

* lint and fix db test

* attempt to pare down results based on columns, does not work.

* add column filtering for filebackend, add tests, add no filter condition retrieving all records on filebackend

* fix complex object filtering

* implement select_distinct for filebackend

* select_dsitinct now works on record identifiers for filebackend, polish tests

* fix retrieve_one and retrieve_many for filebackend

* fix timestamp test for filebackend

* use isinstance for type checking

* refactored for efficiency

* fix recursive schema after merge

* remove get_records, fix return type for select_records if given columns

* begin remove of get_one_record

* deleted get_one_record function

* fix get_status on dbbackend using select_records

* remove retrieve from dbbackend and reports

* replace retrieve with retrieve_one and begin replacing retrieve with select_records

* lint

* fix select_records filebackend bug, remove retrieve from tests

* finish removing retrieve function

* attempt re-implementing record not found error for retrieve_one and retrieve_many

* fix bugs with select_records and RecordNotFound, implement ColumnNotFound exception

* lint

* update changelog and version for 0.6.0a3 release

* refactoring based on ruff

* update cli and usage and api docs, lint

* update documentation focused on schemas

* add result_identifier as parameter to retrieve_one

* add result_identifier as parameter to retrieve_many

* __version__ = "0.6.0a4"

* Finish adding pytest.mark.skipif

* lint

* move general tests to test_pipestat, remove some redundancy

* implement clear_status pytest for filebackend #110

* polish test set_item, report_overwrite

* add pytest fixture val_dict

* add pytest fixture values_sample, values_project

* add pytest fixture values_complex_linking, move fixtures to conftest.py

* deleted unnecessary print

* add test_select_no_filter_limit

* black formatting

* add pytest fixture for ranges

* lint

* fix record_identifier being returned twice if it is an input column

* fix typo

* change retrieve_one output to no longer we wrapped by select_records output

* fix schema propert to return psm.cfg["_schema"].to_dict()

* fix schema propert to return psm.cfg["_schema"]

* add psm.schema.original_schema to parsed schema

* add r_id = record_identifier or self.record_identifier to retrieve_one

* fix majority of tests due to retrieve_one

* lint

* fix table creation bug with retrieve_one

* remove RecordNotFoundError because it breaks pypiper

* fix parsed_schema json schema parsing to be correct hierarchy

* update jupyter api docs

* update cli docs

* fix property access #113

* add retrieving multiple results using retrieve_one

* list to List

* add quickstart guide

* add psm.schema.resolved_schema

* remove unused function

* version bump to v0.6.0a5

* fix config_path property

* re-implement RecordNotFoundError

* be more selective about which exceptions to pass

* potential solution for #117

* version bump 0.6.0a6

* remove redundant pydantic

* fix pipestat reader dependency checking

* fix cli bug regarding env variables

* retrieve_one record_identifier defaults to None to allow for use with env variables, clean README.md

* further documentation polishing

* fix obtaining record_identifier from cfg file

* correct config.md

* correct pipestat_specification.md

* 0.6.0a7 prerelease

* Clarify return on select_records

* ensure requirements list yacman>= 0.9.2

* potential fix for #119

* change warning to debug for creating status table

* fix pipestat creating subdirectories before creating results file.

* First proof of concept for #120

* simply do replacement at beginning

* Fixed #106

* now functional, create aggregate_results.yaml which replaces self._data before reporting #120

* replace spaces in names with %20 during html link generation

* fix reported_objs bug

* expandpath for link_dir during pipestat link

* add tests for new aggregate_results functionality

* refactor db-backend to dbbackend

* fix stats and objects buttons for summary page #130

* fix stats table on summary page #130

* fix broken status if multi_results_files #130

* fix log and profile files #130

* fix tsv columns #130

* Add check for pipeline type during report generation #130

* refactor Sample to Record #130

* fix refactor for sample page

* fix stats.tsv and objects.yaml generation to incorporate pipeline types

* add project_objects page #130

* minor polishing

* remove temp test

* lint

* add passing looper's samples to pipestat summarize to populate table

* remove unused code

* version bump for pre-release v0.6.0a9

* use "object_type" for image and file types or fall back on using "type", add try except

* ensure project level objects are accurate on summary/index page

* remove project name from stats and object file names

* update changelog

* 0.6.0a10 bump for pre-release

* allow project-level figures to display as columns instead of rows

* 0.6.0a11 pre-release

* v0.6.0 release prep

* lint

---------

Co-authored-by: nsheff <nsheff@users.noreply.github.com>
Co-authored-by: Khoroshevskyi <sasha99250@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants