Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naming and architecture suggestions #8

Open
Guillawme opened this issue Apr 28, 2021 · 3 comments
Open

Naming and architecture suggestions #8

Guillawme opened this issue Apr 28, 2021 · 3 comments
Labels
enhancement New feature or request

Comments

@Guillawme
Copy link

Guillawme commented Apr 28, 2021

Hello,

Since the README says

if you see ways to improve the naming or architecture, now is the time to raise an issue.

I suggest checking how much of the STAR format specification is supported by the package, and maybe renaming it to STAR.jl if enough of the specification is supported already and/or if complete (or "complete enough for most use cases") support is within the scope of this package. It seems to me that it is already compatible with STAR files following these conventions from RELION (I tried reading small files, it worked; trying to read a ~200 MB file with the native parser hanged for more than one hour until I cancelled the command; I can open a dedicated issue for that and provide the large file, if you want).

One other nice thing to have, which may not even need changing the architecture, would be a high-level function with which one could do something like read_star("/path/to/file.star") and get out a list of DataFrames (one per table present in the file) with properly typed columns (most of the time String, Int64 or Float64; not sure anything else is ever present in these files). And also the complementary write_star(df::DataFrame, "/path/to/file.star") (single table) and write_star(dfs::Vector{DataFrame}, "/path/to/file.star") (multiple tables).

@jamesrhester
Copy link
Owner

I've deliberately focused on CIF as that is widely used and recognised in both chemical crystallography and macromolecular crystallography. The package also includes a lot of CIF dictionary infrastructure. The RELION conventions look like they would be completely compatible with CIF, so a read_star method would be easy enough to implement - it would just be the same as reading a CIF.

Can you provide a link to the 200MB file. Currently native speed is about 1MB/2s so a 200Mb file should be 400s, or 7 minutes. There may be a bottleneck in parsing that I can fix quickly. It may also be a memory consumption issue.

Columns can only be typed if there is some knowledge of what their contents should be. CIF uses CIF dictionaries, so if you do associate a CIF dictionary to a CIF data block, correctly-typed columns are returned. In the case of STAR, there is no foolproof way of determining column contents (strings don't have to be delimited, I think) so you probably as a user need to indicate what types you want for columns and so it's probably just as simple to use DataFrame methods for this.

@Guillawme
Copy link
Author

Guillawme commented Apr 29, 2021

I opened a dedicated issue and posted the large file with more info there. See #9.

I mentioned typed columns because this is something starfile.py does automatically, and it is very convenient, but it's possible that it relies on pandas for this (just like reading a CSV file into a pandas dataframe also gives typed columns; but I indeed don't know how it can determine these types automatically). So, it's possible that DataFrames.jl can also guess column types (Base.parse() can do exactly this, it seems).

@jamesrhester jamesrhester added the enhancement New feature or request label May 4, 2021
@jamesrhester
Copy link
Owner

Suggested enhancements:

  • Document degree of STAR compatibility
  • Provide a command to read STAR files that essentially aliases CIF reading commands

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants