Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

End repair #63

Merged
merged 31 commits into from
Oct 24, 2023
Merged

End repair #63

merged 31 commits into from
Oct 24, 2023

Conversation

jmtsuji
Copy link
Collaborator

@jmtsuji jmtsuji commented Oct 20, 2023

Addresses #58 so that repair.py is installed as part of rotary and is integrated into the snakemake workflow. Also includes usage of TIGRFAM HMMs.

@LeeBergstrand would you mind taking a look before this PR is merged? (Thanks so much for your support for the install of repair.py!!)

LeeBergstrand and others added 19 commits October 4, 2023 18:52
* develop: (61 commits)
  add biopython as a install requirement.
  move end repair to proper directory.
  remove touch and move ref_genome_path_list to output file.
  specify 95% of the machines memory in the config rather than 100%
  check for fq and fastq
  convert to using fstrings
  Cleaned up comments
  Basic working version
  Separated argparse into its own function, as discussed in #47
  add update to readme instructions.
  CLI debugging and misc code fixes
  make sure it selects the correct config for rotary run.
  add swap sh for subprocess check call. add command line flag for forcing ovewrites with rotary init.
  more edits to capture entire std err output
  more code for running snakemake.
  add call snakemake.
  modularize Python code.
  fix command line error when no command is specified.
  update git ignore
  recreate scripts folder
  ...

# Conflicts:
#	.gitignore
#	rotary/rules/rotary.smk
@LeeBergstrand
Copy link
Collaborator

@jmtsuji I added e4dceff to add the dependencies to the other requirements files. So, three files require dependency updates.

  • environment.yaml - used for creating conda enviroments.
  • meta.yaml - used by conda build when building conda packges.
  • pyproject.toml - used by pip install when installing via ```pip``. It can only install pure Python packages offPypii (so no bioinformatics tools).

You could technically remove environment.yaml once the package is uploaded to Anaconda as conda build would specify these dependencies. You can programmatically get the dependencies out of the environment.yaml and have them templated during the build process (see https://stackoverflow.com/questions/60125363/conda-meta-yaml-requirements-run-section-dependencies-from-file)

The conda build process works like this:

  1. Conda build takes the meta.yaml file and uses it to start building the Conda package.
  2. Conda build installs the dependencies listed in the meta.yaml file.
  3. Conda build calls pip install . to install the rotary python package.
  4. pip install uses the the dependencies in pyproject.toml and attempts to install them. However, these were already installed by ```conda build,` so it skips this step.
  5. pip install uses the the scripts list in pyproject.toml to install the CLI scripts.
  6. All the resulting dependencies and files are baked by conda build into a conda bundle specific to the operating system the package was built on.
  7. You then run conda convert to convert the package you built for your OS to versions that work on other operating systems (this is why, on anaconda.org, you see multiple package builds for different OSes). If the conversion fails, you must run conda build on a different OS. For example, if the Windows conversion build fails, you must run conda build on a Windows computer and upload it separately.
  8. You then use the anaconda CLI tool to upload these packages to anaconda.org.

The big projects run these steps using GitHub Actions (https://docs.github.com/en/actions) to do the build and upload steps.

…in sample creation and during rotary init. Changes so it checks the entire file extension even if there are multiple.
rotary/rules/rotary.smk Show resolved Hide resolved
rotary/repair.py Outdated Show resolved Hide resolved
rotary/repair.py Outdated Show resolved Hide resolved
rotary/repair.py Outdated Show resolved Hide resolved
rotary/repair.py Outdated Show resolved Hide resolved
@LeeBergstrand
Copy link
Collaborator

LeeBergstrand commented Oct 21, 2023

Great job putting this together! Looks like it's coming together splendidly. I left a lot of comments, so keep going.

Okay, right now, repair.py has this design pattern where you pass down the dependency_dict to all the child functions that call subprocesses. This method could be more straightforward to a degree.

Right now, your code is like the following:

samtools_sort_args = [dependency_dict['samtools'], 'sort', '-@', str(threads), '-m', f'{threads_mem_mb}M']
subprocess.run(samtools_sort_args, check=True, input=samtools_view.stdout, stdout=bam_handle, stderr=logfile_handle)

I want to move your code to using shell=true. This runs the sub-command in a shell and thus can call the underlying programs like samtools without a path. There are security issues with shell=true, but they don't apply to us because we already run this tool from CLI.

So you can just run:

samtools_sort_args = ['samtools', 'sort', '-@', str(threads), '-m', f'{threads_mem_mb}M']

@LeeBergstrand
Copy link
Collaborator

LeeBergstrand commented Oct 21, 2023

What are your thoughts on removing the CLI dependency check and the dependancy_dict from the code? If the dependency isn't there, the code would fail, and it would dump that the shell can't find it. Installing through conda should guarantee that the dependencies are installed 98% of the time.

rotary/repair.py Outdated Show resolved Hide resolved
rotary/repair.py Outdated Show resolved Hide resolved
@LeeBergstrand
Copy link
Collaborator

LeeBergstrand commented Oct 21, 2023

The following code should be moved to a function as it is repeated multiple times.

 if append_log is True:
        write_mode = 'a'
    elif append_log is False:
        write_mode = 'w'
    else:
        raise ValueError

I would also add an ValueError(fstring) / error.log to the explain why the value error is being raised.

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Oct 23, 2023

@jmtsuji I added e4dceff to add the dependencies to the other requirements files. So, three files require dependency updates.

@LeeBergstrand Thanks for the explanation and for updating the other requirement files! I'll keep this in mind when updating dependencies in the future. Your overview of how the pip and conda installers interact is very helpful.

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Oct 23, 2023

The following code should be moved to a function as it is repeated multiple times...

Moved to a function

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Oct 23, 2023

Okay, right now, repair.py has this design pattern where you pass down the dependency_dict to all the child functions that call subprocesses. This method could be more straightforward to a degree...
I want to move your code to using shell=true. This runs the sub-command in a shell and thus can call the underlying programs like samtools without a path. There are security issues with shell=true, but they don't apply to us because we already run this tool from CLI.

@LeeBergstrand Actually, based on my previous testing, I am pretty sure that I can already call the underlying programs like samtools without a path, even if shell = False. However, I decided to explicitly set the path to the programs to avoid any weird PATH conflict issues that might happen. Do you think that such path conflicts are basically impossible? If so, I can simplify the code.

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Oct 23, 2023

What are your thoughts on removing the CLI dependency check and the dependancy_dict from the code? If the dependency isn't there, the code would fail, and it would dump that the shell can't find it. Installing through conda should guarantee that the dependencies are installed 98% of the time.

@LeeBergstrand I like the CLI dependency check because it tells the user upfront that a dependency is missing. Otherwise, they might have to run 50% of the pipeline before they run into the missing dependency issue. In future, I am thinking of using the dependency check function to also grab the version of the dependency tool to add to the log. So I would be happy to leave the function in the code... what are your thoughts?

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Oct 23, 2023

Great job putting this together! Looks like it's coming together splendidly. I left a lot of comments, so keep going.

@LeeBergstrand Thanks! Appreciate the comments. I left several responses and some unresolved threads, so please have a look and let me know your thoughts... we can move forward toward merging this PR based on your feedback.

@LeeBergstrand
Copy link
Collaborator

@LeeBergstrand Actually, based on my previous testing, I am pretty sure that I can already call the underlying programs like samtools without a path, even if shell = False. However, I decided to explicitly set the path to the programs to avoid any weird PATH conflict issues that might happen. Do you think that such path conflicts are basically impossible? If so, I can simplify the code.

Given the installation, these pathing issues are generally unlikely. Because check_dependency() uses shutil.which(dependency_name), you will likely pass any pathing issues onto the Python code. I believe the shutil.which would be affected by shell variables, symlinks, etc., including incorrect ones. In Unix, executables are files with some file system flags changed, and globally accessible executables are those added to the $PATH. The default $PATH for your user should include /home/YOUR-USER-NAME/bin (i.e. ~/bin). Most executables are symlinked to ~/bin, which lets them show up in your shell when you call them.

Given the above I think we can simplify the code.

@LeeBergstrand
Copy link
Collaborator

What are your thoughts on removing the CLI dependency check and the dependancy_dict from the code? If the dependency isn't there, the code would fail, and it would dump that the shell can't find it. Installing through conda should guarantee that the dependencies are installed 98% of the time.

@LeeBergstrand I like the CLI dependency check because it tells the user upfront that a dependency is missing. Otherwise, they might have to run 50% of the pipeline before they run into the missing dependency issue. In future, I am thinking of using the dependency check function to also grab the version of the dependency tool to add to the log. So I would be happy to leave the function in the code... what are your thoughts?

If the end repair pipeline takes a while, I would keep the dependency check-in. I think logging the dependency versions would be helpful.

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Oct 24, 2023

Given the installation, these pathing issues are generally unlikely. Because check_dependency() uses shutil.which(dependency_name), you will likely pass any pathing issues onto the Python code. I believe the shutil.which would be affected by shell variables, symlinks, etc., including incorrect ones. In Unix, executables are files with some file system flags changed, and globally accessible executables are those added to the $PATH. The default $PATH for your user should include /home/YOUR-USER-NAME/bin (i.e. ~/bin). Most executables are symlinked to ~/bin, which lets them show up in your shell when you call them.
Given the above I think we can simplify the code.

OK, got it. I have simplified the code!

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Oct 24, 2023

If the end repair pipeline takes a while, I would keep the dependency check-in. I think logging the dependency versions would be helpful.

End repair might take several minutes (or longer if a large assembly is input), so let's leave dependency_check() in the code. I'll continue to aim to add version checks in the future.

* refactor-subcommands:
  move subprocess logging to function, invert check being default, move stdin.stdout into function.

# Conflicts:
#	rotary/repair.py
@LeeBergstrand
Copy link
Collaborator

@jmtsuji We are pretty close to being able to merge this in—one last issue other than removing the dependency_dict,

https://github.com/jmtsuji/rotary/blob/0b984a5dd01d80f999d0f688c9aac3fb4c228ffe/rotary/repair.py#L463

Is this the correct function argument mapping? In the other runs commands, you mapped stderr to the log file. These arguments are mapped correctly?

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Oct 24, 2023

Is this the correct function argument mapping? In the other runs commands, you mapped stderr to the log file. These arguments are mapped correctly?

@LeeBergstrand Yes, this is correct. This particular tool writes the log info to STDOUT, so I had to direct STDOUT to the logfile. I wanted to retain STDERR info as well in case anything got printed there. (By the way, this is the tool that I plan to replace with in-house code).

@LeeBergstrand
Copy link
Collaborator

Is this the correct function argument mapping? In the other runs commands, you mapped stderr to the log file. These arguments are mapped correctly?

@LeeBergstrand Yes, this is correct. This particular tool writes the log info to STDOUT, so I had to direct STDOUT to the logfile. I wanted to retain STDERR info as well in case anything got printed there. (By the way, this is the tool that I plan to replace with in-house code).

Great I wanted to confirm.

@jmtsuji
Copy link
Collaborator Author

jmtsuji commented Oct 24, 2023

@LeeBergstrand Dependency dict and logger changes are addressed. End-to-end tests are passing. OK to merge?

@LeeBergstrand
Copy link
Collaborator

@jmtsuji Great. Feel free to merge.

@jmtsuji jmtsuji merged commit 05313b3 into develop Oct 24, 2023
@jmtsuji jmtsuji deleted the end_repair branch October 24, 2023 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants