Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beginners Help with Nextclade CLI #1416

Closed
ryhisner opened this issue Feb 14, 2024 · 6 comments
Closed

Beginners Help with Nextclade CLI #1416

ryhisner opened this issue Feb 14, 2024 · 6 comments

Comments

@ryhisner
Copy link

I'm trying to figure out how to use Nextclade CLI. I follow all the directions, but nothing ever seems to work. For example, after trying to run a fasta multiple times, the error messages indicated I needed an input dataset. So I've followed the directions to get one, but nothing works. Basically, it says to use:

nextclade dataset get [OPTIONS] --name <--output-dir <OUTPUT_DIR>|--output-zip <OUTPUT_ZIP>>

I've tried both

nextclade dataset get --nextstrain/sars-cov-2/wuhan-hu-1/orfs --output-dir .

nextclade dataset get --SARS-CoV-2 --output-dir .

neither of which work. It says that both nextstrain/sars-cov-2/wuhan-hu-1/orfs and SARS-CoV-2 are "unexpected arguments" even though these are the exact names listed on the input dataset list.

I have no idea what I'm doing wrong.

@ryhisner ryhisner added good first issue Good for newcomers help wanted Extra attention is needed needs triage Mark for review and label assignment t:feat Type: request of a new feature, functionality, enchancement labels Feb 14, 2024
@ryhisner ryhisner changed the title Beginners Help Beginners Help with Nextclade CLI Feb 14, 2024
@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Feb 14, 2024

Make sure you read the CLI docs ("Usage" and "Reference" pages):
https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextclade-cli/usage.html


Both of your invocations are invalid because you did not provide the --name argument. The correct way to request SARS-CoV-2 dataset by name is:

nextclade dataset get --name="nextstrain/sars-cov-2/wuhan-hu-1/orfs" --output-dir="dataset/"

You can also use shortcut name of this particular dataset:

nextclade dataset get --name="sars-cov-2" --output-dir="dataset/"

These two invocations do the same thing.


Think of arguments as key-value pairs separated from other arguments with spaces:

--key1=value1 --key2=value2 --key3=value3

Each argument has a specific meaning in the context of the program you are using. In most cases you need both, the key and the value. The key is the pre-agreed name of the argument. By looking at the key the program understands what kind of information you want to provide to it. In the case of the dataset get command, the argument to request a dataset by name happens to be called name. So you must write --name before giving the actual name of the dataset. The value is the piece of the information you want to give to the program. In this case it's the actual name of the dataset you request - "sars-cov-2". The equals sign (=) between key and value is optional.

It is usually better to wrap the values in quotation marks, especially if it contains spaces:

--some-arg="my value with spaces"

Some arguments which mean to turn something on or off don't need value, only the key (this kind of arguments is sometimes called "flags"). A good example is --only-names of the dataset list command:

nextclade dataset list --only-names

As you see it does not have any value after it. It just toggles on the printing of only the names of the datasets, instead of the big table which it prints by default.

There are also so-called "positional" arguments, which have no key, but only the value. For example, when you pass a fasta file to nextclade run:

nextclade run --input-dataset="dataset/" --output-dir="results/" "my_input_1.fasta" "my_input_2.fasta"

In this case there are two positional arguments: "my_input_1.fasta", "my_input_2.fasta". So as you see, positional arguments are good when you need to pass multiple things into the program.

You can find the available arguments and their meaning in the built-in help screen, by running the program with only the --help flag:

nextclade --help

And in case of Nextclade, you can also read dedicated help screen different for each of the subcommands:

nextclade run --help
nextclade dataset list --help
nextclade dataset get --help
nextclade sort --help

None of this is specific to Nextclade (nextclade is used as a relevant example). These are the basics of using command-line programs (aka console or terminal programs, or CLI). There should be plenty of learning materials on this topic on the internet.


Regarding specifics of Nextclade, I would not download the dataset into the current directory (the . passed to the --output-dir in your example means "output directory is the current working directory"). It might make it difficult to separate input and output files later - they will all end up mixed up in the same directory.

There is also another, simpler way to run nextclade analysis:

nextclade run --dataset-name="sars-cov-2" --output-dir="results/" "my_input.fasta"

This does not need a separate dataset get step. When using --dataset-name argument, the dataset will be downloaded each time you run the program, and dataset files will not be written to your computer (which may or may not be what you want).

@ivan-aksamentov ivan-aksamentov removed good first issue Good for newcomers help wanted Extra attention is needed needs triage Mark for review and label assignment t:feat Type: request of a new feature, functionality, enchancement labels Feb 14, 2024
@ryhisner
Copy link
Author

ryhisner commented Mar 9, 2024

Thank you for this assistance. I've gotten caught up in several other projects in the past couple of weeks, but I'm setting out this weekend to learn this in earnest.

I printed out the Nextclade documentation and spent the last week reading it all. That, along with your advice here has been helpful. I think I've figured out the basics of how to run Nextclade and get the sort of file I want (an ndjson at the moment).

I do have a few questions about parts of the documentation.

  1. "Add multiple occurrences to increase verbosity further." I need all the help I can get, so I'd like to make Nextclade as verbose as possible. But I'm not sure what "multiple occurrences" means. Does it mean that you type in multiple v's, as in:
    -vvvvv or maybe -v -v -v -v ? How many do you have to enter for maximum verbosity?

  2. Are the brackets and other symbols in documentation real or not? For example, in the section below from the Nextclade documentation, am I supposed to include the brackets, the < and > symbols, and the "..."? Or are those not real? Based on what you said in the post above, I'm guessing they're not real, but I want to be 100% sure. I don't know how to tell things that are required to be in the code from things that are there but aren't supposed to be part of the code, and there doesn't seem to me to be any possible way to tell the difference. Is there an easy way to know this?

image

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Mar 10, 2024

Does it mean that you type in multiple v's, as in:
-vvvvv or maybe -v -v -v -v ? How many do you have to enter for maximum verbosity?

I think either should work. But I usually use the -vvv form. Note that the verbosity levels higher than info (with one -v), including debug and trace verbosity levels, are mostly only useful for developers - they print way too much technical stuff, which will just confuse you more (i.e. do you really want to know what crypto algorithm is being negotiated when SSL handshake is established during HTTP connection when a dataset file is downloaded? Probably not). Note that verbosity levels only affect what's printed to the console. Output files are always the same.

All verbosity levels are listed under --verbosity argument. It allows to set a level you want directly, without counting how many -v or -q flags you need. The default level is warn. One -v moves verbosity a level up, one -q moves it a level down.

Are the brackets and other symbols in documentation real or not?

This is a convention for denoting variables (placeholders). The <thing> means required value, that is you need to put a thing there. While [thing] means optional value, and [thing]... repeatable value, i.e. you can put one or multiple things in there. You don't need to enter brackets, only the value itself.

I think the convention originally comes from man (1, 2, 3) - a tool to read manual pages on Unix-like systems. The page you screenshotted is autogenerated using a docs generation utility though, and I am not sure how closely it follows the convention. But that's the general idea.

I think I've figured out the basics of how to run Nextclade and get the sort of file I want (an ndjson at the moment)

I would not recommend JSON and NDJSON outputs, because they are unstable, meaning the format can change without notice. This is mentioned in the docs. You probably want TSV output (--output-tsv). It's stable and easy to open in Excel, Google Sheets or any other spreadsheet software. Use "tab" (\t) as column delimiter if your software cannot detect it automatically (it will typically ask you). That's what we recommend for most users. By the way, the output files in CLI are exactly the same as what's in the "Export" page of the web app. So if you are accustomed to using Nextclade Web export files, then you will find the CLI outputs familiar as well.

@ryhisner
Copy link
Author

Thank you. I have used the Nextclade Web export files a lot, so I'm familiar with those. I'm trying to get ndjson files because I want to be able to search them using Julia, which I (half) learned and was ready to start using before I realized there was something called bash that I really should've learned before I ever even tried Julia because it's impossible to do anything without bash.

@ryhisner
Copy link
Author

Is there a way to get the GISAID accession numbers from Nextclade? I'm doing a search and the only results I can get are the sequence names, which I then have to paste one at a time into the GISAID text search in order to find and download the fastas. I'd like to be able to paste all the EPI_ISL numbers at once so I can download them easily, but I don't see them anywhere in the TSV file or the ndjson file and I'm not sure where else they would be.

@ivan-aksamentov
Copy link
Member

Is there a way to get the GISAID accession numbers from Nextclade?

Not sure what you mean here. Nextclade software does not deal with GISAID and does not even know what accession is, or that GISAID even exists. We don't rely on any database. The only source of data is the input files users provide - input fasta files and dataset files.

Sequence names are taken from your input fasta file and presented in the output files as is. If your fasta file does not contain accession you will not get it from Nextclade. So it's your responsibility to set the names in your input fasta such that you get desired names in the output TSV.

Or do you mean something else?

By the way, sequence names are not guaranteed to be unique - scientists often don't bother with naming their produced sequences too much and it's a bit of a chaos. So it's not always possible to deduce exact sequence just from the name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants