Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can the search parameters be modified? #33

Closed
hansmaelke opened this issue May 25, 2020 · 4 comments
Closed

How can the search parameters be modified? #33

hansmaelke opened this issue May 25, 2020 · 4 comments
Labels

Comments

@hansmaelke
Copy link

Regards
I am analyzing some bacterial strains in which I am sure there are RGPs and so far Ppanggolin has worked wonders. However I have many RGPs, is there a way to increase the search requirement? Could the threshold be modified? and in this way obtain fewer RGPs

Thanks in advance

@axbazin
Copy link
Member

axbazin commented May 25, 2020

Yes absolutely.
Normally on your previous analysis, a pangenome.h5 file was generated, you can reuse it to rerun parts of the workflow.
The simplest option is to change the minimum required length for an RGP to be predicted. The default is 3000bp. If you want the predicted RGPs to be of at least 5000 bp for example, you can use the '--min_length' option along with this pangenome.h5 file, as such:

ppanggolin rgp -p pangenome.h5 --min_length 5000

Another possibility is modifying the minimum score threshold. The default threshold for that score is 4 which roughly means that you need at least 4 shell or cloud genes close together to get a RGP, when other parameters are set to default.

If you feel like this is not strict enough, and only want the regions with a lot more genes, you can change this threshold, as such:

ppanggolin rgp -p pangenome.h5 --min_score 8

This will set the threshold to 8 instead of the default 4.

There are other parameters, but they are less straight forward to explain. You can see them all by running ppanggolin rgp -h.

Afterward, you can regenerate the 'plastic_regions.tsv' file by running

ppanggolin write -p pangenome.h5 --regions --output MyNewRegionsOutputDir

If you do start tweaking the parameters, you might find the following command useful:

ppanggolin info -p pangenome.h5 --parameters
which will list the parameters used to compute the results currently stored in the .h5 file for all the steps of the analysis.

@mikkelbregovic
Copy link

Hello!
Could you briefly explain the options

--persistent penalty

  • variable_gain

If I increase or decrease these values, what should I expect?

Thanks for taking the time to answer these basic things.
Really appreciated

@axbazin
Copy link
Member

axbazin commented May 26, 2020

Hello

Taken alone, those two parameters kind of oppose each other.

Persistent penalty default is 3. Decreasing it might fuse two RGPs that are close together along the genome but separated by some persistent genes. Increasing it might divide RGPs into multiple components if there are persistent genes included in them.

Variable gain default is 1. Increasing it might fuse two RGPs that are close together along the genome, while decreasing it might divide RGPs into multiple components if there are persistent genes included in them.
And both of those parameters will impact the score of the RGPs that are predicted.

In any case however, having persistent genes in the middle of RGPs is relatively rare, so modifying those parameters slightly should not have a lot of impact, while changing them greatly might not give you biologically meaningful results anymore, as you may group RGPs together over long stretches of persistent genes.

If you want to understand more in detail how all of those parameters interact, the full method is detailed in this preprint : https://www.biorxiv.org/content/10.1101/2020.03.26.007484v1.full
In part "2.1 - panRGP method"

In part 2.1.1, parameter p in the formula corresponds to persistent penalty, parameter v to variable gain
In part 2.1.2, parameter s min is "min_score" and l min is "min_length" I was talking about previously.

Only 2.1.1 and 2.1.2 will be of interest for understanding how the RGPs are predicted.

If something is unclear, do not hesitate to ask more questions :)

@axbazin
Copy link
Member

axbazin commented Sep 28, 2020

Since this is from may and there has been no other questions since, I will close this issue. If you have any other question please do not hesitate to reopen it.

@axbazin axbazin closed this as completed Sep 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants