Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LTR_retriever reports redundant/duplicated intact LTR-RT from inputs of both LTR_finder and LTR_harvest #28

Closed
b524198065 opened this issue Nov 19, 2018 · 4 comments

Comments

@b524198065
Copy link

Hi Shujun,

When results from both LTR_finder and LTR_harvest were given to the LTR_retriever, I found few likely duplicated intact LTR-RT results in the *pass.list and *pass.list.gff3 file, which are interesting:

like this:

tig00000022_1:257836..262566 pass motif:TGCA TSD:ACTAC 257831..257835 262567..262571 IN:258538..261865 0.9672 - unknown NA 1289956
tig00000022_1:257836..262566 pass motif:TGCA TSD:GTAGT 257831..257835 262567..262571 IN:258538..261865 0.9673 ? unknown NA 1285934

and this:

tig00000209:486380..491904 pass motif:TGCA TSD:TTG 486377..486379 491905..491907 IN:486636..491647 0.9804 - unknown LTR 763871
tig00000209:486385..491904 pass motif:TGCA TSD:NA .. .. IN:486636..491652 0.9802 - unknown LTR 771771

tig00000241:1060515..1070650 pass motif:TGCA TSD:TTTGT 1060510..1060514 1070651..1070655 IN:1060827..1070338 0.9904 - unknown LTR 371614
tig00000241:1061545..1066430 pass motif:TGCA TSD:AAAAC 1061540..1061544 1066431..1066435 IN:1061687..1066288 0.993 - unknown LTR 270495

It seems like that only part of the features (e.g. TSD) of the two redundant entries are different, but their locations on the genome were almost the same.

Despite the fact that the number of the likely duplicated intact LTR-RT is low (5 of 497 candidates), I think it is still good to ensure the results are reliable. How do I know which the better or proper predicted result is and remove the duplicated one?

Many thanks,

Hongbo

@oushujun
Copy link
Owner

Hi Hongbo,

You can randomly select one from the seemingly duplicates. They are equally reliable. The redundancy will be removed during the library construction procedure.

What cause this is the different prediction results generated by LTRharvest and LTR_finder, such that LTR_retriever has to figure out the true case from different directions and has a chance to find features that are not exactly the same but both fit the current definition of LTR-RT.

Can you provide the sequences of these candidates with 100bp extended on both ends? I can further look into them and see if there is a way to improve the algorithm.

Thanks,
Shujun

@b524198065
Copy link
Author

@oushujun
Here are the fasta format sequences file for the three examples above.

example_1.txt
example_2.txt
example_3.txt

@oushujun
Copy link
Owner

Thanks! I will look into them when time allows.

Shujun

oushujun added a commit that referenced this issue Dec 4, 2018
… locates at the boundary of a contig. 2. fix the bug #28 #29 for sometimes pruducing slightly different results when using both LTRharvest and LTR_FINDER inputs. 3. fix the bug #28 for biased recognition of TGCA motif over non-TGCA motifs. This fix produced similar prediction quality in terms of sensitivity, specificity, accuracy, and precision that were tested in genomes of sacred lotus and rice.
oushujun added a commit that referenced this issue Dec 4, 2018
…s when using both LTRharvest and LTR_FINDER inputs.
@oushujun
Copy link
Owner

oushujun commented Dec 9, 2018

Hi Hongbo,

Thank you for providing the sequence.

For the first one:

tig00000022_1:257836..262566 pass motif:TGCA TSD:ACTAC 257831..257835 262567..262571 IN:258538..261865 0.9672 - unknown NA 1289956
tig00000022_1:257836..262566 pass motif:TGCA TSD:GTAGT 257831..257835 262567..262571 IN:258538..261865 0.9673 ? unknown NA 1285934

This is a true LTR, but due to a minor bug, the direction information of the first case is inherited from LTR_FINDER, resulting the TSD was converted into the complementary sequence. To be consistent, direction information from LTR_FINDER will be removed and de novo inferred using the LTR_retriever algorithm.

For second one:

tig00000209:486380..491904 pass motif:TGCA TSD:TTG 486377..486379 491905..491907 IN:486636..491647 0.9804 - unknown LTR 763871
tig00000209:486385..491904 pass motif:TGCA TSD:NA .. .. IN:486636..491652 0.9802 - unknown LTR 771771

I used the sequence you provided, and I only got the first prediction. This is a true LTR, however, the prediction biases to pick the motif of TG-CA other than non-canonical motifs. In this case, the TSD-motif sequence should be 5'-CATTG[TG...TA]CATTG-3', but LTR_retriever picked up 5'-TTG[TG...CA]TTG-3' instead. I have corrected the bias and push the updates to the v2.0 LTR_retriever.

For the last one:

tig00000241:1060515..1070650 pass motif:TGCA TSD:TTTGT 1060510..1060514 1070651..1070655 IN:1060827..1070338 0.9904 - unknown LTR 371614
tig00000241:1061545..1066430 pass motif:TGCA TSD:AAAAC 1061540..1061544 1066431..1066435 IN:1061687..1066288 0.993 - unknown LTR 270495

This is a case of nested LTR elements, with the second LTR element inserted into the first element. LTR_retriever takes care of such cases. Nested sequences would be removed in the final library if intact versions of such sequences are found.

I have fixed the bugs to avoid the first two cases, as well as other minor bugs and push a new version of LTR_retriever to the repository. The new version has similar annotation performance comparing to previous versions, but with better details in terms of TSD and motif identification. Hope these helps!

Best,
Shujun

@oushujun oushujun closed this as completed Dec 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants