Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File *nmtf.LTRlib.fa not made #81

Closed
lcoombe opened this issue Aug 28, 2020 · 11 comments
Closed

File *nmtf.LTRlib.fa not made #81

lcoombe opened this issue Aug 28, 2020 · 11 comments
Labels

Comments

@lcoombe
Copy link

lcoombe commented Aug 28, 2020

Hello,

I'm running LTR retriever v2.9.0 (installed via conda), and based on the logs I'm expecting to see these output files in my working directory:

LTR-RT library
        RC-genome-V4.500plus.seqtk_5.fa.LTRlib.redundant.fa (All LTR-RTs with redundancy)
        RC-genome-V4.500plus.seqtk_5.fa.LTRlib.fa (All non-redundant LTR-RTs)
        RC-genome-V4.500plus.seqtk_5.fa.nmtf.LTRlib.fa (Non-TGCA LTR-RTs)

However, I'm only seeing two of those files:

[lcoombe]$ ls *LTRlib*fa
RC-genome-V4.500plus.seqtk_5.fa.LTRlib.fa  RC-genome-V4.500plus.seqtk_5.fa.LTRlib.redundant.fa

Specified parameters:

Parameters: -genome RC-genome-V4.500plus.seqtk_5.fa -infinder RC-genome-V4.500plus.seqtk_5.fa.finder.scn -inharvest RC-genome-V4.500plus.seqtk_5.fa.harvest.scn -nonTGCA RC-genome-V4.500plus.seqtk_5.fa.harvest.nonTGCA.scn -threads 4 -noanno

Any idea why that fasta file isn't being generated? Or am I looking in the wrong place?

Thanks so much!
Lauren

@oushujun
Copy link
Owner

Hi Lauren,

It's likely that the program did not identify any non-TGCA LTR elements in the genome. Please check if there are any entries in the *nmtf.pass.list file.

Best,
Shujun

@lcoombe
Copy link
Author

lcoombe commented Aug 28, 2020

Hi Shujun,

Thanks for the prompt response!

I look a look at that file, but it looks like there are entries:

[lcoombe@hpce706 ltr_retriever-RC-genome-V4.500plus.seqtk_5.fa]$ ls *LTRlib*fa
RC-genome-V4.500plus.seqtk_5.fa.LTRlib.fa  RC-genome-V4.500plus.seqtk_5.fa.LTRlib.redundant.fa
[lcoombe@hpce706 ltr_retriever-RC-genome-V4.500plus.seqtk_5.fa]$ cat *nmtf.pass.list
#LTR_loc	Category	Motif	TSD	5_TSD 3_TSD	Internal	Identity	Strand	SuperFamily	TE_type	Insertion_Time
s00002979:17199..23609	pass	motif:TGTA	TSD:CTCGT	17194..17198	23610..23614	IN:17699..23109	0.9620	?	unknown	NA	1499864
s00003068:6229694..6231445	pass	motif:TGCT	TSD:ATAAT	6229689..6229693	6231446..6231450	IN:6230017..6231122	0.969unknown	NA	1212196
s00003156:318552..323859	pass	motif:TGAC	TSD:ACAAC	318547..318551	323862..323866	IN:319136..323281	0.9465	+	GypsyLTR	2136454
s00003321:283843..285140	pass	motif:TATA	TSD:ATAGC	283838..283842	285141..285145	IN:284023..284970	0.9649	?	unknown	NA	1382119
s00003397:107320..114677	pass	motif:TGTA	TSD:TATAT	107315..107319	114678..114682	IN:107742..114256	0.9378	-	GypsyLTR	2497401
s00003422:254140..258317	pass	motif:TGTG	TSD:CTCTG	254135..254139	258318..258322	IN:254302..258155	0.9693	-	GypsyLTR	1204601
s00003576:682825..684549	pass	motif:TGGT	TSD:ATGTA	682820..682824	684550..684554	IN:683010..684364	0.9462	?	unknown	NA	2145677
s00003590:124843..130198	pass	motif:TACA	TSD:CTCAT	124838..124842	130199..130203	IN:125121..129921	0.9065	-	GypsyLTR	3841969
s00003717:232257..237592	pass	motif:TTTT	TSD:TTGTT	232252..232256	237593..237597	IN:232443..237408	0.9514	+	GypsyLTR	1934538
s00003947:179251..184709	pass	motif:TGTA	TSD:CTGGG	179246..179250	184710..184714	IN:179955..184005	0.9445	-	GypsyLTR	2216757
s00004212:169574..175432	pass	motif:TGTA	TSD:TGATC	169569..169573	175433..175437	IN:170291..174715	0.9064	-	GypsyLTR	3844193
s00004313:532336..536866	pass	motif:TATA	TSD:AAACA	532331..532335	536867..536871	IN:532786..536417	0.9443	?	unknown	NA	2225177

Are there cases where it is expected to have entries in the file, but they don't end up in the *nmtf.LTRlib.fa file?

Thanks for your help!
Lauren

@oushujun
Copy link
Owner

Hi Lauren,

Can you paste the program screen output here? And if rerunning the program is not too slow, please rerun it with the -v parameter so that we can check the intermediate files to further track down the cause.

Best,
Shujun

@lcoombe
Copy link
Author

lcoombe commented Aug 29, 2020

Hi Shujun,

For sure -- here's the full log:

Parameters: -genome /projects/bullfrog_assembly_scratch/genome/annotation/version4/repeat-masking/custom-repeat-library/RC-genome-V4.500plus.seqtk_5.fa -infinder /projects/bullfrog_assembly_scratch/genome/annotation/version4/repeat-masking/custom-repeat-library/RC-genome-V4.500plus.seqtk_5.fa.finder.scn -inharvest /projects/bullfrog_assembly_scratch/genome/annotation/version4/repeat-masking/custom-repeat-library/RC-genome-V4.500plus.seqtk_5.fa.harvest.scn -nonTGCA /projects/bullfrog_assembly_scratch/genome/annotation/version4/repeat-masking/custom-repeat-library/RC-genome-V4.500plus.seqtk_5.fa.harvest.nonTGCA.scn -threads 4 -noanno


Thu Aug 27 21:14:56 PDT 2020	Dependency checking: All passed!
Thu Aug 27 21:15:13 PDT 2020	LTR_retriever is starting from the Init step.
Thu Aug 27 21:15:25 PDT 2020	Start to convert inputs...
				Total candidates: 2629
				Total uniq candidates: 2557

Thu Aug 27 21:15:34 PDT 2020	Module 1: Start to clean up candidates...
				Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
				Sequences containing tandem repeats will be discarded.

Thu Aug 27 21:17:29 PDT 2020	1040 clean candidates remained

Thu Aug 27 21:17:29 PDT 2020	Modules 2-5: Start to analyze the structure of candidates...
				The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

Thu Aug 27 21:19:15 PDT 2020	Intact LTR-RT found: 161

Thu Aug 27 21:19:20 PDT 2020	Module 6: Start to analyze truncated LTR-RTs...
				Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
				Use -notrunc if you don't want to keep them.

Thu Aug 27 21:19:20 PDT 2020	54 truncated LTR-RTs found
Thu Aug 27 21:19:48 PDT 2020	21 truncated LTR sequences have added to the library

Thu Aug 27 21:19:48 PDT 2020	Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
				Total library sequences: 296
Thu Aug 27 21:23:10 PDT 2020	Retained clean sequence: 296

Thu Aug 27 21:23:10 PDT 2020	Sequence clustering for RC-genome-V4.500plus.seqtk_5.fa.ltrTE ...
Thu Aug 27 21:23:10 PDT 2020	Unique lib sequence: 296

Thu Aug 27 21:23:12 PDT 2020	Module 7: Start to analyze non-TGCA LTR-RT candidates...
				Total non-TGCA candidates: 5823
Thu Aug 27 21:23:12 PDT 2020	Start to remove non-TGCA candidates that are >=60% identical to TGCA LTRs...
Thu Aug 27 21:25:08 PDT 2020	Total uniq non-TGCA candidates: 3880

Thu Aug 27 21:25:08 PDT 2020	Module 1: Start to clean up candidates...
				Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
				Sequences containing tandem repeats will be discarded.

Thu Aug 27 21:25:12 PDT 2020	3719 clean non-TGCA candidates remained

Thu Aug 27 21:25:12 PDT 2020	Modules 2-5: Start to analyze the structure of candidates...
				The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

Thu Aug 27 21:31:47 PDT 2020	Intact non-TGCA LTR-RT found: 13

Thu Aug 27 21:31:51 PDT 2020	Module 6: Start to analyze truncated LTR-RTs...
				Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
				Use -notrunc if you don't want to keep them.

Thu Aug 27 21:31:52 PDT 2020	37 truncated LTR-RTs found
Thu Aug 27 21:32:15 PDT 2020	58 truncated LTR sequences have added to the library

Thu Aug 27 21:32:15 PDT 2020	Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
				Total library sequences: 87
Thu Aug 27 21:33:18 PDT 2020	Retained clean sequence: 87

Thu Aug 27 21:33:19 PDT 2020	Module 6: Start to remove nested insertions in internal regions...
Thu Aug 27 21:34:26 PDT 2020	Raw internal region size (bit): 779494
				Clean internal region size (bit): 692038

Thu Aug 27 21:34:26 PDT 2020	Sequence number of the redundant LTR-RT library: 600
				The redundant LTR-RT library size (bit): 1066159

Thu Aug 27 21:34:26 PDT 2020	Module 8: Start to make non-redundant library...

Thu Aug 27 21:34:27 PDT 2020	Final LTR-RT library entries: 363
				Final LTR-RT library size (bit): 808187

Thu Aug 27 21:34:27 PDT 2020	Total intact LTR-RTs found: 173
				Total intact non-TGCA LTR-RTs found: 12

Thu Aug 27 21:34:31 PDT 2020	All analyses were finished!

##############################
####### Result files #########
##############################

Table output for intact LTR-RTs (detailed info)
	RC-genome-V4.500plus.seqtk_5.fa.pass.list (All LTR-RTs)
	RC-genome-V4.500plus.seqtk_5.fa.nmtf.pass.list (Non-TGCA LTR-RTs)
	RC-genome-V4.500plus.seqtk_5.fa.pass.list.gff3 (GFF3 format for intact LTR-RTs)

LTR-RT library
	RC-genome-V4.500plus.seqtk_5.fa.LTRlib.redundant.fa (All LTR-RTs with redundancy)
	RC-genome-V4.500plus.seqtk_5.fa.LTRlib.fa (All non-redundant LTR-RTs)
	RC-genome-V4.500plus.seqtk_5.fa.nmtf.LTRlib.fa (Non-TGCA LTR-RTs)

I'll also launch another run with the -v!

Thanks,
Lauren

@oushujun
Copy link
Owner

With a glance the log file seems good to me. I will take a closer look at each step. The number of intact LTR elements seems a little bit low for me. Did you use the hard-masked genome or the soft/un-masked one for LTRharvest and LTR_FINDER?

@lcoombe
Copy link
Author

lcoombe commented Aug 29, 2020

It could be maybe partially because it's not a super contiguous assembly?? The N50 is ~150kb, and I split the file into partitions so it ran faster.
The input genome is unmasked -- This is one of my steps to create a custom repeat library for my genome assembly so I can mask it before gene annotation.

@oushujun
Copy link
Owner

Splitting the genome is suboptimal because the filtering step needs a bigger sample size to be effective. You may use more threads to run it and the parallelism is quite efficient.

@lcoombe
Copy link
Author

lcoombe commented Aug 30, 2020

So I did it because I'm running other tools as well (LTR finder, RepeatModeler), and the genome I'm working with is quite large (~6GB). Do you think that the parallelism would scale to a genome of that size??

@oushujun
Copy link
Owner

oushujun commented Aug 30, 2020 via email

@lcoombe
Copy link
Author

lcoombe commented Aug 30, 2020

Ok cool - I'll give it a try without partitioning (it was too slow previously but I see there have been significant improvements since I last tried!).
And thanks for the suggestion about EDTA -- I think another member of our group tried it but found that one of the components (I think TIR-learner) was quite slow, so that's why I haven't tried it myself yet.
Thanks for your suggestions!

@oushujun
Copy link
Owner

oushujun commented Aug 31, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants