Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TALON seems to be stuck or I have a least no idea what it is doing. #131

Closed
callumparr opened this issue Jun 10, 2023 · 11 comments
Closed

Comments

@callumparr
Copy link

Admittedly I am running this on a very large data set. All in the merge BAM contains something like 180M primary alignments. The log output seems to be stopped at this point and I cannot see any addition to created temp files in talon_tmp, nor any writing to the TALON.db file in over 24 hr. Is there some time out that has occurred?

Running on a node that has 1TB of memory and seems to be fine. I checked the lines from QC log file and it seems it still hasn't gone through all alignments from the merged BAM file.

(base) callum@dgt-gpu2:/tmp$ wc -l F6_interactome_neurogenesis_QC.log 
**174762972** F6_interactome_neurogenesis_QC.log
**184445290** + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
184445290 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

[ 2023-06-06 16:50:13 ] Annotating reads in interval chr1_KI270712v1_random:6437-172247...
[ 2023-06-06 16:50:13 ] Annotating reads in interval chr22_KI270733v1_random:10973-179722...
[ 2023-06-06 16:50:17 ] Annotating reads in interval chr22_KI270734v1_random:19497-164966...
[ 2023-06-06 16:50:17 ] Annotating reads in interval chr3_GL000221v1_random:22885-31291...
[ 2023-06-06 16:50:21 ] Annotating reads in interval chr4:27109-190178486...
[ 2023-06-06 16:50:32 ] Annotating reads in interval chr3:10001-198233546...
[ 2023-06-06 17:02:42 ] Annotating reads in interval chr1_KI270714v1_random:306-38347...
[ 2023-06-06 17:02:47 ] Annotating reads in interval chr4_GL000008v2_random:21898-170436...
[ 2023-06-06 17:02:51 ] Annotating reads in interval chr5:11436-181474525...
[ 2023-06-06 17:06:40 ] Annotating reads in interval chr6:105419-170741818...
[ 2023-06-06 18:22:14 ] Annotating reads in interval chr22:11249806-50807135...
[ 2023-06-06 19:34:18 ] Annotating reads in interval chr8:62561-145066713...
[ 2023-06-06 23:57:18 ] Annotating reads in interval chr9_KI270717v1_random:13135-37432...
[ 2023-06-06 23:57:21 ] Annotating reads in interval chr9_KI270718v1_random:14-13424...
[ 2023-06-06 23:57:25 ] Annotating reads in interval chr9_KI270719v1_random:11943-171140...
[ 2023-06-06 23:57:29 ] Annotating reads in interval chrM:1-16569...
[ 2023-06-06 23:58:23 ] Annotating reads in interval chrUn_GL000195v1:1725-179378...
[ 2023-06-06 23:58:34 ] Annotating reads in interval chrUn_GL000213v1:84666-85257...
[ 2023-06-06 23:58:37 ] Annotating reads in interval chrUn_GL000214v1:32227-60204...
[ 2023-06-06 23:58:41 ] Annotating reads in interval chrUn_GL000216v2:149288-174691...
[ 2023-06-06 23:58:45 ] Annotating reads in interval chrUn_GL000218v1:22808-97453...
[ 2023-06-06 23:58:49 ] Annotating reads in interval chrUn_GL000219v1:77559-99661...
[ 2023-06-06 23:58:53 ] Annotating reads in interval chrUn_GL000220v1:2839-161752...
[ 2023-06-07 00:02:20 ] Annotating reads in interval chrUn_GL000226v1:5354-6916...
[ 2023-06-07 00:02:24 ] Annotating reads in interval chrUn_KI270435v1:33904-58136...
[ 2023-06-07 00:02:28 ] Annotating reads in interval chrUn_KI270438v1:112108-112412...
[ 2023-06-07 00:02:32 ] Annotating reads in interval chrUn_KI270442v1:217247-391534...
[ 2023-06-07 00:02:35 ] Annotating reads in interval chrUn_KI270466v1:975-1128...
[ 2023-06-07 00:02:39 ] Annotating reads in interval chrUn_KI270742v1:53330-74847...
[ 2023-06-07 00:02:43 ] Annotating reads in interval chrUn_KI270743v1:136039-148853...
[ 2023-06-07 00:02:47 ] Annotating reads in interval chrUn_KI270744v1:423-145137...
[ 2023-06-07 00:02:51 ] Annotating reads in interval chrUn_KI270745v1:15951-37181...
[ 2023-06-07 00:02:55 ] Annotating reads in interval chrUn_KI270747v1:1530-3683...
[ 2023-06-07 00:02:58 ] Annotating reads in interval chrUn_KI270748v1:380-91432...
[ 2023-06-07 00:03:02 ] Annotating reads in interval chrUn_KI270751v1:43807-115120...
[ 2023-06-07 00:03:07 ] Annotating reads in interval chrUn_KI270754v1:7375-8623...
[ 2023-06-07 00:03:11 ] Annotating reads in interval chrX:13511-156030800...
[ 2023-06-07 03:48:54 ] Annotating reads in interval chr14_GL000009v2_random:43297-186420...
[ 2023-06-07 13:18:04 ] Annotating reads in interval chr9:10001-138325916...
[ 2023-06-07 13:24:22 ] Annotating reads in interval chr16_KI270728v1_random:159477-1543972...
[ 2023-06-07 18:02:14 ] Annotating reads in interval chr15_KI270727v1_random:56454-444540...
[ 2023-06-07 19:53:51 ] Annotating reads in interval chrY:2784477-56738229...
[ 2023-06-08 15:14:35 ] Annotating reads in interval chr11_KI270721v1_random:237-93434...
[ 2023-06-08 16:21:11 ] Annotating reads in interval chr7:13426-159218022...
[ 2023-06-08 20:58:44 ] Annotating reads in interval chr17_GL000205v2_random:14811-169354...
[ 2023-06-09 03:01:34 ] Annotating reads in interval chr13:16509321-114351695...
^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B[ 2023-06-09 11:49:01 ] Annotating reads in interval chr20:128702-64329822...
[ 2023-06-09 16:50:09 ] Shutting down message queue...
drwxrwxrwt 17 root     root 4.0K Jun 10 15:55 .
drwxr-xr-x  2 callum   sgx     6 Jun 10 15:35 hsperfdata_callum
-rw-r--r--  1 callum   sgx   13G Jun 10 01:50 F6_interactome_neurogenesis_QC.log
drwxr-xr-x  3 callum   sgx  4.0K Jun  9 11:09 talon_tmp
drwx------  2 kitakura sgx    89 Jun  9 10:03 rootlesskit2530122802
drwx------  2 callum   sgx    89 Jun  8 01:05 rootlesskit3564735061
drwx------  3 root     root   17 Jun  7 12:46 systemd-private-391258cc985348f19c9061b06dd85295-systemd-resolved.service-BiriuT
drwx------  3 root     root   17 Jun  7 12:46 systemd-private-391258cc985348f19c9061b06dd85295-ntp.service-7FIpxw
drwxr-xr-x  2 root     root    6 Jun  7 12:45 hsperfdata_root
drwx------  2 callum   sgx    31 Jun  7 01:50 pymp-b49m1npo
-rw-------  1 callum   sgx  1.9K Jun  6 23:36 pybedtools.hmajrws2.tmp
-rw-------  1 callum   sgx  9.0G Jun  6 23:35 pybedtools.el9onqe6.tmp
-rw-------  1 callum   sgx  9.0G Jun  6 23:19 pybedtools.kx8ubdxz.tmp
-rw-r--r--  1 callum   sgx  781M Jun  6 14:35 F6_interactome.db
drwx------  3 root     root   22 Jun  6 13:34 snap-private-tmp
drwx------  3 root     root   17 Jun  6 13:34 systemd-private-391258cc985348f19c9061b06dd85295-systemd-logind.service-m0eMgz
drwxr-xr-x 25 root     root 4.0K Jun  6 13:34 ..
drwxrwxrwt  2 root     root    6 Jun  6 13:34 .Test-unix
drwxrwxrwt  2 root     root    6 Jun  6 13:34 .ICE-unix
drwxrwxrwt  2 root     root    6 Jun  6 13:34 .X11-unix
drwxrwxrwt  2 root     root    6 Jun  6 13:34 .XIM-unix
drwxrwxrwt  2 root     root    6 Jun  6 13:34 .font-unix

drwxrwxrwt 17 root   root 4.0K Jun 10 16:05 ..
-rw-r--r--  1 callum sgx   20G Jun 10 01:50 observed_transcript_tuples.tsv
-rw-r--r--  1 callum sgx  388M Jun 10 01:50 transcript_annot_tuples.tsv
-rw-r--r--  1 callum sgx   50M Jun 10 01:50 exon_annot_tuples.tsv
-rw-r--r--  1 callum sgx   19M Jun 10 01:49 gene_annot_tuples.tsv
-rw-r--r--  1 callum sgx   28M Jun 10 01:34 vertex_2_gene_tuples.tsv
-rw-r--r--  1 callum sgx   29M Jun 10 01:34 location_tuples.tsv
-rw-r--r--  1 callum sgx   59M Jun 10 01:34 edge_tuples.tsv
-rw-r--r--  1 callum sgx  154M Jun 10 01:34 transcript_tuples.tsv
-rw-r--r--  1 callum sgx  1.2M Jun 10 01:34 gene_tuples.tsv
-rw-r--r--  1 callum sgx   391 Jun  9 11:14 merged.bam.flagstat
drwxr-xr-x  3 callum sgx  4.0K Jun  9 11:09 .
-rw-r--r--  1 callum sgx     0 Jun  7 01:50 abundance_tuples.tsv
drwxr-xr-x  2 callum sgx  4.0K Jun  7 01:49 interval_files
-rw-r--r--  1 callum sgx   16M Jun  6 22:33 merged.bam.bai
-rw-r--r--  1 callum sgx  112G Jun  6 22:18 merged.bam
-rw-r--r--  1 callum sgx   17G Jun  6 20:18 Neuron_rep2.bam
-rw-r--r--  1 callum sgx   17G Jun  6 19:47 Neuron_rep2_unsorted.bam
-rw-r--r--  1 callum sgx   17G Jun  6 19:26 Neuron_rep1.bam
-rw-r--r--  1 callum sgx   17G Jun  6 18:54 Neuron_rep1_unsorted.bam
-rw-r--r--  1 callum sgx   12G Jun  6 18:31 NSC_rep2.bam
-rw-r--r--  1 callum sgx   13G Jun  6 18:10 NSC_rep2_unsorted.bam
-rw-r--r--  1 callum sgx   20G Jun  6 17:55 NSC_rep1.bam
-rw-r--r--  1 callum sgx   20G Jun  6 17:20 NSC_rep1_unsorted.bam
-rw-r--r--  1 callum sgx   21G Jun  6 16:56 iPSC_rep2.bam
-rw-r--r--  1 callum sgx   21G Jun  6 16:20 iPSC_rep2_unsorted.bam
-rw-r--r--  1 callum sgx   28G Jun  6 15:55 iPSC_rep1.bam
-rw-r--r--  1 callum sgx   28G Jun  6 15:07 iPSC_rep1_unsorted.bam
@callumparr
Copy link
Author

Ah OK so it was doing something but then when it started to update database it had many errors.

[ 2023-06-11 16:51:28 ] All jobs complete. Starting database update.
[ 2023-06-11 17:20:18 ] Validating database........
Database counter for 'genes' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 153594
counter_value: 172923
Database counter for 'transcripts' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 1644012
counter_value: 2021938
Database counter for 'location' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 2116074
counter_value: 2397222
Database counter for 'edge' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 3020322
counter_value: 3559795
Database counter for 'observed' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 164632280
counter_value: 173941741
Traceback (most recent call last):
  File "/home/callum/miniconda3/bin/talon", line 33, in <module>
  File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 2464, in main
    end_support = parse_custom_SAM_tags(sam_record)
  File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 1781, in update_database
    # get overlap and compare
  File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 2095, in check_database_integrity
    except Exception as e:
RuntimeError: Discrepancy found in database. Discarding changes to database and exiting...
/analysisdata/fantom6/Interactome/ONT-CAGE_TALON_dorado/scripts/talon.sh: line 22: rep1: command not found
/tmp/F6_interactome_neurogenesis_QC.log:         74.5% -- replaced with /tmp/F6_interactome_neurogenesis_QC.log.gz
gzip: /tmp/*talon_read_annot.tsv: No such file or directory

@fairliereese
Copy link
Member

Hey, my suggestion when dealing with this much data is to run TALON sequentially. I have had luck with running it on 100s of millions of reads if I run ~40 million reads at a time.

@callumparr
Copy link
Author

Hi @fairliereese thanks for the reply!

I am trying to get all samples (context) in at once. So I instead now load in per chr to reduce size of the data TALON has to handle, so basically running TALON 25 times to include the major chr contigs. I hope this doesn't break some logic of how talon works.

@callumparr
Copy link
Author

@fairliereese

To speed up the database generation I took two tacs but both involved splitting all samples alignments to chromosomes and running them either a) sequentially into the same database, one chr config at a time, b) or in parallel creating a database for each chr and adding a prefix to TALON. The latter is obviously faster to generate all the annotations but then it means having to do a lot of downstream work handling the different talon.db. Given that each has the same hg38 build and gencode v39 annotation in the talon initializing. Is it possible to merge these into one database? There would be overlap would be for the initial gencode annotations from initalizing a database for each chr.

@fairliereese
Copy link
Member

Actually splitting by chromosome will not really help with speeding up because TALON already tries to do this in order to parallelize. It splits the input BAM files into non-overlapping genomic segments which often just end up splitting by chromosome. So by splitting data up this way you won't really be getting a speed benefit.

Currently there is no way within TALON to merge transcripts from separate databases. There are however, other tools that we have developed that accomplish this. See my library Cerberus, which harmonizes transcriptome annotations to use a unified set of coordinates. As a note of caution transcriptome merging typically involves introducing flexibility at the 5' and 3' ends, as we can't really rely on exact matching across transcripts as we can for things like splice sites. If you're interested in using Cerberus I can try to work with you to do that. I've used it successfully on output from multiple TALON databases and have a lot of code lying around that might help you.

@callumparr
Copy link
Author

Y

Actually splitting by chromosome will not really help with speeding up because TALON already tries to do this in order to parallelize. It splits the input BAM files into non-overlapping genomic segments which often just end up splitting by chromosome. So by splitting data up this way you won't really be getting a speed benefit.

Currently there is no way within TALON to merge transcripts from separate databases. There are, however, other tools that we have developed that accomplish this. See my library Cerberus, which harmonizes transcriptome annotations to use a unified set of coordinates. As a note of caution transcriptome merging typically involves introducing flexibility at the 5' and 3' ends, as we can't really rely on exact matching across transcripts as we can for things like splice sites. If you're interested in using Cerberus I can try to work with you to do that. I've used it successfully on output from multiple TALON databases and have a lot of code lying around that might help you.

Yes separating by chr and running sequentially doesn't make sense I am realizing as the parallelization comes from this exactly.

Running and outputting a .db for each chr is very quick and at the moment we are thinking to create filter whitelists and GTFs from them and then just merging the chr annotation files into one. As each annotation we will merge is from separate chromosomes this shouldn't cause any headaches. Or am I missing something?

@fairliereese
Copy link
Member

fairliereese commented Jun 17, 2023 via email

@fairliereese
Copy link
Member

Actually, now that I'm thinking about it the only thing you'll need to be careful for is not merging abundance of transcripts from the separate chromosomes together even if they have the same transcript ID.

@callumparr
Copy link
Author

Actually, now that I'm thinking about it the only thing you'll need to be careful for is not merging abundance of transcripts from the separate chromosomes together even if they have the same transcript ID.

I extracted an abundance file for each chr.db. can I not simply rbind the results .tsv files and it will be like a chr sorted abundance file? Counts for each isoform should only appear once, as they are located on one chr only.

Sorry, I probably misunderstood your point.

Every time I run a new database from the same gencode annotation, TALON will assign the same index to these known annotations right?

@fairliereese
Copy link
Member

Yes, but you will run the risk of having duplicated transcript IDs. For instance, novel transcript number 1 from chromosome 1 will not be the same as novel transcript number 1 from chromosome 2. This is perhaps an obvious point and there would be easy ways to make your novel transcript IDs unique but I wanted to make sure to point it out nonetheless.

@callumparr
Copy link
Author

ah, I see yes I added a prefix for novel annotations when initializing the database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants