Tracks in Custom genome #1306

YeHW · 2023-03-28T09:00:27Z

Hi, igv team.

I've got human genome fasta file (human 1kg_v37) and want to setup a custom genome.json for that to use with igv. I've checked wiki and b37_1kg.json shipped with igv and had some questions about the fields in the json file.

In b37_1kg.json, "cytobandURL": "https://s3.amazonaws.com/igv.org.genomes/1kg_v37/b37_cytoband.txt", what is the origin of this cytoband file? I found that the cytoband for GRCh38 is from UCSC. Is it the same case for 1kg_v37?
In b37_1kg.json, how is "url": "https://s3.amazonaws.com/igv.org.genomes/hg19/ncbiRefSeq.sorted.txt.gz" in the Refseq Genes track sorted and tabixed? I found there's only a unsorted ncbiRefSeq.txt.gz in ucsc's ftp site.
In the above Refseq Genes track, there is only 1 transcript (NM_002944.2) for ROS1 gene, but in the ncbi GDV there are 3 transcripts (NM_002944.3, NM_001378902.1, NM_001378891.1) for ROS1 gene (because it's using the latest Annotation Release 105.20220307). How can I build a new Refseq Genes track to be used with igv using the latest Annotation Release from Refseq (105.20220307 as of writing)?

Thank you!

The text was updated successfully, but these errors were encountered:

maximilianh · 2023-03-28T14:23:15Z

1. In b37_1kg.json <https://s3.amazonaws.com/igv.org.genomes/1kg_v37/b37_1kg.json>, "cytobandURL": "https://s3.amazonaws.com/igv.org.genomes/1kg_v37/b37_cytoband.txt", what is the origin of this cytoband file? I found that the cytoband for GRCh38 is from UCSC. Is it the same case for 1kg_v37? I am not with IGV, but I bet the cytoband file also comes from UCSC. 1. In b37_1kg.json <https://s3.amazonaws.com/igv.org.genomes/1kg_v37/b37_1kg.json>, how is "url": " https://s3.amazonaws.com/igv.org.genomes/hg19/ncbiRefSeq.sorted.txt.gz" in the Refseq Genes track sorted and tabixed? I found there's only a unsorted ncbiRefSeq.txt.gz in ucsc's ftp <https://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/> site. 1. In the above Refseq Genes track, there is only 1 transcript (NM_002944.2) for ROS1 gene, but in the ncbi GDV <https://www.ncbi.nlm.nih.gov/genome/gdv/browser/genome/?id=GCF_000001405.25> there are 3 transcripts for ROS1 gene (because it's using the latest Annotation Release 105.20220307). How can I build a new *Refseq Genes* track to be used with igv using the latest Annotation Release from Refseq ( 105.20220307 <https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/105.20220307/GCF_000001405.25_GRCh37.p13/> as of writing)? UCSC also uses 105.20220307 on hg19 and there is only one location

for NM_002944.3 chr6 - 117608515 - 117747105. The GDV also has only one location for NM_002944.3. On hg38, and T2T, only one location in GDV. On the UCSC search, for hg38, there are three locations, but on three different tracks (UCSC's mapping, NCBI's mapping and Gencode)

…

Thank you! — Reply to this email directly, view it on GitHub <#1306>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACL4TKZFO6DUZVHP53KXVDW6KSDPANCNFSM6AAAAAAWKIGFMU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

jrobinso · 2023-03-28T14:47:08Z

Thanks @maximilianh , you are correct, the cytoband files all come from UCSC.

Tabix indexing is optional, and not really necessary for tracks < 20MB in total size. I would suggest skipping this, but if you have a need for it documentation is here: https://www.htslib.org/doc/tabix.html.

I assume you have reviewed the documentation (https://github.com/igvteam/igv.js/wiki) that describes genomes.json and other files.

YeHW · 2023-03-30T09:59:46Z

In b37_1kg.json https://s3.amazonaws.com/igv.org.genomes/1kg_v37/b37_1kg.json, "cytobandURL": "https://s3.amazonaws.com/igv.org.genomes/1kg_v37/b37_cytoband.txt", what is the origin of this cytoband file? I found that the cytoband for GRCh38 is from UCSC. Is it the same case for 1kg_v37? I am not with IGV, but I bet the cytoband file also comes from UCSC. 1. In b37_1kg.json https://s3.amazonaws.com/igv.org.genomes/1kg_v37/b37_1kg.json, how is "url": " https://s3.amazonaws.com/igv.org.genomes/hg19/ncbiRefSeq.sorted.txt.gz" in the Refseq Genes track sorted and tabixed? I found there's only a unsorted ncbiRefSeq.txt.gz in ucsc's ftp https://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/ site. 1. In the above Refseq Genes track, there is only 1 transcript (NM_002944.2) for ROS1 gene, but in the ncbi GDV https://www.ncbi.nlm.nih.gov/genome/gdv/browser/genome/?id=GCF_000001405.25 there are 3 transcripts for ROS1 gene (because it's using the latest Annotation Release 105.20220307). How can I build a new Refseq Genes track to be used with igv using the latest Annotation Release from Refseq ( 105.20220307 https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/105.20220307/GCF_000001405.25_GRCh37.p13/ as of writing)? UCSC also uses 105.20220307 on hg19 and there is only one location
for NM_002944.3 chr6 - 117608515 - 117747105. The GDV also has only one location for NM_002944.3. On hg38, and T2T, only one location in GDV. On the UCSC search, for hg38, there are three locations, but on three different tracks (UCSC's mapping, NCBI's mapping and Gencode)
…
Thank you! — Reply to this email directly, view it on GitHub <Tracks in Custom genome #1306>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TKZFO6DUZVHP53KXVDW6KSDPANCNFSM6AAAAAAWKIGFMU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Thanks @maximilianh and @jrobinso.

Just want to make sure about one thing:
I spot three NM_ transcripts for ROS1 gene in GDV:

I guess it's because there are three NM_ transcripts in the RefSeq release 105.20220307. I downloaded GCF_000001405.25_GRCh37.p13_genomic.gtf.gz from the above ftp site, and checked:

curl -s 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/105.20220307/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gtf.gz' \
  | zcat GCF_000001405.25_GRCh37.p13_genomic.gtf.gz \
  | rg '(gene_id "ROS1"; transcript_id "NM_.*?")' -or '$1' \
  | sort | uniq

# output
## gene_id "ROS1"; transcript_id "NM_001378891.1"
## gene_id "ROS1"; transcript_id "NM_001378902.1"
## gene_id "ROS1"; transcript_id "NM_002944.3"

If the above speculation is correct, I want to build a new igv track based on RefSeq release 105.20220307. To ahieve that, I need to build a file like ncbiRefSeq.sorted.txt.gz. Could you please help me with this?

Thank you!

maximilianh · 2023-03-30T10:33:19Z

"I spot three NM_ transcripts for ROS1 gene in GDV" - OK, now this is entirely different. You originally said that there were three locations for the transcript NM_002944.3. Yes, I think your analysis is correct. The updated file ncbiRefSeq for 105.20220307 is here https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ncbiRefSeq.txt.gz and as Jim mentioned, you can sort it but that's not required. You should be able to load the file as-is, if I understood him correctly (but he may be able to confirm that. Jim: we're updating these files on a regular schedule now, automatically. Maybe we can figure out a way to let you know when to update these? We can send email or ping a URL when we update. We also keep the previous versions, so you could in theory tag them with the release name and offer a version history.

…

On Thu, Mar 30, 2023 at 11:59 AM Hongwei Ye ***@***.***> wrote: 1. In b37_1kg.json https://s3.amazonaws.com/igv.org.genomes/1kg_v37/b37_1kg.json, "cytobandURL": " https://s3.amazonaws.com/igv.org.genomes/1kg_v37/b37_cytoband.txt", what is the origin of this cytoband file? I found that the cytoband for GRCh38 is from UCSC. Is it the same case for 1kg_v37? I am not with IGV, but I bet the cytoband file also comes from UCSC. 1. In b37_1kg.json https://s3.amazonaws.com/igv.org.genomes/1kg_v37/b37_1kg.json, how is "url": " https://s3.amazonaws.com/igv.org.genomes/hg19/ncbiRefSeq.sorted.txt.gz" in the Refseq Genes track sorted and tabixed? I found there's only a unsorted ncbiRefSeq.txt.gz in ucsc's ftp https://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/ site. 1. In the above Refseq Genes track, there is only 1 transcript (NM_002944.2) for ROS1 gene, but in the ncbi GDV https://www.ncbi.nlm.nih.gov/genome/gdv/browser/genome/?id=GCF_000001405.25 there are 3 transcripts for ROS1 gene (because it's using the latest Annotation Release 105.20220307). How can I build a new *Refseq Genes* track to be used with igv using the latest Annotation Release from Refseq ( 105.20220307 https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/105.20220307/GCF_000001405.25_GRCh37.p13/ as of writing)? UCSC also uses 105.20220307 on hg19 and there is only one location for NM_002944.3 chr6 - 117608515 - 117747105. The GDV also has only one location for NM_002944.3. On hg38, and T2T, only one location in GDV. On the UCSC search, for hg38, there are three locations, but on three different tracks (UCSC's mapping, NCBI's mapping and Gencode) … <#m_-8401278165509756520_> Thank you! — Reply to this email directly, view it on GitHub <#1306 <#1306>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TKZFO6DUZVHP53KXVDW6KSDPANCNFSM6AAAAAAWKIGFMU . You are receiving this because you are subscribed to this thread.Message ID: *@*.***> Thanks @maximilianh <https://github.com/maximilianh> and @jrobinso <https://github.com/jrobinso>. Just want to make sure about one thing: I spot three NM_ transcripts for ROS1 gene in GDV <https://www.ncbi.nlm.nih.gov/genome/gdv/browser/genome/?id=GCF_000001405.25> : [image: image] <https://user-images.githubusercontent.com/43214065/228794780-46125249-ca93-49c4-bae3-b2a0f8f46ac4.png> I guess it's because there are three NM_ transcripts in the RefSeq release 105.20220307 <https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/105.20220307/GCF_000001405.25_GRCh37.p13/>. I downloaded GCF_000001405.25_GRCh37.p13_genomic.gtf.gz <https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/105.20220307/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gtf.gz> from the above ftp site, and checked: curl -s 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/105.20220307/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gtf.gz' \ | zcat GCF_000001405.25_GRCh37.p13_genomic.gtf.gz \ | rg '(gene_id "ROS1"; transcript_id "NM_.*?")' -or '$1' \ | sort | uniq # output## gene_id "ROS1"; transcript_id "NM_001378891.1"## gene_id "ROS1"; transcript_id "NM_001378902.1"## gene_id "ROS1"; transcript_id "NM_002944.3" If the above speculation is correct, I want to build a new igv track based on RefSeq release 105.20220307 <https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/105.20220307/GCF_000001405.25_GRCh37.p13/>. To ahieve that, I need to build a file like *ncbiRefSeq.sorted.txt.gz*. Could you please help me with this? Thank you! — Reply to this email directly, view it on GitHub <#1306 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACL4TOCOIT6RNSITZRE4Q3W6VKR3ANCNFSM6AAAAAAWKIGFMU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jrobinso · 2023-03-30T15:31:48Z

@maximilianh Thanks for the info. Actually I was considering just referencing those URLs directly (e.g. https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ncbiRefSeq.txt.gz). The only thing I do additionally is tabix index them, but really they aren't that large and the benefit is marginal.

jrobinso · 2023-03-30T15:37:13Z

@YeHW You do not need to build a file like ncbiRefSeq.sorted.txt.gz. See the user documentation, all common annotation formats are supported, including gff3.

maximilianh · 2023-03-30T15:41:53Z

Yes, that would be perfect. We have never, not once, changed these URLs so far.

…

On Thu, Mar 30, 2023 at 5:37 PM Jim Robinson ***@***.***> wrote: @YeHW <https://github.com/YeHW> You do not need to build a file like ncbiRefSeq.sorted.txt.gz. See the user documentation, all common annotation formats are supported, including gff3. — Reply to this email directly, view it on GitHub <#1306 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACL4TPRG7WWYRBAVJVHYATW6WSDHANCNFSM6AAAAAAWKIGFMU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jrobinso · 2023-03-30T18:22:35Z

This is done. @YeHW if you update your genome by selecting "Genomes > Select Hosted Genome" from the menu you will get updated annotation. The assembly you are asking about is in the updated menu as follows

YeHW · 2023-04-01T07:24:27Z

@maximilianh @jrobinso Thank you! I can see the updated annotation.

YeHW closed this as completed Apr 1, 2023

yonghaoy mentioned this issue Apr 17, 2023

Hosted file moved? igvteam/igv-notebook#19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracks in Custom genome #1306

Tracks in Custom genome #1306

YeHW commented Mar 28, 2023 •

edited

Loading

maximilianh commented Mar 28, 2023 via email

jrobinso commented Mar 28, 2023

YeHW commented Mar 30, 2023

maximilianh commented Mar 30, 2023 via email

jrobinso commented Mar 30, 2023

jrobinso commented Mar 30, 2023

maximilianh commented Mar 30, 2023 via email

jrobinso commented Mar 30, 2023

YeHW commented Apr 1, 2023

Tracks in Custom genome #1306

Tracks in Custom genome #1306

Comments

YeHW commented Mar 28, 2023 • edited Loading

maximilianh commented Mar 28, 2023 via email

jrobinso commented Mar 28, 2023

YeHW commented Mar 30, 2023

maximilianh commented Mar 30, 2023 via email

jrobinso commented Mar 30, 2023

jrobinso commented Mar 30, 2023

maximilianh commented Mar 30, 2023 via email

jrobinso commented Mar 30, 2023

YeHW commented Apr 1, 2023

YeHW commented Mar 28, 2023 •

edited

Loading