Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A small number of rows where gene name, symbol and feature type have "no value" #47

Closed
ValWood opened this issue May 18, 2022 · 15 comments

Comments

@ValWood
Copy link
Collaborator

ValWood commented May 18, 2022

Screenshot 2022-05-18 at 09 09 57

@ValWood

This comment was marked as outdated.

@ValWood
Copy link
Collaborator Author

ValWood commented May 18, 2022

@kimrutherford
@manulera

@kimrutherford
Copy link
Collaborator

I get the same list with this query:

<query model="genomic" view="Gene.primaryIdentifier Gene.secondaryIdentifier Gene.symbol Gene.name Gene.length Gene.organism.shortName" constraintLogic="(A and B)" sortOrder="">
   <constraint path="Gene.length" op="IS NULL" code="B"/>
   <constraint path="Gene.organism.shortName" value="S. pombe" op="=" code="A"/>
</query>

PombeMine-zero-length-genes-1

@kimrutherford
Copy link
Collaborator

I'm investigating (pombase/pombase-chado#967) why we export the transcript ID "SPAC1556.06.1" as an exact synonym for "SPAC1556.06" in the JSON file for PombeMine. But it might just be a coincidence that it's in this list.

  • SPAC1F12.03c and SPAC4H3.12c aren't current PomBase identifiers. Those genes were removed sometime in the past. (Details: https://www.pombase.org/status/new-and-removed-genes)

  • SPBC28F2.11 is a current PomBase gene. There are two genes with that DB identifier in PombeMine. I'm not sure why they haven't merged.

  • SPBC8E4.02c is a synonym of SPNCRNA.9001 in PomBase because two genes were merged in the past. In PombeMine there is a gene object for SPBC8E4.02c and one for SPNCRNA.9001.

    • Ensembl Genomes has SPBC8E4.02c but not SPNCRNA.9001
  • SPCC548.03c.1 and SPCC548.03c.2 are transcript IDs.

@ValWood

This comment was marked as outdated.

@kimrutherford
Copy link
Collaborator

I thought we only load genes from PomBase?

They will be loaded from any source that has gene data.

I should have done this earlier. Here is the result of querying PombeMine for the gene identifier and the DataSet that the identifier came from:

identifier DataSet
Q9H9V9 GO Annotation data set
SPAC1556.06.1 BioGRID interaction data set
SPAC1F12.03c BioGRID interaction data set
SPAC4H3.12c BioGRID interaction data set
SPBC28F2.11 cerevisiae-orthologs data set
SPBC8E4.02c BioGRID interaction data set
SPCC548.03c.1 GO Annotation data set
SPCC548.03c.2 GO Annotation data set

@ValWood
Copy link
Collaborator Author

ValWood commented May 23, 2022

@ValWood
Copy link
Collaborator Author

ValWood commented May 23, 2022

  • Contact BioGRID about:

  • SPBC8E4.02c is now a synonym of -> SPNCRNA.9001 (there is no longer a protein coding orf for this ID)

  • SPAC1F12.03c. removed; replaced by a nuclear mitochondrial pseudogene (NUMT) feature

  • SPAC4H3.12c not protein-coding (of upstream region of snr62). No corresponding gene feature (but might be part of snr62 transcript)

  • SPAC1556.06.1 is a transcript ID for an alternative transcript of SPAC1556.06

Also asked @kimrutherford not to load into PomBase
#51

@ValWood
Copy link
Collaborator Author

ValWood commented May 23, 2022

  • SPBC28F2.11 | cerevisiae-orthologs data set

I don't understand this one. The S. c orthologs are parsed from the contig files and this isn't mentioned except as a systematic ID?

See query
#50

@ValWood
Copy link
Collaborator Author

ValWood commented May 23, 2022

  • when I search UniPRrt for these isoforms I only get one entry
    Q9P3V0

Can you send the GOA GAF so that I can investigate further? (the alternative forms would be in the column "gene product form ID (column 17)

Addded to #51

@kimrutherford
Copy link
Collaborator

kimrutherford commented May 23, 2022

SPBC28F2.11 | cerevisiae-orthologs data set
I don't understand this one. The S. c orthologs are parsed from the contig files and this isn't mentioned except as a systematic ID?
Yep, I think that's one for InterMine to investigate.
outdated

Can you send the GOA GAF so that I can investigate further? (the alternative forms would be in the column "gene product form ID (column 17)

Here's the pombe and japonicus lines from the GOA GAF we load:
https://curation.pombase.org/kmr44/gene_association.goa_uniprot.pombe+japonicus-2022-04-01.tsv.gz

That's what PomBase uses, but PombeMine might be reading the XML file.

@ValWood

This comment was marked as outdated.

@ValWood
Copy link
Collaborator Author

ValWood commented May 23, 2022

identifier DataSet
SPCC548.03c.1 GO Annotation data set
SPCC548.03c.2 GO Annotation data set

#51

@ValWood
Copy link
Collaborator Author

ValWood commented May 23, 2022

Sorry @danielabutano ! I thought this ticket was on our tracker whilst we tracked down the sources. So I can close this issue, more informative tickets. have been opened for the individual issues requiring action.

@ValWood
Copy link
Collaborator Author

ValWood commented May 23, 2022

BioGrid have mailed back. They have fixed the 4 issues at their end so these will disappear soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants