Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expand gget search to include synonym hits in addition to name and description hits #90

Closed
KleinSamuel opened this issue Jun 29, 2023 · 4 comments

Comments

@KleinSamuel
Copy link

What happened?

Hi,
I am searching for the gene "WISP2" using gget with the following command:
gget search -s homo_sapiens "WISP2"
which returns the following result:
Thu Jun 29 14:53:00 2023 INFO Fetching results from database: homo_sapiens_core_109_38 Thu Jun 29 14:53:02 2023 INFO Total matches found: 1. Thu Jun 29 14:53:02 2023 INFO Query time: 5.15 seconds. [ { "ensembl_id": "ENSG00000244558", "gene_name": "KCNK15-AS1", "ensembl_description": "KCNK15 and WISP2 antisense RNA 1 [Source:HGNC Symbol;Acc:HGNC:49901]", "ext_ref_description": "KCNK15 and WISP2 antisense RNA 1", "biotype": "lncRNA", "url": "https://useast.ensembl.org/homo_sapiens/Gene/Summary?g=ENSG00000244558" } ]

But I would expect the gene with ensembl ID "ENSG00000064205" which has the symbol "CCN5" but lists "WISP2" as a synonym.

Apparently gget matches the search term in the "description" field but one can argue that a match in the "Gene Synonyms" field should be weighted higher.

gget version

0.27.7

Operating System (OS)

Linux

User interface

Command-line

Are you using a computer with an Apple M1 chip?

Not M1

What is the exact command that was run?

gget search -s homo_sapiens "WISP2"

Which output/error did you get?

Output for the gene "ENSG00000244558" with symbol "KCNK15-AS1" but I would expect the gene "ENSG00000064205" with symbol "CCN5".
The reason is that the gene symbol "WISP2" (search term) is a gene synonym for "CCN5".
@lauraluebbert lauraluebbert changed the title gget search returns possibly wrong result expand gget search to include synonym hits in addition to name and description hits Jun 30, 2023
@lauraluebbert
Copy link
Member

Hi Samuel,

I agree that it would be great to expand gget search to also include results based on synonyms. Unfortunately, the Ensembl SQL database does not include a synonyms field (which is why I am getting the synonyms from UniProt for gget info). Their website search does not use the publicly accessible SQL database, so it is extremely difficult to reproduce all of the results that would be returned through a website search (hence the disclaimer on searching name and description only). I agree though that this is not optimal and will search for a workaround when I have some time

@KleinSamuel
Copy link
Author

I am not quite sure if this approach is stable enough but I figured out that we can search the synonym in the synonym field of the external_synonym table which also contains a xref_id field.
This xref_id can then be matched to the display_xref_id field of the gene table.

See the example below for the aforementioned WISP2 gene. This results in the "correct" gene result, the CCN5 gene for which WISP2 is a synonym.

MySQL [homo_sapiens_core_109_38]> select * from external_synonym where synonym="WISP2";
+---------+---------+
| xref_id | synonym |
+---------+---------+
| 2781993 | WISP2   |
+---------+---------+
1 row in set (0.083 sec)
MySQL [homo_sapiens_core_109_38]> select gene_id,stable_id from gene where display_xref_id=2781993;
+---------+-----------------+
| gene_id | stable_id       |
+---------+-----------------+
|  106801 | ENSG00000064205 |
+---------+-----------------+
1 row in set (0.109 sec)

I hope this helps.

@KleinSamuel
Copy link
Author

I played around with this approach a bit and now I am unsure if it should be used.

When searching in the synonyms table for the gene CCR9 (ENSG00000173585), it results in the xref_id 2766033.
But this links to the gene ACKR2 (ENSG00000144648), which is not CCR9 but one of its synonyms...

MySQL [homo_sapiens_core_109_38]> select * from external_synonym where synonym="CCR9";
+---------+---------+
| xref_id | synonym |
+---------+---------+
| 2766033 | CCR9    |
+---------+---------+
1 row in set (0.029 sec)
MySQL [homo_sapiens_core_109_38]> select gene_id,stable_id from gene where display_xref_id=2766033;
+---------+-----------------+
| gene_id | stable_id       |
+---------+-----------------+
|  124089 | ENSG00000144648 |
+---------+-----------------+
1 row in set (0.058 sec)

Another possible approach could be to search in the gene_attrib table as follows (for the WISP2 gene):
(This allows for the retrieval of the gene_id directly instead of the xref_id)

MySQL [homo_sapiens_core_109_38]> select * from gene_attrib where value="WISP2";
+---------+----------------+-------+
| gene_id | attrib_type_id | value |
+---------+----------------+-------+
|  106801 |              4 | WISP2 |
+---------+----------------+-------+
1 row in set (0.075 sec)
MySQL [homo_sapiens_core_109_38]> select gene_id,stable_id from gene where gene_id=106801;
+---------+-----------------+
| gene_id | stable_id       |
+---------+-----------------+
|  106801 | ENSG00000064205 |
+---------+-----------------+
1 row in set (0.028 sec)

@anhchi172
Copy link
Collaborator

Hi Samuel,

Thank you for your suggestion. The new release v0.27.9 of gget now also searches Ensembl synonyms (in addition to gene descriptions and names) to return more comprehensive search results. You can install the version using the command

pip install gget

Let us know if there are any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants