Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to download all the chemical compound and their related data of an organism from LOTUS ? #27

Closed
ap1438 opened this issue Jun 10, 2022 · 8 comments

Comments

@ap1438
Copy link

ap1438 commented Jun 10, 2022

So, i have an organism and i want to download all the chemical compounds related to that organism with their smile ID and the species that produce those chemical compounds.

So what i did was just search in the web page and found all the entries of chemical compounds related to that organism. And downloaded the SDF file which was the only downloading option available. And later converted it to excel format.

But what i realized was that file was missing compound names.

So what i wanted was Compound name, Smile ID, Species it is present.

Is is possible to get it as such from the LOTUS database by any means ?

@Adafede
Copy link
Contributor

Adafede commented Jun 12, 2022

Hi!

Thank you for your issue.

Actually, we support a lot of custom searches (see https://lotus.naturalproducts.net/documentation) but not the specific one you requested.

We might provide a SPARQL endpoint in the future to handle such requests but in the meantime, querying Wikidata directly seems a good option.

I prepared a query you can easily adapt for you: https://w.wiki/5GSw. You can directly download the results as a tabular file there.

Another option could be to use https://pubchem.ncbi.nlm.nih.gov/classification/#hid=115 and search there directly, they offer CSV download also.

More generally, the compounds' names are automatically generated so we would advise being very cautious with them.

Best

@ap1438
Copy link
Author

ap1438 commented Jun 13, 2022

Thank you for your quick response and valuable suggestion.
As i see the code and downloaded the data the fields molecular formulae was missing.
So, i tried to modify the code and download the molecular formulae also.
But i don't know why it shows query time limit reached.
So, I tried this code

https://w.wiki/5GgJ

Can you check and guide me where did i go wrong.

@Adafede
Copy link
Contributor

Adafede commented Jun 13, 2022

You were almost there!

I think the query you want is: https://w.wiki/5Ggd

Your was querying again against whole Wikidata for molecules

@ap1438
Copy link
Author

ap1438 commented Jun 13, 2022

Thanks for the correction and insights.

@ap1438 ap1438 closed this as completed Jun 13, 2022
@ap1438
Copy link
Author

ap1438 commented Jun 16, 2022

Search for "Gentiana" returned 483 natural products in LOTUS Database search in LOTUS webpage.
BUT wiki data query returns 768 .
Why is this much difference.

Can you please let me know the reason behind the difference?

@ap1438 ap1438 reopened this Jun 16, 2022
@Adafede
Copy link
Contributor

Adafede commented Jun 16, 2022

Hi,

Not exactly, the query I wrote you gives structure-organism pairs. So the same structure can appear multiple times. If you want to reduce it to distinct structures, here: https://w.wiki/5J73.

Hope this clarifies

@ap1438
Copy link
Author

ap1438 commented Jun 16, 2022

Thank you

@ap1438 ap1438 closed this as completed Jun 16, 2022
@alrichardbollans
Copy link

alrichardbollans commented Jul 4, 2023

I'm trying to do something similar and following your examples, when I run:

SELECT DISTINCT ?structure ?structureLabel ?structure_smiles ?structureCAS ?structureINCHIKEY ?organism ?organism_name WHERE {
  VALUES ?taxon {
    wd:Q21754                                    # You can remove the Qxxxxxx and hit Ctrl+space, type the first letters and it should autocomplete
  }
  ?organism (wdt:P171*) ?taxon;                   # Include children taxa
                        wdt:P225 ?organism_name.  # Get organism name
  ?structure wdt:P233 ?structure_smiles;          # Get the SMILES
             (p:P703/ps:P703) ?organism.          # Found in given taxon/taxa

  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100000

I get 20968 results, however when I try to include CASID and INCHIKEY information with the following:

SELECT DISTINCT ?structure ?structureLabel ?structure_smiles ?structureCAS ?structureINCHIKEY ?organism ?organism_name WHERE {
  VALUES ?taxon {
    wd:Q21754                                    # You can remove the Qxxxxxx and hit Ctrl+space, type the first letters and it should autocomplete
  }
  ?organism (wdt:P171*) ?taxon;                   # Include children taxa
                        wdt:P225 ?organism_name.  # Get organism name
  ?structure wdt:P233 ?structure_smiles;          # Get the SMILES
             (p:P703/ps:P703) ?organism;          # Found in given taxon/taxa
             wdt:P231 ?structureCAS;          # Get the CAS
             wdt:P235 ?structureINCHIKEY.          # Get the INCHIKEY

  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100000

I only get 7967 results. I imagine this might be because the latter query doesn't return instances without a CAS ID or INCHIKEY. Is it possible to return all metabolites found in taxa and leave missing values for the properties as NaN?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants