# LOTUS - dataset overview

[LOTUS initiative](https://lotus.nprod.net/) is a database of natural products connected to organisms. Befor expanding the database, we like to have an overview about the last LOTUS datasest.

[latest LOTUS dataset (v10)](https://zenodo.org/records/7534071)

More info in the [Paper (DOI:70780)](https://elifesciences.org/articles/70780)


## load dataset

In [2]:
# Example of loading LOTUS datasets with polars (python module)
import polars as pl
import numpy as np


df_lotus = pl.read_csv(
        "../data/230106_frozen_metadata.csv.gz",
        infer_schema_length=50000,
        null_values=["", "NA"],
        schema_overrides=
        {
            "structure_xlogp": pl.Float32,
            "structure_cid": pl.UInt32,
            "organism_taxonomy_ncbiid": pl.UInt32,
            "organism_taxonomy_ottid": pl.UInt32,
            "structure_stereocenters_total": pl.UInt32,
            "structure_stereocenters_unspecified": pl.UInt32,
        },
    )

df_lotus = df_lotus.with_columns(
        pl.col("organism_taxonomy_gbifid")
        .map_elements(lambda x: np.nan if x.startswith("c(") else x)
        .cast(pl.UInt32)
        .alias("organism_taxonomy_gbifid")
    )

# change the configsetting, to see the full tables
pl.Config.set_tbl_rows(100)
pl.Config(fmt_str_lengths=550)


print(f"all columns of LOTUS (total: {df_lotus.shape[1]}): \n{df_lotus.columns}")

all columns of LOTUS (total: 39): 
['structure_wikidata', 'structure_inchikey', 'structure_inchi', 'structure_smiles', 'structure_molecular_formula', 'structure_exact_mass', 'structure_xlogp', 'structure_smiles_2D', 'structure_cid', 'structure_nameIupac', 'structure_nameTraditional', 'structure_stereocenters_total', 'structure_stereocenters_unspecified', 'structure_taxonomy_npclassifier_01pathway', 'structure_taxonomy_npclassifier_02superclass', 'structure_taxonomy_npclassifier_03class', 'structure_taxonomy_classyfire_chemontid', 'structure_taxonomy_classyfire_01kingdom', 'structure_taxonomy_classyfire_02superclass', 'structure_taxonomy_classyfire_03class', 'structure_taxonomy_classyfire_04directparent', 'organism_wikidata', 'organism_name', 'organism_taxonomy_gbifid', 'organism_taxonomy_ncbiid', 'organism_taxonomy_ottid', 'organism_taxonomy_01domain', 'organism_taxonomy_02kingdom', 'organism_taxonomy_03phylum', 'organism_taxonomy_04class', 'organism_taxonomy_05order', 'organism_taxono



In [7]:
# example of "Limonen" entry

df = df_lotus.filter((pl.col("structure_nameTraditional") == "Limonene") & (pl.col("organism_taxonomy_09species") == "Cannabis sativa"))[0, :]

df.write_csv("Limonene.csv", separator=",")

The 39 columns are separated in **natural products (NP)**, **chemical structure**,  **organism** and the **source**.  

In [2]:
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

for columnname in df_lotus.columns:
    # count the appearance of each "word" in the given column
    df_lotus_plot = df_lotus.select(pl.col(columnname).value_counts(sort=True, name="n")).unnest(columnname)

    # print the 5 most common names
    print(columnname, ":", df_lotus_plot[0:5, :])

structure_wikidata : shape: (5, 2)
┌─────────────────────────────────┬──────┐
│ structure_wikidata              ┆ n    │
│ ---                             ┆ ---  │
│ str                             ┆ u32  │
╞═════════════════════════════════╪══════╡
│ http://www.wikidata.org/entity… ┆ 4156 │
│ http://www.wikidata.org/entity… ┆ 3142 │
│ http://www.wikidata.org/entity… ┆ 3006 │
│ http://www.wikidata.org/entity… ┆ 2390 │
│ http://www.wikidata.org/entity… ┆ 2294 │
└─────────────────────────────────┴──────┘
structure_inchikey : shape: (5, 2)
┌─────────────────────────────┬──────┐
│ structure_inchikey          ┆ n    │
│ ---                         ┆ ---  │
│ str                         ┆ u32  │
╞═════════════════════════════╪══════╡
│ IPCSVZSSVZVIGE-UHFFFAOYSA-N ┆ 4156 │
│ KZJWDPNRJALLNS-VJSFXXLFSA-N ┆ 3142 │
│ HCXVJBMSMIARIN-PHZDYDNGSA-N ┆ 3006 │
│ REFJWTPEDVJJIY-UHFFFAOYSA-N ┆ 2390 │
│ XMGQYMWWDOXHJM-UHFFFAOYSA-N ┆ 2294 │
└─────────────────────────────┴──────┘
structure_inchi : shape: (5,

For the pickaxe (MINEs) it is necessary to provide a SMILE and a ID.  
In this case would be used the coloumns **3D SMILES (structure_smiles)** and **structure_wikidata**.  

In [3]:
# write in bold the specific title
print("\033[1m\nuniqueness over categories:\033[0m")

# unique counts only of the "chemical structure" columns
unique_counts = df_lotus.select(
    df_lotus.columns[0:20]
    ).n_unique()
print("all structure columns:", str(unique_counts))

# unique counts only of the "organism" columns []
unique_counts = df_lotus.select(
    df_lotus.columns[21:35]
    ).n_unique()
print("all organism columns:", str(unique_counts))

# unique counts for the full dataset
unique_counts = df_lotus.select(
    df_lotus.columns[:]
    ).n_unique()
print("all columns:", str(unique_counts))

# unique counts for the pickaxe/MINEs input
print("\033[1m\nImportant Info for pickaxe (MINEs):\033[0m")
print(f'rows for LOTUS dataset: {df_lotus.shape[0]}')
print(f'\nunique "structure_smiles": {df_lotus.select(["structure_smiles"]).n_unique()}')
print(f'unique "structure_wikidata": {df_lotus.select(["structure_wikidata"]).n_unique()}')
print(f'unique "structure_smiles" and "structure_wikidata": {df_lotus.select(["structure_smiles", "structure_wikidata"]).n_unique()}')
print(f'\nunique "structure_smiles": {df_lotus.select(["structure_smiles"]).n_unique()}')
print(f'unique "structure_inchi": {df_lotus.select(["structure_inchi"]).n_unique()}')
print(f'unique "structure_smiles" and "structure_inchi": {df_lotus.select(["structure_smiles", "structure_inchi"]).n_unique()}')

[1m
uniqueness over categories:[0m
all structure columns: 257226
all organism columns: 36803
all columns: 792364
[1m
Important Info for pickaxe (MINEs):[0m
rows for LOTUS dataset: 792364

unique "structure_smiles": 220820
unique "structure_wikidata": 220783
unique "structure_smiles" and "structure_wikidata": 220834

unique "structure_smiles": 220820
unique "structure_inchi": 220823
unique "structure_smiles" and "structure_inchi": 220823


The uniqueness will be reduced from the full dataset (792'364 rows) to the pickaxe inputfile "structure_smiles" and "structure_wikidata" (220'834 rows), which is **27.9%** of the full dataset.  

This are good news, so the dataset will be reduced drasticaly.  

Which is a little unusual that we have more SMILES than wikidata links. 
This could be, because not all SMILES have a wikidata link yet, which in turn cannot be because LOTUS is ‘complete’ and has no empty fields.  

If we look at the inchi's, it is also a little bit surprising, that there should be 3 duplicates. 

In [4]:
# make a dataframe with only the two columns: "structure_smiles" and "structure_wikidata"
df_lotus_for_pickaxe_with_wikidata = df_lotus.select(["structure_smiles", "structure_wikidata"]).unique()

# search for the duplicates and print them
df_lotus_for_pickaxe_with_wikidata_duplicates = df_lotus_for_pickaxe_with_wikidata.filter(df_lotus_for_pickaxe_with_wikidata.select(['structure_smiles']).is_duplicated()).sort('structure_smiles')
print(f'{df_lotus_for_pickaxe_with_wikidata_duplicates[0:2, :]}')


# make a dataframe with only the two columns: "structure_smiles" and "structure_inchi"
df_lotus_for_pickaxe_with_inchi = df_lotus.select(["structure_smiles", "structure_inchi"]).unique()

# search for the duplicates and print them
df_lotus_for_pickaxe_with_inchi_duplicates = df_lotus_for_pickaxe_with_inchi.filter(df_lotus_for_pickaxe_with_inchi.select(['structure_smiles']).is_duplicated()).sort('structure_smiles')
print(f'{df_lotus_for_pickaxe_with_inchi_duplicates[0:2, :]}')


shape: (2, 2)
┌───────────────────┬──────────────────────────────────────────┐
│ structure_smiles  ┆ structure_wikidata                       │
│ ---               ┆ ---                                      │
│ str               ┆ str                                      │
╞═══════════════════╪══════════════════════════════════════════╡
│ C1CCC2(CCCCO2)OC1 ┆ http://www.wikidata.org/entity/Q55620521 │
│ C1CCC2(CCCCO2)OC1 ┆ http://www.wikidata.org/entity/Q804105   │
└───────────────────┴──────────────────────────────────────────┘
shape: (2, 2)
┌─────────────────────────────────────────────────┬────────────────────────────────────────────────┐
│ structure_smiles                                ┆ structure_inchi                                │
│ ---                                             ┆ ---                                            │
│ str                                             ┆ str                                            │
╞═══════════════════════════════════════════════

The real reason is, that some SMILES have two wikidata entries, which is confusing. If we check the links, they are both linked to the "same" chemical component.
This makes also sense, because the amount of this duplicated entries (28) correspond to the amount of entries of only SMILES (220820) plus the additinal entries (14).

The same problem we can see with the INCHI's. In this case, we have multiple inchi's for the same SMILE's. This is possible, because the INCHI's describes the molecule more exactly then the SMILES.

## Why do we have less wikidata links then SMILES? This is a lose of information, no?  
We can see below, that some wikidatalinks are pointing to multiple SMILE's. This can be the case, because the wikilinks sometimes are represanting chemical groupes.

In [5]:
# make a dataframe with only the two columns: "structure_smiles" and "structure_wikidata"
df_lotus_for_pickaxe_with_wikidata = df_lotus.select(["structure_smiles", "structure_wikidata"]).unique()

# search for the duplicates and print them
df_lotus_for_pickaxe_with_wikidata_duplicates = df_lotus_for_pickaxe_with_wikidata.filter(df_lotus_for_pickaxe_with_wikidata.select(['structure_wikidata']).is_duplicated()).sort('structure_wikidata')
print(f'{df_lotus_for_pickaxe_with_wikidata_duplicates[0:2, :]}')

shape: (2, 2)
┌──────────────────────────────────────────────────────┬───────────────────────────────────────────┐
│ structure_smiles                                     ┆ structure_wikidata                        │
│ ---                                                  ┆ ---                                       │
│ str                                                  ┆ str                                       │
╞══════════════════════════════════════════════════════╪═══════════════════════════════════════════╡
│ C/C=C1/CN2CC[C@@]34c5ccccc5N5[C@H](C(=O)OC)[C@H]1C[C ┆ http://www.wikidata.org/entity/Q105144092 │
│ @H]2[C@]53Oc1c4cc2c(c1O)C(=O)O[C@]13[C@@H]4C[C@H]5/C ┆                                           │
│ (=C\C)CN4CC[C@@]21c1ccccc1N3[C@@H]5C(=O)OC           ┆                                           │
│ C/C=C1/CN2CC[C@]34c5ccccc5N5[C@H](C(=O)OC)[C@H]1C[C@ ┆ http://www.wikidata.org/entity/Q105144092 │
│ H]2[C@@]53Oc1c4cc2c(c1O)C(=O)O[C@@]13[C@@H]4C[C@H]5/ ┆                     