# Quickstart

For the examples of species, genes, proteins, and cell types, you'll learn to

- lookup identifiers of an entity based on underlying ontologies
- convert between identifiers
- standardize columns in a `DataFrame` against a ontology-based identifier

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import bionty as bt
import pandas as pd

## Species

Let's start with exploring `Species`! You can initiate the `Species` class by providing a common name of species.

```{Note}
Don't worry if you are not sure what common name to use, just go with default for now.
```

In [None]:
species = bt.Species(common_name="human")

Now you have access to a couple of attributes, let's see what they are.

```{Important}

These are attributes of the base class `bt.Ontology`. So, you'll find the same API for any other entity (such as Gene, Protein, etc.), not just for Species.

```

In [None]:
species.fields

This gives you a list of column names that annotate each species.

In [None]:
species.std_id

This prints the field that is used as the standardized id.

Note that Bionty focuses on interpretability, so we use the unique `common_name` of [Ensembl Species](https://asia.ensembl.org/info/about/species.html).

In [None]:
species.std_name

This is the value of `std_id` in the current instance (e.g. `human`, `mouse`, `dog` ...).

For entities that require name conversions, there is a `.search` method to allow converting among different ids.

```{admonition} Alex
I think a good name for the method would be "lookup". I just don't fully understand why `scientific_name` is not an attribute of `species`. 🤔
```

In [None]:
species.search("scientific_name")

In [None]:
species.search("taxon_id")

A key feature of `Bionty` is that, for each entity class, we implement a static `.dataclass` via `pydantic` models which allows users to access each entry via tab completion. This allows you quickly search for a entry of interest and retrieve it's associated attributes.

```{admonition} Alex

Below, I find it highly non-intuitive that calling `Species(common_name="human").dataclass` yields a static class containing _all_ species.

Instead, I'd have expected `Species(common_name="human").dataclass` to return what you'd refer to as `dc.human`.

Let's simply discuss the whole logic & UX here! 😅

```

In [None]:
dc = species.dataclass

In [None]:
dc.cat

In [None]:
dc.cat.scientific_name

## Gene

Next let's take a look at genes, which essentially follows the same design choices as `Species`.

The main differences are:
1. The `.dataclass` of `Gene` is species specific, therefore you will only retrieve gene entries of the specified species.
2. We implement a `.standardize` function which allows to standardizing gene names inplace in a dataframe.

In [None]:
gene = bt.Gene(species="human")

In [None]:
from bionty._settings import settings

In [None]:
settings.datasetdir

In [None]:
dc = gene.dataclass

In [None]:
dc.PDCD1

In [None]:
dc.PDCD1.ensembl_gene_id

Same as species, you can check out gene related fields

In [None]:
print(f"`.fields:` {gene.fields}\n`.std_id`: {gene.std_id}")

Now Let's check out a few examples of gene name conversions.

The `.search` function by default converts to the standardized ids, you can also specify `.id_type_from` and `id_type_to` to control the behavior.


In [None]:
hgnc_ids = ["HGNC:1100", "HGNC:1101"]
ensembl_ids = ["ENSG00000012048", "ENSG00000139618"]

In [None]:
# default is to convert into .std_id

gene.search(ensembl_ids, id_type_from="ensembl.gene_id")

In [None]:
# OR you can convert between any two of the attributes

gene.search(["BRCA1", "BRCA2"], id_type_from="hgnc_symbol", id_type_to="entrez.gene_id")

Now let's try to provide a dataframe.

`.standardize` produces:
- The outcome of standardization will be directly written as the index.
- A `std_id` column containing the standardized ids, if no standardized id is found for a gene, it will be NaN.
- A `index_orig` column containing the original index provided before standardization.

In [None]:
# default is to standardizing gene symbols

df = pd.DataFrame(index=["RNF53", "BRCA2", "FakeGene"])
gene.standardize(df)

df

Of course you can also just provide a list of genes without a dataframe!

In [None]:
gene.standardize(["RNF53", "BRCA2", "FakeGene"])

Let's now try standardizing some ensembl ids

In [None]:
gene.standardize(["ENSG00000012048", "ENSG00000139618"], id_type="ensembl.gene_id")

## Protein

Protein is very similar to Gene.

In [None]:
protein = bt.Protein(species="human")

In [None]:
print(f"`.fields:` {protein.fields}\n`.std_id`: {protein.std_id}")

In [None]:
uniprot_ids = ["P40925", "P40926", "O43175", "Q9UM73"]

protein.search(uniprot_ids, id_type_from="UNIPROT_ID", id_type_to="CHEMBL_ID")

## Cell Type

(More features to come!)

Here we provide an interface to the Cell Ontology via an ontology manager Owlready2.

Other ontology based entities such as `Disease` and `Tissue` are both implemented this way.

- `.onto` is the ontology instance of Owlready2, which allows you to load, query, modify, save ontologies. See more [here](https://owlready2.readthedocs.io/en/latest/intro.html).
- `.onto_dict` is the dictionary of {'name': 'label'}.
- `.classes` is the access point to each ontology object.
- `.search` allows you to look for ontologies based on keywords.
- `.standardize` checks the correctness of ontology ids.
- `.dataclass` is a static dataclass for accessing entities with tab completion.


In [None]:
ct = bt.CellType()

In [None]:
ct.onto

In [None]:
ct.onto_dict["CL_0002000"]

A few info available in an object 

In [None]:
obj = ct.classes["CL_0002000"]

In [None]:
obj.is_a

In [None]:
obj.ancestors()

Let's try to search for some CL terms

In [None]:
ct.search(["T cell", "B cell", "hepatocyte"])

Standardization currently merely prints out warnings of obsolete or notfound ontologies. 

In [None]:
terms = ["CL:0000084", "CL:0000243", "CL:0002000"]

ct.standardize(terms)

Similarly, it has a static dataclass for accessing each entity

In [None]:
dc = ct.dataclass

In [None]:
dc.CL_0000738