# Quickstart

For the examples of species, genes, proteins, and cell types, you'll learn to

- lookup identifiers of an entity based on underlying ontologies
- convert between identifiers
- standardize columns in a `DataFrame` against a ontology-based identifier

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import bionty as bt
import pandas as pd

## Species

Let's start with exploring `Species`! You can initiate the `Species` class by providing a common name of species.


In [None]:
species = bt.Species()

Via `.dataclass` you can quickly lookup an entry of interest and retrieve associated attributes all via tab-completion.

In [None]:
dc = species.dataclass

In [None]:
dc.human

In [None]:
dc.human.scientific_name

In [None]:
dc.human.taxon_id

In [None]:
dc.cat

In [None]:
dc.cat.scientific_name

All attributes inherit from the base class `bt.Ontology`.

So, you'll find the same API for any other entity (such as Gene, Protein, etc.), not just for Species.

In [None]:
species.fields

This gives you a list of column names that annotate each species.

In [None]:
species.std_id

This prints the field that is used as the standardized id.

Note that Bionty focuses on interpretability, so we use the unique `common_name` of [Ensembl Species](https://asia.ensembl.org/info/about/species.html).

In [None]:
species.std_name

This is the value of `std_id` in the current instance (e.g. `human`, `mouse`, `dog` ...).

## Gene

Next let's take a look at genes, which essentially follows the same design choices as `Species`.

The main differences are:
1. The `.dataclass` of `Gene` is species specific, therefore you will only retrieve gene entries of the specified species.
2. We implement a `.standardize` function which allows to standardizing gene names inplace in a dataframe.

In [None]:
gene = bt.Gene(species="human")

In [None]:
dc = gene.dataclass

In [None]:
dc.PDCD1

In [None]:
dc.PDCD1.ensembl_gene_id

Same as species, you can check out gene related fields

In [None]:
gene.fields

In [None]:
gene.std_id

Now Let's check out a few examples of gene name conversions.

The `.search` function by default converts to the standardized ids, you can also specify `.id_type_from` and `id_type_to` to control the behavior.


In [None]:
hgnc_ids = ["HGNC:1100", "HGNC:1101"]
ensembl_ids = ["ENSG00000012048", "ENSG00000139618"]

The default is to convert into `.std_id`.

In [None]:
gene.search(ensembl_ids, id_type_from="ensembl.gene_id")

Or you can convert between any two of the attributes.

In [None]:
gene.search(["BRCA1", "BRCA2"], id_type_from="hgnc_symbol", id_type_to="entrez.gene_id")

Now let's try to provide a dataframe.

`.standardize` produces:
- The outcome of standardization will be directly written as the index.
- A `std_id` column containing the standardized ids, if no standardized id is found for a gene, it will be NaN.
- A `index_orig` column containing the original index provided before standardization.

The default is to standardize gene symbols:

In [None]:
df = pd.DataFrame(index=["RNF53", "BRCA2", "FakeGene"])
gene.standardize(df)

df

Of course you can also just provide a list of genes without a dataframe!

In [None]:
gene.standardize(["RNF53", "BRCA2", "FakeGene"])

Let's now try standardizing some ensembl ids

In [None]:
gene.standardize(["ENSG00000012048", "ENSG00000139618"], id_type="ensembl.gene_id")

## Protein

Protein is very similar to Gene.

In [None]:
protein = bt.Protein(species="human")

In [None]:
protein.fields

In [None]:
protein.std_id

In [None]:
uniprot_ids = ["P40925", "P40926", "O43175", "Q9UM73"]

protein.search(uniprot_ids, id_type_from="UNIPROT_ID", id_type_to="CHEMBL_ID")

## Cell Type

(More features to come!)

Here we provide an interface to the Cell Ontology via an ontology manager Owlready2.

Other ontology based entities such as `Disease` and `Tissue` are both implemented this way.

- `.onto` is the ontology instance of Owlready2, which allows you to load, query, modify, save ontologies. See more [here](https://owlready2.readthedocs.io/en/latest/intro.html).
- `.onto_dict` is the dictionary of {'name': 'label'}.
- `.classes` is the access point to each ontology object.
- `.search` allows you to look for ontologies based on keywords.
- `.standardize` checks the correctness of ontology ids.
- `.dataclass` is a static dataclass for accessing entities with tab completion.


In [None]:
ct = bt.CellType(reload=True)

In [None]:
ct.onto_dict["CL_0002000"]

A few info available in an object 

In [None]:
obj = ct.classes["CL_0002000"]

In [None]:
obj.is_a

In [None]:
obj.ancestors()

Let's try to search for some CL terms

In [None]:
ct.search(["T cell", "B cell", "hepatocyte"])

Standardization currently merely prints out warnings of obsolete or notfound ontologies. 

In [None]:
terms = ["CL:0000084", "CL:0000243", "CL:0002000"]

ct.standardize(terms)

Similarly, it has a static dataclass for accessing each entity

In [None]:
dc = ct.dataclass

In [None]:
dc.CL_0000738

In [None]:
dc.__dict__

## Disease

In [None]:
disease = bt.Disease()

In [None]:
dc = disease.dataclass

In [None]:
dc.MONDO_0000492

## Tissue

In [None]:
tissue = bt.Tissue()

In [None]:
dc = tissue.dataclass

In [None]:
dc.CL_0000101