# Quickstart

We'll walk through Bionty's table model for species, genes, proteins, and cell types.

You'll see how to

- configure a table
- lookup identifiers of an entity
- convert between identifiers
- validate data against identifiers

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import bionty as bt

## Species


In [3]:
species = bt.Species()

In [4]:
species.df.head()

Unnamed: 0_level_0,scientific_name,taxon_id,assembly,accession,release,short_name
common_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
abingdon_island_giant_tortoise,chelonoidis_abingdonii,106734,ASM359739v1,GCA_003597395.1,106,cabingdonii
african_green_monkey,chlorocebus_sabaeus,60711,ChlSab1.1,GCA_000409795.2,106,csabaeus
african_ostrich,struthio_camelus_australis,441894,ASM69896v1,GCA_000698965.1,106,saustralis
african_savanna_elephant,loxodonta_africana,9785,loxAfr3,GCA_000001905.1,106,lafricana
agassiz's_desert_tortoise,gopherus_agassizii,38772,ASM289641v1,GCA_002896415.1,106,gagassizii


In [5]:
species.df.loc["pig"]

scientific_name    sus_scrofa, sus_scrofa_usmarc, sus_scrofa_bame...
taxon_id           9823, 9823, 9823, 9823, 9823, 9823, 9823, 9823...
assembly           Sscrofa11.1, USMARCv1.0, Bamei_pig_v1, Rongcha...
accession          GCA_000003025.6, GCA_002844635.1, GCA_00170023...
release            106, 106, 106, 106, 106, 106, 106, 106, 106, 1...
short_name         sscrofa, susmarc, sbamei, srongchang, slargewh...
Name: pig, dtype: object

We can auto-complete on the index by transposing the table.

## Gene

Next let's take a look at genes, which essentially follows the same design choices as `Species`.

The main differences are:
1. The `.dataclass` of `Gene` is species specific, therefore you will only retrieve gene entries of the specified species.
2. We implement a `.standardize` function which allows to standardizing gene names inplace in a dataframe.

In [None]:
gene = bt.Gene(species="human")

In [None]:
dc = gene.dataclass

In [None]:
dc.PDCD1

In [None]:
dc.PDCD1.ensembl_gene_id

Same as species, you can check out gene related fields

In [None]:
gene.fields

In [None]:
gene.std_id

Now Let's check out a few examples of gene name conversions.

The `.search` function by default converts to the standardized ids, you can also specify `.id_type_from` and `id_type_to` to control the behavior.


In [None]:
hgnc_ids = ["HGNC:1100", "HGNC:1101"]
ensembl_ids = ["ENSG00000012048", "ENSG00000139618"]

The default is to convert into `.std_id`.

In [None]:
gene.search(ensembl_ids, id_type_from="ensembl.gene_id")

Or you can convert between any two of the attributes.

In [None]:
gene.search(["BRCA1", "BRCA2"], id_type_from="hgnc_symbol", id_type_to="entrez.gene_id")

Now let's try to provide a dataframe.

`.standardize` produces:
- The outcome of standardization will be directly written as the index.
- A `std_id` column containing the standardized ids, if no standardized id is found for a gene, it will be NaN.
- A `index_orig` column containing the original index provided before standardization.

The default is to standardize gene symbols:

In [None]:
df = pd.DataFrame(index=["RNF53", "BRCA2", "FakeGene"])
gene.standardize(df)

df

Of course you can also just provide a list of genes without a dataframe!

In [None]:
gene.standardize(["RNF53", "BRCA2", "FakeGene"])

Let's now try standardizing some ensembl ids

In [None]:
gene.standardize(["ENSG00000012048", "ENSG00000139618"], id_type="ensembl.gene_id")

## Protein

Protein is very similar to Gene.

In [None]:
protein = bt.Protein(species="human")

In [None]:
protein.fields

In [None]:
protein.std_id

In [None]:
uniprot_ids = ["P40925", "P40926", "O43175", "Q9UM73"]

protein.search(uniprot_ids, id_type_from="UNIPROT_ID", id_type_to="CHEMBL_ID")

## Cell Type

(More features to come!)

Here we provide an interface to the Cell Ontology via an ontology manager Owlready2.

Other ontology based entities such as `Disease` and `Tissue` are both implemented this way.

- `.onto` is the ontology instance of Owlready2, which allows you to load, query, modify, save ontologies. See more [here](https://owlready2.readthedocs.io/en/latest/intro.html).
- `.onto_dict` is the dictionary of {'name': 'label'}.
- `.classes` is the access point to each ontology object.
- `.search` allows you to look for ontologies based on keywords.
- `.standardize` checks the correctness of ontology ids.
- `.dataclass` is a static dataclass for accessing entities with tab completion.


In [None]:
ct = bt.CellType(reload=True)

In [None]:
ct.onto_dict["CL_0002000"]

A few info available in an object 

In [None]:
obj = ct.classes["CL_0002000"]

In [None]:
obj.is_a

In [None]:
obj.ancestors()

Let's try to search for some CL terms

In [None]:
ct.search(["T cell", "B cell", "hepatocyte"])

Standardization currently merely prints out warnings of obsolete or notfound ontologies. 

In [None]:
terms = ["CL:0000084", "CL:0000243", "CL:0002000"]

ct.standardize(terms)

Similarly, it has a static dataclass for accessing each entity

In [None]:
dc = ct.dataclass

In [None]:
dc.CL_0000738