# Quickstart

We'll walk through Bionty's table model for species, genes, proteins, and cell types.

You'll see how to

- configure a table
- lookup identifiers of an entity
- convert between identifiers
- validate data against identifiers

In [5]:
%load_ext autoreload
%autoreload 2

In [23]:
import bionty as bt
import pandas as pd

## Species


In [3]:
species = bt.Species()

In [6]:
species.df.head()

  df.common_name = df.common_name.str.replace(" ", "_").str.lower().str.replace("'", "").str.replace("-", "_").str.replace(".", "_").str.replace("(", "").str.replace(")", "")  # noqa


Unnamed: 0_level_0,scientific_name,taxon_id,assembly,accession,release,short_name
common_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
abingdon_island_giant_tortoise,chelonoidis_abingdonii,106734,ASM359739v1,GCA_003597395.1,106,cabingdonii
african_green_monkey,chlorocebus_sabaeus,60711,ChlSab1.1,GCA_000409795.2,106,csabaeus
african_ostrich,struthio_camelus_australis,441894,ASM69896v1,GCA_000698965.1,106,saustralis
african_savanna_elephant,loxodonta_africana,9785,loxAfr3,GCA_000001905.1,106,lafricana
agassizs_desert_tortoise,gopherus_agassizii,38772,ASM289641v1,GCA_002896415.1,106,gagassizii


In [7]:
species.df.shape[0]

261

You can search terms with auto-complete using a lookup object:

In [8]:
species.lookup.white_tufted_ear_marmoset;

In [9]:
species.df.loc["human"]

scientific_name        homo_sapiens
taxon_id                       9606
assembly                     GRCh38
accession          GCA_000001405.28
release                         106
short_name                 hsapiens
Name: human, dtype: object

In [10]:
species.df.loc[["pig", "human", "mouse"]]

Unnamed: 0_level_0,scientific_name,taxon_id,assembly,accession,release,short_name
common_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pig,"sus_scrofa, sus_scrofa_usmarc, sus_scrofa_bame...","9823, 9823, 9823, 9823, 9823, 9823, 9823, 9823...","Sscrofa11.1, USMARCv1.0, Bamei_pig_v1, Rongcha...","GCA_000003025.6, GCA_002844635.1, GCA_00170023...","106, 106, 106, 106, 106, 106, 106, 106, 106, 1...","sscrofa, susmarc, sbamei, srongchang, slargewh..."
human,homo_sapiens,9606,GRCh38,GCA_000001405.28,106,hsapiens
mouse,"mus_musculus_c57bl6nj, mus_musculus_nzohlltj, ...","10090, 10090, 10090, 10090, 10091, 10090, 1009...","C57BL_6NJ_v1, NZO_HlLtJ_v1, 129S1_SvImJ_v1, BA...","GCA_001632555.1, GCA_001624745.1, GCA_00162418...","106, 106, 106, 106, 106, 106, 106, 106, 106, 1...","mc57bl6nj, mnzohlltj, m129s1svimj, mbalbcj, mc..."


## Gene

Next let's take a look at genes, which essentially follows the same design choices as `Species`.

The main differences are:
1. The `.dataclass` of `Gene` is species specific, therefore you will only retrieve gene entries of the specified species.
2. We implement a `.standardize` function which allows to standardizing gene names inplace in a dataframe.

In [2]:
gene = bt.Gene(species="human")

In [3]:
gene.df

Unnamed: 0_level_0,hgnc_id,name,locus_group,locus_type,status,location,location_sortable,alias_symbol,alias_name,prev_symbol,...,cd,lncrnadb,enzyme_id,intermediate_filament_db,rna_central_ids,lncipedia,gtrnadb,agr,mane_select,gencc
hgnc_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,HGNC:5,alpha-1-B glycoprotein,protein-coding gene,gene with protein product,Approved,19q13.43,19q13.43,,,,...,,,,,,,,HGNC:5,ENST00000263100.8|NM_130786.4,
A1BG-AS1,HGNC:37133,A1BG antisense RNA 1,non-coding RNA,"RNA, long non-coding",Approved,19q13.43,19q13.43,FLJ23569,,NCRNA00181|A1BGAS|A1BG-AS,...,,,,,,A1BG-AS1,,HGNC:37133,,
A1CF,HGNC:24086,APOBEC1 complementation factor,protein-coding gene,gene with protein product,Approved,10q11.23,10q11.23,ACF|ASP|ACF64|ACF65|APOBEC1CF,,,...,,,,,,,,HGNC:24086,ENST00000373997.8|NM_014576.4,
A2M,HGNC:7,alpha-2-macroglobulin,protein-coding gene,gene with protein product,Approved,12p13.31,12p13.31,FWP007|S863-7|CPAMD5,,,...,,,,,,,,HGNC:7,ENST00000318602.12|NM_000014.6,HGNC:7
A2M-AS1,HGNC:27057,A2M antisense RNA 1,non-coding RNA,"RNA, long non-coding",Approved,12p13.31,12p13.31,,,,...,,,,,,A2M-AS1,,HGNC:27057,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11B,HGNC:25820,"zyg-11 family member B, cell cycle regulator",protein-coding gene,gene with protein product,Approved,1p32.3,01p32.3,FLJ13456,,ZYG11,...,,,,,,,,HGNC:25820,ENST00000294353.7|NM_024646.3,HGNC:25820
ZYX,HGNC:13200,zyxin,protein-coding gene,gene with protein product,Approved,7q34,07q34,,,,...,,,,,,,,HGNC:13200,ENST00000322764.10|NM_003461.5,
ZYXP1,HGNC:51695,zyxin pseudogene 1,pseudogene,pseudogene,Approved,8q24.23,08q24.23,,,,...,,,,,,,,HGNC:51695,,
ZZEF1,HGNC:29027,zinc finger ZZ-type and EF-hand domain contain...,protein-coding gene,gene with protein product,Approved,17p13.2,17p13.2,KIAA0399|ZZZ4|FLJ10821,,,...,,,,,,,,HGNC:29027,ENST00000381638.7|NM_015113.4,


In [5]:
lookup = gene.lookup 

In [7]:
lookup.PDCD1;

In [10]:
gene.df.loc["PDCD1"].head()

hgnc_id                        HGNC:8760
name             programmed cell death 1
locus_group          protein-coding gene
locus_type     gene with protein product
status                          Approved
Name: PDCD1, dtype: object

In [19]:
gene.df.index

Index(['A1BG', 'A1BG-AS1', 'A1CF', 'A2M', 'A2M-AS1', 'A2ML1', 'A2ML1-AS1',
       'A2ML1-AS2', 'A2MP1', 'A3GALT2',
       ...
       'ZXDA', 'ZXDB', 'ZXDC', 'ZYG11A', 'ZYG11AP1', 'ZYG11B', 'ZYX', 'ZYXP1',
       'ZZEF1', 'ZZZ3'],
      dtype='object', name='hgnc_symbol', length=43156)

Converting between fields is currently done throught he pandas API:

In [21]:
gene.df.loc[gene.df.index.isin(["BRCA1", "BRCA2"])]

Unnamed: 0_level_0,hgnc_id,name,locus_group,locus_type,status,location,location_sortable,alias_symbol,alias_name,prev_symbol,...,cd,lncrnadb,enzyme_id,intermediate_filament_db,rna_central_ids,lncipedia,gtrnadb,agr,mane_select,gencc
hgnc_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
BRCA1,HGNC:1100,BRCA1 DNA repair associated,protein-coding gene,gene with protein product,Approved,17q21.31,17q21.31,RNF53|BRCC1|PPP1R53|FANCS,"BRCA1/BRCA2-containing complex, subunit 1|prot...",,...,,,,,,,,HGNC:1100,ENST00000357654.9|NM_007294.4,HGNC:1100
BRCA2,HGNC:1101,BRCA2 DNA repair associated,protein-coding gene,gene with protein product,Approved,13q13.1,13q13.1,FAD|FAD1|BRCC2|XRCC11,"BRCA1/BRCA2-containing complex, subunit 2",FANCD1|FACD|FANCD,...,,,,,,,,HGNC:1101,ENST00000380152.8|NM_000059.4,HGNC:1101


## Protein

Protein is very similar to Gene.

In [None]:
protein = bt.Protein(species="human")

In [None]:
protein.fields

In [None]:
protein.std_id

In [None]:
uniprot_ids = ["P40925", "P40926", "O43175", "Q9UM73"]

protein.search(uniprot_ids, id_type_from="UNIPROT_ID", id_type_to="CHEMBL_ID")

## Cell Type

In [None]:
ct = bt.CellType()

In [None]:
dc = ct.dataclass

In [None]:
dc.CL_0000738