# Find-A-Bug API Demo

In this demo, I will showcase some of the features of the Find-A-Bug API. Prior to using the API, it needs to be installed (see installation guide in the `documentation.ipynb` file), and the package must be located in the working directory. In order to use the API, you will need to be on campus wifi, or using a VPN. 

In [1]:
# Import the API.
import fabapi
# Import some other utilities...
import pandas as pd
import numpy as np

To display the information about the tables in the database (and just to confirm that everything is working as intended), we can use the `info` method. 

In [2]:
fabapi.info()

NAME: gtdb_r207_metadata
PRIMARY KEY: genome_id
COLUMNS: genome_id ambiguous_bases checkm_completeness checkm_contamination checkm_marker_count checkm_marker_lineage checkm_marker_set_count checkm_strain_heterogeneity coding_bases coding_density contig_count gc_count gc_percentage genome_size gtdb_genome_representative gtdb_representative gtdb_type_designation gtdb_type_designation_sources gtdb_type_species_of_genus l50_contigs l50_scaffolds longest_contig longest_scaffold lsu_23s_contig_len lsu_23s_count lsu_23s_length lsu_23s_query_id lsu_5s_contig_len lsu_5s_count lsu_5s_length lsu_5s_query_id lsu_silva_23s_blast_align_len lsu_silva_23s_blast_bitscore lsu_silva_23s_blast_evalue lsu_silva_23s_blast_perc_identity lsu_silva_23s_blast_subject_id mean_contig_length mean_scaffold_length mimag_high_quality mimag_low_quality mimag_medium_quality n50_contigs n50_scaffolds ncbi_assembly_level ncbi_assembly_name ncbi_assembly_type ncbi_bioproject ncbi_biosample ncbi_contig_count ncbi_contig_n5

Suppose we want to get the information for a particular KO group; for Avi's sake I will query the Rubisco KO group, which (I think) has the code K01601. Note that the query below took a couple of minutes to run, as KO is not (yet -- I can if this would be helpful) defined as an index in the `gtdb_r207_annotations_kegg` table. 

In [7]:
ko_info = fabapi.get_by_ko('K01601', gene_name=True, genome_id=True, as_df=True)
ko_info.head()

SELECT gtdb_r207_annotations_kegg.ko, gtdb_r207_annotations_kegg.gene_name, gtdb_r207_annotations_kegg.genome_id 
FROM gtdb_r207_annotations_kegg 
WHERE gtdb_r207_annotations_kegg.ko = :ko_1



Unnamed: 0,ko,gene_name,genome_id
0,K01601,NZ_FNIA01000003.1_120,RS_GCF_900103505.1
1,K01601,NZ_MUNZ01000119.1_27,RS_GCF_002001205.1
2,K01601,NCJL01000004.1_68,GB_GCA_002278935.1
3,K01601,NCJL01000004.1_73,GB_GCA_002278935.1
4,K01601,NZ_CP030051.1_5598,RS_GCF_004114975.1


In [7]:
print(f'The query above returned {len(ko_info)} results.')

The query above returned 6541 results.


Now, suppose we want to look into the organism which possesses the highest number of genes in the KO group K01601. We can grab its `genome_ID` from the table using the following. 

In [10]:
ko_info_mode = ko_info['genome_id'].mode()
print(f'Highest-represented organism is {ko_info_mode[0]}')

Highest-represented organism is RS_GCF_016583835.1


What else can we learn about this organism? We can get associated metadata from the database using the `get_by_genome_id` method. **NOTE:** There seems to be an unnecessary join here, I will look into it. 

In [3]:
genome_id_info = fabapi.get_by_genome_id('RS_GCF_016583835.1', as_df=True)
genome_id_info.head()

SELECT gtdb_r207_metadata.genome_id, gtdb_r207_metadata.gtdb_domain, gtdb_r207_metadata.gtdb_phylum, gtdb_r207_metadata.gtdb_class, gtdb_r207_metadata.gtdb_order, gtdb_r207_metadata.gtdb_genus, gtdb_r207_metadata.gtdb_species 
FROM gtdb_r207_metadata 
WHERE gtdb_r207_metadata.genome_id = :genome_id_1



Unnamed: 0,genome_id,gtdb_domain,gtdb_phylum,gtdb_class,gtdb_order,gtdb_genus,gtdb_species
0,RS_GCF_016583835.1,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Chromatiales,g__Thiococcus,s__Thiococcus pfennigii


Suppose we want to grab the sequence information for all genes in the Rubisco KO group. This g

In [9]:
gene_names = []

In [15]:
df = fabapi.get_by_gene_name(list(ko_info['gene_name'])[:10], as_df=False)

In [18]:
fabapi.to_df(df)

Unnamed: 0,gene_name,genome_id,sequence
0,NCJL01000004.1_68,GB_GCA_002278935.1,MDQSARYSNLDLKEADLIKGEKHILVAYKMKPKAGYGYLEAAAHFA...
1,NCJL01000004.1_73,GB_GCA_002278935.1,MAVKSYNAGVKEYRQTYWEPEYKVQDTDILACFKITPQAGVSREEI...
2,NZ_AP020335.1_194,RS_GCF_013340845.1,MASKTFDAGVQDYQLTYWTPDYTPLDTDLLACFKVVPQEGVPREEA...
3,NZ_AP020335.1_1953,RS_GCF_013340845.1,MDQSNRYADLTLTEEKLVADGNHLLVAYRLKPAAGYGFLEVAAHVA...
4,NZ_AP020335.1_2033,RS_GCF_013340845.1,MAKTYNAGVKEYRETYWMPEYEPKDSDFLACFKVVPQPGVPREEIA...
5,NZ_BHVV01000001.1_230,RS_GCF_003865015.1,MAVKTYNAGVKEYRQTYWTPEYTPKDTDILACFKVTPQPGVAREEV...
6,NZ_CP030051.1_5598,RS_GCF_004114975.1,MNAHAGTVRGKERYRSGVMEYKRMGYWEPDYTPKDTDVIALFRVTP...
7,NZ_CP030052.1_571,RS_GCF_004114975.1,MSLRVAINGFGRIGRNVLRAIAESRRNDIEVVAINDLGPVETNAHL...
8,NZ_FNIA01000003.1_120,RS_GCF_900103505.1,MTGIEYDDFLDLDYEPTGTDLVCEFSIAPASDMSMEAAASRVASES...
9,NZ_MUNZ01000119.1_27,RS_GCF_002001205.1,MATKTYSAGVKEYRSTYWEPHYTPKDTDILACFKITPQAGVDREEV...
