<img src="./mascot.png"  width="300" height="300">
 
# Find-A-Bug API

## Installation

First, clone this repository into your working directory using the following
`git` command. 

`git clone pipparichter/find-a-bug`

Then, to install all necessary dependencies, run the following command from your
working directory. 

`pip install -e find-a-bug`

## Usage

First, you need to import the Find-A-Bug utilities using the line below.

In [2]:
import fabapi

### `fabapi.info`

This function does not take any arguments. It simply returns information about tables in the Find-A-Bug database, including the names of the tables and the fields in the tables. 

In [12]:
result = fabapi.info()

AttributeError: module 'fabapi' has no attribute 'info'

In [None]:
print(result)

### `fabapi.get`

There is currently one all-purpose function for querying the Find-A-Bug database (`fabapi.get`), which will handle any query your heart desires. Eventually, I will add in some canned functions, which allow easier access to certain queries which are frequently made. **I would love to get some feedback as to which queries are most helpful!**

#### Arguments
1. `fields`: A string or list of strings, specifying the fields for which information from the database should be returned. 
2. `where`: A dictionary. Each key in the dictionary is a string indicating a field, and the corresponding value indicates which filters should be applied to the corresponding field. 
3. `verbose`: `True` by default. Indicates whether or not to print the URL that gets sent to the `microbes.gps.caltech.edu` HTTP server.  

#### Returns
A string, containing the results of the query in CSV format, as well as how long the query took to run and the raw SQL code used to make the query. 

#### Examples

**Q:** What are all gene IDs that were annotated as either K03738 or K11389, with an HMM score above the predefined threshold? *NOTE: For reasons relating to upload time, only the annotations which met the predefined threshold are included in the database.*

In [10]:
# The 'ko' in the first argument does not need to be specified to pull out the 
# relevant gene IDs. This just specifies that both the KO group and gene_ids should be returned.
result = fabapi.get(['genome_id', 'ko'], {'ko':['K03738', 'K11389']})

GET http://microbes.gps.caltech.edu:8000/genome_id+ko/ko=eq;K03738+ko=eq;K11389


In [11]:
print(result)

23865 results in 138.4808603869751 seconds

SELECT gtdb_r207_annotations_kegg.genome_id, gtdb_r207_annotations_kegg.ko 
FROM gtdb_r207_annotations_kegg 
WHERE gtdb_r207_annotations_kegg.ko = ? OR gtdb_r207_annotations_kegg.ko = ?

**********
genome_id,ko
RS_GCF_002001205.1,K03738
GB_GCA_003670425.1,K03738
GB_GCA_003670425.1,K03738
GB_GCA_003670425.1,K03738
GB_GCA_012800525.1,K03738
GB_GCA_002868865.1,K03738
GB_GCA_002868865.1,K03738
GB_GCA_003712145.1,K03738
GB_GCA_003712145.1,K11389
GB_GCA_003712145.1,K03738
GB_GCA_003712145.1,K03738
RS_GCF_003865015.1,K11389
RS_GCF_003865015.1,K03738
GB_GCA_002402635.1,K03738
GB_GCA_008974985.1,K03738
GB_GCA_013178225.1,K03738
RS_GCF_001742305.1,K03738
RS_GCF_001742305.1,K03738
GB_GCA_012728915.1,K03738
GB_GCA_012728915.1,K03738
GB_GCA_003662765.1,K03738
GB_GCA_003662765.1,K03738
GB_GCA_003662765.1,K03738
GB_GCA_003662765.1,K03738
GB_GCA_003662765.1,K03738
GB_GCA_003662765.1,K03738
GB_GCA_003662765.1,K03738
GB_GCA_016783885.1,K03738
GB_GCA_016783885.

#### A note on query structure

The SQL database manager accesses information using a "primary key." In the case of the `gtdb_r207_metadata` table, for example, the primary key is `genome_id`, which means the database manager can very efficiently access the table using genome IDs (a query like this takes about 2 seconds). However, if you instead wanted to look for all genome IDs which met a certain criterion, e.g. all genome IDs which belonged to members of the genus `Rickettsia`, the manager would have to iterate over every single one of the millions of rows in the database, checking each one to see if it meets the specified condition (belonging to the genus `Rickettsia`. This is very slow (about 2 hours). 

I can make changes to the database so that making these "inverted" queries is much faster. However, this can take up a lot of memory on the `microbes.gps.caltech.edu` drive, so I don't want to do it for every possible scenario. If there are any particular cases which would be useful, or frequently used, please let me know!

### `fabapi.to_df`

In order to more easily work with the data

## Feedback

The Find-A-Bug API is still in development. For any suggestions, or to report any errors or bugs, please email me at `prichter at caltech dot edu`.

