<img src="./mascot.png"  width="400" height="400">

# Find-A-Bug


## Installation

There is no good way to do this yet. The only file that is needed for interacting with the database is the `find_a_bug_api.py` file. This file can be downloaded from this repository into the working directory, and used as shown in the sections below.

In [1]:
!wget https://raw.githubusercontent.com/pipparichter/find-a-bug/master/find_a_bug_api.py

--2023-03-03 14:43:01--  https://raw.githubusercontent.com/pipparichter/find-a-bug/master/find_a_bug_api.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2629 (2.6K) [text/plain]
Saving to: ‘find_a_bug_api.py.1’


2023-03-03 14:43:01 (13.9 MB/s) - ‘find_a_bug_api.py.1’ saved [2629/2629]



## About the database

### Where is the data hosted?

The data is hosted on a Caltech machine running a Red Hat Linux distro, `microbes.gps.caltech.edu`. This remote host is only accessible when on Caltech campus wifi or using a VPN.

### How is the data stored?

The data is stored in multiple tables in a MariaDB SQL database. More information on the table structure and the information they contain is given in the following section. 

### What information does the database contain?

Currently, the data is organized into three tables (although we will be adding more soon!). All data is from the Genome Taxonomy Database (GTDB) r207 release. **NOTE:** It is not necessary to know which information is found in which table to use the API; it's all handled!

1. `gtdb_r207_metadata`
2. `gtdb_r207_amino_acid_seqs`
3. `gtdb_r207_annotations_kegg`

## Querying the database

There is currently one all-purpose function for querying the Find-A-Bug database (`find_a_bug_api.get`), which will handle any query your heart desires. Eventually, I will add in some canned functions, which allow easier access to certain queries which are frequently made.

```
def get(fields, where={}, verbose=True):
    '''

    args:
        : fields (str or list): Either a single field or a list of fields for
            which to retrieve information. 
    kwargs:
        : where (dict): Specifies search options. Some format options for the
            key, value pairs are as follows:
            (1) 'ko':'KO123'
            (2) 'ko':['KO123', 'KO456'] retrieves all fields where the KO group
              matches EITHER of the specified groups. 
            (3) 'threshold':('>', 500)
            (4) 'threshold':[('>', 500), ('<', 1000)]
    '''
```

In [2]:
# Import the get function from the api module. 
from findabug_api import get 

I'll show a few example queries below. The time range for how long it takes to query the database vary according to which table is being queried in the database. The `gtdb_r207_metadata` table, for example, is slow to query (for reasons which I am currently trying to sort out). 

**Q:** What are all gene IDs that were annotated as either K03738 or K11389, with an HMM score above the predefined threshold? *NOTE: For reasons relating to upload time, only the annotations which met the predefined threshold are included in the database.*

In [4]:
# The 'ko' in the first argument does not need to be specified to pull out the 
# relevant gene IDs. This just specifies that both the KO group and gene_ids should be returned.
info = get(['gene_id', 'ko'], {'ko':['K03738', 'K11389']})

ConnectionError: HTTPConnectionPool(host='microbes.gps.caltech.edu', port=8000): Max retries exceeded with url: /gene_id+ko/ko=eq;K03738+ko=eq;K11389 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7faeb6906100>: Failed to establish a new connection: [Errno 110] Connection timed out'))

In [11]:
print(info)

2 results in 2.1039409656077623 seconds

SELECT gtdb_r207_metadata.gtdb_species 
FROM gtdb_r207_metadata 
WHERE gtdb_r207_metadata.genome_id = ?

gtdb_species
s__JAANXF01 sp018263395



**Q:** What is the total number of genomes that are from the Enterobacteriaceae taxonomic family, according the GTDB genome annotation? (TODO: It might be useful to add some kind of `count` function.


#### A note on query structure

The SQL database manager accesses information using a "primary key." In the case of the `gtdb_r207_metadata` table, for example, the primary key is `genome_id`, which means the database manager can very efficiently access the table using genome IDs (a query like this takes about 2 seconds). However, if you instead wanted to look for all genome IDs which met a certain criterion, e.g. all genome IDs which belonged to members of the genus `Rickettsia`, the manager would have to iterate over every single one of the millions of rows in the database, checking each one to see if it meets the specified condition (belonging to the genus `Rickettsia`. This is very slow (about 2 hours). 

I can make changes to the database so that making these "inverted" queries is much faster. However, this can take up a lot of memory on the `microbes.gps.caltech.edu` drive, so I don't want to do it for every possible scenario. If there are any particular cases which would be useful, or frequently used, please let me know!

## Feedback

### Error reports

### Feature requests
