DatabaseTools is a package made of three modules: AntigenicDatabase, TiterTable, CityList designed to read and process the serum, antigen and result databases in acorg repository. In short:

AntigenicDatabase is designed to read sera and antigens json datasets into AntigenDataset and SerumDataset files.
They are constructed from the parent class Dataset. Dataset allows deep searches and aliased searches to find entries by impartial names, names with typos or properties such as passaged type. Search functionalities support regular expressions.

TiterTable allows for loading results.json files from the database which also requires antigens.json and sera.json
files as it performs consistency checks between the antigens and sera in the results.json and those that are stored as entries in antigens and sera json.

CityList is a small module for keeping track of city names and abbreviations and also allow some aliased searching for city names.

In [1]:
import sys
import os
# this will be used for supressind some warning messages
class HiddenPrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout
        
with HiddenPrints():
    from AntigenicDatabase import AntigenDataset, SerumDataset
    from TiterTable import TiterTable  

#AntigenicDatabase
#To load a dataset you open it as a json file and supply the json file:
import json

with open('./tests/test_datasets/test_antigens.json', 'r') as fileobj:
            antigens_json = json.load(fileobj)
        
antigen_dataset = AntigenDataset(antigens_json)  

#lets get the entry with ID
entries_found = antigen_dataset.get_entry("ARTLF7")
#entry is a list of antigens with a representation (Id, Long)
print('Entry with id ARTLF7 is')
print(entries_found)
print('')

#lets find the antigen which is VIETNAM/2004
#the long name is actually A/VIETNAM/1194/2004-NIBRG-14
#but aliased search takes care of that
entries_found = antigen_dataset.aliased_search("VIETNAM/2004")
print('Entry with name VIETNAM/2004 is')
print(entries_found)
print('')

#lets find all antigens that have been passaged in SIAT
entries_found = antigen_dataset.deep_search("SIAT")
print('Entries with passage SIAT are')
print(entries_found)



Entry with id ARTLF7 is
[(ARTLF7, A/CHICKEN/INDIA/NIV33487/2006-RG-7)]

Entry with name VIETNAM/2004 is
[(14846I, A/VIETNAM/1194/2004-NIBRG-14)]

Entries with passage SIAT are
[(14846I, A/VIETNAM/1194/2004-NIBRG-14), (2U7GA8, A/HONG-KONG/213/2003), (NK4KU7, A/CAMBODIA/R0405050/2007-NIBRG-88), (CXSYDG, A/DUCK/HUNAN/795/2002), (SL0DT0, A/INDONESIA/05/2005-IBCDCRG-2), (I4Z8K3, A/INDONESIA/CDC357/2006), (ARTLF7, A/CHICKEN/INDIA/NIV33487/2006-RG-7), (9FJOVM, A/COMMON-MAGPIE/HONG-KONG/5052/2007), (7N90YJ, A/BHG/MONGOLIA/X53/2009), (7CJ09K, A/HUBEI/1/2010-IBCDCRG-30)]


In [7]:
#Now lets load a titer table. For this we need to load results, antigens and sera
#TiterTable accepts dictionaries so if the results file is a list, you have to input
#the element of the list you want to analyse. The structure of results file can of the 
#format where it is a list of dictionaries each of which contain a results key that
#again map to list of dictionaries. It is one if these dictionaries at the end that should be 
#loaded to the titer table.

with open('./tests/test_datasets/test_antigens.json', 'r') as fileobj:    
    antigens_json = json.load(fileobj)
    
with open('./tests/test_datasets/test_antisera.json', 'r') as fileobj:    
    sera_json = json.load(fileobj)   
    
with open('./tests/test_datasets/test_results.json', 'r') as fileobj:    
    titer_table_json = json.load(fileobj)   

#if there are sera from the same strains it gives a warning indicating the table might
#do with an averaging (not implemented yet). It also performs a host of other health checks
#for all the datasets involved automatically. If the sera_json and antigens_json datasets
#contain more antigens and sera than in the titer table, only the corresponding subsets
#are taken into account. If any antigen or sera in the titer table does not occur
#in sera_json or antigen_json, it will throw an error.

titer_table = TiterTable(titer_table_json[0]['results'][0], sera_json, antigens_json)
antigen_ids = titer_table.antigen_ids
serum_ids = titer_table.serum_ids

#one can get the titer table as a dataframe with some options using the
#to_df function. extra_rows and columns can be added as dictionaries to this
#table as follows. For instance we will add the antigen and serum ids below.
#we will also leave thresholded titers as they are. Returned values
#are the dataframe and the numerical raw data in the data frame.

df, raw_data = titer_table.to_df(thresholded=True,             
                       extra_rows={'Id':serum_ids},
                       extra_columns={'Id':antigen_ids})


print(df.iloc[0:5,0:2])
print('')
#This example was used to demonstrate how to put extra_rows and columns however
#ids and serum_strain_ids can be automatically put by setting add_ids = True, 
#add_serum_strain_ids = True. Moreover actual serum and antigen names can also
#specified with options antigen_names = list and serum_names = list as demonstrated
#below:
antigen_names = [str(x) for x in range(10)]
serum_names = [str(x) for x in 'ABCDEFGHIJ']

df, raw_data = titer_table.to_df(thresholded=True, antigen_names = antigen_names, 
                                 serum_names=serum_names, add_ids=True)

print(df.iloc[0:5,0:5])


There may be repeated measurements in this dataset (repeated serum strain ids).
                                       Id A/ANHUI/1/2005
Id                                    NaN         8VCWN7
A/VIETNAM/1194/2004-NIBRG-14       14846I            <10
A/HONG-KONG/213/2003               2U7GA8            <10
A/CAMBODIA/R0405050/2007-NIBRG-88  NK4KU7            <10
A/DUCK/HUNAN/795/2002              CXSYDG            <10

        id       A       B       C       D
id     NaN  8VCWN7  S8NG60  IJ4HOU  77POTS
0   14846I     <10     <10     <10     <10
1   2U7GA8     <10     <10     <10     <10
2   NK4KU7     <10     <10     <10     <10
3   CXSYDG     <10     <10     <10     <10
