amino_acid_additional_data and nucleotides_additional_data in contigs database#1419
Conversation
self.db_path will not be None if it does not raise Error
variables could get hairy if people do something like "from anvio.miscdata import MiscDataTableFactory"
…nalDataBaseClass and utilize it in AdditionalDataBaseClass.add. This is necessary for the huge amount of info required to store this information.
|
Here is the dataset I'm testing on: It was generated with this script after exporting #! /usr/bin/env python
import pandas as pd
import numpy as np
df = pd.read_csv('gene_calls.txt', sep='\t')
colors = [
'red',
'green',
'blue',
'cyan',
'yellow',
'orange',
'black',
'indigo',
'violet',
]
annotation = {
'amino_acids': [],
'interaction_potential': [],
'probability!exposed': [],
'probability!buried': [],
'color': [],
}
def get_color():
return np.random.choice(colors)
def get_interaction_pot():
return np.random.rand()
def get_probability():
x = np.random.rand()
return x, 1-x
for _, row in df.iterrows():
g = row['gene_callers_id']
s = row['source']
M = int(np.floor((row['stop'] - row['start']) / 3))
for m in range(M):
aa_id = '::'.join([str(x) for x in [s, g, m]])
annotation['amino_acids'].append(aa_id)
annotation['interaction_potential'].append(get_interaction_pot())
x, y = get_probability()
annotation['probability!exposed'].append(x)
annotation['probability!buried'].append(y)
annotation['color'].append(get_color())
annotation = pd.DataFrame(annotation)
annotation.to_csv('misc_aa.txt', sep='\t', index=False)
print(annotation)This is a very large text file: 800MB. It 4 annotations for every amino acid position in a contigs database that is initally 58MB. After loading in this data via I think this is not exactly the most ideal way to store this data, but it uses the existing framework and it works well. I have additionally created a storage buffer so all the data is not piled into the DB at once and so memory is not an issue, even for imports many times larger than this one. I think its ready to merge. What do you think @meren? |
|
I think this is good to go, Evan! Thanks for using a storage buffer to avoid memory issues later. |

Wanted to post this now for review. I have so far only done toy data sets. Tomorrow I want to test on a life-sized contigs DB with a fake dataset. Any criticism invited.