Skip to content

amino_acid_additional_data and nucleotides_additional_data in contigs database#1419

Merged
ekiefl merged 26 commits into
masterfrom
aa-data-table
Apr 30, 2020
Merged

amino_acid_additional_data and nucleotides_additional_data in contigs database#1419
ekiefl merged 26 commits into
masterfrom
aa-data-table

Conversation

@ekiefl
Copy link
Copy Markdown
Contributor

@ekiefl ekiefl commented Apr 30, 2020

Wanted to post this now for review. I have so far only done toy data sets. Tomorrow I want to test on a life-sized contigs DB with a fake dataset. Any criticism invited.

@ekiefl ekiefl added contigs database misc data Issues related to miscdata.py (items, layers, amino_acids, nucleotides, layer_orders, etc.) labels Apr 30, 2020
@ekiefl ekiefl requested a review from meren April 30, 2020 00:23
ekiefl added 3 commits April 30, 2020 12:02
variables could get hairy if people do something like "from
anvio.miscdata import MiscDataTableFactory"
…nalDataBaseClass and utilize it in AdditionalDataBaseClass.add.

This is necessary for the huge amount of info required to store this
information.
@ekiefl
Copy link
Copy Markdown
Contributor Author

ekiefl commented Apr 30, 2020

Here is the dataset I'm testing on:

image

It was generated with this script after exporting gene_calls.txt from the infant gut dataset:

#! /usr/bin/env python

import pandas as pd
import numpy as np

df = pd.read_csv('gene_calls.txt', sep='\t')

colors = [
    'red',
    'green',
    'blue',
    'cyan',
    'yellow',
    'orange',
    'black',
    'indigo',
    'violet',
]

annotation = {
    'amino_acids': [],
    'interaction_potential': [],
    'probability!exposed': [],
    'probability!buried': [],
    'color': [],
}

def get_color():
    return np.random.choice(colors)

def get_interaction_pot():
    return np.random.rand()

def get_probability():
    x = np.random.rand()
    return x, 1-x

for _, row in df.iterrows():
    g = row['gene_callers_id']
    s = row['source']
    M = int(np.floor((row['stop'] - row['start']) / 3))

    for m in range(M):
        aa_id = '::'.join([str(x) for x in [s, g, m]])
        annotation['amino_acids'].append(aa_id)
        annotation['interaction_potential'].append(get_interaction_pot())
        x, y = get_probability()
        annotation['probability!exposed'].append(x)
        annotation['probability!buried'].append(y)
        annotation['color'].append(get_color())

annotation = pd.DataFrame(annotation)
annotation.to_csv('misc_aa.txt', sep='\t', index=False)
print(annotation)

This is a very large text file: 800MB. It 4 annotations for every amino acid position in a contigs database that is initally 58MB. After loading in this data via anvi-import-misc-data the contigs database climbs to 3.1GB. Much of this added weight is due to the long-form nature (37 million rows) of misc data storage.

I think this is not exactly the most ideal way to store this data, but it uses the existing framework and it works well. I have additionally created a storage buffer so all the data is not piled into the DB at once and so memory is not an issue, even for imports many times larger than this one.

I think its ready to merge. What do you think @meren?

@meren
Copy link
Copy Markdown
Member

meren commented Apr 30, 2020

I think this is good to go, Evan!

Thanks for using a storage buffer to avoid memory issues later.

@ekiefl ekiefl merged commit b42bb03 into master Apr 30, 2020
@ekiefl ekiefl deleted the aa-data-table branch April 30, 2020 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contigs database misc data Issues related to miscdata.py (items, layers, amino_acids, nucleotides, layer_orders, etc.)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants