Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amino_acid_additional_data and nucleotides_additional_data in contigs database #1419

Merged
merged 26 commits into from Apr 30, 2020

Conversation

ekiefl
Copy link
Contributor

@ekiefl ekiefl commented Apr 30, 2020

Wanted to post this now for review. I have so far only done toy data sets. Tomorrow I want to test on a life-sized contigs DB with a fake dataset. Any criticism invited.

@ekiefl ekiefl added contigs database misc data labels Apr 30, 2020
@ekiefl ekiefl requested a review from meren Apr 30, 2020
ekiefl added 3 commits Apr 30, 2020
variables could get hairy if people do something like "from
anvio.miscdata import MiscDataTableFactory"
…nalDataBaseClass and utilize it in AdditionalDataBaseClass.add.

This is necessary for the huge amount of info required to store this
information.
@ekiefl
Copy link
Contributor Author

ekiefl commented Apr 30, 2020

Here is the dataset I'm testing on:

image

It was generated with this script after exporting gene_calls.txt from the infant gut dataset:

#! /usr/bin/env python

import pandas as pd
import numpy as np

df = pd.read_csv('gene_calls.txt', sep='\t')

colors = [
    'red',
    'green',
    'blue',
    'cyan',
    'yellow',
    'orange',
    'black',
    'indigo',
    'violet',
]

annotation = {
    'amino_acids': [],
    'interaction_potential': [],
    'probability!exposed': [],
    'probability!buried': [],
    'color': [],
}

def get_color():
    return np.random.choice(colors)

def get_interaction_pot():
    return np.random.rand()

def get_probability():
    x = np.random.rand()
    return x, 1-x

for _, row in df.iterrows():
    g = row['gene_callers_id']
    s = row['source']
    M = int(np.floor((row['stop'] - row['start']) / 3))

    for m in range(M):
        aa_id = '::'.join([str(x) for x in [s, g, m]])
        annotation['amino_acids'].append(aa_id)
        annotation['interaction_potential'].append(get_interaction_pot())
        x, y = get_probability()
        annotation['probability!exposed'].append(x)
        annotation['probability!buried'].append(y)
        annotation['color'].append(get_color())

annotation = pd.DataFrame(annotation)
annotation.to_csv('misc_aa.txt', sep='\t', index=False)
print(annotation)

This is a very large text file: 800MB. It 4 annotations for every amino acid position in a contigs database that is initally 58MB. After loading in this data via anvi-import-misc-data the contigs database climbs to 3.1GB. Much of this added weight is due to the long-form nature (37 million rows) of misc data storage.

I think this is not exactly the most ideal way to store this data, but it uses the existing framework and it works well. I have additionally created a storage buffer so all the data is not piled into the DB at once and so memory is not an issue, even for imports many times larger than this one.

I think its ready to merge. What do you think @meren?

@meren
Copy link
Member

meren commented Apr 30, 2020

I think this is good to go, Evan!

Thanks for using a storage buffer to avoid memory issues later.

@ekiefl ekiefl merged commit b42bb03 into master Apr 30, 2020
@ekiefl ekiefl deleted the aa-data-table branch Apr 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contigs database misc data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants