Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify fasta header naming with uc to biom #3

Open
dridk opened this issue Aug 22, 2016 · 2 comments
Open

Clarify fasta header naming with uc to biom #3

dridk opened this issue Aug 22, 2016 · 2 comments

Comments

@dridk
Copy link

dridk commented Aug 22, 2016

I trying to do a simple test , but I don't understand how fasta header are proccess.
For exemple, I have One sample test.fa with the following reads :

>A_sample1
AGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA
>A_sample2
ATGGTCGTATATATATGGTCGTATATATATGGTCGTATATATATGGTCGTATATATATGGTCGTATATATATGGTCGTATATAT
>A_sample3
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT
>A_sample4
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT
>A_sample5
ATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATAT
>A_sample6
ATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATAT
>A_sample7
AGAACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA
>A_sample8
AGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA
>A_sample9
AGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA
>A_sample10
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT
>A_sample11
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT
>A_sample12
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGT

I cluster them using :

vsearch --cluster_fast test.fa --id 0.97 --centroids centroids.fa --sizeout --uc test.uc --relabel_sha1 --relabel_keep

Now I want to convert them to biom using your script :

python create_otu_table_from_uc_file.py -i test.uc -o test.biom

I get the following error :

Error in uc file formating. Check for spaces in sample IDs and to make sure there is a semicolon after sample IDs.
First line with issue:
S       0       84      *       *       *       *       *       A1      *
100.0%
Writing table...

I thinks fasta header should keep a rule, but I don't know how... Could you make me a simple exemple to make me understand ?
Thanks

@leffj
Copy link
Owner

leffj commented Aug 22, 2016

Hi, good question. You need a string in the fasta header that includes:
';barcodelabel=SAMPLEID;’. For example:

M01918:213:000000000-AFC1C:1:1101:15775:1331 1:N:0:0;barcode=TAAATATACCCT;barcodelabel=cp83;
TACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTATGTAAGACAGGTGTGAAATCCCCGGGCTTAACCTGGGAATTGCCTTTGGGACTGCATGGCTAGAGTGTGTCAGAGGGGGGTAGAATTCCAAGTGTAGCAGTGTAATGCGTAGATATGTGGGGGAATACCGATGGCGGAGGCAGCCCCCTGGGCAGATACTGACGCTCAGGCACGAAAGCCTGGGGAGCAAACA

where ‘cp83’ is the sample ID.

This formatting comes from the prep_fastq_for_uparse_paired.py script, fyi.

Jon

On Aug 22, 2016, at 1:25 PM, sacha schutz <notifications@github.com mailto:notifications@github.com> wrote:

I trying to do a simple test , but I don't understand how fasta header are proccess.
For exemple, I have One sample test.fa with the following reads :

A_sample1
AGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA
A_sample2
ATGGTCGTATATATATGGTCGTATATATATGGTCGTATATATATGGTCGTATATATATGGTCGTATATATATGGTCGTATATAT
A_sample3
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT
A_sample4
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT
A_sample5
ATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATAT
A_sample6
ATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATATGCGGTGTAATACGTGTATGATAT
A_sample7
AGAACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA
A_sample8
AGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA
A_sample9
AGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACAAGATACA
A_sample10
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT
A_sample11
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGTGTCGTATAT
A_sample12
ATGGTCGTGTCGTGTCGTGTCGTATATATATCGGTCGTGTCGTGTCGTGTCGTGTCGTATGTCGTGTCGTGTCGT
I cluster them using :

vsearch --cluster_fast test.fa --id 0.97 --centroids centroids.fa --sizeout --uc test.uc --relabel_sha1 --relabel_keep

Now I want to convert them to biom using your script :

uctobiom -i test.uc -o test.biom

I get the following error :

Error in uc file formating. Check for spaces in sample IDs and to make sure there is a semicolon after sample IDs.
First line with issue:
S 0 84 * * * * * A1 *
100.0%
Writing table...
I thinks fasta header should keep a rule, but I don't know how... Could you make me a simple exemple to make me understand ?
Thanks


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #3, or mute the thread https://github.com/notifications/unsubscribe-auth/ACqxj9IVFp30JCvEeNF83DmzOhtOu3l7ks5qiduWgaJpZM4JqGxV.

@bioinfo17
Copy link

Hi,

I have an .uc file that is in the format below:

H 205 339 98.5 + 0 0 339M B3::M02542:85:000000000-BWJ73:1:1102:22965:2274 OTU_206 H 547 339 98.5 + 0 0 339M B13::M02542:85:000000000-BWJ73:1:2116:22473:4007 OTU_548 H 436 339 97.6 + 0 0 D338M B14::M02542:85:000000000-BWJ73:1:1116:19896:20825 OTU_437 H 127 339 98.8 + 0 0 339M B9::M02542:85:000000000-BWJ73:1:1118:22070:17406 OTU_128 H 200 337 99.1 + 0 0 I337M B3::M02542:85:000000000-BWJ73:1:1116:13763:3215 OTU_201 H 174 339 98.8 + 0 0 339M B15::M02542:85:000000000-BWJ73:1:1115:12758:8719 OTU_175 N * * * . * * * B6::M02542:85:000000000-BWJ73:1:1117:9645:18835 * H 137 328 99.1 + 0 0 328M11I B12::M02542:85:000000000-BWJ73:1:2103:20919:8080 OTU_138 H 443 335 100.0 + 0 0 335M4I B12::M02542:85:000000000-BWJ73:1:1103:27262:12348 OTU_444

I get the following error:

Error in uc file formating. Check for spaces in sample IDs and to make sure there is a semicolon after sample IDs.
First line with issue:
H 349 338 99.4 + 0 0 261MI77M B1::M02542:85:000000000-BWJ73:1:1OTU_35022:9749 1:N:0:TAGCTT

I'm finding it hard to convert the .uc file to otu table txt file. Would you be please able to modify the script, create_otu_table_from_uc_file.py for user-specific needs?

Any help will be much appreciated, thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants