New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not appear to work on FASTA files #1

Closed
alexpreynolds opened this Issue Sep 5, 2014 · 9 comments

Comments

Projects
None yet
2 participants
@alexpreynolds

alexpreynolds commented Sep 5, 2014

I ran into warnings compiling the bioinfo pattern file:

$ file -C -m bioinfo
bioinfo, 7: Warning: no need to escape `#'
bioinfo, 7: Warning: no need to escape `#'
bioinfo, 7: Warning: no need to escape `.'

Here's the version I'm using:

$ file --version
file-5.04

I tried running file on the compiled patterns with a few different FASTA inputs:

test1.fa

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFY
RTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNE
ECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHL
DVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAAD
EEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAK
QLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKE
PAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYR
TIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDE
CKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLD
KILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEAR
RLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSS
KPSTPSTPASKRKVGCIIYLFLYF

test2.fa

>Foo
ACGTTGTCcggctT
>Bar
TTTTGACCATTCCC

test3.fa (same as your repo's test input)

>ref
AGCATGTTAGATAAGATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
>ref2
aggttttataaaacaattaagtctacagagcaactacgcg

Here are results:

$ file -z -m bioinfo.mgc test1.fa
test1.fa: ASCII text
$ file -z -m bioinfo.mgc test2.fa
test2.fa: ASCII text
$ file -z -m bioinfo.mgc test3.fa
test3.fa: ASCII text
@lindenb

This comment has been minimized.

Show comment
Hide comment
@lindenb

lindenb Sep 5, 2014

Owner

thank you alex. I'm currently running a newer version of file :

$ file -v
file-5.09
magic file from /etc/magic:/usr/share/misc/magic

I wonder if it only comes from some new features ? how could I test this ?

Owner

lindenb commented Sep 5, 2014

thank you alex. I'm currently running a newer version of file :

$ file -v
file-5.09
magic file from /etc/magic:/usr/share/misc/magic

I wonder if it only comes from some new features ? how could I test this ?

@lindenb

This comment has been minimized.

Show comment
Hide comment
@lindenb

lindenb Sep 5, 2014

Owner

I've added your two examples in my repo: 382f34a

 file -z -m bioinfo.mgc test/*.fa
test/fasta01.fa: Fasta sequence, 
test/fasta02.fa: Fasta sequence, 
test/fasta03.fa: Fasta sequence, 
Owner

lindenb commented Sep 5, 2014

I've added your two examples in my repo: 382f34a

 file -z -m bioinfo.mgc test/*.fa
test/fasta01.fa: Fasta sequence, 
test/fasta02.fa: Fasta sequence, 
test/fasta03.fa: Fasta sequence, 
@alexpreynolds

This comment has been minimized.

Show comment
Hide comment
@alexpreynolds

alexpreynolds Sep 5, 2014

I compiled the latest version of file (version 5.19) from the file Github mirror:

$ git clone https://github.com/file/file.git
$ cd file
$ libtoolize
$ aclocal
$ autoheader
$ autoreconf -f -i
$ ./configure --prefix=/home/areynolds/opt; make; make install

This gives me the following binary:

$ /home/areynolds/opt/bin/file -v
file-5.19
magic file from /home/areynolds/opt/share/misc/magic

I recompiled the patterns:

$ /home/areynolds/opt/bin/file -C -m bioinfo
bioinfo, 7: Warning: no need to escape `#'
bioinfo, 7: Warning: no need to escape `#'
bioinfo, 7: Warning: no need to escape `.'

Unfortunately, no luck using file on my test inputs. For example, on your original repo input, I get the same result as before:

$ /home/areynolds/opt/bin/file -z -m bioinfo.mgc test3.fa
test3.fa: ASCII text

I love the idea of using file for this though! I'll try to dig some more and see what I can find out.

alexpreynolds commented Sep 5, 2014

I compiled the latest version of file (version 5.19) from the file Github mirror:

$ git clone https://github.com/file/file.git
$ cd file
$ libtoolize
$ aclocal
$ autoheader
$ autoreconf -f -i
$ ./configure --prefix=/home/areynolds/opt; make; make install

This gives me the following binary:

$ /home/areynolds/opt/bin/file -v
file-5.19
magic file from /home/areynolds/opt/share/misc/magic

I recompiled the patterns:

$ /home/areynolds/opt/bin/file -C -m bioinfo
bioinfo, 7: Warning: no need to escape `#'
bioinfo, 7: Warning: no need to escape `#'
bioinfo, 7: Warning: no need to escape `.'

Unfortunately, no luck using file on my test inputs. For example, on your original repo input, I get the same result as before:

$ /home/areynolds/opt/bin/file -z -m bioinfo.mgc test3.fa
test3.fa: ASCII text

I love the idea of using file for this though! I'll try to dig some more and see what I can find out.

@alexpreynolds

This comment has been minimized.

Show comment
Hide comment
@alexpreynolds

alexpreynolds Sep 5, 2014

I'm running this under RHEL6:

$ uname -a
Linux foo 2.6.32-431.17.1.el6.x86_64 #1 SMP Fri Apr 11 17:27:00 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux

Tools were built with fairly modern kit:

$ gcc --version
gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
...

alexpreynolds commented Sep 5, 2014

I'm running this under RHEL6:

$ uname -a
Linux foo 2.6.32-431.17.1.el6.x86_64 #1 SMP Fri Apr 11 17:27:00 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux

Tools were built with fairly modern kit:

$ gcc --version
gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
...
@lindenb

This comment has been minimized.

Show comment
Hide comment
@lindenb

lindenb Sep 5, 2014

Owner

Thanks for testing. I removed the '#' (got the same warnings too).

$ uname -a
Linux okazaki 3.2.0-65-generic #99-Ubuntu SMP Fri Jul 4 21:04:27 UTC 2014 i686 i686 i386 GNU/Linux

does it works with the sam file ?

Owner

lindenb commented Sep 5, 2014

Thanks for testing. I removed the '#' (got the same warnings too).

$ uname -a
Linux okazaki 3.2.0-65-generic #99-Ubuntu SMP Fri Jul 4 21:04:27 UTC 2014 i686 i686 i386 GNU/Linux

does it works with the sam file ?

@alexpreynolds

This comment has been minimized.

Show comment
Hide comment
@alexpreynolds

alexpreynolds Sep 5, 2014

It does work on SAM:

$ cat > foo.sam
@HD     VN:1.0 SO:coordinate
@SQ     SN:seq1 LN:5000
@SQ     SN:seq2 LN:5000
@CO     Example of SAM/BAM file format.
B7_591:4:96:693:509     73      seq1    1       99      36M     *       0       0       CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG    <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7    MF:i:18 Aq:i:73 NM:i:0  UQ:i:0  H0:i:1  H1:i:0
EAS54_65:7:152:368:113  73      seq1    3       99      35M     *       0       0       CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT     <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6):     MF:i:18 Aq:i:66 NM:i:0  UQ:i:0  H0:i:1  H1:i:0
EAS51_64:8:5:734:57     137     seq1    5       99      35M     *       0       0       AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC     <<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5     MF:i:18 Aq:i:66 NM:i:0  UQ:i:0  H0:i:1  H1:i:0
B7_591:1:289:587:906    137     seq1    6       63      36M     *       0       0       GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT    (-&----,----)-)-),'--)---',+-,),''*,    MF:i:130        Aq:i:63 NM:i:5  UQ:i:38 H0:i:0  H1:i:0
EAS56_59:8:38:671:758   137     seq1    9       99      35M     *       0       0       GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG     <<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5%     MF:i:18 Aq:i:72 NM:i:0  UQ:i:0  H0:i:1  H1:i:0
EAS56_61:6:18:467:281   73      seq1    13      99      35M     *       0       0       ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA     <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666     MF:i:18 Aq:i:39 NM:i:1  UQ:i:5  H0:i:0  H1:i:1

$ /home/areynolds/opt/bin/file -z -m bioinfo.mgc foo.sam
foo.sam: SAM file v1.0 sorted on coordinates

alexpreynolds commented Sep 5, 2014

It does work on SAM:

$ cat > foo.sam
@HD     VN:1.0 SO:coordinate
@SQ     SN:seq1 LN:5000
@SQ     SN:seq2 LN:5000
@CO     Example of SAM/BAM file format.
B7_591:4:96:693:509     73      seq1    1       99      36M     *       0       0       CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG    <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7    MF:i:18 Aq:i:73 NM:i:0  UQ:i:0  H0:i:1  H1:i:0
EAS54_65:7:152:368:113  73      seq1    3       99      35M     *       0       0       CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT     <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6):     MF:i:18 Aq:i:66 NM:i:0  UQ:i:0  H0:i:1  H1:i:0
EAS51_64:8:5:734:57     137     seq1    5       99      35M     *       0       0       AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC     <<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5     MF:i:18 Aq:i:66 NM:i:0  UQ:i:0  H0:i:1  H1:i:0
B7_591:1:289:587:906    137     seq1    6       63      36M     *       0       0       GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT    (-&----,----)-)-),'--)---',+-,),''*,    MF:i:130        Aq:i:63 NM:i:5  UQ:i:38 H0:i:0  H1:i:0
EAS56_59:8:38:671:758   137     seq1    9       99      35M     *       0       0       GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG     <<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5%     MF:i:18 Aq:i:72 NM:i:0  UQ:i:0  H0:i:1  H1:i:0
EAS56_61:6:18:467:281   73      seq1    13      99      35M     *       0       0       ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA     <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666     MF:i:18 Aq:i:39 NM:i:1  UQ:i:5  H0:i:0  H1:i:1

$ /home/areynolds/opt/bin/file -z -m bioinfo.mgc foo.sam
foo.sam: SAM file v1.0 sorted on coordinates
@lindenb

This comment has been minimized.

Show comment
Hide comment
@lindenb

lindenb Sep 5, 2014

Owner

cool. So it could be a regex problem ; Did I use an non-orthodox regex syntax ?

I'll check this later.

:-)

Owner

lindenb commented Sep 5, 2014

cool. So it could be a regex problem ; Did I use an non-orthodox regex syntax ?

I'll check this later.

:-)

@alexpreynolds

This comment has been minimized.

Show comment
Hide comment
@alexpreynolds

alexpreynolds Sep 5, 2014

Whoops, I'm an idiot! I didn't have trailing newlines on my test FASTA inputs:

[areynolds@foo ~]$ cat -A test3.fa
>ref$
AGCATGTTAGATAAGATAGCTGTGCTAGTAGGCAGTCAGCGCCAT$
>ref2$
aggttttataaaacaattaagtctacagagcaactacgcg[areynolds@foo ~]

Once I added the trailing newline, things work correctly:

[areynolds@foo ~]$ /home/areynolds/opt/bin/file -z -m bioinfo.mgc test3.fa
test3.fa: Fasta DNA sequence, ASCII text

Sorry for the false alarm.

alexpreynolds commented Sep 5, 2014

Whoops, I'm an idiot! I didn't have trailing newlines on my test FASTA inputs:

[areynolds@foo ~]$ cat -A test3.fa
>ref$
AGCATGTTAGATAAGATAGCTGTGCTAGTAGGCAGTCAGCGCCAT$
>ref2$
aggttttataaaacaattaagtctacagagcaactacgcg[areynolds@foo ~]

Once I added the trailing newline, things work correctly:

[areynolds@foo ~]$ /home/areynolds/opt/bin/file -z -m bioinfo.mgc test3.fa
test3.fa: Fasta DNA sequence, ASCII text

Sorry for the false alarm.

@lindenb

This comment has been minimized.

Show comment
Hide comment
@lindenb

lindenb Sep 5, 2014

Owner

Cool. Thanks for testing anyway ! :-)

Owner

lindenb commented Sep 5, 2014

Cool. Thanks for testing anyway ! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment