Skip to content

Heavily commented, quick script for processing sequence files for length and basepair composition

Notifications You must be signed in to change notification settings

naturepoker/sequence-counter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

sequence-counter

Heavily commented, quick script for processing sequence files for length and basepair composition

This is a pure bash script for taking a sequnce file and determining its base content, total length both gapped and ungapped. The purpose of the script is to be a portable educational tool for people just learning bash scripting (such as myself)

It's not the best optimized script by any stretch of imagination, but it's simple enough that all its components should be useful for any amateur researcher looking for simple, practical code examples.

For reference, on a netbook with Celeron N3060 processor and 4GB of ram running Lubuntu 20.04 Human chromosome 1 GRCh38.p13 (about 240.8 MB file) NC_000001.11 takes below time from start to finish.

real 3m46.775s user 3m23.530s sys 0m16.992s

Agrobacterium tumefaciens strain GCF_900045375.1 takes below time from start to finish

real 0m5.423s user 0m4.901s sys 0m0.467s

The repo contains a 100bp positive control fasta file generated by a DNA synthesis script from https://github.com/naturepoker/dna-synth Running below code should output the following.

./seq_counter.sh control_100bp.fasta

##################################################
Processing control_100bp.fasta
##################################################
                                                  
                                                  
                                                  
##################################################
     Total sequence composition is as follows     
--------------------------------------------------
     18 A
     27 T
     29 C
     26 G
--------------------------------------------------
Total gapped sequence length is: 100
--------------------------------------------------
Total ungapped sequence length is: 100
--------------------------------------------------
GC content in control_100bp.fasta is 55.00 %                 
##################################################

About

Heavily commented, quick script for processing sequence files for length and basepair composition

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages