🌍 English ∙ Français ∙ Português
Note: This guide provides an overview of useful one-line UNIX commands and commonly used analyses in the bioinformatic arena. It is intended for a beginner audience and is permanently growing. Suggestions for improving this resource are more than welcome. Please drop me a line if you would like to contribute. –Github, Webpage. Thank you!
- Unix under MacOSX
- Unix under Windows
- Dummy test files
- Basic Awk & Sed
- DNA, words
- FASTA/Q handling
- More resources
-
Mac OS X comes stocked with an application named Terminal. You can find it under Finder → Applications → Utilities → Terminal. If you expect to use the Terminal a lot, drag the Terminal icon from the Finder window onto the Dock. You can then launch Terminal with a single click.
-
For convenience, you should then set up a package manager to install new software within the Terminal.
-
You can access the power of the Unix shell under Microsoft Windows by installing Cygwin. Most of the things described in this document will work out of the box.
-
On Windows 10, you can use Windows Subsystem for Linux (WSL), which provides a familiar Bash environment with Unix command line utilities.
-
If you mainly want to use GNU developer tools (such as GCC) on Windows, consider MinGW and its MSYS package, which provides utilities such as bash, gawk, make and grep. MSYS doesn't have all the features compared to Cygwin. MinGW is particularly useful for creating native Windows ports of Unix tools.
-
You can also install Homebrew on Windows Subsystem for Linux (WSL).
For convenience, I am providing test files to all the commands below. These comprise a FASTA file in.fa, a numeric CSV file in.csv, an ID file with FASTA headers ID.txt, a FASTQ file in.fq, a motifs FASTA file motifs.fa. Just open Terminal (in MacOSX) or Cygwin (for example) in Windows and have fun!
awk '{print $1,$3,$7}' in.csv
cat in.csv | awk 'NR == 1 || NR == 3 || NR == 7'
cat in.csv | sort | uniq | wc -l
awk 'NR % 5 == 0' in.csv
awk 'NR % 5 == 3' in.csv
awk 'NR > 1' in.csv
sed \$d in.csv
awk 'NR>=10&&NR<=20' in.csv
sed '/^$/d' in.csv
sed '/99/d' in.csv
awk '!arr[$3]++' in.csv
sed 's/23/99/g' in.csv
awk '{x+=$3}END{print x/NR}' in.csv
sort -nk3,3 in.csv | awk 'NF{a[NR]=$3;p++} END {print (p%2==0)?(a[int(p/2)+1]+a[int(p/2)])/2:a[int(p/2)+1]}'
awk '{for (i=1; i<=NF; i++) {a[NR,i] = $i}} NF>p { p = NF } END {for(k=1; k<=p; k++) {z=a[1,k];for(i=2; i<=NR; i++){z=z" "a[i,k];} print z}}' in.csv
awk 'FNR==NR{a[$1]++;next}!a[$1]' in1.csv in2.csv
Compare 1st, 2nd, and 3rd fields of in1.csv with 1st, 4th and 5th fields of in2.csv. Only matching rows in in2.csv will be printed.
awk 'NR==FNR{a[$1,$2,$3]++;next} (a[$1,$4,$5])' in1.csv in2.csv
grep -A1 -B2 '99' in.csv
echo {A,C,T,G}{A,C,T,G}{A,C,T,G}{A,C,T,G}
awk -v min=100 -v max=200 -v freq=10 'BEGIN{srand(); for(i=0;i<freq;i++) print int(min+rand()*(max-min+1))}'
echo ATGCA | perl -nle 'print map{$_ =~ tr/ACGT/TGCA/; $_} reverse split("",$_)'
ls *.fq | parallel md5sum {} > Checksums.txt
awk 'NR%4==1{print ">"substr($0,2)}NR%4==2{print $0}' in.fq > out.fa
awk '/^>/ {tmp=substr($0,2) ".fa"}; {print >> tmp; close(tmp)}' in.fa
perl -ne 'if(/^>(\S+)/){$p=$i{$1}}$p?print:chomp;$i{$_}=1 if @ARGV' ID.txt in.fa
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' in.fa
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next;}{seqlen = seqlen + length($0)} END {print seqlen}' in.fa
awk '{print ">" $1,"\n" $2;}' in.txt
-
One great option is to install SeqKit and run it as:
seqkit locate --ignore-case --degenerate --pattern TCTAWA in.fa
In the case above, we are asking seqkit to locate all the positions of the degenerate motif TCTAWA in a FASTA file (in.fa). In case we have multiple motifs to test, we can just assemble them in a FASTA format (motifs.fa file):
seqkit locate --ignore-case --degenerate -f motifs.fa in.fa
-
First install FASTQC (for example using brew):
brew install fastqc
Then simply run (if you want to do it in parallel 12 jobs at a time):
find *.fq | parallel -j 12 "fastqc {} --outdir ."
This will produce a FASTQC HTML report and a ZIP file containing all associated files.
- awesome-shell: A curated list of shell tools and resources.
- awesome-osx-command-line: A more in-depth guide for the macOS command line.
- Strict mode for writing better shell scripts.
- shellcheck: A shell script static analysis tool. Essentially, lint for bash/sh/zsh.
- Filenames and Pathnames in Shell: The sadly complex minutiae on how to handle filenames correctly in shell scripts.
- Data Science at the Command Line: More commands and tools helpful for doing data science, from the book of the same name.