## We're going to learn to use some useful linux tools by creating a fasta file of unique loci from a tags file created using Stacks.

## Commands
* grep
* pipe (|)
* awk
* sed
* redirect (>)






# 1. grep - search for patterns in text file

grep prints lines that contain patterns from a text file

Common usage is
>`grep <pattern> input_file`

See the manual page for more details
>`man grep`

Example: Extract consensus sequences from a tags file created by Stacks.

In [1]:
!grep consensus example.tags.tsv

0	38	1	Locus_10000	0	+	consensus			TGCAGGACAGGCCCACTTTACTCTCATCACCTGTACCGATCTGAGCGCAGGACCAGCTCTCAGGGGGAACTCAGCTACTATGAACCGGATGTGAATCAATGAAATAATGG	0	0	0	-12.2811
0	38	2	Locus_10002	0	+	consensus			TGCAGGCAATTAAAAGCCTGCTCAATGTGATTGGTCGAACTAGAAACAGAAAACAGTTTCACATAATAATGGAGATTATGTTTGAACAAAGGCATAACAAATTGTTATTT	0	0	0	-6.07
0	38	3	Locus_10003	0	+	consensus			TGCAGGCGTCTACCTCTCGCCACAATGTTCCAGACACTTTCCTGTTCTTGTAAACACGTCTGACCCAGAAAGTCCCCTGCTGTGAAAGACAAACTCTGCAAAATGTCTGG	0	0	0	0.00
0	38	4	Locus_10004	0	+	consensus			TGCAGGACAGCAGCAAAAGAGCATGAGGCCAATCCAGAGAGCTTTTGTCGGCTGAGCTACCGACTGCCAAGCACCTGAGGTAGATAAATGCACCAAACATTTCACTTGAG	0	0	0	-2.05
0	38	6	Locus_10006	0	+	consensus			TGCAGGTGTGTATCTCCATGTGCGTGTGAGTGATGATGTTGCAGGAAAGTGAGCCACAGCTGGAAGCGCATTTGTGATTCGAAACTCTCCGCAAAAGAAATCCCCACACG	0	0	0	0.00
0	38	7	Locus_10007	0	+	consensus			TGCAGGAAGGACATCCCAGAATAAATGAGGCTTTAATTTAGTTGTTATTGTTGCTGCCCAAAGGAAGGAGGTGATGCAAAAAATGATGAGTCATGTGTAATGAGTCAGAT	0	0	0	0.00
0	38	8	Locus_10008	0	+	consensus			TGCAGGTTCTA



# 2. | - pipe to send output from one command to another

Common usage is
>`<command1> | <command2>`

Example: Lets put it all together and use grep, |, sed, and awk to create a fasta file with all of the sequences.

In [3]:
!grep consensus example.tags.tsv | head -n5

0	38	1	Locus_10000	0	+	consensus			TGCAGGACAGGCCCACTTTACTCTCATCACCTGTACCGATCTGAGCGCAGGACCAGCTCTCAGGGGGAACTCAGCTACTATGAACCGGATGTGAATCAATGAAATAATGG	0	0	0	-12.2811
0	38	2	Locus_10002	0	+	consensus			TGCAGGCAATTAAAAGCCTGCTCAATGTGATTGGTCGAACTAGAAACAGAAAACAGTTTCACATAATAATGGAGATTATGTTTGAACAAAGGCATAACAAATTGTTATTT	0	0	0	-6.07
0	38	3	Locus_10003	0	+	consensus			TGCAGGCGTCTACCTCTCGCCACAATGTTCCAGACACTTTCCTGTTCTTGTAAACACGTCTGACCCAGAAAGTCCCCTGCTGTGAAAGACAAACTCTGCAAAATGTCTGG	0	0	0	0.00
0	38	4	Locus_10004	0	+	consensus			TGCAGGACAGCAGCAAAAGAGCATGAGGCCAATCCAGAGAGCTTTTGTCGGCTGAGCTACCGACTGCCAAGCACCTGAGGTAGATAAATGCACCAAACATTTCACTTGAG	0	0	0	-2.05
0	38	6	Locus_10006	0	+	consensus			TGCAGGTGTGTATCTCCATGTGCGTGTGAGTGATGATGTTGCAGGAAAGTGAGCCACAGCTGGAAGCGCATTTGTGATTCGAAACTCTCCGCAAAAGAAATCCCCACACG	0	0	0	0.00




# 3. awk - text manipulation tool

awk is used to parse text files. I commonly use it to extract columns of data from a delimited file.

Common useage is 
>`awk '{print $column_number}' input_file`

See the manual page for more details (including how to parse files delimited by things other than whitespace).
>`man awk`

Example: Extract Stack depth for each allele from 'matches' file produced by Stacks.

In [4]:
!grep consensus example.tags.tsv | awk '{print ">" $4 "\t" $8}' # print > then column4 then a tab then column8

>Locus_10000	TGCAGGACAGGCCCACTTTACTCTCATCACCTGTACCGATCTGAGCGCAGGACCAGCTCTCAGGGGGAACTCAGCTACTATGAACCGGATGTGAATCAATGAAATAATGG
>Locus_10002	TGCAGGCAATTAAAAGCCTGCTCAATGTGATTGGTCGAACTAGAAACAGAAAACAGTTTCACATAATAATGGAGATTATGTTTGAACAAAGGCATAACAAATTGTTATTT
>Locus_10003	TGCAGGCGTCTACCTCTCGCCACAATGTTCCAGACACTTTCCTGTTCTTGTAAACACGTCTGACCCAGAAAGTCCCCTGCTGTGAAAGACAAACTCTGCAAAATGTCTGG
>Locus_10004	TGCAGGACAGCAGCAAAAGAGCATGAGGCCAATCCAGAGAGCTTTTGTCGGCTGAGCTACCGACTGCCAAGCACCTGAGGTAGATAAATGCACCAAACATTTCACTTGAG
>Locus_10006	TGCAGGTGTGTATCTCCATGTGCGTGTGAGTGATGATGTTGCAGGAAAGTGAGCCACAGCTGGAAGCGCATTTGTGATTCGAAACTCTCCGCAAAAGAAATCCCCACACG
>Locus_10007	TGCAGGAAGGACATCCCAGAATAAATGAGGCTTTAATTTAGTTGTTATTGTTGCTGCCCAAAGGAAGGAGGTGATGCAAAAAATGATGAGTCATGTGTAATGAGTCAGAT
>Locus_10008	TGCAGGTTCTACTCCAAGGGTAGAACTGTGGGAATGACTGTGGGAATTTTCCATCCCATGAAGCACCCGGTGGCACATTTGTACTGACAAATAACAGAATAGATTTAATA
>Locus_10010	TGCAGGTGAAGCCGGCGCTTGGTGTCGATGATGTGTTGGCCAAGCTCCAGCATGTGTTGTACATATAAAACTATGATGATGACAATAATCATTTGGGTTTGTCCACCAGG



# 4. sed - text manipulation tool

sed is also used to parse text files. I commonly use it to search and replace elements in a file.

Common usage is 
>`sed 's/<search pattern>/<replace pattern>/' input_file`

See the manual page for more details
>`man sed`

In [5]:
!grep consensus example.tags.tsv | awk '{print ">" $4 "\t" $8}' | sed 's/\t/\n/' # replace tab with a new line character

>Locus_10000
TGCAGGACAGGCCCACTTTACTCTCATCACCTGTACCGATCTGAGCGCAGGACCAGCTCTCAGGGGGAACTCAGCTACTATGAACCGGATGTGAATCAATGAAATAATGG
>Locus_10002
TGCAGGCAATTAAAAGCCTGCTCAATGTGATTGGTCGAACTAGAAACAGAAAACAGTTTCACATAATAATGGAGATTATGTTTGAACAAAGGCATAACAAATTGTTATTT
>Locus_10003
TGCAGGCGTCTACCTCTCGCCACAATGTTCCAGACACTTTCCTGTTCTTGTAAACACGTCTGACCCAGAAAGTCCCCTGCTGTGAAAGACAAACTCTGCAAAATGTCTGG
>Locus_10004
TGCAGGACAGCAGCAAAAGAGCATGAGGCCAATCCAGAGAGCTTTTGTCGGCTGAGCTACCGACTGCCAAGCACCTGAGGTAGATAAATGCACCAAACATTTCACTTGAG
>Locus_10006
TGCAGGTGTGTATCTCCATGTGCGTGTGAGTGATGATGTTGCAGGAAAGTGAGCCACAGCTGGAAGCGCATTTGTGATTCGAAACTCTCCGCAAAAGAAATCCCCACACG
>Locus_10007
TGCAGGAAGGACATCCCAGAATAAATGAGGCTTTAATTTAGTTGTTATTGTTGCTGCCCAAAGGAAGGAGGTGATGCAAAAAATGATGAGTCATGTGTAATGAGTCAGAT
>Locus_10008
TGCAGGTTCTACTCCAAGGGTAGAACTGTGGGAATGACTGTGGGAATTTTCCATCCCATGAAGCACCCGGTGGCACATTTGTACTGACAAATAACAGAATAGATTTAATA
>Locus_10010
TGCAGGTGAAGCCGGCGCTTGGTGTCGATGATGTGTTGGCCAAGCTCCAGCATGTGTTGTACATATAAAACTATGATGATGACAATAATCATTTGGGTTTGTCC



# 5. redirect - send output to a file

Common usage is 
>`<command1> > <output_file>`

Example: Send output from the last command to a fasta file.

In [6]:
!grep consensus example.tags.tsv | awk '{print ">" $4 "\t" $8}' | sed 's/\t/\n/' > example.fa

In [7]:
!head -n10 example.fa

>Locus_10000
TGCAGGACAGGCCCACTTTACTCTCATCACCTGTACCGATCTGAGCGCAGGACCAGCTCTCAGGGGGAACTCAGCTACTATGAACCGGATGTGAATCAATGAAATAATGG
>Locus_10002
TGCAGGCAATTAAAAGCCTGCTCAATGTGATTGGTCGAACTAGAAACAGAAAACAGTTTCACATAATAATGGAGATTATGTTTGAACAAAGGCATAACAAATTGTTATTT
>Locus_10003
TGCAGGCGTCTACCTCTCGCCACAATGTTCCAGACACTTTCCTGTTCTTGTAAACACGTCTGACCCAGAAAGTCCCCTGCTGTGAAAGACAAACTCTGCAAAATGTCTGG
>Locus_10004
TGCAGGACAGCAGCAAAAGAGCATGAGGCCAATCCAGAGAGCTTTTGTCGGCTGAGCTACCGACTGCCAAGCACCTGAGGTAGATAAATGCACCAAACATTTCACTTGAG
>Locus_10006
TGCAGGTGTGTATCTCCATGTGCGTGTGAGTGATGATGTTGCAGGAAAGTGAGCCACAGCTGGAAGCGCATTTGTGATTCGAAACTCTCCGCAAAAGAAATCCCCACACG
