## Shell script in Bash

### 12/1/2016

This is a shell script written in bash that can be used to count the number of sequences in gzipped fastq or fasta files (which would be raw RADseq data OR demultiplexed and filtered RADseq data from `process_radtags`). For my data analysis, I need to count the reads produced per individual after `process_radtags` so that I can use the individuals with the greatest read depths for `cstacks` catalog creation. 

This bash script must be run from the root directory; the path to the subdirectory with the fastq/fasta files is specified at the command line when prompted, as part of the script. 


Here is a link to the [Bash Script](https://github.com/mfisher5/mf-fish546-2016/blob/master/FISH546_BASHscript.sh)

In [1]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/mf-fish546-2016/notebooks'

In [2]:
cd ../../

/mnt/hgfs/Pacific cod/DataAnalysis


To give an idea of what the shell script looks like: 

In [8]:
!head -n 17 FISH546_BASHscript.sh

#!/bin/bash

##### This shell script will count the number of sequences in gzipped fastq OR fasta files (raw data + post process_radtags) #####
## M. Fisher 12/1/2016 ##
## Starting within the file directory `DataAnalysis` ##

#Set Variables
POST_DATE=$(date '+%Y-%m-%d') #find the date and save as POST_DATE
OUTPUT_FILENAME="SequenceCounts.$POST_DATE.txt" #create output file that contains the date


#Ask for input from user
echo "Enter file directory. Note - must be subdirectory within current location ::"
read DIRECTORY 		#reads entry typed at command line as new variable "DIRECTORY"
echo "Is file type '.fa.gz' or '.fq.gz'?"
read FILE_TYPE 		#reads entry typed at command line as new variable "FILE_TYPE"
echo "--"


In [None]:
#!/bin/bash

##### This shell script will count the number of sequences in gzipped fastq OR fasta files (raw data + post process_radtags) #####
## M. Fisher 12/1/2016 ##
## Starting within the file directory `DataAnalysis` ##

#Set Variables
POST_DATE=$(date '+%Y-%m-%d') #find the date and save as POST_DATE
OUTPUT_FILENAME="SequenceCounts.$POST_DATE.txt" #create output file that contains the date


#Ask for input from user
echo "Enter file directory. Note - must be subdirectory within current location ::"
read DIRECTORY 		#reads entry typed at command line as new variable "DIRECTORY"
echo "Is file type '.fa.gz' or '.fq.gz'?"
read FILE_TYPE 		#reads entry typed at command line as new variable "FILE_TYPE"
echo "--"
echo "--"

Now to run the shell script... 

**(1) On gzipped Fasta files**

In [13]:
!chmod +x FISH546_BASHscript.sh

In [17]:
!bash FISH546_BASHscript.sh

FISH546_BASHscript.sh: line 2: $'\r': command not found
FISH546_BASHscript.sh: line 6: $'\r': command not found
FISH546_BASHscript.sh: line 10: $'\r': command not found
FISH546_BASHscript.sh: line 11: $'\r': command not found
Enter file directory. Note - must be subdirectory within current location ::
^C


Unfortunately I cannot seem to run an interactive shell script from within jupyter. 
The script asks two questions -- 

`Enter file directory. Note - must be subdirectory within current location :: `

`Is file type '.fa.gz' or '.fq.gz'?`

Here is a screenshot of the output in terminal: 

![fasta](https://github.com/mfisher5/mf-fish546-2016/blob/master/Diagrams/BashScriptFASTA.png?raw=true)

And a link to the original photo: 

[Fasta Terminal Output](https://github.com/mfisher5/mf-fish546-2016/blob/master/Diagrams/BashScriptFASTA.png)

Once the program is done running through all of the files in that directory, I can also view the output in a text file that was saved in the specified folder. The name of the text file is printed out to the terminal. 

In [2]:
cd ../../samplesT142

/mnt/hgfs/Pacific cod/DataAnalysis/samplesT142


In [3]:
!head SequenceCounts.2016-12-01.txt

./GE011215_01.1.fa.gz
3807454
./GE011215_07.1.fa.gz
3631183
./GE011215_08.1.fa.gz
2076442
./GE011215_09.1.fa.gz
3758901
./GE011215_10.1.fa.gz
3440184


**(2) On gzipped Fastq files**

Note that the sequence counts output for fastq files are slightly different, because I am using code from Stephen Turner's github repo, [oneliners](https://github.com/stephenturner/oneliners/blob/40943e6cd4695f3c4d8a1b8d5e940b957e060c26/README.md#awk--sed-for-bioinformatics)

The first number in the line under the file name is the total number of sequences. 


In [None]:
!bash FISH546_BASHscript.sh

Screenshot of terminal

![fastq](https://github.com/mfisher5/mf-fish546-2016/blob/master/Diagrams/BashScriptFASTQ.png?raw=true)

And a link to the original picture: 

[Fastq terminal output](https://github.com/mfisher5/mf-fish546-2016/blob/master/Diagrams/BashScriptFASTQ.png)

In [3]:
cd ../L2samplesT142

/mnt/hgfs/Pacific cod/DataAnalysis/L2samplesT142


In [4]:
!head SequenceCounts.2016-12-01.txt

./BOR07_01.fq.gz
  -nan   -nan
./BOR07_03.fq.gz
2968140 276502 9.31567 TGCAGGACTCATAGTGCTTGCTCACTTTAAGAAATACTAGAACACCACAGCCCCTCATTAGTACAGTGGTCAGTATCCATTGGCCATGCATGAAACCAGGAATGTGAAGATTCTTGGTTTAACTCAGTCCGAACGAAAAGAG 20471 0.689691
./BOR07_09.fq.gz
5769259 531199 9.2074 TGCAGGACTCATAGTGCTTGCTCACTTTAAGAAATACTAGAACACCACAGCCCCTCATTAGTACAGTGGTCAGTATCCATTGGCCATGCATGAAACCAGGAATGTGAAGATTCTTGGTTTAACTCAGTCCGAACGAAAAGAG 32744 0.56756
./GE011215_18.fq.gz
6852956 602505 8.7919 TGCAGGACTCATAGTGCTTGCTCACTTTAAGAAATACTAGAACACCACAGCCCCTCATTAGTACAGTGGTCAGTATCCATTGGCCATGCATGAAACCAGGAATGTGAAGATTCTTGGTTTAACTCAGTCCGAACGAAAAGAG 44328 0.646845
./GE011215_19.fq.gz
4019476 378982 9.42864 TGCAGGACTCATAGTGCTTGCTCACTTTAAGAAATACTAGAACACCACAGCCCCTCATTAGTACAGTGGTCAGTATCCATTGGCCATGCATGAAACCAGGAATGTGAAGATTCTTGGTTTAACTCAGTCCGAACGAAAAGAG 35823 0.891236


*!!  Ignore the first file, BOR07_01.fq.gz. This is an empty file. *