## Raw data
we have discussed the raw data we will use before, now we will use bash commands to actually look in these files.

### file location
The sequencing reads are already available in the folder 'data/reads', to keep the practical feasible we have made a subset of the original data. <br>

In this reads folder you will find 12 files. <br>
L and P stand for the leaf and plant samples. <br>
L1 L2 L3 are the biological replicates. R1 and R2 represent the forward and reverse illumina reads. So 2 samples * 3 replicates * 2 directions = 12 files <br>
Files are stored in a fastq format (.fastq) and compressed in gzip archives (.gz). The gzip compression format is widely used in the genomics field, fastq files often compress down to only a quarter of the filesize with gzip! You do not need to extract these. <br>

Lets first double check if all files are available. 
1. Add a cell below
2. use the `ls` command to see what files are present in the `data/reads/` folder



### View a gzipped file
To de-compress a gzipped file you can use the zcat command like this
> zcat path/to/file.gz

Note that the content of gzipped files often is quite big. So big even that you may crash this webpage. To prevent this from happening. 'Pipe' the output of `zcat` into `head` to display only the first 10 lines. like so

> zcat path/to/file.gz | head

Now create a new cell below, and check the first 10 lines of one of the fastq files. Do these look as we would expect?


### Counting reads
Now I want you to find out how many reads there are in a single fastq file. You will achieve this in who steps.

First, using grep we can select for lines in a file that contain only the word/key specified in the grep command. As you can see in the above command, the headers of the reads always start with the "@", therefor if we only want to see the headers we can simply grep for the "@" using the command below:

Create a new cell below and create a command that uses
1. `zcat`  to open the fastq.gz file
2. `grep '@'`   to filter out only headers 
3. `head`  to show you only the first 10 lines and thus not crash our webpage.

Again use pipes to feed the ouput from one programme to the next.

In the output you just created in step one, every line contains a header respresenting one sequence. The second step now is to count these lines. To count the number of lines we will first have to read in the gzip file, then grep on the headers, and then we can use the wordcount command to count the number of lines. First read the 'help page'
of the `wc` command by typing in a new cell:

> wc --help

What option do we need to provide `wc` to count lines? Deduce from the manual page.

Now Finally, make a new 'pipeline' using
1. `zcat`  to open the fastq.gz file
2. `grep '@'`   to filter out only headers 
3. `wc` + options to count the lines in the fastq file.


### wrap up
So now you have executed some bash commands and investigated the output.  Well done!

## Assembled data

Calculating a de-novo metagenome assembly is beyond the scope of this practical, so we have done this for you. You will however analyse this assembly yourself! :)

Lets start by looking at the assembly files in the folder `data/assembly`. Do this in a new cell below.

Now, look at the first 10 lines of the assembly with the `head` command.

Will you use `zcat` to open the assembly? Only if the assembly is a compressed `assembly.fasta.gz` file. If you are dealing with a 'regular' fasta file, then use `cat` (Short for concatenate).

Now we will assess the number of sequences in the assembly. Remember how we did this for the FastQ file. 

1. `cat` the assembly file
2. `grep` the headers
3. `wc` to count the lines.

Add a new cell below and assess the number of scaffolds in the metagenome assembly

<h2>Length distributions of the scaffolds</h2>
Now that we have the assembly, we will do some quick analyses to get a idea of the quality. First we will plot the length distribution of the scaffolds in the assembly. Luckily for us, the length of each sequence in the fasta is already embeded in each fasta header. We can easily extract these numbers and plot them in python.

Since this is a bash practical, I wrote the python code for you already. All you need to do is add the path to you assembly file in the line 

> f = open("path/to/assembly.file","r")

To plot the length distribution run the python code below.

In [None]:
import matplotlib.pyplot as plt
import re
%matplotlib inline  
plt.style.use('ggplot')

f = open("", "r")
lines = f.readlines()
f.close()

data = []
regexp = re.compile(">")

for line in lines:
    if re.search(regexp, line):
        line = line.strip().split('_')
        data.append(float(line[3]))
        
fig = plt.figure(figsize=(10,10))
plt.hist(data, bins=100, log=True);
plt.title("length distribution scaffolds");
plt.xlabel("length");
plt.ylabel("count");

Did you expect this distribution?<br>
Why would there be so many short scaffolds?<br>