### Reading high-throughput sequencing files

```
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=60
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGG
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=60
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIII
```

+ `@` line: basic read information
+ sequence line
+ `+` metadata line
+ quality score line

### Big files

In [4]:
f = open("files/simple-file.txt")
for l in f.readlines():
    print(l,end="")
f.close()

line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10


### Problem

`f.readlines` is powerful **but** you just loaded the whole contents of the file into memory.

If the file is 10GB, you may have just crashed your computer!

### Solution: turn the reader into an `iterator`

In [5]:
with open("files/simple-file.txt") as f:
    for l in f:
        print(l.strip()) 

line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10


```python
with open("files/simple-file.txt") as f:
    for l in f:
        do_something(l)
```

This bizarro syntax goes through the file line-by-line, wiping out each old line with the new one.

### What if you have a *compressed* input file?

In [6]:
with open("files/simple-file.txt.gz") as f:
    for l in f:
        print(l.strip())

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

### Use the `gzip` module

In [7]:
import gzip

with gzip.open("files/simple-file.txt.gz") as f:
    for l in f:
        l_ascii = l.decode("ascii")
        print(l_ascii.strip())

line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10


```python
import gzip

with gzip.open("files/simple-file.txt.gz") as f:
    for l in f:
        l_ascii = l.decode("ascii")
        print(l_ascii.strip())
```

+ `gzip.open` is just like `open` except it takes a .gz file
+ `l.decode("ascii")` species that the binary blob that just came out of the file is normal text

 + What sequence occurs most often the in the file `files/example.fastq.gz`? 
 + **Bonus**: create a histogram of read quality. The last line gives the PHRED score--[Probability of error](https://en.wikipedia.org/wiki/FASTQ_format#Quality)--for each nucleotide. It ranges from `1.00000` (encoded as the letter `0`) to `0.00006` (encoded by the letter `K`).  Sum up the score for all bases in the alignment. Table is here: (http://www.drive5.com/usearch/manual/qscores.gif)


```
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=60
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGG
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=60
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIII
```

In [None]:
#http://www.drive5.com/usearch/manual/qscores.gif
    
import gzip

get_line = False
seqs = {}
with gzip.open("files/example.fastq.gz") as f:
    for l in f:
        l_ascii = l.decode("ascii")
        if l_ascii[0] == "@":
            get_line = True
            continue
        if get_line:
            try:
                seqs[l_ascii.strip()] += 1
            except KeyError:
                seqs[l_ascii.strip()] = 1
            get_line = False

for s in seqs.keys():
    print(seqs[s])
