# Day 2, Exercise 1 - file reading and writing
### There are 2 parts to this exercise, with the answers written under each section
1. Count the length of 10 DNA sequences
2. Calculate the GC content (optional)

There are various ways to write the code for these tasks. Here, we present one solution in the answers, but if you have written a different one, that's perfectly fine. Just ensure that you test your code to confirm that it performs as expected.

<hr style="border: 2px solid #000080;">

##  1. Count the length of 10 DNA sequences 
```
Ecoli-10seq_0.fna
Ecoli-10seq_1.fna
Ecoli-10seq_2.fna
Ecoli-10seq_3.fna
Ecoli-10seq_4.fna
Ecoli-10seq_5.fna
Ecoli-10seq_6.fna
Ecoli-10seq_7.fna
Ecoli-10seq_8.fna
Ecoli-10seq_9.fna
```
from `E.coli` and output the result to a file in the following tab-seperated format, tab seperated, with one result per line:
```
SeqID Length
```
#### SeqID can be extracted as the first word from the annotation line in the sequence file, i.e. the line started with `>`
For example, for the annotation line 
```
>lcl|NC_000913.3_cds_YP_025308.1_1724 [gene=katE] [locus_tag=b1732] [db_xref=UniProtKB/Swiss-Prot:P21179] [protein=catalase HPII] [protein_id=YP_025308.1] [location=1813867..1816128] [gbkey=CDS]
```
The SeqID should be `lcl|NC_000913.3_cds_YP_025308.1_1724`
#### Tips
- Sequences can be found in the folder `downloads/DNAseqs`. Download the zip file from <a href="https://python-bioinfo.bioshu.se/downloads.zip">here</a> if you haven't done it yet.
- You may need to use the following `string` methods
    - `string.lstrip()`       Remove the substring from left side of the string
    - `string.strip()`        Remove the white spaces from both ends of the string
    - `string.startswith()`   Check whether the string is started with certain substring 
    - `string.split()`        Convert the string to a list

- Use loops to iterate over files when reading and processing files
---

### The answer

In [15]:
results = []  # Create an empty list to store the results

# Loop through 10 sequence files
for i in range(10):
    seqfile = "../../downloads/DNAseqs/Ecoli-10seq_" + str(i) + ".fna"
    with open(seqfile, "r") as fpin:
        seqlength = 0
        seqid = ""
        for line in fpin:
            if line.startswith(">"):  # Identify the sequence ID line
                line = line.lstrip(">")
                seqid = line.split()[0]
            else:
                seqlength += len(line.strip())  # Accumulate the length of the sequence
            
        results.append((seqid, seqlength))  # Append the sequence ID and its length to results
                                            # Note: (seqid, seqlength) is a tuple
                                            # so results is a list of tuples

# Ensure the output directory exists, if not, create it by `mkdir output`
output_dir = "output"

outfile = output_dir + "/" + "length_of_dna_seqs.txt"

# Write the results to the output file
with open(outfile, "w", encoding="utf-8") as fpout:
    for seqid, seqlength in results:
        fpout.write(seqid + "\t" + str(seqlength) + "\n") # unlike print(), the newline '\n' needs to be added
                                                          # literally for the `write` method

In [16]:
# check the content of the file by 
! cat ./output/length_of_dna_seqs.txt

lcl|NC_000913.3_cds_YP_025308.1_1724	2262
lcl|NC_000913.3_cds_NP_416194.1_1672	417
lcl|NC_000913.3_cds_NP_418217.1_3710	264
lcl|NC_000913.3_cds_NP_418490.1_3983	1293
lcl|NC_000913.3_cds_NP_416901.1_2381	1257
lcl|NC_000913.3_cds_NP_417887.1_3363	1434
lcl|NC_000913.3_cds_NP_418643.1_4139	342
lcl|NC_000913.3_cds_NP_414712.1_172	852
lcl|NC_000913.3_cds_NP_416467.1_1950	918
lcl|NC_000913.3_cds_NP_416274.1_1752	273


<hr style="border: 2px solid #000080;">

## 2. Calculate the GC content (optional)

### Background:

In bioinformatics, the GC content is an important measure of the composition of DNA sequences. GC content is the percentage of nucleotides in a DNA sequence that are either guanine (G) or cytosine (C). This metric can provide insights into the properties of the DNA, such as its stability and melting temperature, and can differ between different organisms or genomic regions.

### Objective:

Write a script to calculate the GC content of the 10 DNA sequences used in the first exercise. Output the results to a file in the following format, tab separated, with one result per line:
```
SeqID GC_content
```

Write a script to calculate the GC content of the 10 DNA sequences used in the first exercise. Output the results to a file in the following tab-separated format, with one result per line:


### Tips

- The GC content is calculated using the following formula:
    - `GC_content = (count_of_G + count_of_C) / sequence_length * 100`
- Use if/else statement and loops
___

###  The answer

In [18]:
results = []  # List to store results (SeqID, GC_content)

# Loop through 10 DNA sequence files
for i in range(10):
    seqfile = f"../../downloads/DNAseqs/Ecoli-10seq_{i}.fna"
    
    with open(seqfile, "r") as fpin:
        seqlength = 0
        gc_count = 0
        seqid = ""

        for line in fpin:
            if line.startswith(">"):  # Identify the sequence ID line
                line = line.lstrip(">")
                seqid = line.split()[0]
            else:
                line = line.strip()
                seqlength += len(line)
                gc_count += line.count('G') + line.count('C') # use the "count" method to count the number of 
                                                              # G and C
        
        if seqlength > 0:
            gc_content = (gc_count / seqlength) * 100  # Calculate GC content
        else:
            gc_content = 0

        results.append((seqid, gc_content))  # Append the SeqID and GC_content to results

# Ensure the output directory exists, if not, create it by `mkdir output`
output_dir = "output"

# Output file for GC content
outfile = os.path.join(output_dir, "gc_content_of_dna_seqs.txt")

# Write the GC contents to the output file
with open(outfile, "w", encoding="utf-8") as fpout:
    for seqid, gc_content in results:
        fpout.write(seqid + "\t" + str(gc_content) + "\n")

In [19]:
# check the content of the file by 
! cat ./output/gc_content_of_dna_seqs.txt

lcl|NC_000913.3_cds_YP_025308.1_1724	52.07780725022104
lcl|NC_000913.3_cds_NP_416194.1_1672	48.6810551558753
lcl|NC_000913.3_cds_NP_418217.1_3710	49.24242424242424
lcl|NC_000913.3_cds_NP_418490.1_3983	32.94663573085847
lcl|NC_000913.3_cds_NP_416901.1_2381	48.050914876690534
lcl|NC_000913.3_cds_NP_417887.1_3363	54.95118549511855
lcl|NC_000913.3_cds_NP_418643.1_4139	51.461988304093566
lcl|NC_000913.3_cds_NP_414712.1_172	49.88262910798122
lcl|NC_000913.3_cds_NP_416467.1_1950	53.05010893246187
lcl|NC_000913.3_cds_NP_416274.1_1752	50.54945054945055
