# A03 Key

## Problem 1

What is one of the primary challenges in de novo gene assembly?

**A.** The uniformity of genetic sequences across different organisms, making it difficult to distinguish between species.

**B.** The presence of repetitive sequences within the genome that can complicate the assembly process.

**C.** The overabundance of RNA sequences in DNA sequencing data, interfering with genome assembly.

**D.** The rapid degradation of DNA samples during the sequencing process, resulting in incomplete data.

In [1]:
problem_1_answer = "B"

### Explanation

The uniformity of genetic sequences across different organisms makes it challenging to distinguish between species.

> This option is incorrect because de novo gene assembly primarily involves assembling sequences from a single organism's genome rather than distinguishing between species.
> The challenge in de novo assembly is not about the uniformity across species but rather about accurately piecing together the genome of a single organism.

**The presence of repetitive sequences within the genome can complicate the assembly process.**

> This option is correct.
> One of the primary challenges in de novo gene assembly is the presence of repetitive sequences within the genome.
> These repeats can confuse assembly algorithms, making it difficult to accurately reconstruct the original genome from short sequencing reads.
> Repetitive sequences can lead to assembly errors, such as collapses or expansions of repeat regions.

The overabundance of RNA sequences in DNA sequencing data, interfering with genome assembly.

> This option is incorrect because DNA sequencing data, especially in the context of de novo genome assembly, focuses on DNA rather than RNA. While RNA-seq data can be used for transcriptome assembly, the primary challenge in de novo DNA assembly is not the presence of RNA sequences but dealing with the DNA sequences themselves.
> Contamination and quality control are concerns in sequencing data.
> Still, the overabundance of RNA sequences is not a direct challenge for de novo DNA assembly.

The rapid degradation of DNA samples during the sequencing process results in incomplete data.

> This option is incorrect as a primary challenge. While DNA degradation can be a problem in sequencing, modern sequencing technologies and protocols are designed to minimize the impact of degradation.
> The primary challenge in de novo assembly is not the degradation of DNA samples but rather computational and algorithmic challenges related to accurately reconstructing genomes from short, error-prone reads.

## Problem 2

Which of the following statements accurately reflects a challenge associated with de novo assembly, especially when compared to reference assembly?

**A.** De novo assembly is less effective at detecting single nucleotide polymorphisms (SNPs).

**B.** De novo assembly cannot be used to study organisms with large or complex genomes, such as humans or plants.

**C.** De novo assembly typically requires higher sequencing depth to achieve comparable accuracy and completeness to reference assembly.

**D.** Reference assembly is preferred for metagenomic studies due to its superior ability to reconstruct genomes from mixed samples.


In [2]:
problem_2_answer = "C"

### Explanation

De novo assembly is less effective at detecting single nucleotide polymorphisms (SNPs).

> This statement is somewhat misleading.
> While de novo assembly does not rely on a reference genome for SNP identification, it can still be used to detect SNPs once the assembly is complete.
> The challenge here is not the effectiveness in detecting SNPs per se but rather the complexity of assembling a genome without a reference.
> Once assembled, the novel genome can serve as a reference for SNP detection through comparison with other sequences.
> Thus, the main challenge with de novo assembly in the context of SNPs is not an inherent inability to detect them but rather the initial difficulty of assembling the genome to a high enough quality that can then be used for such analyses.

De novo assembly cannot be used to study organisms with large or complex genomes, such as humans or plants.

> This statement is incorrect.
> De novo assembly can and has been used to study organisms with large and complex genomes.
> The challenge is not that it cannot be used but that it requires substantial computational resources and often a higher sequencing depth to manage the complexity and size of these genomes.
> Advances in sequencing technologies and assembly algorithms have increasingly made it possible to tackle large genomes de novo, although with varying degrees of difficulty and resource requirements.

De novo assembly typically requires higher sequencing depth to achieve comparable accuracy and completeness to reference assembly.

> This statement needs to be more general and more accurate.
> The preference for reference assembly or de novo assembly in metagenomic studies depends on the specific goals of the study and the availability of reference genomes.
> De novo assembly can be particularly valuable in metagenomics for discovering novel organisms without reference genome.
> However, reference assembly can be more straightforward and less computationally intensive when references are available, and the goal is to identify known organisms within a sample.
> Both approaches have their place in metagenomics, with de novo assembly playing a critical role in exploring the genetic material of uncultured or unknown microbes.

**Reference assembly is preferred for metagenomic studies due to its superior ability to reconstruct genomes from mixed samples.**

> Without a reference genome to guide the alignment of sequencing reads, de novo assembly relies on overlapping sequences to assemble the genome or transcriptome from scratch.
> This process is more susceptible to errors, especially in repetitive regions or areas with low complexity. It is necessary to have higher sequencing coverage to ensure that all regions are accurately assembled.
> The increased sequencing depth helps compensate for these challenges by providing more data for the assembly algorithm, improving the chances of correctly piecing together the genome.


## Problem 3

**Points:** 5

Imagine you are working on a simplified de novo gene assembly project where you are given a set of short DNA reads.
Your task is to determine the shortest possible DNA sequence that contains all the given reads as subsequences. Each read is represented by a string consisting of characters A, T, C, and G, which stand for the nucleotides adenine, thymine, cytosine, and guanine, respectively.

Given the following reads:

-   Read 1: ATCG
-   Read 2: CGTT
-   Read 3: GATC
-   Read 4: TCGA

Determine the overlap between each pair of reads and combine them all into one contig.
Start with read 1, merge with the highest overlap, then repeat until all reads are used.

Put your final sequence in `problem_3_answer`; for example, `problem_3_answer = "AAAAATTTTTCCGGC"`.

In [3]:
problem_3_answer = "TCGATCGTT"

### Explanation

In the normal greedy algorithm, you have to try every possible merge to find the largest overlap.
To make it easier, I gave you the instruction to start with a merge involving read 1.

**Merge 1**

Try to merge `ATCG` with `CGTT`.

```text
ATCG--
--CGTT
```

or 

```text
----ATCG
CGTT----
```

Maximum overlap of 2.

**Merge 2**

Try to merge `ATCG` with `GATC`.

```text
ATCG---
---GATC
```

or 

```text
-ATCG
GATC-
```

Maximum overlap of 3.

**Merge 3**

Try to merge `ATCG` with `TCGA`.

```text
ATCG-
-TCGA
```

or

```text
---ATCG
TCGA---
```

Maximum overlap of 3.

**(Bonus) Merge 4**

Try to merge `CGTT` with `GATC`.

```text
CGTT----
----GATC
```

or

```text
---CGTT
GATC---
```

Maximum overlap of 1.CGTT
Try to merge `CGTT` with `TCGA`.

```text
CGTT---
---TCGA
```

or

```text
----CGTT
TCGA----
```

Maximum overlap of 1.

**(Bonus) Merge 6**

Try to merge `GATC` with `TCGA`.

```text
GATC
--TCGA
```

or

```text
--GATC
TCGA
```

Maximum overlap of 2.

Since we have two equal probable merges `GATC` + `ATCG` and `ATCG` + `TCGA`, we have to consider both options.

**Merge 2.1**

Try to merge `GATCG` with `CGTT`.

```text
GATCG--
---CGTT
```

or

```text
----GATCG
CGTT-----
```

Maximum overlap of 2.

**Merge 2.2**

Try to merge `GATCG` with `TCGA`.

```text
GATCG-
--TCGA
```

or

```text
--GATCG
TCGA---
```

Maximum overlap of 3.
This is a merge with the most overlap, so we proceed with : `TCGATCG`.

**Merge 2.2.1**

There is only one fragment left, so let's try to merge it.

```text
TCGATCG--
-----CGTT
```

or

```text
---TCGATCG
CGTT------
```

We have a maximum overlap of 2, which would result in **`TCGATCGTT`**.

What happens if we made the other move?

**Merge 3.1**

Try to merge `ATCGA` with `CGTT`.

```text
ATCGA----
-----CGTT
```

or

```text
----ATCGA
CGTT-----
```

Maximum overlap of 0.

**Merge 3.2**

Try to merge `ATCGA` with `GATC`.

```text
ATCGA----
---GATC--
```

or

```text
-ATCGA
GATC--
```

Maximum overlap of 3.
We proceed with this merge which would result in `GATCGA`.

**Merge 3.2.1**

Try to merge `GATCGA` with `CGTT`.

```text
GATCGA----
------CGTT
```

or

```text
----GATCGA
CGTT------
```

There is no overlap, so we do not have a valid merge; thus, the answer is Merge 2.2.1.

## Problem 4

You are given the following sequencing error probabilities from an illumina run:

-   0.0398%
-   0.0398%
-   0.0398%
-   0.0398%
-   0.0398%
-   0.0158%
-   0.0316%
-   0.0158%
-   0.0251%
-   0.02%

What would the ASCII characters be in the FASTQ file?

In [4]:
problem_4_answer = "CCCCCGDGEF"

The sequencing error probabilities provided correspond to the error rates observed in an Illumina sequencing run.
In FASTQ format, quality scores are encoded as ASCII characters, with each character representing a Phred quality score.
The Phred quality score, Q, is related to the error probability, P, by the formula:

$$
Q = −10 \log_{10}(P)
$$
where P is the error probability (e.g., 0.0398% or 0.000398 in decimal form).

Let's calculate the Phred quality scores for each of the given error probabilities and then convert these scores to their corresponding ASCII characters.
In the FASTQ format, a Phred score of Q is encoded by an ASCII character with a value of Q+33, following the Sanger format.

Let's perform these calculations.

The ASCII characters corresponding to the given sequencing error probabilities in the FASTQ file would be:

-   C for the error probabilities of 0.0398%
-   G for the error probabilities of 0.0158%
-   D for the error probability of 0.0316%
-   E for the error probability of 0.0251%
-   F for the error probability of 0.02%

These characters represent the encoded Phred quality scores, which reflect the reliability of each base call in the sequence data.

In [5]:
import numpy as np

# Given error probabilities in percentage
error_percentages = [
    0.0398,
    0.0398,
    0.0398,
    0.0398,
    0.0398,
    0.0158,
    0.0316,
    0.0158,
    0.0251,
    0.0200,
]

# Convert percentages to decimal probabilities
error_probabilities = np.array(error_percentages) / 100

# Calculate Phred quality scores
Q_scores = -10 * np.log10(error_probabilities)

# Convert Phred scores to ASCII characters
ascii_characters = "".join([chr(int(round(q)) + 33) for q in Q_scores])

print(ascii_characters)

CCCCCGDGEF
