# What is N50 and L50?
N50 and L50 are terms commonly used in genomics and bioinformatics, particularly in the context of genome assembly.

N50 refers to the median contig length of a genome assembly, where contigs are contiguous sequences of DNA. It is a measure of the quality of the assembly, with higher N50 values indicating longer contigs and a more complete assembly.

L50, on the other hand, refers to the number of contigs that make up the N50 value. In other words, it is the number of contigs that are at least as long as the N50 value.

Together, N50 and L50 provide a way to evaluate the quality and completeness of a genome assembly. A higher N50 value and a lower L50 value generally indicate a better assembly.

# compare N50 and L50
N50 and L50 are related but distinct metrics used to evaluate the quality of a genome assembly.

N50 is a measure of the median contig length, where contigs are contiguous sequences of DNA. It represents the length of the contig at which 50% of the total assembly length is contained in contigs of that length or longer.

L50, on the other hand, is a measure of the number of contigs that make up the N50 value. It represents the number of contigs that are at least as long as the N50 value.

To illustrate the difference:

    A high N50 value indicates that the assembly has long contigs, but it doesn't provide information about the number of contigs.
    A low L50 value indicates that the assembly has fewer, longer contigs, which is generally desirable.

For example, consider two assemblies:

Assembly A: N50 = 100,000 bp, L50 = 10
Assembly B: N50 = 50,000 bp, L50 = 20

__In this case, Assembly A has longer contigs (higher N50), but it also has fewer contigs (lower L50), indicating a more complete assembly.__

# Example
Here's an example to illustrate the difference between N50 and L50:

Suppose we have a genome assembly with the following contig lengths:

    Contig 1: 150,000 bp
    Contig 2: 120,000 bp
    Contig 3: 100,000 bp
    Contig 4: 80,000 bp
    Contig 5: 60,000 bp
    Contig 6: 40,000 bp
    Contig 7: 20,000 bp
    Contig 8: 10,000 bp

The total assembly length is 680,000 bp.

To calculate N50, we need to find the median contig length. Let's arrange the contigs in order from longest to shortest:

    150,000 bp
    120,000 bp
    100,000 bp
    80,000 bp
    60,000 bp
    40,000 bp
    20,000 bp
    10,000 bp

The median contig length is the average of the 4th and 5th contigs, which is (80,000 + 60,000) / 2 = 70,000 bp. However, since the 5th contig is 60,000 bp and the 4th contig is 80,000 bp, and 50% of the total assembly length is contained in contigs of 80,000 bp or longer, the N50 is actually 80,000 bp.

__To calculate L50, we need to find the number of contigs that are at least as long as the N50 value. In this case, there are 4 contigs that are at least 80,000 bp long (Contigs 1-4). Therefore, the L50 is 4.__

So, in this example:

    N50 = 80,000 bp
    L50 = 4


In [44]:
def calculate_n50_l50(contig_lengths:list)-> tuple:
    """
    Calculate N50 and L50 from a list of contig lengths.

    Args:
        contig_lengths (list): A list of contig lengths.

    Returns:
        tuple: A tuple containing the N50 and L50 values.
    """
    # Sort the contig lengths in descending order
    contig_lengths.sort(reverse=True)

    # Calculate the total length of all contigs
    total_length = sum(contig_lengths)

    # Find the midpoint of the total length
    midpoint = total_length / 2

    # Initialize variables to track the cumulative length and the number of contigs
    cumulative_length = 0
    num_contigs = 0

    # Iterate over the contig lengths to find the N50 and L50
    for length in contig_lengths:
        cumulative_length += length
        num_contigs += 1

        # Check if the cumulative length has reached the midpoint
        if cumulative_length >= midpoint:
            # The N50 is the current contig length
            n50 = length
            # The L50 is the number of contigs needed to reach the midpoint
            l50 = num_contigs
            break

    return n50, l50

In [45]:
# Example usage:
contig_lengths = [1000, 800, 700, 600, 500, 400, 300, 200, 100]
n50, l50 = calculate_n50_l50(contig_lengths)
print(f"N50: {n50}")
print(f"L50: #{l50}")

N50: 700
L50: #3


In [56]:
import numpy as np

mean = 100
stddev = 10
n = 20
bp_numbers = np.random.normal(mean, stddev, n).astype(int)

print(f"original bp number orders: {bp_numbers}")
print(f"sorted bp numbers: {np.sort(bp_numbers)[::-1]})")
print(f"N50, L50= {calculate_n50_l50(list(bp_numbers))}")

original bp number orders: [105 105 108 110 107  85 117 110  87 109 104 112  83  82  82 109 104  92
 112 103]
sorted bp numbers: [117 112 112 110 110 109 109 108 107 105 105 104 104 103  92  87  85  83
  82  82])
N50, L50= (105, 10)


# why N50 and L50?

N50 and L50 are important metrics in genomics and bioinformatics because they provide a way to evaluate the quality and completeness of a genome assembly. Here are some reasons why N50 and L50 are important:
- `Assembly quality`: N50 and L50 are used to assess the quality of a genome assembly. A higher N50 value indicates that the assembly has longer contigs, which is generally desirable. A lower L50 value indicates that the assembly has fewer, longer contigs, which is also desirable.
- `Genome completeness`: N50 and L50 can be used to estimate the completeness of a genome assembly. A higher N50 value and a lower L50 value indicate that the assembly is more complete.
- ...

# real example

Human Genome Project: The Human Genome Project used N50 and L50 to evaluate the quality of the human genome assembly. The final assembly had an N50 of 57.9 Mb and an L50 of 999 contigs is 18.
```
                        GenBank
Genome size	         3.1 Gb
Total ungapped length   2.9 Gb
Number of contigs	   999
Contig N50	      	57.9 Mb
Contig L50	          18
```
https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/