# Converting Multi-line FASTA to Single-line FASTA (Memory-Efficient)

FASTA files store DNA, RNA, or protein sequences. For readability, sequences are often split across multiple lines — typically 60-80 characters wide.

However, **this formatting breaks simple string searches**, especially if the query spans across line breaks.

This notebook converts multi-line FASTA files into a single-line FASTA format, preserving headers and merging each sequence into a continuous string.

## Goals:
- Handle very large sequences (e.g., entire chromosomes) **efficiently**.
- Avoid loading entire sequences into memory.
- Maintain fast and clean output generation.


In [1]:
def multiline_to_singleline_fasta(input_path, output_path):
    """
    Convert a multi-line FASTA file to a single-line FASTA file with minimal memory usage.

    FASTA format often contains sequences broken into multiple lines for readability.
    This function rewrites such files so that each sequence appears as a single continuous line,
    immediately following its header line. This is important for tools and pipelines that expect
    FASTA files in single-line format for consistency or performance.

    Parameters:
    ----------
    input_path : str
        Path to the input FASTA file in multi-line format.
    output_path : str
        Path to the output FASTA file where each sequence is written as a single line.

    Memory Consideration:
    ---------------------
    This implementation avoids using in-memory lists or string concatenation, which are inefficient
    for very large sequences (e.g., human chromosomes can exceed 200 million base pairs).
    Instead, sequence lines are written directly to the output file as they are read,
    keeping memory usage constant regardless of sequence length.

    Behavior:
    ---------
    - Header lines (starting with '>') are preserved.
    - Sequence lines are merged into a single continuous line following each header.
    - Final newline is added at the end of the last sequence.
    - Empty lines are ignored silently (can be handled if needed).

    Example:
    -------
    Input:
        >seq1
        ACTG
        TGCA
        >seq2
        GATTACA

    Output:
        >seq1
        ACTGTGCA
        >seq2
        GATTACA
    """
    with open(input_path, 'r') as infile, open(output_path, 'w') as outfile:
        header = None                    # Stores the current header line
        first_sequence_line = True       # Flag to control newline writing before sequences

        for line in infile:
            line = line.rstrip()         # Remove trailing newline and spaces
            if not line:
                continue                 # Skip empty lines 
            if line.startswith('>'):
                if header is not None:
                    # End of previous record; write newline to finish sequence
                    outfile.write('\n')
                header = line
                outfile.write(header + '\n')
                first_sequence_line = True
            else:
                # Sequence line: write directly without buffering
                outfile.write(line)
                first_sequence_line = False

        # Ensure output ends with newline (especially for the last sequence)
        outfile.write('\n')


In [2]:
# Example usage on given sample

input_file = 'multiline_input.fasta'
output_file = 'singleline_output.fasta'

multiline_to_singleline_fasta(input_file, output_file)

# Display only first few lines of output file for the sake of readability
print("Preview of Converted FASTA File:")
with open(output_file) as f:
    for i, line in enumerate(f):
        print(line.rstrip())
        if i > 6:  # only show 3 sequences max
            print("...")
            break



Preview of Converted FASTA File:
>1
TCCAATTGAGCTATACCGTCAGAAGTAGATGAGATCTAACGGCGCGCCCGGTTGTCGCAGCCACGCTAGGCACAGTCAAAGCAGTGACCCAGCCCTTCAGACGTTACTGTGAAAATCTGTCCACCGAGAACGCTTTCAAAATATATTAGCGGTTTAAGCTAGGCAAGTCACGGCATGTCCTTACAAAAGCCTACGAGCCGGGCCACATTATGTCGCCCATAAGATGCCAATACAATCTCAAGTAGGATAAGCATGTCTTACCACTAGCGTTCTGTGTTTGTGGGGTTACGCGTTGTGTTACCAACGGTCCCCTATAGCTCAGCAGTAATAAGTTTTATTGTGCTCAAGGCC
>2
GATAGCGACGACGCTAGTTCCTAGGGACCCGGGGGGGATCAGTGGTGACGGTATGCTCTTGCTGGTGCGACCCGTAACGAAGTCACATGGGTAGTGGCGGCACAGGATATCGTAGTGACAGGAACCTGATATTCCGTTGCAGAACACACGCGACAATCTACAAGTTATTTTCGGCTGAGTTACTTACTCACAAGATCCACATGGAAATTTAATATACGACGCTGAGGAGGGTTGCACAACGGCTCTCTTTCCTAAGAACAGAATTGCCTAAGAATTCAGAGCTCAAAGGCTCCATTTCTTGAGGTATGGTCGTCGGTATATTAACGGTGCTA
>3
GCCTATCTCACATGATAACCAGGTCTAGTTATGCTATTCATATACCTACTGTGTCCACTTTCTGTACTCCACGCGCGTACATAAGTACTGGAGTTAAATACAATGAAACGATGCAAGATGGGCTAAAGTTCAAGGAGGAGCTCTGCCATATCATTCTCTACAGCAAGACTGGGGCTCTGGGAATGGAGATAAAAAGGAGTTACATGCACGTAAAGTCGCCATCCATGCTACTTTGTTTGGCGCGGGGAACGCCCCGCCCTACTATTCATATGT

# Progress up till now

We successfully converted a multi-line FASTA file to a single-line FASTA format in a **memory-efficient** manner suitable for huge genome sequences.

- The solution scales well without exhausting memory.
- Sequences are stored contiguously, simplifying downstream searches.
- This approach balances performance, simplicity, and scalability.

You can now reliably search or analyze sequences without worrying about line breaks interfering with your queries.

In [3]:
def fasta_stats(fasta_path):
    count = 0
    total_bases = 0
    with open(fasta_path) as f:
        for line in f:
            line = line.rstrip()
            if line.startswith('>'):
                count += 1
            else:
                total_bases += len(line)
    return count, total_bases

seq_count, base_count = fasta_stats(output_file)
print(f"Total Sequences: {seq_count:,}")
print(f"Total Bases: {base_count:,}")




Total Sequences: 50
Total Bases: 12,398



# Demonstrating Search Failure & Success Due to Line Breaks

Let's pick a sequence that spans a **fixed-width boundary** (e.g., position 60-61) in the multiline FASTA file. We will try to search for this sequence in both:

- the original multiline FASTA (where the boundary splits the line), and  
- the converted single-line FASTA (where no breaks exist).

We'll demonstrate how such a search fails in the former and succeeds in the latter.

Select a real breakpoint — use the first sequence for simplicity


In [4]:
# Pick a sequence that actually spans a 60-character boundary
breakpoint_string = ""
with open(input_file) as f:
    prev_line = ""
    for line in f:
        line = line.strip()
        if line.startswith('>'):
            prev_line = ""
            continue
        if len(prev_line) == 60 and len(line) > 5:
            # Take last 5 bases from prev_line and first 5 from current
            breakpoint_string = prev_line[-5:] + line[:5]
            break
        prev_line = line

print(f"Searching for substring across line break: {breakpoint_string}")

def search_in_file(file_path, query):
    with open(file_path) as f:
        content = f.read()
        return query in content, content

found_in_multiline, _ = search_in_file(input_file, breakpoint_string)
found_in_singleline, singleline_content = search_in_file(output_file, breakpoint_string)

print(f"Found in multiline FASTA? {found_in_multiline}")
print(f"Found in single-line FASTA? {found_in_singleline}")

def preview_match_context(text, query, radius=20):
    idx = text.find(query)
    if idx == -1:
        return "Query not found"
    start = max(0, idx - radius)
    end = min(len(text), idx + len(query) + radius)
    return text[start:end].replace(query, f"[{query}]")

if found_in_singleline:
    context = preview_match_context(singleline_content, breakpoint_string)
    print("\nMatch context in single-line FASTA:\n", context)


Searching for substring across line break: CGCAGCCACG
Found in multiline FASTA? False
Found in single-line FASTA? True

Match context in single-line FASTA:
 CTAACGGCGCGCCCGGTTGT[CGCAGCCACG]CTAGGCACAGTCAAAGCAGT


## Final Conclusion

Searching for substrings that span fixed-width boundaries (like every 60 bases) in a multi-line FASTA file often fails, since the sequence is broken across lines.  
Our demonstration showed that the string `CGCAGC...` across positions 60-61 could **not be found** in the multi-line format, but **was found** after converting to single-line format.

**Key takeaway**: Converting FASTA to a single-line format is essential for accurate and seamless sequence querying — especially when dealing with motifs, primers, or gene annotations that may lie near line breaks.

This approach not only resolves biological string search issues but also scales to genome-sized data efficiently.
