# HOWTO: Removing rows from the alignment

In [1]:
# Generate random FASTA alignment of 3 samples, 9 characters long to use as an example
import random

def generate_random_alignment(num_samples, seq_len, bases='ATCG'):
    sequences = ['' for i in range(num_samples)]
    for i in range(seq_len):
        char1 = random.choice(bases)
        if random.random() > 0.5:  # >0.5 means variable
            # Pick another character
            char2 = random.choice(bases)
            freq1 = random.randint(1, num_samples-1)
            freq2 = num_samples - freq1
        else:
            char2 = ''
            freq1, freq2 = num_samples, 0
        for j1 in range(freq1):
            sequences[j1] += char1
        for j2 in range(freq1, freq1+freq2):
            sequences[j2] += char2
    return sequences

def generate_random_fasta(path, num_samples, seq_len, bases='ATCG'):
    sequences = generate_random_alignment(num_samples, seq_len, bases=bases)
    with open(path, 'w') as f:
        for i, seq in enumerate(sequences):
            if i == 1:
                print('>seq{i}\n{s}'.format(i=i+1, s=seq), file=f)
            else:
                print('>seq{i} description{i}\n{s}'.format(i=i+1, s=seq), file=f)

path = 'test.aln'
generate_random_fasta(path, 3, 9)

In [2]:
# Prints the contents of test.aln
with open('test.aln', 'r') as f:
    for line in f:
        print(line.rstrip())

>seq1 description1
TTATGAGTT
>seq2
TTATGAGTT
>seq3 description3
TTGTGCATT


## Editing alignment rows

The `Alignment` object provides two methods to remove row entries from the multiple sequence alignment: `.remove` and `.retain`.



### Removing rows using `.remove`

The `.remove` method takes in an integer position index as input and removes the corresponding row at that position. The method also accepts multiple indices at the same time via a list.

By default, the `.remove` method removes the row inplace, changing the existing data. However, if `copy=True`, the method instead will generate a copy of the data without the specified row/s, keeping the original data intact.

### Removing rows inplace (default)

In the following example, the row with position index 0, which is the first row, will be removed from the sequence alignment. Note how this affects both the number of rows in the sequence alignment and the number of entries in the row metadata.

By default, `.remove` will modify the underlying data and changes made cannot be undone.

In [3]:
# Import Alignment module and import the data from a file into an Alignment object
# See 01_Reading_alignments.ipynb for details about importing data.
from alignmentrs import Alignment
aln = Alignment.from_fasta(path)

In [4]:
# Row and row metadata before editing
aln.row_and_metadata

Unnamed: 0,description,sequence
seq1,description1,TTATGAGTT
seq2,,TTATGAGTT
seq3,description3,TTGTGCATT


In [5]:
# Number of rows and columns in the alignment prior to editing
aln.nrows, aln.ncols

(3, 9)

In [6]:
# Remove row with position index 0 inplace
aln.row.remove(0)

In [7]:
# Resulting number of rows and columns in the alignment after editing
# Note that latter remains the same
aln.nrows, aln.ncols

(2, 9)

In [8]:
# Resulting row and row metadata after editing
aln.row_and_metadata

Unnamed: 0,description,sequence
seq2,,TTATGAGTT
seq3,description3,TTGTGCATT


### Removing rows via a copy

The following code also removes the first row of the sequence alignment (row 0). However, instead of editing the underlying data, a new copy (`new_aln`) reflecting the edit will be returned instead. This keeps the original data (`aln`) intact. This way of removing rows is useful when it is necessary to compare the original and edited states of the alignment. 

However, this method of removing rows is not always recommended, especially for large alignments, because it creates a copy of the data and doubles the memory used for the analysis.

In [9]:
# Reimport the data from a file into an Alignment object such that
# the starting data will the same as the first.
# See 01_Reading_alignments.ipynb for details about importing data.
aln = Alignment.from_fasta(path)

In [10]:
# Row and row metadata before editing
aln.row_and_metadata

Unnamed: 0,description,sequence
seq1,description1,TTATGAGTT
seq2,,TTATGAGTT
seq3,description3,TTGTGCATT


In [11]:
# Number of rows and columns in the alignment prior to editing
aln.nrows, aln.ncols

(3, 9)

In [12]:
# Return a copy of the data, removing row with position index 0
# The edited copy is named `new_aln`
new_aln = aln.row.remove(0, copy=True)

In [13]:
# Number of rows and columns in the original alignment after to editing
aln.nrows, aln.ncols

(3, 9)

In [14]:
# Number of rows and columns in the NEW alignment
new_aln.nrows, new_aln.ncols

(2, 9)

In [15]:
# Row and row metadata in the original alignment after editing
aln.row_and_metadata

Unnamed: 0,description,sequence
seq1,description1,TTATGAGTT
seq2,,TTATGAGTT
seq3,description3,TTGTGCATT


In [16]:
# Row and row metadata in the NEW alignment
new_aln.row_and_metadata

Unnamed: 0,description,sequence
seq2,,TTATGAGTT
seq3,description3,TTGTGCATT


### Removing rows using `.retain`

The `.retain` method takes in an integer position index as input and removes all the other rows except the row at the specified position. To retain more than a single row, this method also accepts multiple indices at the same time via a list.

The functionality of `.retain` can be considered the inverse of `.remove`. Whereas `.remove` removes specified rows, `.retain` keeps specified rows and removes all other rows.

By default, the `.retain` method removes rows inplace, changing the existing data. However, if `copy=True`, the method instead will generate a copy of the data retaining only the specified row/s, keeping the original data intact.

### Retaining rows inplace (default)

In the following example, the row with position index 0, which is the first row, will be kept while all the other rows in the sequence alignment be deleted. This produces the opposite effect of `.remove`. 

This means the resulting number of rows and entries in the row metadata after editing will be equal to the number of specified indices.

By default, `.retain` will modify the underlying data and the original state before editing cannot be recovered.

In [17]:
# Reimport the data from a file into an Alignment object such that
# the starting data will the same as the first.
# See 01_Reading_alignments.ipynb for details about importing data.
aln = Alignment.from_fasta(path)

In [18]:
# Row and row metadata before editing
aln.row_and_metadata

Unnamed: 0,description,sequence
seq1,description1,TTATGAGTT
seq2,,TTATGAGTT
seq3,description3,TTGTGCATT


In [19]:
# Number of rows and columns in the alignment prior to editing
aln.nrows, aln.ncols

(3, 9)

In [20]:
# Retain row with position index 0 inplace
aln.row.retain(0)

In [21]:
# Resulting number of rows in the alignment after editing
# Note that latter remains the same
aln.nrows, aln.ncols

(1, 9)

In [22]:
# Resulting row and row metadata after editing
aln.row_and_metadata

Unnamed: 0,description,sequence
seq1,description1,TTATGAGTT


### Retaining rows via a copy

The following code also retains the first row of the sequence alignment (row 0). However, in contrast to the default behavior, setting the `copy` parameter to `True` will create a new copy (`new_aln`) reflecting the changes. This keeps the original data intact (`aln`). This way of retaining rows is useful when it is necessary to compare the original and edited states of the alignment. 

However, returning a new copy containing the edit is not always recommended, especially for large alignments, because having the original copy and the new copy will double the memory necessary for the analysis.

In [23]:
# Reimport the data from a file into an Alignment object such that
# the starting data will the same as the first.
# See 01_Reading_alignments.ipynb for details about importing data.
aln = Alignment.from_fasta(path)

In [24]:
# Row and row metadata before editing
aln.row_and_metadata

Unnamed: 0,description,sequence
seq1,description1,TTATGAGTT
seq2,,TTATGAGTT
seq3,description3,TTGTGCATT


In [25]:
# Number of rows and columns in the alignment prior to editing
aln.nrows, aln.ncols

(3, 9)

In [26]:
# Return a copy of the data, retaining the row with position index 0, removing all others
# The edited copy is named `new_aln`
new_aln = aln.row.retain(0, copy=True)

In [27]:
# Number of rows and columns in the original alignment after to editing
aln.nrows, aln.ncols

(3, 9)

In [28]:
# Number of rows and columns in the NEW alignment
new_aln.nrows, new_aln.ncols

(1, 9)

In [29]:
# Row and row metadata in the original alignment after editing
aln.row_and_metadata

Unnamed: 0,description,sequence
seq1,description1,TTATGAGTT
seq2,,TTATGAGTT
seq3,description3,TTGTGCATT


In [30]:
# Row and row metadata in the NEW alignment
new_aln.row_and_metadata

Unnamed: 0,description,sequence
seq1,description1,TTATGAGTT


## Next

See `02b_Removing_columns.ipynb` for more information about removing and retaining alignment columns.

Proceed to `03a_Filtering_rows.ipynb` or `03b_Filtering_columns.ipynb` to know more about using a function to select rows or columns respectively.