# HOWTO: Removing columns from the alignment

In [1]:
# Generate random FASTA alignment of 3 samples, 9 characters long to use as an example
import random

def generate_random_alignment(num_samples, seq_len, bases='ATCG'):
    sequences = ['' for i in range(num_samples)]
    for i in range(seq_len):
        char1 = random.choice(bases)
        if random.random() > 0.5:  # >0.5 means variable
            # Pick another character
            char2 = random.choice(bases)
            freq1 = random.randint(1, num_samples-1)
            freq2 = num_samples - freq1
        else:
            char2 = ''
            freq1, freq2 = num_samples, 0
        for j1 in range(freq1):
            sequences[j1] += char1
        for j2 in range(freq1, freq1+freq2):
            sequences[j2] += char2
    return sequences

def generate_random_fasta(path, num_samples, seq_len, bases='ATCG'):
    sequences = generate_random_alignment(num_samples, seq_len, bases=bases)
    with open(path, 'w') as f:
        for i, seq in enumerate(sequences):
            if i == 1:
                print('>seq{i}\n{s}'.format(i=i+1, s=seq), file=f)
            else:
                print('>seq{i} description{i}\n{s}'.format(i=i+1, s=seq), file=f)

path = 'test.aln'
generate_random_fasta(path, 3, 9)

In [2]:
# Prints the contents of test.aln
with open('test.aln', 'r') as f:
    for line in f:
        print(line.rstrip())

>seq1 description1
TCATAAGGA
>seq2
TCAGAAGGT
>seq3 description3
CCAGATGGT


## Editing alignment columns

The `Alignment` objects has two methods for removing alignment columns from the multiple sequence alignment: `.remove` and `.retain`.

### Removing columns using `.remove`

The `.remove` method requires an integer position index as input to remove the corresponding nth character found at the position for each sample in the alignment. The method can also remove multiple columns at the same time by specifying multiple indices using a list.

By default, the `.remove` method removes alignment columns inplace, modifying the underlying data and metadata. However, if `copy=True`, the method instead will generate a copy containing the edits and keep the original data intact.

### Removing columns inplace

In this example, the alignment column with position index 0 - the first character for each sequence and also the first alignment column - will be removed. Note how this affects both the number of columns in the sequence alignment and the number of entries in the column metadata.

By default, `.remove` will modify the underlying data and changes made cannot be undone.

In [3]:
# Import Alignment module and import the data from a file into an Alignment object
# See 01_Reading_alignments.ipynb for details about importing data.
from alignmentrs import Alignment
aln = Alignment.from_fasta(path)

In [4]:
# Number of rows and columns in the alignment prior to editing
aln.nrows, aln.ncols

(3, 9)

In [5]:
# Row and row metadata before editing
aln.column_and_metadata

Unnamed: 0,sequence
0,TTC
1,CCC
2,AAA
3,TGG
4,AAA
5,AAT
6,GGG
7,GGG
8,ATT


In [6]:
# Remove column with position index 0 inplace
aln.col.remove(0)

In [7]:
# Resulting number of rows and columns in the alignment after editing
# Note that the former remains unchanged
aln.ncols

8

In [8]:
# Resulting column and column metdata after editing
aln.column_and_metadata

Unnamed: 0,sequence
1,CCC
2,AAA
3,TGG
4,AAA
5,AAT
6,GGG
7,GGG
8,ATT


### Removing columns via a copy

The following example also removes the first alignment column (column 0). However, unlike the first example that modifying the existing data, a new copy (`new_aln`) that reflects the edits will be retruned instead. In this way, the original data (`aln`) is kept intact. Removing columns in this manner is useful when it is necessary to compare the original and edited states of the alignment.

However, removing columns this way is not always recommended, especially for large alignments, because it creates a copy of the data and double the memory used for the analysis.

In [9]:
# Reimport the data from a file into an Alignment object such that
# the starting data will the same as the first.
# See 01_Reading_alignments.ipynb for details about importing data.
aln = Alignment.from_fasta(path)

In [10]:
# Row and row metadata before editing
aln.column_and_metadata

Unnamed: 0,sequence
0,TTC
1,CCC
2,AAA
3,TGG
4,AAA
5,AAT
6,GGG
7,GGG
8,ATT


In [11]:
# Number of rows and columns in the alignment prior to editing
aln.nrows, aln.ncols

(3, 9)

In [12]:
# Return a copy of the data, removing the column with position index 0
# The edited copy is named `new_aln`
new_aln = aln.col.remove(0, copy=True)

In [13]:
# Number of rows and columns in the original alignment after to editing
aln.nrows, aln.ncols

(3, 9)

In [14]:
# Number of rows and columns in the NEW alignment
new_aln.nrows, new_aln.ncols

(3, 8)

In [15]:
# Column and column metadata in the original alignment after editing
aln.column_and_metadata

Unnamed: 0,sequence
0,TTC
1,CCC
2,AAA
3,TGG
4,AAA
5,AAT
6,GGG
7,GGG
8,ATT


In [16]:
# Column and column metadata in the NEW alignment
new_aln.column_and_metadata

Unnamed: 0,sequence
1,CCC
2,AAA
3,TGG
4,AAA
5,AAT
6,GGG
7,GGG
8,ATT


### Removing columns using `.retain`

The `.retain` method differs from `.remove` in that it takes an integer position index as input and removes all the other columns except the column at the specified position. This is has the inverse effect of `.remove`, where the column at the specified position is removed. To keep more than a single column, this method also accepts multiple indices at the same time via a list.

By default, the `.retain` method removes columns inplace, changing the underlying data. However, if `copy=True`, the method will instead return a copy of the data retaining only the specified column/s and keep the original data intact.

### Retaining columns inplace

In the following example, the column with position index 0 - the first character for each sequence and also the first alignment column - will be kept and all other characters in each sequence will be deleted. This produces the opposite effect of `.remove`.

This means the resulting number of columns and entries in the column metadata after editing will be equal to the number of specified indices.

By default, `.retain` will modify the data inplace and the original state of the alignment prior to editing cannot be recovered.

In [17]:
# Import the data from a file into an Alignment object
# See 01_Reading_alignments.ipynb for details
aln = Alignment.from_fasta(path)

In [18]:
aln.column_and_metadata

Unnamed: 0,sequence
0,TTC
1,CCC
2,AAA
3,TGG
4,AAA
5,AAT
6,GGG
7,GGG
8,ATT


In [19]:
aln.ncols

9

In [20]:
aln.col.retain(0)
aln.ncols

1

In [21]:
aln.column_and_metadata

Unnamed: 0,sequence
0,TTC


### Retaining columns via a copy

The code below also retains the first column of the sequence alignment (column 0). However, in contrast to the default behavior, setting the `copy` parameter to `True` will instead create a new copy (`new_aln`) reflecting the changes and the original data (`aln`) is kept intact. This way of editing the alignment is useful when it is necessary to compare the original and edited states.

However, returning a new copy of the edited alignment instead of editing inplace is not always recommended especially for large alignments. Returning a new copy and keeping the original data doubles the memory necessary for the analysis.

In [22]:
# Import the data from a file into an Alignment object
# See 01_Reading_alignments.ipynb for details
aln = Alignment.from_fasta(path)

In [23]:
aln.column_and_metadata

Unnamed: 0,sequence
0,TTC
1,CCC
2,AAA
3,TGG
4,AAA
5,AAT
6,GGG
7,GGG
8,ATT


In [24]:
aln.ncols

9

In [25]:
new_aln = aln.col.retain(0, copy=True)

In [26]:
new_aln.column_and_metadata

Unnamed: 0,sequence
0,TTC


In [27]:
aln.column_and_metadata

Unnamed: 0,sequence
0,TTC
1,CCC
2,AAA
3,TGG
4,AAA
5,AAT
6,GGG
7,GGG
8,ATT


## Next

See `02a_Removing_rows.ipynb` for more information about removing and retaining alignment rows.

Proceed to `03a_Filtering_rows.ipynb` or `03b_Filtering_columns.ipynb` to know more about using a function to select rows or columns respectively.