# HOWTO: Reading alignments

In [1]:
# Generate random FASTA alignment of 3 samples, 9 characters long to use as an example
import random

def generate_random_alignment(num_samples, seq_len, bases='ATCG'):
    sequences = ['' for i in range(num_samples)]
    for i in range(seq_len):
        char1 = random.choice(bases)
        if random.random() > 0.5:  # >0.5 means variable
            # Pick another character
            char2 = random.choice(bases)
            freq1 = random.randint(1, num_samples-1)
            freq2 = num_samples - freq1
        else:
            char2 = ''
            freq1, freq2 = num_samples, 0
        for j1 in range(freq1):
            sequences[j1] += char1
        for j2 in range(freq1, freq1+freq2):
            sequences[j2] += char2
    return sequences

def generate_random_fasta(path, num_samples, seq_len, bases='ATCG'):
    sequences = generate_random_alignment(num_samples, seq_len, bases=bases)
    with open(path, 'w') as f:
        for i, seq in enumerate(sequences):
            if i == 1:
                print('>seq{i}\n{s}'.format(i=i+1, s=seq), file=f)
            else:
                print('>seq{i} description{i}\n{s}'.format(i=i+1, s=seq), file=f)

path = 'test.aln'
generate_random_fasta(path, 3, 9)

In [2]:
# Prints the contents of test.aln
with open('test.aln', 'r') as f:
    for line in f:
        print(line.rstrip())

>seq1 description1
AACAATCGG
>seq2
TACAATCGG
>seq3 description3
TACAATGGG


## Import the `Alignment` object from `alignmentrs`

`alignmentrs` is the package containing all the classes and methods for reading and manipulating a multiple sequence alignment.

The `Alignment` class is used to create an alignment object that contains the information about a multiple sequence alignment, from its sequences to sample names and related metadata.

In [3]:
from alignmentrs import Alignment

## Open FASTA file using `Alignment.from_fasta`

The `Alignment` class contains methods to import data encoded in various formats. `Alignment` class import method names begin with `from`...
- `from_dict` imports data formatted as a dictionary
- `from_fasta` imports FASTA-formatted string or a FASTA file
- `from_json` imports a JSON-formatted string or a JSON file
- `from_pickle` imports a pickled alignment object

In this example, `from_fasta` is used to import an existing FASTA file as an `Alignment` object.

In [4]:
aln = Alignment.from_fasta(path)

## Inspecting the contents of the alignment

The `Alignment` object encapsulates the alignment sequence matrix, sequence identifiers and descriptions (known as row metadata), site annotations (known as column metadata), and comments and descriptions for the entire alignment (known as alignment metadata.

### Attributes

These can be readily accessed as attributes of the `Alignment` object:
- `.data` shows the underlying sequence matrix as a `SeqMatrix` object
- `.row_metadata` shows sequence identifiers and other row-related metadata as a pandas `DataFrame`
- `.column_metadata` shows site annotations and other column-related metadata as a pandas `DataFrame`
- `.alignment_metadata` shows alignment comments and other alignment-related information as a `dict`

In [5]:
# Sequence matrix
aln.data

SeqMatrix(nrows=3, ncols=9)

In [6]:
# Row metadata
aln.row_metadata

Unnamed: 0,description
seq1,description1
seq2,
seq3,description3


In [7]:
# Column metadata
aln.column_metadata

0
1
2
3
4
5
6
7
8


### Properties

The `Alignment` object has several properties used to describe its contents:
- `.nrows` shows the number samples are present in the alignment.
- `.ncols` shows the number of aligned characters in the alignment.
- `.sequences` returns the sequences in the alignment as a list of strings.
- `.row_and_metadata` returns a new pandas `DataFrame` that joins row metadata with its corresponding a sequence.
- `.column_and_metadata` returns a new pandas `DataFrame` that joins column metadata with the sequence at the particular site.

In [8]:
# Number of samples
aln.nrows

3

In [9]:
# Number of aligned columns
aln.ncols

9

In [10]:
# Sequences
aln.sequences

['AACAATCGG', 'TACAATCGG', 'TACAATGGG']

In [11]:
# Combined row metadata and sequence
aln.row_and_metadata

Unnamed: 0,description,sequence
seq1,description1,AACAATCGG
seq2,,TACAATCGG
seq3,description3,TACAATGGG


In [12]:
# Combined column metadata and sequence
aln.column_and_metadata

Unnamed: 0,sequence
0,ATT
1,AAA
2,CCC
3,AAA
4,AAA
5,TTT
6,CCG
7,GGG
8,GGG


## Accessing rows and columns

A row refers to individual sequence of a sample or record in the sequene alignment. These records may be different gene sequences or sequences from different organisms that were aligned using a multiple sequence alignment software (ClustalX, MAFFT, MrBayes, etc.)

A column refers to the vertical list of characters found for each single-character column in the alignment. Concretely, a column is the list of nth character for each sample in the sequence alignment.

To access rows and columns, `Alignment` as `.row` and `.col` properties for rows and columns respectively. Rows and columns are indexed by integers starting from 0 and uses Python indexing conventions.

### Retrieving rows

In [13]:
# Retrieving the sequence of the first sample in the alignment
aln.row[0]

'AACAATCGG'

In [14]:
# Retrieving the sequence of the last sample in the alignment containing 3 samples
aln.row[2]

'TACAATGGG'

In [15]:
# Retrieving the sequence of the last sample in the alignment containing 3 samples
# using negative indexing
aln.row[-1]

'TACAATGGG'

In [16]:
# Retrieving the first 2 sequences in the alignment using a slice
aln.row[0:2]

['AACAATCGG', 'TACAATCGG']

### Retrieving columns

In [17]:
# Retrieving the first character of each of the 3 samples in the alignment
aln.col[0]

'ATT'

In [18]:
# Retrieving the last character of each of the 3 samples in the alignment
# where all samples are 9 characters long
aln.col[8]

'GGG'

In [19]:
# Retrieving the last character of all 3 samples in the alignment
# where all samples are 9 characters long using negative indexing
aln.col[-1]

'GGG'

In [20]:
# Retrieving the first 2 characters of each of the 3 sample in the alignment using a slice
aln.col[0:2]

['ATT', 'AAA']

## Next

Proceed to `02a_Removing_rows.ipynb` or `02b_Removing_columns.ipynb` to know more about removing and retaining alignment rows or columns respectively.