Problem

For two strings s1 and s2 of equal length, the p-distance between them, denoted dp(s1,s2), is the proportion of corresponding symbols that differ between s1 and s2.

For a general distance function d on n taxa s1,s2,…,sn (taxa are often represented by genetic strings), we may encode the distances between pairs of taxa via a distance matrix D in which Di,j=d(si,sj).

Given: A collection of n (n≤10) DNA strings s1,…,sn of equal length (at most 1 kbp). Strings are given in FASTA format.

Return: The matrix D corresponding to the p-distance dp on the given strings. As always, note that your answer is allowed an absolute error of 0.001.

See more info here: https://rosalind.info/problems/pdst/

In [9]:
from Bio import SeqIO

dna_strings = []
names = []
for seq_record in SeqIO.parse('./Text_Files/rosalind_pdst.txt', "fasta"):
    names.append(seq_record.id)
    dna_strings.append(seq_record.seq)

In [15]:
for n, s in zip(names, dna_strings):
    print(n, "\n", s)

Rosalind_4473 
 CCCCCTGCATCGGGATTTGGTGGAAGATAATGCTTCAGCGAGGCGGCGACATCGACTCCGCGTAAATTCTTCACCGCTGACATTCGGCAGAAATTATTACCCGGTGTGCTCCTACTTCCGTCAGCATGTGCGTCGTCGACAGAGTGGAGAATGCACTTGGGTATTCTCACATGCCGACACTTGCAAGGTGAAGAAGTTCCGTAGGTTTGCATGTGAATACCTCTAGGAAATCTCCATCTCCTCTAAATAACAGCTCGGTCAGCGCGGTACTCTGCGACACCCAACAGAATCAACTTATTTCGATGTACACAATATTAATCTAAGGCCAACCTATTCTAAGGGAAGTGTACCTACAATTATGTTTCGTGTGTTGTGACTAGCATTGAGTTTCCGTTTGACAGCGTGCAACCCAAACGAGGGTATTGCCTATACGACTCTGTGCTTATTTTCATTCGCCGTCCTCTTGTAGGCACCTAACCGTCAAGTAGTAAAACTCAGCCGTGTCCATGTTGATCACACGTAAAGAATTCTATCGTGGGTTCTGCTCGGATCTGCTCTAATCCAGTTAGACGTCCGTCAATGTAAGAGCCGGTGTCTGATAAGAGTCAGGCTCAGAACAACCTGCAGTTCACGAATGGCCGATAGCTCCAGGCATTCACGCCCAAAGACTCTCGCCTACGTGGGACGTGTTACTCTTTTGGTTTCGTCGGTTCAAGGCGTTCGTATGGACTTCCTCCCGGAAGGGTTTCGGTAAGCGGTCATATCGTACTACGTAAAATAACGAAACCCCCCCATAAGTCCCACATACGATAGACCTTGTAGATCAACCGGATAGCGGTCGCTTAGCATGCGGGATCCACCACAGATGCGGGGAGACTAATGAACAACGGGTTGGATAACGGACACCTCAGGCTTAGACGATA
Rosalind_5498 
 TTGTCTACCTTAAGGTTTGATAGGGCGTCGTTCCCAGACGAATG

In [2]:
def pdst(seq1, seq2):
    
    tot_len = len(seq1)
    dif_count = 0

    for i in range(0, tot_len):
        if seq1[i] != seq2[i]:
            dif_count += 1

    return dif_count/tot_len

In [3]:
for i in dna_strings:
    pdst_list = []
    for j in dna_strings:
        pdst_list.append(f"{pdst(i, j): .5f}")
        
    print(' '.join(pdst_list))
    

 0.00000  0.57855  0.56988  0.49404  0.61755  0.28277  0.31744  0.33261
 0.57855  0.00000  0.46804  0.31419  0.32286  0.61105  0.63814  0.46804
 0.56988  0.46804  0.00000  0.30336  0.56880  0.60563  0.65222  0.44529
 0.49404  0.31419  0.30336  0.00000  0.47887  0.54171  0.60997  0.30119
 0.61755  0.32286  0.56880  0.47887  0.00000  0.63705  0.65005  0.53738
 0.28277  0.61105  0.60563  0.54171  0.63705  0.00000  0.45070  0.46154
 0.31744  0.63814  0.65222  0.60997  0.65005  0.45070  0.00000  0.50921
 0.33261  0.46804  0.44529  0.30119  0.53738  0.46154  0.50921  0.00000
