# Grouping Algorithms Tests - DNAClust

## Description

Simple tests for DNAClust program



## Setup
1. Download DNAClust binaries or compile from source
2. Put binaries into `grouping-algorithms/dnaclust`

## Utility


In [113]:
import subprocess

type Cluster = list[str]
type Clusters = list[Cluster]


def run(program: str, *args: str):
    result = subprocess.run([program, *args], capture_output=True,
                            text=True)
    return result


def display_fasta(filename: str):
    print("=" * 10 + " BEGIN: FASTA " + "=" * 10)

    with open(f"grouping-algorithms/data/{filename}") as handle:
        print(handle.read())

    print("=" * 10 + " END: FASTA " + "=" * 10)


def display_clusters(clusters: Clusters):
    print("=" * 10 + " BEGIN: CLUSTERS " + "=" * 10)

    for idx, cluster in enumerate(clusters, 1):
        print(f"Cluster No. {idx}: \n", end="")
        
        for seq in cluster:
            print(f"\t- {seq}")

    print("=" * 10 + " END: CLUSTERS " + "=" * 10)

In [114]:
def run_dna_clust(filename: str, similarity: float, k: int) -> Clusters:
    result = run(
        "grouping-algorithms/dnaclust/dnaclust",
        "grouping-algorithms/data/" + filename,
        "-l",
        "-s",
        f"{similarity:.2f}",
        "-k",
        str(k)
    )

    clusters = [
        [element.strip() for element in cluster.strip().split("\t")]
        for cluster in result.stdout.strip().split("\n")
    ]

    return clusters

## Tests

### Group simple
- cluster two very simple sequences

In [115]:
display_fasta("simple.fasta")

>Group_A

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

>Group_T

uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu




In [116]:
display_clusters(run_dna_clust("simple.fasta", 0.95, 3))

Cluster No. 1: 
	- Group_T
Cluster No. 2: 
	- Group_A


### Group simple with small differences
- cluster four very simple sequences
- two sequences are impure, but very similar to reference sequences

In [117]:
display_fasta("simple_small_diff.fasta")

>Group_A
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

>Group_ADirty
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaatt

>Group_T
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt

>Group_TDirty
ttttttttttttaatttttttttttttttttttttttttttttttttttttttttttttttttt


In [118]:
display_clusters(run_dna_clust("simple_small_diff.fasta", 0.98, 3))

Cluster No. 1: 
	- Group_T
Cluster No. 2: 
	- Group_TDirty
Cluster No. 3: 
	- Group_ADirty
Cluster No. 4: 
	- Group_A


In [119]:
display_clusters(run_dna_clust("simple_small_diff.fasta", 0.95, 3))

Cluster No. 1: 
	- Group_T
	- Group_TDirty
Cluster No. 2: 
	- Group_ADirty
	- Group_A


### Group with huge difference
- cluster three very simple sequences
- one sequence is mix 50/50 of other

In [120]:
display_fasta("simple_mixed.fasta")

>Group_A

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

>Group_T

tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt

>Group_Mix

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaatttttttttttttttttttttttttttttttt



In [121]:
display_clusters(run_dna_clust("simple_mixed.fasta", 0.95, 3))

Cluster No. 1: 
	- Group_T
Cluster No. 2: 
	- Group_Mix
Cluster No. 3: 
	- Group_A


In [122]:
display_clusters(run_dna_clust("simple_mixed.fasta", 0.9, 3))

Cluster No. 1: 
	- Group_T
Cluster No. 2: 
	- Group_Mix
Cluster No. 3: 
	- Group_A


In [123]:
display_clusters(run_dna_clust("simple_mixed.fasta", 0.7, 3))

Cluster No. 1: 
	- Group_T
Cluster No. 2: 
	- Group_Mix
Cluster No. 3: 
	- Group_A


In [126]:
display_clusters(run_dna_clust("simple_mixed.fasta", 0.5, 3))

Cluster No. 1: 
	- Group_T
	- Group_Mix
Cluster No. 2: 
	- Group_A


## Conclusions

- clusters depends on similarity parameters 
    - high values creates many small clusters
    - low values few large clusters
    - high values creates very similar clusters