# Grouping Algorithms Tests - Mothur

## Description

Simple tests for Mothur



## Setup
1. Download Mothur binaries or compile from source (https://github.com/mothur/mothur)
2. Put binaries into `grouping-algorithms/mothur`

## Utility

In [1]:
import subprocess

type Cluster = list[str]
type Clusters = list[Cluster]


def run(program: str, *args: str):
    result = subprocess.run([program, *args], capture_output=True,
                            text=True)
    return result


def display_fasta(filename: str):
    print("=" * 10 + " BEGIN: FASTA " + "=" * 10)

    with open(f"grouping-algorithms/data/{filename}") as handle:
        print(handle.read())

    print("=" * 10 + " END: FASTA " + "=" * 10)


def display_clusters(clusters: Clusters):
    print("=" * 10 + " BEGIN: CLUSTERS " + "=" * 10)

    for idx, cluster in enumerate(clusters, 1):
        print(f"Cluster No. {idx}: \n", end="")
        
        for seq in cluster:
            print(f"\t- {seq}")

    print("=" * 10 + " END: CLUSTERS " + "=" * 10)

In [31]:
def run_mothur(filename: str, cutoff: float) -> Clusters:
    commands = f"""
    dist.seqs(fasta=grouping-algorithms/data/{filename}.fasta)
    unique.seqs(fasta=grouping-algorithms/data/{filename}.fasta)
    cluster(column=grouping-algorithms/data/{filename}.dist, count=grouping-algorithms/data/{filename}.count_table, cutoff={cutoff:.2f})
    """
    
    with open("commands.txt", "w") as handle:
        handle.write(commands)
    
    
    result = run(
        "grouping-algorithms/mothur/mothur",
        "commands.txt"
    )

    print(result.stdout)

    clusters = []
    
    with open(f"grouping-algorithms/data/{filename}.opti_mcc.list", "r") as handle:
        data = handle.read()
    
    lines = data.split("\n")
    results = lines[1].split("\t")
    numOtus = int(results[1])
    for i in range(numOtus):
        clusters.append([
            results[2 + i].split(",")
        ])

    return clusters

## Tests

### Group simple
- cluster two very simple sequences

In [32]:
display_fasta("simple.fasta")

>Group_A

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

>Group_T

tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt




In [33]:
display_clusters(run_mothur("simple", 0.2))

Linux version

Using ReadLine,Boost,GSL
mothur v.1.48.0
Last updated: 5/20/22
by
Patrick D. Schloss

Department of Microbiology & Immunology

University of Michigan
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

For questions and analysis support, please visit our forum at https://forum.mothur.org

Type 'quit()' to exit program

[NOTE]: Setting random seed to 19760620.

Batch Mode



mothur > dist.seqs(fasta=grouping-algorithms/data/simple.fasta)

Using 16 processors.

Sequence	Time	Num_Dists_Below_Cutoff
0	0	0
1	0	1

It took 0 secs to find distances for 2 sequences. 1 distances below cutoff 1.


Output File Names: 
grouping-algorithms/data/simple.dist


mothur > uniqu

### Group simple with small differences
- cluster four very simple sequences
- two sequences are impure, but very similar to reference sequences

In [34]:
display_fasta("simple_small_diff.fasta")

>Group_A
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

>Group_ADirty
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaatt

>Group_T
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt

>Group_TDirty
ttttttttttttaatttttttttttttttttttttttttttttttttttttttttttttttttt


In [37]:
display_clusters(run_mothur("simple_small_diff", 0.03))

Linux version

Using ReadLine,Boost,GSL
mothur v.1.48.0
Last updated: 5/20/22
by
Patrick D. Schloss

Department of Microbiology & Immunology

University of Michigan
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

For questions and analysis support, please visit our forum at https://forum.mothur.org

Type 'quit()' to exit program

[NOTE]: Setting random seed to 19760620.

Batch Mode



mothur > dist.seqs(fasta=grouping-algorithms/data/simple_small_diff.fasta)

Using 16 processors.

Sequence	Time	Num_Dists_Below_Cutoff
1	0	1
1	0	0
2	0	2
2	0	0
0	0	0
3	0	3

It took 0 secs to find distances for 4 sequences. 6 distances below cutoff 1.


Output File Names: 
grouping-algorithm

In [38]:
display_clusters(run_mothur("simple_small_diff", 0.2))

Linux version

Using ReadLine,Boost,GSL
mothur v.1.48.0
Last updated: 5/20/22
by
Patrick D. Schloss

Department of Microbiology & Immunology

University of Michigan
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

For questions and analysis support, please visit our forum at https://forum.mothur.org

Type 'quit()' to exit program

[NOTE]: Setting random seed to 19760620.

Batch Mode



mothur > dist.seqs(fasta=grouping-algorithms/data/simple_small_diff.fasta)

Using 16 processors.

Sequence	Time	Num_Dists_Below_Cutoff
1	0	0
2	0	2
1	0	1
0	0	0
2	0	0
3	0	3

It took 0 secs to find distances for 4 sequences. 6 distances below cutoff 1.


Output File Names: 
grouping-algorithm

In [39]:
display_clusters(run_mothur("simple_small_diff", 0.5))

Linux version

Using ReadLine,Boost,GSL
mothur v.1.48.0
Last updated: 5/20/22
by
Patrick D. Schloss

Department of Microbiology & Immunology

University of Michigan
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

For questions and analysis support, please visit our forum at https://forum.mothur.org

Type 'quit()' to exit program

[NOTE]: Setting random seed to 19760620.

Batch Mode



mothur > dist.seqs(fasta=grouping-algorithms/data/simple_small_diff.fasta)

Using 16 processors.

Sequence	Time	Num_Dists_Below_Cutoff
1	0	1
2	0	0
1	0	0
2	0	2
3	0	3
0	0	0

It took 0 secs to find distances for 4 sequences. 6 distances below cutoff 1.


Output File Names: 
grouping-algorithm

### Group with huge difference
- cluster three very simple sequences
- one sequence is mix 50/50 of other

In [40]:
display_clusters(run_mothur("simple_mixed", 0.03))

Linux version

Using ReadLine,Boost,GSL
mothur v.1.48.0
Last updated: 5/20/22
by
Patrick D. Schloss

Department of Microbiology & Immunology

University of Michigan
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

For questions and analysis support, please visit our forum at https://forum.mothur.org

Type 'quit()' to exit program

[NOTE]: Setting random seed to 19760620.

Batch Mode



mothur > dist.seqs(fasta=grouping-algorithms/data/simple_mixed.fasta)

Using 16 processors.

Sequence	Time	Num_Dists_Below_Cutoff
0	0	0
1	0	1
2	0	2

It took 0 secs to find distances for 3 sequences. 3 distances below cutoff 1.


Output File Names: 
grouping-algorithms/data/simple_mixed.dis

In [41]:
display_clusters(run_mothur("simple_mixed", 0.2))

Linux version

Using ReadLine,Boost,GSL
mothur v.1.48.0
Last updated: 5/20/22
by
Patrick D. Schloss

Department of Microbiology & Immunology

University of Michigan
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

For questions and analysis support, please visit our forum at https://forum.mothur.org

Type 'quit()' to exit program

[NOTE]: Setting random seed to 19760620.

Batch Mode



mothur > dist.seqs(fasta=grouping-algorithms/data/simple_mixed.fasta)

Using 16 processors.

Sequence	Time	Num_Dists_Below_Cutoff
0	0	0
1	0	1
2	0	2

It took 0 secs to find distances for 3 sequences. 3 distances below cutoff 1.


Output File Names: 
grouping-algorithms/data/simple_mixed.dis

In [42]:
display_clusters(run_mothur("simple_mixed", 0.5))

Linux version

Using ReadLine,Boost,GSL
mothur v.1.48.0
Last updated: 5/20/22
by
Patrick D. Schloss

Department of Microbiology & Immunology

University of Michigan
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

For questions and analysis support, please visit our forum at https://forum.mothur.org

Type 'quit()' to exit program

[NOTE]: Setting random seed to 19760620.

Batch Mode



mothur > dist.seqs(fasta=grouping-algorithms/data/simple_mixed.fasta)

Using 16 processors.

Sequence	Time	Num_Dists_Below_Cutoff
1	0	1
2	0	2
0	0	0

It took 0 secs to find distances for 3 sequences. 3 distances below cutoff 1.


Output File Names: 
grouping-algorithms/data/simple_mixed.dis

## Conclusions

- rich library with many utility functions
- can create batch file that allows creating processing pipeline (no need external tools)
- implementation of different algorithms 
- multithreaded
- clustering algorithm requires distance matrix 
- distance matrix can use one from defined metrics (few different)
- can be used in tandem with pure clustering tool as preprocessing tool due to rich functionalities (eg. unique sequences etc.)
