# Grouping Algorithms Tests - CD-HIT

## Description

Simple tests for CD-HIT



## Setup
1. Download CD-HIT source files (https://github.com/weizhongli/cdhit)
2. Put source files into `grouping-algorithms/cdhit`
3. Compile CD-HIT by `make`
4. Compile CD-HIT-AUXTOOLS by `make`
5. Full installation guide is covered by CD-HIT user's guide  https://github.com/weizhongli/cdhit/blob/master/doc/cdhit-user-guide.wiki#user-content-Installation

## Utility

In [1]:
import subprocess

type Cluster = list[str]
type Clusters = list[Cluster]


def run(program: str, *args: str):
    result = subprocess.run([program, *args], capture_output=True,
                            text=True)
    return result


def display_fasta(filename: str):
    print("=" * 10 + " BEGIN: FASTA " + "=" * 10)

    with open(f"grouping-algorithms/data/{filename}") as handle:
        print(handle.read())

    print("=" * 10 + " END: FASTA " + "=" * 10)


def display_clusters(clusters: Clusters):
    print("=" * 10 + " BEGIN: CLUSTERS " + "=" * 10)

    for idx, cluster in enumerate(clusters, 1):
        print(f"Cluster No. {idx}: \n", end="")
        
        for seq in cluster:
            print(f"\t- {seq}")

    print("=" * 10 + " END: CLUSTERS " + "=" * 10)

In [18]:
def run_cd_hit(filename: str, similarity: float, word_size: int = 5) -> Clusters:
    r = run(
        "grouping-algorithms/cdhit/cd-hit",
        "-i", 
        "grouping-algorithms/data/" + filename,
        "-o",
        "result.tmp",
        "-c",
        f"{similarity:.2f}",
        "-n",
        f"{word_size}"
    )
    print(r.stdout)
    
    clusters = []
    with open("result.tmp.clstr") as handle:
        data = handle.read()
        
    cluster = None
    for line in data.split("\n"):
        if line.startswith(">"):
            if cluster is not None:
                clusters.append(cluster)
                cluster = None
            cluster = []
        elif ">" in line and "..." in line:
            name = line[line.index(">") + 1 : line.index("...")]
            is_repr = line[line.index("...") + 3:]
            is_repr = "Representative" if is_repr == "*" else is_repr
            name = f"{name.strip()} ({is_repr.strip()})"
            cluster.append(name)
        else:
            print("Unknown line", line)
            
    if cluster is not None:
        clusters.append(cluster)
            
    return clusters

## Tests

### Group simple
- cluster two very simple sequences

In [19]:
display_fasta("simple.fasta")

>Group_A

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

>Group_T

uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu




In [21]:
display_clusters(run_cd_hit("simple.fasta", 0.95, 5))

Program: CD-HIT, V4.8.1 (+OpenMP), May 05 2024, 13:34:07
Command: grouping-algorithms/cdhit/cd-hit -i
         grouping-algorithms/data/simple.fasta -o result.tmp -c
         0.95 -n 5

Started: Sun May  5 17:37:50 2024
                            Output                              
----------------------------------------------------------------
total seq: 2
longest and shortest : 64 and 64
Total letters: 128
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 65M = 65M
Miscellaneous   : 0M
Total           : 75M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 90518873


comparing sequences from          0  to          2

        2  finished          2  clusters

Approximated maximum memory consumption: 75M
writing new database
writing clustering information
program completed !

Total CPU time 0.08

Unknown line 
Cluster No. 

### Group simple with small differences
- cluster four very simple sequences
- two sequences are impure, but very similar to reference sequences

In [22]:
display_fasta("simple_small_diff.fasta")

>Group_A
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

>Group_ADirty
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaatt

>Group_T
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt

>Group_TDirty
ttttttttttttaatttttttttttttttttttttttttttttttttttttttttttttttttt


In [24]:
display_clusters(run_cd_hit("simple_small_diff.fasta", 0.98, 3))

Program: CD-HIT, V4.8.1 (+OpenMP), May 05 2024, 13:34:07
Command: grouping-algorithms/cdhit/cd-hit -i
         grouping-algorithms/data/simple_small_diff.fasta -o
         result.tmp -c 0.98 -n 3

Started: Sun May  5 17:38:32 2024
                            Output                              
----------------------------------------------------------------
Your word length is 3, using 5 may be faster!
total seq: 4
longest and shortest : 64 and 64
Total letters: 256
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 0M = 0M
Miscellaneous   : 0M
Total           : 10M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 98668495


comparing sequences from          0  to          4

        4  finished          4  clusters

Approximated maximum memory consumption: 10M
writing new database
writing clustering information
program compl

In [26]:
display_clusters(run_cd_hit("simple_small_diff.fasta", 0.95, 3))

Program: CD-HIT, V4.8.1 (+OpenMP), May 05 2024, 13:34:07
Command: grouping-algorithms/cdhit/cd-hit -i
         grouping-algorithms/data/simple_small_diff.fasta -o
         result.tmp -c 0.95 -n 3

Started: Sun May  5 17:38:44 2024
                            Output                              
----------------------------------------------------------------
Your word length is 3, using 5 may be faster!
total seq: 4
longest and shortest : 64 and 64
Total letters: 256
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 0M = 0M
Miscellaneous   : 0M
Total           : 10M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 98668495


comparing sequences from          0  to          4

        4  finished          2  clusters

Approximated maximum memory consumption: 10M
writing new database
writing clustering information
program compl

### Group with huge difference
- cluster three very simple sequences
- one sequence is mix 50/50 of other

In [27]:
display_fasta("simple_mixed.fasta")

>Group_A

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

>Group_T

tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt

>Group_Mix

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaatttttttttttttttttttttttttttttttt



In [30]:
display_clusters(run_cd_hit("simple_mixed.fasta", 0.95, 5))

Program: CD-HIT, V4.8.1 (+OpenMP), May 05 2024, 13:34:07
Command: grouping-algorithms/cdhit/cd-hit -i
         grouping-algorithms/data/simple_mixed.fasta -o
         result.tmp -c 0.95 -n 5

Started: Sun May  5 17:39:30 2024
                            Output                              
----------------------------------------------------------------
total seq: 3
longest and shortest : 64 and 64
Total letters: 192
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 65M = 65M
Miscellaneous   : 0M
Total           : 75M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 90518845


comparing sequences from          0  to          3

        3  finished          3  clusters

Approximated maximum memory consumption: 75M
writing new database
writing clustering information
program completed !

Total CPU time 0.06

Unknown line 
Cluste

In [31]:
display_clusters(run_cd_hit("simple_mixed.fasta", 0.5, 2))

Program: CD-HIT, V4.8.1 (+OpenMP), May 05 2024, 13:34:07
Command: grouping-algorithms/cdhit/cd-hit -i
         grouping-algorithms/data/simple_mixed.fasta -o
         result.tmp -c 0.50 -n 2

Started: Sun May  5 17:39:31 2024
                            Output                              
----------------------------------------------------------------
total seq: 3
longest and shortest : 64 and 64
Total letters: 192
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 0M = 0M
Miscellaneous   : 0M
Total           : 10M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 98686165


comparing sequences from          0  to          3

        3  finished          2  clusters

Approximated maximum memory consumption: 10M
writing new database
writing clustering information
program completed !

Total CPU time 0.01

Unknown line 
Cluster 

## Conclusions

- CD-HIT have very nice output files that contains representative sequence and similarity information (can be used in later visualisation)
- CD-HIT has many options, can be fine-tuned to specific tasks
- runtime information are helpful and contains all information eg. memory usage or cpu time
- cd-hit is multithread

## Conclusions about all 3 algorithms

**DNAClust**
- DNAClust is the simplest program in this comparison that offers few options and is no longer maintained.
- handling errors is not friendly and gives zero feedback - programs just closes with exit code
- output is simple, without information about representative

**CD-HIT**
- more complex, with more options and slightly better error handling
- contains the representative for each cluster (nice to have)
- have very nice statistics for usage (memory, cpu)
- multithreaded

**Mothur**
- powerful toolkit, more than other programs
- nice to have as preprocessing tool
- clustering is nice, but requires distance matrix (can be problematic for large datasets)
- multithreaded

## Next Steps

- comparing all programs in real scenarios with few hundreds short reads
    - measuring time, memory and CPU usage
    - measuring amount of clusters, quality of clusters
    - tries with fine-tuning