# How to run the package to build sequence similarity networks and clusterings of a protein family

All that is required is a fasta file of protein sequences.  

The entire pipeline is run from the command line using python. After the cluster definitions are generated, the package also provides various helper functions to visualize and annotate the data.      
  
**Note:**
Sequence headers in the fasta will be trimmed at the first space and converted to numeric IDs for the process.  
This is the easiest way to ensure any arbitrarily named set of sequences can be processed without issue. Later notebooks will show how to map the IDs back given a particular downstream process.

**Another note:**
Examples in this notebook attempt to use the bash shell to execute examples. But assuming you have it installed, you can also run the same commands directly from command line (assuming the correct python environment is activated). All cells start with '%%bash', which is just for this notebook, and does not need to be copied if running the commands anywhere else.

## 1. Pipeline options

Assuming the 'proteinclustertools' has been installed, the main processing pipeline can be accessed with the following command.  
  
Keep in mind that for the package to be found you need to be using the environment (conda, venv, Jupyter kernel, whichever is relevant for where the code is being run) that has all the requirements installed.

The following is an example of the options available.

In [1]:
%%bash
python -m proteinclustertools.pipeline -h

usage: pipeline.py [-h] [-d D] [-p P] [-fa FA] [-A] [-MC] [-P]
                   [-cc CC [CC ...]] [-cluster_lines CLUSTER_LINES] [-f F]
                   [-RMC] [-cluster_jobs CLUSTER_JOBS] [-E] [-CE] [-t T]
                   [-KF KF] [-KM] [-K K [K ...]] [-FKM] [-max_k MAX_K] [-HC]
                   [-CT] [-FHC] [-U]

Pipeline for analyzing protein families using unsupervised clustering. Uses
either homology or vector embeddings to cluster sequences.

options:
  -h, --help            show this help message and exit

General arguments:
  -d D, -directory D    Output directory for files (default: out/)
  -p P, -prefix P       Prefix for output files (default: ssn)
  -fa FA, -fasta FA     Fasta file to analyze. Only needs to be given once.
                        (default: None)

Homology based method:
  -A, -all_by_all       Run mmseqs all-by-all search (default: False)
  -MC, -mmseqs_cluster  Cluster mmseqs results (default: False)
  -P, -cluster_percentiles
                      

## 2. Setting up the pipeline

The pipeline always starts with cleaning an input fasta file. This can be run as a separate first step, or it will be handled automatically when running any analysis options while providing the '-fa' argument.

**Note:** A large part of the pipeline is just keeping track of consecutive input/outputs between steps, and so the fasta file cannot be changed once inputed as it assumes subsequent steps make use of this file.  
There are two options to work changes in the data:  
1. If the desired data is a subset of the original, the '-f' or '-filter' option can be used for the clustering methods to isolate just the subset of interest.
2. (Easiest in most cases) Just start the analysis in a new folder, for most datasets (~50-100k sequences) the run time to redo is not that long.

**Note:** All commands require the '-d' ('-directory', where to put output) and '-p' ('-prefix', how to label files) arguments. The folder containing the analysis can be moved and renamed, as the file tracking is completely internal to that directory. However, this also means all analyses have to reuse that same directory.

Running the following will produce a cleaned fasta file, and the conversion header map for future reference.

Note that in this case we use the 100% representatives of the IPR001761 family of proteins. Clustering can be easily generated using CD-hit or mmseqs.

In [2]:
%%bash
python -m proteinclustertools.pipeline -d ../output -p IPR001761 -fa ../data/IPR001761_rep100.fasta 

cleaned fasta is in: ../output/IPR001761_cleaned.fasta


Subsequent commands no longer need the '-fa' option.

## 3. Workflows

There are 3 main work flows grouped into 2 methods:

Homology method - Build a sequence similarity network using MMSeqs pairwise sequence alignments, and cluster the network by detecting connected components above a given cut-off.

Representation method - Convert sequences into vectors using the ESM1b protein language model, then conduct either 1) Kmeans clustering or 2) hierarchical clustering (RAM intensive, exponentially scales with sequence count)

### 3a. Homology method

All-by-all pairwise alignments can be run easily using the '-A' option.  

In [None]:
%%bash
python -m proteinclustertools.pipeline -d ../output -p IPR001761 -A

This creates a 'mmseqs_search.tsv' file with all the pairwise alignment bitscores.  
  
Alternatively, one can load their own table of pairwise metrics for clustering and skip the *-A* option. In the following steps, add the *-edge_table* command with a path to the table. The custom table needs to be tab-separated, 3 columns, with target/query in the 1st 2 columns, and score in the 3rd; headers need to be present, but their value do not matter. Tables with more columns than needed are okay, but only the first 3 columns will be used.

  
To identify clusters, use the '-MC' option. It is necessary to identify the cut-offs where clusters should be detected. After the all-by-all alignments, the pipeline automatically sets bitscore cutoffs at the 10,25,50,75, and 90 percentiles of the full distribution (sampled up to 1M pairwise bitscores).

So one can use the percentiles:

In [None]:
%%bash
python -m proteinclustertools.pipeline -d ../output -p IPR001761 -MC -P

Or specify a list of cut-offs:

In [None]:
%%bash
python -m proteinclustertools.pipeline -d ../output -p IPR001761 -MC -cc 100 150 200 250 300

Example using custom table (feeding existing table as an example). Need to use the '-redo_mmseqs_clustering' option to force redoing cutoffs that were already done.

Note: custom tables do not support the percentiles option

In [None]:
%%bash
python -m proteinclustertools.pipeline -d ../output -p IPR001761 -edge_table ../output/IPR001761_mmseqs_search.tsv -MC -cc 100 150 -redo_mmseqs_clustering

The clustering tries to use multiprocessing, and may consume too much RAM on very large MMSeqs results files. If this is the case, the user can try reducing the number of lines being read by each job, or limit the maximum number of jobs.

In [None]:
%%bash
python -m proteinclustertools.pipeline -d ../output -p IPR001761 -MC -P -cluster_lines 1000000 -cluster_jobs 100

This will generate clustering files under a 'mmseqs_clustering' folder in the target directory. Which can now be used for downstream steps.  
  
See the [**hierarchical cluster plot**](./hierarchical_cluster_plot.ipynb) notebook for an example of how to visualize the clusters using the tools in the package.  
See the [**representative selection**](./representative_selection.ipynb) notebook for an example of how to select representatives given clustering definitions.
  
Both examples are applicable regardless of how the clustering was generated (any of the methods).

### 3b. Vector representation methods
  
Both representation methods require the vector embeddings to be generated first ('-E' option).

**Note:** This was intended to be run on a computer with a dedicated GPU, and requires pytorch and cuda to have been properly installed.  
Without a GPU, this code could still work in theory as it may try to run it on CPU instead. In this case, it will likely run much slower.

In [None]:
%%bash
python -m proteinclustertools.pipeline -d ../output -p IPR001761 -E

Alternatively, the embedding generation can be skipped, and the user can supply their own embeddings.

Use the -embeddings_dict option, and specify a path to a pickled dictionary with the sequence IDs as keys and the embeddings vectors as values (1D, already pooled).  

### 3c. Kmeans clustering

The implementation used is bisecting K-means, which generates a bisecting cluster tree. This tree can then be subdivided to generate lower values of K so that the hierarchical relationship is preserved. This is necessary as K-means (including bisecting K-means) does not produce consistent hierarchical relationships across different K.   

**Note:** With Kmeans, there is stochasticity to the clustering. The pipeline uses a fixed random number seed to get the same results. 

First, to produces clusters and trees with Kmeans, use the '-KM' option. Note that the clustering parameter for Kmeans is the number of final clusters, the user may need to experiment to identify the best results.  
  
Supply the list of cluster counts with '-K'. In this case, we make up to 10,000 clusters.  

In [2]:
%%bash
python -m proteinclustertools.pipeline -d ../output -p IPR001761 -KM -K 10000 -KF 30

cleaned fasta is in: ../output/IPR001761_cleaned.fasta
Loading embeddings for kmeans
Loading embeddings for kmeans
Reducing feature dimensions with PCA
Running kmeans


Note the '-KF' option. When using vectors, it is possible to reduce the vector complexity using PCA first. By default, all vector analyses reduce the vectors to 30 dimensions. Set the -KF to 0 to use the full vectors.

The clustering definitions are outputed to 'kmeans/max_k/' folder in the target directory.

We take this tree structure and further divide it into lower cluster counts.

In this case, we use the same cluster counts as those produced when clustering the homology method using the percentiles option.

In [None]:
%%bash
python -m proteinclustertools.pipeline -d ../output -p IPR001761 -FKM -max_k 10000 -K 166 505 1403 3939 9094

The output is stored in 'kmeans/flattened_10000/' (separate folder for each max_k)

### 3d. Hierarchical clustering
  
With hierarchical clustering, we first generate a linkage matrix, that can then be used to generate clusterings at different levels, or to build a tree structure for visualization. Start with the '-HC' method to create the linkage matrix. Again, '-KF' can be used to control feature complexity. 

This pipeline generates linkage matrices using 'cosine' as the metric, and 'weighted' method of clustering.     

In [None]:
%%bash
python -m proteinclustertools.pipeline -d ../output -p IPR001761 -HC -KF 30

To produce specific clusters, use the '-FHC' option to 'flatten' the clusters. The pipeline uses the simple method of 'max clust' where similar to Kmeans, the user specifies the desired number of final clusters.

In [None]:
%%bash
python -m proteinclustertools.pipeline -d ../output -p IPR001761 -FHC -K 166 505 1403 3939 9094

The resulting cluster definitions are in 'hierarchical_clustering' folder in the target directory.

Parallel to the cluster generation, we can also make a tree structure from the linkage matrix. The function produces the full tree with all sequences, which can be further manipulated (example in the [**tree structure**](./tree_structure.ipynb) notebook).  
Simply use the '-CT' option.

In [None]:
%%bash
python -m proteinclustertools.pipeline -d ../output -p IPR001761 -CT

The tree is created directly in the target directory. Look for the Newick (.nwk) file.

### 3e. Conducting UMAP
  
The vectors can be processed by UMAP to map each sequence directly to a 2D coordinate for visualization. It has the strength of being naturally self-organising and preserving some structure from the higher dimensions, but has various pitfalls for interpretation. 
  
**Note:** Visualizations using UMAP can be sensitive to hyperparameters. However, for simplicity, the pipeline uses default options.  
The user is encouraged to test UMAP (or another dimensionality reduction technique like TSNE) to see if different settings change their interpretations.  
In this case, they'll need to write their own code (usually just 1-2 lines) and can manually provide the embeddings from this pipeline.    

The pipeline provides the '-U' option to make conduct UMAP on all sequences that have been converted into vectors. Again, '-KF' can be used to control vector complexity.    

In [None]:
%%bash
python -m proteinclustertools.pipeline -d ../output -p IPR001761 -U -KF 30

The resulting 'umap.csv' is created in the target directory, and is a dataframe with an x and y coordinate for each sequence ID (in numerical format). This can be conveniently plotted using any scatter plot function or graphing software. See the [**UMAP plot**](./UMAP_plot.ipynb) notebook for an example of interactive visualization.