# Data processing

##Overview

1.   Import dependencies
2.   Mount drive and read in data
3.   Create a fasta file of all human paired data for processing by MMSeqs2
4.   Install Conda and MMSeqs2
5.   Run LinClust at 66% identity to obtain ~100K sequence clusters
6.   Transfer everthing to mmseqs_output folder on drive

Data at this point will be converted into a csv file of heavy and light chains with unique identifiers for processing by Seb to obtain developability parameters using MAPT.

In [None]:
import os
from google.colab import drive
import polars as pl
import pandas as pd

In [None]:
#mount drive
drive.mount('/content/drive')

path = '/content/drive/MyDrive/msc-project-mbalmf01/'

Mounted at /content/drive


In [None]:
os.chdir(path)
os.mkdir('mmseqs2_output')

Read in all paired data

In [None]:
df = pl.read_csv('all_paired/opig_data/230618_human_paired_seqs.csv', dtypes={'Run': pl.Utf8})

In [None]:
print(df.columns)
print(df.head())

['', 'sequence_id_heavy', 'ANARCI_status_heavy', 'sequence_heavy', 'sequence_alignment_aa_heavy', 'sequence_id_light', 'ANARCI_status_light', 'sequence_light', 'sequence_alignment_aa_light', 'Run', 'seq_id']
shape: (5, 11)
┌─────┬────────────┬────────────┬────────────┬───┬────────────┬────────────┬─────────┬─────────────┐
│     ┆ sequence_i ┆ ANARCI_sta ┆ sequence_h ┆ … ┆ sequence_l ┆ sequence_a ┆ Run     ┆ seq_id      │
│ --- ┆ d_heavy    ┆ tus_heavy  ┆ eavy       ┆   ┆ ight       ┆ lignment_a ┆ ---     ┆ ---         │
│ i64 ┆ ---        ┆ ---        ┆ ---        ┆   ┆ ---        ┆ a_light    ┆ str     ┆ str         │
│     ┆ str        ┆ str        ┆ str        ┆   ┆ str        ┆ ---        ┆         ┆             │
│     ┆            ┆            ┆            ┆   ┆            ┆ str        ┆         ┆             │
╞═════╪════════════╪════════════╪════════════╪═══╪════════════╪════════════╪═════════╪═════════════╡
│ 0   ┆ AAACCTGAGA ┆ |Deletions ┆ AGCTCTCAGA ┆ … ┆ GCTGTGCTGT ┆ SYELTQ

In [None]:
def df_to_fasta(df: pd, cols: list, f: str):
    x = cols[0]
    y = cols[1]
    with open(f, 'w') as out:
        for i in range(df.shape[0]):
            out.write('>' + df[x].iloc[i] + '\n' + df[y].iloc[i] + '\n')

In [None]:
interspacing_string = pl.lit('SGGSTITSYNVYYTKLSSSGT')
df = df.with_columns(
    pl.concat_str([pl.col('sequence_alignment_aa_heavy'), interspacing_string, pl.col('sequence_alignment_aa_light')]).alias('heavy_light')
)

df_sub = df.to_pandas()[['seq_id', 'heavy_light']]


In [None]:
df_to_fasta(df_sub, ['seq_id', 'heavy_light'], 'all_paired/paired_human.fasta')

Run commands to install conda, mmseqs2

In [None]:
#append site-packages to path to run conda/mmseqs2/cd-hit
import sys
sys.path.append('/usr/local/lib/python3.9/site-packages/')

In [None]:
!wget https://repo.anaconda.com/miniconda/Miniconda3-py39_23.3.1-0-Linux-x86_64.sh
!chmod +x Miniconda3-py39_23.3.1-0-Linux-x86_64.sh
!bash ./Miniconda3-py39_23.3.1-0-Linux-x86_64.sh -b -f -p /usr/local

--2023-06-20 20:57:59--  https://repo.anaconda.com/miniconda/Miniconda3-py39_23.3.1-0-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.130.3, 104.16.131.3, 2606:4700::6810:8303, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.130.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 70605094 (67M) [application/x-sh]
Saving to: ‘Miniconda3-py39_23.3.1-0-Linux-x86_64.sh.1’


2023-06-20 20:58:01 (30.6 MB/s) - ‘Miniconda3-py39_23.3.1-0-Linux-x86_64.sh.1’ saved [70605094/70605094]

PREFIX=/usr/local
Unpacking payload ...
                                                                               
Installing base environment...


Downloading and Extracting Packages


Downloading and Extracting Packages

Preparing transaction: - \ | / done
Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
installation finished.
    You currently have a 

In [None]:
# install via conda
!conda install -c conda-forge -c bioconda mmseqs2

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - 

OS to /tmp and create a new directory to run linclust algorithm in

In [None]:
os.chdir('/tmp')
!mkdir /tmp/new_tmp
!cp /content/drive/MyDrive/msc-project-mbalmf01/all_paired/paired_human.fasta /tmp

Run the MMSeqs2 easy-linclust program on the antibody sequences at 67% similarity to reduce number down to ~100,000 representative sequences. I tried a bunch of different values here: 0.8, 0.75, 0.7, 0.6, 0.65 and finally 0.67.

This may take a while ~45 mins to run. The larger the cutoff the longer it takes..

In [None]:
!sudo mmseqs easy-linclust paired_human.fasta clusterRes new_tmp --min-seq-id 0.66 -c 0.8 --cov-mode 1

easy-linclust paired_human.fasta clusterRes new_tmp --min-seq-id 0.67 -c 0.8 --cov-mode 1 

MMseqs Version:                     	13.45111
Cluster mode                        	0
Max connected component depth       	1000
Similarity type                     	2
Threads                             	2
Compressed                          	0
Verbosity                           	3
Substitution matrix                 	nucl:nucleotide.out,aa:blosum62.out
Add backtrace                       	false
Alignment mode                      	0
Alignment mode                      	0
Allow wrapped scoring               	false
E-value threshold                   	0.001
Seq. id. threshold                  	0.67
Min alignment length                	0
Seq. id. mode                       	0
Alternative alignments              	0
Coverage threshold                  	0.8
Coverage mode                       	1
Max sequence length                 	65535
Compositional bias                  	1
Max reject              

Copy all files over from /tmp to mmseqs2_output folder

In [None]:
!mv /tmp/clusterRes_all_seqs.fasta /content/drive/MyDrive/msc-project-mbalmf01/mmseqs2_output
!mv /tmp/clusterRes_cluster.tsv /content/drive/MyDrive/msc-project-mbalmf01/mmseqs2_output
!mv /tmp/clusterRes_rep_seq.fasta /content/drive/MyDrive/msc-project-mbalmf01/mmseqs2_output

!cp -R /tmp/new_tmp /content/drive/MyDrive/msc-project-mbalmf01/mmseqs2_output