# Sequence Analysis

Cluster global mass spectrometry-generated data by both, phosphorylation levels of phosphopeptides, and amino acid sequence. 

In [1]:
import pandas as pd
import numpy as np
from msresist.pre_processing import preprocessing
from msresist.sequence_analysis import MassSpecClustering, preprocess_seqs
from msresist.FileExporter import create_download_link
from msresist.parameter_tuning import GridSearch_CV
from sklearn.mixture import GaussianMixture
from sklearn.utils.estimator_checks import check_estimator
import warnings
warnings.simplefilter("ignore")

Import and pre-process data:

In [2]:
pd.set_option('display.max_colwidth', 1000)
pd.set_option('display.max_rows', 300)

ABC = preprocessing(motifs=True, Vfilter=True, FCfilter=True, log2T=True)
ABC = preprocess_seqs(ABC, "Y")

Define parameters/arguments:

In [3]:
ncl = 4
pYTS = "Y"
GMMweight = 1
covariance_type = "tied" 
max_n_iter = 20

param_grid = {"ncl": list(range(2,11)),  "GMMweight": list(np.linspace(0,5,11))}

Fit data to model and display results:

In [4]:
MSC = MassSpecClustering(ncl, GMMweight, pYTS, covariance_type, max_n_iter).fit(ABC)

In [5]:
clusters = pd.DataFrame(MSC.Cl_seqs_).T
clusters.columns = ["Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"]

In [9]:
clusters.iloc[:20, :]

Unnamed: 0,Cluster 1,Cluster 2,Cluster 3,Cluster 4
0,KTRDQYLMWLT,KKETRYGEVFE,AENPEYLGLDV,ARLGEYEDVSR
1,PCEEVYVKHMG,ASQKDYSSGFG,DAETLYKAMKG,DARDLYDAGVK
2,YNGDYYRQGRI,DGENIYIRHSN,DVDAAYMNKVE,GGQGSYVPLLR
3,LLSSDYRIING,ISKQEYDESGP,LGEGTYATVYK,KKPHRYRPGTV
4,SATLLYDQPLQ,LLSVAYKNVVG,LKGQVYILGRE,LGSEVYRMLRE
5,DVAEKYLDIPK,NRGPAYGLSRE,PPYTDYVSTRW,NVKGEYDVTMP
6,GSRADYDTLSL,DNGGYYITTRA,PQGREYGMIYL,TYTREYFTFPA
7,KGGRGYDRDHV,DQGEKYIDLRH,TSKVIYDFIEK,VVRKDYDTLSK
8,KNYGSYSTQAS,FHPEPYGLEDD,ACRAAYNLVRD,AAEPEYPKGIR
9,LDKIRYESLTD,GCFDPYSDDPR,APHVHYARLKT,AKFINYVKNCF


## Hyperparameter Search

My estimator seems to be scikit compatible, however it's too inefficient to run an exhaustive hyperparameter search with GridSearchCV. Each algorithm iteration takes around 7', which multiplied by 149 iterations = 17h. Also need to check that my scoring method is reliable. 

The highest number of cross-validation folds can't be ABC.shape[0] (LOOCV) since the GMM estimator needs at least two peptides in the test set. I'm using the closest alternative which is ABC.shape[0]/2.

In [6]:
# cv = int(ABC.shape[0]/2)

In [7]:
# CVresults_max = GridSearch_CV(Combined, param_grid, cv, ABC)
# CVresults_max
# CVresults_min.nlargest(20, "mean_test_scores")