## Trying to do fragpipe analyst R in python 
due to intensive package downlaoding: it takes too long even on HPC

FragPipe-Analyst and FragPipeAnalystR share the same common modular design with the following modules: \
(1) I/O module that handles the input and output files. \
(2) Data manipulation module, which has functions for operating the fundamental data structure, including removing/merging samples and feature selections. \
(3) Data filtering module, which provides methods for filtering data based on missing values. \
(4) Normalization module that provides several normalization methods. \
(5) Imputation module that provides several imputation methods. \
(6) Quality control (QC) module, which provides functions to generate various visualizations, including principal component analysis (PCA) plots and heatmaps. \
(7) Differential expression (DE) analysis module, which provides functions for performing statistical procedures and result visualization, such as a volcano plot. \
(8) Enrichment analysis module, which provides statistical procedures specifically for inferring biological insights, such as overrepresentation tests.\


1. Import tsv from FP output

In [18]:
import pandas as pd
import numpy as np

In [19]:
combined_protein = pd.read_csv(r"data_WTvs3PA/combined_protein.tsv", header=0, sep="\t")
annotation = pd.read_csv(r"data_WTvs3PA/experiment_annotation.tsv", header=0, sep="\t")


### Obtain only the required columns:
protein information and hits
Protein intensity in each sample in MaxLFQ (we decided to use MaxLFQ)

In [20]:
# filter(regex=...) searches column names
# (?i) makes it case-insensitive
# Returns only the matching columns
maxlfq_cols = combined_protein.filter(regex='(?i)MaxLFQ').columns
#the below is a little chaining that gets only Entry Name and turn that into a dataframe so i could ge the req columns
protein_name = combined_protein.iloc[:,2].to_frame() #this is the column of entry name
protein_info = combined_protein.iloc[:, :5]
req_cols =  list(protein_name.columns)+list(maxlfq_cols)
protein_maxquant = combined_protein.loc[:, req_cols]
### ALWAYS set index (rownames) before transposing -> so the rownames will be preserved as column names
protein_maxquant=protein_maxquant.set_index('Entry Name')

#### Transform the dataset? add columns from experimental annotation to specify experimental groups for easier comparison and grouping?

In [21]:
protein_maxquant_T=protein_maxquant.transpose()
#rownames and colnames are set after transpose


### Generate log2 of maxlfq intensity

In [22]:
conditions_arr = ['GLP1R_3PA_GLP1'] * 4 +['GLP1R_3PA_VEH'] * 4 + ['GLP1R_WT_GLP1'] * 4 + ['GLP1R_3PA_GLP1'] * 4+ ['NEG']
#protein_maxquant_T.insert(0,"Conditions",conditions_arr)
protein_maxquant_T.insert(0, "Conditions", conditions_arr)



In [23]:
#save protein quant and prot quant transposed
protein_maxquant.to_csv(r"data_WTvs3PA/protein_maxquant.csv")
protein_maxquant_T.to_csv(r"data_WTvs3PA/protein_maxquant_T.csv")

### Try plotting PCA and heat map 

#### PCA step1:distributing the dataset into two components x and Y


In [30]:
import numpy as np 
import pandas as pd 
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df=pd.read_csv(r"data_WTvs3PA/protein_maxquant_T.csv", header=0, sep=",")
df.set_index(['Conditions'], inplace=True)
df= df.transpose()
df.shape #this method require flat shapedata
df.drop('Unnamed: 0', axis=0, inplace=True) #drop the unwanted row


In [None]:
#df.reset_index(inplace=True)

In [55]:
#df.drop('level_0', axis=1, inplace=True) #drop unwanted column if exists
df['indexkeep']= df.loc[:,'index']
df.index= df['indexkeep']
X=df.iloc[:,1:17]
Y=df.iloc[:, 0]
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X=scaler.fit_transform(X)
X

array([[ 0.12541803,  0.15989102,  0.15726783, ...,  0.11068162,
         0.10609429,  0.14819676],
       [-0.17664646, -0.16943403, -0.16836794, ..., -0.18053587,
        -0.17429433, -0.18849836],
       [-0.15817491, -0.15494996, -0.15284529, ..., -0.15636731,
        -0.1595809 , -0.16239626],
       ...,
       [-0.19172349, -0.15296487, -0.14842097, ..., -0.16762058,
        -0.16379686, -0.17662418],
       [-0.17223597, -0.16327986, -0.16589356, ..., -0.17784616,
        -0.16869913, -0.18145311],
       [-0.14131627, -0.13613398, -0.1318695 , ..., -0.15148375,
        -0.15046515, -0.16116339]])

In [56]:
#PCA
from sklearn.decomposition import PCA
pca=PCA()
pca.fit_transform(X)
pca.get_covariance()

array([[1.00035881, 0.99491531, 0.97838022, 0.97881933, 0.9912392 ,
        0.97865417, 0.98671331, 0.99121175, 0.99603791, 0.98737824,
        0.97482428, 0.96958323, 0.98308627, 0.97833228, 0.98458034,
        0.9835183 ],
       [0.99491531, 1.00035881, 0.98236046, 0.98237529, 0.99457755,
        0.98477567, 0.99040935, 0.99331359, 0.98999398, 0.99350439,
        0.97935099, 0.97489842, 0.99076534, 0.98469726, 0.98841911,
        0.98799575],
       [0.97838022, 0.98236046, 1.00035881, 0.99800925, 0.97749122,
        0.9645932 , 0.97130404, 0.97557739, 0.97444238, 0.98180387,
        0.9903138 , 0.99309979, 0.97114815, 0.98855126, 0.99134623,
        0.96854397],
       [0.97881933, 0.98237529, 0.99800925, 1.00035881, 0.97916217,
        0.96849996, 0.97484751, 0.97790518, 0.97426971, 0.98085135,
        0.99005853, 0.99187658, 0.97490175, 0.99154499, 0.99302511,
        0.97139731],
       [0.9912392 , 0.99457755, 0.97749122, 0.97916217, 1.00035881,
        0.99145098, 0.99193673, 

In [57]:
from sklearn.decomposition import PCA
pca_wt3pa = PCA(n_components=2)
principalComponents = pca_wt3pa.fit_transform(X)
principal_Df = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])


In [58]:
principal_Df.tail()


Unnamed: 0,principal component 1,principal component 2
2783,2.343817,0.00892
2784,-0.696496,0.008585
2785,-0.652711,0.018496
2786,-0.68472,0.003995
2787,-0.579087,0.010116


In [42]:
explained_variance=pca.explained_variance_ratio_
explained_variance

array([9.83964308e-01, 6.77860926e-03, 2.82443082e-03, 1.55248463e-03,
       1.41332780e-03, 1.07757355e-03, 8.37825207e-04, 4.11629325e-04,
       3.19293643e-04, 2.63296992e-04, 1.64216050e-04, 1.31215622e-04,
       1.19837340e-04, 5.48861362e-05, 5.26300693e-05, 3.44352041e-05])

In [61]:
plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis of WT vs 3PA, VEH/GLP1",fontsize=20)
targets = ['GLP1R_3PA_VEH','GLP1R_3PA_GLP1','GLP1R_WT_VEH','GLP1R_WT_GLP1','NEG']
colors = ['r', 'g','b','c','y']
for target, color in zip(targets,colors):
    indicesToKeep = df['index'] == target
    plt.scatter(principal_Df.loc[indicesToKeep, 'principal component 1']
               , principal_Df.loc[indicesToKeep, 'principal component 2']
               , c = color
               , s = 50)
plt.legend(targets,prop={'size': 15})

plt.show()

AssertionError: 