## Trying to do fragpipe analyst R in python 
due to intensive package downlaoding: it takes too long even on HPC

FragPipe-Analyst and FragPipeAnalystR share the same common modular design with the following modules: \
(1) I/O module that handles the input and output files. \
(2) Data manipulation module, which has functions for operating the fundamental data structure, including removing/merging samples and feature selections. \
(3) Data filtering module, which provides methods for filtering data based on missing values. \
(4) Normalization module that provides several normalization methods. \
(5) Imputation module that provides several imputation methods. \
(6) Quality control (QC) module, which provides functions to generate various visualizations, including principal component analysis (PCA) plots and heatmaps. \
(7) Differential expression (DE) analysis module, which provides functions for performing statistical procedures and result visualization, such as a volcano plot. \
(8) Enrichment analysis module, which provides statistical procedures specifically for inferring biological insights, such as overrepresentation tests.\


1. Import tsv from FP output

In [78]:
import pandas as pd
import numpy as np

In [81]:
combined_protein = pd.read_csv(r"data_WTvs3PA/combined_protein.tsv", header=0, sep="\t")
annotation = pd.read_csv(r"data_WTvs3PA/experiment_annotation.tsv", header=0, sep="\t")


### Obtain only the required columns:
protein information and hits
Protein intensity in each sample in MaxLFQ (we decided to use MaxLFQ)

In [None]:
# filter(regex=...) searches column names
# (?i) makes it case-insensitive
# Returns only the matching columns
maxlfq_cols = combined_protein.filter(regex='(?i)MaxLFQ').columns
#the below is a little chaining that gets only Entry Name and turn that into a dataframe so i could ge the req columns
protein_name = combined_protein.iloc[:,2].to_frame() #this is the column of entry name
protein_info = combined_protein.iloc[:, :5]
req_cols =  list(protein_name.columns)+list(maxlfq_cols)
protein_maxquant = combined_protein.loc[:, req_cols]
### ALWAYS set index (rownames) before transposing -> so the rownames will be preserved as column names
protein_maxquant=protein_maxquant.set_index('Entry Name')

#### Transform the dataset? add columns from experimental annotation to specify experimental groups for easier comparison and grouping?

In [92]:
protein_maxquant_T=protein_maxquant.transpose()
#rownames and colnames are set after transpose


### Generate log2 of maxlfq intensity

In [None]:
conditions_arr = ['GLP1R_WT'] * 8 + ['GLP1R_3PA'] * 8 + ['NEG']
#protein_maxquant_T.insert(0,"Conditions",conditions_arr)

protein_mq_log2= protein_maxquant_T.apply(np.log2)
print(protein_mq_log2.head)

### Try plotting PCA and heat map 

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
num_corr = protein_maxquant.select_dtypes(include=[np.number]).corr()
sns.heatmap(num_corr, annot=True, cmap='coolwarm')
plt.show()