## Data processing

### 0 Introduction
Before starting data analysis, it is important to questions such as:
- What is the outcome to be achieved as a result of data analysis?
- Which parameters are necessary to keep, and why, to achieve the desired outcome?
- How to achieve a standard structure in different datasets? 

For this project we have 3 datasets:
- **Proteomics:** Signal peptide abundances measured by analyzing the supernatant of fungal samples cultivated under two different media ccompositions (minimal media and minimal media with additional nitrogen)
- **RNA sequence:** Gene expression levels obtained from transcriptome anlaysis of fungal samples cultivated under different conditions
- **SignalP:** Potential signal peptide seuqences determined by using signalP algorithm to analyze the _A. oryzae_ genome 

The Proteomics and SingnalP datasets are needed to make comparison between the observed and predicted signal peptides, whereas RNA sequence dataset is needed to normalize the signal peptide abundance obesrved in the proteomics analysis with respect to gene expression levels as this will improve the data for machine learning when we start Automatic Machine Learning (AutoML).

However, for now, the datasets have different structure and excess information that are not necessary for data analysis. This will make it difficult to connect the datasets during analysis. Hence, the datasets need to be processed to obtain a common structure which contains only the relevant information needed to carry out analysis and link between the datasets. 

### Agenda:
As a rule of thumb, the datasets should have a generic structure which will satisfy:
- Common name for the columns which contain same data content 
- Representative name that clearly identifies the data stored under the columns
- Dataframe strucure that contains only the necessary information needed for the analysis

In [2]:
import pandas as pd

## 1 Data

### 1.1 Proteomics data
Proteomics data consist of peptide sequence analysis of supernatants obtained from _Aspergillus_ samples that are grown under two different minimum media compositions (minimum media and minimum media with additional nitrogen). Each media composition was tested with 3 samples and the supernatants of each sample were analyzed through mass spectrometry, thus yielding the proteomics dataframe.

In [3]:
# Read the Excel file into a pandas dataframe 
df_proteomics = pd.read_excel('/Users/lucaslevassor/projects/Signal_peptide_project/data/03_proteomics_data/20221124_FJ_E1200_MWN_15cm_140min_500ng_#1572_proteins.xlsx')
df_proteomics

Unnamed: 0,Checked,Protein FDR Confidence: Combined,Master,Accession,Description,Exp. q-value: Combined,Sum PEP Score,Coverage [%],# Peptides,# PSMs,...,"Found in Sample: [S23] F23: Sample, 8","Found in Sample: [S24] F24: Sample, 9","Found in Sample: [S25] F25: Sample, 10","Found in Sample: [S26] F26: Sample, 11","Found in Sample: [S27] F27: Sample, 12","Found in Sample: [S28] F28: Sample, 13","Found in Sample: [S29] F29: Sample, 14","Found in Sample: [S30] F30: Sample, 15",# Protein Groups,Modifications
0,False,High,Master Protein,AO090003000935-T-p1,transcript=AO090003000935-T | gene=AO090003000...,0.000,599.687,89,31,7127,...,High,High,High,High,High,High,High,High,1,
1,False,High,Master Protein,AO090023000944-T-p1,transcript=AO090023000944-T | gene=AO090023000...,0.000,523.784,68,32,3237,...,High,High,High,Peak Found,Peak Found,High,High,High,1,
2,False,High,Master Protein,AO090003001591-T-p1,transcript=AO090003001591-T | gene=AO090003001...,0.000,519.742,68,33,2987,...,High,High,Not Found,Not Found,Not Found,High,High,High,1,
3,False,High,Master Protein,RFP_Fusion,RFP_Fusion,0.000,450.464,64,35,398,...,High,Peak Found,High,Peak Found,High,High,High,High,1,
4,False,High,Master Protein,AO090005001300-T-p1,transcript=AO090005001300-T | gene=AO090005001...,0.000,384.472,90,45,554,...,High,High,High,High,High,High,High,High,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
861,False,Medium,Master Protein,AO090001000075-T-p1,transcript=AO090001000075-T | gene=AO090001000...,0.044,1.795,4,1,1,...,Not Found,Peak Found,Peak Found,Peak Found,Peak Found,Peak Found,Peak Found,Peak Found,1,
862,False,Medium,Master Protein,AO090005001355-T-p1,transcript=AO090005001355-T | gene=AO090005001...,0.046,1.790,14,1,1,...,Peak Found,Peak Found,Peak Found,Not Found,Peak Found,Not Found,Peak Found,Not Found,1,
863,False,Medium,Master Protein,AO090003000247-T-p1,transcript=AO090003000247-T | gene=AO090003000...,0.047,1.779,4,1,1,...,Not Found,Peak Found,Peak Found,Peak Found,Peak Found,Not Found,Not Found,Not Found,1,Met-loss [N-Term]
864,False,Medium,Master Protein,AO090001000680-T-p1,transcript=AO090001000680-T | gene=AO090001000...,0.048,1.774,1,1,1,...,Not Found,Not Found,Peak Found,Peak Found,Not Found,Peak Found,High,Peak Found,1,


### 1.3 SignalP data

SignalP is an algorithm developed in DTU which predicts the presence and location of signal peptide cleavage sites the amino acid sequences through analyzing the annotated protein profile of a specific organism. Hence, SignalP can be used to generate a list of potential signal peptide sequences. I would like to credit and emphasize my gratitude towards my supervisor Lucas Levassor who put the effor to generate the SignalP predictions, which were described in the notebook 00.

In [6]:
# Read the Excel file into a pandas dataframe
df_signalP = pd.read_excel('/Users/lucaslevassor/projects/Signal_peptide_project/data/02_all_signal_peptides/sigpep_predict.xlsx')
df_signalP

Unnamed: 0.1,Unnamed: 0,gene,start_pos,end_pos,signal_peptide_likelyhood,sequence
0,0,AO090005000016-T-p1,0,25,0.999803,MAPSHSFMLFLSVICTHLCSLVVAV
1,3,AO090005000029-T-p1,0,25,0.999835,MHLRNIVIALAATAVASPVDLQDRQ
2,6,AO090005000042-T-p1,0,25,0.999843,MKASFISRLLSLTAFAISSNLSYGL
3,9,AO090005000053-T-p1,0,43,0.854809,MGLFLTALGALSSVNVLYSRGRMPLKHLATLLCALSPTVALSQ
4,12,AO090005000059-T-p1,0,20,0.999821,MHLQATLAVGLSLLGLTLAD
...,...,...,...,...,...,...
1056,3168,AO090103000483-T-p1,0,21,0.833106,MKTSFLLAAIGFLYRLPCSAA
1057,3171,AO090103000487-T-p1,0,21,0.999710,MTRYLSFLFLLILFGNSVFTA
1058,3174,AO090103000493-T-p1,0,19,0.999791,MRGIVALSFLSVALGVTAD
1059,3177,AO090701000994-T-p1,0,20,0.999845,MRLLLIAPLFSAVSYGAQAT


## 2 Data processing

What are the parameters to isolate? What to keep and why to keep?

### 2.1 Common changes in all datasets

Among all the dataset, accession names serve as a common feature that can be used to connect the datasets, thus meaning all the datasets should be processed to have:
- Common and representative name for the accession columns such as "Accession"
- Common data structure which can be recognized such as accession numbers wihtout the suffix: "-T-p1"

In [8]:
# Change the name of the accesion column in the dataframes into "Accession" 
df_signalP = df_signalP.rename(columns={'gene': 'Accession'})

# Remove the suffix from the variables located in the "Accession" columns
df_proteomics['Accession'] = df_proteomics['Accession'].str.replace('-T-p1', '')
df_signalP['Accession'] = df_signalP['Accession'].str.replace('-T-p1', '')

### 2.2 Proteomics data processing

In this dataset, some of the important processing steps that need to be taken include:

- Removal of RFP_Fusion (Red Fluorescent Protein) as it is not part of the native protein pool of A. oryzae RIB40
- Isolation of important parameters for analysis such as the columns:
    - **Accession:** Unique identifier that is assigned to the signal peptide sequence 
    - **Abundace:** Raw signal peptide counts that are observed and recorded directly from mass spectrometry analysis 
    - **Abundances (Scaled):** Scaled version of raw singal peptide counts signal peptide counts that are observed from mass spectrometry analysis 

There are multiple columns with the name "Abundance" hence, distinctive patterns needs to be determined to isolate the specific columns. The data for minimum media (MM) and minimum media with additional nitrogen (MM_N) compositions are stored under the column names which show a specific naming pattern such as:

- **Raw signal peptide abundance**
    - Abundance: F16, Abundance: F17, Abundance: F18 - Minimum media (MM)
    - Abundance: F28, Abundance: F29, Abundance: F30 - Minimum media with additional nitrogen (MM_N)
- **Scaled signal peptide abundance**
    - Abundances (Scaled): F16, Abundances (Scaled): F17, Abundances (Scaled): F18 - Minimum media (MM)
    - Abundances (Scaled): F28, Abundances (Scaled): F29, Abundances (Scaled): F30 - Minimum media with additional nitrogen (MM_N)

Hence, the "Scaled" and "Raw" abundace samples can be distinguished with the column pattern names:
- **Pattern 1:** "Abundance: and Scaled"
- **Pattern 2:** "F16, F17, F18, F28, F29, F30"

Lets remove "RFP_Fusion" and create a boolean mask to select dataframe variables according to specific characters they contain in the dataframe such as using "Scaled" to select scaled abundances and "Abundance:" to select raw abundances 

In [9]:
# Remove RFP_Fusion from the dataframe as it is not part of the native protein pool
df_proteomics = df_proteomics[df_proteomics.Accession != 'RFP_Fusion']

# Create a boolean mask that is True for columns that contain the pattaern1: "Scaled, Abundance:" and pattern2: "F16, F17, F18, F28, F29; F30"
pattern1 = 'Abundance:|Scaled'
pattern2 = 'F16|F17|F18|F28|F29|F30'
df_raw_scaled_abundances = df_proteomics.loc[:, df_proteomics.columns.str.contains(pattern1) & df_proteomics.columns.str.contains(pattern2, regex=True)]

# Recall the dataframe to combine accession with ioslated list of columns for the peptide abundances 
df_proteomics_processed = df_proteomics[['Accession']+ list(df_raw_scaled_abundances.columns)]
df_proteomics_processed

Unnamed: 0,Accession,"Abundances (Scaled): F16: Sample, 1","Abundances (Scaled): F17: Sample, 2","Abundances (Scaled): F18: Sample, 3","Abundances (Scaled): F28: Sample, 13","Abundances (Scaled): F29: Sample, 14","Abundances (Scaled): F30: Sample, 15","Abundance: F16: Sample, 1","Abundance: F17: Sample, 2","Abundance: F18: Sample, 3","Abundance: F28: Sample, 13","Abundance: F29: Sample, 14","Abundance: F30: Sample, 15"
0,AO090003000935,5.4,4.3,7.4,81.0,35.7,77.8,9.597858e+07,9.416572e+07,1.297344e+08,1.888527e+09,1.011602e+09,1.584488e+09
1,AO090023000944,242.7,250.1,258.5,42.9,72.6,45.7,2.378266e+08,3.016177e+08,2.504219e+08,5.514791e+07,1.134059e+08,5.128523e+07
2,AO090003001591,231.8,214.1,244.3,62.8,84.6,60.1,9.888853e+09,1.123887e+10,1.030518e+10,3.511919e+09,5.755309e+09,2.932830e+09
4,AO090005001300,96.3,38.4,17.0,89.7,111.3,82.3,7.204276e+07,3.532077e+07,1.257048e+07,8.793645e+07,1.328252e+08,7.046364e+07
5,AO090010000746,98.6,105.3,113.0,260.9,284.5,247.8,2.367148e+08,3.107989e+08,2.681958e+08,8.206836e+08,1.088923e+09,6.807444e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...
861,AO090001000075,201.7,18.7,,239.4,431.0,261.4,1.819305e+05,2.070978e+04,,2.830392e+05,6.199164e+05,2.698529e+05
862,AO090005001355,64.6,146.2,,,184.2,,2.104682e+04,5.864526e+04,,,9.575187e+04,
863,AO090003000247,,,,,,,,,,,,
864,AO090001000680,,,,250.9,673.0,244.1,,,,5.761538e+04,1.879930e+05,4.894831e+04


In [13]:
# Convert and save a summary dataframe
df_proteomics_processed.to_csv('/Users/lucaslevassor/projects/Signal_peptide_project/data/03_proteomics_data/proteomics_processed.csv', index=False)

### 2.4 SignalP data processing

In this dataset, some of the important processing steps that need to be taken include:
- Change the column names "end_pos" to a representative name
- Isolation of important parameters for analysis such as the columns:
    - **Accession:** Unique identifier that is assigned to the signal peptide sequence 
    - **end_pos:** Length of the predicted signal peptides
    - **sequence:** Amino acid sequence of the predicfted signal peptides

In [14]:
# Rename the column "Unnamed: 5" into a representative name
df_signalP = df_signalP.rename(columns={'end_pos': 'length'})

# Isolate the important parameters
df_signalP_processed = df_signalP[['Accession', 'length', 'sequence']]
df_signalP_processed

Unnamed: 0,Accession,length,sequence
0,AO090005000016,25,MAPSHSFMLFLSVICTHLCSLVVAV
1,AO090005000029,25,MHLRNIVIALAATAVASPVDLQDRQ
2,AO090005000042,25,MKASFISRLLSLTAFAISSNLSYGL
3,AO090005000053,43,MGLFLTALGALSSVNVLYSRGRMPLKHLATLLCALSPTVALSQ
4,AO090005000059,20,MHLQATLAVGLSLLGLTLAD
...,...,...,...
1056,AO090103000483,21,MKTSFLLAAIGFLYRLPCSAA
1057,AO090103000487,21,MTRYLSFLFLLILFGNSVFTA
1058,AO090103000493,19,MRGIVALSFLSVALGVTAD
1059,AO090701000994,20,MRLLLIAPLFSAVSYGAQAT
