# Training Wheel Exercise: Renal Cancer Data

Objective: 
Identifying relevant features is a common procedure in biological research (e.g gene expression and protein expression studies). Hence, comprehensive understanding of its workflow is mandatory. 

The purpose of this exercise is to provide an introduction into the conventional pipeline for pathway enrichment studies. More specifically, protein expression data will be analysed. Here, we are employing commonly-used feature selection methods e.g. t-test, recursive feature elimination, and selecting top proteins based on statistical p-value ranks. 

Subsequently, the frequently used hypergeometric enrichment tool will be utilised to evaluate enrichment of these protein features to pathways. 

The results from this exercise can also be used as a benchmark as we perform feature selection through "fuzzy-logic" tool
s in future studies. 

Footnote: 
This notebook also serves as a logbook for BS9001.

### Installation of packages 

Let's first install the packages that are required for analyses. Do ensure that pip is already installed, otherwise, download pip in the command line. 

Installation is not required if it's already done so. 

NameError: name 'python' is not defined

In [4]:
#to install the libraries and packages required 
#before this run, download pip in command line 
!pip install pandas
!pip install sklearn 
!pip install matplotlib
!pip install numpy
!pip install bioinfokit
!pip install combat
!pip install seaborn

You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m
Collecting combat
  Using cached combat-0.3.0-py3-none-any.whl (36 kB)
Collecting mpmath==1.1.0
  Using cached mpmath-1.1.0.tar.gz (512 kB)
Collecting numpy==1.18.5
  Using cached numpy-1.18.5.zip (5.4 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25lerror
[31m    ERROR: Command err

You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


### Data preparation 
In this exercise, the well-studied renal cancer (RC) data will be used for analyses.

Some of the characteristics of the RC data include the following: 
- Protein expression data 
- "Cleaned" data 
- Two groups of patients: normal (control) and cancer (case) group
- Consist of 6 patients in each group with data obtained in duplicates 
- Consist of 3123 protein expression genes 
- Dimensions: 3123 rows x 24 columns  

Firstly, data transformation needs to be performed in order to render the RC data appropriate for subsequent analyses. 

In [25]:
#Data preparation 
%matplotlib inline
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
from __future__ import division
from sklearn.decomposition import PCA 
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing

#transpose and convert first column as header, generate the feature tables
geneaggregate = pd.read_csv('RC_data.csv')
df = geneaggregate[sorted(geneaggregate.columns)]
dfT = df.swapaxes("index", "columns")
new_header = dfT.iloc[0] #grab the first row for the header
featuretable = dfT[1:] #take the data less the header row
featuretable.columns = new_header #set the header row as the df header

#store the protein expression values and protein gene names 
samples = featuretable.values
sample_names = featuretable.columns

print(featuretable)

Unnamed: 0              Q9UBE0  Q9BSJ8  P02656 O95741  P09651  P55809  Q15631  \
cancer_cc_patient1_rep1  40914   41185  274731   4969  103836  101089  102971   
cancer_cc_patient1_rep2  45120   42150  284693   5472  118185   97593  110008   
cancer_cc_patient2_rep1  44113  113386  141656   7872  162475  137794  112840   
cancer_cc_patient2_rep2  47835  139305  155864   7957  170174   99304  129621   
cancer_cc_patient3_rep1  26957   35891  161075   6111  101960   88960   92826   
cancer_cc_patient3_rep2  26005   30788  124223   3722   82866   68415   75856   
cancer_cc_patient6_rep1  35712   52750  194500   4094  118675  117222  142954   
cancer_cc_patient6_rep2  30517   33753  166061   3580   83659   85197   92312   
cancer_cc_patient7_rep1  38094   62466  177344   5878   95034   96055   97456   
cancer_cc_patient7_rep2  26654   45894  133566   4066   71776   54583   58171   
cancer_cc_patient8_rep1  48576  113950  121262   5700  175963  122041  114483   
cancer_cc_patient8_rep2  290

### Batch Correction 
Biological data are often obtained in batches. Technical sources of variation across batches, can lead to heterogenity across batches of data (e.g. different experiment personnels, different experimental instruments) - also known as batch effects.

Batch effects can potentially confound data, and lead to discrepancies in statistical testing. Such discrepancies can obscure revelation of important explanatory variables in datasets (e.g. subpopulations). 

Hence, batch correction is often an imperative step to ensure that discoveries made from the data in question are truly meaningful and relevant.

In [1]:
from combat import pycombat
#batch correction 
#transform data into suitable format for batch correction 
df2 = df.set_index("Unnamed: 0")
df2_logtransformed = np.log2(df2)

#perform batch correction
batch = [1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2]
df_batchcorrected = pycombat(df2_logtransformed,batch)

#plt.boxplot(df_corrected.transpose())
#plt.show()
df_batchcorrected_T = df_batchcorrected.T
df_batchcorrected_T

NameError: name 'df' is not defined