# DKFZ - Interview Coding Task

### Description:

##### In this folder, you will find a tsv file (“Protein_abundance.tsv”) that contains protein abundance measured by label-free mass-spectrometry for chronic lymphocytic leukemia (CLL) patient samples. You will also find an excel table (“sampleAnnotation.xls”) that contains some basic annotations for those samples. Your task is to process the protein abundance dataset; assess its quality and identify protein markers for prognosis. You also need to write a report, preferably in Rmarkdown or Jupyter Notebook format, to present your analysis results to your potential future dry-lab or wet-lab collaborators.

##### List packages

In [4]:
import pandas as pd

#### 1 - Data processing: 

##### The protein abundance in the tsv file is not normalized and has missing values, which is very common in the data table you will get from a proteomic facility. You need to use a proper way to normalize the protein abundance and deal with missing values.

##### Data Loader

In [7]:
prot_abundance = pd.read_csv("Protein_abundance.tsv", sep='\t')

# set the column of protein ID and gene names as index


# reorder the column names numerically

prot_abundance

Unnamed: 0,X1,A_1_1,A_1_10,A_1_11,A_1_12,A_1_13,A_1_14,A_1_15,A_1_16,A_1_17,...,A_1_45,A_1_46,A_1_47,A_1_48,A_1_49,A_1_5,A_1_6,A_1_7,A_1_8,A_1_9
0,sp|A0A0B4J2F0|PIOS1_HUMAN,52102.095,125050.9000,127950.6700,53948.9740,107912.760,77528.140,108104.0900,52705.0000,131094.0300,...,63600.9530,94610.980,128579.1700,148041.1400,98653.940,114634.030,80361.080,65810.9400,42605.0060,99264.020
1,sp|A0A0U1RRE5|NBDY_HUMAN,360139.220,753984.2000,447030.4900,731422.2300,684069.270,648198.630,438458.6800,660127.2400,687005.1200,...,305968.0600,697089.000,579907.9000,759759.9000,686380.400,196684.860,589261.750,661849.8000,692631.3000,360189.130
2,sp|A0A0U1RRL7|MMPOS_HUMAN,,63778.9800,14527.0000,19672.9600,14687.050,14174.010,15628.9900,13392.9900,39137.9600,...,6454.0020,112448.940,32630.0300,41294.0700,39025.000,22286.996,55373.000,46616.0000,49873.9900,8911.020
3,sp|A0AV96|RBM47_HUMAN,50372.995,,,15318.0700,,30909.950,,,,...,28616.0500,59622.030,,73474.8900,77743.830,,,64175.0400,,37837.990
4,sp|A0AVT1|UBA6_HUMAN,151737.190,190831.2700,118342.0400,181234.0600,124849.950,112345.860,134640.9000,136432.9500,125748.6700,...,93573.8000,224532.940,143076.2900,190958.5400,216224.260,41043.950,175683.830,150981.2200,166945.8400,62551.901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3937,sp|Q8NHL6|LIRB1_HUMAN,,12429.9815,7635.9804,6106.0060,1963.015,2884.028,4716.9780,7672.9984,5668.0048,...,2105.9945,8866.032,16215.0013,3075.0079,10001.974,,6728.997,9001.0738,10192.9934,2910.010
3938,sp|Q8NHM5|KDM2B_HUMAN,12543.996,161836.0900,,20208.0363,,23227.042,32185.9828,104636.0230,17688.9650,...,24475.1019,25891.956,251586.2370,32601.1410,25847.033,,6158.996,61293.0430,46818.0050,5913.012
3939,sp|Q8NHP6|MSPD2_HUMAN,953.005,,,,,,,,,...,,4652.991,,2405.0130,2104.989,,1547.991,,,
3940,sp|Q8NHP8|PLBL2_HUMAN,,,1668.9980,5641.0431,4955.991,,,,526.9987,...,,,3368.0150,,,,,,,


As there is no additional information provided about the experiment design and different processing steps among samples, it can be assumed that the missing values in the MS data are completly at random. Given this situation, kNN algorithm is utilized for the protein data imputation, as it normally promises better predictions. {cite:p}`zhang2018proteome`

In [8]:
# set the sample with the most protein identified as the control group for the KNN imputation method


#### 2 - Quality assessment:
   
##### The protein abundance measurement can often be influenced by technical factors, such as batch effect, operators, total protein concentrations, and free thaw cycles of the cells. Those technical factors could potentially act as confounders for downstream analysis. In the sample annotation table, you will find the technical factors and you need to evaluate whether they will confound downstream analysis.


#### 3 - Identify protein markers for prognosis: 

##### In the sample annotation file, you will find three columns that contain the clinical information, which can be used to estimate the overall survival, of the CLL patients. You need to select proteins whose expression can be used to predict the overall survival of those patients using a proper statistical model. You may also do an enrichment analysis to see which pathway is potentially related to clinical outcome.

### Observation and Conclusion:

#### 1. 

```{bibliography}
:style: unsrt
```