## About

In lights of the success of the ProTargetMiner study (Saei et al. - ProTargetMiner as a proteome signature library of anticancer molecules for functional discovery) further data mining from the same data set, but extended with the proteome of dying cells will be conducted in this project. The overall goal of this project is to analyze the proteome for living and dying cells to understand cell death, specifically to find proteome signatures relating to cell death regardless of which treatment or cell line the cells belongs to, as well as be able to identify different apoptotic processes in the cell. The dataset consists of living and dead cells of three cell lines (cancer cell lines; A549, RKO and MCF-7). The apoptotic process has been introduced by nine different cancer treatments (8-zaguanine, Raltitrexed, Topotecan, Floxuridine, Nutlin, Dasatinib, Gefitinib, Vincristine and Bortezomib), and there are three replicates, meaning that in total there are 60 different conditions in 3 replicates.  

For this project there is a couple of things we would like to investigate:

1) The number of apoptotic pathways. It has been suggested that there are at least two major signalling pathways trigger apoptotic cell death in [Geen et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4665079/), but we would like to investigate exactly how many there are. 
2) Proteome signature for surviving and dying cells regardless of treatment or cell line. To do this we would need to investigate target regulation in dying vs. surviving cells for each treatment and cell line as well.   

## Goals
- Count amount of up- and downregulated proteins/peptides for life and death states.
- Check if there are proteins which are always up- or down-regulated for regardless of drug treatment. 
- Check the correlations between protein abundances. 
- Measure variance and co-variance of proteins. 



The project is related to:

[ProTargetMiner as a proteome signature library of anticancer molecules for functional discovery](https://www.nature.com/articles/s41467-019-13582-8)

[Comparative Proteomics of Dying and Surviving Cancer Cells Improves the Identification of Drug Targets and Sheds Light on Cell Life / Death Decisions](https://pubmed.ncbi.nlm.nih.gov/29572246/)




## Problem

The apoptotic process is related to many human diseases, which may result when cells die that shouldn't and other live that should die. Modulation of apoptotic processes may therefore offer valuable methods of treatment ([Renehan et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1120576/)). Understanding the apoptotic process is therefore very important. A dataset used in [Saei et al.](https://www.nature.com/articles/s41467-019-13582-8) proved to be of valuable for uncovering protein signatures relating to drug compound targets and actions mechnisms. Therefore and extension of this dataset with  



## 2020-10-21 Wednesday
### 17:34 Some thoughts when processing proteins with my GO-id function.
I am using the Leading_razor_protein as my search protein. The GO search goes bad when we have entries with the following formatting:

P51608-2
Q9Y5Q8-3
CON__Q2KJC7
A0A0C4DGS5
CON__ENSEMBL:ENSBTAP00000007350
D3DTX6

But testing P51608-2 --> P51608 I find can get a GO-match. I guess this has to do with the suffix "-2"l. I also notice that often when P51608-2 is the Leading_razor_protein; the protein column also contains P51608, P51608-2; P51608-3, etc. I will investigate what the suffixes and errors are later.



### 15:48 Data description
Just finished coding a function to infer GO from protein names with uniprot, but there are some searching problems. But first lets describe the data.

There are four files:

- peptides tryptic.txt
- peptides.txt
- proteinGroups tryptic.txt
- proteinGroups.txt

There are two searches performed; one tryptic (I guess the tryptic files are the tryptic searches) and one semi-tryptic. I've been told this is performed because in cell death proteases get active and degrade some proteins. 

The data code is:


Reporter intensity corrected 0 A549_S_Rep1

0 is the TMT tag code, it goes from 0-9 and covers a control (0) and the rest are 9 drugs in the following order: 

Control
8-zaguanine
Raltitrexed
Topotecan
Floxuridine
Nutlin
Dasatinib
Gefitinib
Vincristine
Bortezomib

A549, RKO and MCF-7 are the cell lines. 

S stands for "Surviving" meaning attached cells after treatment, and D stands for "Dying", those cells that detached from the plate after treatment with the anticancer drugs.
Rep1-3 are the replicate numbers, all the experiments are in 3 replicates. 

The surviving cell data comes from this paper ([ProTargetMiner as a proteome signature library of anticancer molecules for functional discovery](https://www.nature.com/articles/s41467-019-13582-8)). The dying cell data is new. 

The head of the unprocessed data is shown in the table below:

In [2]:
import pandas as pd
import numpy as np

data_loc = "~/git/lifeAndDeath/data/knitr/"

peptides = pd.read_csv(data_loc + "peptides_knitr.txt", sep = "\t")

In [3]:
peptides

Unnamed: 0,Sequence,N-term cleavage window,C-term cleavage window,Amino acid before,First amino acid,Second amino acid,Second last amino acid,Last amino acid,Amino acid after,A Count,...,Reverse,Potential contaminant,id,Protein group IDs,Mod. peptide IDs,Evidence IDs,MS/MS IDs,Best MS/MS,Oxidation (M) site IDs,MS/MS Count
0,AAAAAAAAAA,LVSENAGRAAAAAAAA,AAAAAAAAAAAAAGAG,R,A,A,A,A,A,10,...,,,0,4743;2687;1346;7153;8131,0,0,0,0,,1
1,AAAAAAAAAAA,LVSENAGRAAAAAAAA,AAAAAAAAAAAAGAGA,R,A,A,A,A,A,11,...,,,1,4743;2687;1346;8131,1,1;2;3,1,1,,1
2,AAAAAAAAAAAAAAAGAGAGAK,LVSENAGRAAAAAAAA,AGAGAGAKQTPADGEA,R,A,A,A,K,Q,18,...,,,2,4743,2,4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;2...,2;3;4;5;6;7;8;9;10;11;12;13;14;15;16,13,,15
3,AAAAAAAAAAAAATATAGPR,RLLQGAAAAAAAAAAA,ATATAGPRGEAPPPPP,A,A,A,P,R,G,15,...,,,3,8875,3,29;30;31;32,17;18;19,19,,3
4,AAAAAAAAAAAK,LFAPASAAAAAAAAAA,AAAAAAAKGALEGAAG,A,A,A,A,K,G,11,...,,,4,712;11411,4,33;34,20,20,,1
5,AAAAAAAAAAGAAGGR,________________,AAGAAGGRGSGPGRRR,M,A,A,G,R,G,12,...,,,5,7153,5,35;36;37;38;39;40;41;42;43;44;45;46;47;48;49;5...,21;22;23;24;25;26;27;28;29;30;31;32;33;34;35;3...,26,,40
6,AAAAAAAAAAGAAGGRGSGPGR,________________,GRGSGPGRRRHLVPGA,M,A,A,G,R,R,12,...,,,6,7153,6,82;83;84,61;62;63,63,,3
7,AAAAAAAAAAGEGAR,________________,AAAGEGARSPSPAAVS,A,A,A,A,R,S,11,...,,,7,7012,7,85,64,64,,1
8,AAAAAAAAAGEGAR,________________,AAAGEGARSPSPAAVS,A,A,A,A,R,S,10,...,,,8,7012,8,86;87,65,65,,1


This data is converted to seaborn-melted format, values have been log2FC-transformed where log2FC(A,B) is used, where the A is the untreated sample for each {cell_line}_{state}_{replicate} and the B is the treated for each {cell_line}_{state}_{replicate}. The data is tresholded with following parameters:

- PEP < 0.01
- Missed_cleaveges < 2
- Reporter_intensity_count > 0
- Reporter_intensity_corrected not in [-inf, inf],

where I suspect -inf and inf is caused by log(0) values. 

The function "log2FC_data_peptide" in /scripts/preprocessor.py (script needs cleanup) is used for log2FC. /scripts/preprocessor.py also has the seaborn-melted format converter code in it (but it is still not in a function).

The converted data is shown below:

In [5]:
melted = pd.read_csv(data_loc + "melted_treshold_knitr.csv", sep = "\t")


In [6]:
melted

Unnamed: 0,Leading_razor_protein,Proteins,Reporter_intensity_count,Cell_line,Treatment,State,Replicate,Reporter_intensity_corrected
0,Q86U42,Q86U42;Q86U42-2,2,A549,0,D,Rep1,0.0
1,Q9Y4H2,Q9Y4H2,1,A549,0,D,Rep1,0.0
2,O75822,O75822;O75822-2;O75822-3,1,A549,0,D,Rep1,0.0
3,O75822,O75822;O75822-2;O75822-3,3,A549,0,D,Rep1,0.0
4,P36578,P36578;H3BM89;H3BU31,7,A549,0,D,Rep1,0.0
5,P51608-2,P51608-2;B5MCB4,1,A549,0,D,Rep1,0.0
6,Q96P70,Q96P70,2,A549,0,D,Rep1,0.0
7,P28482,P28482;P28482-2,2,A549,0,D,Rep1,0.0
8,P09417,P09417;P09417-2;B7Z415;D6RGG7;D6RHJ7,1,A549,0,D,Rep1,0.0
9,Q01167,Q01167;Q01167-2;Q01167-3,1,A549,0,D,Rep1,0.0



### 14:57 Protein Classification. 
Checked out reactome, but honestly I have no idea how to use this. Webscraping their uniprot2reactome.txt urls will take too much time. What did my PI mean with looking at reactome for protein classification?

I got some information from my collaborative partner, A, that I should check out StringDB, CytoScape and Corum for protein classification by thier functions. 

Previous entry on curl on uniprot, was unnescessary, since the uniprot tab-seperated files could include GO in the columns.

ToDo:
- Check reactome more thoroughly,
- Check StringDB, CytoScape and Corum

## 2020-10-20 Tuesday
### 12:02 - curl on Uniprot.

I might be able to speed up the collection, by using bash curl and multiprocessing to to collect data from uniprot GO. This should be much quicker than python requests.

This means I will have to learn:
- multiprocessing on bash.
- curl to get uniprot info.

## 2020-10-18
I've applied for resource allocation at Swedish National Infrastucture for Computing (SNIC) and I'm waiting for the SNIC proposal to go through so I can get my gene-ontology (GO) protein classification for lifeAndDeath project in SNIC with multicore processing. Using my 8 local cores, It would theoretically take 10 days to get the GO protein classifications.

Setting up a Rmarkdown log for this project. 