# Prepare data for the detailed analysis of lysine deserts in the human  proteome


<br />

This Jupyter Notebook, except cells in par. 4 & 6, may be skipped, so the authors' preprocessed data will be used in the subsequent analyses.

# 1. Import libraries

In [1]:
import requests
import pandas as pd

# 2. Download data


In [3]:
%%bash

mkdir -p data
mkdir -p results
mkdir -p results/TMH_predictions

## 2.1. Download UniProt data

Directly on the UniProt web site, download all human sequences with the following information (**keep the indicated order!**):

- Entry	
- Reviewed	
- Entry Name	
- Protein names	
- Gene Names	
- Organism	
- Length	
- Proteomes	
- Subcellular location [CC]	
- Gene Ontology (biological process)	
- Gene Ontology (cellular component)	
- Gene Ontology (molecular function)	
- Sequence	
- Function [CC]	
- Tissue specificity
- Pathway	
- Involvement in disease	
- Natural variant	
- Protein families	
- Interacts with	
- Compositional bias	
- 3D	
- Protein existence	
- Domain [FT]

and save in `data` as `Uniprot_raw_human_proteome.tsv`.

## 2.2. Add data on disordered regions from the MobiDB

Download dataset of all human intrinsically disordered regions (IDRs) from [MobiDB](https://mobidb.bio.unipd.it) and save as `files/MobiDB_human_IDRs.tsv`.

In [4]:
response = requests.get("https://mobidb.bio.unipd.it/api/download?ncbi_taxon_id=9606&format=tsv")

with open('data/MobiDB_human_IDRs.tsv','w') as fd:
    fd.write(response.text)

In [14]:
mdb = pd.read_csv('data/MobiDB_human_IDRs.tsv', sep='\t')

# from MobiDB datase, select pLDDT-based predictions from the AlphaFold2
mdb = mdb[mdb['feature'] == 'prediction-plddt-alphafold']
mdb.drop(['feature', 'content_count'], axis=1, inplace=True)

mdb.rename(columns={'acc':'Entry', 
                    'start..end': 'Struct_regions_start..end',
                    'content_fraction': 'Struct_regions_fraction'}, inplace=True)

# Get ranges of IDRs
IDRs = []
IDRs_fraction = []

for key, value in mdb.iterrows():
    length = value[3]
    IDRs_fraction.append(round((1.0 - float(value[2]))*100, 3))
    struct_ranges = value[1].split(',')
    far_left = 0
    prot_IDRs = []
    
    for i in range(len(struct_ranges)):
        start = int(struct_ranges[i].split('..')[0])
        end = int(struct_ranges[i].split('..')[1])
        min_r = 999999999
        max_r = 0
        for el in range(far_left, end+1):
            if el not in range(start, end+1):
                if el < min_r:
                    min_r = el
                if el > max_r:
                    max_r = el
        
        to_append = f'{min_r}..{max_r}'
        if to_append != '0..0':
            prot_IDRs.append(to_append)
        far_left = end + 1
        
    end_to_append = f'{end+1}..{length}'
    if end_to_append != f'{length}..{length}' and end+1 <= length:
        prot_IDRs.append(end_to_append)
    else:
        prot_IDRs.append(f'{end+1}')
        
    IDRs.append(','.join(prot_IDRs))

mdb['IDR_region_start..end'] = IDRs
mdb['IDR_region_fraction'] = IDRs_fraction
mdb['Struct_regions_fraction'] = mdb['Struct_regions_fraction'] *100
mdb.drop('length', axis=1, inplace=True)

## 2.3. Merge data from the UniProt with the MobiDB information

In [15]:
human_df = pd.read_csv('data/Uniprot_raw_human_proteome.tsv', sep='\t')
merged = pd.merge(human_df, mdb, on='Entry',how='outer')
merged.dropna(subset = ["Sequence"], inplace=True)

merged.to_csv('data/Uniprot_human_proteome.tsv.gz', sep='\t', compression='gzip', index=False)
merged.head()

  values = values.astype(str)


Unnamed: 0,Entry,Reviewed,Entry Name,Protein names,Gene Names,Organism,Length,Proteomes,Subcellular location [CC],Gene Ontology (biological process),...,Protein families,Interacts with,Compositional bias,3D,Protein existence,Domain [FT],Struct_regions_start..end,Struct_regions_fraction,IDR_region_start..end,IDR_region_fraction
0,A0A087X1C5,reviewed,CP2D7_HUMAN,Putative cytochrome P450 2D7 (EC 1.14.14.1),CYP2D7,Homo sapiens (Human),515.0,UP000005640: Chromosome 22,SUBCELLULAR LOCATION: Membrane {ECO:0000305}; ...,arachidonic acid metabolic process [GO:0019369...,...,Cytochrome P450 family,,,,Uncertain,,"4..142,148..227,229..329,348..515",94.8,"0..3,143..147,228..228,330..347,516",5.2
1,A0A0B4J2F0,reviewed,PIOS1_HUMAN,Protein PIGBOS1 (PIGB opposite strand protein 1),PIGBOS1,Homo sapiens (Human),54.0,UP000005640: Chromosome 15,SUBCELLULAR LOCATION: Mitochondrion outer memb...,regulation of endoplasmic reticulum unfolded p...,...,,Q96S66,,,Evidence at protein level,,6..53,90.7,"0..5,54",9.3
2,A0A0B4J2F2,reviewed,SIK1B_HUMAN,Putative serine/threonine-protein kinase SIK1B...,SIK1B,Homo sapiens (Human),783.0,UP000005640: Chromosome 21,,cellular response to glucose starvation [GO:00...,...,"Protein kinase superfamily, CAMK Ser/Thr prote...",Q9BRT9; Q68G74; Q14696; P78337; Q9NRG1; P62072,"COMPBIAS 457..477; /note=""Polar residues""; /ev...",,Uncertain,"DOMAIN 27..278; /note=""Protein kinase""; /evide...","21..280,303..352",39.6,"0..20,281..302,353..783",60.4
3,A0A0C5B5G6,reviewed,MOTSC_HUMAN,Mitochondrial-derived peptide MOTS-c (Mitochon...,MT-RNR1,Homo sapiens (Human),16.0,UP000005640: Mitochondrion,SUBCELLULAR LOCATION: Secreted {ECO:0000269|Pu...,activation of protein kinase activity [GO:0032...,...,,,,,Evidence at protein level,,6..15,62.5,"0..5,16",37.5
4,A0A0K2S4Q6,reviewed,CD3CH_HUMAN,Protein CD300H (CD300 antigen-like family memb...,CD300H,Homo sapiens (Human),201.0,UP000005640: Chromosome 17,SUBCELLULAR LOCATION: [Isoform 1]: Membrane {E...,neutrophil chemotaxis [GO:0030593],...,CD300 family,,,,Evidence at protein level,"DOMAIN 25..123; /note=""Ig-like V-type""; /evide...","24..62,81..115,127..139,174..193",55.2,"0..23,63..80,116..126,140..173,194..201",44.8


# 3. Prepare list of human E3 ligases

List of human E3 ligases with assigned classes e.g. RING or U-Box was manually prepared based on 
* [Medvar et al.](https://hpcwebapps.cit.nih.gov/ESBL/Database/E3-ligases/)
* [UbiNet 2.0](https://awi.cuhk.edu.cn/~ubinet/index.php)
* several other publications

It can be found in `data/E3_list.tsv`

# 4. Prepare list of human housekeeping genes

List of human housekeeping genes was obtained from [tau.ac.il](https://www.tau.ac.il/~elieis/HKG/HK_genes.txt)


In [None]:
%%bash

mkdir -p data/
wget https://www.tau.ac.il/~elieis/HKG/HK_genes.txt -P data/

# 5. Predict number of transmembrane helices (TMH)


Predict number of transmembrane helices (TMH) for each sequence from all the "pre-filtered" MSA files using the standalone version of [TMHMM-2.0](https://services.healthtech.dtu.dk/service.php?TMHMM-2.0) software.

### 5.1. Prepare fasta file input for the TMHMM-2.0 software

Create fasta with all sequences to predict TMH for each of them using the TMHMM-2.0 software.

Save all human sequences to one fasta file `files/human_seq_for_TMH.fasta`.

In [7]:
df = pd.read_csv('data/Uniprot_human_proteome.tsv.gz', sep='\t')
uniprots = df['Entry'].tolist()
sequences = df['Sequence'].tolist()
f = open('results/TMH_predictions/human_seq_for_TMH.fasta', 'w')
for i in range(len(uniprots)):
    f.write(f'>{uniprots[i]}\n{sequences[i]}\n')
f.close()

###  5.2. Downolad TMHMM-2.0

Download the TMHMM-2.0 standalone version [here](https://services.healthtech.dtu.dk/software.php).


### 5.3. Run TMHMM-2.0

#### Provide path to TMHMM

Simply write it down in the field below 

e.g. `/Users/ns/Install/tmhmm-2.0c.Linux/tmhmm-2.0c/bin/tmhmm`

**Do not hit ENTER after writing it in the field below.**

In [None]:
db = HBox([Label('Full path to TMHMM-2.0 software:'), Text()])
display(db)

In [None]:
tmhmm_path = db.children[1].value # get path to TMHMM-2.0 software path from Text object

#### Run TMHMM-2.0

Run TMHMM-2.0 on `results/TMH_predictions/human_seq_for_TMH.fasta`. Choose short output & save it to `results/TMH_predictions/human_seq_for_TMH.tsv`.

In [18]:
%%bash -s "$tmhmm_path"

$1 results/TMH_predictions/human_seq_for_TMH.fasta >> results/TMH_predictions/human_seq_for_TMH.tsv
gzip -9 results/TMH_predictions/human_seq_for_TMH.tsv

#### Delete the unnecessary file

Delete fasta file with conacatenated human sequences.

In [17]:
%%bash
rm results/TMH_predictions/human_seq_for_TMH.fasta

### 5.4. Assign predicted number of TMH for each human protein

In [17]:
tmh = pd.read_csv('results/TMH_predictions/human_seq_for_TMH.tsv.gz', sep='\t', usecols=[0,4], names=['Entry', 'tmp'])

tmh['NumTMH'] = [el.split('=')[1] for el in tmh['tmp'].tolist()]
tmh = tmh.drop('tmp', axis=1)
tmh.head()

Unnamed: 0,Entry,NumTMH
0,A0A087X1C5,2
1,A0A0B4J2F0,1
2,A0A0B4J2F2,0
3,A0A0C5B5G6,0
4,A0A0K2S4Q6,1


In [18]:
df = pd.read_csv('data/Uniprot_human_proteome.tsv.gz', sep='\t')
merged = pd.merge(df, tmh, on='Entry',how='outer')
merged.to_csv('data/Uniprot_human_proteome.tsv.gz', sep='\t', compression='gzip', index=False)
merged.head()

  values = values.astype(str)


Unnamed: 0,Entry,Reviewed,Entry Name,Protein names,Gene Names,Organism,Length,Proteomes,Subcellular location [CC],Gene Ontology (biological process),...,Interacts with,Compositional bias,3D,Protein existence,Domain [FT],Struct_regions_start..end,Struct_regions_fraction,IDR_region_start..end,IDR_region_fraction,NumTMH
0,A0A087X1C5,reviewed,CP2D7_HUMAN,Putative cytochrome P450 2D7 (EC 1.14.14.1),CYP2D7,Homo sapiens (Human),515.0,UP000005640: Chromosome 22,SUBCELLULAR LOCATION: Membrane {ECO:0000305}; ...,arachidonic acid metabolic process [GO:0019369...,...,,,,Uncertain,,"4..142,148..227,229..329,348..515",94.8,"0..3,143..147,228..228,330..347,516",5.2,2
1,A0A0B4J2F0,reviewed,PIOS1_HUMAN,Protein PIGBOS1 (PIGB opposite strand protein 1),PIGBOS1,Homo sapiens (Human),54.0,UP000005640: Chromosome 15,SUBCELLULAR LOCATION: Mitochondrion outer memb...,regulation of endoplasmic reticulum unfolded p...,...,Q96S66,,,Evidence at protein level,,6..53,90.7,"0..5,54",9.3,1
2,A0A0B4J2F2,reviewed,SIK1B_HUMAN,Putative serine/threonine-protein kinase SIK1B...,SIK1B,Homo sapiens (Human),783.0,UP000005640: Chromosome 21,,cellular response to glucose starvation [GO:00...,...,Q9BRT9; Q68G74; Q14696; P78337; Q9NRG1; P62072,"COMPBIAS 457..477; /note=""Polar residues""; /ev...",,Uncertain,"DOMAIN 27..278; /note=""Protein kinase""; /evide...","21..280,303..352",39.6,"0..20,281..302,353..783",60.4,0
3,A0A0C5B5G6,reviewed,MOTSC_HUMAN,Mitochondrial-derived peptide MOTS-c (Mitochon...,MT-RNR1,Homo sapiens (Human),16.0,UP000005640: Mitochondrion,SUBCELLULAR LOCATION: Secreted {ECO:0000269|Pu...,activation of protein kinase activity [GO:0032...,...,,,,Evidence at protein level,,6..15,62.5,"0..5,16",37.5,0
4,A0A0K2S4Q6,reviewed,CD3CH_HUMAN,Protein CD300H (CD300 antigen-like family memb...,CD300H,Homo sapiens (Human),201.0,UP000005640: Chromosome 17,SUBCELLULAR LOCATION: [Isoform 1]: Membrane {E...,neutrophil chemotaxis [GO:0030593],...,,,,Evidence at protein level,"DOMAIN 25..123; /note=""Ig-like V-type""; /evide...","24..62,81..115,127..139,174..193",55.2,"0..23,63..80,116..126,140..173,194..201",44.8,1


# 6. Download half-life datasets

Download dataset of proteins' half-lives created by Mathieson et al. (2018) - [Supplementary Data 2](https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-018-03106-1/MediaObjects/41467_2018_3106_MOESM5_ESM.xlsx)

Download dataset of proteins' half-lives created by Li et al. (2021) - [Supplementary Table 2](https://ars.els-cdn.com/content/image/1-s2.0-S1097276521007498-mmc3.xlsx)

In [None]:
%%bash

mkdir -p data/half_life
wget -O data/half_life/Mathieson_half_life_dataset.xlsx https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-018-03106-1/MediaObjects/41467_2018_3106_MOESM5_ESM.xlsx
wget -O data/half_life/Li_half_life_dataset.xlsx https://ars.els-cdn.com/content/image/1-s2.0-S1097276521007498-mmc3.xlsx