# Analysis of protein abundance in _e.coli_

This notebook shows the use of pybenford module with protein abundance data. The protein abundance data are from [PRIDE](https://www.ebi.ac.uk/pride/) : https://www.ebi.ac.uk/pride/archive/projects/PXD024151  
This is an analysis of the proteome of _Escherichia coli_ during sudden carbon starvation. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os, sys

import pybenford as benford

## Loading dataset

In [3]:
if not os.path.exists("e033784_Proteins.xlsx"):
    !wget https://www.ebi.ac.uk/pride/data/archive/2021/03/PXD024151/e033784_Proteins.xlsx

--2021-06-21 09:38:46--  https://www.ebi.ac.uk/pride/data/archive/2021/03/PXD024151/e033784_Proteins.xlsx
Resolving www.ebi.ac.uk (www.ebi.ac.uk)... 193.62.193.80
Connecting to www.ebi.ac.uk (www.ebi.ac.uk)|193.62.193.80|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://ftp.pride.ebi.ac.uk/pride/data/archive/2021/03/PXD024151/e033784_Proteins.xlsx [following]
--2021-06-21 09:38:46--  http://ftp.pride.ebi.ac.uk/pride/data/archive/2021/03/PXD024151/e033784_Proteins.xlsx
Resolving ftp.pride.ebi.ac.uk (ftp.pride.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.pride.ebi.ac.uk (ftp.pride.ebi.ac.uk)|193.62.197.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 511702 (500K) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: 'e033784_Proteins.xlsx'


2021-06-21 09:38:47 (1.55 MB/s) - 'e033784_Proteins.xlsx' saved [511702/511702]



## Exploration of dataset

In [4]:
data = pd.read_excel('e033784_Proteins.xlsx')
data.shape

(2259, 31)

In [5]:
data.head()

Unnamed: 0,Checked,Protein FDR Confidence Combined,Master,Accession,Description,Exp q-value Combined,Sum PEP Score,Coverage in Percent,Number of Peptides,Number of PSMs,...,Abundances Normalized F4 Sample,Abundance F1 Sample,Abundance F2 Sample,Abundance F3 Sample,Abundance F4 Sample,Found in Sample in S1 F1 Sample,Found in Sample in S2 F2 Sample,Found in Sample in S3 F3 Sample,Found in Sample in S4 F4 Sample,Modifications
0,False,High,IsMasterProtein,P0A8V2,DNA-directed RNA polymerase subunit beta OS=Es...,0.0,669.114,72,95,794,...,2233296000.0,2107560000.0,1126121000.0,854211000.0,276510000.0,High,High,High,High,
1,False,High,IsMasterProtein,P0A8T7,DNA-directed RNA polymerase subunit beta' OS=E...,0.0,596.904,68,91,841,...,1950036000.0,2005228000.0,1126856000.0,878170600.0,241438900.0,High,High,High,High,
2,False,High,IsMasterProtein,P0A6F5,60 kDa chaperonin OS=Escherichia coli (strain ...,0.0,503.063,93,53,1435,...,5664362000.0,4516081000.0,2140958000.0,1842317000.0,701319000.0,High,High,High,High,
3,False,High,IsMasterProtein,P0AFG8,Pyruvate dehydrogenase E1 component OS=Escheri...,0.0,450.292,72,75,758,...,3326486000.0,3315792000.0,1710860000.0,1433563000.0,411860700.0,High,High,High,High,
4,False,High,IsMasterProtein,P0A705,Translation initiation factor IF-2 OS=Escheric...,0.0,424.883,73,74,429,...,968516400.0,1348475000.0,627879800.0,403777900.0,119914500.0,High,High,High,High,


In [6]:
data.columns

Index(['Checked', 'Protein FDR Confidence Combined', 'Master', 'Accession',
       'Description', 'Exp q-value Combined', 'Sum PEP Score',
       'Coverage in Percent', 'Number of Peptides', 'Number of PSMs',
       'Number of Unique Peptides', 'Number of Protein Groups',
       'Number of AAs', 'MW in kDa', 'calc pI', 'Score Sequest HT Sequest HT',
       'Number of Peptides by Search Engine Sequest HT',
       'Number of Razor Peptides', 'Abundances Normalized F1 Sample',
       'Abundances Normalized F2 Sample', 'Abundances Normalized F3 Sample',
       'Abundances Normalized F4 Sample', 'Abundance F1 Sample',
       'Abundance F2 Sample', 'Abundance F3 Sample', 'Abundance F4 Sample',
       'Found in Sample in S1 F1 Sample', 'Found in Sample in S2 F2 Sample',
       'Found in Sample in S3 F3 Sample', 'Found in Sample in S4 F4 Sample',
       'Modifications'],
      dtype='object')