**6/1/20**

I'm going to use this notebook to answer some basic quesitons about the results of the various database searches to compare against other proteomic publications. These questions include: How many human and bacterial proteins did my search identify? Is the distribution of spectral hits for peptides normal, or do I need to run them through a log transformation before performing statistical tests? Does the search identify about the same number of PSMs in each sample, or does it vary? What percent of the PSMs in BV+ samples are human vs. BV-?

In [1]:
from elliot_utils import *

In [2]:
results = getOrderedFiles(TAILORED_RESULTS, '.tsv')
comResults = getOrderedFiles(COMMUNITY_RESULTS, '.tsv')
combResults = getOrderedFiles(POOLED_RESULTS, '.tsv')
indResults = getOrderedFiles(SINGLE_RESULTS, '.tsv')
hybResults = getOrderedFiles(HYBRID_RESULTS, '.tsv')
analysisPath = Path.cwd().joinpath('analysis_files/basic_search_results/')
figPath = Path.cwd().joinpath('figures/basic_search_results/')

Determine the number of unique human and bacterial proteins identified by my searches

In [3]:
# Prints the number of unique human and bacterial proteins identified in the set of results.
# Requires a set of human filtered peptides and bacterial filtered peptides.
def reportUniqueProteins(humanPeps, bacteriaPeps, results):
    peptidePool = set()
    peptidePool.update(humanPeps)
    peptidePool.update(bacteriaPeps)
    protsIDd = {} # key=protein ID, value=set of peptides that match this protein
    pep2Prots = {} # key=peptide sequence, value=set of proteins that match this peptide
    pepCounts = {} # key=peptide sequence, value=number of times identified across all samples
    pepProbabilities = {} # key=peptide sequence, value=lowest spectral probability
    for pep in peptidePool:
        pep2Prots[pep] = set()
        pepCounts[pep] = 0
        pepProbabilities[pep] = 1000000000
    for res in results:
        with res.open(mode='r') as infile:
            reader = csv.reader(infile, delimiter='\t')
            for row in reader:
                protType = determineIDType(row)
                if protType == 'first':
                    continue
                if not isSignificant(row):
                    break
                if (protType == 'human' and row[PEPTIDE] in humanPeps) or (protType == 'bacteria' and row[PEPTIDE] in bacteriaPeps):
                    pepCounts[row[PEPTIDE]] += 1
                    if pepProbabilities[row[PEPTIDE]] > float(row[SPEC_PROBABILITY]):
                        pepProbabilities[row[PEPTIDE]] = float(row[SPEC_PROBABILITY])
                    hits = getProteinHitList(row, protType)
                    pep2Prots[row[PEPTIDE]].update(set(hits))
                    for hit in hits:
                        if not hit in protsIDd.keys():
                            protsIDd[hit] = set()
                        protsIDd[hit].add(row[PEPTIDE])

    humanRealProts = set()
    bacteriaRealProts = set()
    peptidesCopy = peptidePool.copy()
    for prot, pepSet in protsIDd.items():
        validPeps = set()
        for pep in pepSet:
            if pep in peptidesCopy:
                validPeps.add(pep)
        if len(validPeps) > 1:
            if prot.find('HUMAN') != -1:
                humanRealProts.add(prot)
            else:
                bacteriaRealProts.add(prot)
            for p in validPeps:
                peptidesCopy.remove(p)
    for pep, count in pepCounts.items():
        if pep in peptidesCopy and count > 1:
            realProt = pep2Prots[pep].pop()
            if realProt.find('HUMAN') != -1:
                humanRealProts.add(realProt)
            else:
                bacteriaRealProts.add(realProt)
            peptidesCopy.remove(pep)
    for pep, probability in pepProbabilities.items():
        if pep in peptidesCopy and probability < 1e-15:
            peptidesCopy.remove(pep)
            realProt = pep2Prots[pep].pop()
            if realProt.find('HUMAN') != -1:
                humanRealProts.add(realProt)
            else:
                bacteriaRealProts.add(realProt)
    print(f'Unique Human Proteins: {len(humanRealProts)}')
    print(f'Unique Bacterial Proteins {len(bacteriaRealProts)}')

In [5]:
humanPeps = getFilteredPeptides(results, 'human')
bacteriaPeps = getFilteredPeptides(results, 'bacteria')

In [6]:
# Determine how many human and bacterial proteins have 2 unique peptides identified, one peptide identified twice, or one peptide with spec < 1e-15
print('16S_Sample-Matched')
reportUniqueProteins(humanPeps, bacteriaPeps, results)

16S_Sample-Matched
Unique Human Proteins: 1074
Unique Bacterial Proteins 1257


In [7]:
# Unique proteins for 16S_Pooled
comHumanPeps = getFilteredPeptides(comResults, 'human')
comBacteriaPeps = getFilteredPeptides(comResults, 'bacteria')

In [8]:
print('16S_Pooled')
reportUniqueProteins(comHumanPeps, comBacteriaPeps, comResults)

16S_Pooled
Unique Human Proteins: 798
Unique Bacterial Proteins 1022


In [9]:
# Unique proteins for Shotgun_Pooled
combHumanPeps = getFilteredPeptides(combResults, 'human')
combBacteriaPeps = getFilteredPeptides(combResults, 'bacteria')

In [10]:
print('Shotgun_Pooled')
reportUniqueProteins(combHumanPeps, combBacteriaPeps, combResults)

Shotgun_Pooled
Unique Human Proteins: 820
Unique Bacterial Proteins 1037


In [11]:
# Unique proteins for Shotgun_Sample-Matched
indHumanPeps = getFilteredPeptides(indResults, 'human')
indBacteriaPeps = getFilteredPeptides(indResults, 'bacteria')

In [12]:
print('Shotgun_Sample-Matched')
reportUniqueProteins(indHumanPeps, indBacteriaPeps, indResults)

Shotgun_Sample-Matched
Unique Human Proteins: 1182
Unique Bacterial Proteins 942


In [14]:
# Unique proteins for Hybrid_Sample-Matched
hybHumanPeps = getFilteredPeptides(hybResults, 'human')
hybBacteriaPeps = getFilteredPeptides(hybResults, 'bacteria')

In [15]:
print('Hybrid_Sample-Matched')
reportUniqueProteins(hybHumanPeps, hybBacteriaPeps, hybResults)

Hybrid_Sample-Matched
Unique Human Proteins: 1068
Unique Bacterial Proteins 1418
