**12/21/21**

The purpose of this notebook is to document construction and refinement of the Community database.

In [1]:
from elliot_utils import *

In [2]:
bacteriaDir = Path.cwd().joinpath('../PublicSequences/Combined_AllNCBI_12-21/')
eukaryotesDir = Path.cwd().joinpath('../PublicSequences/Combined_AllNCBI_Eukaryotes_12-21/')
humanFile = Path.cwd().joinpath('../PublicSequences/Human9606_2-6-2019_TrypPigBov.fasta')

In [3]:
# Get human and contaminant data from the file
humanData = ''
with humanFile.open(mode='r') as infile:
    humanData = infile.read()

In [6]:
# Deduplicate bacterial proteins
bacteriaProteins = {} #key=protein seqeunce, value=protein object
for fastafile in bacteriaDir.iterdir():
    species = fastafile.stem.replace('_', ' ')
    data = ''
    with fastafile.open(mode='r') as infile:
        data = infile.read()
    dataList = data.split('\n>')
    dataList[0] = dataList[0][1:]
    del data
    for sequence in dataList:
        newProt = Protein(sequence)
        if newProt.sequence in bacteriaProteins.keys():
            bacteriaProteins[newProt.sequence].addTaxa(newProt.taxa[0])
        else:
            bacteriaProteins[newProt.sequence] = newProt

In [7]:
# Deduplicate nonhuman eukaryotic proteins
eukProteins = {} #key=protein seqeunce, value=protein object
for fastafile in eukaryotesDir.iterdir():
    species = fastafile.stem.replace('_', ' ')
    data = ''
    with fastafile.open(mode='r') as infile:
        data = infile.read()
    dataList = data.split('\n>')
    dataList[0] = dataList[0][1:]
    del data
    for sequence in dataList:
        newProt = Protein(sequence)
        if newProt.sequence in eukProteins.keys():
            eukProteins[newProt.sequence].addTaxa(newProt.taxa[0])
        else:
            eukProteins[newProt.sequence] = newProt

In [4]:
dbDir = Path.cwd().joinpath('../12-21-21_NextflowMSGF_Community5_Combined/databases/')

In [10]:
# Start by writing out human and eukaryotic proteins to a new fasta file
dbIndex = 1
dbPath = dbDir.joinpath(f'Community5_{str(dbIndex)}.fasta')

In [14]:
humanEukToWrite = [humanData]
for prot in eukProteins.values():
    humanEukToWrite.append(prot.getEntry())
with dbPath.open(mode='w', newline='') as dbfile:
    dbfile.write(''.join(humanEukToWrite))

In [15]:
# Add the bacterial proteins, keeping each file approximately 300mb in size
bactToWrite = []
for prot in bacteriaProteins.values():
    bactToWrite.append(prot.getEntry())
_iter = 0
while _iter < len(bactToWrite):
    dbPath = dbDir.joinpath(f'Community5_{str(dbIndex)}.fasta')
    with dbPath.open(mode='a', newline='') as outfile:
        for i in range(_iter, len(bactToWrite)):
            outfile.write(bactToWrite[i])
            _iter += 1
            if dbPath.stat().st_size > 300000000:
                dbIndex += 1
                break

In [6]:
# Condense the results down into a single file for each sample
resultsDir = Path.cwd().joinpath('../12-21-21_NextflowMSGF_Community5_Combined/output/')

In [7]:
condensedResults = condenseHugeDBResults(resultsDir)

Done sorting files
Done condensing file 1284
Done condensing file 1289
Done condensing file 1290
Done condensing file 1292
Done condensing file 1294
Done condensing file 1296
Done condensing file 1299
Done condensing file 1303
Done condensing file 1304
Done condensing file 1306
Done condensing file 1310
Done condensing file 1314
Done condensing file 1316
Done condensing file 1318
Done condensing file 1320
Done condensing file 1322
Done condensing file 1324
Done condensing file 1326
Done condensing file 1328
Done condensing file 1334
Done condensing file 1336
Done condensing file 1338
Done condensing file 1340
Done condensing file 1342
Done condensing file 1346
Done condensing file 1348
Done condensing file 1350
Done condensing file 1356
Done condensing file 1358


In [8]:
condensedResultsDir = Path.cwd().joinpath('../12-21-21_NextflowMSGF_Community5_Combined/output_processed/')

In [12]:
for idNumber, tsvFile in condensedResults.items():
    newfilePath = condensedResultsDir.joinpath(f'{idNumber}-Combined_Fredricks_CVL_23Dec14_Pippin_14-08-21_dta_Community5.tsv')
    tsvFile.writeToFile(newfilePath)

In [13]:
# Refine the database
hitProteins = getHitsInResults(condensedResultsDir)
refinedDBDir = Path.cwd().joinpath('../12-21-21_NextflowMSGF_Community5_Combined/databases_refined/')

In [14]:
refineHugeDatabase(hitProteins, dbDir, refinedDBDir, 'Community5_Refined')

Done reading Community5_1.fasta
Done reading Community5_2.fasta
Done reading Community5_3.fasta
846002 sequences written.


In [18]:
outputRefinedDir = Path.cwd().joinpath('../12-21-21_NextflowMSGF_Community5_Combined/output_refined/')
condensedRefinedDir = Path.cwd().joinpath('../12-21-21_NextflowMSGF_Community5_Combined/output_refined_processed/')

In [16]:
condensedRefinedResults = condenseHugeDBResults(outputRefinedDir)

Done sorting files
Done condensing file 1284
Done condensing file 1289
Done condensing file 1290
Done condensing file 1292
Done condensing file 1294
Done condensing file 1296
Done condensing file 1299
Done condensing file 1303
Done condensing file 1304
Done condensing file 1306
Done condensing file 1310
Done condensing file 1314
Done condensing file 1316
Done condensing file 1318
Done condensing file 1320
Done condensing file 1322
Done condensing file 1324
Done condensing file 1326
Done condensing file 1328
Done condensing file 1334
Done condensing file 1336
Done condensing file 1338
Done condensing file 1340
Done condensing file 1342
Done condensing file 1346
Done condensing file 1348
Done condensing file 1350
Done condensing file 1356
Done condensing file 1358


In [19]:
for idNumber, tsvFile in condensedRefinedResults.items():
    newfilePath = condensedRefinedDir.joinpath(f'{idNumber}-Combined_Fredricks_CVL_23Dec14_Pippin_14-08-21_dta_Community5_Refined.tsv')
    tsvFile.writeToFile(newfilePath)

Because a single refined database incorporates hits from all searched samples, I need to make and search another refined Community database for the subset of samples to compare against global. I'll go through the same process as above, but this time using only the Subset of sample results to determine what proteins go into the refined database.

In [20]:
outputSubsetDir = Path.cwd().joinpath('../12-21-21_NextflowMSGF_Community5_Combined/output_subset/')

In [22]:
subsetHitProteins = getHitsInResults(outputSubsetDir)
refinedSubsetDBDir = Path.cwd().joinpath('../12-21-21_NextflowMSGF_Community5_Combined/databases_subset_refined/')

In [23]:
refineHugeDatabase(subsetHitProteins, dbDir, refinedSubsetDBDir, 'Community5_Subset_Refined')

Done reading Community5_1.fasta
Done reading Community5_2.fasta
Done reading Community5_3.fasta
360252 sequences written.


In [40]:
# Process the results of the refined subset database search, since that database was only a single file
subsetRefinedResults = getOrderedFiles(Path.cwd().joinpath('../12-21-21_NextflowMSGF_Community5_Combined/output_subset_refined/'), '.tsv')
subsetRefinedProcessedDir = Path.cwd().joinpath('../12-21-21_NextflowMSGF_Community5_Combined/output_subset_refined_processed/')

In [41]:
collapseRepeatRows(subsetRefinedResults, subsetRefinedProcessedDir)