## Description

In this notebook, we discard proteins which are:  
1. Homologous to proteins from more than one fungal taxonomic family  
2. Encoded by genes on contigs with no detectable homology to other fungi  

## Modules

In [None]:
from collections import Counter

## Data

Accessions of the target proteomes:

In [None]:
target_accessions = open('basal_accessions.txt')
target_accessions = set(l.strip() for l in target_accessions if l.strip())
len(target_accessions)

Mapping of organism taxids to their family taxids:

In [None]:
org2taxid_file = 'org2taxid.tsv'
org2taxid = [l.strip().split('\t') for l in open(org2taxid_file) if l.strip()]
speciestx2familytx = {l[2]: l[3] for l in org2taxid}

Initial blast results (all-vs-all blast within the target organisms), used for step 1: 

In [None]:
blast_result_file = 'initial_blast_results'
blast_results = [l.strip().split('\t') for l in open(blast_result_file) if l.strip()]        

Accession to scaffold mapping, used for step 2:

In [None]:
acc2scaffold_file = 'acc2scaffold.tsv'
acc2scaffold = {}
with open(acc2scaffold_file) as h:
    for l in h:
        l = l.strip().split('\t')
        assert l[0] not in acc2scaffold
        acc2scaffold[l[0]] = l[1]
len(acc2scaffold) 

## Data processing

Generate dictionaries that map protein accessions to the number of species and number of taxonomic families in which the protein has homologs.

In [None]:
accession_species_pairs = set((l[0], l[2]) for l in blast_results)
species_counter = Counter([x[0] for x in accession_species_pairs])

In [None]:
accession_family_pairs = set((l[0], speciestx2familytx[l[2]]) for l in blast_results)
family_counter = Counter([x[0] for x in accession_family_pairs])

In [None]:
edf_occurence_table = [(acc, species_counter[acc], family_counter[acc]) for acc in target_accessions]
with open('edf_occurence_table.tsv', 'w') as h:
    for l in edf_occurence_table:
        h.write('\t'.join(l) + '\n')

Generate a mapping of protein accession to the number of protein-encoding genes on the same scaffold with homology to at least 2 fungal species:

In [None]:
scaffold2anc_nb = Counter()  # numbers of proteins in at least 2 fungi per scaffold
for acc in nb_of_edf_homologs:
    if species_counter[acc] >= 2:
        scaffold = acc2scaffold[acc]
        scaffold2anc_nb[scaffold] += 1

Select the single-family accessions (step 1) and remove contaminants (step 2), save the resulting list of accessions:

In [None]:
single_family_accessions = [acc for acc in target_accessions if scaffold2anc_nb[acc] >= 1 and family_counter[acc] == 1]

In [None]:
with open('single_family_accessions.txt', 'w') as h:
    h.write('\n'.join(single_family_accessions) + '\n')