## Notebook 2: Homology matrix generation from genome sequences

Based on the paper from Norsigian et al. (2020) , using the 55 GEMS from Monk et al. (2013). 

This notebook generates the data used for the draft model generation in notebook 3.1. To start, a list of reference strains with NCBI identifiers is compiled. Annotated genomes of all reference strains and EcN are downloaded from NCBI as genbank files and parsed to generate FASTA files for the protein sequences. A bidirectional protein blast (BLASTp) is run between EcN and the reference strains. Protein sequences with a percentage identity (PID) above an 80% similarity threshold and alignment length of at least 25% are deemed homologous. The best bidirectional hit (BBH) is identified for each EcN gene and a binary orthology matrix is created based on the gene homology. Additionally bi-directional blast best hits are identified for all E. coli K-12 MG1655 genes in comparison to EcN. 

---
#### Original
In this tutorial, we will be working on generationg multi-strain genome-scale models for 5 E.coli strains. The reference model we used here is the iML1515 model published in Nature Biotechnology (PMID: 29020004), and the reference strain is E. coli K12 MG1655. We will be generating strain-specific models for 5 other E.coli strains: ATCC 8739, LF82, UM146, UMN026 and IAI39.

This is the the first notebook in the tutorial to create homology matrix from genome sequences.There are four major steps in this notebook
1. Download the genome annotation (GenBank files) from NCBI, and generate fasta files (protein &nucleotide) from them
2. Perform BLASTp to find homologous proteins in strains of interest
3. Use best bidirectional hits to create gene presence/absence matrix
4. Supplementary for best practice: use BLASTn to check if we have missed any unannotated open reading frames and retain these genes in orthology matrix as well as guide future manual curation

In [1]:
#import packages needed
import pandas as pd
from glob import glob
from Bio import Entrez, SeqIO

In [3]:
# Load the information on the 55 reference strains
ReferenceStrains=pd.read_csv('../tables/strain_info.csv', usecols=[1,3])
ReferenceStrains.head()

Unnamed: 0,Strain,NCBI ID
0,Escherichia coli LF82,CU651637
1,Escherichia coli O83:H1 str. NRG 857C,CP001855
2,Escherichia coli UM146,CP002167
3,Escherichia coli APEC O1,CP000468
4,Escherichia coli ATCC 8739,CP000946


In [4]:
# The Reference Genome of E. coli Nissle, Assembly: ASM71459v1, Bioproject: PRJNA248167
EcN_ID ='CP022686.1'
ReferenceStrainIDs=list(ReferenceStrains['NCBI ID'])

## 1. Download genome annotations (GenBank files) to generate fasta files 

### 1.1 Dowload genomes from NCBI
Download the genome annotations (GenBank files) from NCBI for all reference strains and EcN. 

In [4]:
# Define path for storage of files
genomes_path = '../data/genomes_gb'
prot_path = '../data/proteins'
nucl_path = '../data/nucleotides'

In [5]:
# define a function to download the annotated genebank files from NCBI
def dl_genome(id, folder=genomes_path): # be sure get CORRECT ID
    files=glob('%s/*.gb'%folder)
    out_file = '%s/%s.gb'%(folder, id)

    if out_file in files:
        print (out_file, 'already downloaded')
        return
    else:
        print ('downloading %s from NCBI'%id)
        
    from Bio import Entrez
    Entrez.email = ""     #Insert email here for NCBI
    handle = Entrez.efetch(db="nucleotide", id=id, rettype="gb", retmode="text")
    fout = open(out_file,'w')
    fout.write(handle.read())
    fout.close()

In [6]:
# execute the above function, and download the GenBank files for 55 E. coli/Shigella strains
for strain in ReferenceStrainIDs:
    dl_genome(strain, folder=genomes_path)

downloading CU651637 from NCBI
downloading CP001855 from NCBI
downloading CP002167 from NCBI
downloading CP000468 from NCBI
downloading CP000946 from NCBI
downloading CP000819 from NCBI
downloading CP001665 from NCBI
downloading AM946981 from NCBI
downloading CP001509 from NCBI
downloading CP001396 from NCBI
downloading CP001637 from NCBI
downloading AP012030 from NCBI
downloading CU928162 from NCBI
downloading CP000802 from NCBI
downloading CU928160 from NCBI
downloading CP002516 from NCBI
downloading AP009240 from NCBI
downloading AP009378 from NCBI
downloading CP000948 from NCBI
downloading U00096 from NCBI
downloading AP009048 from NCBI
downloading CP002185 from NCBI
downloading CP002967 from NCBI
downloading FN554766 from NCBI
downloading CU928145 from NCBI
downloading AP010958 from NCBI
downloading AP010960 from NCBI
downloading CP001164 from NCBI
downloading AE005174 from NCBI
downloading BA000007 from NCBI
downloading CP001368 from NCBI
downloading AP010953 from NCBI
downloadin

In [13]:
# Additionally download the GenBank file of EcN into the "genomes" folder
dl_genome(EcN_ID, folder=genomes_path)

downloading CP022686.1 from NCBI


### 1.2 Examine the Downloaded Strains

In [None]:
# define a function to gather information of the downloaded strains from the GenBank files
def get_strain_info(folder=genomes_path):
    files = glob('%s/*.gb'%folder)
    strain_info = []
    
    for file in files:
        handle = open(file)
        record = SeqIO.read(handle, "genbank")
        
        for f in record.features:
            if f.type=='source':
                info = {}
                info['file'] = file
                info['id'] = file.split('\\')[-1].split('.')[0]
                for q in f.qualifiers.keys():
                    info[q] = '|'.join(f.qualifiers[q])
                strain_info.append(info)
    return pd.DataFrame(strain_info)

In [None]:
# information on the downloaded strains
get_strain_info(folder=genomes_path)

### 1.3 Generate FASTA files for both Protein and Nucleotide Pipelines
From the GenBank file, sequence and annoation information is used to generate fasta files for the protein and nucleotide analyses. The resulting fasta files will then be used in step 2 as input for BLAST

In [None]:
# define a function to parse the Genbank file to generate fasta files for both protein and nucleotide sequences
def parse_genome(id, type='prot', in_folder=genomes_path, out_folder=prot_path, overwrite=0):

    in_file = '%s/%s.gb'%(in_folder, id)
    out_file='%s/%s.fa'%(out_folder, id)
    files =glob('%s/*.fa'%out_folder)
    
    if out_file in files and overwrite==0:
        print (out_file, 'already parsed')
        return
    else:
        print ('parsing %s'%id)
        
    gene_synonym_dict = {}
    
    handle = open(in_file)
    
    fout = open(out_file,'w')
    x = 0

    records = SeqIO.parse(handle, "genbank")
    for record in records:
        for f in record.features:
            if f.type=='CDS':
                seq=f.extract(record.seq)
                
                if type=='nucl':
                    seq=str(seq)
                else:
                    seq=str(seq.translate())
                    
                if 'locus_tag' in f.qualifiers.keys():
                    locus = f.qualifiers['locus_tag'][0]
                elif 'gene' in f.qualifiers.keys():
                    locus = f.qualifiers['gene'][0]
                else:
                    locus = 'gene_%i'%x
                    x+=1
                fout.write('>%s\n%s\n'%(locus, seq))
                
    fout.close()
    gene_synonym_dict.to_csv('../tables/gene_synonym.csv')

In [None]:
# Generate fasta files for 55 reference strains and EcN
for strain in ReferenceStrainIDs:
    parse_genome(strain, type='prot', in_folder=genomes_path, out_folder=prot_path)
    parse_genome(strain, type='nucl', in_folder=genomes_path, out_folder=nucl_path)
    
parse_genome(EcN_ID, type='prots', in_folder=genomes_path, out_folder=prot_path)
parse_genome(EcN_ID, type='nucl', in_folder=genomes_path, out_folder=nucl_path) 

## 2. Perform BLAST to find homologous proteins in strains of interest

### 2.1 Make BLAST DB for each of the target strains for both Protein and Nucleotide Pipelines

Both BLASTp for proteins and BLASTn for nucleotides is run. BLASTp will be used as the main approach to identify homologous proteins in reference strain and other strains of interest, while BLASTn will be used as a supplementary method to check for any unannotated genes

In [None]:
# Define a function to make blast database for either protein of nucleotide
def make_blast_db(id,folder=prot_path,db_type='prot'):
    import os
    
    out_file ='%s/%s.fa.pin'%(folder, id)
    files =glob('%s/*.fa.pin'%folder)
    
    if out_file in files:
        print (id, 'already has a blast db')
        return
    if db_type=='nucl':
        ext='fna'
    else:
        ext='fa'

    cmd_line='makeblastdb -in %s/%s.%s -dbtype %s' %(folder, id, ext, db_type)
    
    print ('making blast db with following command line...')
    print (cmd_line)
    os.system(cmd_line)

In [None]:
# make protein sequence databases 
# Because we are performing bi-directional blast, we make databases from both reference strain and strains of interest
for strain in ReferenceStrainIDs:
    make_blast_db(strain,folder=prot_path,db_type='prot')
    
make_blast_db(EcN_ID,folder=prot_path,db_type='prot')

### 2.2 Define functions to run protein BLAST and get sequence lengths
- BLASTp will be the main approach used here to identify homologous proteins between strains 
- Aside from sequence similarity, we also want to ensure the coverage of sequence mapping is sufficient. Therefore, we need to identiy the sequence length for each protein and compare it with the alignment length.

In [14]:
# define a function to run BLASTp
def run_blastp(seq,db,in_folder=prot_path, out_folder='../data/bbh', out=None,outfmt=6,evalue=0.001,threads=1):
    import os
    if out==None:
        out='%s/%s_vs_%s.txt'%(out_folder, seq, db)
        print(out)

    files =glob('%s/*.txt'%out_folder)
    if out in files:
        print (seq, 'already blasted')
        return
    
    print ('blasting %s vs %s'%(seq, db))
    
    db = '%s/%s.fa'%(in_folder, db)
    seq = '%s/%s.fa'%(in_folder, seq)
    cmd_line='blastp -db %s -query %s -out %s -evalue %s -outfmt %s -num_threads %i' \
    %(db, seq, out, evalue, outfmt, threads)
    
    print ('running blastp with following command line...')
    print (cmd_line)
    os.system(cmd_line)
    return out

In [15]:
# define a function to get sequence length 
def get_gene_lens(query, in_folder=prot_path):

    file = '%s/%s.fa'%(in_folder, query)
    handle = open(file)
    records = SeqIO.parse(handle, "fasta")
    out = []
    
    for record in records:
        out.append({'gene':record.name, 'gene_length':len(record.seq)})
    
    out = pd.DataFrame(out)
    return out

## 3. Use Bi-Directional BLASTp Best Hits to create gene presence/absence matrix

### 3.1 Obtain Bi-Directional BLASTp Best Hits

From the above BLASTp results, we can obtain Bi-Directional BLASTp Best Hits to identify homologous proteins (>80%, see section 3.3). Besides gene similarity score, the coverage of alignment is also used to filter mapping results (>25%). 

In [16]:
# define a function to get Bi-Directional BLASTp Best Hits
def get_bbh(query, subject, in_folder='../data/bbh'):    
    
    #Utilize the defined protein BLAST function
    run_blastp(query, subject)
    run_blastp(subject, query)
    
    query_lengths = get_gene_lens(query, in_folder=prot_path)
    subject_lengths = get_gene_lens(subject, in_folder=prot_path)
    
    #Define the output file of this BLAST
    out_file = '%s/%s_vs_%s_parsed.csv'%(in_folder, query, subject)
    files=glob('%s/*_parsed.csv'%in_folder)
    
    #Combine the results of the protein BLAST into a dataframe
    print ('parsing BBHs for', query, subject)
    cols = ['gene', 'subject', 'PID', 'alnLength', 'mismatchCount', 'gapOpenCount', 'queryStart', 'queryEnd', 'subjectStart', 'subjectEnd', 'eVal', 'bitScore']
    bbh=pd.read_csv('%s/%s_vs_%s.txt'%(in_folder,query, subject), sep='\t', names=cols)
    bbh = pd.merge(bbh, query_lengths) 
    bbh['COV'] = bbh['alnLength']/bbh['gene_length']
    
    bbh2=pd.read_csv('%s/%s_vs_%s.txt'%(in_folder,subject, query), sep='\t', names=cols)
    bbh2 = pd.merge(bbh2, subject_lengths) 
    bbh2['COV'] = bbh2['alnLength']/bbh2['gene_length']
    out = pd.DataFrame()
    
    # Filter the genes based on coverage (>25%)
    bbh = bbh[bbh.COV>=0.25]
    bbh2 = bbh2[bbh2.COV>=0.25]
    
    #Delineate the best hits from the BLAST
    for g in bbh.gene.unique():
        res = bbh[bbh.gene==g]
        if len(res)==0:
            continue
#         best_hit = res.loc[res.PID.idxmax()]
#         best_gene = best_hit.subject
#         res2 = bbh2[bbh2.gene==best_gene]
        
        # add section to catch cases where highest hits are not the same
        res.sort_values('PID', ascending=False, inplace=True)
        for hit in res['subject']: #check for the all hits whether they are present in the other direction
            res2 = bbh2[bbh2.gene==hit]
            if len(res2)!=0: # as soon as there is a hit, continue
                best_hit = res[res['subject'] == hit] #Set the gene that is the highest hit as best_hit
                best_gene = best_hit.subject
                break
        
        if len(res2)==0:
            continue
        best_hit2 = res2.loc[res2.PID.idxmax()]
        best_gene2 = best_hit2.subject
        if g==best_gene2:
            best_hit['BBH'] = '<=>'
        else:
            best_hit['BBH'] = '->'
        out=pd.concat([out, pd.DataFrame(best_hit)])#.transpose()])
    
    #Save the final file to a designated CSV file    
    out.to_csv(out_file)

In [7]:
# Execute the BLAST for EcN against all reference strains, save results to 'bbh' i.e. "bidirectional best
# hits" folder to create the homology matrix

for strain in ReferenceStrainIDs:
    get_bbh(EcN_ID,strain, in_folder='../data/bbh')

In [19]:
# Additionally get the bhh for the comparison between MG1655 and EcN
get_bbh('U00096', EcN_ID, in_folder='../data/bbh')

../data/bbh/U00096_vs_CP022686.1.txt
blasting U00096 vs CP022686.1
running blastp with following command line...
blastp -db ../data/proteins/CP022686.1.fa -query ../data/proteins/U00096.fa -out ../data/bbh/U00096_vs_CP022686.1.txt -evalue 0.001 -outfmt 6 -num_threads 1
../data/bbh/CP022686.1_vs_U00096.txt
blasting CP022686.1 vs U00096
running blastp with following command line...
blastp -db ../data/proteins/U00096.fa -query ../data/proteins/CP022686.1.fa -out ../data/bbh/CP022686.1_vs_U00096.txt -evalue 0.001 -outfmt 6 -num_threads 1
parsing BBHs for U00096 CP022686.1


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


### 3.2 Parse the BLAST Results into one Homology Matrix of the Reconstruction Genes

For the homology matrix, we only focus on genes that are present in the reference model

In [15]:
# Load all the BLAST files between the reference strain and target strains
# Exclude the reverse blast with the MG1655 genome, which is needed for the first step in notebook 3.1

blast_files = [file for file in glob('%s/*_parsed.csv'%'../data/bbh/') if 'U00096_vs_CP022686.1_parsed.csv' not in file]

for blast in blast_files:
    bbh=pd.read_csv(blast)
    print (blast,bbh.shape)

../data/bbh\CP022686.1_vs_AE005174_parsed.csv (4184, 16)
../data/bbh\CP022686.1_vs_AE005674_parsed.csv (3979, 16)
../data/bbh\CP022686.1_vs_AE014073_parsed.csv (3874, 16)
../data/bbh\CP022686.1_vs_AE014075_parsed.csv (4543, 16)
../data/bbh\CP022686.1_vs_AM946981_parsed.csv (4110, 16)
../data/bbh\CP022686.1_vs_AP009048_parsed.csv (4136, 16)
../data/bbh\CP022686.1_vs_AP009240_parsed.csv (4201, 16)
../data/bbh\CP022686.1_vs_AP009378_parsed.csv (4319, 16)
../data/bbh\CP022686.1_vs_AP010953_parsed.csv (4294, 16)
../data/bbh\CP022686.1_vs_AP010958_parsed.csv (4272, 16)
../data/bbh\CP022686.1_vs_AP010960_parsed.csv (4247, 16)
../data/bbh\CP022686.1_vs_AP012030_parsed.csv (4129, 16)
../data/bbh\CP022686.1_vs_BA000007_parsed.csv (4189, 16)
../data/bbh\CP022686.1_vs_CP000034_parsed.csv (3762, 16)
../data/bbh\CP022686.1_vs_CP000036_parsed.csv (3871, 16)
../data/bbh\CP022686.1_vs_CP000038_parsed.csv (3956, 16)
../data/bbh\CP022686.1_vs_CP000243_parsed.csv (4466, 16)
../data/bbh\CP022686.1_vs_CP000

In [16]:
# Import genes and their information on annotations
# Create ortho matrix for all strains
# define a function to get gene information
def get_gene_info(query, in_folder='../data/genomes_gb'):

    file = '%s/%s.gb'%(in_folder, query)
    handle = open(file)
    records = SeqIO.parse(handle, "genbank")
    out = []
    
    records = SeqIO.parse(handle, "genbank")
    for record in records:
        for f in record.features:
            if f.type=='CDS':
                if 'locus_tag' in f.qualifiers.keys():
                    locus = f.qualifiers['locus_tag'][0]
                    product = f.qualifiers['product'][0]
                    start_loc = f.location.start
                    end_loc = f.location.end
                elif 'gene' in f.qualifiers.keys():
                    locus = f.qualifiers['gene'][0]
                    print(locus)
                else:
                    locus = 'gene_%i'%x
                    x+=1
                    print(locus)
    
                out.append({'gene':locus, 'product':product, 'start':start_loc, 'end':end_loc})
    
    out = pd.DataFrame(out)
    out.set_index('gene',inplace=True)
    return out

gene_info = get_gene_info(EcN_ID, in_folder='../data/genomes_gb')
listGeneIDs = gene_info.index.tolist()
gene_info.head()

Unnamed: 0_level_0,product,start,end
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CIW80_00005,ATP-dependent RNA helicase HrpA,332,4235
CIW80_00010,DUF218 domain-containing protein,4507,5308
CIW80_00015,aldehyde dehydrogenase,5504,6944
CIW80_00020,type I glyceraldehyde-3-phosphate dehydrogenase,6985,7987
CIW80_00025,cytochrome B,8175,8706


In [17]:
# Create 2 matrices of N, rows where N is the number of EcN genes and M columns where M is the number of reference strains
# One matrix will be populated with the PID results from the blasts and another with the mapping of gene locus tags
ortho_matrix=pd.DataFrame(index=listGeneIDs,columns=ReferenceStrainIDs)
geneIDs_matrix=pd.DataFrame(index=listGeneIDs,columns=ReferenceStrainIDs)

In [18]:
#Parse through each blast file and acquire pertinent information for each matrix for each of the base reconstruction genes
for blast in blast_files:
    bbh=pd.read_csv(blast)
    listIDs=[]
    listPID=[]
    for r,row in ortho_matrix.iterrows():
        try:
            currentOrtholog=bbh[bbh['gene']==r].reset_index()
            listIDs.append(currentOrtholog.iloc[0]['subject'])
            listPID.append(currentOrtholog.iloc[0]['PID'])
        except:
            listIDs.append('None')
            listPID.append(0)
    for col in ortho_matrix.columns:
        if col in blast:
            ortho_matrix[col]=listPID
            geneIDs_matrix[col]=listIDs

### 3.3 Apply Similarity Threshold to Binarize  Homology Matrix to Presence/Absence Matrix
The homology matrix generated above is turned into a binarized presence/absence matrix by setting a similarity threshold for the PID.

In [19]:
# Genes with a similarity greater than 80% PID are considered present in the EcN genome 
# and consequently less than 80% are considered absent from the target strain genome
ortho_matrix_pid = ortho_matrix.copy()
for column in ortho_matrix:
    ortho_matrix.loc[ortho_matrix[column]<=80.0,column]=0
    ortho_matrix.loc[ortho_matrix[column]>80.0,column]=1

In [20]:
# Save as .csv files
ortho_matrix.to_csv('../tables/ortho_matrix.csv')
geneIDs_matrix.to_csv('../tables/geneIDs_matrix.csv')
ortho_matrix_pid.to_csv('../tables/ortho_matrix_pid.csv')
ortho_matrix

Unnamed: 0,CU651637,CP001855,CP002167,CP000468,CP000946,CP000819,CP001665,AM946981,CP001509,CP001396,...,CU928163,CP000243,CP001063,CP000036,CP000034,CP001383,AE014073,AE005674,CP000266,CP000038
CIW80_00005,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
CIW80_00010,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0
CIW80_00015,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0
CIW80_00020,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0
CIW80_00025,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CIW80_25775,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
CIW80_25780,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0
CIW80_25785,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
CIW80_25790,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0


### 3.4 Homology matrix based on MG1655 genes
In the first part of notebook 3.1 all genes of MG1655 that are not present in EcN are removed from the MG1655 model (iML1515) to generate the basis for the draft model construction. To make this comparison, the best bidirectional hit (bbh) for each MG1655 gene present in the iML1515 model in comparison to EcN genes was identified.

In [5]:
# Load the BLAST file of EcN/MG1655 strains
bbh=pd.read_csv('../data/bbh/U00096_vs_%s_parsed.csv'%EcN_ID)

In [6]:
#Load the base reconstruction to designate the list of genes within the model
import cobra
model = cobra.io.load_json_model('../data/models/iML1515.json')

MG_listGeneIDs=[]
for gene in model.genes:
    MG_listGeneIDs.append(gene.id)
    
# The ID of b4583 was changed to b4104 in the iML1515 model
MG_listGeneIDs.append('b4583')

In [7]:
# Create 1 matrix with the iML1515 model genes as rows and EcN as column.
# The matrix will be populated with the PID results from the blasts
EcN_ID='CP022686.1'
MG_ortho_matrix=pd.DataFrame(index=MG_listGeneIDs,columns=[EcN_ID])
MG_geneIDs_matrix=pd.DataFrame(index=MG_listGeneIDs,columns=[EcN_ID])

In [8]:
#Parse through the blast file and acquire pertinent information for each of the base reconstruction genes
listPID=[]
listIDs=[]

for r,row in MG_ortho_matrix.iterrows():
    try:
        currentOrtholog=bbh[bbh['gene']==r].reset_index() #switched gene and subject
        listIDs.append(currentOrtholog.iloc[0]['subject'])
        listPID.append(currentOrtholog.iloc[0]['PID'])
    except:
        listIDs.append('None')
        listPID.append(0)

MG_ortho_matrix[EcN_ID]=listPID
MG_ortho_matrix.head()
MG_geneIDs_matrix[EcN_ID]=listIDs

In [9]:
# Genes with a greater than 80% PID are considered present in the target strain genome 
# and consequently less than 80% are considered absent from the target strain genome
MG_ortho_matrix.loc[MG_ortho_matrix[EcN_ID]<=80.0,EcN_ID]=0
MG_ortho_matrix.loc[MG_ortho_matrix[EcN_ID]>80.0,EcN_ID]=1
MG_ortho_matrix.head()

Unnamed: 0,CP022686.1
b2551,1.0
b0870,1.0
b3368,1.0
b2436,1.0
b3500,1.0


In [10]:
# As mentioned above, the ID of b4583 was changed to b4104 in the iML1515 model
# remove the row of b4104 and replace with row of b4583
MG_ortho_matrix = MG_ortho_matrix.drop('b4104')
MG_ortho_matrix = MG_ortho_matrix.rename(index={'b4583': 'b4104'})
    
MG_geneIDs_matrix = MG_geneIDs_matrix.drop('b4104')
MG_geneIDs_matrix = MG_geneIDs_matrix.rename(index={'b4583': 'b4104'})

In [11]:
MG_ortho_matrix

Unnamed: 0,CP022686.1
b2551,1.0
b0870,1.0
b3368,1.0
b2436,1.0
b3500,1.0
...,...
b2796,1.0
b4093,1.0
b3878,0.0
b1206,1.0


In [15]:
MG_geneIDs_matrix

Unnamed: 0,CP022686.1
b2551,CIW80_06560
b0870,CIW80_22360
b3368,CIW80_11380
b2436,CIW80_06055
b3500,CIW80_12105
...,...
b2796,CIW80_07660
b4093,CIW80_15760
b3878,CIW80_13025
b1206,CIW80_24925


In [16]:
# Save as .csv files
MG_ortho_matrix.to_csv('../tables/MG_ortho_matrix.csv')
MG_geneIDs_matrix.to_csv('../tables/MG_geneIDs_matrix.csv')

## 4. Perform BLASTn to check unannotated open reading frames to guide manual curation 
At this juncture it may be useful to execute a supplementary nucleotide BLAST to check for unannotated genes, results here become candidates for manual curation. We retain unannotated genes that pass the threshold for
similarity and contain no premature stop codons



### 4.1 BLASTn

In [None]:
#Define a function to generate FNA from the GBK files
def gbk2fasta(gbk_filename):
    faa_filename = '.'.join(gbk_filename.split('.')[:-1])+'.fna'
    input_handle  = open(gbk_filename, "r")
    output_handle = open(faa_filename, "w")

    for seq_record in SeqIO.parse(input_handle, "genbank") :
        print ("Converting GenBank record %s" % seq_record.id)
        output_handle.write(">%s %s\n%s\n" % (
               seq_record.id,
               seq_record.description,
               seq_record.seq))

    output_handle.close()
    input_handle.close()

In [None]:
#Define function to run the BLASTn
def run_blastn(seq, db,outfmt=6,evalue=0.001,threads=1):
    import os
    out = '../data/nucleotides/'+seq+'_vs_'+db+'.txt'
    seq = '../data/nucleotides/'+seq+'.fa'
    db = '../data/genomes_gb/'+db+'.fna'
    
    cmd_line='blastn -db %s -query %s -out %s -evalue %s -outfmt %s -num_threads %i' \
    %(db, seq, out, evalue, outfmt, threads)
    
    print ('running blastn with following command line...')
    print (cmd_line)
    os.system(cmd_line)
    return out

In [None]:
# make nucleotide sequence databases 
for strain in ReferenceStrainIDs:
    make_blast_db(strain,folder=genomes_path,db_type='nucl')

In [None]:
# convert genbank files to fna files for strains of interest
for strain in ReferenceStrainIDs:
    gbk2fasta('../data/genomes_gb/'+strain+'.gb')

In [None]:
# perform uni-directional BLASTn hit
genome_blast_res=[]
for strain in ReferenceStrainIDs:
    res = run_blastn(EcN_ID,strain)
    genome_blast_res.append(res)

In [None]:
#define a function to parse through the nucleotide BLAST results and form one matrix of all the results
def parse_nucl_blast(infile):
    cols = ['gene', 'subject', 'PID', 'alnLength', 'mismatchCount', 'gapOpenCount', 'queryStart', 'queryEnd', 'subjectStart', 'subjectEnd', 'eVal', 'bitScore']
    data = pd.read_csv(infile, sep='\t', names=cols)
    data = data[(data['PID']>80) & (data['alnLength']>0.8*data['queryEnd'])]
    data2=data.groupby('gene').first()
    return data2.reset_index()


In [None]:
# parse the nucleotide blast matrix 
na_matrix=pd.DataFrame()
for file in genome_blast_res:
    genes =parse_nucl_blast(file)
    name ='.'.join(file.split('_')[-1].split('.')[:-1])
    na_matrix = na_matrix.append(genes[['gene','subject','PID']])
na_matrix = pd.pivot_table(na_matrix, index='gene', columns='subject',values='PID')

In [None]:
na_matrix.head()
na_matrix.to_csv('../tables/na_matrix.csv')

### 4.2 Examine unnannotated open reading frames
We compare the results from BLASTp and BLASTn and record any inconsistencies between the two matrices due to missing annotation. This result is then saved to guide future manual curation. 

In [4]:
# define a function to extract the sequence from fna file 
def extract_seq(g, contig, start, end):
    from Bio import SeqIO
    handle = open(g)
    records = SeqIO.parse(handle, "fasta")
    
    for record in records:
        if record.name==contig:
            if end>start:
                section = record[start:end]
            else:
                section = record[end-1:start+1].reverse_complement()
                
            seq = str(section.seq)
    return seq

In [5]:
# Load variables when starting at this point
ortho_matrix = pd.read_csv('../tables/ortho_matrix.csv')
ortho_matrix.set_index('Unnamed: 0', inplace=True)

geneIDs_matrix = pd.read_csv('../tables/geneIDs_matrix.csv')
geneIDs_matrix.set_index('Unnamed: 0', inplace=True)
                         
na_matrix = pd.read_csv('../tables/na_matrix.csv')
na_matrix.set_index('gene', inplace=True)
col_names = na_matrix.columns.values.tolist()
col_names = [x[:-2] for x in col_names]
na_matrix.columns = col_names

listGeneIDs = ortho_matrix.index.values.tolist()

In [6]:
#Define updated matrices that will include genes based on sequence evidence that were missing due to lack of annotation
ortho_matrix_w_unannotated = ortho_matrix.copy()
geneIDs_matrix_w_unannotated = geneIDs_matrix.copy()

In [7]:
#Define matrix of the BLASTn results for all the pertinent model genes
nonModelGenes=[]
for g in na_matrix.index:
    if g not in listGeneIDs:
        nonModelGenes.append(g)

na_model_genes=na_matrix.drop(nonModelGenes)

In [8]:
#For each strain in the ortho_matrix, identify genes that meet threshold of SEQ similarity, but missing from
#annotated ORFS. Additionally, look at the sequence to ensure that these cases do not have early stop codons indicating
#nonfunctional even if the NA seqs meet the threshold

pseudogenes = {}

for c in ortho_matrix.columns:
    
    orfs = ortho_matrix[c]
    genes = na_model_genes[c]
    # All the Model Genes that met the BLASTp Requirements
    orfs2 = orfs[orfs==1].index.tolist()
    # All the Model Genes based off of BLASTn similarity above threshold of 80
    genes2 = genes[genes>=80].index.tolist()
    # By Definition find the genes that pass sequence threshold but were NOT in annotated ORFs:
    unannotated = set(genes2) -set(orfs2)
    
    # Obtain sequences of this list to check for premature stop codons:
    data = '../data/nucleotides/CP022686.1_vs_%s.txt'%c
    cols = ['gene', 'subject', 'PID', 'alnLength', 'mismatchCount', 'gapOpenCount', 'queryStart', 'queryEnd', 'subjectStart', 'subjectEnd', 'eVal', 'bitScore']
    data = pd.read_csv(data, sep='\t', names=cols)
    #
    pseudogenes[c] = {}
    unannotated_data = data[data['gene'].isin(list(unannotated))]
    for i in unannotated_data.index:
        gene = data.loc[i,'gene']
        contig = data.loc[i,'subject'] 
        start = data.loc[i,'subjectStart']
        end = data.loc[i,'subjectEnd']
        seq = extract_seq('../data/genomes_gb/%s.fna'%c,contig, start, end)
        # check for early stop codons - these are likely nonfunctional and shouldn't be included
        if '*' in seq:
            print (seq)
            pseudogenes[c][gene]=seq
            # Remove the gene from list of unannotated genes
            unannotated-set([gene])
            
    
    print (c, unannotated)
    
    # For pertinent genes, retain those based off of nucleotide similarity within the orthology matrix and geneIDs matrix
    ortho_matrix_w_unannotated.loc[unannotated,c]=1
    for g in unannotated:
        geneIDs_matrix_w_unannotated.loc[g,c] = '%s_ortholog'%g
    

CU651637 {'CIW80_22445', 'CIW80_06330', 'CIW80_08295', 'CIW80_15510', 'CIW80_15855', 'CIW80_04745', 'CIW80_11570', 'CIW80_10280', 'CIW80_09630', 'CIW80_05030', 'CIW80_09805', 'CIW80_04425', 'CIW80_01880', 'CIW80_04490', 'CIW80_25340', 'CIW80_12160', 'CIW80_01570', 'CIW80_24460', 'CIW80_12380', 'CIW80_05650', 'CIW80_13175', 'CIW80_16405', 'CIW80_12135', 'CIW80_11190', 'CIW80_13695', 'CIW80_19755', 'CIW80_13245', 'CIW80_22470', 'CIW80_19765', 'CIW80_10845', 'CIW80_23170', 'CIW80_02595', 'CIW80_15595', 'CIW80_10730', 'CIW80_16470', 'CIW80_02835', 'CIW80_15525', 'CIW80_25220', 'CIW80_24670', 'CIW80_19975', 'CIW80_24610', 'CIW80_22910', 'CIW80_25020', 'CIW80_25710', 'CIW80_05590', 'CIW80_11525', 'CIW80_25400', 'CIW80_12320', 'CIW80_11575', 'CIW80_18415', 'CIW80_06025', 'CIW80_00550', 'CIW80_12115', 'CIW80_20075', 'CIW80_00210', 'CIW80_04210', 'CIW80_10710', 'CIW80_09230', 'CIW80_08165', 'CIW80_15195', 'CIW80_00290', 'CIW80_08180', 'CIW80_14460', 'CIW80_20505', 'CIW80_13915', 'CIW80_10530', 

CP000946 {'CIW80_22445', 'CIW80_06330', 'CIW80_08295', 'CIW80_15510', 'CIW80_15855', 'CIW80_04745', 'CIW80_11570', 'CIW80_10280', 'CIW80_09630', 'CIW80_05030', 'CIW80_06735', 'CIW80_09805', 'CIW80_04425', 'CIW80_01880', 'CIW80_04490', 'CIW80_25340', 'CIW80_01570', 'CIW80_12380', 'CIW80_05650', 'CIW80_24680', 'CIW80_13175', 'CIW80_16405', 'CIW80_12135', 'CIW80_13695', 'CIW80_19755', 'CIW80_19455', 'CIW80_22470', 'CIW80_19765', 'CIW80_10845', 'CIW80_23170', 'CIW80_17790', 'CIW80_02595', 'CIW80_15595', 'CIW80_08460', 'CIW80_10730', 'CIW80_16470', 'CIW80_02835', 'CIW80_15525', 'CIW80_25220', 'CIW80_24670', 'CIW80_19975', 'CIW80_01990', 'CIW80_25020', 'CIW80_02830', 'CIW80_11525', 'CIW80_17670', 'CIW80_25400', 'CIW80_12320', 'CIW80_18415', 'CIW80_06025', 'CIW80_01845', 'CIW80_00550', 'CIW80_00915', 'CIW80_12115', 'CIW80_20075', 'CIW80_00210', 'CIW80_04210', 'CIW80_10710', 'CIW80_09230', 'CIW80_08165', 'CIW80_08180', 'CIW80_20505', 'CIW80_09400', 'CIW80_10530', 'CIW80_10590', 'CIW80_21745', 

CP001509 {'CIW80_22445', 'CIW80_06330', 'CIW80_08295', 'CIW80_15510', 'CIW80_04745', 'CIW80_11570', 'CIW80_10280', 'CIW80_06735', 'CIW80_09805', 'CIW80_25340', 'CIW80_01570', 'CIW80_12380', 'CIW80_02890', 'CIW80_08015', 'CIW80_05650', 'CIW80_24680', 'CIW80_13175', 'CIW80_16405', 'CIW80_12135', 'CIW80_13695', 'CIW80_19755', 'CIW80_08720', 'CIW80_22470', 'CIW80_19765', 'CIW80_10845', 'CIW80_23170', 'CIW80_02595', 'CIW80_15595', 'CIW80_10730', 'CIW80_16470', 'CIW80_15525', 'CIW80_25220', 'CIW80_24670', 'CIW80_22910', 'CIW80_25020', 'CIW80_03320', 'CIW80_11525', 'CIW80_18910', 'CIW80_25400', 'CIW80_12320', 'CIW80_18415', 'CIW80_06025', 'CIW80_00550', 'CIW80_12115', 'CIW80_20075', 'CIW80_00210', 'CIW80_04210', 'CIW80_10710', 'CIW80_09230', 'CIW80_00310', 'CIW80_08180', 'CIW80_20505', 'CIW80_14885', 'CIW80_13915', 'CIW80_10530', 'CIW80_10590', 'CIW80_21745', 'CIW80_01355', 'CIW80_15895', 'CIW80_21580', 'CIW80_13215', 'CIW80_24665', 'CIW80_15310', 'CIW80_17415', 'CIW80_00055', 'CIW80_06600', 

CU928162 {'CIW80_22445', 'CIW80_06330', 'CIW80_08295', 'CIW80_15510', 'CIW80_15855', 'CIW80_04745', 'CIW80_11570', 'CIW80_10280', 'CIW80_09630', 'CIW80_05030', 'CIW80_09805', 'CIW80_04425', 'CIW80_01880', 'CIW80_04490', 'CIW80_25340', 'CIW80_12160', 'CIW80_01570', 'CIW80_12380', 'CIW80_02890', 'CIW80_23630', 'CIW80_24325', 'CIW80_05650', 'CIW80_13175', 'CIW80_16405', 'CIW80_12135', 'CIW80_13695', 'CIW80_08720', 'CIW80_13245', 'CIW80_00385', 'CIW80_22470', 'CIW80_10845', 'CIW80_23170', 'CIW80_02595', 'CIW80_15595', 'CIW80_10730', 'CIW80_16470', 'CIW80_15525', 'CIW80_25220', 'CIW80_22910', 'CIW80_08855', 'CIW80_25710', 'CIW80_12375', 'CIW80_05590', 'CIW80_18910', 'CIW80_17670', 'CIW80_25400', 'CIW80_12320', 'CIW80_21645', 'CIW80_11010', 'CIW80_18415', 'CIW80_06025', 'CIW80_16060', 'CIW80_20075', 'CIW80_17750', 'CIW80_00210', 'CIW80_04210', 'CIW80_09230', 'CIW80_08165', 'CIW80_15195', 'CIW80_08180', 'CIW80_14460', 'CIW80_20505', 'CIW80_14885', 'CIW80_13915', 'CIW80_10530', 'CIW80_10590', 

AP009378 {'CIW80_03090', 'CIW80_22445', 'CIW80_06330', 'CIW80_08295', 'CIW80_15510', 'CIW80_15855', 'CIW80_04745', 'CIW80_11570', 'CIW80_10280', 'CIW80_09630', 'CIW80_05030', 'CIW80_06735', 'CIW80_09805', 'CIW80_04425', 'CIW80_01880', 'CIW80_04490', 'CIW80_25340', 'CIW80_12160', 'CIW80_01570', 'CIW80_12380', 'CIW80_05650', 'CIW80_16405', 'CIW80_23165', 'CIW80_12135', 'CIW80_13695', 'CIW80_19755', 'CIW80_13245', 'CIW80_22470', 'CIW80_19765', 'CIW80_10845', 'CIW80_23170', 'CIW80_02595', 'CIW80_15595', 'CIW80_08460', 'CIW80_10730', 'CIW80_16470', 'CIW80_15525', 'CIW80_03375', 'CIW80_25220', 'CIW80_24670', 'CIW80_23750', 'CIW80_22910', 'CIW80_25710', 'CIW80_11525', 'CIW80_18910', 'CIW80_18425', 'CIW80_25400', 'CIW80_07240', 'CIW80_12320', 'CIW80_21645', 'CIW80_18415', 'CIW80_03670', 'CIW80_06025', 'CIW80_00550', 'CIW80_12115', 'CIW80_20075', 'CIW80_00210', 'CIW80_04210', 'CIW80_09230', 'CIW80_08165', 'CIW80_15195', 'CIW80_19045', 'CIW80_08180', 'CIW80_14460', 'CIW80_20505', 'CIW80_13915', 

CP002967 {'CIW80_12425', 'CIW80_22445', 'CIW80_19535', 'CIW80_09745', 'CIW80_21060', 'CIW80_06625', 'CIW80_06330', 'CIW80_24705', 'CIW80_17275', 'CIW80_12110', 'CIW80_25320', 'CIW80_18100', 'CIW80_14550', 'CIW80_08295', 'CIW80_15510', 'CIW80_15855', 'CIW80_04745', 'CIW80_11570', 'CIW80_03040', 'CIW80_01885', 'CIW80_17170', 'CIW80_06440', 'CIW80_02730', 'CIW80_18605', 'CIW80_09860', 'CIW80_10280', 'CIW80_09630', 'CIW80_05030', 'CIW80_10135', 'CIW80_06735', 'CIW80_09805', 'CIW80_06290', 'CIW80_01880', 'CIW80_04490', 'CIW80_25340', 'CIW80_09875', 'CIW80_21265', 'CIW80_11290', 'CIW80_19580', 'CIW80_11810', 'CIW80_01570', 'CIW80_22430', 'CIW80_12380', 'CIW80_18005', 'CIW80_12390', 'CIW80_01150', 'CIW80_02890', 'CIW80_09780', 'CIW80_02920', 'CIW80_00205', 'CIW80_20540', 'CIW80_19990', 'CIW80_02865', 'CIW80_14985', 'CIW80_18110', 'CIW80_12435', 'CIW80_19600', 'CIW80_03890', 'CIW80_05650', 'CIW80_13175', 'CIW80_20905', 'CIW80_16405', 'CIW80_20555', 'CIW80_12135', 'CIW80_06155', 'CIW80_22985', 

CP001164 {'CIW80_22445', 'CIW80_06330', 'CIW80_08295', 'CIW80_15510', 'CIW80_15855', 'CIW80_04745', 'CIW80_11570', 'CIW80_08630', 'CIW80_10835', 'CIW80_02905', 'CIW80_10280', 'CIW80_09630', 'CIW80_05030', 'CIW80_06735', 'CIW80_09805', 'CIW80_04425', 'CIW80_01880', 'CIW80_04490', 'CIW80_25340', 'CIW80_12160', 'CIW80_01570', 'CIW80_12380', 'CIW80_07180', 'CIW80_24325', 'CIW80_05650', 'CIW80_13175', 'CIW80_16405', 'CIW80_23165', 'CIW80_03525', 'CIW80_12135', 'CIW80_15680', 'CIW80_13695', 'CIW80_19755', 'CIW80_08720', 'CIW80_22470', 'CIW80_00300', 'CIW80_00715', 'CIW80_23170', 'CIW80_17790', 'CIW80_23115', 'CIW80_15595', 'CIW80_07515', 'CIW80_08460', 'CIW80_10730', 'CIW80_02835', 'CIW80_00920', 'CIW80_15525', 'CIW80_19145', 'CIW80_25220', 'CIW80_25510', 'CIW80_19975', 'CIW80_24610', 'CIW80_20310', 'CIW80_08855', 'CIW80_24975', 'CIW80_02830', 'CIW80_19150', 'CIW80_11760', 'CIW80_11525', 'CIW80_18910', 'CIW80_17670', 'CIW80_25400', 'CIW80_04825', 'CIW80_06025', 'CIW80_00550', 'CIW80_12115', 

AP010953 {'CIW80_03090', 'CIW80_22445', 'CIW80_06330', 'CIW80_08295', 'CIW80_15510', 'CIW80_15855', 'CIW80_04745', 'CIW80_11570', 'CIW80_10280', 'CIW80_05030', 'CIW80_06735', 'CIW80_09805', 'CIW80_04425', 'CIW80_01880', 'CIW80_04490', 'CIW80_25340', 'CIW80_01570', 'CIW80_12380', 'CIW80_05650', 'CIW80_13175', 'CIW80_16405', 'CIW80_12135', 'CIW80_23695', 'CIW80_13695', 'CIW80_19755', 'CIW80_22470', 'CIW80_10845', 'CIW80_23170', 'CIW80_23115', 'CIW80_02595', 'CIW80_15595', 'CIW80_08310', 'CIW80_07255', 'CIW80_10730', 'CIW80_16470', 'CIW80_15525', 'CIW80_19145', 'CIW80_25220', 'CIW80_24670', 'CIW80_22910', 'CIW80_25020', 'CIW80_08855', 'CIW80_11525', 'CIW80_18910', 'CIW80_25400', 'CIW80_12320', 'CIW80_18415', 'CIW80_06025', 'CIW80_00550', 'CIW80_20075', 'CIW80_00210', 'CIW80_04210', 'CIW80_10710', 'CIW80_16900', 'CIW80_15195', 'CIW80_03245', 'CIW80_08180', 'CIW80_20505', 'CIW80_10530', 'CIW80_10590', 'CIW80_21745', 'CIW80_21785', 'CIW80_01355', 'CIW80_15895', 'CIW80_21580', 'CIW80_13215', 

FM180568 {'CIW80_03090', 'CIW80_22445', 'CIW80_06330', 'CIW80_08295', 'CIW80_15510', 'CIW80_15855', 'CIW80_04745', 'CIW80_11570', 'CIW80_10280', 'CIW80_05030', 'CIW80_06735', 'CIW80_09805', 'CIW80_01880', 'CIW80_04490', 'CIW80_25340', 'CIW80_12160', 'CIW80_01570', 'CIW80_12380', 'CIW80_23630', 'CIW80_24325', 'CIW80_24680', 'CIW80_13175', 'CIW80_16405', 'CIW80_05650', 'CIW80_12135', 'CIW80_13695', 'CIW80_19755', 'CIW80_08720', 'CIW80_13245', 'CIW80_22470', 'CIW80_19765', 'CIW80_10845', 'CIW80_23170', 'CIW80_02595', 'CIW80_15595', 'CIW80_08310', 'CIW80_07255', 'CIW80_10730', 'CIW80_16470', 'CIW80_02835', 'CIW80_15525', 'CIW80_25220', 'CIW80_24670', 'CIW80_19975', 'CIW80_12710', 'CIW80_22910', 'CIW80_25020', 'CIW80_25710', 'CIW80_02830', 'CIW80_11525', 'CIW80_18910', 'CIW80_25400', 'CIW80_07240', 'CIW80_12320', 'CIW80_18415', 'CIW80_17360', 'CIW80_06025', 'CIW80_00550', 'CIW80_12115', 'CIW80_20075', 'CIW80_00210', 'CIW80_04210', 'CIW80_22515', 'CIW80_10710', 'CIW80_09230', 'CIW80_17355', 

CP002729 {'CIW80_22445', 'CIW80_06330', 'CIW80_08295', 'CIW80_15510', 'CIW80_15855', 'CIW80_04745', 'CIW80_11570', 'CIW80_08630', 'CIW80_02905', 'CIW80_10280', 'CIW80_09630', 'CIW80_05030', 'CIW80_06735', 'CIW80_09805', 'CIW80_16235', 'CIW80_04425', 'CIW80_19530', 'CIW80_04490', 'CIW80_25340', 'CIW80_01570', 'CIW80_12380', 'CIW80_02890', 'CIW80_09010', 'CIW80_05650', 'CIW80_24680', 'CIW80_13175', 'CIW80_16405', 'CIW80_12135', 'CIW80_23695', 'CIW80_13475', 'CIW80_13695', 'CIW80_19755', 'CIW80_08720', 'CIW80_22470', 'CIW80_19765', 'CIW80_10845', 'CIW80_23170', 'CIW80_17790', 'CIW80_23115', 'CIW80_02595', 'CIW80_15595', 'CIW80_08460', 'CIW80_10730', 'CIW80_16470', 'CIW80_02835', 'CIW80_15525', 'CIW80_25220', 'CIW80_19975', 'CIW80_24610', 'CIW80_22910', 'CIW80_08855', 'CIW80_12375', 'CIW80_11525', 'CIW80_18910', 'CIW80_17670', 'CIW80_25400', 'CIW80_18415', 'CIW80_03670', 'CIW80_06025', 'CIW80_00550', 'CIW80_20075', 'CIW80_17750', 'CIW80_00210', 'CIW80_04210', 'CIW80_13030', 'CIW80_10710', 

CP000247 {'CIW80_22445', 'CIW80_06330', 'CIW80_08295', 'CIW80_15510', 'CIW80_15855', 'CIW80_04745', 'CIW80_11570', 'CIW80_10280', 'CIW80_09630', 'CIW80_05030', 'CIW80_06735', 'CIW80_09805', 'CIW80_01880', 'CIW80_04490', 'CIW80_25340', 'CIW80_01570', 'CIW80_12380', 'CIW80_23630', 'CIW80_03475', 'CIW80_19205', 'CIW80_05650', 'CIW80_23420', 'CIW80_13175', 'CIW80_16405', 'CIW80_12135', 'CIW80_23325', 'CIW80_23555', 'CIW80_13695', 'CIW80_13245', 'CIW80_22470', 'CIW80_19765', 'CIW80_10845', 'CIW80_23170', 'CIW80_17790', 'CIW80_23360', 'CIW80_15595', 'CIW80_08460', 'CIW80_10730', 'CIW80_16470', 'CIW80_02835', 'CIW80_15525', 'CIW80_19145', 'CIW80_25220', 'CIW80_25710', 'CIW80_18945', 'CIW80_02830', 'CIW80_05590', 'CIW80_11525', 'CIW80_18910', 'CIW80_17670', 'CIW80_25400', 'CIW80_12320', 'CIW80_18415', 'CIW80_06025', 'CIW80_01845', 'CIW80_00550', 'CIW80_23625', 'CIW80_12115', 'CIW80_08665', 'CIW80_20075', 'CIW80_17750', 'CIW80_00210', 'CIW80_04210', 'CIW80_02815', 'CIW80_08165', 'CIW80_16900', 

CU928164 {'CIW80_03090', 'CIW80_22445', 'CIW80_06330', 'CIW80_08295', 'CIW80_15510', 'CIW80_15855', 'CIW80_04745', 'CIW80_11570', 'CIW80_16065', 'CIW80_10280', 'CIW80_09630', 'CIW80_05030', 'CIW80_06735', 'CIW80_09805', 'CIW80_01880', 'CIW80_04490', 'CIW80_25340', 'CIW80_12160', 'CIW80_01570', 'CIW80_12380', 'CIW80_05650', 'CIW80_13175', 'CIW80_16405', 'CIW80_23165', 'CIW80_12135', 'CIW80_13695', 'CIW80_19755', 'CIW80_08720', 'CIW80_00385', 'CIW80_22470', 'CIW80_19765', 'CIW80_10845', 'CIW80_23170', 'CIW80_17790', 'CIW80_02595', 'CIW80_15595', 'CIW80_08460', 'CIW80_10730', 'CIW80_16470', 'CIW80_02835', 'CIW80_15525', 'CIW80_25220', 'CIW80_24610', 'CIW80_22910', 'CIW80_08790', 'CIW80_18945', 'CIW80_12375', 'CIW80_02830', 'CIW80_10915', 'CIW80_11525', 'CIW80_17670', 'CIW80_18425', 'CIW80_25400', 'CIW80_07240', 'CIW80_12320', 'CIW80_21645', 'CIW80_11010', 'CIW80_18415', 'CIW80_08675', 'CIW80_06025', 'CIW80_00550', 'CIW80_16060', 'CIW80_12115', 'CIW80_08665', 'CIW80_20075', 'CIW80_17750', 

CP001063 {'CIW80_22445', 'CIW80_12600', 'CIW80_13615', 'CIW80_08295', 'CIW80_15510', 'CIW80_05090', 'CIW80_04745', 'CIW80_11570', 'CIW80_11350', 'CIW80_09205', 'CIW80_10835', 'CIW80_09165', 'CIW80_07705', 'CIW80_09195', 'CIW80_10280', 'CIW80_09630', 'CIW80_22935', 'CIW80_05030', 'CIW80_05800', 'CIW80_14390', 'CIW80_06735', 'CIW80_09805', 'CIW80_04425', 'CIW80_01880', 'CIW80_25340', 'CIW80_25760', 'CIW80_01570', 'CIW80_24000', 'CIW80_12380', 'CIW80_05650', 'CIW80_13175', 'CIW80_16405', 'CIW80_12135', 'CIW80_04750', 'CIW80_13695', 'CIW80_08720', 'CIW80_22470', 'CIW80_16290', 'CIW80_06185', 'CIW80_02595', 'CIW80_15595', 'CIW80_11395', 'CIW80_22170', 'CIW80_01280', 'CIW80_08460', 'CIW80_18375', 'CIW80_10730', 'CIW80_02835', 'CIW80_15525', 'CIW80_12540', 'CIW80_25220', 'CIW80_25510', 'CIW80_17330', 'CIW80_22910', 'CIW80_04445', 'CIW80_22945', 'CIW80_08855', 'CIW80_09895', 'CIW80_24975', 'CIW80_02830', 'CIW80_04555', 'CIW80_02715', 'CIW80_20670', 'CIW80_01900', 'CIW80_11525', 'CIW80_18910', 

CP001383 {'CIW80_03090', 'CIW80_00445', 'CIW80_22445', 'CIW80_01660', 'CIW80_06330', 'CIW80_08295', 'CIW80_15510', 'CIW80_15855', 'CIW80_05090', 'CIW80_04745', 'CIW80_11570', 'CIW80_07510', 'CIW80_02095', 'CIW80_20975', 'CIW80_10280', 'CIW80_09630', 'CIW80_05030', 'CIW80_06735', 'CIW80_09805', 'CIW80_04425', 'CIW80_01880', 'CIW80_04490', 'CIW80_25340', 'CIW80_01570', 'CIW80_12380', 'CIW80_07715', 'CIW80_07180', 'CIW80_09010', 'CIW80_05650', 'CIW80_13175', 'CIW80_16405', 'CIW80_23165', 'CIW80_12135', 'CIW80_04405', 'CIW80_14770', 'CIW80_15680', 'CIW80_13695', 'CIW80_22470', 'CIW80_23170', 'CIW80_02595', 'CIW80_15595', 'CIW80_08525', 'CIW80_22170', 'CIW80_10730', 'CIW80_16470', 'CIW80_02835', 'CIW80_03375', 'CIW80_00180', 'CIW80_12345', 'CIW80_25220', 'CIW80_24670', 'CIW80_19975', 'CIW80_06545', 'CIW80_15520', 'CIW80_24975', 'CIW80_18945', 'CIW80_02830', 'CIW80_16500', 'CIW80_12495', 'CIW80_18910', 'CIW80_25400', 'CIW80_12320', 'CIW80_14555', 'CIW80_09250', 'CIW80_00330', 'CIW80_25625', 

CP000038 {'CIW80_11210', 'CIW80_22445', 'CIW80_01660', 'CIW80_06330', 'CIW80_08295', 'CIW80_15510', 'CIW80_15855', 'CIW80_14785', 'CIW80_04745', 'CIW80_11570', 'CIW80_10835', 'CIW80_10280', 'CIW80_09630', 'CIW80_05030', 'CIW80_05800', 'CIW80_06735', 'CIW80_09805', 'CIW80_01880', 'CIW80_04490', 'CIW80_25340', 'CIW80_25760', 'CIW80_01570', 'CIW80_24000', 'CIW80_12380', 'CIW80_07195', 'CIW80_05650', 'CIW80_13175', 'CIW80_16405', 'CIW80_08610', 'CIW80_12135', 'CIW80_04750', 'CIW80_15680', 'CIW80_13695', 'CIW80_08720', 'CIW80_22470', 'CIW80_10845', 'CIW80_16290', 'CIW80_11520', 'CIW80_02595', 'CIW80_15980', 'CIW80_15595', 'CIW80_08310', 'CIW80_22170', 'CIW80_07030', 'CIW80_10730', 'CIW80_16470', 'CIW80_15525', 'CIW80_12345', 'CIW80_25220', 'CIW80_25510', 'CIW80_19975', 'CIW80_04685', 'CIW80_06080', 'CIW80_22910', 'CIW80_15520', 'CIW80_08790', 'CIW80_12375', 'CIW80_11525', 'CIW80_25400', 'CIW80_12320', 'CIW80_21645', 'CIW80_18415', 'CIW80_17425', 'CIW80_00550', 'CIW80_06555', 'CIW80_12115', 

In [18]:
#Save the Presence/Absence Matrix and geneIDs Matrix for future use
ortho_matrix_w_unannotated.to_csv('../tables/ortho_matrix_unan.csv')
geneIDs_matrix_w_unannotated.to_csv('../tables/geneIDs_matrix_unan.csv')