# NifH Gene-Finder Analysis

The following notebook provides a standard workflow for downstream analysis of the result outputs from executing the gene-finder snakemake file which builds an alignment profile from our reference gene alignment and uses HMMER searches to compare it against the Tara oceans metagenome assemblies for the larger size fractions.

Current functions: 
1. evalueglob: concatenates all the evalue tables output from our hmmsearch results for a given gene into one data frame with sample id; output is a data frame saved to csv
2. threshold_cutoff: takes only evalues below a certain threshold value and outputs a data frame for those sequence hits. Threshold cutoff may need to be designated using manual checking of sequences against NCBI protein database using BLAST.
3. getfasta2: extracts the fasta sequences which correspond to hits meeting the threshold above and puts them into an output fasta file.
4. filter_length : filter sequences by a desired length and output a fasta file of the longer sequences.

In [1]:
import pandas as pd

In [2]:
import glob

In [3]:
import matplotlib.pyplot as plt
import os


In [4]:
from Bio import SeqIO


In [5]:
#IMPORTANT OBJECT DEFINITIONS (can be changed as desired)

#gene name can be edited for each gene of interest 
gene= "nifH"

#evalue_tables_path is the path to the gene specific set of all the evalue table files output from hmmsearch done with snakemake
evalue_tables_path="/vortexfs1/omics/alexander/lblum/tara_gene_finder/results/hmm_results/nifH/*.tab"

#proteinpath is for the prodigal_proteins data folder
proteinpath= "/vortexfs1/omics/alexander/lblum/tara_gene_finder/data/prodigal_proteins"

#output_folder will be a gene-specific folder for outputs from jupyter notebook analysis of results from the snakemake hmmsearches 
output_folder= "/vortexfs1/omics/alexander/lblum/tara_gene_finder/jupyter_notebooks/outputs/nifH/"

#extracted_hits_path directs desired gene sequences to a fasta file for the iterative alignment/ gene-finder search
extracted_hits_path= os.path.join(output_folder,gene+"_extracted_hits.fasta")

#location of the gene-specific set of hmm_results, in this case nifH
hits_results_path= "/vortexfs1/omics/alexander/lblum/tara_gene_finder/results/hmm_results/nifH/*.faa"

In [6]:
# inputing our sample info file which contains all the assigned size fractions
sample_info=pd.read_csv('/vortexfs1/omics/alexander/lblum/tara_gene_finder/data/metadata/SampleList_ForAssembly_metaG_python.txt', sep="\t")

In [7]:
def evalueglob(globfolder):
    """This function will go through all of the eval table files in a directory 
    And combine them into a a new data frame with the sample id listed
    input is results .eval.tab files and output should be for nifH to a new output folder
    
    *globfolder selects all the eval.tab files in the results folder for the gene of interest using a wild card*
    *gene needs to be defined at the top of the script as nifH or another gene*
    
    """
    data = [] # pd.concat combines list of data frames; #sep='\t' for formatting tab separated values
    eval_files = glob.glob(globfolder)
    for tab in eval_files:
        frame= pd.read_csv(tab, sep='\t')
        frame['sample_id'] = os.path.basename(tab).replace(".eval.tab","")
        #create new column with base name 
        data.append(frame)

    eval_table = pd.concat(data, ignore_index=True) 
    eval_table.to_csv(os.path.join(output_folder,gene+"_eval_table.csv"))
    return(eval_table)
        

In [8]:
def threshold_cutoff(df, column= 'e-value', evalue= 10 **-40):
    """
    This function will take our concatenated evalue table and filter it based on a given threshold cutoff; 
    column and evalue are set to default parameters. E-value threshold can be modified as needed.
    """
    outdf= df[df[column] <= evalue]
    return(outdf)

In [9]:
def getfasta2(hitsfolder, dataframe):
    """
    This function is meant to more efficiently extract sequences from our hits.faa results files of the snakemake hmmsearches
    based on the threshold set in the threshold_cutoff function producing the threshold_hits data table with contig_id.
    produces a single fasta file with all the sequences meeting the threshold, called gene_extracted_hits_test.fasta.
    hitsfolder is defined at top (hits_resutls_path for all the hits.faa files in results)
    dataframe is threshold hits
    Uses glob and SeqIO are imported at the top of the script
    """
    files_list = glob.glob(hitsfolder) #glob through all the files in hitsfolder, which we define to be the .hits.faa resulting from snakemake hmmsearches and stored in results folder

    output_handle = open(os.path.join(output_folder,gene+"_extracted_hits.fasta"), "w") #append this extracted_hits_test file

    for file_name in files_list:
        records= SeqIO.index(file_name,"fasta") 

        for i in dataframe["contig_id"]:  #call on the column "contig_id" from dataframe (threshold_hits) and if that contig_id shows up in the records(which index the fasta file) the sequence is added
            if i in records:
                SeqIO.write(records[i], output_handle, 'fasta')

In [10]:
 def filter_length(filename):
    """
    This function uses SeqIO from Biopython: (from Bio import SeqIO) already at top of script
    It will take the output file of extracted_hits.fasta from getfasta function used in the last step, or another fasta file,
    and filter for only sequences of a certain length.
    It will save the sequences that meet the length requirement as a new fasta file called gene, "_length_filtered_hits.fasta" in the outputs folder
    """
    long_sequences = []
    input_handle=open(filename) #input_handle will correspond to the file defined in extracted_hits_path at the top of the script, corresponding to the output file from getfasta 
    output_handle = open(os.path.join(output_folder,gene+"_length_filtered_hits.fasta"), "w") #able to open for writing file #not sure about os.path.join(output_folder,gene+"_filtered_hits.fasta")
    
    for record in SeqIO.parse(input_handle, "fasta") :
        if len(record.seq) >= 160 :
            # this adds the record if it meets the length requirement
            long_sequences.append(record) #appends long_sequences object to contain the records that match the condition in the for loop, meeting the length requirement
 
    print ("Found long sequences")

    SeqIO.write(long_sequences, output_handle, "fasta")  #writes long_sequences into a fasta file directed to the output handle, defined to be in the output folder with a new name of "_length_filtered_hits.fasta"
    input_handle.close()
    output_handle.close()

In [11]:
#evalue_tables_path defined at top of jupyter notebook
#call evalueglob with the path to all the tables to create a .csv of all our concatenated evalue tables with sample name.
#you need to define the return as eval_table so that when we call it in our next function the output is already defined (otherwise just writes to csv)
eval_table=evalueglob(evalue_tables_path)

In [12]:
#threshhold_hits = full_eval_table[full_eval_table['e-value'] <= 1 * 10 **-40]   
#used the above to model the function below which creates a data frame of threshold cutoff hits
#reference column either with .name or ['name']


In [13]:
threshold_hits=threshold_cutoff(eval_table, evalue= 10 **-40)
#here we call our new threshold_cutoff function and have to provide a data frame (which is eval_table, as outputed by the evalueglob function)
#we can specify a different evalue than is set
#we call the output threshold_hits and use this for our summary table and for our getfasta2 function later on

In [14]:
threshold_hits.to_csv(os.path.join(output_folder,gene+"_threshold_hits_table.csv"), index=False)
#this table will be needed later for searching within MAGS, as it links sequence id to sample id

In [15]:
near_cutoff = eval_table[(eval_table['e-value'] >= 1 * 10 **-43) & (eval_table['e-value'] <= 1 * 10 **-30)]
#manually check sequences using BLAST around the threshold to ensure reliable filtering.

In [16]:
summary_table = pd.DataFrame(data=eval_table['sample_id'].value_counts())

In [17]:
#sample_info read in at top
#merge the sample info to the threshold hits table
threshold_with_info= threshold_hits.merge(sample_info, left_on='sample_id', right_on='Assembly_group')

In [18]:
getfasta2(hits_results_path, threshold_hits)

In [19]:
#here we use the filter_length function by inputing the file containing extracted_hits that met the e-value cut off
#we take the sequences that meet the length requirement and save those to a new length_filtered_hits.fasta file.
###if __name__=="__main__":  #what does this do??#
filter_length(extracted_hits_path)
    #path defined at the top of script and guides to the location of the extracted_hits.fasta file in outputs folder for this gene which was made with the getfasta function

Found long sequences
