A Python script for retrieving given gene sequences. The output is a fasta file (name=gene name), with headers that denote the species name start and end base pair, and the accession number of the species. 

How it works:
A. Create a dataframe of metadata regarding the genes 
1. iterate through all gff files and find the start and stop base pair number of the desired gene 
2. input start and stop in different columns in a dataframe, on the corresponding row (with species name, strain, accession number) 

B. Gather the sequences and input into dataframe
1. unzip all fasta files in the directory
2. for each fasta file, take the string using the start and stop index (dont include header)
3. input that string into corresponding row of dataframe, using the accession number as a key  

C. Output dataframe into a formatted fasta file 
1. create a header based on row in dataframe 
2. append the gene sequence 
3. create new line for new row in df

In [1]:
#Check that we are in the correct working directory
import os 
print(os.getcwd())
import sys

/Users/gracedai/Documents/MethanogeneDB


In [3]:
pip install sh

Collecting sh
  Downloading sh-2.0.7-py3-none-any.whl (38 kB)
Installing collected packages: sh
Successfully installed sh-2.0.7

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
os.chdir("/Users/gracedai/Documents/MethanogeneDB/prokka_out")

In [24]:
# list to store paths to gff files
gff_files = []
# iterate through current working directory and for each file 
# append the gff path to a list 
for root, dirs, files in os.walk(os.getcwd()):
    for file in files:
        if file.endswith(".gff"):
            gff_files.append(os.path.join(root, file))
print(gff_files)
len(gff_files)

['/Users/gracedai/Documents/MethanogeneDB/GCA_000970225.1/GCF_000970225.1_ASM97022v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_000092305.1/GCF_000092305.1_ASM9230v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_000970165.1/GCF_000970165.1_ASM97016v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_024170265.1/GCF_024170265.1_ASM2417026v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_002502925.1/GCA_002502925.1.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_000091665.1/GCF_000091665.1_ASM9166v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_020885915.1/GCF_020885915.1_ASM2088591v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_017535165.1/GCA_017535165.1_ASM1753516v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_000166095.1/GCF_000166095.1_ASM16609v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_017412935.1/GCA_017412935.1_ASM1741293v1_genomic.gff', '/Users/gracedai/Documents/Metha

275

In [25]:
gff_paths = []
for file in gff_files:
    gff_paths.append(file[40:])
print(gff_paths)
len(gff_paths)

['GCA_000970225.1/GCF_000970225.1_ASM97022v1_genomic.gff', 'GCA_000092305.1/GCF_000092305.1_ASM9230v1_genomic.gff', 'GCA_000970165.1/GCF_000970165.1_ASM97016v1_genomic.gff', 'GCA_024170265.1/GCF_024170265.1_ASM2417026v1_genomic.gff', 'GCA_002502925.1/GCA_002502925.1.gff', 'GCA_000091665.1/GCF_000091665.1_ASM9166v1_genomic.gff', 'GCA_020885915.1/GCF_020885915.1_ASM2088591v1_genomic.gff', 'GCA_017535165.1/GCA_017535165.1_ASM1753516v1_genomic.gff', 'GCA_000166095.1/GCF_000166095.1_ASM16609v1_genomic.gff', 'GCA_017412935.1/GCA_017412935.1_ASM1741293v1_genomic.gff', 'GCA_002508885.1/GCA_002508885.1.gff', 'GCA_029633965.1/GCF_029633965.1_ASM2963396v1_genomic.gff', 'GCA_900114585.1/GCF_900114585.1_IMG-taxon_2593339150_annotated_assembly_genomic.gff', 'GCA_000744315.1/GCF_000744315.1_ASM74431v1_genomic.gff', 'GCA_028743435.1/GCF_028743435.1_ASM2874343v1_genomic.gff', 'GCA_001477655.1/GCF_001477655.1_ASM147765v1_genomic.gff', 'GCA_002356395.1/GCA_002356395.1_ASM235639v1_genomic.gff', 'GCA_03016

275

In [6]:
import pandas as pd
from sh import gunzip

# Create a database of metadata regarding the genes
GFF File Format: 
- Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.'
[0] seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
[1] source - name of the program that generated this feature, or the data source (database or project name)
[2] feature - feature type name, e.g. Gene, Variation, Similarity
[3] start - Start position* of the feature, with sequence numbering starting at 1.
[4] end - End position* of the feature, with sequence numbering starting at 1.
[5] score - A floating point value.
[6] strand - defined as + (forward) or - (reverse).
[7] frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
[8] attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

In [22]:
gff_paths = ['GCA_902775675.1/GCA_902775675.1.gff', 'GCA_902789165.1/GCA_902789165.1.gff', 'GCA_902757365.1/GCA_902757365.1.gff', 'GCA_002509015.1/GCA_002509015.1.gff', 'GCA_001548675.1/GCF_001548675.1_ASM154867v1_genomic.gff', 'GCA_001477655.1/GCF_001477655.1_ASM147765v1_genomic.gff', 'GCA_003814835.1/GCF_003814835.1_ASM381483v1_genomic.gff', 'GCA_000024185.1/GCF_000024185.1_ASM2418v1_genomic.gff', 'GCA_001563245.1/GCF_001563245.1_ASM156324v1_genomic.gff', 'GCA_902775725.1/GCA_902775725.1.gff']

In [29]:
import pandas as pd
#get the start and stop of the given gene 

#create dataframe to store the output 
df = pd.DataFrame(columns=["assembly_accession", "index", "start", "stop"])

# iterate through the GFF files
for path_name in gff_paths:
    species = path_name[:15]  #keep species name from the path_name
    gene_count = 0            #index to keep count of number of times gene shows up in one species
    
    with open(path_name, "r") as file:
        for line in file:
            if not line.startswith('#'):
                fields = line.strip().split('\t')
#                 if fields[2] == "gene":
                if "16S" in line and "gene" in line:   #specify desired gene
                    start = int(fields[3])
                    stop = int(fields[4])
                    gene_count += 1

                    new_row = pd.DataFrame({"assembly_accession": [species],
                                            "index": [gene_count],
                                            "start": [start],
                                            "stop": [stop]})

                    df = pd.concat([df, new_row], ignore_index=True)

df

Unnamed: 0,assembly_accession,index,start,stop
0,GCA_001548675.1,1,169708,170580
1,GCA_001548675.1,2,492626,494101
2,GCA_001548675.1,3,710846,712321
3,GCA_001477655.1,1,539355,540829
4,GCA_001477655.1,2,785166,786640
5,GCA_001477655.1,3,2409376,2410248
6,GCA_003814835.1,1,482634,483506
7,GCA_003814835.1,2,728669,728892
8,GCA_003814835.1,3,84110,84196
9,GCA_003814835.1,4,2709,4059


In [62]:
#insert columns with organism and strain name
species = pd.read_csv('/Users/gracedai/Downloads/both.csv')
species = species[['assembly_accession', 'organism_name', 'strain']]
df = pd.merge(df, species, left_on='assembly_accession', right_on='assembly_accession', how='left')
df

Unnamed: 0,assembly_accession,index,start,stop,organism_name,strain
0,GCA_902762095.1,1,13119,14774,uncultured Methanobrevibacter,RUG10736
1,GCA_000953115.1,1,1073810,1075462,Methanobacterium formicicum,DSM1535
2,GCA_900095815.1,1,1411589,1413253,Methanothermobacter wolfeii,SIV6
3,GCA_902795935.1,1,2874,4529,uncultured Methanobrevibacter,RUG14110
4,GCA_000499765.1,1,1786063,1787715,Methanobacterium,MB1


# Gather gene sequences and put into database

In [6]:
#os.chdir("/Users/gracedai/Documents/asmg labs files/fna sequences")

In [50]:
for root, dirs, files in os.walk(os.getcwd()):
    for file in files:
        if file.endswith(".gz"):
            #os.remove(os.path.join(root, file))
            gunzip(os.path.join(root, file))

In [12]:
os.getcwd()

'/Users/gracedai/Documents/MethanogeneDB/MAGS'

In [63]:
# Function to find fasta file in a folder and extract the gene sequence
def get_sequence(folder_name, start, stop):
    fasta_file = [file for file in os.listdir(folder_name) if file.endswith(".fna")]
    if not fasta_file:
        return None
    fasta_path = os.path.join(folder_name, fasta_file[0])
    with open(fasta_path, "r") as file:
        lines = file.readlines()
        sequence = "".join(line.strip() for line in lines[1:] if not line.startswith(">"))
    return sequence[start-1:stop]

In [42]:
#test function
# out = get_sequence("GCA_001563305.1", 10, 20)
# print(out)

In [64]:
#get the fasta files and sequences, put them in a column and merge with previous df
# store the gene sequences
sequences = []

for _, row in df.iterrows():
    folder_name = row["assembly_accession"]
    #print(folder_name)
    start, stop = row["start"], row["stop"]
    sequence = get_sequence(folder_name, start, stop) #extract the gene sequence using function
    sequences.append(sequence)

# add column of sequences to existing dataframe
df["Sequence"] = sequences
df

Unnamed: 0,assembly_accession,index,start,stop,organism_name,strain,Sequence
0,GCA_902762095.1,1,13119,14774,uncultured Methanobrevibacter,RUG10736,
1,GCA_000953115.1,1,1073810,1075462,Methanobacterium formicicum,DSM1535,
2,GCA_900095815.1,1,1411589,1413253,Methanothermobacter wolfeii,SIV6,TTATTTTCCAGCAGGTATGATGAGGTCCCTTTCTCCTTCCGGGATG...
3,GCA_902795935.1,1,2874,4529,uncultured Methanobrevibacter,RUG14110,TCTTAAAGTAACACCATTACCAATCAATCTTCTACCATCAGAACAT...
4,GCA_000499765.1,1,1786063,1787715,Methanobacterium,MB1,ATGAATAACGAAAAAAAGCTCTTCTTAAAAGCTTTACAAAACAAAT...


In [65]:
df['organism_name'] = df['organism_name'].astype(str)
df = df.rename(columns={'Isolate': 'strain'})
df_sorted = df.sort_values('organism_name', ascending=True)
df_sorted

Unnamed: 0,assembly_accession,index,start,stop,organism_name,strain,Sequence
4,GCA_000499765.1,1,1786063,1787715,Methanobacterium,MB1,ATGAATAACGAAAAAAAGCTCTTCTTAAAAGCTTTACAAAACAAAT...
1,GCA_000953115.1,1,1073810,1075462,Methanobacterium formicicum,DSM1535,
2,GCA_900095815.1,1,1411589,1413253,Methanothermobacter wolfeii,SIV6,TTATTTTCCAGCAGGTATGATGAGGTCCCTTTCTCCTTCCGGGATG...
0,GCA_902762095.1,1,13119,14774,uncultured Methanobrevibacter,RUG10736,
3,GCA_902795935.1,1,2874,4529,uncultured Methanobrevibacter,RUG14110,TCTTAAAGTAACACCATTACCAATCAATCTTCTACCATCAGAACAT...


In [67]:
# export locally to a csv
df_sorted.to_csv('mrtA_seq2.csv')

# Output dataframes to a formatted fasta file 

In [47]:
#out_fasta should be "name.fasta" to ensure the output file is in fasta format
def to_fasta(df, out_fasta):
    df["organism_name"] = df["organism_name"].str.replace(" ","-")
    df["strain"] = df["strain"].str.replace(" ","-")
    with open(out_fasta, 'w') as file:
        for _, row in df.iterrows():
            header = f'>{row["assembly_accession"]}|{row["organism_name"]}|{row["strain"]}|{row["index"]}|{row["start"]}|{row["stop"]}'
            sequence = row["Sequence"]
            file.write(f'{header}\n{sequence}\n')
    file.close()

In [58]:
#export to local fasta file
to_fasta(df_sorted, "mrtG_seqs2.fasta")