A Python script for searching through all gff files in a directory. The script searches through a given gff file, notes the count of a given gene in a pathway, and then outputs that number into the correct cell associated with that gene and pathway in the excel spreadsheet.

How it works:
A. Flip the csv vertical genes to be names of columns. 
B. Write all species ids to the given csv (should be names of rows). 
C. Create a list of all gene names to search for called to_search.txt

1. Search through all files in a directory and find all gff files(even if they are in a sub-folder)
- unzip the file in the same folder
- write the path of the file to a list
- trim path name to part that is necessary for operating 

2. for each file in the list, perform the following:
- for each line in to_search.txt, find number of lines in which that word shows up at least once (counts of lines, not words!)
- create a dataframe with the columns as gene names from to_search.txt 
- add row to dataframe for each species searched, that row has all the counts

In [1]:
#Check that we are in the correct working directory
import os 
print(os.getcwd())

/Users/gracedai/Documents/MethanogeneDB


In [3]:
# Install a conda package in the current Jupyter kernel
import sys
!conda install --yes --prefix {sys.prefix} sh

Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 23.3.1
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0



## Package Plan ##

  environment location: /Users/gracedai/anaconda3

  added / updated specs:
    - sh


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2023.5.7           |  py310hca03da5_0         153 KB
    openssl-1.1.1u             |       h1a28f6b_0         2.6 MB
    sh-1.14.2                  |  py310hca03da5_0          67 KB
    ------------------------------------------------------------
                                           Total:         2.8 MB

The following NEW packages will be INSTALLED:

  sh           

In [4]:
os.chdir("/Users/gracedai/Documents/MethanogeneDB") 

In [2]:
#unzips a file name, given the path, and deletes the zipped version
from sh import gunzip
#this line is for testing purposes
#gunzip('GCF_000430905.1/GCF_000430905.1_ASM43090v1_genomic.gff.gz')

In [3]:
#unzip whatever files needed
for root, dirs, files in os.walk(os.getcwd()):
    for file in files:
        if file.endswith(".gz"):
            #os.remove(os.path.join(root, file))
            gunzip(os.path.join(root, file))

In [8]:
# list to store paths to gff files
gff_files = []
# iterate through current working directory and for each file 
# unzip the file and append the path to a list 
#run this twice ?
for root, dirs, files in os.walk(os.getcwd()):
    for file in files:
        if file.endswith(".gff.gz"):
            gunzip(os.path.join(root, file))
        if file.endswith(".gff"):
            gff_files.append(os.path.join(root, file))
print(gff_files)
len(gff_files)

['/Users/gracedai/Documents/MethanogeneDB/GCA_000970225.1/GCF_000970225.1_ASM97022v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_000092305.1/GCF_000092305.1_ASM9230v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_000970165.1/GCF_000970165.1_ASM97016v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_024170265.1/GCF_024170265.1_ASM2417026v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_002502925.1/GCA_002502925.1.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_000091665.1/GCF_000091665.1_ASM9166v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_020885915.1/GCF_020885915.1_ASM2088591v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_017535165.1/GCA_017535165.1_ASM1753516v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_000166095.1/GCF_000166095.1_ASM16609v1_genomic.gff', '/Users/gracedai/Documents/MethanogeneDB/GCA_017412935.1/GCA_017412935.1_ASM1741293v1_genomic.gff', '/Users/gracedai/Documents/Metha

275

In [10]:
gff_paths = []
for file in gff_files:
    gff_paths.append(file[40:])
print(gff_paths)
len(gff_paths)

['GCA_000970225.1/GCF_000970225.1_ASM97022v1_genomic.gff', 'GCA_000092305.1/GCF_000092305.1_ASM9230v1_genomic.gff', 'GCA_000970165.1/GCF_000970165.1_ASM97016v1_genomic.gff', 'GCA_024170265.1/GCF_024170265.1_ASM2417026v1_genomic.gff', 'GCA_002502925.1/GCA_002502925.1.gff', 'GCA_000091665.1/GCF_000091665.1_ASM9166v1_genomic.gff', 'GCA_020885915.1/GCF_020885915.1_ASM2088591v1_genomic.gff', 'GCA_017535165.1/GCA_017535165.1_ASM1753516v1_genomic.gff', 'GCA_000166095.1/GCF_000166095.1_ASM16609v1_genomic.gff', 'GCA_017412935.1/GCA_017412935.1_ASM1741293v1_genomic.gff', 'GCA_002508885.1/GCA_002508885.1.gff', 'GCA_029633965.1/GCF_029633965.1_ASM2963396v1_genomic.gff', 'GCA_900114585.1/GCF_900114585.1_IMG-taxon_2593339150_annotated_assembly_genomic.gff', 'GCA_000744315.1/GCF_000744315.1_ASM74431v1_genomic.gff', 'GCA_028743435.1/GCF_028743435.1_ASM2874343v1_genomic.gff', 'GCA_001477655.1/GCF_001477655.1_ASM147765v1_genomic.gff', 'GCA_002356395.1/GCA_002356395.1_ASM235639v1_genomic.gff', 'GCA_03016

275

In [11]:
import pandas as pd
#read the text file of gene names into a list
gene_file = open("/Users/gracedai/Documents/asmg labs files/optimize/to_search_co2.txt", "r")
data = gene_file.read()
to_search = data.split("\n")
print(to_search)
gene_file.close()

FileNotFoundError: [Errno 2] No such file or directory: '/Users/gracedai/Documents/asmg labs files/optimize/to_search_co2.txt'

In [12]:
len(to_search)

43

GFF File Format: 
- Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.'
[0] seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
[1] source - name of the program that generated this feature, or the data source (database or project name)
[2] feature - feature type name, e.g. Gene, Variation, Similarity
[3] start - Start position* of the feature, with sequence numbering starting at 1.
[4] end - End position* of the feature, with sequence numbering starting at 1.
[5] score - A floating point value.
[6] strand - defined as + (forward) or - (reverse).
[7] frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
[8] attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

In [19]:
#create an empty dictionary to store the gene counts for each species
gene_counts = {}

#for gene pathways: iterate through all species gff files
# for path_name in gff_paths:
#     species = path_name[:15]  #keep species name from the path_name
#     gene_counts[species] = {}
    
#     with open(path_name, 'r') as file:
#         for line in file:
#             if not line.startswith('#'):
#                 for gene in to_search:
#                     if "gene=" + gene in line:
#                         if gene not in gene_counts[species]:
#                             gene_counts[species][gene] = 1
#                         else:
#                             gene_counts[species][gene] += 1
    #read the GFF file
#     with open(path_name, 'r') as file:
#         for line in file:
#             if not line.startswith('#'):
#                 fields = line.strip().split('\t')
#                 if (fields[2] == 'gene'):
#                     attributes = fields[8] #field number 8 contains attributes such as gene name
#                     for gene in to_search:
#                         if gene in attributes:
#                             if gene not in gene_counts[species]:
#                                 gene_counts[species][gene] = 1
#                             else:
#                                 gene_counts[species][gene] += 1

# for 16s and tRNA
# for path_name in gff_paths:
#     species = path_name[:15]  #keep species name from the path_name
#     gene_counts[species] = {}

#     #read the GFF file
#     with open(path_name, 'r') as file:
#         for line in file:
#             if not line.startswith('#'):
#                 fields = line.strip().split('\t')
#                 attributes = fields[8] #field number 8 contains attributes such as gene name
#                 if fields[2] == "gene":
#                     if 'tRNA' in attributes:
#                         if 'tRNA' not in gene_counts[species]:
#                             gene_counts[species]['tRNA'] = 1
#                         else:
#                             gene_counts[species]['tRNA'] += 1
#                 if fields[2] == "rRNA":
#                     if '16S' in attributes:
#                         if '16S' not in gene_counts[species]:
#                             gene_counts[species]['16S'] = 1
#                         else:
#                             gene_counts[species]['16S'] += 1

for path_name in gff_paths:
    species = path_name[:15]  #keep species name from the path_name
    gene_counts[species] = {}

    #read the GFF file
    with open(path_name, 'r') as file:
        for line in file:
            if not line.startswith('#'):
                if "Name=mrtA" in line or "Name=MrtA" in line:
                    if 'mrtA' not in gene_counts[species]:
                        gene_counts[species]['mrtA'] = 1
                    else:
                        gene_counts[species]['mrtA'] += 1
                if "Name=mrtB" in line or "Name=MrtB" in line: 
                    if 'mrtB' not in gene_counts[species]:
                        gene_counts[species]['mrtB'] = 1
                    else:
                        gene_counts[species]['mrtB'] += 1
                if "Name=mrtG" in line or "Name=MrtG" in line: 
                    if 'mrtG' not in gene_counts[species]:
                        gene_counts[species]['mrtG'] = 1
                    else:
                        gene_counts[species]['mrtG'] += 1

# Create a dataframe from the gene counts
df = pd.DataFrame.from_dict(gene_counts, orient='index')
df.fillna(0, inplace=True)

In [20]:
df = df.rename_axis('assembly_accession')
df

Unnamed: 0_level_0,mrtG,mrtB,mrtA
assembly_accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GCA_002502965.1,1.0,0.0,0.0
GCA_902762095.1,1.0,1.0,1.0
GCA_000953115.1,1.0,1.0,1.0
GCA_900095815.1,1.0,1.0,1.0
GCA_902763685.1,1.0,0.0,0.0
GCA_902795935.1,1.0,1.0,1.0
GCA_002506325.1,1.0,0.0,0.0
GCA_002506905.1,1.0,0.0,0.0
GCA_002509405.1,1.0,0.0,0.0
GCA_002496945.1,1.0,0.0,0.0


In [17]:
df = df[to_search]
df

KeyError: "['fwdA', 'fmdA', 'fwdB', 'fmdB', 'fwdD', 'fmdD', 'fwdE', 'fmdE', 'fwdF', 'fmdF', 'fwdH'] not in index"

In [134]:
#add columns that are missing, fill in with 0
missing = ['mtmC','hdrB1', 'hdrC1', 'hdrB2', 'hdrC2']
for i in missing: 
    df[i] = 0.0
df

Unnamed: 0_level_0,mvhA,mcrA,mcrG,mcrB,mvhG,hdrA2,mtbC,hdrE,hdrD,mtbA,...,mtmB,mtbB,mttC,mttB,hdrA1,mtmC,hdrB1,hdrC1,hdrB2,hdrC2
assembly_accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GCF_001458655.1,1.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GCF_000008645.1,1.0,2.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GCF_000145295.1,1.0,2.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GCF_008033705.1,1.0,2.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GCF_002813655.1,1.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GCF_001477655.1,0.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GCF_028743435.1,0.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GCF_000970325.1,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GCF_001027005.1,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
#match species to the assembly acession numbers 
#first import csv with species and their accession numbers 
species = pd.read_csv('/Users/gracedai/Documents/MAGS.csv')
species = species[['Accession', 'organism_name', 'Isolate']]
species.head()

Unnamed: 0,Accession,organism_name,Isolate
0,GCA_902788255.1,uncultured Methanobrevibacter,RUG13354
1,GCA_002494785.1,Methanobacteriaceae archaeon,UBA117
2,GCA_902784195.1,uncultured Methanobrevibacter,RUG12946
3,GCA_002495685.1,Methanobrevibacter,UBA412
4,GCA_902771225.1,uncultured Methanobrevibacter,RUG11657


In [22]:
df = pd.merge(df, species, left_on='assembly_accession', right_on='Accession', how='left')
df

Unnamed: 0,mrtG,mrtB,mrtA,Accession,organism_name,Isolate
0,1.0,0.0,0.0,GCA_002502965.1,Methanomassiliicoccaceae archaeon,UBA381
1,1.0,1.0,1.0,GCA_902762095.1,uncultured Methanobrevibacter,RUG10736
2,1.0,1.0,1.0,,,
3,1.0,1.0,1.0,,,
4,1.0,0.0,0.0,GCA_902763685.1,methanogenic archaeon,RUG10898
5,1.0,1.0,1.0,GCA_902795935.1,uncultured Methanobrevibacter,RUG14110
6,1.0,0.0,0.0,GCA_002506325.1,Methanomassiliicoccaceae archaeon,UBA314
7,1.0,0.0,0.0,GCA_002506905.1,Methanomassiliicoccaceae archaeon,UBA328
8,1.0,0.0,0.0,GCA_002509405.1,Methanomassiliicoccaceae archaeon,UBA72
9,1.0,0.0,0.0,GCA_002496945.1,Methanomassiliicoccaceae archaeon,UBA408


In [24]:
df.to_csv('mrt1.csv')

In [50]:
#resort column order
cols = ['Accession', 'organism_name', 'Isolate', 
    '16S', 'tRNA']
df = df[cols]

#add row with kegg numbers
# kegg_row = [0,0,0,
#     'K14082',
#     'K16176',
#     'K16177',
#     'K16178',
#     'K16179',
#     'K14083',
#     'K14084',
#     'K00399',
#     'K00401',
#     'K00402',
#     'K22480',
#     'K22481',
#     'K22482',
#     'K03388',
#     'K03389',
#     'K03390',
#     'K08264',
#     'K08265',
#     'K14126',
#     'K14128',
#     'K14127',
#     'K22516',
#     'K00125']

# kegg_row_df = pd.DataFrame([kegg_row], columns=df.columns)
#add row with gene description
# descrip = [0, 0, 0, '[methyl-Co(III) methylamine-specific corrinoid protein]:coenzyme M methyltransferase [EC:2.1.1.247]',
#     'methylamine---corrinoid protein Co-methyltransferase [EC:2.1.1.248]',
#     'monomethylamine corrinoid protein',
#     'dimethylamine---corrinoid protein Co-methyltransferase [EC:2.1.1.249]',
#     'dimethylamine corrinoid protein',
#     'trimethylamine---corrinoid protein Co-methyltransferase [EC:2.1.1.250]',
#     'trimethylamine corrinoid protein',
#     'methyl-coenzyme M reductase alpha subunit [EC:2.8.4.1]',
#     'methyl-coenzyme M reductase beta subunit [EC:2.8.4.1]',
#     'methyl-coenzyme M reductase gamma subunit [EC:2.8.4.1]',
#     'heterodisulfide reductase subunit A1 [EC:1.8.7.3]',
#     'heterodisulfide reductase subunit B1 [EC:1.8.7.3]',
#     'heterodisulfide reductase subunit C1 [EC:1.8.7.3]',
#     'heterodisulfide reductase subunit A2 [EC:1.8.7.3 1.8.98.4 1.8.98.5 1.8.98.6]',
#     'heterodisulfide reductase subunit B2 [EC:1.8.7.3 1.8.98.4 1.8.98.5 1.8.98.6]',
#     'heterodisulfide reductase subunit C2 [EC:1.8.7.3 1.8.98.4 1.8.98.5 1.8.98.6]',
#     'heterodisulfide reductase subunit D [EC:1.8.98.1]',
#     'heterodisulfide reductase subunit E [EC:1.8.98.1]',
#     'F420-non-reducing hydrogenase large subunit [EC:1.12.99.- 1.8.98.5]',
#     'F420-non-reducing hydrogenase small subunit [EC:1.12.99.- 1.8.98.5]',
#     'F420-non-reducing hydrogenase iron-sulfur subunit [EC:1.12.99.- 1.8.98.5 1.8.98.6]',
#     'formate dehydrogenase (coenzyme F420) alpha subunit [EC:1.17.98.3 1.8.98.6]',
#     'formate dehydrogenase (coenzyme F420) beta subunit [EC:1.17.98.3 1.8.98.6]']
# descrip_df = pd.DataFrame([descrip], columns=df.columns)
# df = pd.concat([kegg_row_df, descrip_df, df], ignore_index=True)

In [51]:
df

Unnamed: 0,Accession,organism_name,Isolate,16S,tRNA
0,GCA_902788045.1,uncultured Methanobrevibacter,RUG13330,0.0,32
1,GCA_902774685.1,uncultured Methanobrevibacter,RUG11990,0.0,28
2,GCA_002497785.1,Methanolinea,UBA144,1.0,37
3,GCA_002502925.1,Methanomassiliicoccaceae archaeon,UBA414,2.0,41
4,GCA_002506795.1,Methanobacterium,UBA397,0.0,38
...,...,...,...,...,...
218,GCA_902764455.1,uncultured Methanobrevibacter,RUG10962,0.0,35
219,GCA_902802065.1,uncultured Methanobrevibacter,RUG14729,2.0,35
220,GCA_902787415.1,methanogenic archaeon,RUG13270,0.0,35
221,GCA_902768415.1,uncultured Methanobrevibacter,RUG11363,0.0,27


In [53]:
df['organism_name'] = df['organism_name'].astype(str)
df_sorted = df.sort_values('organism_name', ascending=True)
df_sorted.tail(n=20)

Unnamed: 0,Accession,organism_name,Isolate,16S,tRNA
38,GCA_902770715.1,uncultured Methanobrevibacter,RUG11588,0.0,30
37,GCA_902801725.1,uncultured Methanobrevibacter,RUG14697,1.0,30
166,GCA_902798335.1,uncultured Methanobrevibacter,RUG14349,0.0,31
167,GCA_902757275.1,uncultured Methanobrevibacter,RUG10244,0.0,36
75,GCA_902758945.1,uncultured Methanobrevibacter,RUG10414,0.0,33
34,GCA_902769175.1,uncultured Methanobrevibacter,RUG11428,0.0,41
125,GCA_902777475.1,uncultured Methanobrevibacter,RUG12275,0.0,34
32,GCA_902759195.1,uncultured Methanobrevibacter,RUG10437,0.0,33
174,GCA_902784295.1,uncultured Methanobrevibacter,RUG12954,0.0,31
31,GCA_902788255.1,uncultured Methanobrevibacter,RUG13354,0.0,31


In [54]:
df_sorted.to_csv('prokka_16s.csv')

In [62]:
wgs = pd.read_csv('/Users/gracedai/Downloads/co2.csv')
prokka = pd.read_csv('/Users/gracedai/Downloads/prokkas.csv')
rumen_only = pd.read_csv('/Users/gracedai/Downloads/rumen.csv')
wgs.head(n=10)

Unnamed: 0,organism_name,assembly_accession,strain,fwdA,fmdA,fwdB,fmdB,fwdC,fmdC,fwdD,...,hdrA2,hdrB2,hdrC2,hdrD,hdrE,mvhA,mvhG,mvhD,fdhA,fdhB
0,#CO2 -> methane,# KEGG M00567,#,K00200,K00200,K00201,K00201,K00202,K00202,K00203,...,K03388,K03389,K03390,K08264,K08265,K14126,K14128,K14127,K22516,K00125
1,#,#,#,formylmethanofuran dehydrogenase subunit A [EC...,formylmethanofuran dehydrogenase subunit A [EC...,formylmethanofuran dehydrogenase subunit B [EC...,formylmethanofuran dehydrogenase subunit B [EC...,formylmethanofuran dehydrogenase subunit C [EC...,formylmethanofuran dehydrogenase subunit C [EC...,formylmethanofuran dehydrogenase subunit D [EC...,...,heterodisulfide reductase subunit A2 [EC:1.8.7...,heterodisulfide reductase subunit B2 [EC:1.8.7...,heterodisulfide reductase subunit C2 [EC:1.8.7...,heterodisulfide reductase subunit D [EC:1.8.98.1],heterodisulfide reductase subunit E [EC:1.8.98.1],F420-non-reducing hydrogenase large subunit [E...,F420-non-reducing hydrogenase small subunit [E...,F420-non-reducing hydrogenase iron-sulfur subu...,formate dehydrogenase (coenzyme F420) alpha su...,formate dehydrogenase (coenzyme F420) beta sub...
2,Candidatus Methanogranum sp. U3.2.1,GCA_019262145.1,na,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Candidatus Methanoliparum sp. LAM-1,GCF_020885915.1,LAM-1,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0
4,Candidatus Methanomassiliicoccus intestinalis,GCF_902383905.1,na,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Candidatus Methanomassiliicoccus intestinalis ...,GCF_000404225.1,Issoire-Mx1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Candidatus Methanomethylophilus alvus,GCF_902387285.1,Mx1201,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Candidatus Methanomethylophilus alvus,GCF_003711245.1,Mx-05,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Candidatus Methanomethylophilus alvus Mx1201,GCF_000300255.2,Mx1201,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,Candidatus Methanomethylophilus sp. 1R26 (arch...,GCF_001481295.1,1R26,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [64]:
out = pd.merge(rumen_only, wgs, on='assembly_accession', how='left')

In [74]:
out['mcrA'] = out['mcrA'].fillna(0).astype(int)
out_mcrA = out[out['mcrA'] >= 1]
out_mcrA = out_mcrA[['assembly_accession', 'organism_name_x', 'strain_x','host', 'location', 'mcrA']]
out_mcrA

Unnamed: 0,assembly_accession,organism_name_x,strain_x,host,location,mcrA
0,GCF_000024185.1,Methanobrevibacter ruminantium,M1,Bovine,Germany,1
1,GCF_001027005.1,Methanosarcina barkeri,CM1,Cow,New Zealand,1
2,GCF_001563245.1,Methanobrevibacter olleyae,YLM1,Lamb,New Zealand,2
4,GCF_000711215.1,Methanomicrobium mobile,BP,Cow,Australia,1
5,GCF_001481295.1,Candidatus Methanomethylophilus,1R26,Cow,Denmark,1
6,GCF_003111625.1,Methanobrevibacter thaueri,DSM 11995,Cow,,1
9,GCF_000320505.1,Methanobrevibacter boviskoreani,JH1,Cattle,Korea,1
10,GCF_001729965.1,Methanosphaera,WGK6,Kangaroo,Australia,1
11,GCF_003268005.1,Methanosphaera,BMS,Steer,Australia,1
12,GCF_001548675.1,Methanobrevibacter,YE315,Cattle,Australia,1


In [75]:
out_mcrA.to_csv('out_mcrA.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/Documents/asmg_labs_files/fasta_sequences/mcrA/out_mcrA.csv'