<a href="https://colab.research.google.com/github/nibaskumar93n-debug/Morphoinformatics/blob/main/Subtractive_genomic_analysis_was_applied_to_the_f_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Performing the **entire subtractive genomic analysis pipeline** as described‚Äîfrom protein sequence retrieval through essentiality and metabolic pathway analysis to subcellular localization‚Äî**is generally not feasible to execute *entirely* within a Google Colab notebook using the *exact* external web servers mentioned (UniProt, CD-HIT, BLASTp via NCBI web interface, Geptop, KAAS)**.

However, **it is absolutely possible to replicate the *steps* and perform *equivalent analyses* using Python libraries and command-line tools that can be installed or run within the Colab environment**, though this requires significant coding and setup.

Here is a guide outlining the feasibility and detailing the steps for a **Colab-adapted implementation**.

-----

## 1\. Feasibility of Colab Implementation

| Step | Original Tool | Feasibility in Colab | Notes on Colab Implementation |
| :--- | :--- | :--- | :--- |
| **Protein Retrieval** | UniProt Database | **High** | Use **BioPython** to fetch sequences using accession IDs or use UniProt's API. |
| **Paralog Discarding** | CD-HIT Server | **High** | Install and run **CD-HIT** (command-line version) in Colab's terminal, or use a Python wrapper if available, or write a custom clustering script using a library like `scikit-learn` or `MMseqs2`. |
| **Non-Homologous Identification** | BLASTp (NCBI Web) | **Medium/High** | Use **standalone BLAST+** (easily installed in Colab) and the **BioPython** `NcbiWWW` or `NcbiDblocal` modules. Requires downloading a human proteome database. **This is the most computationally intensive step.** |
| **Essentiality Assessment** | Geptop Server | **Low** | Geptop is a proprietary web server. **Cannot be run directly.** You'd need to find a similar **essential gene prediction tool** (e.g., using machine learning models or comparative genomics data) or use a dataset of known essential genes if available. This step is the **hardest to replicate precisely.** |
| **Metabolic Pathway Analysis** | KAAS Server | **Low** | KAAS is a specialized web server. **Cannot be run directly.** You would use **BioPython** and the **KEGG REST API** (or similar tools like **GhostKOALA** if they offer an API/standalone version) to assign KOs and map to pathways. This requires careful parsing of results. |
| **Subcellular Localization** | (Tool not specified) | **Medium** | Use publicly available **standalone localization prediction tools** like **PSORTb** or **DeepTMHMM** (if available for install) or use an **API** from a service like **DeepLoc** (if one exists). |

-----

## 2\. Step-by-Step Guide for Colab-Adapted Subtractive Genomic Analysis

### A. Setup and Dependencies

The first cell in your Colab notebook will be for installation.

In [1]:
!pip install -q biopython pandas requests
!mkdir -p /content/{proteome,non_paralogous,blast_results}
import requests, os, pandas as pd
from Bio import SeqIO

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/3.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.3/3.2 MB[0m [31m9.5 MB/s[0m eta [36m0:00:01[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.4/3.2 MB[0m [31m35.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m3.2/3.2 MB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[?25h

### B. Protein Sequence Retrieval (UniProt)

Use **BioPython** to fetch the sequences for your four organisms.

In [17]:
# --- STEP 1: Upload your proteome ---
species_name = "Bifidobacterium_animalis"
uploaded_proteome = "/content/uniprotkb_proteome_UP000037239_2025_10_29.fasta"

if os.path.exists(uploaded_proteome):
    os.rename(uploaded_proteome, f"/content/proteome/{species_name}.fasta")
    proteome_path = f"/content/proteome/{species_name}.fasta"
    print(f"‚úÖ Proteome uploaded: {proteome_path}")
else:
    raise FileNotFoundError("‚ùå Please upload your FASTA file manually in Colab first!")

# --- Count total proteins ---
total_proteins = sum(1 for _ in SeqIO.parse(proteome_path, "fasta"))
print(f"üß© Total proteins in proteome: {total_proteins}")


‚úÖ Proteome uploaded: /content/proteome/Bifidobacterium_animalis.fasta
üß© Total proteins in proteome: 1750


### C. Paralog Discarding (CD-HIT)

Run the **CD-HIT** command-line tool within Colab using the `!` prefix.

In [18]:
# --- STEP 2. Remove paralogous sequences using CD-HIT (60% identity)
!apt-get install -y cd-hit

# --- STEP 2: Remove paralogs using CD-HIT (60% identity) ---
non_paralog_path = f"/content/non_paralogous/{species_name}_nonparalog.fasta"
os.makedirs("/content/non_paralogous", exist_ok=True)
!cd-hit -i "$proteome_path" -o "$non_paralog_path" -c 0.6 -n 4 -d 0 > /dev/null

# --- Count after CD-HIT ---
non_paralog_count = sum(1 for _ in SeqIO.parse(non_paralog_path, "fasta"))
print(f"üß¨ Non-paralogous proteins retained: {non_paralog_count} ({(non_paralog_count/total_proteins)*100:.1f}% retained)")


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
cd-hit is already the newest version (4.8.1-4).
0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.
üß¨ Non-paralogous proteins retained: 1730 (98.9% retained)


In [19]:
# --- STEP 3: Remove human homologs ---
# 3a. Download human reference proteome (UniProt)
!wget -q -O /content/human.fasta "https://rest.uniprot.org/uniprotkb/stream?query=proteome:UP000005640&format=fasta"

# 3b. Build human BLAST database
!makeblastdb -in /content/human.fasta -dbtype prot -out /content/human_db > /dev/null

# 3c. Run BLASTp vs human
blast_out = f"/content/blast_results/{species_name}_vs_human.tsv"
os.makedirs("/content/blast_results", exist_ok=True)
!blastp -query "$non_paralog_path" -db /content/human_db -outfmt "6 qseqid sseqid pident evalue qcovs" -evalue 1e-5 -num_threads 2 -out "$blast_out"

print("‚úÖ BLASTp vs Human completed.")

# --- 3d. Filter for non-homologous proteins (‚â§30% identity, ‚â•70% coverage) ---
df_human = pd.read_csv(blast_out, sep="\t", names=["qseqid","sseqid","pident","evalue","qcovs"])
human_homologs = set(df_human[(df_human["pident"] > 30) & (df_human["qcovs"] >= 70)]["qseqid"])
non_homologous_ids = []

for record in SeqIO.parse(non_paralog_path, "fasta"):
    if record.id not in human_homologs:
        non_homologous_ids.append(record.id)

print(f"üö´ Human-homologous proteins removed: {len(human_homologs)}")
print(f"‚úÖ Non-homologous proteins retained: {len(non_homologous_ids)} ({(len(non_homologous_ids)/non_paralog_count)*100:.1f}% retained)")

# --- Save non-homologous FASTA ---
non_hom_fasta = f"/content/{species_name}_nonhomolog.fasta"
with open(non_hom_fasta, "w") as out:
    for record in SeqIO.parse(non_paralog_path, "fasta"):
        if record.id in non_homologous_ids:
            SeqIO.write(record, out, "fasta")

‚úÖ BLASTp vs Human completed.
üö´ Human-homologous proteins removed: 217
‚úÖ Non-homologous proteins retained: 1513 (87.5% retained)


In [20]:
# Unzip DEG10
!gunzip -c /content/DEG10.aa.gz > /content/DEG10.aa.fasta


In [21]:
# --- STEP 4: Predict essential proteins using DEG10 ---
# Make sure DEG10.aa.fasta and deg10_db exist
!gunzip -c /content/DEG10.aa.gz > /content/DEG10.aa.fasta
!makeblastdb -in /content/DEG10.aa.fasta -dbtype prot -out /content/deg10_db > /dev/null

blast_deg_out = f"/content/{species_name}_vs_deg10.tsv"
!blastp -query "$non_hom_fasta" -db /content/deg10_db -outfmt "6 qseqid sseqid pident evalue qcovs bitscore" -evalue 1e-5 -num_threads 2 -out "$blast_deg_out"

print("‚úÖ BLASTp vs DEG10 completed.")


FASTA-Reader: Ignoring invalid residues at position(s): On line 91713: 44
FASTA-Reader: Ignoring invalid residues at position(s): On line 102730: 48
FASTA-Reader: Ignoring invalid residues at position(s): On line 110967: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 112557: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 112604: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 112775: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 113161: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 113389: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 113405: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 113418: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 113681: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 113850: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 114182: 18
FASTA-Reader: Ignoring inv

In [22]:
# --- STEP 5: Filter essential-like hits (GEPTOP mimic logic) ---
df_deg = pd.read_csv(blast_deg_out, sep="\t", names=["qseqid","sseqid","pident","evalue","qcovs","bitscore"])
filtered = df_deg[(df_deg["pident"] >= 30) & (df_deg["qcovs"] >= 70)]

# Keep only best hit per query
best_hits = filtered.sort_values("evalue").drop_duplicates("qseqid", keep="first")

print(f"‚≠ê Total DEG10 hits passing threshold: {len(filtered)}")
print(f"üéØ Unique predicted essential proteins: {len(best_hits)} ({(len(best_hits)/len(non_homologous_ids))*100:.1f}% of non-homologous proteins)")



‚≠ê Total DEG10 hits passing threshold: 6208
üéØ Unique predicted essential proteins: 443 (29.3% of non-homologous proteins)


In [23]:
# --- STEP 6: Extract FASTA for essential proteins ---
ids_to_keep = set(best_hits["qseqid"])
output_fasta = f"/content/{species_name}_predicted_essential.fasta"

with open(output_fasta, "w") as out:
    for record in SeqIO.parse(non_hom_fasta, "fasta"):
        if record.id in ids_to_keep:
            SeqIO.write(record, out, "fasta")

print(f"üíæ FASTA saved: {output_fasta}")


üíæ FASTA saved: /content/Bifidobacterium_animalis_predicted_essential.fasta


In [5]:
import pandas as pd
import requests
from tqdm import tqdm

# 1Ô∏è‚É£ Load KAAS mapping (protein ‚Üí KO)
kaas_file = "/content/Bifidobacterium_animalis.csv"  # your CSV file

# Read as standard CSV (comma-separated)
df = pd.read_csv(kaas_file)

# Check column names
print("Columns detected:", df.columns.tolist())
if not {"protein", "KO"}.issubset(df.columns):
    df.columns = ["protein", "KO"]  # enforce standard naming if not present

# 2Ô∏è‚É£ Count assigned and unassigned
assigned = df["KO"].notna().sum()
unassigned = df["KO"].isna().sum()
print(f"Assigned KO IDs: {assigned}")
print(f"Unassigned proteins: {unassigned}")

# 3Ô∏è‚É£ Remove NA and get unique KO IDs
ko_list = df["KO"].dropna().unique().tolist()

# 4Ô∏è‚É£ Map each KO to KEGG pathways via KEGG REST API
def get_pathways_for_ko(ko):
    url = f"https://rest.kegg.jp/link/pathway/ko:{ko}"
    res = requests.get(url)
    if res.status_code == 200:
        lines = res.text.strip().split("\n")
        pathways = []
        for l in lines:
            parts = l.split("\t")
            if len(parts) > 1:  # only if both columns exist
                pathways.append(parts[1].replace("path:", ""))
        return pathways
    return []


ko_to_path = {}
for ko in tqdm(ko_list, desc="Mapping KO ‚Üí Pathway"):
    ko_to_path[ko] = get_pathways_for_ko(ko)

# 5Ô∏è‚É£ Create DataFrame of KO ‚Üí Pathway
path_df = (
    pd.DataFrame([(ko, p) for ko, plist in ko_to_path.items() for p in plist],
                 columns=["KO", "Pathway"])
)

# 6Ô∏è‚É£ Identify KO IDs with no pathway mapping
mapped_kos = set(path_df["KO"])
unmapped_kos = [ko for ko in ko_list if ko not in mapped_kos]
print(f"\nKO-assigned proteins with NO pathway mapping: {len(unmapped_kos)}")

# 7Ô∏è‚É£ Download human pathway list
human_pathways = requests.get("http://rest.kegg.jp/list/pathway/hsa").text
human_path_list = [line.split("\t")[0].replace("path:", "") for line in human_pathways.strip().split("\n")]

# 8Ô∏è‚É£ Identify shared vs unique bacterial pathways
path_df["Shared_with_Human"] = path_df["Pathway"].isin(human_path_list)

shared = path_df[path_df["Shared_with_Human"]].Pathway.nunique()
unique = path_df[~path_df["Shared_with_Human"]].Pathway.nunique()

print(f"\nüß≠ Pathway summary:")
print(f"Total distinct pathways: {path_df.Pathway.nunique()}")
print(f"Shared with Human: {shared}")
print(f"Unique bacterial: {unique}")

# 9Ô∏è‚É£ Save results
path_df.to_csv("/content/KAAS_pathway_analysis.csv", index=False)
print("\n‚úÖ Results saved to: /content/KAAS_pathway_analysis.csv")

# üîü Save unique bacterial pathways only
unique_df = path_df[~path_df["Shared_with_Human"]]
unique_df.to_csv("/content/unique_bacterial_pathways.csv", index=False)
print("üß¨ Unique bacterial pathways saved: /content/unique_bacterial_pathways.csv")



Columns detected: ['protein', 'KO']
Assigned KO IDs: 192
Unassigned proteins: 251


Mapping KO ‚Üí Pathway: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 190/190 [02:29<00:00,  1.27it/s]



KO-assigned proteins with NO pathway mapping: 60

üß≠ Pathway summary:
Total distinct pathways: 208
Shared with Human: 0
Unique bacterial: 208

‚úÖ Results saved to: /content/KAAS_pathway_analysis.csv
üß¨ Unique bacterial pathways saved: /content/unique_bacterial_pathways.csv


In [13]:
import pandas as pd

# --- Input files ---
kaas_file = "/content/Bifidobacterium_animalis.csv"      # Protein ‚Üí KO mapping (comma-delimited)
pathway_file = "/content/KAAS_pathway_analysis.csv"      # KO ‚Üí Pathway mapping (comma-delimited)

# --- Load data ---
mapping_df = pd.read_csv(kaas_file)  # uses header from file
path_df = pd.read_csv(pathway_file)

# --- Summary: Assigned vs Unassigned KOs ---
assigned = mapping_df["KO"].notna().sum()
unassigned = mapping_df["KO"].isna().sum()

# --- KO ‚Üí pathway mapping ---
ko_list = mapping_df["KO"].dropna().unique().tolist()
ko_with_no_pathway = [ko for ko in ko_list if ko not in path_df["KO"].unique()]
num_no_pathway = len(ko_with_no_pathway)

# --- Filter only unique bacterial pathways ---
unique_pathways_df = path_df[path_df["Shared_with_Human"] == False]

# --- Merge protein ‚Üí KO with KO ‚Üí pathway ---
merged_df = pd.merge(mapping_df.dropna(subset=["KO"]), unique_pathways_df, on="KO", how="inner")
merged_df = merged_df.drop_duplicates(subset=["protein", "KO", "Pathway"])

# --- Save merged protein ‚Üí KO ‚Üí pathway CSV ---
output_file = "/content/Bifidobacterium_animalis_merged_information.csv"
merged_df.to_csv(output_file, index=False)

# --- Save summary info ---
summary_file = "/content/Bifidobacterium_animalis_KO_summary.csv"
summary_df = pd.DataFrame({
    "Metric": ["Assigned KO IDs", "Unassigned proteins", "KO-assigned proteins with NO pathway mapping",
               "Total distinct pathways", "Shared with Human", "Unique bacterial pathways"],
    "Count": [assigned, unassigned, num_no_pathway,
              path_df["Pathway"].nunique(), path_df[path_df["Shared_with_Human"]].Pathway.nunique(),
              unique_pathways_df.Pathway.nunique()]
})
summary_df.to_csv(summary_file, index=False)

# --- Print info ---
print(f"‚úÖ Merged protein ‚Üí KO ‚Üí pathway file saved: {output_file}")
print(f"‚úÖ Summary file saved: {summary_file}")
print(summary_df)


‚úÖ Merged protein ‚Üí KO ‚Üí pathway file saved: /content/Bifidobacterium_animalis_merged_information.csv
‚úÖ Summary file saved: /content/Bifidobacterium_animalis_KO_summary.csv
                                         Metric  Count
0                               Assigned KO IDs    192
1                           Unassigned proteins    251
2  KO-assigned proteins with NO pathway mapping     60
3                       Total distinct pathways    208
4                             Shared with Human      0
5                     Unique bacterial pathways    208


In [12]:
import pandas as pd

# --- Load KAAS results (protein ‚Üî KO) ---
kaas_df = pd.read_csv("/content/Bifidobacterium_animalis.csv")

# --- Load KO ‚Üî Pathway data (from your previous analysis) ---
path_df = pd.read_csv("/content/KAAS_pathway_analysis.csv")

# --- Filter for unique bacterial pathways ---
unique_df = path_df[path_df["Shared_with_Human"] == False]

# --- Get list of unique KO IDs ---
unique_kos = unique_df["KO"].unique()

# --- Subset proteins belonging to those KOs ---
unique_proteins = kaas_df[kaas_df["KO"].isin(unique_kos)]

# --- Save list of unique proteins ---
unique_proteins.to_csv("/content/Bifidobacterium_animalis_unique_pathway_proteins.csv", index=False)
print(f"‚úÖ Unique proteins saved: {unique_proteins.shape[0]}")


‚úÖ Unique proteins saved: 130


In [8]:
from Bio import SeqIO
import pandas as pd

# --- INPUT FILES ---
fasta_file = "/content/Bifidobacterium_animalis_predicted_essential.fasta"
pathway_file = "/content/KAAS_pathway_analysis.csv"
mapping_file = "/content/Bifidobacterium_animalis.csv"

# --- LOAD DATA ---
path_df = pd.read_csv(pathway_file)
mapping_df = pd.read_csv(mapping_file, sep=",")  # üëà changed from '\t' to ','

# Ensure column names are correct
mapping_df.columns = ["protein", "KO"]

# Get KOs that are NOT shared with human (unique bacterial)
unique_kos = path_df.loc[path_df["Shared_with_Human"] == False, "KO"].unique().tolist()

# Get protein IDs associated with those unique KOs
unique_proteins = mapping_df[mapping_df["KO"].isin(unique_kos)]["protein"].unique().tolist()

print(f"‚úÖ Unique bacterial KOs: {len(unique_kos)}")
print(f"‚úÖ Corresponding protein IDs: {len(unique_proteins)}")

# --- FILTER FASTA ---
output_fasta = "/content/Bifidobacterium_animalis_unique_pathway_proteins.fasta"
count = 0

with open(output_fasta, "w") as out_f:
    for record in SeqIO.parse(fasta_file, "fasta"):
        if any(pid in record.id for pid in unique_proteins):
            SeqIO.write(record, out_f, "fasta")
            count += 1

print(f"üéØ Unique-pathway protein sequences saved: {output_fasta}")
print(f"Total sequences written: {count}")





‚úÖ Unique bacterial KOs: 130
‚úÖ Corresponding protein IDs: 130
üéØ Unique-pathway protein sequences saved: /content/Bifidobacterium_animalis_unique_pathway_proteins.fasta
Total sequences written: 130


In [15]:
from Bio import SeqIO

input_fasta = "/content/Bifidobacterium_animalis_unique_pathway_proteins.fasta"
records = list(SeqIO.parse(input_fasta, "fasta"))

chunk_size = 70
for i in range(0, len(records), chunk_size):
    chunk = records[i:i+chunk_size]
    output_file = f"/content/unique_proteins_chunk_{i//chunk_size + 1}.fasta"
    SeqIO.write(chunk, output_file, "fasta")
    print(f"‚úÖ Chunk saved: {output_file} ({len(chunk)} sequences)")


‚úÖ Chunk saved: /content/unique_proteins_chunk_1.fasta (70 sequences)
‚úÖ Chunk saved: /content/unique_proteins_chunk_2.fasta (60 sequences)


In [5]:
##### The NetGenes databse system is not clear to me now ,,, because there is no protein sequence onley the essential gene and their scores........
from google.colab import files

uploaded = files.upload()  # Upload your NetGenes zip file





Saving NetGenes.zip to NetGenes.zip


In [6]:
!mkdir -p /content/netgenes
!unzip /content/NetGenes.zip -d /content/netgenes


Archive:  /content/NetGenes.zip
  inflating: /content/netgenes/Acaricomes phytoseiuli.csv  
  inflating: /content/netgenes/Acaryochloris marina.csv  
  inflating: /content/netgenes/Accumulibacter phosphatis.csv  
  inflating: /content/netgenes/Accumulibacter sp. BA93.csv  
  inflating: /content/netgenes/Acetobacter aceti 1023.csv  
  inflating: /content/netgenes/Acetobacter aceti ATCC23746.csv  
  inflating: /content/netgenes/Acetobacter malorum.csv  
  inflating: /content/netgenes/Acetobacter nitrogenifigens.csv  
  inflating: /content/netgenes/Acetobacter okinawensis.csv  
  inflating: /content/netgenes/Acetobacter pasteurianus 3P3.csv  
  inflating: /content/netgenes/Acetobacter pasteurianus IFO328301.csv  
  inflating: /content/netgenes/Acetobacteraceae bacterium AT5844.csv  
  inflating: /content/netgenes/Achromobacter arsenitoxydans.csv  
  inflating: /content/netgenes/Achromobacter insuavis.csv  
  inflating: /content/netgenes/Achromobacter piechaudii ATCC43553.csv  
  inflating

In [None]:
!apt-get install -y ncbi-blast+ > /dev/null

# Make BLAST database
!makeblastdb -in /content/netgenes/NetGenes_bacteria.fasta -dbtype prot -out /content/netgenes/netgenes_db

