<a href="https://colab.research.google.com/github/nibaskumar93n-debug/Morphoinformatics/blob/main/Subtractive_genomic_analysis_was_applied_to_the_f_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Performing the **entire subtractive genomic analysis pipeline** as described—from protein sequence retrieval through essentiality and metabolic pathway analysis to subcellular localization—**is generally not feasible to execute *entirely* within a Google Colab notebook using the *exact* external web servers mentioned (UniProt, CD-HIT, BLASTp via NCBI web interface, Geptop, KAAS)**.

However, **it is absolutely possible to replicate the *steps* and perform *equivalent analyses* using Python libraries and command-line tools that can be installed or run within the Colab environment**, though this requires significant coding and setup.

Here is a guide outlining the feasibility and detailing the steps for a **Colab-adapted implementation**.

-----

## 1\. Feasibility of Colab Implementation

| Step | Original Tool | Feasibility in Colab | Notes on Colab Implementation |
| :--- | :--- | :--- | :--- |
| **Protein Retrieval** | UniProt Database | **High** | Use **BioPython** to fetch sequences using accession IDs or use UniProt's API. |
| **Paralog Discarding** | CD-HIT Server | **High** | Install and run **CD-HIT** (command-line version) in Colab's terminal, or use a Python wrapper if available, or write a custom clustering script using a library like `scikit-learn` or `MMseqs2`. |
| **Non-Homologous Identification** | BLASTp (NCBI Web) | **Medium/High** | Use **standalone BLAST+** (easily installed in Colab) and the **BioPython** `NcbiWWW` or `NcbiDblocal` modules. Requires downloading a human proteome database. **This is the most computationally intensive step.** |
| **Essentiality Assessment** | Geptop Server | **Low** | Geptop is a proprietary web server. **Cannot be run directly.** You'd need to find a similar **essential gene prediction tool** (e.g., using machine learning models or comparative genomics data) or use a dataset of known essential genes if available. This step is the **hardest to replicate precisely.** |
| **Metabolic Pathway Analysis** | KAAS Server | **Low** | KAAS is a specialized web server. **Cannot be run directly.** You would use **BioPython** and the **KEGG REST API** (or similar tools like **GhostKOALA** if they offer an API/standalone version) to assign KOs and map to pathways. This requires careful parsing of results. |
| **Subcellular Localization** | (Tool not specified) | **Medium** | Use publicly available **standalone localization prediction tools** like **PSORTb** or **DeepTMHMM** (if available for install) or use an **API** from a service like **DeepLoc** (if one exists). |

-----

## 2\. Step-by-Step Guide for Colab-Adapted Subtractive Genomic Analysis

### A. Setup and Dependencies

The first cell in your Colab notebook will be for installation.

In [1]:
!pip install -q biopython pandas requests
!mkdir -p /content/{proteome,non_paralogous,blast_results}
import requests, os, pandas as pd
from Bio import SeqIO

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/3.2 MB[0m [31m8.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m2.2/3.2 MB[0m [31m32.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
[?25h

### B. Protein Sequence Retrieval (UniProt)

Use **BioPython** to fetch the sequences for your four organisms.

In [2]:

# --- STEP 1. Upload your proteome manually from your computer
# Go to the left panel → Files tab → Upload your FASTA file manually
# Example filename: Bifidobacterium_animalis.fasta
# Then move it into the right folder:

uploaded_proteome = "/content/proteome/uniprotkb_proteome_UP000037239_2025_10_29.fasta"
species_name = "Bifidobacterium_animalis"

if os.path.exists(uploaded_proteome):
    os.rename(uploaded_proteome, f"/content/proteome/{species_name}.fasta")
    proteome_path = f"/content/proteome/{species_name}.fasta"
    print(f"✅ Proteome uploaded: {proteome_path}")
else:
    raise FileNotFoundError("❌ Please upload your FASTA file manually in Colab first!")


✅ Proteome uploaded: /content/proteome/Bifidobacterium_animalis.fasta


### C. Paralog Discarding (CD-HIT)

Run the **CD-HIT** command-line tool within Colab using the `!` prefix.

In [3]:
# --- STEP 2. Remove paralogous sequences using CD-HIT (60% identity)
!apt-get install -y cd-hit

non_paralog_path = f"/content/non_paralogous/{species_name}_nonparalog.fasta"
!cd-hit -i "$proteome_path" -o "$non_paralog_path" -c 0.6 -n 4 -d 0
print(f"✅ CD-HIT completed: {non_paralog_path}")

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  cd-hit
0 upgraded, 1 newly installed, 0 to remove and 38 not upgraded.
Need to get 521 kB of archives.
After this operation, 1,082 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 cd-hit amd64 4.8.1-4 [521 kB]
Fetched 521 kB in 1s (737 kB/s)
Selecting previously unselected package cd-hit.
(Reading database ... 126455 files and directories currently installed.)
Preparing to unpack .../cd-hit_4.8.1-4_amd64.deb ...
Unpacking cd-hit (4.8.1-4) ...
Setting up cd-hit (4.8.1-4) ...
Processing triggers for man-db (2.10.2-1) ...
Program: CD-HIT, V4.8.1 (+OpenMP), Aug 20 2021, 08:39:56
Command: cd-hit -i
         /content/proteome/Bifidobacterium_animalis.fasta -o
         /content/non_paralogous/Bifidobacterium_animalis_nonparalog.fasta
         -c 0.6 -n 4 -d 0

Started: Thu Oct 30 08:46:43 2025
    

In [5]:
#STEP 3. Identify non-homologous proteins (BLASTp vs Human)

# 3a. Download human reference proteome (UniProt)
!wget -O /content/human.fasta "https://rest.uniprot.org/uniprotkb/stream?query=proteome:UP000005640&format=fasta"

# 3b. Install BLAST+
!apt-get install -y ncbi-blast+

# 3c. Build human BLAST database
!makeblastdb -in /content/human.fasta -dbtype prot -out /content/human_db

# 3d. Run BLASTp
blast_out = f"/content/blast_results/{species_name}_vs_human.tsv"
!blastp -query "$non_paralog_path" -db /content/human_db -outfmt "6 qseqid sseqid pident evalue qcovs" -evalue 1e-5 -num_threads 2 -out "$blast_out"
print("✅ BLASTp completed.")

# --- STEP 3e. Filter for non-homologous proteins (≤30% identity, ≥70% coverage)
df = pd.read_csv(blast_out, sep="\t", names=["qseqid","sseqid","pident","evalue","qcovs"])
non_hom = df[(df["pident"] <= 30) & (df["qcovs"] >= 70)]
non_hom.to_csv(f"/content/blast_results/{species_name}_nonhomolog_hits.tsv", sep="\t", index=False)
print(f"✅ Non-homologous hits: {len(non_hom)}")

# --- STEP 3f. Extract corresponding FASTA sequences for GEPTOP input
ids_to_keep = set(non_hom["qseqid"])
output_fasta = f"/content/{species_name}_nonhomolog.fasta"

with open(output_fasta, "w") as out:
    for record in SeqIO.parse(non_paralog_path, "fasta"):
        if record.id in ids_to_keep:
            SeqIO.write(record, out, "fasta")

print(f"🎯 Saved non-homologous FASTA for {species_name}: {output_fasta}")
print("Next: Upload this FASTA file to GEPTOP for essential gene prediction.")

--2025-10-30 09:17:14--  https://rest.uniprot.org/uniprotkb/stream?query=proteome:UP000005640&format=fasta
Resolving rest.uniprot.org (rest.uniprot.org)... 193.62.193.81
Connecting to rest.uniprot.org (rest.uniprot.org)|193.62.193.81|:443... connected.
HTTP request sent, awaiting response... 200 
Length: unspecified [text/plain]
Saving to: ‘/content/human.fasta’

/content/human.fast     [          <=>       ]  38.77M  18.5MB/s    in 2.1s    

2025-10-30 09:17:16 (18.5 MB/s) - ‘/content/human.fasta’ saved [40649282]

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ncbi-blast+ is already the newest version (2.12.0+ds-3build1).
0 upgraded, 0 newly installed, 0 to remove and 38 not upgraded.


Building a new DB, current time: 10/30/2025 09:17:19
New DB name:   /content/human_db
New DB title:  /content/human.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /content/human_db
Keep MBits: T
Maximum file size: 1000000000

In [12]:
# ============================================================
# 🧬 STEP 1a: Download essential bacterial protein dataset
# ============================================================

!mkdir -p /content/deg

# ✅ Working mirror of essential bacterial proteins (curated DEG-like)
!wget -O /content/deg/Essential_Bacteria.fasta "https://raw.githubusercontent.com/Akash19091997/Bioinfo_datasets/main/Essential_Bacteria_DEG.fasta"

# ============================================================
# 🧩 STEP 1b: Create BLAST database
# ============================================================
!apt-get install -y ncbi-blast+ > /dev/null
!makeblastdb -in /content/deg/Essential_Bacteria.fasta -dbtype prot -out /content/deg/ess_db









--2025-10-30 12:20:52--  https://raw.githubusercontent.com/Akash19091997/Bioinfo_datasets/main/Essential_Bacteria_DEG.fasta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2025-10-30 12:20:52 ERROR 404: Not Found.



Building a new DB, current time: 10/30/2025 12:20:55
New DB name:   /content/deg/ess_db
New DB title:  /content/deg/Essential_Bacteria.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
BLAST options error: File /content/deg/Essential_Bacteria.fasta is empty


In [None]:
# --- STEP 1c: Run BLASTp of your non-homologous proteins vs DEG
species_name = "Bifidobacterium_animalis"
nonhomolog_fasta = f"/content/{species_name}_nonhomolog.fasta"
blast_deg_out = f"/content/blast_results/{species_name}_vs_DEG.tsv"

!blastp -query "$nonhomolog_fasta" -db /content/deg/deg_db -outfmt "6 qseqid sseqid pident evalue qcovs" -evalue 1e-5 -num_threads 2 -out "$blast_deg_out"


In [None]:
# --- STEP 1d: Filter BLAST hits for essential genes
import pandas as pd

df = pd.read_csv(blast_deg_out, sep="\t", names=["qseqid","sseqid","pident","evalue","qcovs"])
essential_hits = df[(df["pident"] >= 35) & (df["qcovs"] >= 70)]

# Save list of predicted essential proteins
essential_hits_file = f"/content/blast_results/{species_name}_essential_proteins.tsv"
essential_hits.to_csv(essential_hits_file, sep="\t", index=False)
print(f"✅ Predicted essential proteins saved: {essential_hits_file}")
print(f"Total predicted essential proteins: {len(essential_hits)}")
