<a href="https://colab.research.google.com/github/nibaskumar93n-debug/Morphoinformatics/blob/main/Subtractive_genomic_analysis_was_applied_to_the_f_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Performing the **entire subtractive genomic analysis pipeline** as described‚Äîfrom protein sequence retrieval through essentiality and metabolic pathway analysis to subcellular localization‚Äî**is generally not feasible to execute *entirely* within a Google Colab notebook using the *exact* external web servers mentioned (UniProt, CD-HIT, BLASTp via NCBI web interface, Geptop, KAAS)**.

However, **it is absolutely possible to replicate the *steps* and perform *equivalent analyses* using Python libraries and command-line tools that can be installed or run within the Colab environment**, though this requires significant coding and setup.

Here is a guide outlining the feasibility and detailing the steps for a **Colab-adapted implementation**.

-----

## 1\. Feasibility of Colab Implementation

| Step | Original Tool | Feasibility in Colab | Notes on Colab Implementation |
| :--- | :--- | :--- | :--- |
| **Protein Retrieval** | UniProt Database | **High** | Use **BioPython** to fetch sequences using accession IDs or use UniProt's API. |
| **Paralog Discarding** | CD-HIT Server | **High** | Install and run **CD-HIT** (command-line version) in Colab's terminal, or use a Python wrapper if available, or write a custom clustering script using a library like `scikit-learn` or `MMseqs2`. |
| **Non-Homologous Identification** | BLASTp (NCBI Web) | **Medium/High** | Use **standalone BLAST+** (easily installed in Colab) and the **BioPython** `NcbiWWW` or `NcbiDblocal` modules. Requires downloading a human proteome database. **This is the most computationally intensive step.** |
| **Essentiality Assessment** | Geptop Server | **Low** | Geptop is a proprietary web server. **Cannot be run directly.** You'd need to find a similar **essential gene prediction tool** (e.g., using machine learning models or comparative genomics data) or use a dataset of known essential genes if available. This step is the **hardest to replicate precisely.** |
| **Metabolic Pathway Analysis** | KAAS Server | **Low** | KAAS is a specialized web server. **Cannot be run directly.** You would use **BioPython** and the **KEGG REST API** (or similar tools like **GhostKOALA** if they offer an API/standalone version) to assign KOs and map to pathways. This requires careful parsing of results. |
| **Subcellular Localization** | (Tool not specified) | **Medium** | Use publicly available **standalone localization prediction tools** like **PSORTb** or **DeepTMHMM** (if available for install) or use an **API** from a service like **DeepLoc** (if one exists). |

-----

## 2\. Step-by-Step Guide for Colab-Adapted Subtractive Genomic Analysis

### A. Setup and Dependencies

The first cell in your Colab notebook will be for installation.

In [5]:
!pip install -q biopython pandas requests
!mkdir -p /content/{proteome,non_paralogous,blast_results}
import requests, os, pandas as pd
from Bio import SeqIO

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/3.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.3/3.2 MB[0m [31m8.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.7/3.2 MB[0m [31m39.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m3.2/3.2 MB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[?25h

### B. Protein Sequence Retrieval (UniProt)

Use **BioPython** to fetch the sequences for your four organisms.

In [16]:
# --- STEP 1: Upload your proteome ---
species_name = "Bacteroides_fragilis"
uploaded_proteome = "/content/proteome/Bacteroides_fragilis.fasta"

if os.path.exists(uploaded_proteome):
    os.rename(uploaded_proteome, f"/content/proteome/{species_name}.fasta")
    proteome_path = f"/content/proteome/{species_name}.fasta"
    print(f"‚úÖ Proteome uploaded: {proteome_path}")
else:
    raise FileNotFoundError("‚ùå Please upload your FASTA file manually in Colab first!")

# --- Count total proteins ---
total_proteins = sum(1 for _ in SeqIO.parse(proteome_path, "fasta"))
print(f"üß© Total proteins in proteome: {total_proteins}")


‚úÖ Proteome uploaded: /content/proteome/Bacteroides_fragilis.fasta
üß© Total proteins in proteome: 4234


### C. Paralog Discarding (CD-HIT)

Run the **CD-HIT** command-line tool within Colab using the `!` prefix.

In [17]:
# --- STEP 2. Remove paralogous sequences using CD-HIT (60% identity)
!apt-get install -y cd-hit

# --- STEP 2: Remove paralogs using CD-HIT (60% identity) ---
non_paralog_path = f"/content/non_paralogous/{species_name}_nonparalog.fasta"
os.makedirs("/content/non_paralogous", exist_ok=True)
!cd-hit -i "$proteome_path" -o "$non_paralog_path" -c 0.6 -n 4 -d 0 > /dev/null

# --- Count after CD-HIT ---
non_paralog_count = sum(1 for _ in SeqIO.parse(non_paralog_path, "fasta"))
print(f"üß¨ Non-paralogous proteins retained: {non_paralog_count} ({(non_paralog_count/total_proteins)*100:.1f}% retained)")


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
cd-hit is already the newest version (4.8.1-4).
0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.
üß¨ Non-paralogous proteins retained: 4085 (96.5% retained)


In [18]:
# --- STEP 3: Remove human homologs ---

# 3a. Install BLAST+
!apt-get install -y ncbi-blast+ > /dev/null
# 3a. Download human reference proteome (UniProt)
!wget -q -O /content/human.fasta "https://rest.uniprot.org/uniprotkb/stream?query=proteome:UP000005640&format=fasta"

# 3b. Build human BLAST database
!makeblastdb -in /content/human.fasta -dbtype prot -out /content/human_db > /dev/null

# 3c. Run BLASTp vs human
blast_out = f"/content/blast_results/{species_name}_vs_human.tsv"
os.makedirs("/content/blast_results", exist_ok=True)
!blastp -query "$non_paralog_path" -db /content/human_db -outfmt "6 qseqid sseqid pident evalue qcovs" -evalue 1e-5 -num_threads 2 -out "$blast_out"

print("‚úÖ BLASTp vs Human completed.")

# --- 3d. Filter for non-homologous proteins (‚â§30% identity, ‚â•70% coverage) ---
df_human = pd.read_csv(blast_out, sep="\t", names=["qseqid","sseqid","pident","evalue","qcovs"])
human_homologs = set(df_human[(df_human["pident"] > 30) & (df_human["qcovs"] >= 70)]["qseqid"])
non_homologous_ids = []

for record in SeqIO.parse(non_paralog_path, "fasta"):
    if record.id not in human_homologs:
        non_homologous_ids.append(record.id)

print(f"üö´ Human-homologous proteins removed: {len(human_homologs)}")
print(f"‚úÖ Non-homologous proteins retained: {len(non_homologous_ids)} ({(len(non_homologous_ids)/non_paralog_count)*100:.1f}% retained)")

# --- Save non-homologous FASTA ---
non_hom_fasta = f"/content/{species_name}_nonhomolog.fasta"
with open(non_hom_fasta, "w") as out:
    for record in SeqIO.parse(non_paralog_path, "fasta"):
        if record.id in non_homologous_ids:
            SeqIO.write(record, out, "fasta")

‚úÖ BLASTp vs Human completed.
üö´ Human-homologous proteins removed: 348
‚úÖ Non-homologous proteins retained: 3737 (91.5% retained)


In [9]:
# Unzip DEG10
!gunzip -c /content/DEG10.aa.gz > /content/DEG10.aa.fasta


In [19]:
# --- STEP 4: Predict essential proteins using DEG10 ---
!makeblastdb -in /content/DEG10.aa.fasta -dbtype prot -out /content/deg10_db > /dev/null

blast_deg_out = f"/content/{species_name}_vs_deg10.tsv"
!blastp -query "$non_hom_fasta" -db /content/deg10_db -outfmt "6 qseqid sseqid pident evalue qcovs bitscore" -evalue 1e-5 -num_threads 2 -out "$blast_deg_out"

print("‚úÖ BLASTp vs DEG10 completed.")


FASTA-Reader: Ignoring invalid residues at position(s): On line 91713: 44
FASTA-Reader: Ignoring invalid residues at position(s): On line 102730: 48
FASTA-Reader: Ignoring invalid residues at position(s): On line 110967: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 112557: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 112604: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 112775: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 113161: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 113389: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 113405: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 113418: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 113681: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 113850: 18
FASTA-Reader: Ignoring invalid residues at position(s): On line 114182: 18
FASTA-Reader: Ignoring inv

In [20]:
# --- STEP 5: Filter essential-like hits (GEPTOP mimic logic) ---
df_deg = pd.read_csv(blast_deg_out, sep="\t", names=["qseqid","sseqid","pident","evalue","qcovs","bitscore"])
filtered = df_deg[(df_deg["pident"] >= 30) & (df_deg["qcovs"] >= 70)]

# Keep only best hit per query
best_hits = filtered.sort_values("evalue").drop_duplicates("qseqid", keep="first")

print(f"‚≠ê Total DEG10 hits passing threshold: {len(filtered)}")
print(f"üéØ Unique predicted essential proteins: {len(best_hits)} ({(len(best_hits)/len(non_homologous_ids))*100:.1f}% of non-homologous proteins)")



‚≠ê Total DEG10 hits passing threshold: 5745
üéØ Unique predicted essential proteins: 853 (22.8% of non-homologous proteins)


In [21]:
 # --- STEP 5: Filter essential-like hits (stricter GEPTOP mimic logic) ---Getting too many essential proteins so using stricter threshold
df_deg = pd.read_csv(blast_deg_out, sep="\t", names=["qseqid","sseqid","pident","evalue","qcovs","bitscore"])

# ‚úÖ Tightened thresholds for higher confidence
filtered = df_deg[
    (df_deg["pident"] >= 40) &
    (df_deg["qcovs"] >= 80) &
    (df_deg["bitscore"] >= 100) &
    (df_deg["evalue"] <= 1e-10)
]

# Keep only the best hit per protein
best_hits = filtered.sort_values("evalue").drop_duplicates("qseqid", keep="first")

print(f"‚≠ê Total DEG10 hits passing strict threshold: {len(filtered)}")
print(f"üéØ Unique predicted essential proteins: {len(best_hits)} ({(len(best_hits)/len(non_homologous_ids))*100:.1f}% of non-homologous proteins)")


‚≠ê Total DEG10 hits passing strict threshold: 3389
üéØ Unique predicted essential proteins: 665 (17.8% of non-homologous proteins)


In [22]:
# --- STEP 6: Extract FASTA for essential proteins ---
ids_to_keep = set(best_hits["qseqid"])
output_fasta = f"/content/{species_name}_predicted_essential.fasta"

with open(output_fasta, "w") as out:
    for record in SeqIO.parse(non_hom_fasta, "fasta"):
        if record.id in ids_to_keep:
            SeqIO.write(record, out, "fasta")

print(f"üíæ FASTA saved: {output_fasta}")


üíæ FASTA saved: /content/Bacteroides_fragilis_predicted_essential_revised_threshold.fasta


In [None]:
# --- STEP 6: Extract FASTA for essential proteins ---
ids_to_keep = set(best_hits["qseqid"])
output_fasta = f"/content/{species_name}_predicted_essential_revised_threshold.fasta"

with open(output_fasta, "w") as out:
    for record in SeqIO.parse(non_hom_fasta, "fasta"):
        if record.id in ids_to_keep:
            SeqIO.write(record, out, "fasta")

print(f"üíæ FASTA saved: {output_fasta}")

In [1]:
import pandas as pd
import requests
from tqdm import tqdm

# 1Ô∏è‚É£ Load KAAS mapping (protein ‚Üí KO)
kaas_file = "/content/Kaas_bacteroides_fragilis.csv"  # your CSV file

# Read as standard CSV (comma-separated)
df = pd.read_csv(kaas_file)

# Check column names
print("Columns detected:", df.columns.tolist())
if not {"protein", "KO"}.issubset(df.columns):
    df.columns = ["protein", "KO"]  # enforce standard naming if not present

# 2Ô∏è‚É£ Count assigned and unassigned
assigned = df["KO"].notna().sum()
unassigned = df["KO"].isna().sum()
print(f"Assigned KO IDs: {assigned}")
print(f"Unassigned proteins: {unassigned}")

# 3Ô∏è‚É£ Remove NA and get unique KO IDs
ko_list = df["KO"].dropna().unique().tolist()

# 4Ô∏è‚É£ Map each KO to KEGG pathways via KEGG REST API
def get_pathways_for_ko(ko):
    url = f"https://rest.kegg.jp/link/pathway/ko:{ko}"
    res = requests.get(url)
    if res.status_code == 200:
        lines = res.text.strip().split("\n")
        pathways = []
        for l in lines:
            parts = l.split("\t")
            if len(parts) > 1:  # only if both columns exist
                pathways.append(parts[1].replace("path:", ""))
        return pathways
    return []


ko_to_path = {}
for ko in tqdm(ko_list, desc="Mapping KO ‚Üí Pathway"):
    ko_to_path[ko] = get_pathways_for_ko(ko)

# 5Ô∏è‚É£ Create DataFrame of KO ‚Üí Pathway
path_df = (
    pd.DataFrame([(ko, p) for ko, plist in ko_to_path.items() for p in plist],
                 columns=["KO", "Pathway"])
)

# 6Ô∏è‚É£ Identify KO IDs with no pathway mapping
mapped_kos = set(path_df["KO"])
unmapped_kos = [ko for ko in ko_list if ko not in mapped_kos]
print(f"\nKO-assigned proteins with NO pathway mapping: {len(unmapped_kos)}")

# 7Ô∏è‚É£ Download human pathway list
human_pathways = requests.get("http://rest.kegg.jp/list/pathway/hsa").text
human_path_list = [line.split("\t")[0].replace("path:", "") for line in human_pathways.strip().split("\n")]

# 8Ô∏è‚É£ Identify shared vs unique bacterial pathways
path_df["Shared_with_Human"] = path_df["Pathway"].isin(human_path_list)

shared = path_df[path_df["Shared_with_Human"]].Pathway.nunique()
unique = path_df[~path_df["Shared_with_Human"]].Pathway.nunique()

print(f"\nüß≠ Pathway summary:")
print(f"Total distinct pathways: {path_df.Pathway.nunique()}")
print(f"Shared with Human: {shared}")
print(f"Unique bacterial: {unique}")

# 9Ô∏è‚É£ Save results
path_df.to_csv("/content/KAAS_pathway_analysis.csv", index=False)
print("\n‚úÖ Results saved to: /content/KAAS_pathway_analysis.csv")

# üîü Save unique bacterial pathways only
unique_df = path_df[~path_df["Shared_with_Human"]]
unique_df.to_csv("/content/unique_bacterial_pathways.csv", index=False)
print("üß¨ Unique bacterial pathways saved: /content/unique_bacterial_pathways.csv")



Columns detected: ['Protein', 'KO']
Assigned KO IDs: 423
Unassigned proteins: 242


Mapping KO ‚Üí Pathway: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 399/399 [04:03<00:00,  1.64it/s]



KO-assigned proteins with NO pathway mapping: 116

üß≠ Pathway summary:
Total distinct pathways: 230
Shared with Human: 0
Unique bacterial: 230

‚úÖ Results saved to: /content/KAAS_pathway_analysis.csv
üß¨ Unique bacterial pathways saved: /content/unique_bacterial_pathways.csv


In [3]:
import pandas as pd

# --- Input files ---
kaas_file = "/content/Kaas_bacteroides_fragilis.csv"      # Protein ‚Üí KO mapping (comma-delimited)
pathway_file = "/content/KAAS_pathway_analysis.csv"      # KO ‚Üí Pathway mapping (comma-delimited)

# --- Load data ---
mapping_df = pd.read_csv(kaas_file)  # uses header from file
path_df = pd.read_csv(pathway_file)

# --- Summary: Assigned vs Unassigned KOs ---
assigned = mapping_df["KO"].notna().sum()
unassigned = mapping_df["KO"].isna().sum()

# --- KO ‚Üí pathway mapping ---
ko_list = mapping_df["KO"].dropna().unique().tolist()
ko_with_no_pathway = [ko for ko in ko_list if ko not in path_df["KO"].unique()]
num_no_pathway = len(ko_with_no_pathway)

# --- Filter only unique bacterial pathways ---
unique_pathways_df = path_df[path_df["Shared_with_Human"] == False]

# --- Merge protein ‚Üí KO with KO ‚Üí pathway ---
merged_df = pd.merge(mapping_df.dropna(subset=["KO"]), unique_pathways_df, on="KO", how="inner")
merged_df = merged_df.drop_duplicates(subset=["Protein", "KO", "Pathway"])

# --- Save merged protein ‚Üí KO ‚Üí pathway CSV ---
output_file = "/content/Bacteroides_fragilis_merged_information.csv"
merged_df.to_csv(output_file, index=False)

# --- Save summary info ---
summary_file = "/content/Bacteroides_fragilis_KO_summary.csv"
summary_df = pd.DataFrame({
    "Metric": ["Assigned KO IDs", "Unassigned proteins", "KO-assigned proteins with NO pathway mapping",
               "Total distinct pathways", "Shared with Human", "Unique bacterial pathways"],
    "Count": [assigned, unassigned, num_no_pathway,
              path_df["Pathway"].nunique(), path_df[path_df["Shared_with_Human"]].Pathway.nunique(),
              unique_pathways_df.Pathway.nunique()]
})
summary_df.to_csv(summary_file, index=False)

# --- Print info ---
print(f"‚úÖ Merged protein ‚Üí KO ‚Üí pathway file saved: {output_file}")
print(f"‚úÖ Summary file saved: {summary_file}")
print(summary_df)


‚úÖ Merged protein ‚Üí KO ‚Üí pathway file saved: /content/Bacteroides_fragilis_merged_information.csv
‚úÖ Summary file saved: /content/Bacteroides_fragilis_KO_summary.csv
                                         Metric  Count
0                               Assigned KO IDs    423
1                           Unassigned proteins    242
2  KO-assigned proteins with NO pathway mapping    116
3                       Total distinct pathways    230
4                             Shared with Human      0
5                     Unique bacterial pathways    230


In [4]:
import pandas as pd

# --- Load KAAS results (protein ‚Üî KO) ---
kaas_df = pd.read_csv("/content/Kaas_bacteroides_fragilis.csv")

# --- Load KO ‚Üî Pathway data (from your previous analysis) ---
path_df = pd.read_csv("/content/KAAS_pathway_analysis.csv")

# --- Filter for unique bacterial pathways ---
unique_df = path_df[path_df["Shared_with_Human"] == False]

# --- Get list of unique KO IDs ---
unique_kos = unique_df["KO"].unique()

# --- Subset proteins belonging to those KOs ---
unique_proteins = kaas_df[kaas_df["KO"].isin(unique_kos)]

# --- Save list of unique proteins ---
unique_proteins.to_csv("/content/Bacteroides_fragilis_unique_pathway_proteins.csv", index=False)
print(f"‚úÖ Unique proteins saved: {unique_proteins.shape[0]}")


‚úÖ Unique proteins saved: 296


In [8]:
from Bio import SeqIO
import pandas as pd

# --- INPUT FILES ---
fasta_file = "/content/Bacteroides_fragilis_predicted_essential_revised_threshold.fasta"
pathway_file = "/content/KAAS_pathway_analysis.csv"
mapping_file = "/content/Kaas_bacteroides_fragilis.csv"

# --- LOAD DATA ---
path_df = pd.read_csv(pathway_file)
mapping_df = pd.read_csv(mapping_file, sep=",")  # üëà changed from '\t' to ','

# Ensure column names are correct
mapping_df.columns = ["protein", "KO"]

# Get KOs that are NOT shared with human (unique bacterial)
unique_kos = path_df.loc[path_df["Shared_with_Human"] == False, "KO"].unique().tolist()

# Get protein IDs associated with those unique KOs
unique_proteins = mapping_df[mapping_df["KO"].isin(unique_kos)]["protein"].unique().tolist()

print(f"‚úÖ Unique bacterial KOs: {len(unique_kos)}")
print(f"‚úÖ Corresponding protein IDs: {len(unique_proteins)}")

# --- FILTER FASTA ---
output_fasta = "/content/Bacteroides_fragilis_unique_pathway_proteins.fasta"
count = 0

with open(output_fasta, "w") as out_f:
    for record in SeqIO.parse(fasta_file, "fasta"):
        if any(pid in record.id for pid in unique_proteins):
            SeqIO.write(record, out_f, "fasta")
            count += 1

print(f"üéØ Unique-pathway protein sequences saved: {output_fasta}")
print(f"Total sequences written: {count}")





‚úÖ Unique bacterial KOs: 283
‚úÖ Corresponding protein IDs: 296
üéØ Unique-pathway protein sequences saved: /content/Bacteroides_fragilis_unique_pathway_proteins.fasta
Total sequences written: 296


In [9]:
import pandas as pd
import re

# Path to your PSORTb CSV
psortb_csv = "/content/PSORTb_results.csv"

# Read as plain text (since all info is in one column)
df_raw = pd.read_csv(psortb_csv, header=None, names=["Text"], dtype=str)
print(f"‚úÖ Loaded {len(df_raw)} rows from PSORTb result")



‚úÖ Loaded 6808 rows from PSORTb result


In [23]:
import pandas as pd
import re

# Load your raw PSORTb results
df_raw = pd.read_csv("/content/PSORTb_results.csv")

# Join all rows into one large text block
content = "\n".join(df_raw.iloc[:,0].tolist())

# Split by "SeqID:"
entries = re.split(r"SeqID:", content)
records = []

for entry in entries:
    entry = entry.strip()
    if not entry:
        continue

    # Extract the sequence ID
    seq_match = re.search(r"^\s*(\S+)", entry)

    # Extract the final prediction localization
    loc_match = re.search(r"Final Prediction:\s*(\w+)", entry)

    if seq_match and loc_match:
        seqid = seq_match.group(1).strip()
        loc = loc_match.group(1).strip()

        # ‚úÖ Skip if SeqID is "Analysis" or other header text
        if seqid.lower() not in ['analysis', 'seqid', 'results', 'psortb']:
            records.append((seqid, loc))

# Create a dataframe
df = pd.DataFrame(records, columns=["SeqID", "Localization"])

# ‚úÖ Additional cleanup: Remove any rows where SeqID doesn't look like a protein ID
# Protein IDs typically contain | or are alphanumeric
df = df[df['SeqID'].str.contains(r'[A-Z0-9]', case=True)]

# Normalize localization to lowercase
df['Localization'] = df['Localization'].str.lower()

# Save the cleaned file
df.to_csv("/content/psortb_cleaned.csv", index=False)

print(f"‚úÖ Extracted {len(df)} protein predictions")
print("üíæ Saved as: /content/psortb_cleaned.csv")

# Show distribution
print("\nüìä Localization Distribution:")
print(df['Localization'].value_counts())

# Show first few rows
print("\nüî¨ Sample data:")
print(df.head(10))



‚úÖ Extracted 296 protein predictions
üíæ Saved as: /content/psortb_cleaned.csv

üìä Localization Distribution:
Localization
cytoplasmic            234
cytoplasmicmembrane     33
unknown                 24
periplasmic              4
outermembrane            1
Name: count, dtype: int64

üî¨ Sample data:
                            SeqID         Localization
0                          CMSVM-          cytoplasmic
1           sp|Q5L9Q6|DAPDH_BACFN          cytoplasmic
2          tr|Q5LB89|Q5LB89_BACFN          cytoplasmic
3          tr|Q5LBH1|Q5LBH1_BACFN          cytoplasmic
4          tr|Q5LH15|Q5LH15_BACFN          cytoplasmic
5           sp|Q5LHT1|RIBBA_BACFN          cytoplasmic
6            sp|Q5LIJ3|MURE_BACFN          cytoplasmic
7  tr|A0A149NMK2|A0A149NMK2_BACFN          cytoplasmic
8  tr|A0A380YQQ1|A0A380YQQ1_BACFN          cytoplasmic
9  tr|A0A380YTM3|A0A380YTM3_BACFN  cytoplasmicmembrane


In [24]:
import pandas as pd

# Load your PSORTb results
df = pd.read_csv("/content/psortb_cleaned.csv")

# Normalize localization text to lowercase and strip whitespace
df["Localization"] = df["Localization"].str.lower().str.strip()

# üß™ Cytoplasmic ONLY (exclude cytoplasmic membrane)
cytoplasmic_only = df[df["Localization"] == "cytoplasmic"]

# üß¨ Cytoplasmic Membrane
cytoplasmic_membrane = df[
    df["Localization"].str.contains(
        "cytoplasmicmembrane|cytoplasmic membrane",
        case=False,
        na=False
    )
]

# üíâ Vaccine candidates (outer membrane, extracellular, periplasmic)
vaccine_candidates = df[
    df["Localization"].str.contains(
        "outermembrane|outer membrane|extracellular|periplasm",
        case=False,
        na=False
    )
]

# ‚ùì Unknown localization
unknown_localization = df[df["Localization"] == "unknown"]

# üìä Print comprehensive summary
print(f"\n{'='*60}")
print(f"üìä SUBCELLULAR LOCALIZATION ANALYSIS")
print(f"{'='*60}")
print(f"Total proteins analyzed: {len(df)}")
print(f"\nüî¨ Localization Distribution:")
print(f"  üß™ Cytoplasmic only: {len(cytoplasmic_only)} ({len(cytoplasmic_only)/len(df)*100:.1f}%)")
print(f"  üß¨ Cytoplasmic membrane: {len(cytoplasmic_membrane)} ({len(cytoplasmic_membrane)/len(df)*100:.1f}%)")
print(f"  üíâ Outer membrane/Extracellular/Periplasmic: {len(vaccine_candidates)} ({len(vaccine_candidates)/len(df)*100:.1f}%)")
print(f"  ‚ùì Unknown: {len(unknown_localization)} ({len(unknown_localization)/len(df)*100:.1f}%)")
print(f"{'='*60}\n")

# üéØ Drug target prioritization (following the paper's approach - keep all but categorize)
print(f"üéØ DRUG TARGET PRIORITIZATION:")
print(f"  ‚≠ê High priority (membrane/secreted): {len(vaccine_candidates)} proteins")
print(f"     ‚Üí Accessible, good for antibodies/small molecules")
print(f"  ‚≠ê Medium priority (cytoplasmic membrane): {len(cytoplasmic_membrane)} proteins")
print(f"     ‚Üí Targetable with membrane-permeable drugs")
print(f"  ‚≠ê Lower priority (cytoplasmic): {len(cytoplasmic_only)} proteins")
print(f"     ‚Üí Requires cell penetration, but still valid targets")
print(f"  ‚ö†Ô∏è  Unknown localization: {len(unknown_localization)} proteins")
print(f"     ‚Üí Needs further analysis\n")

# Save all categories
df.to_csv("/content/psortb_cleaned.csv", index=False)
cytoplasmic_only.to_csv("/content/cytoplasmic_only.csv", index=False)
cytoplasmic_membrane.to_csv("/content/cytoplasmic_membrane.csv", index=False)
vaccine_candidates.to_csv("/content/vaccine_candidates.csv", index=False)
unknown_localization.to_csv("/content/unknown_localization.csv", index=False)

# Combined drug targets (all except pure cytoplasmic if you want to be strict)
# But following the paper - keep ALL
all_targets = df.copy()
all_targets.to_csv("/content/all_localized_targets.csv", index=False)

print("‚úÖ Files saved:")
print("  - psortb_cleaned.csv (all proteins)")
print("  - cytoplasmic_only.csv")
print("  - cytoplasmic_membrane.csv")
print("  - vaccine_candidates.csv")
print("  - unknown_localization.csv")
print("  - all_localized_targets.csv")

# üìà Detailed breakdown
print(f"\nüìà Detailed Localization Breakdown:")
print(df["Localization"].value_counts())








üìä SUBCELLULAR LOCALIZATION ANALYSIS
Total proteins analyzed: 296

üî¨ Localization Distribution:
  üß™ Cytoplasmic only: 234 (79.1%)
  üß¨ Cytoplasmic membrane: 33 (11.1%)
  üíâ Outer membrane/Extracellular/Periplasmic: 5 (1.7%)
  ‚ùì Unknown: 24 (8.1%)

üéØ DRUG TARGET PRIORITIZATION:
  ‚≠ê High priority (membrane/secreted): 5 proteins
     ‚Üí Accessible, good for antibodies/small molecules
  ‚≠ê Medium priority (cytoplasmic membrane): 33 proteins
     ‚Üí Targetable with membrane-permeable drugs
  ‚≠ê Lower priority (cytoplasmic): 234 proteins
     ‚Üí Requires cell penetration, but still valid targets
  ‚ö†Ô∏è  Unknown localization: 24 proteins
     ‚Üí Needs further analysis

‚úÖ Files saved:
  - psortb_cleaned.csv (all proteins)
  - cytoplasmic_only.csv
  - cytoplasmic_membrane.csv
  - vaccine_candidates.csv
  - unknown_localization.csv
  - all_localized_targets.csv

üìà Detailed Localization Breakdown:
Localization
cytoplasmic            234
cytoplasmicmembrane     33


In [26]:
### For Gene_symbol from Uniprot

ids = df["SeqID"].dropna().unique().tolist()
print(f"Found {len(ids)} unique protein IDs")


Found 296 unique protein IDs


In [28]:
import requests
import time
import pandas as pd

def get_gene_name_from_uniprot(SeqID):
    """Fetch gene name from UniProt API"""
    try:
        # Clean protein ID (remove version numbers like .1, .2)
        clean_id = SeqID.split('.')[0].split('|')[-1]

        url = f"https://rest.uniprot.org/uniprotkb/{clean_id}.json"
        response = requests.get(url, timeout=10)

        if response.status_code == 200:
            data = response.json()
            # Try to get gene name
            if 'genes' in data and len(data['genes']) > 0:
                if 'geneName' in data['genes'][0]:
                    return data['genes'][0]['geneName']['value']

        return None
    except:
        return None

# Load cytoplasmic proteins
df_cytoplasmic = pd.read_csv("/content/cytoplasmic_only.csv")

# Get gene names (this will take time - ~234 API calls)
gene_names = []
print("Fetching gene names from UniProt...")

for i, SeqID in enumerate(df_cytoplasmic['SeqID']):
    gene_name = get_gene_name_from_uniprot(SeqID)
    gene_names.append(gene_name if gene_name else SeqID)

    if (i + 1) % 10 == 0:
        print(f"  Processed {i + 1}/{len(df_cytoplasmic)} proteins...")
        time.sleep(0.5)  # Be nice to UniProt API

df_cytoplasmic['Gene_Name'] = gene_names
df_cytoplasmic.to_csv("/content/cytoplasmic_with_genes.csv", index=False)
print("‚úÖ Gene names retrieved and saved!")


Fetching gene names from UniProt...
  Processed 10/234 proteins...
  Processed 20/234 proteins...
  Processed 30/234 proteins...
  Processed 40/234 proteins...
  Processed 50/234 proteins...
  Processed 60/234 proteins...
  Processed 70/234 proteins...
  Processed 80/234 proteins...
  Processed 90/234 proteins...
  Processed 100/234 proteins...
  Processed 110/234 proteins...
  Processed 120/234 proteins...
  Processed 130/234 proteins...
  Processed 140/234 proteins...
  Processed 150/234 proteins...
  Processed 160/234 proteins...
  Processed 170/234 proteins...
  Processed 180/234 proteins...
  Processed 190/234 proteins...
  Processed 200/234 proteins...
  Processed 210/234 proteins...
  Processed 220/234 proteins...
  Processed 230/234 proteins...
‚úÖ Gene names retrieved and saved!


In [33]:
### using batch  to find gene symbol import requests### not very handy
import pandas as pd
import time

def batch_uniprot_mapping(SeqID):
    """Map multiple protein IDs to gene names in one request"""

    # Prepare the mapping request
    url = "https://rest.uniprot.org/idmapping/run"

    params = {
        'ids': ','.join(SeqID[:500]),  # Max 500 at a time
        'from': 'UniProtKB_AC-ID',
        'to': 'Gene_Name'
    }

    response = requests.post(url, data=params)

    if response.status_code == 200:
        job_id = response.json()['jobId']

        # Poll for results
        results_url = f"https://rest.uniprot.org/idmapping/status/{job_id}"

        for _ in range(30):  # Try for 30 seconds
            time.sleep(1)
            status = requests.get(results_url).json()

            if 'results' in status:
                return status['results']

    return []

# Use this for your 234 proteins
df_cytoplasmic = pd.read_csv("/content/cytoplasmic_only.csv")
SeqID = df_cytoplasmic['SeqID'].tolist()

# Clean IDs
clean_ids = [pid.split('.')[0].split('|')[-1] for pid in SeqID]

results = batch_uniprot_mapping(clean_ids[:500])  # First 500

# Parse results
id_to_gene = {r['from']: r['to'] for r in results}
df_cytoplasmic['Gene_Name'] = df_cytoplasmic['SeqID'].apply(
    lambda x: id_to_gene.get(x.split('.')[0].split('|')[-1], x)
)

# üíæ Save output file
output_path = "/content/cytoplasmic_with_genes_batch.csv"
df_cytoplasmic.to_csv(output_path, index=False)
print(f"‚úÖ Output file saved to: {output_path}")



‚úÖ Output file saved to: /content/cytoplasmic_with_genes_batch.csv


In [None]:
from Bio import SeqIO

input_fasta = "/content/Bifidobacterium_animalis_unique_pathway_proteins.fasta"
records = list(SeqIO.parse(input_fasta, "fasta"))

chunk_size = 70
for i in range(0, len(records), chunk_size):
    chunk = records[i:i+chunk_size]
    output_file = f"/content/unique_proteins_chunk_{i//chunk_size + 1}.fasta"
    SeqIO.write(chunk, output_file, "fasta")
    print(f"‚úÖ Chunk saved: {output_file} ({len(chunk)} sequences)")


‚úÖ Chunk saved: /content/unique_proteins_chunk_1.fasta (70 sequences)
‚úÖ Chunk saved: /content/unique_proteins_chunk_2.fasta (60 sequences)


In [None]:
##### The NetGenes databse system is not clear to me now ,,, because there is no protein sequence onley the essential gene and their scores........
from google.colab import files

uploaded = files.upload()  # Upload your NetGenes zip file





Saving NetGenes.zip to NetGenes.zip
