<a href="https://colab.research.google.com/github/jpdemuth/beginning-bioinformatics/blob/main/notebooks/OrthoFinder_Colab_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# OrthoFinder on Google Colab (Teaching-Friendly Notebook)

This Colab notebook installs and runs **OrthoFinder** on small to medium test datasets. It includes:
- One-click **installation** of OrthoFinder and common dependencies (DIAMOND, MMseqs2, MAFFT, MUSCLE, FastTree).
- **Data upload** or **Google Drive mount**.
- A **tiny toy dataset** you can generate in seconds for demonstration.
- A configurable **run cell** (choose sequence search, MSA, and tree methods).
- **Result inspection** helpers and **export** to ZIP/Drive.

> ⚠️ **Colab caveats:** Free sessions can disconnect, RAM is limited (~12–25 GB), and CPUs are modest. Use this for teaching, prototyping, or small bacterial/small-proteome examples. For larger analyses, use HPC.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1) Check the environment

In [2]:

!nproc
!python --version
!free -h
!df -h


2
Python 3.12.11
               total        used        free      shared  buff/cache   available
Mem:            12Gi       777Mi       8.9Gi       1.0Mi       3.0Gi        11Gi
Swap:             0B          0B          0B
Filesystem      Size  Used Avail Use% Mounted on
overlay         226G   40G  187G  18% /
tmpfs            64M     0   64M   0% /dev
shm             5.8G     0  5.8G   0% /dev/shm
/dev/root       2.0G  1.2G  764M  62% /usr/sbin/docker-init
/dev/sda1       233G   40G  193G  18% /kaggle/input
tmpfs           6.4G   68K  6.4G   1% /var/colab
tmpfs           6.4G     0  6.4G   0% /proc/acpi
tmpfs           6.4G     0  6.4G   0% /proc/scsi
tmpfs           6.4G     0  6.4G   0% /sys/firmware
drive            15G   10G  5.1G  67% /content/drive


## 2) Install OrthoFinder & dependencies

In [3]:

# System packages
!sudo apt-get update -qq
!sudo apt-get install -y -qq git build-essential python3-dev
# Sequence search & alignment tools
!sudo apt-get install -y -qq diamond-aligner mmseqs2 mafft muscle fasttree
# Utilities
!sudo apt-get install -y -qq pigz unzip

# Python deps
!pip -q install biopython pandas

# OrthoFinder install from source
!rm -rf OrthoFinder
!git clone -q https://github.com/davidemms/OrthoFinder.git
!cd OrthoFinder && python setup.py install --user

# Ensure the user-local bin is on PATH for this session
import os
user_bin = os.path.expanduser("~/.local/bin")
if user_bin not in os.environ.get("PATH",""):
    os.environ["PATH"] = user_bin + ":" + os.environ.get("PATH","")
print("PATH=", os.environ["PATH"])

# Sanity checks
!orthofinder -h | head -n 20 || true
!diamond version || true
!mmseqs --version || true
!mafft --version || true
!muscle -version || true
!fasttree -help | head -n 5 || true


W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 51.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package fonts-lato.
(Reading database ... 126435 files and directories currently installed.)
Preparing to unpack .../00-fonts-lato_2.0-2.1_all.deb ...
Unpacking fonts-lato (2.0-2.1) ...
Selecting previously unselected package netbase.
Preparing to unpack .../01-netbase_6.3_all.deb ...
Unpacking netbase (6.3) ...
Selecting previ

In [10]:
# missed mcl install on cell above
!sudo apt-get update -qq
!sudo apt-get install -y -qq mcl
!mcl -h | head -n 5

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package mcl.
(Reading database ... 130541 files and directories currently installed.)
Preparing to unpack .../mcl_1%3a14-137+ds-9build2_amd64.deb ...
Unpacking mcl (1:14-137+ds-9build2) ...
Setting up mcl (1:14-137+ds-9build2) ...
Processing triggers for man-db (2.10.2-1) ...
________ mcl verbosity modes
--show ....... print MCL

In [11]:
# verify dependencies
!orthofinder -h | head -n 20 || true
!diamond version || true
!mmseqs --version || true
!mafft --version || true
!muscle -version || true
!fasttree -help | head -n 5 || true
!mcl -h | head -n 5 || true


OrthoFinder version 3.0.1b1 Copyright (C) 2014 David Emms

SIMPLE USAGE:
Run full OrthoFinder analysis on FASTA format proteomes in <dir>
  orthofinder [options] -f <dir>

To assign species from <dir1> to existing OrthoFinder orthogroups in <dir2>
  orthofinder [options] --assign <dir1> --core <dir2>

OPTIONS:
 -t <int>        Number of parallel sequence search threads [Default = 2]
 -a <int>        Number of parallel analysis threads
 -d              Input is DNA sequences
 -M <txt>        Method for gene tree inference. Options 'dendroblast' & 'msa'
                 [Default = msa]
 -S <txt>        Sequence search program [Default = diamond]
                 Options: blast, diamond, diamond_ultra_sens, blast_gz, mmseqs, blast_nucl
 -A <txt>        MSA program, requires '-M msa' [Default = mafft]
                 Options: mafft, muscle, mafft_memsave
diamond version 2.0.14
MMseqs2 (Many against Many sequence searching) is an open-source software suite for very fast, 
parallelized pro


## 3) Provide your protein FASTA files
You have two options:
- **Upload** protein FASTA files (one per species) — recommended for small demos.
- **Mount Google Drive** and point to a folder containing your FASTA files.

Each file should be **amino acid sequences** (not DNA). Extensions like `.fa`, `.faa`, or `.fasta` are fine.


In [5]:

# Option A: Upload files from your computer (select multiple)
from google.colab import files
import os, shutil, glob

INPUT_DIR = "/content/drive/MyDrive/Data"
os.makedirs(INPUT_DIR, exist_ok=True)

print("Select one or more protein FASTA files (.fa/.faa/.fasta):")
uploaded = files.upload()

# Move uploaded files into INPUT_DIR
for fn in uploaded.keys():
    dst = os.path.join(INPUT_DIR, fn)
    shutil.move(fn, dst)

print("\nFiles now in", INPUT_DIR)
!ls -lah /content/drive/MyDrive/Data || true


total 42M
-rw------- 1 root root 14M Sep 24 15:10 Agla_2.0_protein.faa
-rw------- 1 root root 12M Sep 24 15:10 Ldec_2.0_protein.faa
-rw------- 1 root root 17M Sep 24 15:10 Tcas5.2_protein.faa


In [None]:

# Option B: Mount Google Drive and set an existing folder path containing protein FASTAs
from google.colab import drive
drive.mount('/content/drive')

# After mounting, set this to your Drive folder path (edit as needed):
DRIVE_INPUT_DIR = "/content/drive/MyDrive/orthofinder_input"

# If you want to *copy* from Drive into the Colab runtime (faster local access), do:
# (Leave commented out if you want OrthoFinder to read directly from Drive.)
# !mkdir -p /content/orthofinder_input
# !cp -v "$DRIVE_INPUT_DIR"/* /content/orthofinder_input/
# print("Copied files into /content/orthofinder_input")



### (Optional) 3a. Create a tiny **toy dataset** for quick demos
This generates two fake protein FASTA files (`speciesA.faa`, `speciesB.faa`) with very small sequences — useful for testing the pipeline fast.


In [None]:

from pathlib import Path

toy_dir = Path("/content/orthofinder_input")
toy_dir.mkdir(parents=True, exist_ok=True)

toyA = ">spA_gene1\nMTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVE\n>spA_gene2\nMVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRF\n"
toyB = ">spB_gene1\nMTEYKLVVVGAGGIGKSALTIQLIQNHFVDEYDPTIEDSYRKQ\n>spB_gene2\nMVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWT\n"

(toy_dir / "speciesA.faa").write_text(toyA)
(toy_dir / "speciesB.faa").write_text(toyB)

!ls -lah /content/orthofinder_input


## 4) Validate input files

In [7]:

import glob, os

INPUT_DIR = "/content/drive/MyDrive/Data"  # change if you prefer Drive folder
files_found = glob.glob(os.path.join(INPUT_DIR, "*"))
if not files_found:
    raise SystemExit("No files found in " + INPUT_DIR + ". Upload or copy your FASTA files first.")

print("Found files:")
for f in files_found:
    print(" -", os.path.basename(f))

# Quick content check: ensure these look like FASTA (first char '>')
bad = []
for f in files_found:
    try:
        with open(f, 'r') as fh:
            first = fh.read(1)
        if first != '>':
            bad.append(f)
    except Exception as e:
        print("Error reading", f, e)

if bad:
    print("\nWARNING: These files do not look like FASTA (no leading '>'):", "\n".join(bad))
else:
    print("\nAll files appear to be FASTA-like.")


Found files:
 - Ldec_2.0_protein.faa
 - Agla_2.0_protein.faa
 - Tcas5.2_protein.faa

All files appear to be FASTA-like.


## 5) Configure run options

In [8]:

# Choose methods. Valid options include:
#   SEARCH: 'diamond', 'blast', or 'mmseqs'
#   MSA:    'msa' (MAFFT/MUSCLE) or 'none'
#   TREE:   'fasttree' (recommended in Colab). IQ-TREE is not installed by default here.
SEARCH_METHOD = 'diamond'   # 'diamond' | 'mmseqs' | 'blast'
MSA_METHOD    = 'msa'       # 'msa' | 'none'
TREE_METHOD   = 'fasttree'  # 'fasttree'

# Threads: set based on available CPUs
import os, subprocess
try:
    nproc = int(subprocess.check_output(['nproc']).decode().strip())
except Exception:
    nproc = 2
THREADS = max(1, nproc - 0)  # use all detected cores
print(f"SEARCH={SEARCH_METHOD}, MSA={MSA_METHOD}, TREE={TREE_METHOD}, THREADS={THREADS}")


SEARCH=diamond, MSA=msa, TREE=fasttree, THREADS=2


## 6) Run OrthoFinder

In [12]:

import os, time, glob, re, subprocess, shlex, pathlib

INPUT_DIR = "/content/drive/MyDrive/Data"  # adjust if needed

# Build command
cmd = [
    "orthofinder",
    "-f", INPUT_DIR,
    "-S", SEARCH_METHOD,
    "-M", MSA_METHOD,
    "-T", TREE_METHOD,
    "-t", str(THREADS)
]

print("Running:", " ".join(shlex.quote(x) for x in cmd))
start = time.time()
ret = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
end = time.time()

# Save log
log_path = "/content/orthofinder_run.log"
with open(log_path, "w") as fh:
    fh.write(ret.stdout)

print(f"Runtime: {end - start:.1f} sec")
print("\n--- Log head ---")
print("\n".join(ret.stdout.splitlines()[:40]))
print("\n... (full log saved to", log_path, ")")

# Find newest Results_* directory
results_dirs = sorted(glob.glob(os.path.join(INPUT_DIR, "OrthoFinder", "Results_*")))
if results_dirs:
    RESULTS_DIR = results_dirs[-1]
    print("\nLatest results directory:", RESULTS_DIR)
else:
    RESULTS_DIR = None
    print("\nNo Results_* directory found. Check the log for errors.")


Running: orthofinder -f /content/drive/MyDrive/Data -S diamond -M msa -T fasttree -t 2
Runtime: 20115.9 sec

--- Log head ---

OrthoFinder version 3.0.1b1 Copyright (C) 2014 David Emms

2025-09-24 15:21:16 : Starting OrthoFinder 3.0.1b1
2 thread(s) for highly parallel tasks (BLAST searches etc.)
1 thread(s) for OrthoFinder algorithm

Results directory: /content/drive/MyDrive/Data/OrthoFinder/Results_Sep24_1/

Checking required programs are installed
----------------------------------------
Test can run "mcl -h" - ok
Test can run "mafft" - ok
Test can run "fasttree" - ok

Dividing up work for BLAST for parallel processing
--------------------------------------------------
2025-09-24 15:21:19 : Creating diamond database 1 of 3
2025-09-24 15:21:19 : Creating diamond database 2 of 3
2025-09-24 15:21:20 : Creating diamond database 3 of 3

Running diamond all-versus-all
------------------------------
Using 2 thread(s)
2025-09-24 15:21:21 : This may take some time....
2025-09-24 15:21:21 : Do

## 7) Inspect key outputs

In [13]:

import os, glob, pandas as pd

if RESULTS_DIR and os.path.isdir(RESULTS_DIR):
    print("Top-level files:")
    !ls -lah "$RESULTS_DIR" | head -n 50

    # Common useful tables (only show if present)
    tables = {
        "Orthogroups.tsv": os.path.join(RESULTS_DIR, "Orthogroups", "Orthogroups.tsv"),
        "Orthogroups_SingleCopyOrthologues.txt": os.path.join(RESULTS_DIR, "Orthogroups", "Orthogroups_SingleCopyOrthologues.txt"),
        "Statistics_Overall.tsv": os.path.join(RESULTS_DIR, "Comparative_Genomics_Statistics", "Statistics_Overall.tsv"),
    }

    for label, path in tables.items():
        if os.path.exists(path):
            print(f"\n### {label}")
            try:
                if path.endswith(".tsv") or path.endswith(".txt"):
                    df = pd.read_csv(path, sep='\t')
                    display(df.head(10))
                else:
                    with open(path, 'r') as fh:
                        for i, line in enumerate(fh):
                            if i>20: break
                            print(line.rstrip())
            except Exception as e:
                print("Could not display", label, ":", e)
        else:
            print(f"{label} not found.")
else:
    print("No results directory to inspect.")


Top-level files:
total 60K
-rw------- 1 root root 2.5K Sep 24 20:56 Citation.txt
drwx------ 2 root root 4.0K Sep 24 20:56 Comparative_Genomics_Statistics
drwx------ 2 root root 4.0K Sep 24 20:55 Gene_Duplication_Events
drwx------ 2 root root 4.0K Sep 24 20:55 Gene_Trees
-rw------- 1 root root  631 Sep 24 15:21 Log.txt
drwx------ 2 root root 4.0K Sep 24 20:49 MultipleSequenceAlignments
drwx------ 2 root root 4.0K Sep 24 20:55 Orthogroups
drwx------ 2 root root 4.0K Sep 24 16:17 Orthogroup_Sequences
drwx------ 5 root root 4.0K Sep 24 20:56 Orthologues
drwx------ 2 root root 4.0K Sep 24 20:52 Phylogenetically_Misplaced_Genes
drwx------ 2 root root 4.0K Sep 24 20:55 Phylogenetic_Hierarchical_Orthogroups
drwx------ 2 root root 4.0K Sep 24 20:52 Putative_Xenologs
drwx------ 2 root root 4.0K Sep 24 20:55 Resolved_Gene_Trees
drwx------ 2 root root 4.0K Sep 24 20:56 Single_Copy_Orthologue_Sequences
drwx------ 2 root root 4.0K Sep 24 20:52 Species_Tree
drwx------ 7 root root 4.0K Sep 24 20:55 Wo

Unnamed: 0,Orthogroup,Agla_2.0_protein,Ldec_2.0_protein,Tcas5.2_protein
0,OG0000000,"XP_018561228.1, XP_018561230.1, XP_018561245.1...",XP_023015561.1,
1,OG0000001,"XP_018573580.1, XP_018573581.1, XP_018573582.1...","XP_023014941.1, XP_023014942.1, XP_023014943.1...","XP_008198840.1, XP_008198841.1, XP_008198842.1..."
2,OG0000002,XP_018569802.1,"XP_023012866.1, XP_023012867.1, XP_023012868.1...",
3,OG0000003,"XP_018561139.1, XP_018561140.1, XP_018561141.1...","XP_023023687.1, XP_023023688.1, XP_023023689.1...","XP_008196506.1, XP_008196508.1, XP_008196509.1..."
4,OG0000004,"XP_018566011.1, XP_018566013.1, XP_018566014.1...",,"XP_008199740.1, XP_008199741.1, XP_008199742.1..."
5,OG0000005,"XP_018575723.1, XP_018575724.1, XP_018575731.1...","XP_023019533.1, XP_023019534.1, XP_023019535.1...","XP_015834114.1, XP_015834115.1, XP_015834116.1..."
6,OG0000006,"XP_018562778.1, XP_018569175.1, XP_018569785.1...","XP_023012020.1, XP_023012847.1, XP_023013716.1...","XP_008192426.1, XP_008192457.2, XP_015835126.1..."
7,OG0000007,"XP_018577212.1, XP_018577213.1, XP_018577214.1...","XP_023016833.1, XP_023016835.1, XP_023016837.1...","NP_001107841.1, XP_015838873.1, XP_015838875.1..."
8,OG0000008,,,"XP_008195491.1, XP_008198276.2, XP_015833775.1..."
9,OG0000009,"XP_018562144.2, XP_018564512.1, XP_018568077.1...","XP_023013508.1, XP_023014002.1, XP_023014014.1...","XP_015839902.1, XP_015840141.1, XP_015840305.1..."



### Orthogroups_SingleCopyOrthologues.txt


Unnamed: 0,N0.HOG0000005
0,N0.HOG0000007
1,N0.HOG0000058
2,N0.HOG0000112
3,N0.HOG0000184
4,N0.HOG0000186
5,N0.HOG0000191
6,N0.HOG0000241
7,N0.HOG0000327
8,N0.HOG0000424
9,N0.HOG0000447



### Statistics_Overall.tsv
Could not display Statistics_Overall.tsv : Error tokenizing data. C error: Expected 2 fields in line 26, saw 5



## 8) Export results (ZIP download and/or save to Drive)

In [14]:

import os, shutil, glob, zipfile

if RESULTS_DIR:
    zip_path = "/content/orthofinder_results.zip"
    # Create zip
    with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zf:
        for root, dirs, files in os.walk(RESULTS_DIR):
            for fn in files:
                p = os.path.join(root, fn)
                zf.write(p, arcname=os.path.relpath(p, os.path.dirname(RESULTS_DIR)))
    print("Created:", zip_path)

    # Offer direct download
    from google.colab import files
    files.download(zip_path)

    # (Optional) Save to Drive. Uncomment to copy.
    # DRIVE_SAVE_DIR = "/content/drive/MyDrive/orthofinder_runs"
    # !mkdir -p "$DRIVE_SAVE_DIR"
    # !cp -v "$zip_path" "$DRIVE_SAVE_DIR/"
else:
    print("No results to export.")


OSError: [Errno 95] Operation not supported: '/content/drive/MyDrive/Data/OrthoFinder/Results_Sep24_1/Orthogroups/Orthogroups.GeneCount.gsheet'


## 9) Troubleshooting tips

- **No `Results_*` folder?** Open and read `orthofinder_run.log` above; look for missing tools or malformed FASTA.
- **Large datasets crash or hang:** Try fewer species, smaller proteomes, or switch `SEARCH_METHOD='mmseqs'` and `MSA_METHOD='none'` to reduce runtime.
- **Trees fail:** Keep `TREE_METHOD='fasttree'` in Colab; IQ-TREE often isn't available by default.
- **Protein vs DNA:** OrthoFinder expects **protein** FASTA input. Translate coding sequences first if needed.
- **Reproducibility:** Pin tool versions or export this notebook with an exact log of installed versions for your records.
