# Studying evolution of HOGs with `pyham`

#### See full tutorial at https://zoo.cs.ucl.ac.uk/tutorials/tutorial_pyHam_get_started.html


## Requirements

PyHam needs hierarchical orthogroups in xml format and a species tree (newick or phyloxml). Those files can be created with [OMA standalone](https://omabrowser.org/standalone/).


### Very quick OMA tutorial

Go to the previous link to check the documentation.

#### Potential problems and solutions

- When installing OMA, it may fails because isn't able to create a link to some executables, but it stills create the `oma`/`OMA` links which work. Also, It probably isn't needed to run the `install.sh` script if you run oma from a conda environment which has `numpy`, `biopython` and `lxml` installed.
- When editing the parameter file (`OMA -p`), the outgroup definition should include only the string before the first `.`. So if you have as outgroup `GCF_000172155.1_ASM17215v1_genomic.fa`, you should include in the parameter file:

```
OutgroupSpecies := ['GCF_000172155']
```

- The amount of information printed to `stdout` is really big (after running 15 minutes, it created an out file of about 15 GBs), so in the sbatch file use:

```
#SBATCH --output=/dev/null
```

OMA is used in https://omabrowser.org/oma/home/

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import re
from collections import defaultdict
import matplotlib.pyplot as plt
import pickle

In [2]:
import ete3
from ete3 import Tree

In [3]:
pd.set_option('max_colwidth', 300)
pd.set_option('display.max_columns', 50)
# pd.set_option?

In [4]:
root = Path("/run/media/mibu/LR-orico_mini/Recovery_ADATA/Umayor/Work/Akker/OMA/Output")

Internal nodes don't have proper names

In [5]:
import pyham

In [6]:
# pyham.Ham?

In [7]:
# the orthoxml tree fails
xml_tree = "EstimatedSpeciesTree.phyloxml"
nwk_path = root.joinpath("EstimatedSpeciesTree.nwk").as_posix()
phyloxml_path = root.joinpath("EstimatedSpeciesTree.phyloxml").as_posix()
orthoxml_path = root.joinpath("HierarchicalGroups.orthoxml").as_posix()

In [8]:
pyham_analysis = pyham.Ham(nwk_path, orthoxml_path, tree_format='newick')

In [9]:
# EstimatedSpeciesTree.phyloxml with tree_format='phyloxml' fails
pyham_analysis = pyham.Ham(phyloxml_path, orthoxml_path, use_internal_name=True, tree_format='phyloxml')

IndexError: list index out of range

## `pyham_analysis.taxonomy.tree` points to an `ete3` tree

In [10]:
# This is a ete3 tree object
print(pyham_analysis.taxonomy.tree), type(pyham_analysis.taxonomy.tree)


         /-GCF_001683795
        |
      /-|   /-MGYG-HGUT-04532
     |  |  |
     |  |  |         /-bin3c
     |   \-|      /-|
     |     |   /-|   \-MGYG-HGUT-03447
     |     |  |  |
     |      \-|   \-R28__metabat2_low_PE
     |        |
     |         \-MGYG-HGUT-03413
   /-|
  |  |         /-MGYG-HGUT-02378
  |  |      /-|
  |  |     |   \-MGYG-HGUT-00870
  |  |     |
  |  |   /-|      /-MGYG-HGUT-02453
  |  |  |  |   /-|
  |  |  |  |  |   \-MGYG-HGUT-02452
  |  |  |   \-|
--|  |  |     |      /-MGYG-HGUT-02454
  |   \-|     |   /-|
  |     |      \-|   \-MGYG-HGUT-01921
  |     |        |
  |     |         \-MGYG-HGUT-01881
  |     |
  |     |   /-MGYG-HGUT-02584
  |      \-|
  |         \-F157a_European_Toad__metabat2_high_PE
  |
   \-GCF_000172155


(None, ete3.coretype.tree.TreeNode)

**If we use `use_internal_name=True` when creating the pyham object, we get the node support as node names, which won't allow to build a tree profile `pyham_analysis.create_tree_profile` because repeated keys. If we use `use_internal_name=False` (default), nodes will have long names (see below), so better to parse the tree beforehand**

In [11]:
for node in pyham_analysis.taxonomy.tree.traverse():
    print(node.name, node.dist, node.depth)
node.features

GCF_001683795/MGYG-HGUT-04532/bin3c/MGYG-HGUT-03447/R28__metabat2_low_PE/MGYG-HGUT-03413/MGYG-HGUT-02378/MGYG-HGUT-00870/MGYG-HGUT-02453/MGYG-HGUT-02452/MGYG-HGUT-02454/MGYG-HGUT-01921/MGYG-HGUT-01881/MGYG-HGUT-02584/F157a_European_Toad__metabat2_high_PE/GCF_000172155 0.0 0
GCF_001683795/MGYG-HGUT-04532/bin3c/MGYG-HGUT-03447/R28__metabat2_low_PE/MGYG-HGUT-03413/MGYG-HGUT-02378/MGYG-HGUT-00870/MGYG-HGUT-02453/MGYG-HGUT-02452/MGYG-HGUT-02454/MGYG-HGUT-01921/MGYG-HGUT-01881/MGYG-HGUT-02584/F157a_European_Toad__metabat2_high_PE 10.809459 1
GCF_000172155 29.766209 1
GCF_001683795/MGYG-HGUT-04532/bin3c/MGYG-HGUT-03447/R28__metabat2_low_PE/MGYG-HGUT-03413 2.3454421 2
MGYG-HGUT-02378/MGYG-HGUT-00870/MGYG-HGUT-02453/MGYG-HGUT-02452/MGYG-HGUT-02454/MGYG-HGUT-01921/MGYG-HGUT-01881/MGYG-HGUT-02584/F157a_European_Toad__metabat2_high_PE 3.7313062 2
GCF_001683795 13.075742 3
MGYG-HGUT-04532/bin3c/MGYG-HGUT-03447/R28__metabat2_low_PE/MGYG-HGUT-03413 9.6068781 3
MGYG-HGUT-02378/MGYG-HGUT-00870/MGYG-HGU

{'depth', 'dist', 'genome', 'name', 'support'}

## Parsing the node names of the `newick` file

In [12]:
tree = Tree(nwk_path, format=1)
# print(tree)

In [13]:
for i, node in enumerate(tree.traverse()):
    if not node.is_leaf():
        node.name = 'N' + str(i)
        print(node.name)
tree_outf = root.joinpath('EstimatedSpeciesTree_edited.nwk')
tree.write(format=1, outfile=tree_outf.as_posix())

N0
N1
N3
N4
N6
N7
N8
N10
N11
N12
N15
N19
N20
N21
N25


## Create the pyham object again from the new edited tree

**This time the tree should have proper node names**

In [14]:
# the orthoxml tree fails
nwk_path = root.joinpath("EstimatedSpeciesTree_edited.nwk").as_posix()
orthoxml_path = root.joinpath("HierarchicalGroups.orthoxml").as_posix()

In [15]:
# EstimatedSpeciesTree.phyloxml with tree_format='phyloxml' fails
pyham_analysis = pyham.Ham(nwk_path, orthoxml_path, use_internal_name=True, tree_format='newick')

In [16]:
for node in pyham_analysis.taxonomy.tree.traverse():
    print(node.name, node.dist, node.depth)
node.features

 0.0 0
N1 10.8095 1
GCF_000172155 29.7662 1
N3 2.34544 2
N4 3.73131 2
GCF_001683795 13.0757 3
N6 9.60688 3
N7 9.62598 3
N8 1.72572 3
MGYG-HGUT-04532 11.2044 4
N10 0.593046 4
N11 2.72772 4
N12 1.58804 4
MGYG-HGUT-02584 12.0407 4
F157a_European_Toad__metabat2_high_PE 13.0797 4
N15 0.656902 5
MGYG-HGUT-03413 9.84846 5
MGYG-HGUT-02378 0.735412 5
MGYG-HGUT-00870 0.797408 5
N19 0.402709 5
N20 1.57516 5
N21 0.465328 6
R28__metabat2_low_PE 10.3227 6
MGYG-HGUT-02453 1.22448 6
MGYG-HGUT-02452 1.72259 6
N25 0.0860082 6
MGYG-HGUT-01881 0.755767 6
bin3c 8.53252 7
MGYG-HGUT-03447 8.87125 7
MGYG-HGUT-02454 0.625057 7
MGYG-HGUT-01921 0.620852 7


{'depth', 'dist', 'genome', 'name', 'support'}

In [37]:
# pyham_analysis.get_list_ancestral_genomes()

In [18]:
hogs_dict = pyham_analysis.get_dict_top_level_hogs()
# pyham_analysis.get_list_top_level_hogs()
len(hogs_dict)

3785

In [19]:
hogs_dict['10']

<HOG(10)>

In [20]:
pyham_analysis.get_list_top_level_hogs()[:10]

[<HOG(1)>,
 <HOG(2)>,
 <HOG(3)>,
 <HOG(4)>,
 <HOG(5)>,
 <HOG(6)>,
 <HOG(7)>,
 <HOG(8)>,
 <HOG(9)>,
 <HOG(10)>]

In [21]:
pyham_analysis.hog_file_type, pyham_analysis.tree_file

('orthoxml',
 '/run/media/mibu/LR-orico_mini/Recovery_ADATA/Umayor/Work/Akker/OMA/Output/EstimatedSpeciesTree_edited.nwk')

In [22]:
node.genome.name, node.genome.get_number_genes(), node.genome.taxon,  node.name#, node.genome.genes

('MGYG-HGUT-01921',
 2346,
 Tree node 'MGYG-HGUT-01921' (0x7f878d734e8),
 'MGYG-HGUT-01921')

In [23]:
len(pyham_analysis.extant_gene_map), pyham_analysis.extant_gene_map['1703']

(41064, Gene(1703))

In [24]:
list_genome = pyham_analysis.get_list_extant_genomes()
for g in list_genome:
    print(g)

MGYG-HGUT-01881
MGYG-HGUT-02454
MGYG-HGUT-03413
MGYG-HGUT-02453
MGYG-HGUT-00870
F157a_European_Toad__metabat2_high_PE
MGYG-HGUT-01921
MGYG-HGUT-02584
MGYG-HGUT-04532
MGYG-HGUT-02452
bin3c
MGYG-HGUT-03447
GCF_001683795
R28__metabat2_low_PE
GCF_000172155
MGYG-HGUT-02378


In [25]:
ag_ls = pyham_analysis.get_list_ancestral_genomes()
for ag in ag_ls:
    print(ag)

N21
N6
N11
N4
N1
N15
N3
N7

N25
N20
N12
N8
N19
N10


In [26]:
gene3  = pyham_analysis.get_gene_by_id(3)
gene12 = pyham_analysis.get_gene_by_id(12)
gene3, gene12

(Gene(3), Gene(12))

In [27]:
gene3.get_top_level_hog(), gene12.get_top_level_hog()

(<HOG(3080)>, Gene(12))

In [44]:
gene = pyham_analysis.get_gene_by_id(12)
hog = pyham_analysis.get_hog_by_id(1703)

In [45]:
hog.get_all_descendant_genes()

[Gene(10279),
 Gene(36295),
 Gene(31508),
 Gene(38292),
 Gene(39539),
 Gene(27977),
 Gene(1528),
 Gene(19314),
 Gene(12470),
 Gene(25579),
 Gene(21231),
 Gene(14295),
 Gene(26660),
 Gene(17529)]

**Get the relevant information about genes**

In [47]:
gene.get_dict_xref()

{'id': '12', 'protId': 'GPIGBLJI_00012 hypothetical protein'}

In [31]:
from IPython.display import IFrame

## Built-in interactive visualizations

In [32]:
pyham_analysis.create_iHam(hog=hog,outfile="HOG{}.html".format(hog.hog_id));
IFrame("HOG{}.html".format(hog.hog_id), width=1200, height=800)

In [48]:
treeprofile = pyham_analysis.create_tree_profile(outfile="tp.html")
IFrame("tp.html", width=800, height=600)

## Evolutionary history of genomes

### Lateral comparison between two genomes

In [34]:
# MGYG-HGUT-01921
# MGYG-HGUT-02454
HGUT_01921 = pyham_analysis.get_extant_genome_by_name("MGYG-HGUT-01921")
HGUT_02454 = pyham_analysis.get_extant_genome_by_name("MGYG-HGUT-02454")

# Instanciate the gene mapping !
lateral_01921_02454 = pyham_analysis.compare_genomes_lateral(HGUT_01921, HGUT_02454) # The order doesn't matter!

In [35]:
# The identical genes (that stay single copies) 
print("IDENTICAL GENES")
for hogs, dict_genome_gene in lateral_01921_02454.get_retained().items():
    print(f"\nHOG at Euarchontoglires {hogs} is the ancestor of: ")
    for g, gene in dict_genome_gene.items():
        print(f"{gene} in {g}")

IDENTICAL GENES

HOG at Euarchontoglires <HOG()> is the ancestor of: 
Gene(15835) in MGYG-HGUT-01921
Gene(27234) in MGYG-HGUT-02454

HOG at Euarchontoglires <HOG()> is the ancestor of: 
Gene(15836) in MGYG-HGUT-01921
Gene(27233) in MGYG-HGUT-02454

HOG at Euarchontoglires <HOG()> is the ancestor of: 
Gene(15837) in MGYG-HGUT-01921
Gene(27232) in MGYG-HGUT-02454

HOG at Euarchontoglires <HOG()> is the ancestor of: 
Gene(15838) in MGYG-HGUT-01921
Gene(27231) in MGYG-HGUT-02454

HOG at Euarchontoglires <HOG()> is the ancestor of: 
Gene(15839) in MGYG-HGUT-01921
Gene(27230) in MGYG-HGUT-02454

HOG at Euarchontoglires <HOG()> is the ancestor of: 
Gene(15840) in MGYG-HGUT-01921
Gene(27229) in MGYG-HGUT-02454

HOG at Euarchontoglires <HOG()> is the ancestor of: 
Gene(15841) in MGYG-HGUT-01921
Gene(27228) in MGYG-HGUT-02454

HOG at Euarchontoglires <HOG()> is the ancestor of: 
Gene(15842) in MGYG-HGUT-01921
Gene(27227) in MGYG-HGUT-02454

HOG at Euarchontoglires <HOG()> is the ancestor of: 
Ge

In [36]:
# The identical genes (that stay single copies) 
print("GAINED GENES")
for genome, gains in lateral_01921_02454.get_gained().items():
    print(f"\nHOG at Euarchontoglires {hogs} is the ancestor of: ")
    for g in gains:
        print(f"{gene} in {g}")

GAINED GENES

HOG at Euarchontoglires <HOG()> is the ancestor of: 
Gene(27813) in Gene(15865)
Gene(27813) in Gene(15901)
Gene(27813) in Gene(15902)
Gene(27813) in Gene(15903)
Gene(27813) in Gene(15907)
Gene(27813) in Gene(15908)
Gene(27813) in Gene(15909)
Gene(27813) in Gene(15915)
Gene(27813) in Gene(15916)
Gene(27813) in Gene(15917)
Gene(27813) in Gene(15919)
Gene(27813) in Gene(15921)
Gene(27813) in Gene(15923)
Gene(27813) in Gene(15925)
Gene(27813) in Gene(15926)
Gene(27813) in Gene(15930)
Gene(27813) in Gene(15931)
Gene(27813) in Gene(15932)
Gene(27813) in Gene(15934)
Gene(27813) in Gene(15935)
Gene(27813) in Gene(15936)
Gene(27813) in Gene(15940)
Gene(27813) in Gene(15941)
Gene(27813) in Gene(15942)
Gene(27813) in Gene(15943)
Gene(27813) in Gene(15952)
Gene(27813) in Gene(15954)
Gene(27813) in Gene(15955)
Gene(27813) in Gene(15989)
Gene(27813) in Gene(16001)
Gene(27813) in Gene(16021)
Gene(27813) in Gene(16028)
Gene(27813) in Gene(16049)
Gene(27813) in Gene(16050)
Gene(27813) in 