# check HGT with ete3

This notebook was used to check for possible HGT of SCO. 

Here, we will compare singe gene tree to species tree. 

**Concept**

We have a species tree inferred from 136 SCO.

To check if there is a possibility for HGT, we check how each SCOG is distibuted across the species tree.

**Set Up**

In [1]:
from ete3 import Tree
from pathlib import Path
from collections import defaultdict
import seaborn as sns
import pandas as pd
from Bio import AlignIO
import hashlib
from collections import defaultdict

In [2]:
species_tree = Tree('((((H,K)D,(F,I)G)B,E)A,((L,(N,Q)O)J,(P,S)M)C);', format=1)

In [3]:
print(f'This is species tree: {species_tree}')

This is species tree: 
            /-H
         /-|
        |   \-K
      /-|
     |  |   /-F
   /-|   \-|
  |  |      \-I
  |  |
  |   \-E
--|
  |      /-L
  |   /-|
  |  |  |   /-N
  |  |   \-|
   \-|      \-Q
     |
     |   /-P
      \-|
         \-S


In [4]:
gene_tree = Tree('(HK, ((FIE,L), (NQ,PS)));')
print(f'This is gene tree:{gene_tree}')

This is gene tree:
   /-HK
  |
--|      /-FIE
  |   /-|
  |  |   \-L
   \-|
     |   /-NQ
      \-|
         \-PS


1. First, we can be interested in the leave/clade `HK` in the gene tree, and check first common ancestor for H and F in the species tree.

If the first common ancestor just includes `H` and `K`, then we can conclude there is no HGT taking place.

If however other `species` will be found, then there is a posibility that HGT has likely occured.

In [5]:
ancestor = species_tree.get_common_ancestor("H", "K")
print(f"Most common ancestor for H and K in species tree is: {[_.name for _ in ancestor]}")

Most common ancestor for H and K in species tree is: ['H', 'K']


2. We can now investigate `F`, `I` and `E`. 

Here, the HGT has likely occured. 

In [6]:
ancestor = species_tree.get_common_ancestor("F", "I", 'E')
print(f"Most common ancestor for F, I and K in species tree is: {[_.name for _ in ancestor]}")

Most common ancestor for F, I and K in species tree is: ['H', 'K', 'F', 'I', 'E']


**Loading our species tree and roothing them at midpoint**

In [7]:
species_tree = Tree("../../supplementary_file_5/output/tree/04_tbe.raxml.support")
R = species_tree.get_midpoint_outgroup()
# and set it as tree outgroup
species_tree.set_outgroup(R)

**Step 1**
When calculating the species tree, we concatenated the SCO alignments, and removed gaps with trimAl. 
We have also retained information about the alignments partitions, eg. which columns in the MSA represent the SCO. We can use this information to extract the columns of interest (for each individual SCO), and see how identical sequences are disibuted across the tree.

So, here we first extract individual aligments for each SCO from the concatenated and trimmed SCO aligment. 

In [8]:
def get_aligments(SCO, start, end):
    
    alignment = AlignIO.read(Path("../../supplementary_file_5/output/alignments/concatenated/no_gaps_concatenated_sco.fasta").expanduser(), "fasta")    
    
    AlignIO.write(alignment[:, start:end], Path(f"../output/SCOGs_sequences/{SCO}.fasta").expanduser(), "fasta")
    

In [9]:
file_path = Path("../../supplementary_file_5/output/alignments/concatenated/concatenated_modeltest_fixed_positions.part").expanduser()

# Open the file
with open(file_path, 'r') as file:
    # Read the file line by line
    for line in file:
        SCO = line.strip().split(', ')[-1].split('_')[0]
        start = int(line.strip().split(', ')[-1].split('= ')[1].split('-')[0])-1
        end = line.strip().split(', ')[-1].split('= ')[1].split('-')[1]
        get_aligments(SCO, int(start), int(end))

**Step 2** using Hashing method we can now check which species share the same nucleotide sequence, and whether they form monophyletic caneds on the species tree

In [10]:
def check_HGT_and_assign_colour_for_viz(species_tree, alignment):
    """Return dictionary with diffrent colour for each nucleotide variant if possible HGT.
    If HGT unlikely do not assign colour.
    
    :param species_tree: species phylogenetic tree
    :param gene_tree: single gene tree
    """
    
    clusters = defaultdict(list)
    
    alignment = AlignIO.read(alignment, "fasta")
    
    #Getting dictionary with genomes sharing the same nucleotide sequence keyed by the sequence hash
    for _ in alignment:
        clusters[hashlib.md5(str(_.seq).encode('utf-8')).hexdigest()].append(_.name)
        

    
    
    
    HGT_variants = {}
    colours = {}
    HGT_predictions = {}

    #Check for HGT
    current = 0
    for cluster, genomes in clusters.items():
        if len(genomes) !=1: #Only consider genomes that were collapsed eg. share the same nt sequence with at least 1 more genome
            ancestor = species_tree.get_common_ancestor(genomes) #Get first common ancestor
            if sorted(genomes) != sorted([_.name for _ in ancestor]): #Check if the group possible affected by HGT
                current +=1
                for _ in genomes:
                    HGT_variants[_] = current
                    HGT_predictions[_] = 'True'
                    HGT_variants[_] = str(current)
    
    
    palette = sns.color_palette("Spectral", len(set(HGT_variants.values())))
    palette=palette.as_hex()

    x = dict(zip(list(set(HGT_variants.values())), palette))
    
    for HGT_variant, colour in x.items():
        for genome, variant in HGT_variants.items():

            if HGT_variant == variant:
                colours[genome] = colour
    
    return colours, HGT_predictions, HGT_variants



    
    

In [11]:
df_viz = pd.DataFrame({'Name': [_.name for _ in species_tree]})
df_HGT = pd.DataFrame({'Name': [_.name for _ in species_tree]})
df_variants = pd.DataFrame({'Name': [_.name for _ in species_tree]})

In [12]:
datadir = Path("../output/SCOGs_sequences").expanduser()
filenames = sorted(datadir.glob("*"))

In [13]:
for _ in filenames:
    SCO = str(_).split('/')[-1].split('.')[0]
    viz_data, HGT_data, HGT_variants = check_HGT_and_assign_colour_for_viz(species_tree, _)
    df_viz[SCO]= df_viz.Name.map(viz_data).fillna('#F0F0F0')
    df_HGT[SCO]= df_HGT.Name.map(HGT_data).fillna('False')
    df_variants[SCO]= df_variants.Name.map(HGT_variants).fillna('0')

  df_viz[SCO]= df_viz.Name.map(viz_data).fillna('#F0F0F0')
  df_HGT[SCO]= df_HGT.Name.map(HGT_data).fillna('False')
  df_variants[SCO]= df_variants.Name.map(HGT_variants).fillna('0')


In [14]:
df_viz.to_csv(Path("../output/SCOGs_distribution_vizualisation_data.csv").expanduser(), index=False)
df_HGT.to_csv(Path("../output/HGT_predictions_data.csv").expanduser(), index=False)

**Step 3**
for vizualization reduce number of data by removing columns/SCO where no HGT is predicted, and sort the values from least to most diverse

In [15]:
df_viz = df_viz.set_index('Name')

In [16]:
df_viz_reduced = df_viz.loc[:, (df_viz != '#F0F0F0').any(axis=0)]

In [17]:

# Step 1: Calculate the number of unique values in each column
unique_counts = df_viz_reduced.nunique()

# Step 2: Create a dictionary with column names and unique value counts
column_dict = dict(unique_counts)

# Step 3: Sort the dictionary by values in ascending order
sorted_columns = sorted(column_dict, key=column_dict.get)

# Step 4: Extract the sorted column names
sorted_column_names = list(sorted_columns)

# Step 5: Reorder the DataFrame columns based on the sorted column names
df_viz_reduced = df_viz_reduced[sorted_column_names]

# Print the reordered DataFrame
df_viz_reduced

Unnamed: 0_level_0,OG0002083,OG0002171,OG0002192,OG0002202,OG0002242,OG0002247,OG0002077,OG0002081,OG0002116,OG0002125,...,OG0002193,OG0002194,OG0002120,OG0002282,OG0002167,OG0002191,OG0002080,OG0002076,OG0002086,OG0002082
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GCF_000717725.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_900105395.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000380165.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000745345.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000813365.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GCF_000718455.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#a7dba4,#F0F0F0,#466eb1,#F0F0F0,#f3faac,#eff9a6,#fbfdb8,#69c3a5,#e3534a,#fff3ac
GCF_016901035.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#a7dba4,#fcaa5f,#F0F0F0,#fff7b2,#f3faac,#F0F0F0,#fbfdb8,#ffffbe,#e3534a,#d63f4f
GCF_016906245.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#f3faac,#eff9a6,#F0F0F0,#69c3a5,#F0F0F0,#F0F0F0
GCF_001905905.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#466eb1,#F0F0F0,#f3faac,#eff9a6,#F0F0F0,#69c3a5,#e3534a,#F0F0F0


In [18]:
df_viz_reduced.to_csv(Path("../output/SCOGs_distribution_vizualisation_data_reduced.csv").expanduser())

**Checking if the SCO variants that do not form monophyletic are scattered across multiple candidate genus.**

This will be done in the dollowing steps:
- Step 1: Get columns of interest such as those where HGT was suspected. 
- Step 2: Write funcion that will get dictionary with list of genomes sharing the same variant keyed by the variant assigned number

*Step 1*

In [19]:
df_variants_reduced = df_variants.loc[:, (df_variants != '0').any(axis=0)]

*Step 2*

In [20]:
def get_variant_counts(data, column1, column2):
    
    result_dict = defaultdict(list)
    
    # Iterate over the columns
    for key, value in zip(data[column2], data[column1]):
        if key != '0':
            result_dict[key].append(value)

    
    return result_dict

    
    

In [21]:
genus_data = pd.read_csv(Path("../../supplementary_file_10/output/pyANI_genus_IDs.csv").expanduser()).set_index('accession').to_dict()['genus_ID_pc_3']

In [22]:
def get_viz_data(genus_dictionary, variants_dictionary, SCOG):
    
    genus_representation_per_variant = defaultdict(list)
    suspected_genomes = []
    
    colours = {}
    
    #Getting dictionary with list of genus occurence of a given SCOG variant
    for variant, genomes in variants_dictionary.items():
        for genome in genomes:
            genus_representation_per_variant[variant].append(genus_dictionary[genome])
            
    #Getting list of variants that are present across multiple genus
    variants_of_interest = [variant for variant, genus in genus_representation_per_variant.items() if len(list(set(genus))) != 1]

    #Getting list of genomes that share the variants of interest
    for variant, genomes in variants_dictionary.items():
        if variant in variants_of_interest:
            suspected_genomes.extend(genomes)
            
    #Extracting colours
    current_colours = pd.read_csv(Path("../output/SCOGs_distribution_vizualisation_data.csv").expanduser()).set_index('Name').to_dict()[SCOG]
    
    for genome, colour in current_colours.items():
        if genome in suspected_genomes:
            colours[genome] = colour
            
            
    return colours

In [23]:
df_viz = pd.DataFrame({'Name': [_.name for _ in species_tree]})

In [24]:
for SCOG in df_variants_reduced:
    if SCOG != 'Name':
        genomes_per_variants = get_variant_counts(df_variants_reduced, 'Name', SCOG)
        x = get_viz_data(genus_data, genomes_per_variants, SCOG)
        if len(x) != 0:
            df_viz[SCOG]= df_viz.Name.map(x).fillna('#F0F0F0')

In [25]:
df_viz.to_csv(Path("../output/SCOGs_distribution_vizualisation_data_genus_split.csv"), index=False)