# check HGT with ete3

This notebook was used to check for possible HGT of SCO. 

Here, we will compare singe gene tree to species tree. 

**Concept**

We have a species tree inferred from 136 SCO.

To check if there is a possibility for HGT, we check how each SCOG is distibuted across the species tree.

**Set Up**

In [339]:
from ete3 import Tree
from pathlib import Path
from collections import defaultdict
import seaborn as sns
import pandas as pd
from Bio import AlignIO
import hashlib
from collections import defaultdict

In [340]:
species_tree = Tree('((((H,K)D,(F,I)G)B,E)A,((L,(N,Q)O)J,(P,S)M)C);', format=1)

In [341]:
print(f'This is species tree: {species_tree}')

This is species tree: 
            /-H
         /-|
        |   \-K
      /-|
     |  |   /-F
   /-|   \-|
  |  |      \-I
  |  |
  |   \-E
--|
  |      /-L
  |   /-|
  |  |  |   /-N
  |  |   \-|
   \-|      \-Q
     |
     |   /-P
      \-|
         \-S


In [342]:
gene_tree = Tree('(HK, ((FIE,L), (NQ,PS)));')
print(f'This is gene tree:{gene_tree}')

This is gene tree:
   /-HK
  |
--|      /-FIE
  |   /-|
  |  |   \-L
   \-|
     |   /-NQ
      \-|
         \-PS


1. First, we can be interested in the leave/clade `HK` in the gene tree, and check first common ancestor for H and F in the species tree.

If the first common ancestor just includes `H` and `K`, then we can conclude there is no HGT taking place.

If however other `species` will be found, then there is a posibility that HGT has likely occured.

In [343]:
ancestor = species_tree.get_common_ancestor("H", "K")
print(f"Most common ancestor for H and K in species tree is: {[_.name for _ in ancestor]}")

Most common ancestor for H and K in species tree is: ['H', 'K']


2. We can now investigate `F`, `I` and `E`. 

Here, the HGT has likely occured. 

In [344]:
ancestor = species_tree.get_common_ancestor("F", "I", 'E')
print(f"Most common ancestor for F, I and K in species tree is: {[_.name for _ in ancestor]}")

Most common ancestor for F, I and K in species tree is: ['H', 'K', 'F', 'I', 'E']


**Loading our species tree and roothing them at midpoint**

In [345]:
species_tree = Tree("../../supplementary_file_5/output/tree/04_tbe.raxml.support")
R = species_tree.get_midpoint_outgroup()
# and set it as tree outgroup
species_tree.set_outgroup(R)

**Step 1**
When calculating the species tree, we concatenated the SCO alignments, and removed gaps with trimAl. 
We have also retained information about the alignments partitions, eg. which columns in the MSA represent the SCO. We can use this information to extract the columns of interest (for each individual SCO), and see how identical sequences are disibuted across the tree.

So, here we first extract individual aligments for each SCO from the concatenated and trimmed SCO aligment. 

In [346]:
def get_aligments(SCO, start, end):
    
    alignment = AlignIO.read(Path("../../supplementary_file_5/output/alignments/concatenated/no_gaps_concatenated_sco.fasta").expanduser(), "fasta")    
    
    AlignIO.write(alignment[:, start:end], Path(f"../output/SCOGs_sequences/{SCO}.fasta").expanduser(), "fasta")
    

In [347]:
file_path = Path("../../supplementary_file_5/output/alignments/concatenated/concatenated_modeltest_fixed_positions.part").expanduser()

# Open the file
with open(file_path, 'r') as file:
    # Read the file line by line
    for line in file:
        SCO = line.strip().split(', ')[-1].split('_')[0]
        start = int(line.strip().split(', ')[-1].split('= ')[1].split('-')[0])-1
        end = line.strip().split(', ')[-1].split('= ')[1].split('-')[1]
        get_aligments(SCO, int(start), int(end))

**Step 2** using Hashing method we can now check which species share the same nucleotide sequence, and whether they form monophyletic caneds on the species tree

In [348]:
def check_HGT_and_assign_colour_for_viz(species_tree, alignment):
    """Return dictionary with diffrent colour for each nucleotide variant if possible HGT.
    If HGT unlikely do not assign colour.
    
    :param species_tree: species phylogenetic tree
    :param gene_tree: single gene tree
    """
    
    clusters = defaultdict(list)
    
    alignment = AlignIO.read(alignment, "fasta")
    
    #Getting dictionary with genomes sharing the same nucleotide sequence keyed by the sequence hash
    for _ in alignment:
        clusters[hashlib.md5(str(_.seq).encode('utf-8')).hexdigest()].append(_.name)
        

    
    
    
    HGT_variants = {}
    colours = {}
    HGT_predictions = {}
    
    #Check for HGT
    current = 0
    for cluster, genomes in clusters.items():
        if len(genomes) !=1: #Only consider genomes that were collapsed eg. share the same nt sequence with at least 1 more genome
            ancestor = species_tree.get_common_ancestor(genomes) #Get first common ancestor
            if sorted(genomes) != sorted([_.name for _ in ancestor]): #Check if the group possible affected by HGT
                current +=1
                for _ in genomes:
                    HGT_variants[_] = current
                    HGT_predictions[_] = 'True'
                    HGT_variants[_] = str(current)
    
    
    palette = sns.color_palette("Spectral", len(set(HGT_variants.values())))
    palette=palette.as_hex()

    x = dict(zip(list(set(HGT_variants.values())), palette))
    
    for HGT_variant, colour in x.items():
        for genome, variant in HGT_variants.items():

            if HGT_variant == variant:
                colours[genome] = colour
    return colours, HGT_predictions, HGT_variants



    
    

In [349]:
df_viz = pd.DataFrame({'Name': [_.name for _ in species_tree]})
df_HGT = pd.DataFrame({'Name': [_.name for _ in species_tree]})
df_variants = pd.DataFrame({'Name': [_.name for _ in species_tree]})

In [350]:
datadir = Path("../output/SCOGs_sequences").expanduser()
filenames = sorted(datadir.glob("*"))

In [351]:
for _ in filenames:
    SCO = str(_).split('/')[-1].split('.')[0]
    viz_data, HGT_data, HGT_variants = check_HGT_and_assign_colour_for_viz(species_tree, _)
    df_viz[SCO]= df_viz.Name.map(viz_data).fillna('#F0F0F0')
    df_HGT[SCO]= df_HGT.Name.map(HGT_data).fillna('False')
    df_variants[SCO]= df_variants.Name.map(HGT_variants).fillna('0')

  df_viz[SCO]= df_viz.Name.map(viz_data).fillna('#F0F0F0')
  df_HGT[SCO]= df_HGT.Name.map(HGT_data).fillna('False')
  df_variants[SCO]= df_variants.Name.map(HGT_variants).fillna('0')


In [352]:
df_viz.to_csv(Path("../output/SCOGs_distribution_vizualisation_data.csv").expanduser(), index=False)
df_HGT.to_csv(Path("../output/HGT_predictions_data.csv").expanduser(), index=False)

**Step 3**
for vizualization reduce number of data by removing columns/SCO where no HGT is predicted, and sort the values from least to most diverse

In [353]:
df_viz = df_viz.set_index('Name')

In [322]:
df_viz_reduced = df_viz.loc[:, (df_viz != '#F0F0F0').any(axis=0)]

In [354]:
# Step 1: Calculate the number of unique values in each column
unique_counts = df_viz_reduced.nunique()

# Step 2: Create a dictionary with column names and unique value counts
column_dict = dict(unique_counts)

# Step 3: Sort the dictionary by values in ascending order
sorted_columns = sorted(column_dict, key=column_dict.get)

# Step 4: Extract the sorted column names
sorted_column_names = list(sorted_columns)

# Step 5: Reorder the DataFrame columns based on the sorted column names
df_viz_reduced = df_viz_reduced[sorted_column_names]

# Print the reordered DataFrame
df_viz_reduced

Unnamed: 0_level_0,OG0002083,OG0002171,OG0002192,OG0002202,OG0002242,OG0002247,OG0002077,OG0002081,OG0002116,OG0002125,...,OG0002193,OG0002194,OG0002120,OG0002282,OG0002167,OG0002191,OG0002080,OG0002076,OG0002086,OG0002082
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GCF_000717725.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_900105395.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000380165.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000745345.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000813365.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GCF_000718455.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#be254a,#F0F0F0,#466eb1,#F0F0F0,#fff0a6,#feeb9d,#fee491,#525fa9,#f0f9a7,#e55749
GCF_016901035.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#be254a,#e9f69d,#F0F0F0,#d63f4f,#fff0a6,#F0F0F0,#fee491,#feec9f,#f0f9a7,#fcfeba
GCF_016906245.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#fff0a6,#feeb9d,#F0F0F0,#525fa9,#F0F0F0,#F0F0F0
GCF_001905905.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#466eb1,#F0F0F0,#fff0a6,#feeb9d,#F0F0F0,#525fa9,#f0f9a7,#F0F0F0


In [355]:
df_viz_reduced.to_csv(Path("../output/SCOGs_distribution_vizualisation_data_reduced.csv").expanduser())

**Checking if the SCO variants that do not form monophyletic are scattered across multiple candidate genus.**

This will be done in the dollowing steps:
- Step 1: Get columns of interest such as those where HGT was suspected. 
- Step 2: Write funcion that will get dictionary with list of genomes sharing the same variant keyed by the variant assigned number

*Step 1*

In [356]:
df_variants_reduced = df_variants.loc[:, (df_variants != '0').any(axis=0)]

*Step 2*

In [357]:
def get_variant_counts(data, column1, column2):
    
    result_dict = defaultdict(list)
    
    # Iterate over the columns
    for key, value in zip(data[column2], data[column1]):
        if key != '0':
            result_dict[key].append(value)

    
    return result_dict

    
    

In [358]:
genus_data = pd.read_csv(Path("../../supplementary_file_9/output/pyANI_genus_IDs.csv").expanduser()).set_index('accession').to_dict()['genus_ID_pc_4_with_ID']

In [359]:
def get_viz_data(genus_dictionary, variants_dictionary, SCOG):
    
    genus_representation_per_variant = defaultdict(list)
    suspected_genomes = []
    
    colours = {}
    
    #Getting dictionary with list of genus occurence of a given SCOG variant
    for variant, genomes in variants_dictionary.items():
        for genome in genomes:
            genus_representation_per_variant[variant].append(genus_dictionary[genome])
            
    #Getting list of variants that are present across multiple genus
    variants_of_interest = [variant for variant, genus in genus_representation_per_variant.items() if len(list(set(genus))) != 1]

    #Getting list of genomes that share the variants of interest
    for variant, genomes in variants_dictionary.items():
        if variant in variants_of_interest:
            suspected_genomes.extend(genomes)
            
    #Extracting colours
    current_colours = pd.read_csv(Path("../output/SCOGs_distribution_vizualisation_data.csv").expanduser()).set_index('Name').to_dict()[SCOG]
    
    for genome, colour in current_colours.items():
        if genome in suspected_genomes:
            colours[genome] = colour            
    return colours

In [360]:
df_viz = pd.DataFrame({'Name': [_.name for _ in species_tree]})

In [361]:
for SCOG in df_variants_reduced:
    if SCOG != 'Name':
        genomes_per_variants = get_variant_counts(df_variants_reduced, 'Name', SCOG)
        x = get_viz_data(genus_data, genomes_per_variants, SCOG)
        if len(x) != 0:
            df_viz[SCOG]= df_viz.Name.map(x).fillna('#F0F0F0')

In [363]:
df_viz = df_viz.set_index('Name')
# Step 1: Calculate the number of unique values in each column
unique_counts = df_viz_reduced.nunique()

# Step 2: Create a dictionary with column names and unique value counts
column_dict = dict(unique_counts)

# Step 3: Sort the dictionary by values in ascending order
sorted_columns = sorted(column_dict, key=column_dict.get)

# Step 4: Extract the sorted column names
sorted_column_names = list(sorted_columns)

# Step 5: Reorder the DataFrame columns based on the sorted column names
df_viz_reduced = df_viz_reduced[sorted_column_names]

# Print the reordered DataFrame
df_viz_reduced

Unnamed: 0_level_0,OG0002083,OG0002171,OG0002192,OG0002202,OG0002242,OG0002247,OG0002077,OG0002081,OG0002116,OG0002125,...,OG0002193,OG0002194,OG0002120,OG0002282,OG0002167,OG0002191,OG0002080,OG0002076,OG0002086,OG0002082
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GCF_000717725.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_900105395.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000380165.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000745345.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000813365.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GCF_000718455.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#be254a,#F0F0F0,#466eb1,#F0F0F0,#fff0a6,#feeb9d,#fee491,#525fa9,#f0f9a7,#e55749
GCF_016901035.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#be254a,#e9f69d,#F0F0F0,#d63f4f,#fff0a6,#F0F0F0,#fee491,#feec9f,#f0f9a7,#fcfeba
GCF_016906245.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#fff0a6,#feeb9d,#F0F0F0,#525fa9,#F0F0F0,#F0F0F0
GCF_001905905.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#466eb1,#F0F0F0,#fff0a6,#feeb9d,#F0F0F0,#525fa9,#f0f9a7,#F0F0F0


In [366]:
df_viz.to_csv(Path("../output/SCOGs_distribution_vizualisation_data_genus_split.csv"))

In [367]:
df_viz

Unnamed: 0_level_0,OG0002076,OG0002078,OG0002079,OG0002080,OG0002082,OG0002086,OG0002120,OG0002167,OG0002186,OG0002190,OG0002191,OG0002193,OG0002194,OG0002195,OG0002197,OG0002282
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
GCF_000717725.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_900105395.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000380165.1,#F0F0F0,#86cfa5,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000745345.1,#F0F0F0,#5eb9a9,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000813365.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GCF_000718455.1,#525fa9,#c1274a,#F0F0F0,#F0F0F0,#e55749,#F0F0F0,#F0F0F0,#fff0a6,#F0F0F0,#F0F0F0,#feeb9d,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_016901035.1,#F0F0F0,#5eb9a9,#3d79b6,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#fff0a6,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_016906245.1,#525fa9,#fdb567,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#fff0a6,#F0F0F0,#F0F0F0,#feeb9d,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_001905905.1,#525fa9,#fdb567,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#fff0a6,#F0F0F0,#F0F0F0,#feeb9d,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0


**Getting viz data for non-monophyletic variants across the same genus**

In [333]:
def get_viz_data_same_genus(genus_dictionary, variants_dictionary, SCOG):
    
    genus_representation_per_variant = defaultdict(list)
    suspected_genomes = []
    
    colours = {}
    
    #Getting dictionary with list of genus occurence of a given SCOG variant
    for variant, genomes in variants_dictionary.items():
        for genome in genomes:
            genus_representation_per_variant[variant].append(genus_dictionary[genome])
            
    #Getting list of variants that are present across multiple genus
    variants_of_interest = [variant for variant, genus in genus_representation_per_variant.items() if len(list(set(genus))) == 1]

    #Getting list of genomes that share the variants of interest
    for variant, genomes in variants_dictionary.items():
        if variant in variants_of_interest:
            suspected_genomes.extend(genomes)
            
    #Extracting colours
    current_colours = pd.read_csv(Path("../output/SCOGs_distribution_vizualisation_data.csv").expanduser()).set_index('Name').to_dict()[SCOG]
    
    for genome, colour in current_colours.items():
        if genome in suspected_genomes:
            colours[genome] = colour             
    return colours

In [334]:
df_viz_same_genus = pd.DataFrame({'Name': [_.name for _ in species_tree]})

In [335]:
for SCOG in df_variants_reduced:
    if SCOG != 'Name':
        genomes_per_variants = get_variant_counts(df_variants_reduced, 'Name', SCOG)
        x = get_viz_data_same_genus(genus_data, genomes_per_variants, SCOG)
        if len(x) != 0:
            df_viz_same_genus[SCOG]= df_viz.Name.map(x).fillna('#F0F0F0')

In [336]:
df_viz_same_genus = df_viz_same_genus.set_index('Name')
# Step 1: Calculate the number of unique values in each column
unique_counts = df_viz_same_genus.nunique()

# Step 2: Create a dictionary with column names and unique value counts
column_dict = dict(unique_counts)

# Step 3: Sort the dictionary by values in ascending order
sorted_columns = sorted(column_dict, key=column_dict.get)

# Step 4: Extract the sorted column names
sorted_column_names = list(sorted_columns)

# Step 5: Reorder the DataFrame columns based on the sorted column names
df_viz_same_genus = df_viz_same_genus[sorted_column_names]

# Print the reordered DataFrame
df_viz_same_genus

Unnamed: 0_level_0,OG0002083,OG0002171,OG0002192,OG0002202,OG0002242,OG0002247,OG0002077,OG0002078,OG0002081,OG0002116,...,OG0002086,OG0002120,OG0002167,OG0002197,OG0002076,OG0002191,OG0002193,OG0002282,OG0002080,OG0002082
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GCF_000717725.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_900105395.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000380165.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000745345.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000813365.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GCF_000718455.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#f0f9a7,#466eb1,#F0F0F0,#439bb5,#F0F0F0,#F0F0F0,#be254a,#F0F0F0,#fee491,#F0F0F0
GCF_016901035.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#f0f9a7,#F0F0F0,#F0F0F0,#F0F0F0,#feec9f,#F0F0F0,#be254a,#d63f4f,#fee491,#fcfeba
GCF_016906245.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_001905905.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#f0f9a7,#466eb1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0


In [338]:
df_viz_same_genus.to_csv(Path("../output/SCOGs_distribution_vizualisation_data_same_genus.csv"))

**Get data for repeated SCOGs that are monophyletic**

In [235]:
def check_for_monophyly_and_assign_colour_for_viz(species_tree, alignment):
    """Return dictionary with diffrent colour for each nucleotide variant if possible HGT.
    If HGT unlikely do not assign colour.
    
    :param species_tree: species phylogenetic tree
    :param gene_tree: single gene tree
    """
    
    clusters = defaultdict(list)
    
    alignment = AlignIO.read(alignment, "fasta")
    
    #Getting dictionary with genomes sharing the same nucleotide sequence keyed by the sequence hash
    for _ in alignment:
        clusters[hashlib.md5(str(_.seq).encode('utf-8')).hexdigest()].append(_.name)
         
    
    
    monophyletic_variants = {}
    colours = {}
    monophyletic_predictions = {}

    #Check for HGT
    current = 0
    for cluster, genomes in clusters.items():
        if len(genomes) !=1: #Only consider genomes that were collapsed eg. share the same nt sequence with at least 1 more genome
            ancestor = species_tree.get_common_ancestor(genomes) #Get first common ancestor
            if sorted(genomes) == sorted([_.name for _ in ancestor]): #Check if the group possible affected by HGT
                print(genomes)
        
                current +=1
                for _ in genomes:
                    monophyletic_variants[_] = current
                    monophyletic_predictions[_] = 'True'
                    monophyletic_variants[_] = str(current)
    
    
    palette = sns.color_palette("Spectral", len(set(monophyletic_variants.values())))
    palette=palette.as_hex()

    x = dict(zip(list(set(monophyletic_variants.values())), palette))
    
    for monophyletic_variant, colour in x.items():
        for genome, variant in monophyletic_variants.items():

            if monophyletic_variant == variant:
                colours[genome] = colour
    
    return colours, monophyletic_predictions, monophyletic_variants



    
    

In [236]:
df_viz_monophyly = pd.DataFrame({'Name': [_.name for _ in species_tree]})
for _ in filenames:
    SCO = str(_).split('/')[-1].split('.')[0]
    viz_data, monophyly_data, monophyly_variants = check_for_monophyly_and_assign_colour_for_viz(species_tree, _)
    df_viz_monophyly[SCO]= df_viz.Name.map(viz_data).fillna('#F0F0F0')


['GCF_000342125.1', 'GCF_014253015.1']
['GCF_000700005.2', 'GCF_003947265.2']
['GCF_000718095.1', 'GCF_004151105.1']
['GCF_000718525.1', 'GCF_008634015.1']
['GCF_000719365.1', 'GCF_000720175.1']
['GCF_000720215.1', 'GCF_014207295.1', 'GCF_014701095.1', 'GCF_900215615.1']
['GCF_000805335.1', 'GCF_009834125.1']
['GCF_000816025.1', 'GCF_002891315.1']
['GCF_000931445.1', 'GCF_011008945.1']
['GCF_000966975.1', 'GCF_002028425.1']
['GCF_001278075.1', 'GCF_009832925.1']
['GCF_001509795.1', 'GCF_002156055.1', 'GCF_003363195.1', 'GCF_008973465.1']
['GCF_001723075.1', 'GCF_004563805.1']
['GCF_006539185.1', 'GCF_008642375.1']
['GCF_007828695.1', 'GCF_018638305.1']
['GCF_008642275.1', 'GCF_017639205.1']
['GCF_016612625.1', 'GCF_900105265.1']
['GCF_000717595.1', 'GCF_001418135.1']
['GCF_000721235.1', 'GCF_015244315.1']
['GCF_000816025.1', 'GCF_002891315.1']
['GCF_000931445.1', 'GCF_011008945.1']
['GCF_001513985.1', 'GCF_006516935.1']
['GCF_001723075.1', 'GCF_004563805.1']
['GCF_003994375.1', 'GCF_01

['GCF_000721235.1', 'GCF_015244315.1']
['GCF_000721235.1', 'GCF_015244315.1']
['GCF_000816025.1', 'GCF_002891315.1']
['GCF_001418645.1', 'GCF_002761895.1']
['GCF_000719365.1', 'GCF_000720175.1']
['GCF_001509795.1', 'GCF_002156055.1']
['GCF_001723075.1', 'GCF_004563805.1']
['GCF_014650115.1', 'GCF_014650735.1']
['GCF_014650115.1', 'GCF_014650735.1']
['GCF_000092385.1', 'GCF_002224125.2']
['GCF_000700005.2', 'GCF_003947265.2']
['GCF_000709915.1', 'GCF_018966745.1']
['GCF_000716625.1', 'GCF_014197485.1']
['GCF_000717595.1', 'GCF_001418135.1']
['GCF_000719185.1', 'GCF_008704655.1']
['GCF_000719265.1', 'GCF_015475835.1']
['GCF_000719365.1', 'GCF_000720175.1']
['GCF_000719505.1', 'GCF_001513985.1', 'GCF_006516935.1']
['GCF_000719895.1', 'GCF_004217355.1', 'GCF_004328625.1']
['GCF_000721235.1', 'GCF_015244315.1']
['GCF_000725555.1', 'GCF_004795115.1']
['GCF_000816025.1', 'GCF_002891315.1']
['GCF_000829715.2', 'GCF_009811575.1']
['GCF_000931445.1', 'GCF_011008945.1']
['GCF_001278075.1', 'GCF_0

  df_viz_monophyly[SCO]= df_viz.Name.map(viz_data).fillna('#F0F0F0')


['GCF_000342125.1', 'GCF_014253015.1']
['GCF_000721235.1', 'GCF_015244315.1']
['GCF_000816025.1', 'GCF_002891315.1']
['GCF_000931445.1', 'GCF_011008945.1']
['GCF_001509795.1', 'GCF_002156055.1']
['GCF_003955715.1', 'GCF_900206255.1']
['GCF_014650115.1', 'GCF_014650735.1']
['GCF_001723075.1', 'GCF_004563805.1']
['GCF_000010605.1', 'GCF_001434355.1', 'GCF_002910905.1', 'GCF_018619185.1']
['GCF_000147815.2', 'GCF_003595545.1']
['GCF_000342125.1', 'GCF_014253015.1']
['GCF_000383595.1', 'GCF_900215595.1']
['GCF_000700005.2', 'GCF_003947265.2']
['GCF_000715845.1', 'GCF_003865155.1']
['GCF_000716625.1', 'GCF_014197485.1']
['GCF_000717595.1', 'GCF_001418135.1']
['GCF_000717745.1', 'GCF_000720485.1']
['GCF_000718095.1', 'GCF_004151105.1']
['GCF_000718625.1', 'GCF_016860525.1']
['GCF_000719265.1', 'GCF_013394065.1', 'GCF_015475835.1']
['GCF_000720675.1', 'GCF_001905385.1']
['GCF_000805335.1', 'GCF_009834125.1', 'GCF_014648695.1']
['GCF_000816025.1', 'GCF_002891315.1']
['GCF_000931445.1', 'GCF_01

In [237]:
df_viz_monophyly = df_viz_monophyly.loc[:, (df_viz_monophyly != '#F0F0F0').any(axis=0)]
df_viz_monophyly = df_viz_monophyly.set_index('Name')

# Step 1: Calculate the number of unique values in each column
unique_counts = df_viz_monophyly.nunique()

# Step 2: Create a dictionary with column names and unique value counts
column_dict = dict(unique_counts)

# Step 3: Sort the dictionary by values in ascending order
sorted_columns = sorted(column_dict, key=column_dict.get)

# Step 4: Extract the sorted column names
sorted_column_names = list(sorted_columns)

# Step 5: Reorder the DataFrame columns based on the sorted column names
df_viz_monophyly = df_viz_monophyly[sorted_column_names]

# Print the reordered DataFrame
df_viz_monophyly





df_viz_monophyly.to_csv(Path("../output/SCOGs_distribution_vizualisation_data_monophyletic.csv"))

In [239]:
df_viz_monophyly

Unnamed: 0_level_0,OG0002089,OG0002104,OG0002110,OG0002130,OG0002134,OG0002211,OG0002214,OG0002231,OG0002242,OG0002250,...,OG0002190,OG0002080,OG0002167,OG0002076,OG0002195,OG0002079,OG0002191,OG0002197,OG0002232,OG0002282
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GCF_000717725.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#e2514a,#F0F0F0,#F0F0F0,#F0F0F0
GCF_900105395.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#e2514a,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000380165.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000745345.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_000813365.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GCF_000718455.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_016901035.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0
GCF_016906245.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#fec877,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#aedea3
GCF_001905905.1,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,...,#F0F0F0,#fec877,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#F0F0F0,#aedea3


In [240]:
monophyletic = [_ for _ in df_viz_monophyly if _ != "Name"]

In [241]:
all_non_monophyletic = [_ for _ in df_viz_reduced if _ != "Name"]

In [242]:
diffrent_genus = [_ for _ in df_viz if _ != "Name"]

In [243]:
same_genus = [_ for _ in df_viz_same_genus if _ != "Name"]

In [250]:
len(same_genus)

38

In [245]:
data = pd.read_csv("../../supplementary_file_7/output/SCOGs_location.csv")
names_labels = data.set_index('Orthogroup')['label'].to_dict()

In [246]:
from Bio import SeqIO
from collections import Counter 
counter = 0
for _ in filenames:
    scog_name = _.stem
    product = names_labels[scog_name].split(': ')[-1].split('[')[0]
    records = list(SeqIO.parse(_, "fasta"))
    unique = list(set([_.seq for _ in records]))
    if len(unique) != 295:
        counter += 1
        unique_variants = len(unique)
        common_to_two = [_ for _ in Counter([_.seq for _ in records]).values() if _ !=1]
        if scog_name != "OG0002247":
            monophyletic = len(Counter([_ for _ in df_viz_monophyly[scog_name] if _ !="#F0F0F0"]).values())
        else:
            monophyletic = 0
        try:
            non_monophyletic = len(Counter([_ for _ in df_viz_reduced[scog_name] if _ !="#F0F0F0"]).values())
        except KeyError:
            non_monophyletic = 0
    
    print(scog_name, "&", product, "&", unique_variants,"&", len(common_to_two),"&", monophyletic,"&", non_monophyletic, "\\","\\")
    print("\hline")
print(counter)

OG0002076 & 30S ribosomal protein S18 & 166 & 50 & 17 & 33 \ \
\hline
OG0002077 & arD family transcriptional regulator & 285 & 9 & 7 & 2 \ \
\hline
OG0002078 & 50S ribosomal protein L36 & 29 & 18 & 3 & 15 \ \
\hline
OG0002079 & 50S ribosomal protein L22 & 256 & 30 & 18 & 12 \ \
\hline
OG0002080 & 50S ribosomal protein L29 & 214 & 44 & 16 & 28 \ \
\hline
OG0002081 & RNA-binding protein & 281 & 12 & 10 & 2 \ \
\hline
OG0002082 & 50S ribosomal protein L30 & 190 & 48 & 10 & 38 \ \
\hline
OG0002083 & bifunctional nuclease family protein & 292 & 3 & 2 & 1 \ \
\hline
OG0002086 & 50S ribosomal protein L32 & 107 & 40 & 7 & 33 \ \
\hline
OG0002089 & SDR family NAD(P)-dependent oxidoreductase & 294 & 1 & 1 & 0 \ \
\hline
OG0002090 & hypothetical protein & 294 & 1 & 1 & 0 \ \
\hline
OG0002097 & DUF3071 domain-containing protein & 294 & 1 & 1 & 0 \ \
\hline
OG0002099 & insulinase family protein & 294 & 1 & 1 & 0 \ \
\hline
OG0002100 & anscriptional repressor NrdR & 294 & 1 & 1 & 0 \ \
\hline
OG0002

OG0002291 & FadR family transcriptional regulator & 280 & 14 & 10 & 4 \ \
\hline
OG0002292 & A pyrophosphatase & 280 & 14 & 10 & 4 \ \
\hline
52


In [247]:
len(Counter([_ for _ in df_viz_monophyly["OG0002231"] if _ !="#F0F0F0"]).values())

1

In [368]:
for _ in df_viz:
    if _ != "Name":
        SCOG = _
        product = names_labels[_].split(': ')[-1].split('[')[0]
        monophyletic = len(Counter([_ for _ in df_viz[SCOG] if _ !="#F0F0F0"]).values())
        print(SCOG, "&", product,"&", monophyletic, "\\", "\\")
        print("\hline")
        

OG0002076 & 30S ribosomal protein S18 & 18 \ \
\hline
OG0002078 & 50S ribosomal protein L36 & 13 \ \
\hline
OG0002079 & 50S ribosomal protein L22 & 6 \ \
\hline
OG0002080 & 50S ribosomal protein L29 & 9 \ \
\hline
OG0002082 & 50S ribosomal protein L30 & 15 \ \
\hline
OG0002086 & 50S ribosomal protein L32 & 20 \ \
\hline
OG0002120 & 50S ribosomal protein L31 & 4 \ \
\hline
OG0002167 & DUF4177 domain-containing protein & 6 \ \
\hline
OG0002186 & 50S ribosomal protein L10 & 1 \ \
\hline
OG0002190 & 50S ribosomal protein L23 & 2 \ \
\hline
OG0002191 & 30S ribosomal protein S19 & 7 \ \
\hline
OG0002193 & 50S ribosomal protein L16 & 1 \ \
\hline
OG0002194 & 30S ribosomal protein S17 & 4 \ \
\hline
OG0002195 & 50S ribosomal protein L24 & 2 \ \
\hline
OG0002197 & 30S ribosomal protein S8 & 1 \ \
\hline
OG0002282 & hypothetical protein & 1 \ \
\hline


In [186]:
df_viz_monophyly['OG0002292']

KeyError: 'OG0002292'