Clustering Analysis: Use clustering algorithms such as K-means or DBSCAN to identify groups or clusters of plants based on their chemical compositions. This can help identify similarities and differences among different plant species. Find hidden structures is the data that are not immidiately availible. K-means - groups of simmilar instances - the closest centroid to each data sample.

In [2]:
import pandas as pd

# Load the data and set the first column as the index
data = pd.read_csv('./initial_data/chemicals_data_clean.csv', index_col=0)

# Extract row indices (species)
species_list = data.index.tolist()

# Count the total number of species
species_count = len(species_list)

# Create a new DataFrame from the species list
species_df = pd.DataFrame(species_list, columns=['Plant_Species'])

# Save the new DataFrame to a CSV file
species_df.to_csv('./initial_data/plant_species_list.csv', index=False)

# Print the total number of species
print(f'Total number of species: {species_count}')


Total number of species: 4087


Part 2: Clustering

This part will perform the clustering using Euclidean distances

This script will save a separate CSV file for each cluster. Each file will contain a similarity matrix for the plants within that cluster. The filename will include the cluster number for identification. The similarity scores are calculated by normalizing the distances (dividing by the maximum distance) and then subtracting from 1, so that a higher score corresponds to greater similarity. 



In [81]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances, silhouette_score
import numpy as np
import os
import logging
from tqdm import tqdm

# Set up logging to console
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Load the data
logging.info('Loading data...')
df = pd.read_csv('./initial_data/chemicals_data_clean.csv', index_col=0)
logging.info('Data loaded successfully.')

# Set the range of number of clusters to explore
min_clusters = 50
max_clusters = min(len(df), 300)

# Initialize variables
best_score = -1
best_num_clusters = -1

# Search for the optimal number of clusters using Silhouette Score
logging.info('Searching for optimal number of clusters...')
for num_clusters in tqdm(range(min_clusters, max_clusters + 1), desc='Optimizing clusters', unit='cluster'):
    kmeans = KMeans(n_clusters=num_clusters, random_state=0).fit(df.values)
    cluster_labels = kmeans.labels_
    silhouette_avg = silhouette_score(df.values, cluster_labels)
    
    if silhouette_avg > best_score:
        best_score = silhouette_avg
        best_num_clusters = num_clusters

logging.info(f'Optimal number of clusters found: {best_num_clusters} with Silhouette Score: {best_score}')

# Perform K-Means Clustering with the best number of clusters
logging.info('Performing K-Means clustering with optimal number of clusters...')
kmeans = KMeans(n_clusters=best_num_clusters, random_state=0).fit(df.values)
cluster_labels = kmeans.labels_
logging.info('K-Means clustering completed.')

# Assign cluster labels to plants
df['cluster'] = cluster_labels

# For each cluster, extract the subset of the DataFrame, calculate similarity scores, and save to CSV files
for cluster in tqdm(set(df['cluster']), desc='Processing clusters', unit='cluster'):
    logging.info(f'Processing cluster {cluster}...')
    
    # Define the directory path
    dir_path = f'./initial_data/clusters_euclidean'
    
    # Check if the directory exists, if not, create it
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)
        logging.info(f'Created directory: {dir_path}')
    
    # Proceed with saving the data
    cluster_df = df[df['cluster'] == cluster].drop('cluster', axis=1)
    cluster_df.to_csv(f'{dir_path}/cluster_{cluster}_chemical_compositions.csv')
    logging.info(f'Saved chemical compositions for cluster {cluster}.')
    
    # Calculate Euclidean distances
    distances = pairwise_distances(cluster_df.values, metric='euclidean')
    
    # Convert distances to similarities (1 - normalized distance)
    similarities = 1 - distances / distances.max()
    
    # Create a DataFrame of similarity scores
    similarities_df = pd.DataFrame(similarities, index=cluster_df.index, columns=cluster_df.index)
    
    # Save similarity scores to CSV
    similarities_df.to_csv(f'{dir_path}/cluster_{cluster}_plant_similarities.csv')
    logging.info(f'Saved plant similarities for cluster {cluster}.')

logging.info('Number of clusters: {}'.format(len(set(df['cluster']))))
























































































































































































































































































































































































































































































































Optimizing clusters: 100%|█| 251/251 [1:54:30<00:00, 27.37s/cluste



invalid value encountered in divide


invalid value encountered in divide


invalid value encountered in divide


invalid value encountered in divide


invalid value encountered in divide


invalid value encountered in divide


invalid value encountered in divide

Processing clusters: 100%|█| 160/160 [00:05<00:00, 29.07cluster/s]


Validation. We can then compare the initial count of unique plant species with the sum of unique plant species in all clusters. Ideally, the two counts should be the same.

In [106]:
import os
import pandas as pd

# Load the initial data
initial_data = pd.read_csv('./initial_data/chemicals_data_clean.csv', index_col=0)

# Count the number of unique plant species in the initial dataset
initial_species_count = initial_data.index.nunique()
print(f"Number of unique plant species in initial dataset: {initial_species_count}")

# Define the directory where cluster files are located
cluster_directory = './initial_data/clusters_euclidean'

# Initialize variable to store total count of unique species in all clusters
total_clustered_species = 0

# Iterate over each cluster file
for cluster_file in os.listdir(cluster_directory):
    if cluster_file.endswith('_chemical_compositions.csv'):
        # Load cluster data
        cluster_data = pd.read_csv(os.path.join(cluster_directory, cluster_file), index_col=0)
        
        # Count the number of unique plant species in the cluster
        cluster_species_count = cluster_data.index.nunique()
        
        # Add the count to the total
        total_clustered_species += cluster_species_count
        
        # Extract cluster number from file name
        cluster_number = cluster_file.split('_')[1]
        
        # Print the result
        print(f"Number of unique plant species in cluster {cluster_number}: {cluster_species_count}")

# Print total number of unique plant species clustered
print(f"Total number of unique plant species clustered: {total_clustered_species}")


Number of unique plant species in initial dataset: 4087
Number of unique plant species in cluster 71: 27
Number of unique plant species in cluster 105: 9
Number of unique plant species in cluster 28: 19
Number of unique plant species in cluster 79: 3
Number of unique plant species in cluster 123: 7
Number of unique plant species in cluster 97: 43
Number of unique plant species in cluster 20: 42
Number of unique plant species in cluster 57: 5
Number of unique plant species in cluster 154: 1
Number of unique plant species in cluster 118: 3
Number of unique plant species in cluster 136: 8
Number of unique plant species in cluster 82: 7
Number of unique plant species in cluster 35: 21
Number of unique plant species in cluster 42: 14
Number of unique plant species in cluster 6: 60
Number of unique plant species in cluster 141: 2
Number of unique plant species in cluster 64: 6
Number of unique plant species in cluster 13: 17
Number of unique plant species in cluster 110: 4
Number of unique p

To validate even further the clustering results, we will randomly select some plants from the initial dataset and identify the clusters they have been assigned to. Subsequently, a heatmap will be generated to visually assess the similarity among plants within each identified cluster. The script provided below will facilitate this process. It will iterate through all CSV files located in the 'clusters_euclidean' directory. For each file, the script will list out plants whose names contain a specified target substring, accompanied by their respective cluster numbers.

In [104]:
import os
import pandas as pd

# Define the directory path where the cluster files are located
dir_path = './initial_data/clusters_euclidean'

# Define the substring of the plant name you are searching for
target_plant_substring = 'Cannabis sativa'  # replace with the substring of the actual plant name

# Iterate over all files in the directory
for filename in os.listdir(dir_path):
    # Check if the file is a CSV file
    if filename.endswith('_chemical_compositions.csv'):
        # Construct the full file path
        file_path = os.path.join(dir_path, filename)
        
        # Load the CSV file into a DataFrame
        df = pd.read_csv(file_path, index_col=0)
        
        # Check if the target plant substring exists in any plant name in the DataFrame's index
        matching_plants = [plant for plant in df.index if target_plant_substring in plant]
        
        if matching_plants:
            # Extract the cluster number from the filename
            cluster_number = filename.split('_')[1]
            print(f'Plants containing the substring "{target_plant_substring}" are in cluster {cluster_number}:')
            print(matching_plants)
            print()  # Print a newline for readability


Plants containing the substring "Cannabis sativa" are in cluster 31:
['Marihuana (India) - Cannabis sativa L., fam. Cannabinaceae']

Plants containing the substring "Cannabis sativa" are in cluster 54:
['Marihuana (Austria) - Cannabis sativa L., fam. Cannabinaceae']



Part 3 high similarity

This script will generate a PDF file for each high similarity pairs CSV file that contains a heatmap. The heatmap shows the similarity scores between pairs of plants.To cluster plants that are most similar to each other together in the heatmap, we can use hierarchical clustering which creates a tree of clusters, which we can then cut at a certain level to create a heat map where similar items are placed next to each other.

To create meaningful titles for our heatmaps, we will extract the family names from the plant names, identify the unique families within each cluster, and then use this information to create a title that reflects the commonality among the plants in the cluster.

In [83]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import re
import textwrap
!pip install plotly


# Function to extract family names
def extract_family(plant_name):
    match = re.search(r'fam\. ?(\w+)', plant_name)
    if match:
        return match.group(1)
    return None

# Set paths
path = './initial_data/clusters_euclidean/'
output_dir = './initial_data/clusters_euclidean/heatmaps/'

# Create output directory if it does not exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Load cluster files
cluster_files = [file for file in os.listdir(path) if 'plant_similarities' in file]
print(f"Found {len(cluster_files)} cluster files.")

for cluster_file in cluster_files:
    print(f"Processing {cluster_file}...")
    
    # Load similarity data
    similarity_df = pd.read_csv(path + cluster_file, index_col=0)

    # Extract family names from plant names and identify unique families
    families = set(extract_family(name) for name in similarity_df.index if extract_family(name) is not None)
    common_families = ', '.join(families)

    # Wrap the family names in the title without breaking words
    max_chars_per_line = 50
    wrapper = textwrap.TextWrapper(width=max_chars_per_line, break_long_words=False, break_on_hyphens=False)
    wrapped_families = wrapper.fill(text=common_families)

    # Create a title for the heatmap with line breaks for readability
    cluster_number = re.search(r'\d+', cluster_file).group()  # Extract cluster number from file name
    title = f"Cluster {cluster_number} Heatmap:\nCommon Families - {wrapped_families}"



    # Calculate the font size based on cluster size
    num_plants = len(similarity_df)
    max_cluster_size = 100
    font_ratio = num_plants / max_cluster_size
    font_size = max(8, int(8 * font_ratio))
    annot_font_size = max(6, int(6 * font_ratio))  # Adjust font size for annotations
    
    # Create a mask for the mirror part of the heatmap
    mask = np.triu(np.ones_like(similarity_df, dtype=bool))
    
    # Create a figure and axes
    fig, ax = plt.subplots(figsize=(40, 30))
    
    # Visualize the similarity scores using a masked heatmap
    heatmap = sns.heatmap(similarity_df, cmap="YlGnBu", mask=mask, ax=ax,
                          cbar_kws={'label': 'Similarity Score', 'format': '%.1f'},
                          annot=True, fmt=".1f", annot_kws={'fontsize': annot_font_size})  # Adjust font size
    
    # Set the title for the heatmap
    ax.set_title(title, fontsize=font_size)
    
    # Add a legend explaining the heatmap colors
    cbar = heatmap.collections[0].colorbar
    cbar.set_label('Similarity Score Legend')
    
    # Save the heatmap as a vector-based PDF file
    output_file = cluster_file.replace('_plant_similarities.csv', '.pdf')
    output_path = os.path.join(output_dir, output_file)
    fig.savefig(output_path, format='pdf', bbox_inches='tight')
    plt.close()
    
print("All done.")



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Found 160 cluster files.
Processing cluster_105_plant_similarities.csv...
Processing cluster_34_plant_similarities.csv...
Processing cluster_41_plant_similarities.csv...
Processing cluster_80_plant_similarities.csv...
Processing cluster_141_plant_similarities.csv...
Processing cluster_70_plant_similarities.csv...
Processing cluster_5_plant_similarities.csv...
Processing cluster_134_plant_similarities.csv...
Processing cluster_102_plant_similarities.csv...
Processing cluster_33_plant_similarities.csv...
Processing cluster_46_plant_similarities.csv...
Processing cluster_87_plant_similarities.csv...
Processing cluster_39_plant_similarities.csv...
Processing cluster_108_plant_similarities.csv...
Processing cluster_139_plant_similarities.csv...



All-NaN slice encountered


All-NaN slice encountered



Processing cluster_48_plant_similarities.csv...
Processing cluster_89_plant_similarities.csv...
Processing cluster_42_plant_similarities.csv...
Processing cluster_83_plant_similarities.csv...
Processing cluster_37_plant_similarities.csv...
Processing cluster_106_plant_similarities.csv...
Processing cluster_137_plant_similarities.csv...
Processing cluster_73_plant_similarities.csv...
Processing cluster_142_plant_similarities.csv...
Processing cluster_6_plant_similarities.csv...
Processing cluster_148_plant_similarities.csv...
Processing cluster_79_plant_similarities.csv...
Processing cluster_149_plant_similarities.csv...



All-NaN slice encountered


All-NaN slice encountered



Processing cluster_78_plant_similarities.csv...
Processing cluster_136_plant_similarities.csv...
Processing cluster_7_plant_similarities.csv...
Processing cluster_72_plant_similarities.csv...
Processing cluster_143_plant_similarities.csv...
Processing cluster_82_plant_similarities.csv...
Processing cluster_43_plant_similarities.csv...
Processing cluster_36_plant_similarities.csv...
Processing cluster_107_plant_similarities.csv...
Processing cluster_88_plant_similarities.csv...
Processing cluster_49_plant_similarities.csv...
Processing cluster_131_plant_similarities.csv...
Processing cluster_0_plant_similarities.csv...
Processing cluster_75_plant_similarities.csv...
Processing cluster_144_plant_similarities.csv...
Processing cluster_85_plant_similarities.csv...
Processing cluster_44_plant_similarities.csv...
Processing cluster_31_plant_similarities.csv...
Processing cluster_100_plant_similarities.csv...
Processing cluster_3_plant_similarities.csv...
Processing cluster_147_plant_similari


All-NaN slice encountered


All-NaN slice encountered



Processing cluster_116_plant_similarities.csv...
Processing cluster_52_plant_similarities.csv...
Processing cluster_93_plant_similarities.csv...
Processing cluster_58_plant_similarities.csv...
Processing cluster_99_plant_similarities.csv...
Processing cluster_64_plant_similarities.csv...
Processing cluster_155_plant_similarities.csv...
Processing cluster_11_plant_similarities.csv...
Processing cluster_120_plant_similarities.csv...
Processing cluster_20_plant_similarities.csv...
Processing cluster_111_plant_similarities.csv...
Processing cluster_55_plant_similarities.csv...
Processing cluster_94_plant_similarities.csv...
Processing cluster_123_plant_similarities.csv...
Processing cluster_12_plant_similarities.csv...
Processing cluster_156_plant_similarities.csv...
Processing cluster_67_plant_similarities.csv...
Processing cluster_18_plant_similarities.csv...
Processing cluster_129_plant_similarities.csv...
Processing cluster_29_plant_similarities.csv...
Processing cluster_118_plant_simi


All-NaN slice encountered


All-NaN slice encountered



Processing cluster_124_plant_similarities.csv...
Processing cluster_15_plant_similarities.csv...
Processing cluster_151_plant_similarities.csv...
Processing cluster_60_plant_similarities.csv...
Processing cluster_51_plant_similarities.csv...
Processing cluster_90_plant_similarities.csv...
Processing cluster_115_plant_similarities.csv...
Processing cluster_24_plant_similarities.csv...
Processing cluster_91_plant_similarities.csv...
Processing cluster_50_plant_similarities.csv...
Processing cluster_114_plant_similarities.csv...
Processing cluster_25_plant_similarities.csv...
Processing cluster_125_plant_similarities.csv...
Processing cluster_14_plant_similarities.csv...
Processing cluster_150_plant_similarities.csv...
Processing cluster_61_plant_similarities.csv...



All-NaN slice encountered


All-NaN slice encountered



Processing cluster_96_plant_similarities.csv...
Processing cluster_57_plant_similarities.csv...



All-NaN slice encountered


All-NaN slice encountered



Processing cluster_113_plant_similarities.csv...
Processing cluster_22_plant_similarities.csv...
Processing cluster_28_plant_similarities.csv...
Processing cluster_119_plant_similarities.csv...
Processing cluster_19_plant_similarities.csv...
Processing cluster_128_plant_similarities.csv...
Processing cluster_122_plant_similarities.csv...
Processing cluster_13_plant_similarities.csv...
Processing cluster_157_plant_similarities.csv...
Processing cluster_66_plant_similarities.csv...
Processing cluster_21_plant_similarities.csv...
Processing cluster_110_plant_similarities.csv...
Processing cluster_95_plant_similarities.csv...
Processing cluster_54_plant_similarities.csv...
Processing cluster_65_plant_similarities.csv...
Processing cluster_154_plant_similarities.csv...
Processing cluster_10_plant_similarities.csv...



All-NaN slice encountered


All-NaN slice encountered



Processing cluster_121_plant_similarities.csv...
Processing cluster_98_plant_similarities.csv...
Processing cluster_59_plant_similarities.csv...
Processing cluster_26_plant_similarities.csv...
Processing cluster_117_plant_similarities.csv...
Processing cluster_92_plant_similarities.csv...
Processing cluster_53_plant_similarities.csv...
Processing cluster_62_plant_similarities.csv...
Processing cluster_153_plant_similarities.csv...
Processing cluster_17_plant_similarities.csv...
Processing cluster_126_plant_similarities.csv...
Processing cluster_159_plant_similarities.csv...
Processing cluster_68_plant_similarities.csv...
All done.


Now we compare the chemical compositions of plants with high similarity scores 90% and identify the most present chemicals that contributes to the similarity between them. After creating the results_df DataFrame, we remove duplicates based on the chemical compositions by rounding the columns 'First Most Present Amount', 'Second ...' and so on.. to two decimal places using .round() method. Then, we drop duplicates based on these rounded columns.

here's the code modified to display the top 5 present chemicals in order to later create a barplot

In [91]:
import pandas as pd
from tqdm import tqdm
import os

# Define the path to the cluster data
data_path = './initial_data/clusters_euclidean/'

# Get a list of cluster files
cluster_files = [file for file in os.listdir(data_path) if file.startswith('cluster_')]

# Iterate over the cluster files
for cluster_file in cluster_files:
    # Extract the cluster number from the file name
    cluster_num = cluster_file.split('_')[1]
    
    # Load the chemical compositions data
    composition_file = os.path.join(data_path, f'cluster_{cluster_num}_chemical_compositions.csv')
    chemical_compositions = pd.read_csv(composition_file, index_col=0)

    # Load the plant similarities data
    similarities_file = os.path.join(data_path, f'cluster_{cluster_num}_plant_similarities.csv')
    plant_similarities = pd.read_csv(similarities_file, index_col=0)

    # Create an empty DataFrame to store the results
    results_df = pd.DataFrame(columns=['Plant 1', 'Plant 2', 'Similarity Score', 'First Most Present Chemical',
                                       'First Most Present Amount', 'Second Most Present Chemical',
                                       'Second Most Present Amount', 'Third Most Present Chemical',
                                       'Third Most Present Amount', 'Fourth Most Present Chemical',
                                       'Fourth Most Present Amount', 'Fifth Most Present Chemical',
                                       'Fifth Most Present Amount'])

    # Iterate over the rows of the plant similarities data
    for index, row in tqdm(plant_similarities.iterrows(), total=len(plant_similarities), desc=f'Processing Cluster {cluster_num}'):
        plant_1 = index  # Plant 1 name
        for column in row.index:
            if index != column:  # Exclude diagonal comparisons
                similarity_score = row[column]  # Similarity score
                if similarity_score >= 0.9:  # Filter similarity score
                    plant_2 = column  # Plant 2 name

                    # Get the chemical composition of the two plants
                    composition_1 = chemical_compositions.loc[plant_1]
                    composition_2 = chemical_compositions.loc[plant_2]

                    # Find the chemicals present in both plants
                    common_chemicals = set(composition_1.index) & set(composition_2.index)

                    # Find the most present chemical and its amount in each plant
                    most_present_chemicals = [None]*5
                    max_amounts = [0]*5
                    for chemical in common_chemicals:
                        amount_1 = composition_1[chemical]
                        amount_2 = composition_2[chemical]
                        for i in range(5):
                            if amount_1 > max_amounts[i]:
                                for j in range(4, i, -1):
                                    max_amounts[j] = max_amounts[j-1]
                                    most_present_chemicals[j] = most_present_chemicals[j-1]
                                max_amounts[i] = amount_1
                                most_present_chemicals[i] = chemical
                                break

                    # Create a temporary DataFrame to store the current result
                    temp_df = pd.DataFrame({
                        'Plant 1': [plant_1],
                        'Plant 2': [plant_2],
                        'Similarity Score': [similarity_score],
                        'First Most Present Chemical': [most_present_chemicals[0]],
                        'First Most Present Amount': [max_amounts[0]],
                        'Second Most Present Chemical': [most_present_chemicals[1]],
                        'Second Most Present Amount': [max_amounts[1]],
                        'Third Most Present Chemical': [most_present_chemicals[2]],
                        'Third Most Present Amount': [max_amounts[2]],
                        'Fourth Most Present Chemical': [most_present_chemicals[3]],
                        'Fourth Most Present Amount': [max_amounts[3]],
                        'Fifth Most Present Chemical': [most_present_chemicals[4]],
                        'Fifth Most Present Amount': [max_amounts[4]]
                    })

                    # Concatenate the temporary DataFrame with the results DataFrame
                    results_df = pd.concat([results_df, temp_df], ignore_index=True)

    # Remove duplicates based on the chemical compositions
    results_df_deduplicated = results_df.round({'First Most Present Amount': 2,
                                                'Second Most Present Amount': 2,
                                                'Third Most Present Amount': 2,
                                                'Fourth Most Present Amount': 2,
                                                'Fifth Most Present Amount': 2}).drop_duplicates(subset=['First Most Present Amount',
                                                                                                      'Second Most Present Chemical',
                                                                                                      'Second Most Present Amount',
                                                                                                      'Third Most Present Chemical',
                                                                                                      'Third Most Present Amount',
                                                                                                      'Fourth Most Present Chemical',
                                                                                                      'Fourth Most Present Amount',
                                                                                                      'Fifth Most Present Chemical',
                                                                                                      'Fifth Most Present Amount'])

    # Save the results in a folder with the cluster name
    results_path = os.path.join(data_path, f'cluster_{cluster_num}')
    
    # Ensure the directory exists
    if not os.path.exists(results_path):
        os.makedirs(results_path)

    # Save the results dataframe to a csv file with designated name
    filename = f'high_similarity_plants_top5chemicals.csv'
    results_df_deduplicated.to_csv(os.path.join(results_path, filename), index=False)


Processing Cluster 71: 100%|███| 27/27 [00:00<00:00, 20867.18it/s]
Processing Cluster 105: 100%|████| 9/9 [00:00<00:00, 30815.29it/s]
Processing Cluster 152: 100%|██| 15/15 [00:00<00:00, 25731.93it/s]
Processing Cluster 34: 100%|███| 37/37 [00:00<00:00, 17796.93it/s]
Processing Cluster 41: 100%|█████| 16/16 [00:00<00:00, 377.06it/s]
Processing Cluster 155: 100%|████| 2/2 [00:00<00:00, 20213.51it/s]
Processing Cluster 105: 100%|████| 9/9 [00:00<00:00, 31909.33it/s]
Processing Cluster 80: 100%|█████| 3/3 [00:00<00:00, 23431.87it/s]
Processing Cluster 28: 100%|███| 19/19 [00:00<00:00, 27170.74it/s]
Processing Cluster 130: 100%|███| 44/44 [00:00<00:00, 1001.32it/s]
Processing Cluster 137: 100%|████| 59/59 [00:00<00:00, 895.28it/s]
Processing Cluster 108: 100%|██| 35/35 [00:00<00:00, 19114.67it/s]
Processing Cluster 79: 100%|█████| 3/3 [00:00<00:00, 25890.77it/s]
Processing Cluster 123: 100%|██████| 7/7 [00:00<00:00, 327.65it/s]
Processing Cluster 141: 100%|████| 2/2 [00:00<00:00, 20610.83i

Processing Cluster 46: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 1708.94it/s]
Processing Cluster 84: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 104/104 [00:00<00:00, 145.67it/s]
Processing Cluster 28: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 48/48 [00:00<00:00, 167.72it/s]
Processing Cluster 48: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 1135.80it/s]
Processing Cluster 23: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Next we represent the most similar plants and their simillarity scores in one dataframe and their chemical composion in another dataframe so that we can later visualize it as a heatmap

In [90]:
import pandas as pd
from tqdm import tqdm
import os

# Define the path to the cluster data
data_path = './initial_data/clusters_euclidean/'

# Get a list of cluster files
cluster_files = [file for file in os.listdir(data_path) if file.startswith('cluster_')]

# Iterate over the cluster files
for cluster_file in cluster_files:
    # Extract the cluster number from the file name
    cluster_num = cluster_file.split('_')[1]
    
    # Load the plant similarities data
    similarities_file = os.path.join(data_path, f'cluster_{cluster_num}_plant_similarities.csv')
    plant_similarities = pd.read_csv(similarities_file, index_col=0)

    # Create an empty DataFrame to store the results
    results_df = pd.DataFrame(columns=['Plant 1', 'Plant 2', 'Similarity Score'])

    # Iterate over the rows of the plant similarities data
    for plant_1 in tqdm(plant_similarities.index, total=len(plant_similarities), desc=f'Processing Cluster {cluster_num}'):
        for plant_2 in plant_similarities.columns:
            if plant_1 != plant_2:  # Exclude diagonal comparisons
                similarity_score = plant_similarities.loc[plant_1, plant_2]  # Similarity score
                if similarity_score >= 0.9:  # Filter similarity score
                    
                    # Create a temporary DataFrame to store the current result
                    temp_df = pd.DataFrame({
                        'Plant 1': [plant_1],
                        'Plant 2': [plant_2],
                        'Similarity Score': [similarity_score]
                    })

                    # Append the temporary DataFrame to the results DataFrame
                    results_df = pd.concat([results_df, temp_df], ignore_index=True)

    # Save the results to a CSV file
    results_path = os.path.join(data_path, f'cluster_{cluster_num}')
    
    # Ensure the directory exists
    if not os.path.exists(results_path):
        os.makedirs(results_path)

    # Save the results dataframe to a csv file with designated name
    filename = f'high_similarity_plants.csv'
    results_df.to_csv(os.path.join(results_path, filename), index=False)


Processing Cluster 71: 100%|████| 27/27 [00:00<00:00, 5867.98it/s]
Processing Cluster 105: 100%|████| 9/9 [00:00<00:00, 15301.47it/s]
Processing Cluster 152: 100%|██| 15/15 [00:00<00:00, 10218.38it/s]
Processing Cluster 34: 100%|████| 37/37 [00:00<00:00, 4601.20it/s]
Processing Cluster 41: 100%|████| 16/16 [00:00<00:00, 8465.86it/s]
Processing Cluster 155: 100%|████| 2/2 [00:00<00:00, 14074.85it/s]
Processing Cluster 105: 100%|████| 9/9 [00:00<00:00, 21124.08it/s]
Processing Cluster 80: 100%|█████| 3/3 [00:00<00:00, 14631.29it/s]
Processing Cluster 28: 100%|███| 19/19 [00:00<00:00, 12506.56it/s]
Processing Cluster 130: 100%|███| 44/44 [00:00<00:00, 5099.74it/s]
Processing Cluster 137: 100%|███| 59/59 [00:00<00:00, 3988.39it/s]
Processing Cluster 108: 100%|███| 35/35 [00:00<00:00, 6951.45it/s]
Processing Cluster 79: 100%|█████| 3/3 [00:00<00:00, 23258.62it/s]
Processing Cluster 123: 100%|█████| 7/7 [00:00<00:00, 9535.61it/s]
Processing Cluster 141: 100%|████| 2/2 [00:00<00:00, 20164.92i

Here we create a data frame with the full chemical compositions of the most similar plants

In [110]:
import pandas as pd
import glob
import os

# Path to the chemicals data
chemicals_data_path = './initial_data/chemicals_data_clean.csv'

# Read the full chemical composition data
chemicals_df = pd.read_csv(chemicals_data_path, index_col=0)  # Assume the plant name is the index

# Get the list of high similarity plant files
high_similarity_files = glob.glob('./initial_data/clusters_euclidean/cluster_*/high_similarity_plants.csv')

# Iterate over each high similarity file
for hs_file in high_similarity_files:
    # Read the high similarity file
    hs_df = pd.read_csv(hs_file)
    
    # Get the plant names from high similarity file
    plant_names = hs_df.iloc[:, 0].unique()  # Assuming plant names are in the first column
    
    # Get the chemical compositions for the plants in high similarity file
    compositions = chemicals_df.loc[plant_names]
    
    # Save the compositions DataFrame as a CSV file in the corresponding cluster folder
    compositions.to_csv(os.path.join(os.path.dirname(hs_file), 'high_similarity_plants_compositions.csv'))

# Pattern to match your CSV files
pattern = './initial_data/clusters_euclidean/cluster_*/high_similarity_plants_compositions.csv'

# Get a list of all matching file paths
files = glob.glob(pattern)

for file in files:
    # Load the data
    df = pd.read_csv(file, index_col=0)  # Assuming the first column is the index

    # Remove columns that only contain zeros
    df = df.loc[:, (df != 0).any(axis=0)]

    # Save the dataframe back to CSV
    df.to_csv(file, index=True)


This code will create a bar plot with the x-axis representing main chemicals and the y-axis representing the amount. 

In [111]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os

files = glob.glob('./initial_data/clusters_euclidean/cluster_*/high_similarity_plants_top5chemicals.csv')

barWidth = 0.15

# Use a more colorful palette
colors = sns.color_palette('Set3')

# Set the dark grid style
sns.set_style('darkgrid')

for file in files:
    df = pd.read_csv(file)
    unique_plants = []
    plants = []
    
    for i in range(len(df)):
        plant = df.iloc[i, 0]
        if plant in unique_plants:
            continue
        unique_plants.append(plant)
        
        chemicals = [df.iloc[i, j] for j in range(3, 13, 2)]
        amounts = [df.iloc[i, j]*100 for j in range(4, 14, 2)]
        
        plants.append(pd.DataFrame({"Plant": plant, "Chemical": chemicals, "Amount": amounts}))
    
    if not plants:
        continue

    fig, axs = plt.subplots(len(plants), 1, figsize=(12, 8*len(plants)))
    
    if isinstance(axs, plt.Axes):
        axs = [axs]
        
    for i, ax in enumerate(axs):
        print(plants[i]["Chemical"])
        plants[i]["Chemical"] = plants[i]["Chemical"].astype(str)

        bars = ax.bar(plants[i]["Chemical"], plants[i]["Amount"], color=colors[i % len(colors)], 
                width=barWidth, edgecolor='grey', 
                label='{}'.format(plants[i]['Plant'][0]))
        ax.set_xlabel('Chemical', fontweight='bold')
        ax.set_ylabel('Percentage (%)', fontweight='bold')
        ax.set_ylim([0, 100])

        # Add gridlines
        ax.grid(True)

        # Rotate the x-axis labels if they are overlapping
        plt.xticks(rotation=45, ha='right')

        for bar in bars:
            yval = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2, yval + 1, round(yval,2), ha='center', va='bottom')

        ax.text(0.75, 0.5, 'Sum of main chemicals: {:.2f}%'.format(sum(plants[i]["Amount"])), 
                transform=ax.transAxes, fontsize=10, verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

        ax.legend()
    
    dir_name = os.path.dirname(file)
    if not os.path.exists(dir_name):
        os.makedirs(dir_name)

    fig.savefig(os.path.splitext(file)[0] + '.pdf', format='pdf', bbox_inches='tight')
    plt.close(fig)


0               Geranial
1                  Neral
2        Geranyl acetate
3               Geraniol
4    Caryophyllene oxide
Name: Chemical, dtype: object
0                   Geranial
1                      Neral
2                   Geraniol
3            Geranyl acetate
4    6-Methyl-5-hepten-2-one
Name: Chemical, dtype: object
0               Geranial
1                  Neral
2    Caryophyllene oxide
3    Citronellyl acetate
4     beta-Caryophyllene
Name: Chemical, dtype: object
0               Geranial
1                  Neral
2    Caryophyllene oxide
3    Citronellyl acetate
4               Limonene
Name: Chemical, dtype: object
0           Linalool
1            Camphor
2        para-Cymene
3    gamma-Terpinene
4           Geraniol
Name: Chemical, dtype: object
0           Linalool
1            Camphor
2        para-Cymene
3    gamma-Terpinene
4       alpha-Pinene
Name: Chemical, dtype: object
0           Linalool
1       alpha-Pinene
2    gamma-Terpinene
3            Camphor
4    G

0              Sabinene
1              Limonene
2         alpha-Thujene
3    alpha-Phellandrene
4          alpha-Pinene
Name: Chemical, dtype: object
0    alpha-Phellandrene
1       alpha-Santalene
2           para-Cymene
3        beta-Santalene
4        alpha-Humulene
Name: Chemical, dtype: object
0              Linalool
1         alpha-Ocimene
2               Nonanal
3    Phenylacetaldehyde
4              Limonene
Name: Chemical, dtype: object
0           Benzyl alcohol
1            (Z)-3-Hexenol
2              Citronellol
3                 Geraniol
4    (Z)-3-Hexenyl acetate
Name: Chemical, dtype: object
0    Phenylacetaldehyde
1               Nonanal
2         (E)-2-Hexenal
3               Hexanal
4              Heptanal
Name: Chemical, dtype: object
0        para-Cymene
1          Carvacrol
2             Thymol
3      Terpinen-4-ol
4    gamma-Terpinene
Name: Chemical, dtype: object
0        Linoleic acid
1    Hexadecanoic acid
2              Carvone
3         beta-Guaiene
4       

0             Carvacrol
1       gamma-Terpinene
2           para-Cymene
3    beta-Caryophyllene
4        alpha-Humulene
Name: Chemical, dtype: object
0          Carvacrol
1        para-Cymene
2    gamma-Terpinene
3            Myrcene
4       alpha-Pinene
Name: Chemical, dtype: object
0             Carvacrol
1           para-Cymene
2                Thymol
3       gamma-Terpinene
4    beta-Caryophyllene
Name: Chemical, dtype: object
0             Carvacrol
1           para-Cymene
2       gamma-Terpinene
3    beta-Caryophyllene
4               Myrcene
Name: Chemical, dtype: object
0          Carvacrol
1    gamma-Terpinene
2        para-Cymene
3       alpha-Pinene
4            Myrcene
Name: Chemical, dtype: object
0          Carvacrol
1        para-Cymene
2    gamma-Terpinene
3             Thymol
4            Myrcene
Name: Chemical, dtype: object
0          Carvacrol
1        para-Cymene
2    gamma-Terpinene
3       alpha-Pinene
4            Myrcene
Name: Chemical, dtype: object
0         

0           Camphor
1      beta-Thujone
2     alpha-Thujone
3       1,8-Cineole
4    Chrysanthenone
Name: Chemical, dtype: object
0    alpha-Thujone
1          Camphor
2      1,8-Cineole
3     beta-Thujone
4         Camphene
Name: Chemical, dtype: object
0    alpha-Thujone
1     beta-Thujone
2      1,8-Cineole
3          Camphor
4          Borneol
Name: Chemical, dtype: object
0      alpha-Thujone
1       beta-Thujone
2        1,8-Cineole
3            Camphor
4    alpha-Terpineol
Name: Chemical, dtype: object
0     alpha-Thujone
1          Fenchone
2      beta-Thujone
3    Bornyl acetate
4           Camphor
Name: Chemical, dtype: object
0     alpha-Thujone
1          Fenchone
2      beta-Thujone
3      alpha-Pinene
4    Bornyl acetate
Name: Chemical, dtype: object
0    alpha-Thujone
1         Fenchone
2     beta-Thujone
3     alpha-Pinene
4         Sabinene
Name: Chemical, dtype: object
0     alpha-Thujone
1          Fenchone
2      beta-Thujone
3      alpha-Pinene
4    Bornyl acetate


0    2-Phenylethyl methyl ether
1                 Terpinen-4-ol
2               alpha-Terpineol
3               gamma-Terpinene
4                   1,8-Cineole
Name: Chemical, dtype: object
0    2-Phenylethyl methyl ether
1                 Terpinen-4-ol
2               alpha-Terpineol
3                   1,8-Cineole
4               gamma-Terpinene
Name: Chemical, dtype: object
0    2-Phenylethyl methyl ether
1                 Terpinen-4-ol
2                   para-Cymene
3               alpha-Terpineol
4               gamma-Terpinene
Name: Chemical, dtype: object
0      Artemisia ketone
1    beta-Caryophyllene
2           1,8-Cineole
3           beta-Pinene
4    (E)-beta-Farnesene
Name: Chemical, dtype: object
0     Artemisia ketone
1    Artemisia alcohol
2              Myrcene
3        alpha-Guaiene
4              Camphor
Name: Chemical, dtype: object
0     Artemisia ketone
1    Artemisia alcohol
2              Camphor
3      Arteannuic acid
4        alpha-Guaiene
Name: Chemical, dtyp

0     alpha-Thujone
1           Camphor
2      beta-Thujone
3    Chrysanthenone
4       1,8-Cineole
Name: Chemical, dtype: object
0    alpha-Thujone
1     beta-Thujone
2         Sabinene
3    Terpinen-4-ol
4    Viridiflorene
Name: Chemical, dtype: object
0    alpha-Thujone
1     beta-Thujone
2      beta-Pinene
3    Terpinen-4-ol
4    Cuminaldehyde
Name: Chemical, dtype: object
0      alpha-Thujone
1       beta-Thujone
2        para-Cymene
3            Camphor
4    alpha-Terpineol
Name: Chemical, dtype: object
0        alpha-Thujone
1         beta-Thujone
2          Spathulenol
3    trans-Pinocarveol
4        Cuminaldehyde
Name: Chemical, dtype: object
0            alpha-Thujone
1             beta-Thujone
2              1,8-Cineole
3                 Sabinene
4    trans-Sabinyl acetate
Name: Chemical, dtype: object
0         alpha-Thujone
1           1,8-Cineole
2          beta-Thujone
3    beta-Caryophyllene
4             Verbenone
Name: Chemical, dtype: object
0     alpha-Thujone
1    

0             Limonene
1              Myrcene
2     (E)-beta-Ocimene
3         alpha-Pinene
4    beta-Phellandrene
Name: Chemical, dtype: object
0        Limonene
1         Myrcene
2        Linalool
3        Sabinene
4    alpha-Pinene
Name: Chemical, dtype: object
0        Limonene
1         Myrcene
2        Sabinene
3    alpha-Pinene
4        Linalool
Name: Chemical, dtype: object
0        Limonene
1         Myrcene
2     beta-Pinene
3    alpha-Pinene
4         Decanal
Name: Chemical, dtype: object
0              Limonene
1               Myrcene
2    alpha-Phellandrene
3          alpha-Pinene
4              Sabinene
Name: Chemical, dtype: object
0            Limonene
1             Myrcene
2        alpha-Pinene
3            Sabinene
4    (E)-beta-Ocimene
Name: Chemical, dtype: object
0      Limonene
1       Myrcene
2      Sabinene
3    Nootkatone
4      Linalool
Name: Chemical, dtype: object
0        Limonene
1         Myrcene
2    alpha-Pinene
3     para-Cymene
4      Nootkatone
Name:

0      2-Phenylethanol
1          Citronellol
2    Alkanes & alkenes
3             Geraniol
4                Nerol
Name: Chemical, dtype: object
0              2-Phenylethanol
1               beta-Farnesene
2                  beta-Pinene
3    Farnesol (unknown isomer)
4                Neryl acetate
Name: Chemical, dtype: object
0      2-Phenylethanol
1          Citronellol
2             Geraniol
3    Alkanes & alkenes
4                Nerol
Name: Chemical, dtype: object
0    2-Phenylethanol
1        Citronellol
2           Linalool
3           Geraniol
4              Nerol
Name: Chemical, dtype: object
0              Limonene
1         beta-Selinene
2    3-n-Butylphthalide
3         Pentylbenzene
4           beta-Pinene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2        para-Cymene
3        beta-Pinene
4    beta-Bisabolene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2        beta-Pinene
3      alpha-Thujene
4            Myrcene


0      Terpinen-4-ol
1        1,8-Cineole
2           Sabinene
3    gamma-Terpinene
4    Fenchyl acetate
Name: Chemical, dtype: object
0      Terpinen-4-ol
1        1,8-Cineole
2    gamma-Terpinene
3           Sabinene
4        para-Cymene
Name: Chemical, dtype: object
0    alpha-Terpinyl acetate
1               1,8-Cineole
2           Linalyl acetate
3                  Sabinene
4                  Linalool
Name: Chemical, dtype: object
0    alpha-Terpinyl acetate
1               1,8-Cineole
2                  Linalool
3           alpha-Terpineol
4                  Limonene
Name: Chemical, dtype: object
0               1,8-Cineole
1    alpha-Terpinyl acetate
2                  Limonene
3           alpha-Terpineol
4               beta-Pinene
Name: Chemical, dtype: object
0    alpha-Terpinyl acetate
1               1,8-Cineole
2           alpha-Terpineol
3                  Sabinene
4                  Linalool
Name: Chemical, dtype: object
0               1,8-Cineole
1    alpha-Terpinyl ac

0           Limonene
1    Linalyl acetate
2           Linalool
3        beta-Pinene
4    gamma-Terpinene
Name: Chemical, dtype: object
0           Limonene
1    Linalyl acetate
2           Linalool
3        beta-Pinene
4    gamma-Terpinene
Name: Chemical, dtype: object
0           Limonene
1           Linalool
2    Linalyl acetate
3           Geraniol
4        beta-Pinene
Name: Chemical, dtype: object
0           Limonene
1    Linalyl acetate
2           Linalool
3        beta-Pinene
4    gamma-Terpinene
Name: Chemical, dtype: object
0           Limonene
1    Linalyl acetate
2           Linalool
3    gamma-Terpinene
4        beta-Pinene
Name: Chemical, dtype: object
0           Limonene
1    Linalyl acetate
2           Linalool
3    gamma-Terpinene
4        beta-Pinene
Name: Chemical, dtype: object
0           Limonene
1    Linalyl acetate
2           Linalool
3    gamma-Terpinene
4        beta-Pinene
Name: Chemical, dtype: object
0    Linalyl acetate
1           Limonene
2           L

0         Isomenthone
1            Pulegone
2            Limonene
3            Menthone
4    Perilla aldehyde
Name: Chemical, dtype: object
0    Isomenthone
1       Pulegone
2       Limonene
3       Menthone
4    1-Octenol-3
Name: Chemical, dtype: object
0       Diosphenol
1      Isomenthone
2    Isodiosphenol
3         Limonene
4         Menthone
Name: Chemical, dtype: object
0      Isomenthone
1         Limonene
2       Diosphenol
3         Menthone
4    Isodiosphenol
Name: Chemical, dtype: object
0      Menthone
1      Pulegone
2    Piperitone
3    Neomenthol
4       Menthol
Name: Chemical, dtype: object
0        Pulegone
1     Isomenthone
2        Menthone
3    Piperitenone
4        Limonene
Name: Chemical, dtype: object
0        Pulegone
1        Menthone
2      Piperitone
3    Piperitenone
4     Isomenthone
Name: Chemical, dtype: object
0        Pulegone
1        Menthone
2    Piperitenone
3      Piperitone
4        Limonene
Name: Chemical, dtype: object
0        Pulegone
1    Pi

0           Limonene
1           Geranial
2              Neral
3    Geranyl acetate
4           Geraniol
Name: Chemical, dtype: object
0           Limonene
1           Geranial
2              Neral
3      Neryl acetate
4    Geranyl acetate
Name: Chemical, dtype: object
0     beta-Thujone
1         Sabinene
2      1,8-Cineole
3    alpha-Thujone
4     Germacrene D
Name: Chemical, dtype: object
0                 beta-Thujone
1                     Sabinene
2    Cadinene (unknown isomer)
3                  1,8-Cineole
4                Terpinen-4-ol
Name: Chemical, dtype: object
0    beta-Thujone
1         Camphor
2    Germacrene D
3        Sabinene
4        Camphene
Name: Chemical, dtype: object
0      beta-Thujone
1           Camphor
2          Sabinene
3          Camphene
4    Bornyl acetate
Name: Chemical, dtype: object
0           beta-Thujone
1    cis-Ocimene epoxide
2       (Z)-beta-Ocimene
3            Chamazulene
4           Germacrene D
Name: Chemical, dtype: object
0           bet

0    Bornyl acetate
1      alpha-Pinene
2          Camphene
3    delta-3-Carene
4      p-Cymen-8-ol
Name: Chemical, dtype: object
0       Bornyl acetate
1         (-)-Camphene
2    beta-Phellandrene
3     (-)-alpha-Pinene
4      (-)-beta-Pinene
Name: Chemical, dtype: object
0           Bornyl acetate
1             (-)-Camphene
2         (-)-alpha-Pinene
3    (-)-beta-Phellandrene
4          (-)-beta-Pinene
Name: Chemical, dtype: object
0    Bornyl acetate
1          Camphene
2      alpha-Pinene
3          Limonene
4        Tricyclene
Name: Chemical, dtype: object
0    Bornyl acetate
1      alpha-Pinene
2    delta-3-Carene
3          Limonene
4       beta-Pinene
Name: Chemical, dtype: object
0          Camphene
1    Bornyl acetate
2      alpha-Pinene
3    delta-3-Carene
4          Limonene
Name: Chemical, dtype: object
0           Camphor
1    Bornyl acetate
2          Camphene
3          Limonene
4           Myrcene
Name: Chemical, dtype: object
0    Bornyl acetate
1           Camphor


0           Linalool
1        1,8-Cineole
2         Piperitone
3    alpha-Terpineol
4      Terpinen-4-ol
Name: Chemical, dtype: object
0            Linalool
1     Linalyl acetate
2    (E)-beta-Ocimene
3             Myrcene
4       Neryl acetate
Name: Chemical, dtype: object
0                               Linalool
1      cis-Linalool oxide (5) (furanoid)
2    trans-Linalool oxide (5) (furanoid)
3                           Eremophilene
4                          Cyclosativene
Name: Chemical, dtype: object
0                                 Linalool
1                          alpha-Terpineol
2      cis-Linalool oxide (unknown isomer)
3    trans-Linalool oxide (unknown isomer)
4                              1,8-Cineole
Name: Chemical, dtype: object
0                               Linalool
1                        alpha-Terpineol
2                               Geraniol
3                               Limonene
4    cis-Linalool oxide (unknown isomer)
Name: Chemical, dtype: object
0         

0        Limonene
1         Myrcene
2        Linalool
3        Sabinene
4    alpha-Pinene
Name: Chemical, dtype: object
0        Limonene
1         Myrcene
2        Linalool
3    alpha-Pinene
4        Sabinene
Name: Chemical, dtype: object
0        Limonene
1         Myrcene
2        Linalool
3    alpha-Pinene
4         Octanal
Name: Chemical, dtype: object
0                     Limonene
1                      Myrcene
2                     Linalool
3                     Sabinene
4    Sinensal (unknown isomer)
Name: Chemical, dtype: object
0           Limonene
1            Myrcene
2    gamma-Terpinene
3           Sabinene
4       alpha-Pinene
Name: Chemical, dtype: object
0        Limonene
1         Myrcene
2        Sabinene
3        Linalool
4    alpha-Pinene
Name: Chemical, dtype: object
0           Limonene
1            Myrcene
2           Sabinene
3       alpha-Pinene
4    beta-Bisabolene
Name: Chemical, dtype: object
0              Limonene
1               Myrcene
2          alpha-

0                                 Limonene
1                            beta-Selinene
2                       3-n-Butylphthalide
3    cis-3-Butylidene-4,5-dihydrophthalide
4                           alpha-Selinene
Name: Chemical, dtype: object
0              Limonene
1         beta-Selinene
2    3-n-Butylphthalide
3    Sedanoic anhydride
4         Pentylbenzene
Name: Chemical, dtype: object
0              Limonene
1         beta-Selinene
2    3-n-Butylphthalide
3           Ligustilide
4        alpha-Selinene
Name: Chemical, dtype: object
0            Limonene
1     gamma-Terpinene
2        alpha-Pinene
3    (E)-beta-Ocimene
4         beta-Pinene
Name: Chemical, dtype: object
0            Limonene
1     gamma-Terpinene
2    (E)-beta-Ocimene
3             Myrcene
4    (Z)-beta-Ocimene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2            Myrcene
3        para-Cymene
4             Thymol
Name: Chemical, dtype: object
0             Limonene
1      gamma-Terp

0           Limonene
1    gamma-Terpinene
2        beta-Pinene
3       alpha-Pinene
4            Myrcene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2       alpha-Pinene
3        para-Cymene
4            Myrcene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2            Myrcene
3        beta-Pinene
4       alpha-Pinene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2            Myrcene
3           Linalool
4       alpha-Pinene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2        para-Cymene
3       alpha-Pinene
4            Myrcene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2            Myrcene
3       alpha-Pinene
4        beta-Pinene
Name: Chemical, dtype: object
0        Limonene
1      Isopulegol
2         Myrcene
3     Citronellal
4    beta-Ocimene
Name: Chemical, dtype: object
0           Limonene
1    alpha-Terpinene
2            Myrcene
3      

0         Sabinene
1    Terpinen-4-ol
2      beta-Pinene
3      para-Cymene
4         Limonene
Name: Chemical, dtype: object
0       Sabinene
1         Cedrol
2       Limonene
3       Camphene
4    beta-Pinene
Name: Chemical, dtype: object
0             alpha-Pinene
1            Terpinen-4-ol
2                 Sabinene
3             beta-Thujene
4    beta-Terpinyl acetate
Name: Chemical, dtype: object
0    Terpinen-4-ol
1     alpha-Pinene
2         Sabinene
3      para-Cymene
4          Myrcene
Name: Chemical, dtype: object
0                  Sabinene
1              alpha-Pinene
2                  Camphene
3    alpha-Terpinyl acetate
4               beta-Pinene
Name: Chemical, dtype: object
0            Sabinene
1    (Z,Z)-Germacrone
2       alpha-Thujene
3        Germacrene B
4    (E,E)-Germacrone
Name: Chemical, dtype: object
0        Sabinene
1    beta-Ocimene
2     beta-Pinene
3        Limonene
4         Myrcene
Name: Chemical, dtype: object
0               Sabinene
1            be

0                         Methyl chavicol
1    Sesquiterpene hydrocarbons (unknown)
2                        (Z)-beta-Ocimene
3                            (E)-Anethole
4                                Limonene
Name: Chemical, dtype: object
0    Methyl chavicol
1            Menthol
2     alpha-Humulene
3            Carvone
4           Menthone
Name: Chemical, dtype: object
0            Methyl chavicol
1                   Limonene
2    2-Methoxycinnamaldehyde
3                    Menthol
4                    Carvone
Name: Chemical, dtype: object
0    Methyl chavicol
1        beta-Pinene
2       alpha-Pinene
3       Anisaldehyde
4       (E)-Anethole
Name: Chemical, dtype: object
0       Methyl chavicol
1           1,8-Cineole
2              Linalool
3               Camphor
4    beta-Caryophyllene
Name: Chemical, dtype: object
0     Methyl chavicol
1            Linalool
2         1,8-Cineole
3    (Z)-beta-Ocimene
4             Camphor
Name: Chemical, dtype: object
0          Methyl chavico

0        delta-3-Carene
1              Limonene
2           beta-Pinene
3          alpha-Pinene
4    alpha-Phellandrene
Name: Chemical, dtype: object
0        delta-3-Carene
1              Limonene
2           beta-Pinene
3          alpha-Pinene
4    alpha-Phellandrene
Name: Chemical, dtype: object
0      Benzyl benzoate
1        Terpinen-4-ol
2    Benzyl salicylate
3        beta-Selinene
4       alpha-Humulene
Name: Chemical, dtype: object
0      Benzyl benzoate
1    Benzyl salicylate
2       cis-Calamenene
3        beta-Selinene
4       alpha-Humulene
Name: Chemical, dtype: object
0       Benzyl benzoate
1              Linalool
2              Limonene
3    beta-Caryophyllene
4           beta-Pinene
Name: Chemical, dtype: object
0     Benzyl benzoate
1    Methyl cinnamate
2              Indole
3     Methyl benzoate
4     Prenyl benzoate
Name: Chemical, dtype: object
0             Thymol
1        para-Cymene
2    gamma-Terpinene
3        beta-Pinene
4          Carvacrol
Name: Chemical,

0       Methyl eugenol
1    Bicyclogermacrene
2          Spathulenol
3        alpha-Cadinol
4             Globulol
Name: Chemical, dtype: object
0           Methyl eugenol
1                 Elemicin
2                 Linalool
3                  Safrole
4    Caryophyllene alcohol
Name: Chemical, dtype: object
0        Methyl eugenol
1           1,8-Cineole
2     Methyl isoeugenol
3    beta-Caryophyllene
4          alpha-Pinene
Name: Chemical, dtype: object
0        Methyl eugenol
1       Methyl chavicol
2              Elemicin
3         alpha-Elemene
4    beta-Caryophyllene
Name: Chemical, dtype: object
0           Methyl eugenol
1            Terpinen-4-ol
2         (E)-beta-Ocimene
3               Calamenene
4    (E)-Methyl isoeugenol
Name: Chemical, dtype: object
0        Methyl eugenol
1               Myrcene
2               Eugenol
3           1,8-Cineole
4    beta-Caryophyllene
Name: Chemical, dtype: object
0        Methyl eugenol
1               Eugenol
2               Myrcene
3  

0        1,8-Cineole
1            Camphor
2            Borneol
3    alpha-Terpineol
4     Bornyl acetate
Name: Chemical, dtype: object
0           1,8-Cineole
1               Camphor
2           beta-Pinene
3               Myrcene
4    beta-Caryophyllene
Name: Chemical, dtype: object
0     1,8-Cineole
1         Camphor
2     beta-Pinene
3        Camphene
4    alpha-Pinene
Name: Chemical, dtype: object
0      1,8-Cineole
1          Camphor
2    alpha-Thujene
3         Camphene
4          Borneol
Name: Chemical, dtype: object
0     1,8-Cineole
1     beta-Pinene
2    alpha-Pinene
3        Camphene
4         Camphor
Name: Chemical, dtype: object
0           1,8-Cineole
1               Camphor
2    beta-Caryophyllene
3               Borneol
4           beta-Pinene
Name: Chemical, dtype: object
0           1,8-Cineole
1              Linalool
2    beta-Caryophyllene
3          alpha-Pinene
4           beta-Pinene
Name: Chemical, dtype: object
0           1,8-Cineole
1               Camphor
2 

0    Linalyl acetate
1           Linalool
2    alpha-Terpineol
3        Citronellol
4           Geranial
Name: Chemical, dtype: object
0     Linalyl acetate
1            Linalool
2         beta-Pinene
3    (E)-beta-Ocimene
4     alpha-Terpineol
Name: Chemical, dtype: object
0    Linalyl acetate
1           Linalool
2           Limonene
3    alpha-Terpineol
4    Geranyl acetate
Name: Chemical, dtype: object
0       Linalyl acetate
1              Linalool
2       alpha-Terpineol
3               Camphor
4    beta-Caryophyllene
Name: Chemical, dtype: object
0       Linalyl acetate
1              Linalool
2       alpha-Terpineol
3    beta-Caryophyllene
4          Germacrene D
Name: Chemical, dtype: object
0        Linalyl acetate
1               Linalool
2        alpha-Terpineol
3    Caryophyllene oxide
4     beta-Caryophyllene
Name: Chemical, dtype: object
0           Linalool
1    Linalyl acetate
2            Camphor
3        1,8-Cineole
4      Terpinen-4-ol
Name: Chemical, dtype: object


0        beta-Caryophyllene
1                  Linalool
2       Caryophyllene oxide
3            alpha-Humulene
4    alpha-Terpinyl acetate
Name: Chemical, dtype: object
0    beta-Caryophyllene
1              Limonene
2           beta-Pinene
3          alpha-Pinene
4    alpha-Phellandrene
Name: Chemical, dtype: object
0    beta-Caryophyllene
1              Limonene
2          alpha-Pinene
3        delta-3-Carene
4           beta-Pinene
Name: Chemical, dtype: object
0    beta-Caryophyllene
1              Limonene
2              Sabinene
3           beta-Pinene
4          alpha-Pinene
Name: Chemical, dtype: object
0    beta-Caryophyllene
1              Limonene
2              Sabinene
3               Myrcene
4           beta-Pinene
Name: Chemical, dtype: object
0    beta-Caryophyllene
1              Sabinene
2        delta-3-Carene
3              Limonene
4           beta-Pinene
Name: Chemical, dtype: object
0    beta-Caryophyllene
1        delta-3-Carene
2              Limonene
3       

0        Menthone
1         Menthol
2        Limonene
3     Isomenthone
4    alpha-Pinene
Name: Chemical, dtype: object
0                               Menthol
1                              Menthone
2    Monoterpene hydrocarbons (unknown)
3                           Isomenthone
4                            Neomenthol
Name: Chemical, dtype: object
0        Menthol
1       Menthone
2    Isomenthone
3       Limonene
4    beta-Pinene
Name: Chemical, dtype: object
0                               Menthol
1                              Menthone
2    Monoterpene hydrocarbons (unknown)
3                           Isomenthone
4                            Neomenthol
Name: Chemical, dtype: object
0               Menthol
1              Menthone
2       Menthyl acetate
3            Piperitone
4    beta-Caryophyllene
Name: Chemical, dtype: object
0            Menthol
1           Menthone
2        Isomenthone
3         Neomenthol
4    Menthyl acetate
Name: Chemical, dtype: object
0            Menthol

0       Nepetalactone 1
1          Germacrene D
2    beta-Caryophyllene
3       Nepetalactone 3
4       Nepetalactone 2
Name: Chemical, dtype: object
0            Nepetalactone 1
1         beta-Caryophyllene
2    dihydro-Nepetalactone 7
3             beta-Farnesene
4               Germacrene D
Name: Chemical, dtype: object
0       Nepetalactone 1
1          Germacrene D
2        beta-Farnesene
3       Nepetalactone 2
4    beta-Caryophyllene
Name: Chemical, dtype: object
0              Linalool
1       Linalyl acetate
2    (E)-beta-Farnesene
3       alpha-Terpineol
4      (E)-beta-Ocimene
Name: Chemical, dtype: object
0                             Linalool
1                      alpha-Terpineol
2    cis-Linalool oxide (5) (furanoid)
3                2-Phenylethyl acetate
4                        Terpinen-4-ol
Name: Chemical, dtype: object
0                                 Linalool
1                          alpha-Terpineol
2    trans-Linalool oxide (unknown isomer)
3      cis-Linalool o

0             Thymol
1        para-Cymene
2    gamma-Terpinene
3        beta-Pinene
4            Myrcene
Name: Chemical, dtype: object
0             Thymol
1    gamma-Terpinene
2        para-Cymene
3            Myrcene
4           Limonene
Name: Chemical, dtype: object
0             Thymol
1    gamma-Terpinene
2        para-Cymene
3            Myrcene
4        beta-Pinene
Name: Chemical, dtype: object
0             Thymol
1        para-Cymene
2    gamma-Terpinene
3          Carvacrol
4        beta-Pinene
Name: Chemical, dtype: object
0             Thymol
1        para-Cymene
2          Carvacrol
3    gamma-Terpinene
4        beta-Pinene
Name: Chemical, dtype: object
0             Thymol
1        para-Cymene
2    gamma-Terpinene
3        beta-Pinene
4           Limonene
Name: Chemical, dtype: object
0             Thymol
1          Carvacrol
2        para-Cymene
3    gamma-Terpinene
4            Myrcene
Name: Chemical, dtype: object
0                Thymol
1           para-Cymene
2      

0        Eugenol
1       Chavicol
2        Myrcene
3       Linalool
4    3-Octenol-1
Name: Chemical, dtype: object
0     Eugenol
1    Chavicol
2     Myrcene
3    Linalool
4    Limonene
Name: Chemical, dtype: object
0            Eugenol
1    Eugenyl acetate
2         Isoeugenol
3            Safrole
4           Chavicol
Name: Chemical, dtype: object
0            Eugenol
1           Limonene
2     delta-3-Carene
3    Eugenyl acetate
4        para-Cymene
Name: Chemical, dtype: object
0                  Eugenol
1            Terpinen-4-ol
2                 Linalool
3    (E)-Cinnamyl aldehyde
4          Benzyl benzoate
Name: Chemical, dtype: object
0                  Eugenol
1    (E)-Cinnamyl aldehyde
2                 Linalool
3       beta-Caryophyllene
4       alpha-Phellandrene
Name: Chemical, dtype: object
0               Eugenol
1       Benzyl benzoate
2    beta-Caryophyllene
3    alpha-Phellandrene
4       Eugenyl acetate
Name: Chemical, dtype: object
0               Eugenol
1       Ben

0            Carotol
1       alpha-Pinene
2    Geranyl acetate
3           Linalool
4        beta-Pinene
Name: Chemical, dtype: object
0            Carotol
1       alpha-Pinene
2    Geranyl acetate
3        beta-Pinene
4           Sabinene
Name: Chemical, dtype: object
0                Carotol
1               Sabinene
2        Geranyl acetate
3               Geraniol
4    Caryophyllene oxide
Name: Chemical, dtype: object
0         alpha-Humulene
1     beta-Caryophyllene
2                Myrcene
3     Methyl 4-decenoate
4    Geranyl isobutyrate
Name: Chemical, dtype: object
0        alpha-Humulene
1               Myrcene
2    beta-Caryophyllene
3        gamma-Cadinene
4        delta-Cadinene
Name: Chemical, dtype: object
0    Isopinocamphone
1       Pinocamphone
2       Germacrene D
3        beta-Pinene
4        Pinocarvone
Name: Chemical, dtype: object
0    Isopinocamphone
1        Pinocarvone
2       Pinocamphone
3       Germacrene D
4        beta-Pinene
Name: Chemical, dtype: object


The next code creates a heatmap for each CSV file, with each row in the heatmap corresponding to a plant and each column corresponding to a chemical. The color of each cell in the heatmap corresponds to the value in the CSV file for that plant and chemical. The heatmaps are then saved as PDF files.

In [112]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
from tqdm import tqdm
from math import sqrt

base_dir = './initial_data/clusters_euclidean/'
cluster_dirs = [d for d in os.listdir(base_dir) if os.path.isdir(os.path.join(base_dir, d)) and d.startswith('cluster_')]

sns.set_style("white")
cmap = sns.diverging_palette(220, 10, as_cmap=True)

for cluster_dir in tqdm(cluster_dirs, desc='Processing clusters'):
    cluster_number = cluster_dir.split('_')[-1]
    file_path = os.path.join(base_dir, cluster_dir, 'high_similarity_plants_compositions.csv')

    if os.path.isfile(file_path):
        df = pd.read_csv(file_path, index_col=0)

        if df.empty:
            continue

        top_5_chemicals = df.apply(lambda row: row.nlargest(5).index.tolist(), axis=1)
        top_5_chemicals = top_5_chemicals.apply(lambda x: pd.Index(x).drop_duplicates().tolist())
        df_top_5 = df[top_5_chemicals.explode()]
        df_top_5 *= 100

        df_transposed = df_top_5.T.drop_duplicates()

        # Sort by total abundance
        df_transposed['total_abundance'] = df_transposed.sum(axis=1)
        df_transposed.sort_values('total_abundance', ascending=False, inplace=True)
        df_transposed.drop('total_abundance', axis=1, inplace=True)

        if df_transposed.empty:
            continue

        # Dynamic sizing of the plot
        num_rows, num_cols = df_transposed.shape
        fig_width = min(max(num_cols // 2, 10), 50)  # Set a minimum width of 10 and maximum width of 50
        fig_height = min(max(num_rows // 2, 10), 50)  # Set a minimum height of 10 and maximum height of 50

        fig, ax = plt.subplots(figsize=(fig_width, fig_height))

        cell_width = fig_width / num_cols
        cell_height = fig_height / num_rows
        font_size = max(min(min(cell_width, cell_height) * 2.5, 20), 8)  # Adjust the multiplier to control the font size and set a range (8-20)

        sns.heatmap(df_transposed, cmap=cmap, annot=True, fmt=".1f", linewidths=.5,
                    cbar_kws={'label': 'Concentration (%)'},
                    vmin=0, vmax=100, ax=ax, annot_kws={"size": font_size})  # Use dynamic font size

        for t in ax.collections[0].axes.texts:  # iterate over the text elements of the heatmap
            if t.get_text() == '0.0':
                t.set_text('')  # remove the text for zero cells

        ax.set_title(f'Top 5 Abundant Chemicals in Plants with min. 90% Similarity within {cluster_dir}', fontsize=min(font_size+6, 24), pad=20)  # Use dynamic font size and set a maximum limit of 24
        ax.set_xlabel('Plants', fontsize=min(font_size+2, 22), labelpad=10)  # Use dynamic font size and set a maximum limit of 22
        ax.set_ylabel('Chemicals', fontsize=min(font_size+2, 22), labelpad=10)  # Use dynamic font size and set a maximum limit of 22
        ax.tick_params(labelsize=min(font_size, 20))  # Use dynamic font size and set a maximum limit of 20

        sns.despine()  # Remove the top and right spines from plot

        plt.xticks(rotation=90, fontsize=font_size)  # Use dynamic font size
        plt.yticks(rotation=0, fontsize=font_size)  # Use dynamic font size

        plt.tight_layout()

        save_path = os.path.join(base_dir, cluster_dir, f'heatmap_Top_5_Abundant_Chemicals_in_Plants_with_90_Similarity.pdf')
        plt.savefig(save_path, format="pdf", bbox_inches='tight')

        plt.close()

    else:
        print(f"The file {file_path} does not exist.")


Processing clusters: 100%|██████| 160/160 [00:23<00:00,  6.76it/s]
