Clustering Analysis: Use clustering algorithms such as K-means or DBSCAN to identify groups or clusters of plants based on their chemical compositions. This can help identify similarities and differences among different plant species. Find hidden structures is the data that are not immidiately availible. K-means - groups of simmilar instances - the closest centroid to each data sample.

In [2]:
import pandas as pd

# Load the data and set the first column as the index
data = pd.read_csv('./initial_data/chemicals_data_clean.csv', index_col=0)

# Extract row indices (species)
species_list = data.index.tolist()

# Count the total number of species
species_count = len(species_list)

# Create a new DataFrame from the species list
species_df = pd.DataFrame(species_list, columns=['Plant_Species'])

# Save the new DataFrame to a CSV file
species_df.to_csv('./initial_data/plant_species_list.csv', index=False)

# Print the total number of species
print(f'Total number of species: {species_count}')


Total number of species: 4087


Part 2: Clustering

This part will perform the clustering using Euclidian distances

This script will save a separate CSV file for each cluster. Each file will contain a similarity matrix for the plants within that cluster. The filename will include the cluster number for identification. The similarity scores are calculated by normalizing the distances (dividing by the maximum distance) and then subtracting from 1, so that a higher score corresponds to greater similarity. 

Part 3: This script will create an additional CSV file for each cluster, which contains only the plants that have similarity scores greater than 90% to at least one other plant within the same cluster. The filename will include the cluster number for identification.

In this script, for each cluster I've created a list of tuples where each tuple consists of a high-similarity plant and a list of plants to which it's similar to more than 80%. This list of tuples is then converted to a DataFrame and saved to a CSV file. The filename includes the cluster number for identification.

In [3]:
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score, pairwise_distances
from tqdm import tqdm
import numpy as np
import os

# Load the data
df = pd.read_csv('./initial_data/chemicals_data_clean.csv', index_col=0)

# Set the range of number of clusters to explore
min_clusters = 30
max_clusters = min(len(df), 150)

# Initialize variables
best_score = -1
best_num_clusters = -1
silhouette_scores = []

# Create directory for saving outputs
output_directory = "./initial_data/clusters_euclidean"
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

# Perform Agglomerative Clustering for different numbers of clusters
for num_clusters in tqdm(range(min_clusters, max_clusters + 1), desc='Clustering'):
    clustering = AgglomerativeClustering(n_clusters=num_clusters, metric='euclidean', linkage='ward')
    cluster_labels = clustering.fit_predict(df.values)
    
    silhouette_avg = silhouette_score(df.values, cluster_labels)
    silhouette_scores.append(silhouette_avg)

    # Check if the current silhouette score is the best so far
    if silhouette_avg > best_score:
        best_score = silhouette_avg
        best_num_clusters = num_clusters

# Perform Agglomerative Clustering with the best number of clusters
clustering = AgglomerativeClustering(n_clusters=best_num_clusters, metric='euclidean', linkage='ward')
cluster_labels = clustering.fit_predict(df.values)

# Assign cluster labels to plants
df['cluster'] = cluster_labels

# Enforce no more than 50 plants per cluster
cluster_sizes = df.groupby('cluster').size()
for i in cluster_sizes[cluster_sizes > 50].index:
    excess = df[df['cluster'] == i].sample(cluster_sizes[i] - 50, replace=False)
    df.loc[excess.index, 'cluster'] = max(df['cluster']) + 1

# For each cluster, extract the subset of the DataFrame, calculate similarity scores, and save to CSV files
for cluster in set(df['cluster']):
    cluster_df = df[df['cluster'] == cluster].drop('cluster', axis=1)
    cluster_df.to_csv(f'{output_directory}/cluster_{cluster}_chemical_compositions.csv')
    
    distances = pairwise_distances(cluster_df.values)
    similarities = 1 - distances / distances.max()
    similarities_df = pd.DataFrame(similarities, index=cluster_df.index, columns=cluster_df.index)
    similarities_df.to_csv(f'{output_directory}/cluster_{cluster}_plant_similarities.csv')

print('Number of clusters:', len(set(df['cluster'])))


Clustering: 100%|███████████████| 121/121 [18:57<00:00,  9.40s/it]
  similarities = 1 - distances / distances.max()
  similarities = 1 - distances / distances.max()
  similarities = 1 - distances / distances.max()
  similarities = 1 - distances / distances.max()
  similarities = 1 - distances / distances.max()
  similarities = 1 - distances / distances.max()
  similarities = 1 - distances / distances.max()
  similarities = 1 - distances / distances.max()


Number of clusters: 175


Now we delete all columns that only contain zeroes in every CSV cluster file.

In [4]:
import pandas as pd
import glob

# Get list of all CSV files
csv_files = glob.glob("./initial_data/clusters_euclidean/cluster_*_chemical_compositions.csv")

for file in csv_files:
    # Load file into DataFrame
    df = pd.read_csv(file)
    
    # Get columns that only contain zero
    zero_cols = [col for col in df.columns if (df[col] == 0).all()]
    
    # Drop these columns
    df = df.drop(zero_cols, axis=1)
    
    # Write cleaned DataFrame back to CSV
    df.to_csv(file, index=False)

print("All files have been cleaned.")


All files have been cleaned.


Part 3 high similarity

This script will generate a PDF file for each high similarity pairs CSV file that contains a heatmap. The heatmap shows the similarity scores between pairs of plants.To cluster plants that are most similar to each other together in the heatmap, we can use hierarchical clustering which creates a tree of clusters, which we can then cut at a certain level to create a heat map where similar items are placed next to each other.

To create meaningful titles for our heatmaps, we will extract the family names from the plant names, identify the unique families within each cluster, and then use this information to create a title that reflects the commonality among the plants in the cluster.

In [16]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import re

# Function to extract family names
def extract_family(plant_name):
    match = re.search(r'fam\. (\w+)', plant_name)
    if match:
        return match.group(1)
    
    match = re.search(r'\(([^)]+)\)', plant_name)
    if match and len(match.group(1)) >= 4:
        return match.group(1)
    
    words = plant_name.split()
    for word in reversed(words):
        if len(word) >= 4:
            return word
    return plant_name

# Set paths
path = './initial_data/clusters_euclidean/'
output_dir = './initial_data/clusters_euclidean/heatmaps/'

# Create output directory if it does not exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Load cluster files
cluster_files = [file for file in os.listdir(path) if 'plant_similarities' in file]
print(f"Found {len(cluster_files)} cluster files.")

for cluster_file in cluster_files:
    print(f"Processing {cluster_file}...")
    
    # Load similarity data
    similarity_df = pd.read_csv(path + cluster_file, index_col=0)
    
    # Remove rows and columns that are all NaN
    similarity_df = similarity_df.dropna(axis=0, how='all')
    similarity_df = similarity_df.dropna(axis=1, how='all')

    # Check if DataFrame is empty after removing NaN values
    if similarity_df.empty:
        print(f"Skipping {cluster_file} because it's empty after removing NaN values.")
        continue

    # Extract family names from plant names and identify unique families
    families = set(extract_family(name) for name in similarity_df.index)
    common_families = ', '.join(families)

    # Create a title for the heatmap
    cluster_number = re.search(r'\d+', cluster_file).group()  # Extract cluster number from file name
    title = f"Cluster {cluster_number} Heatmap: Common Families - {common_families}"


    # Calculate the font size based on cluster size
    num_plants = len(similarity_df)
    max_cluster_size = 100
    font_ratio = num_plants / max_cluster_size
    font_size = max(8, int(8 * font_ratio))
    annot_font_size = max(6, int(6 * font_ratio))  # Adjust font size for annotations
    
    # Create a mask for the mirror part of the heatmap
    mask = np.triu(np.ones_like(similarity_df, dtype=bool))
    
    # Create a figure and axes
    fig, ax = plt.subplots(figsize=(20, 15))
    
    # Visualize the similarity scores using a masked heatmap
    heatmap = sns.heatmap(similarity_df, cmap="YlGnBu", mask=mask, ax=ax,
                          cbar_kws={'label': 'Similarity Score', 'format': '%.1f'},
                          annot=True, fmt=".1f", annot_kws={'fontsize': annot_font_size})  # Adjust font size
    
    # Set the title for the heatmap
    ax.set_title(title, fontsize=font_size)
    
    # Add a legend explaining the heatmap colors
    cbar = heatmap.collections[0].colorbar
    cbar.set_label('Similarity Score Legend')
    
    # Save the heatmap as a vector-based PDF file
    output_file = cluster_file.replace('_plant_similarities.csv', '.pdf')
    output_path = os.path.join(output_dir, output_file)
    fig.savefig(output_path, format='pdf', bbox_inches='tight')
    plt.close()
    
print("All done.")


Found 175 cluster files.
Processing cluster_105_plant_similarities.csv...
Processing cluster_34_plant_similarities.csv...
Processing cluster_170_plant_similarities.csv...
Processing cluster_41_plant_similarities.csv...
Processing cluster_80_plant_similarities.csv...
Processing cluster_141_plant_similarities.csv...
Processing cluster_70_plant_similarities.csv...
Processing cluster_5_plant_similarities.csv...
Processing cluster_134_plant_similarities.csv...
Processing cluster_102_plant_similarities.csv...
Processing cluster_33_plant_similarities.csv...
Processing cluster_46_plant_similarities.csv...
Processing cluster_87_plant_similarities.csv...
Processing cluster_39_plant_similarities.csv...
Processing cluster_108_plant_similarities.csv...
Processing cluster_139_plant_similarities.csv...
Processing cluster_8_plant_similarities.csv...
Processing cluster_146_plant_similarities.csv...
Processing cluster_77_plant_similarities.csv...
Processing cluster_2_plant_similarities.csv...
Processing

Now we compare the chemical compositions of plants with high similarity scores 80% and identify the most present chemicals that contributes to the similarity between them. After creating the results_df DataFrame, we remove duplicates based on the chemical compositions by rounding the columns 'First Most Present Amount', 'Second ...' and so on.. to two decimal places using .round() method. Then, we drop duplicates based on these rounded columns.

here's the code modified to display the top 5 present chemicals in order to later create a barplot

In [12]:
import pandas as pd
from tqdm import tqdm
import os

# Define the path to the cluster data
data_path = '/Users/mariiakokina/Documents/eo_database/'

# Get a list of cluster files
cluster_files = [file for file in os.listdir(data_path) if file.startswith('cluster_')]

# Iterate over the cluster files
for cluster_file in cluster_files:
    # Extract the cluster number from the file name
    cluster_num = cluster_file.split('_')[1]
    
    # Load the chemical compositions data
    composition_file = os.path.join(data_path, f'cluster_{cluster_num}_chemical_compositions.csv')
    chemical_compositions = pd.read_csv(composition_file, index_col=0)

    # Load the plant similarities data
    similarities_file = os.path.join(data_path, f'cluster_{cluster_num}_plant_similarities.csv')
    plant_similarities = pd.read_csv(similarities_file, index_col=0)

    # Create an empty DataFrame to store the results
    results_df = pd.DataFrame(columns=['Plant 1', 'Plant 2', 'Similarity Score', 'First Most Present Chemical',
                                       'First Most Present Amount', 'Second Most Present Chemical',
                                       'Second Most Present Amount', 'Third Most Present Chemical',
                                       'Third Most Present Amount', 'Fourth Most Present Chemical',
                                       'Fourth Most Present Amount', 'Fifth Most Present Chemical',
                                       'Fifth Most Present Amount'])

    # Iterate over the rows of the plant similarities data
    for index, row in tqdm(plant_similarities.iterrows(), total=len(plant_similarities), desc=f'Processing Cluster {cluster_num}'):
        plant_1 = index  # Plant 1 name
        for column in row.index:
            if index != column:  # Exclude diagonal comparisons
                similarity_score = row[column]  # Similarity score
                if similarity_score >= 0.8:  # Filter similarity score
                    plant_2 = column  # Plant 2 name

                    # Get the chemical composition of the two plants
                    composition_1 = chemical_compositions.loc[plant_1]
                    composition_2 = chemical_compositions.loc[plant_2]

                    # Find the chemicals present in both plants
                    common_chemicals = set(composition_1.index) & set(composition_2.index)

                    # Find the most present chemical and its amount in each plant
                    most_present_chemicals = [None]*5
                    max_amounts = [0]*5
                    for chemical in common_chemicals:
                        amount_1 = composition_1[chemical]
                        amount_2 = composition_2[chemical]
                        for i in range(5):
                            if amount_1 > max_amounts[i]:
                                for j in range(4, i, -1):
                                    max_amounts[j] = max_amounts[j-1]
                                    most_present_chemicals[j] = most_present_chemicals[j-1]
                                max_amounts[i] = amount_1
                                most_present_chemicals[i] = chemical
                                break

                    # Create a temporary DataFrame to store the current result
                    temp_df = pd.DataFrame({
                        'Plant 1': [plant_1],
                        'Plant 2': [plant_2],
                        'Similarity Score': [similarity_score],
                        'First Most Present Chemical': [most_present_chemicals[0]],
                        'First Most Present Amount': [max_amounts[0]],
                        'Second Most Present Chemical': [most_present_chemicals[1]],
                        'Second Most Present Amount': [max_amounts[1]],
                        'Third Most Present Chemical': [most_present_chemicals[2]],
                        'Third Most Present Amount': [max_amounts[2]],
                        'Fourth Most Present Chemical': [most_present_chemicals[3]],
                        'Fourth Most Present Amount': [max_amounts[3]],
                        'Fifth Most Present Chemical': [most_present_chemicals[4]],
                        'Fifth Most Present Amount': [max_amounts[4]]
                    })

                    # Concatenate the temporary DataFrame with the results DataFrame
                    results_df = pd.concat([results_df, temp_df], ignore_index=True)

    # Remove duplicates based on the chemical compositions
    results_df_deduplicated = results_df.round({'First Most Present Amount': 2,
                                                'Second Most Present Amount': 2,
                                                'Third Most Present Amount': 2,
                                                'Fourth Most Present Amount': 2,
                                                'Fifth Most Present Amount': 2}).drop_duplicates(subset=['First Most Present Amount',
                                                                                                      'Second Most Present Chemical',
                                                                                                      'Second Most Present Amount',
                                                                                                      'Third Most Present Chemical',
                                                                                                      'Third Most Present Amount',
                                                                                                      'Fourth Most Present Chemical',
                                                                                                      'Fourth Most Present Amount',
                                                                                                      'Fifth Most Present Chemical',
                                                                                                      'Fifth Most Present Amount'])

    # Save the results in a folder with the cluster name
    results_path = os.path.join(data_path, f'cluster_{cluster_num}')
    
    # Ensure the directory exists
    if not os.path.exists(results_path):
        os.makedirs(results_path)

    # Save the results dataframe to a csv file with designated name
    filename = f'high_similarity_plants_top5chemicals.csv'
    results_df_deduplicated.to_csv(os.path.join(results_path, filename), index=False)


Processing Cluster 71: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 390.63it/s]
Processing Cluster 34: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 448.51it/s]
Processing Cluster 41: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 25266.89it/s]
Processing Cluster 80: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46/46 [00:00<00:00, 187.33it/s]
Processing Cluster 28: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Processing Cluster 7: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 92.75it/s]
Processing Cluster 20: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26/26 [00:00<00:00, 306.49it/s]
Processing Cluster 19: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 43/43 [00:00<00:00, 630.07it/s]
Processing Cluster 72: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 1255.22it/s]
Processing Cluster 82: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Processing Cluster 58: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 822.35it/s]
Processing Cluster 31: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 1383.65it/s]
Processing Cluster 86: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 904.53it/s]
Processing Cluster 2: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 97.67it/s]
Processing Cluster 46: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Processing Cluster 46: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 1708.94it/s]
Processing Cluster 84: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 104/104 [00:00<00:00, 145.67it/s]
Processing Cluster 28: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 48/48 [00:00<00:00, 167.72it/s]
Processing Cluster 48: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 1135.80it/s]
Processing Cluster 23: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Next we represent the most similar plants and their simillarity scores in one dataframe and their chemical composion in another dataframe so that we can later visualize it as a heatmap

In [38]:
import pandas as pd
from tqdm import tqdm
import os

# Define the path to the cluster data
data_path = '/Users/mariiakokina/Documents/eo_database/'

# Get a list of cluster files
cluster_files = [file for file in os.listdir(data_path) if file.startswith('cluster_')]

# Iterate over the cluster files
for cluster_file in cluster_files:
    # Extract the cluster number from the file name
    cluster_num = cluster_file.split('_')[1]
    
    # Load the plant similarities data
    similarities_file = os.path.join(data_path, f'cluster_{cluster_num}_plant_similarities.csv')
    plant_similarities = pd.read_csv(similarities_file, index_col=0)

    # Create an empty DataFrame to store the results
    results_df = pd.DataFrame(columns=['Plant 1', 'Plant 2', 'Similarity Score'])

    # Iterate over the rows of the plant similarities data
    for plant_1 in tqdm(plant_similarities.index, total=len(plant_similarities), desc=f'Processing Cluster {cluster_num}'):
        for plant_2 in plant_similarities.columns:
            if plant_1 != plant_2:  # Exclude diagonal comparisons
                similarity_score = plant_similarities.loc[plant_1, plant_2]  # Similarity score
                if similarity_score >= 0.8:  # Filter similarity score
                    
                    # Create a temporary DataFrame to store the current result
                    temp_df = pd.DataFrame({
                        'Plant 1': [plant_1],
                        'Plant 2': [plant_2],
                        'Similarity Score': [similarity_score]
                    })

                    # Append the temporary DataFrame to the results DataFrame
                    results_df = pd.concat([results_df, temp_df], ignore_index=True)

    # Save the results to a CSV file
    results_path = os.path.join(data_path, f'cluster_{cluster_num}')
    
    # Ensure the directory exists
    if not os.path.exists(results_path):
        os.makedirs(results_path)

    # Save the results dataframe to a csv file with designated name
    filename = f'high_similarity_plants.csv'
    results_df.to_csv(os.path.join(results_path, filename), index=False)


Processing Cluster 71: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 618.32it/s]
Processing Cluster 34: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 2192.99it/s]
Processing Cluster 41: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 25145.71it/s]
Processing Cluster 80: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46/46 [00:00<00:00, 694.74it/s]
Processing Cluster 28: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Processing Cluster 7: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 525.16it/s]
Processing Cluster 19: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 43/43 [00:00<00:00, 2812.29it/s]
Processing Cluster 72: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 6006.16it/s]
Processing Cluster 82: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:00<00:00, 891.43it/s]
Processing Cluster 72: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Processing Cluster 31: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 6068.93it/s]
Processing Cluster 86: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 2385.73it/s]
Processing Cluster 2: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 451.53it/s]
Processing Cluster 46: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 6459.40it/s]
Processing Cluster 64: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Processing Cluster 22: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 596.17it/s]
Processing Cluster 79: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 164/164 [00:00<00:00, 896.99it/s]
Processing Cluster 33: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 3994.17it/s]
Processing Cluster 46: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 6271.08it/s]
Processing Cluster 84: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Here we create a data frame with the full chemical compositions of the most similar plants

In [40]:
import pandas as pd
import glob
import os

# Path to the chemicals data
chemicals_data_path = '/Users/mariiakokina/Documents/eo_database/chemicals_data_clean.csv'

# Read the full chemical composition data
chemicals_df = pd.read_csv(chemicals_data_path, index_col=0)  # Assume the plant name is the index

# Get the list of high similarity plant files
high_similarity_files = glob.glob('/Users/mariiakokina/Documents/eo_database/cluster_*/high_similarity_plants.csv')

# Iterate over each high similarity file
for hs_file in high_similarity_files:
    # Read the high similarity file
    hs_df = pd.read_csv(hs_file)
    
    # Get the plant names from high similarity file
    plant_names = hs_df.iloc[:, 0].unique()  # Assuming plant names are in the first column
    
    # Get the chemical compositions for the plants in high similarity file
    compositions = chemicals_df.loc[plant_names]
    
    # Save the compositions DataFrame as a CSV file in the corresponding cluster folder
    compositions.to_csv(os.path.join(os.path.dirname(hs_file), 'high_similarity_plants_compositions.csv'))

# Pattern to match your CSV files
pattern = '/Users/mariiakokina/Documents/eo_database/cluster_*/high_similarity_plants_compositions.csv'

# Get a list of all matching file paths
files = glob.glob(pattern)

for file in files:
    # Load the data
    df = pd.read_csv(file, index_col=0)  # Assuming the first column is the index

    # Remove columns that only contain zeros
    df = df.loc[:, (df != 0).any(axis=0)]

    # Save the dataframe back to CSV
    df.to_csv(file, index=True)


This code will create a bar plot with the x-axis representing main chemicals and the y-axis representing the amount. 

In [21]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os

files = glob.glob('/Users/mariiakokina/Documents/eo_database/cluster_*/high_similarity_plants_top5chemicals.csv')

barWidth = 0.15

# Use a more colorful palette
colors = sns.color_palette('Set3')

# Set the dark grid style
sns.set_style('darkgrid')

for file in files:
    df = pd.read_csv(file)
    unique_plants = []
    plants = []
    
    for i in range(len(df)):
        plant = df.iloc[i, 0]
        if plant in unique_plants:
            continue
        unique_plants.append(plant)
        
        chemicals = [df.iloc[i, j] for j in range(3, 13, 2)]
        amounts = [df.iloc[i, j]*100 for j in range(4, 14, 2)]
        
        plants.append(pd.DataFrame({"Plant": plant, "Chemical": chemicals, "Amount": amounts}))
    
    if not plants:
        continue

    fig, axs = plt.subplots(len(plants), 1, figsize=(12, 8*len(plants)))
    
    if isinstance(axs, plt.Axes):
        axs = [axs]
        
    for i, ax in enumerate(axs):
        print(plants[i]["Chemical"])
        plants[i]["Chemical"] = plants[i]["Chemical"].astype(str)

        bars = ax.bar(plants[i]["Chemical"], plants[i]["Amount"], color=colors[i % len(colors)], 
                width=barWidth, edgecolor='grey', 
                label='{}'.format(plants[i]['Plant'][0]))
        ax.set_xlabel('Chemical', fontweight='bold')
        ax.set_ylabel('Percentage (%)', fontweight='bold')
        ax.set_ylim([0, 100])

        # Add gridlines
        ax.grid(True)

        # Rotate the x-axis labels if they are overlapping
        plt.xticks(rotation=45, ha='right')

        for bar in bars:
            yval = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2, yval + 1, round(yval,2), ha='center', va='bottom')

        ax.text(0.75, 0.5, 'Sum of main chemicals: {:.2f}%'.format(sum(plants[i]["Amount"])), 
                transform=ax.transAxes, fontsize=10, verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

        ax.legend()
    
    dir_name = os.path.dirname(file)
    if not os.path.exists(dir_name):
        os.makedirs(dir_name)

    fig.savefig(os.path.splitext(file)[0] + '.pdf', format='pdf', bbox_inches='tight')
    plt.close(fig)


0          Limonene
1          Camphene
2      alpha-Pinene
3    Bornyl acetate
4       beta-Pinene
Name: Chemical, dtype: object
0          Limonene
1          Camphene
2    Bornyl acetate
3      alpha-Pinene
4       beta-Pinene
Name: Chemical, dtype: object
0              Limonene
1    beta-Caryophyllene
2          Germacrene D
3               Myrcene
4          alpha-Pinene
Name: Chemical, dtype: object
0              Limonene
1    beta-Caryophyllene
2          Germacrene D
3               Myrcene
4          alpha-Pinene
Name: Chemical, dtype: object
0                       Limonene
1               (Z)-beta-Ocimene
2                        Myrcene
3                  beta-Selinene
4    n-Pentyl-1,3-cyclohexadiene
Name: Chemical, dtype: object
0              Limonene
1      (Z)-beta-Ocimene
2    3-n-Butylphthalide
3               Myrcene
4       gamma-Terpinene
Name: Chemical, dtype: object
0         Limonene
1      beta-Pinene
2         Geranial
3    Neryl acetate
4            Neral


0    (Z)-Methyl isoeugenol
1    1(10)-Aristolen-2-one
2       beta-Caryophyllene
3          Acoragermacrone
4             Isoacolamone
Name: Chemical, dtype: object
0    (Z)-Methyl isoeugenol
1               Calamenene
2          Isocalamenediol
3          Acoragermacrone
4               Shyobunone
Name: Chemical, dtype: object
0           epi-Shyobunone
1               Calamenone
2           beta-Gurjunene
3               Calamenene
4    (Z)-Methyl isoeugenol
Name: Chemical, dtype: object
0      beta-Asarone (Z)
1            Shyobunone
2             Acorenone
3    Preisocalamenediol
4     alpha-Asarone (E)
Name: Chemical, dtype: object
0           Acorenone
1             Acorone
2    beta-Asarone (Z)
3          Shyobunone
4     Isocalamenediol
Name: Chemical, dtype: object
0    beta-Asarone (Z)
1            Limonene
2     alpha-Terpineol
3         Longifolene
4     alpha-Bulnesene
Name: Chemical, dtype: object
0               alpha-Asarone (E)
1    2,4,5-Trimethoxybenzaldehyde
2      

0                              Limonene
1    Sesquiterpenes, oxygen-containing-
2                    (E)-beta-Farnesene
3             Eudesmol (unknown isomer)
4                                Elemol
Name: Chemical, dtype: object
0     Cyperotudone
1         Cyperene
2    Patchoulenone
3        Rotundone
4        Rotundene
Name: Chemical, dtype: object
0         alpha-Copaene
1        alpha-Humulene
2    beta-Caryophyllene
3          Germacrene D
4       alpha-Terpineol
Name: Chemical, dtype: object
0     delta-Cadinene
1       Germacrene D
2      alpha-Copaene
3    gamma-Muurolene
4            Cubenol
Name: Chemical, dtype: object
0       alpha-Santalene
1        Methyl eugenol
2              Elemicin
3        delta-Cadinene
4    beta-Caryophyllene
Name: Chemical, dtype: object
0            Benzyl alcohol
1           Benzyl benzoate
2          Methyl linoleate
3    2-Phenylethyl benzoate
4         Benzyl salicylate
Name: Chemical, dtype: object
0              Davanone
1               

0              Sabinene
1              Limonene
2         alpha-Thujene
3    alpha-Phellandrene
4          alpha-Pinene
Name: Chemical, dtype: object
0    alpha-Phellandrene
1       alpha-Santalene
2           para-Cymene
3        beta-Santalene
4        alpha-Humulene
Name: Chemical, dtype: object
0              Linalool
1         alpha-Ocimene
2               Nonanal
3    Phenylacetaldehyde
4              Limonene
Name: Chemical, dtype: object
0           Benzyl alcohol
1            (Z)-3-Hexenol
2              Citronellol
3                 Geraniol
4    (Z)-3-Hexenyl acetate
Name: Chemical, dtype: object
0    Phenylacetaldehyde
1               Nonanal
2         (E)-2-Hexenal
3               Hexanal
4              Heptanal
Name: Chemical, dtype: object
0        para-Cymene
1          Carvacrol
2             Thymol
3      Terpinen-4-ol
4    gamma-Terpinene
Name: Chemical, dtype: object
0        Linoleic acid
1    Hexadecanoic acid
2              Carvone
3         beta-Guaiene
4       

0             Carvacrol
1       gamma-Terpinene
2           para-Cymene
3    beta-Caryophyllene
4        alpha-Humulene
Name: Chemical, dtype: object
0          Carvacrol
1        para-Cymene
2    gamma-Terpinene
3            Myrcene
4       alpha-Pinene
Name: Chemical, dtype: object
0             Carvacrol
1           para-Cymene
2                Thymol
3       gamma-Terpinene
4    beta-Caryophyllene
Name: Chemical, dtype: object
0             Carvacrol
1           para-Cymene
2       gamma-Terpinene
3    beta-Caryophyllene
4               Myrcene
Name: Chemical, dtype: object
0          Carvacrol
1    gamma-Terpinene
2        para-Cymene
3       alpha-Pinene
4            Myrcene
Name: Chemical, dtype: object
0          Carvacrol
1        para-Cymene
2    gamma-Terpinene
3             Thymol
4            Myrcene
Name: Chemical, dtype: object
0          Carvacrol
1        para-Cymene
2    gamma-Terpinene
3       alpha-Pinene
4            Myrcene
Name: Chemical, dtype: object
0         

0           Camphor
1      beta-Thujone
2     alpha-Thujone
3       1,8-Cineole
4    Chrysanthenone
Name: Chemical, dtype: object
0    alpha-Thujone
1          Camphor
2      1,8-Cineole
3     beta-Thujone
4         Camphene
Name: Chemical, dtype: object
0    alpha-Thujone
1     beta-Thujone
2      1,8-Cineole
3          Camphor
4          Borneol
Name: Chemical, dtype: object
0      alpha-Thujone
1       beta-Thujone
2        1,8-Cineole
3            Camphor
4    alpha-Terpineol
Name: Chemical, dtype: object
0     alpha-Thujone
1          Fenchone
2      beta-Thujone
3    Bornyl acetate
4           Camphor
Name: Chemical, dtype: object
0     alpha-Thujone
1          Fenchone
2      beta-Thujone
3      alpha-Pinene
4    Bornyl acetate
Name: Chemical, dtype: object
0    alpha-Thujone
1         Fenchone
2     beta-Thujone
3     alpha-Pinene
4         Sabinene
Name: Chemical, dtype: object
0     alpha-Thujone
1          Fenchone
2      beta-Thujone
3      alpha-Pinene
4    Bornyl acetate


0    2-Phenylethyl methyl ether
1                 Terpinen-4-ol
2               alpha-Terpineol
3               gamma-Terpinene
4                   1,8-Cineole
Name: Chemical, dtype: object
0    2-Phenylethyl methyl ether
1                 Terpinen-4-ol
2               alpha-Terpineol
3                   1,8-Cineole
4               gamma-Terpinene
Name: Chemical, dtype: object
0    2-Phenylethyl methyl ether
1                 Terpinen-4-ol
2                   para-Cymene
3               alpha-Terpineol
4               gamma-Terpinene
Name: Chemical, dtype: object
0      Artemisia ketone
1    beta-Caryophyllene
2           1,8-Cineole
3           beta-Pinene
4    (E)-beta-Farnesene
Name: Chemical, dtype: object
0     Artemisia ketone
1    Artemisia alcohol
2              Myrcene
3        alpha-Guaiene
4              Camphor
Name: Chemical, dtype: object
0     Artemisia ketone
1    Artemisia alcohol
2              Camphor
3      Arteannuic acid
4        alpha-Guaiene
Name: Chemical, dtyp

0     alpha-Thujone
1           Camphor
2      beta-Thujone
3    Chrysanthenone
4       1,8-Cineole
Name: Chemical, dtype: object
0    alpha-Thujone
1     beta-Thujone
2         Sabinene
3    Terpinen-4-ol
4    Viridiflorene
Name: Chemical, dtype: object
0    alpha-Thujone
1     beta-Thujone
2      beta-Pinene
3    Terpinen-4-ol
4    Cuminaldehyde
Name: Chemical, dtype: object
0      alpha-Thujone
1       beta-Thujone
2        para-Cymene
3            Camphor
4    alpha-Terpineol
Name: Chemical, dtype: object
0        alpha-Thujone
1         beta-Thujone
2          Spathulenol
3    trans-Pinocarveol
4        Cuminaldehyde
Name: Chemical, dtype: object
0            alpha-Thujone
1             beta-Thujone
2              1,8-Cineole
3                 Sabinene
4    trans-Sabinyl acetate
Name: Chemical, dtype: object
0         alpha-Thujone
1           1,8-Cineole
2          beta-Thujone
3    beta-Caryophyllene
4             Verbenone
Name: Chemical, dtype: object
0     alpha-Thujone
1    

0             Limonene
1              Myrcene
2     (E)-beta-Ocimene
3         alpha-Pinene
4    beta-Phellandrene
Name: Chemical, dtype: object
0        Limonene
1         Myrcene
2        Linalool
3        Sabinene
4    alpha-Pinene
Name: Chemical, dtype: object
0        Limonene
1         Myrcene
2        Sabinene
3    alpha-Pinene
4        Linalool
Name: Chemical, dtype: object
0        Limonene
1         Myrcene
2     beta-Pinene
3    alpha-Pinene
4         Decanal
Name: Chemical, dtype: object
0              Limonene
1               Myrcene
2    alpha-Phellandrene
3          alpha-Pinene
4              Sabinene
Name: Chemical, dtype: object
0            Limonene
1             Myrcene
2        alpha-Pinene
3            Sabinene
4    (E)-beta-Ocimene
Name: Chemical, dtype: object
0      Limonene
1       Myrcene
2      Sabinene
3    Nootkatone
4      Linalool
Name: Chemical, dtype: object
0        Limonene
1         Myrcene
2    alpha-Pinene
3     para-Cymene
4      Nootkatone
Name:

0      2-Phenylethanol
1          Citronellol
2    Alkanes & alkenes
3             Geraniol
4                Nerol
Name: Chemical, dtype: object
0              2-Phenylethanol
1               beta-Farnesene
2                  beta-Pinene
3    Farnesol (unknown isomer)
4                Neryl acetate
Name: Chemical, dtype: object
0      2-Phenylethanol
1          Citronellol
2             Geraniol
3    Alkanes & alkenes
4                Nerol
Name: Chemical, dtype: object
0    2-Phenylethanol
1        Citronellol
2           Linalool
3           Geraniol
4              Nerol
Name: Chemical, dtype: object
0              Limonene
1         beta-Selinene
2    3-n-Butylphthalide
3         Pentylbenzene
4           beta-Pinene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2        para-Cymene
3        beta-Pinene
4    beta-Bisabolene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2        beta-Pinene
3      alpha-Thujene
4            Myrcene


0      Terpinen-4-ol
1        1,8-Cineole
2           Sabinene
3    gamma-Terpinene
4    Fenchyl acetate
Name: Chemical, dtype: object
0      Terpinen-4-ol
1        1,8-Cineole
2    gamma-Terpinene
3           Sabinene
4        para-Cymene
Name: Chemical, dtype: object
0    alpha-Terpinyl acetate
1               1,8-Cineole
2           Linalyl acetate
3                  Sabinene
4                  Linalool
Name: Chemical, dtype: object
0    alpha-Terpinyl acetate
1               1,8-Cineole
2                  Linalool
3           alpha-Terpineol
4                  Limonene
Name: Chemical, dtype: object
0               1,8-Cineole
1    alpha-Terpinyl acetate
2                  Limonene
3           alpha-Terpineol
4               beta-Pinene
Name: Chemical, dtype: object
0    alpha-Terpinyl acetate
1               1,8-Cineole
2           alpha-Terpineol
3                  Sabinene
4                  Linalool
Name: Chemical, dtype: object
0               1,8-Cineole
1    alpha-Terpinyl ac

0           Limonene
1    Linalyl acetate
2           Linalool
3        beta-Pinene
4    gamma-Terpinene
Name: Chemical, dtype: object
0           Limonene
1    Linalyl acetate
2           Linalool
3        beta-Pinene
4    gamma-Terpinene
Name: Chemical, dtype: object
0           Limonene
1           Linalool
2    Linalyl acetate
3           Geraniol
4        beta-Pinene
Name: Chemical, dtype: object
0           Limonene
1    Linalyl acetate
2           Linalool
3        beta-Pinene
4    gamma-Terpinene
Name: Chemical, dtype: object
0           Limonene
1    Linalyl acetate
2           Linalool
3    gamma-Terpinene
4        beta-Pinene
Name: Chemical, dtype: object
0           Limonene
1    Linalyl acetate
2           Linalool
3    gamma-Terpinene
4        beta-Pinene
Name: Chemical, dtype: object
0           Limonene
1    Linalyl acetate
2           Linalool
3    gamma-Terpinene
4        beta-Pinene
Name: Chemical, dtype: object
0    Linalyl acetate
1           Limonene
2           L

0         Isomenthone
1            Pulegone
2            Limonene
3            Menthone
4    Perilla aldehyde
Name: Chemical, dtype: object
0    Isomenthone
1       Pulegone
2       Limonene
3       Menthone
4    1-Octenol-3
Name: Chemical, dtype: object
0       Diosphenol
1      Isomenthone
2    Isodiosphenol
3         Limonene
4         Menthone
Name: Chemical, dtype: object
0      Isomenthone
1         Limonene
2       Diosphenol
3         Menthone
4    Isodiosphenol
Name: Chemical, dtype: object
0      Menthone
1      Pulegone
2    Piperitone
3    Neomenthol
4       Menthol
Name: Chemical, dtype: object
0        Pulegone
1     Isomenthone
2        Menthone
3    Piperitenone
4        Limonene
Name: Chemical, dtype: object
0        Pulegone
1        Menthone
2      Piperitone
3    Piperitenone
4     Isomenthone
Name: Chemical, dtype: object
0        Pulegone
1        Menthone
2    Piperitenone
3      Piperitone
4        Limonene
Name: Chemical, dtype: object
0        Pulegone
1    Pi

0           Limonene
1           Geranial
2              Neral
3    Geranyl acetate
4           Geraniol
Name: Chemical, dtype: object
0           Limonene
1           Geranial
2              Neral
3      Neryl acetate
4    Geranyl acetate
Name: Chemical, dtype: object
0     beta-Thujone
1         Sabinene
2      1,8-Cineole
3    alpha-Thujone
4     Germacrene D
Name: Chemical, dtype: object
0                 beta-Thujone
1                     Sabinene
2    Cadinene (unknown isomer)
3                  1,8-Cineole
4                Terpinen-4-ol
Name: Chemical, dtype: object
0    beta-Thujone
1         Camphor
2    Germacrene D
3        Sabinene
4        Camphene
Name: Chemical, dtype: object
0      beta-Thujone
1           Camphor
2          Sabinene
3          Camphene
4    Bornyl acetate
Name: Chemical, dtype: object
0           beta-Thujone
1    cis-Ocimene epoxide
2       (Z)-beta-Ocimene
3            Chamazulene
4           Germacrene D
Name: Chemical, dtype: object
0           bet

0    Bornyl acetate
1      alpha-Pinene
2          Camphene
3    delta-3-Carene
4      p-Cymen-8-ol
Name: Chemical, dtype: object
0       Bornyl acetate
1         (-)-Camphene
2    beta-Phellandrene
3     (-)-alpha-Pinene
4      (-)-beta-Pinene
Name: Chemical, dtype: object
0           Bornyl acetate
1             (-)-Camphene
2         (-)-alpha-Pinene
3    (-)-beta-Phellandrene
4          (-)-beta-Pinene
Name: Chemical, dtype: object
0    Bornyl acetate
1          Camphene
2      alpha-Pinene
3          Limonene
4        Tricyclene
Name: Chemical, dtype: object
0    Bornyl acetate
1      alpha-Pinene
2    delta-3-Carene
3          Limonene
4       beta-Pinene
Name: Chemical, dtype: object
0          Camphene
1    Bornyl acetate
2      alpha-Pinene
3    delta-3-Carene
4          Limonene
Name: Chemical, dtype: object
0           Camphor
1    Bornyl acetate
2          Camphene
3          Limonene
4           Myrcene
Name: Chemical, dtype: object
0    Bornyl acetate
1           Camphor


0           Linalool
1        1,8-Cineole
2         Piperitone
3    alpha-Terpineol
4      Terpinen-4-ol
Name: Chemical, dtype: object
0            Linalool
1     Linalyl acetate
2    (E)-beta-Ocimene
3             Myrcene
4       Neryl acetate
Name: Chemical, dtype: object
0                               Linalool
1      cis-Linalool oxide (5) (furanoid)
2    trans-Linalool oxide (5) (furanoid)
3                           Eremophilene
4                          Cyclosativene
Name: Chemical, dtype: object
0                                 Linalool
1                          alpha-Terpineol
2      cis-Linalool oxide (unknown isomer)
3    trans-Linalool oxide (unknown isomer)
4                              1,8-Cineole
Name: Chemical, dtype: object
0                               Linalool
1                        alpha-Terpineol
2                               Geraniol
3                               Limonene
4    cis-Linalool oxide (unknown isomer)
Name: Chemical, dtype: object
0         

0        Limonene
1         Myrcene
2        Linalool
3        Sabinene
4    alpha-Pinene
Name: Chemical, dtype: object
0        Limonene
1         Myrcene
2        Linalool
3    alpha-Pinene
4        Sabinene
Name: Chemical, dtype: object
0        Limonene
1         Myrcene
2        Linalool
3    alpha-Pinene
4         Octanal
Name: Chemical, dtype: object
0                     Limonene
1                      Myrcene
2                     Linalool
3                     Sabinene
4    Sinensal (unknown isomer)
Name: Chemical, dtype: object
0           Limonene
1            Myrcene
2    gamma-Terpinene
3           Sabinene
4       alpha-Pinene
Name: Chemical, dtype: object
0        Limonene
1         Myrcene
2        Sabinene
3        Linalool
4    alpha-Pinene
Name: Chemical, dtype: object
0           Limonene
1            Myrcene
2           Sabinene
3       alpha-Pinene
4    beta-Bisabolene
Name: Chemical, dtype: object
0              Limonene
1               Myrcene
2          alpha-

0                                 Limonene
1                            beta-Selinene
2                       3-n-Butylphthalide
3    cis-3-Butylidene-4,5-dihydrophthalide
4                           alpha-Selinene
Name: Chemical, dtype: object
0              Limonene
1         beta-Selinene
2    3-n-Butylphthalide
3    Sedanoic anhydride
4         Pentylbenzene
Name: Chemical, dtype: object
0              Limonene
1         beta-Selinene
2    3-n-Butylphthalide
3           Ligustilide
4        alpha-Selinene
Name: Chemical, dtype: object
0            Limonene
1     gamma-Terpinene
2        alpha-Pinene
3    (E)-beta-Ocimene
4         beta-Pinene
Name: Chemical, dtype: object
0            Limonene
1     gamma-Terpinene
2    (E)-beta-Ocimene
3             Myrcene
4    (Z)-beta-Ocimene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2            Myrcene
3        para-Cymene
4             Thymol
Name: Chemical, dtype: object
0             Limonene
1      gamma-Terp

0           Limonene
1    gamma-Terpinene
2        beta-Pinene
3       alpha-Pinene
4            Myrcene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2       alpha-Pinene
3        para-Cymene
4            Myrcene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2            Myrcene
3        beta-Pinene
4       alpha-Pinene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2            Myrcene
3           Linalool
4       alpha-Pinene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2        para-Cymene
3       alpha-Pinene
4            Myrcene
Name: Chemical, dtype: object
0           Limonene
1    gamma-Terpinene
2            Myrcene
3       alpha-Pinene
4        beta-Pinene
Name: Chemical, dtype: object
0        Limonene
1      Isopulegol
2         Myrcene
3     Citronellal
4    beta-Ocimene
Name: Chemical, dtype: object
0           Limonene
1    alpha-Terpinene
2            Myrcene
3      

0         Sabinene
1    Terpinen-4-ol
2      beta-Pinene
3      para-Cymene
4         Limonene
Name: Chemical, dtype: object
0       Sabinene
1         Cedrol
2       Limonene
3       Camphene
4    beta-Pinene
Name: Chemical, dtype: object
0             alpha-Pinene
1            Terpinen-4-ol
2                 Sabinene
3             beta-Thujene
4    beta-Terpinyl acetate
Name: Chemical, dtype: object
0    Terpinen-4-ol
1     alpha-Pinene
2         Sabinene
3      para-Cymene
4          Myrcene
Name: Chemical, dtype: object
0                  Sabinene
1              alpha-Pinene
2                  Camphene
3    alpha-Terpinyl acetate
4               beta-Pinene
Name: Chemical, dtype: object
0            Sabinene
1    (Z,Z)-Germacrone
2       alpha-Thujene
3        Germacrene B
4    (E,E)-Germacrone
Name: Chemical, dtype: object
0        Sabinene
1    beta-Ocimene
2     beta-Pinene
3        Limonene
4         Myrcene
Name: Chemical, dtype: object
0               Sabinene
1            be

0                         Methyl chavicol
1    Sesquiterpene hydrocarbons (unknown)
2                        (Z)-beta-Ocimene
3                            (E)-Anethole
4                                Limonene
Name: Chemical, dtype: object
0    Methyl chavicol
1            Menthol
2     alpha-Humulene
3            Carvone
4           Menthone
Name: Chemical, dtype: object
0            Methyl chavicol
1                   Limonene
2    2-Methoxycinnamaldehyde
3                    Menthol
4                    Carvone
Name: Chemical, dtype: object
0    Methyl chavicol
1        beta-Pinene
2       alpha-Pinene
3       Anisaldehyde
4       (E)-Anethole
Name: Chemical, dtype: object
0       Methyl chavicol
1           1,8-Cineole
2              Linalool
3               Camphor
4    beta-Caryophyllene
Name: Chemical, dtype: object
0     Methyl chavicol
1            Linalool
2         1,8-Cineole
3    (Z)-beta-Ocimene
4             Camphor
Name: Chemical, dtype: object
0          Methyl chavico

0        delta-3-Carene
1              Limonene
2           beta-Pinene
3          alpha-Pinene
4    alpha-Phellandrene
Name: Chemical, dtype: object
0        delta-3-Carene
1              Limonene
2           beta-Pinene
3          alpha-Pinene
4    alpha-Phellandrene
Name: Chemical, dtype: object
0      Benzyl benzoate
1        Terpinen-4-ol
2    Benzyl salicylate
3        beta-Selinene
4       alpha-Humulene
Name: Chemical, dtype: object
0      Benzyl benzoate
1    Benzyl salicylate
2       cis-Calamenene
3        beta-Selinene
4       alpha-Humulene
Name: Chemical, dtype: object
0       Benzyl benzoate
1              Linalool
2              Limonene
3    beta-Caryophyllene
4           beta-Pinene
Name: Chemical, dtype: object
0     Benzyl benzoate
1    Methyl cinnamate
2              Indole
3     Methyl benzoate
4     Prenyl benzoate
Name: Chemical, dtype: object
0             Thymol
1        para-Cymene
2    gamma-Terpinene
3        beta-Pinene
4          Carvacrol
Name: Chemical,

0       Methyl eugenol
1    Bicyclogermacrene
2          Spathulenol
3        alpha-Cadinol
4             Globulol
Name: Chemical, dtype: object
0           Methyl eugenol
1                 Elemicin
2                 Linalool
3                  Safrole
4    Caryophyllene alcohol
Name: Chemical, dtype: object
0        Methyl eugenol
1           1,8-Cineole
2     Methyl isoeugenol
3    beta-Caryophyllene
4          alpha-Pinene
Name: Chemical, dtype: object
0        Methyl eugenol
1       Methyl chavicol
2              Elemicin
3         alpha-Elemene
4    beta-Caryophyllene
Name: Chemical, dtype: object
0           Methyl eugenol
1            Terpinen-4-ol
2         (E)-beta-Ocimene
3               Calamenene
4    (E)-Methyl isoeugenol
Name: Chemical, dtype: object
0        Methyl eugenol
1               Myrcene
2               Eugenol
3           1,8-Cineole
4    beta-Caryophyllene
Name: Chemical, dtype: object
0        Methyl eugenol
1               Eugenol
2               Myrcene
3  

0        1,8-Cineole
1            Camphor
2            Borneol
3    alpha-Terpineol
4     Bornyl acetate
Name: Chemical, dtype: object
0           1,8-Cineole
1               Camphor
2           beta-Pinene
3               Myrcene
4    beta-Caryophyllene
Name: Chemical, dtype: object
0     1,8-Cineole
1         Camphor
2     beta-Pinene
3        Camphene
4    alpha-Pinene
Name: Chemical, dtype: object
0      1,8-Cineole
1          Camphor
2    alpha-Thujene
3         Camphene
4          Borneol
Name: Chemical, dtype: object
0     1,8-Cineole
1     beta-Pinene
2    alpha-Pinene
3        Camphene
4         Camphor
Name: Chemical, dtype: object
0           1,8-Cineole
1               Camphor
2    beta-Caryophyllene
3               Borneol
4           beta-Pinene
Name: Chemical, dtype: object
0           1,8-Cineole
1              Linalool
2    beta-Caryophyllene
3          alpha-Pinene
4           beta-Pinene
Name: Chemical, dtype: object
0           1,8-Cineole
1               Camphor
2 

0    Linalyl acetate
1           Linalool
2    alpha-Terpineol
3        Citronellol
4           Geranial
Name: Chemical, dtype: object
0     Linalyl acetate
1            Linalool
2         beta-Pinene
3    (E)-beta-Ocimene
4     alpha-Terpineol
Name: Chemical, dtype: object
0    Linalyl acetate
1           Linalool
2           Limonene
3    alpha-Terpineol
4    Geranyl acetate
Name: Chemical, dtype: object
0       Linalyl acetate
1              Linalool
2       alpha-Terpineol
3               Camphor
4    beta-Caryophyllene
Name: Chemical, dtype: object
0       Linalyl acetate
1              Linalool
2       alpha-Terpineol
3    beta-Caryophyllene
4          Germacrene D
Name: Chemical, dtype: object
0        Linalyl acetate
1               Linalool
2        alpha-Terpineol
3    Caryophyllene oxide
4     beta-Caryophyllene
Name: Chemical, dtype: object
0           Linalool
1    Linalyl acetate
2            Camphor
3        1,8-Cineole
4      Terpinen-4-ol
Name: Chemical, dtype: object


0        beta-Caryophyllene
1                  Linalool
2       Caryophyllene oxide
3            alpha-Humulene
4    alpha-Terpinyl acetate
Name: Chemical, dtype: object
0    beta-Caryophyllene
1              Limonene
2           beta-Pinene
3          alpha-Pinene
4    alpha-Phellandrene
Name: Chemical, dtype: object
0    beta-Caryophyllene
1              Limonene
2          alpha-Pinene
3        delta-3-Carene
4           beta-Pinene
Name: Chemical, dtype: object
0    beta-Caryophyllene
1              Limonene
2              Sabinene
3           beta-Pinene
4          alpha-Pinene
Name: Chemical, dtype: object
0    beta-Caryophyllene
1              Limonene
2              Sabinene
3               Myrcene
4           beta-Pinene
Name: Chemical, dtype: object
0    beta-Caryophyllene
1              Sabinene
2        delta-3-Carene
3              Limonene
4           beta-Pinene
Name: Chemical, dtype: object
0    beta-Caryophyllene
1        delta-3-Carene
2              Limonene
3       

0        Menthone
1         Menthol
2        Limonene
3     Isomenthone
4    alpha-Pinene
Name: Chemical, dtype: object
0                               Menthol
1                              Menthone
2    Monoterpene hydrocarbons (unknown)
3                           Isomenthone
4                            Neomenthol
Name: Chemical, dtype: object
0        Menthol
1       Menthone
2    Isomenthone
3       Limonene
4    beta-Pinene
Name: Chemical, dtype: object
0                               Menthol
1                              Menthone
2    Monoterpene hydrocarbons (unknown)
3                           Isomenthone
4                            Neomenthol
Name: Chemical, dtype: object
0               Menthol
1              Menthone
2       Menthyl acetate
3            Piperitone
4    beta-Caryophyllene
Name: Chemical, dtype: object
0            Menthol
1           Menthone
2        Isomenthone
3         Neomenthol
4    Menthyl acetate
Name: Chemical, dtype: object
0            Menthol

0       Nepetalactone 1
1          Germacrene D
2    beta-Caryophyllene
3       Nepetalactone 3
4       Nepetalactone 2
Name: Chemical, dtype: object
0            Nepetalactone 1
1         beta-Caryophyllene
2    dihydro-Nepetalactone 7
3             beta-Farnesene
4               Germacrene D
Name: Chemical, dtype: object
0       Nepetalactone 1
1          Germacrene D
2        beta-Farnesene
3       Nepetalactone 2
4    beta-Caryophyllene
Name: Chemical, dtype: object
0              Linalool
1       Linalyl acetate
2    (E)-beta-Farnesene
3       alpha-Terpineol
4      (E)-beta-Ocimene
Name: Chemical, dtype: object
0                             Linalool
1                      alpha-Terpineol
2    cis-Linalool oxide (5) (furanoid)
3                2-Phenylethyl acetate
4                        Terpinen-4-ol
Name: Chemical, dtype: object
0                                 Linalool
1                          alpha-Terpineol
2    trans-Linalool oxide (unknown isomer)
3      cis-Linalool o

0             Thymol
1        para-Cymene
2    gamma-Terpinene
3        beta-Pinene
4            Myrcene
Name: Chemical, dtype: object
0             Thymol
1    gamma-Terpinene
2        para-Cymene
3            Myrcene
4           Limonene
Name: Chemical, dtype: object
0             Thymol
1    gamma-Terpinene
2        para-Cymene
3            Myrcene
4        beta-Pinene
Name: Chemical, dtype: object
0             Thymol
1        para-Cymene
2    gamma-Terpinene
3          Carvacrol
4        beta-Pinene
Name: Chemical, dtype: object
0             Thymol
1        para-Cymene
2          Carvacrol
3    gamma-Terpinene
4        beta-Pinene
Name: Chemical, dtype: object
0             Thymol
1        para-Cymene
2    gamma-Terpinene
3        beta-Pinene
4           Limonene
Name: Chemical, dtype: object
0             Thymol
1          Carvacrol
2        para-Cymene
3    gamma-Terpinene
4            Myrcene
Name: Chemical, dtype: object
0                Thymol
1           para-Cymene
2      

0        Eugenol
1       Chavicol
2        Myrcene
3       Linalool
4    3-Octenol-1
Name: Chemical, dtype: object
0     Eugenol
1    Chavicol
2     Myrcene
3    Linalool
4    Limonene
Name: Chemical, dtype: object
0            Eugenol
1    Eugenyl acetate
2         Isoeugenol
3            Safrole
4           Chavicol
Name: Chemical, dtype: object
0            Eugenol
1           Limonene
2     delta-3-Carene
3    Eugenyl acetate
4        para-Cymene
Name: Chemical, dtype: object
0                  Eugenol
1            Terpinen-4-ol
2                 Linalool
3    (E)-Cinnamyl aldehyde
4          Benzyl benzoate
Name: Chemical, dtype: object
0                  Eugenol
1    (E)-Cinnamyl aldehyde
2                 Linalool
3       beta-Caryophyllene
4       alpha-Phellandrene
Name: Chemical, dtype: object
0               Eugenol
1       Benzyl benzoate
2    beta-Caryophyllene
3    alpha-Phellandrene
4       Eugenyl acetate
Name: Chemical, dtype: object
0               Eugenol
1       Ben

0            Carotol
1       alpha-Pinene
2    Geranyl acetate
3           Linalool
4        beta-Pinene
Name: Chemical, dtype: object
0            Carotol
1       alpha-Pinene
2    Geranyl acetate
3        beta-Pinene
4           Sabinene
Name: Chemical, dtype: object
0                Carotol
1               Sabinene
2        Geranyl acetate
3               Geraniol
4    Caryophyllene oxide
Name: Chemical, dtype: object
0         alpha-Humulene
1     beta-Caryophyllene
2                Myrcene
3     Methyl 4-decenoate
4    Geranyl isobutyrate
Name: Chemical, dtype: object
0        alpha-Humulene
1               Myrcene
2    beta-Caryophyllene
3        gamma-Cadinene
4        delta-Cadinene
Name: Chemical, dtype: object
0    Isopinocamphone
1       Pinocamphone
2       Germacrene D
3        beta-Pinene
4        Pinocarvone
Name: Chemical, dtype: object
0    Isopinocamphone
1        Pinocarvone
2       Pinocamphone
3       Germacrene D
4        beta-Pinene
Name: Chemical, dtype: object


The next code creates a heatmap for each CSV file, with each row in the heatmap corresponding to a plant and each column corresponding to a chemical. The color of each cell in the heatmap corresponds to the value in the CSV file for that plant and chemical. The heatmaps are then saved as PDF files.

In [8]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
from tqdm import tqdm
from math import sqrt

base_dir = '/Users/mariiakokina/Documents/eo_database/'
cluster_dirs = [d for d in os.listdir(base_dir) if os.path.isdir(os.path.join(base_dir, d)) and d.startswith('cluster_')]

sns.set_style("white")
cmap = sns.diverging_palette(220, 10, as_cmap=True)

for cluster_dir in tqdm(cluster_dirs, desc='Processing clusters'):
    cluster_number = cluster_dir.split('_')[-1]
    file_path = os.path.join(base_dir, cluster_dir, 'high_similarity_plants_compositions.csv')

    if os.path.isfile(file_path):
        df = pd.read_csv(file_path, index_col=0)

        if df.empty:
            continue

        top_5_chemicals = df.apply(lambda row: row.nlargest(5).index.tolist(), axis=1)
        top_5_chemicals = top_5_chemicals.apply(lambda x: pd.Index(x).drop_duplicates().tolist())
        df_top_5 = df[top_5_chemicals.explode()]
        df_top_5 *= 100

        df_transposed = df_top_5.T.drop_duplicates()

        # Sort by total abundance
        df_transposed['total_abundance'] = df_transposed.sum(axis=1)
        df_transposed.sort_values('total_abundance', ascending=False, inplace=True)
        df_transposed.drop('total_abundance', axis=1, inplace=True)

        if df_transposed.empty:
            continue

        # Dynamic sizing of the plot
        num_rows, num_cols = df_transposed.shape
        fig_width = min(max(num_cols // 2, 10), 50)  # Set a minimum width of 10 and maximum width of 50
        fig_height = min(max(num_rows // 2, 10), 50)  # Set a minimum height of 10 and maximum height of 50

        fig, ax = plt.subplots(figsize=(fig_width, fig_height))

        cell_width = fig_width / num_cols
        cell_height = fig_height / num_rows
        font_size = max(min(min(cell_width, cell_height) * 2.5, 20), 8)  # Adjust the multiplier to control the font size and set a range (8-20)

        sns.heatmap(df_transposed, cmap=cmap, annot=True, fmt=".1f", linewidths=.5,
                    cbar_kws={'label': 'Concentration (%)'},
                    vmin=0, vmax=100, ax=ax, annot_kws={"size": font_size})  # Use dynamic font size

        for t in ax.collections[0].axes.texts:  # iterate over the text elements of the heatmap
            if t.get_text() == '0.0':
                t.set_text('')  # remove the text for zero cells

        ax.set_title(f'Top 5 Abundant Chemicals in Plants with min. 80% Similarity within {cluster_dir}', fontsize=min(font_size+6, 24), pad=20)  # Use dynamic font size and set a maximum limit of 24
        ax.set_xlabel('Plants', fontsize=min(font_size+2, 22), labelpad=10)  # Use dynamic font size and set a maximum limit of 22
        ax.set_ylabel('Chemicals', fontsize=min(font_size+2, 22), labelpad=10)  # Use dynamic font size and set a maximum limit of 22
        ax.tick_params(labelsize=min(font_size, 20))  # Use dynamic font size and set a maximum limit of 20

        sns.despine()  # Remove the top and right spines from plot

        plt.xticks(rotation=90, fontsize=font_size)  # Use dynamic font size
        plt.yticks(rotation=0, fontsize=font_size)  # Use dynamic font size

        plt.tight_layout()

        save_path = os.path.join(base_dir, cluster_dir, f'heatmap_Top_5_Abundant_Chemicals_in_Plants_with_80_Similarity.pdf')
        plt.savefig(save_path, format="pdf", bbox_inches='tight')

        plt.close()

    else:
        print(f"The file {file_path} does not exist.")


Processing clusters: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90/90 [02:27<00:00,  1.63s/it]
