# ESG Theme Classification Heatmap Demo

This notebook demonstrates how to classify ESG (Environmental, Social, Governance) themes in text and visualize the results as a heatmap.

# Required Libraries

This notebook requires the following libraries (install via requirements.txt):
- spacy (with en_core_web_sm)
- transformers
- pandas
- matplotlib
- seaborn

In [None]:
# Install required dependencies

# Jupyter Notebook interface for interactive computing
%pip install notebook  

# Natural language processing library for sentence segmentation
%pip install spacy  

# Hugging Face library for loading the ESG-BERT model
%pip install transformers  

# Deep learning backend required by transformers
%pip install torch  

# Data manipulation and analysis library
%pip install pandas  

# Core plotting library for creating visualizations
%pip install matplotlib  

# Advanced visualization library for heatmaps
%pip install seaborn  

# Input Document
# Paste your CSR (Corporate Social Responsibility) report or similar text document below

In [None]:
input_text = """Example CSR report text. 
At GreenTech Solutions, we are committed to reducing our carbon emissions by 40% by 2030. As part of our environmental strategy, we have transitioned 60% of our fleet to electric vehicles and implemented solar panels across all office locations.

We also believe in creating an inclusive and equitable workplace. Our diversity and inclusion committee has launched mentorship programs for underrepresented groups and revised hiring practices to reduce unconscious bias.

To support local communities, we donated over $1 million to education and housing initiatives in 2024. Our employees volunteered more than 5,000 hours in various community outreach programs.

In terms of governance, our board of directors has been expanded to include more independent members, with a focus on enhancing transparency and oversight. We've implemented a new whistleblower policy to ensure accountability at all levels of the organization.

We continue to audit our supply chain to ensure ethical labour practices and environmental compliance. Regular risk assessments and stakeholder engagement ensure that we are aligned with best practices in corporate governance.

Finally, as part of our sustainability goals, we've reduced water usage in our manufacturing processes by 25%, and we are targeting zero waste-to-landfill by 2026.

"""

# Sentence Splitting with spaCy

In [None]:
# Download the spaCy English language model

# Required for sentence segmentation
!python -m spacy download en_core_web_sm  

# Import the spaCy library
import spacy

# Load the English language model (small version)
nlp = spacy.load("en_core_web_sm")

# Process the input text with spaCy
# This creates a Doc object with linguistic annotations
doc = nlp(input_text)

# Extract sentences using spaCy's sentence segmentation
# spaCy identifies sentence boundaries based on punctuation and other linguistic features
sentences = [sent.text.strip() for sent in doc.sents]

# Print the list of sentences for verification
print(f"Found {len(sentences)} sentences:")
for i, sentence in enumerate(sentences, 1):
    print(f"{i}. {sentence}")

# ESG Classification with Transformers

In [None]:
# Import necessary modules from transformers and torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# Load the pretrained ESG-BERT model and tokenizer from Hugging Face
# This model is specifically trained to classify text into ESG categories
model_name = "nbroad/ESG-BERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Define a function to classify sentences into detailed ESG categories
def classify_sentence(sentence):
    """
    Classify a sentence into one of the 26 detailed ESG categories and map it to its main ESG pillar.
    
    Args:
        sentence (str): The input sentence to classify
        
    Returns:
        tuple: (detailed_category_name, main_esg_pillar, confidence_score) where:
               - detailed_category_name is the specific ESG category (e.g., "GHG_Emissions")
               - main_esg_pillar is the high-level ESG category ("Environmental", "Social", or "Governance")
               - confidence_score is the model's confidence in the prediction
    """
    # Step 1: Tokenize the input sentence
    # Convert the text into tokens that the model can understand
    # return_tensors="pt" returns PyTorch tensors
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=512)
    
    # Step 2: Feed the tokenized sentence into the model to obtain raw logits
    # Set model to evaluation mode and disable gradient calculation for inference
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
    
    # Step 3: Apply softmax function to convert logits into probability distribution
    # Softmax normalizes the logits so they sum to 1, representing probabilities
    probabilities = F.softmax(logits, dim=1)
    
    # Get the predicted class (highest probability)
    predicted_class = torch.argmax(probabilities, dim=1).item()
    probability_value = probabilities[0][predicted_class].item()
    
    # Step 4: Map the predicted numerical label to detailed ESG categories
    # This mapping is based on the SASB framework as specified in the task
    detailed_categories = {
        # Environmental Categories
        13: "Physical_Impacts_Of_Climate_Change",
        19: "Waste_And_Hazardous_Materials_Management",
        20: "Water_And_Wastewater_Management",
        21: "Air_Quality",
        23: "Ecological_Impacts",
        24: "Energy_Management",
        25: "GHG_Emissions",
        # Social Categories
        1: "Data_Security",
        2: "Access_And_Affordability",
        6: "Customer_Welfare",
        8: "Employee_Engagement_Inclusion_And_Diversity",
        9: "Employee_Health_And_Safety",
        10: "Human_Rights_And_Community_Relations",
        11: "Labor_Practices",
        14: "Product_Quality_And_Safety",
        16: "Selling_Practices_And_Product_Labeling",
        22: "Customer_Privacy",
        # Governance Categories
        0: "Business_Ethics",
        3: "Business_Model_Resilience",
        4: "Competitive_Behavior",
        5: "Critical_Incident_Risk_Management",
        7: "Director_Removal",
        12: "Management_Of_Legal_And_Regulatory_Framework",
        15: "Product_Design_And_Lifecycle_Management",
        17: "Supply_Chain_Management",
        18: "Systemic_Risk_Management"
    }
    
    # Step 5: Map each detailed category to its main ESG pillar
    # This grouping is based on the SASB framework as specified in the task
    esg_pillars = {
        # Environmental Categories
        "Physical_Impacts_Of_Climate_Change": "Environmental",
        "Waste_And_Hazardous_Materials_Management": "Environmental",
        "Water_And_Wastewater_Management": "Environmental",
        "Air_Quality": "Environmental",
        "Ecological_Impacts": "Environmental",
        "Energy_Management": "Environmental",
        "GHG_Emissions": "Environmental",
        # Social Categories
        "Data_Security": "Social",
        "Access_And_Affordability": "Social",
        "Customer_Welfare": "Social",
        "Employee_Engagement_Inclusion_And_Diversity": "Social",
        "Employee_Health_And_Safety": "Social",
        "Human_Rights_And_Community_Relations": "Social",
        "Labor_Practices": "Social",
        "Product_Quality_And_Safety": "Social",
        "Selling_Practices_And_Product_Labeling": "Social",
        "Customer_Privacy": "Social",
        # Governance Categories
        "Business_Ethics": "Governance",
        "Business_Model_Resilience": "Governance",
        "Competitive_Behavior": "Governance",
        "Critical_Incident_Risk_Management": "Governance",
        "Director_Removal": "Governance",
        "Management_Of_Legal_And_Regulatory_Framework": "Governance",
        "Product_Design_And_Lifecycle_Management": "Governance",
        "Supply_Chain_Management": "Governance",
        "Systemic_Risk_Management": "Governance"
    }
    
    # Get the detailed category name
    detailed_category = detailed_categories.get(predicted_class, f"Unknown_Category_{predicted_class}")
    
    # Get the main ESG pillar for this detailed category
    main_pillar = esg_pillars.get(detailed_category, "Unknown")
    
    # Return the tuple with detailed category, main pillar, and confidence score
    return detailed_category, main_pillar, probability_value

# Test the function on a sample sentence
sample_sentence = "The company reduced carbon emissions by 15% this year."
detailed_category, main_pillar, confidence = classify_sentence(sample_sentence)
print(f"Sample: '{sample_sentence}'")
print(f"Detailed ESG Category: {detailed_category}")
print(f"Main ESG Pillar: {main_pillar}")
print(f"Confidence: {confidence:.4f} ({confidence*100:.2f}%)")

# Aggregation of Classification Results

In [None]:
# Import pandas for data manipulation and analysis
import pandas as pd

# Initialize counters for each ESG category
# These will track how many sentences fall into each category
environmental_count = 0
social_count = 0
governance_count = 0

# Loop through each sentence in our list of sentences
for sentence in sentences:
    # Call the classify_sentence function to get the predicted ESG category
    # This returns the detailed category, main pillar, and confidence score
    _, main_pillar, _ = classify_sentence(sentence)
    
    # Increment the appropriate counter based on the predicted main pillar
    if main_pillar == "Environmental":
        environmental_count += 1
    elif main_pillar == "Social":
        social_count += 1
    elif main_pillar == "Governance":
        governance_count += 1

# Create a dictionary with the counts for each category
# This will be used to create our DataFrame
esg_counts = {
    "Environmental": [environmental_count],
    "Social": [social_count],
    "Governance": [governance_count]
}

# Create a pandas DataFrame with the counts
# The index is set to "Report" to indicate these counts are for the entire document
esg_df = pd.DataFrame(esg_counts, index=["Report"])

# Print the resulting DataFrame showing the distribution of ESG categories
print("ESG Category Distribution:")
print(esg_df)

# Calculate and print the total number of sentences classified
total_sentences = environmental_count + social_count + governance_count
print(f"\nTotal sentences classified: {total_sentences}")

# Heatmap Visualization of ESG Classification Results

In [None]:
# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Set the figure size for better visualization
plt.figure(figsize=(10, 4))

# Create a heatmap using Seaborn
# - data: The DataFrame containing our ESG category counts
# - annot=True: Display the numerical values in each cell
# - cmap="YlGnBu": Use the Yellow-Green-Blue color palette
# - fmt="d": Format annotations as integers (d = decimal integer)
# - linewidths=.5: Add thin lines between cells for better separation
# - cbar=True: Include a color bar legend
ax = sns.heatmap(esg_df, 
                annot=True,        # Show the count values in each cell
                cmap="YlGnBu",     # Use the Yellow-Green-Blue color palette
                fmt="d",           # Format annotations as integers
                linewidths=.5,     # Add thin lines between cells
                cbar_kws={'label': 'Sentence Count'})  # Label the color bar

# Set the title for the heatmap
plt.title("ESG Theme Distribution in Report", fontsize=14)

# Customize the axis labels
plt.xlabel("ESG Categories", fontsize=12)
plt.ylabel("Document", fontsize=12)

# Adjust layout to prevent clipping of labels
plt.tight_layout()

# Display the heatmap
plt.show()

# Complete Workflow: End-to-End ESG Classification and Visualization

This cell integrates all the previous components into a complete workflow:
1. Process the input text to extract individual sentences
2. Classify each sentence into an ESG category (Environmental, Social, Governance)
3. Aggregate the classification results into a structured DataFrame
4. Visualize the distribution of ESG themes using a heatmap

This workflow allows for quick analysis of ESG themes in corporate reports or other text documents,
providing insights into the balance and emphasis of different sustainability aspects.

In [None]:
# Step 1: Process the input text to split it into sentences
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(input_text)
sentences = [sent.text.strip() for sent in doc.sents]
print(f"Processing {len(sentences)} sentences...")

# Step 2: Classify each sentence using the classify_sentence function
# Initialize a list to store detailed results
classification_results = []

# Process each sentence
for sentence in sentences:
    detailed_category, main_pillar, confidence = classify_sentence(sentence)
    classification_results.append({
        'sentence': sentence,
        'detailed_category': detailed_category,
        'main_pillar': main_pillar,
        'confidence': confidence
    })
    
# Step 3: Aggregate the classification results into a Pandas DataFrame
import pandas as pd

# Create a detailed DataFrame with all classification results
results_df = pd.DataFrame(classification_results)

# Count occurrences of each main pillar
pillar_counts = results_df['main_pillar'].value_counts().to_dict()

# Ensure all categories are represented (even if count is 0)
esg_counts = {
    "Environmental": pillar_counts.get("Environmental", 0),
    "Social": pillar_counts.get("Social", 0),
    "Governance": pillar_counts.get("Governance", 0)
}

# Create the aggregated DataFrame for visualization
esg_df = pd.DataFrame([esg_counts], index=["Report"])

# Display the aggregated results
print("\nESG Category Distribution:")
print(esg_df)

# Also display the detailed category distribution
print("\nDetailed ESG Category Distribution:")
detailed_counts = results_df['detailed_category'].value_counts()
print(detailed_counts)

# Step 4: Generate and display the heatmap visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Create the visualization
plt.figure(figsize=(10, 4))
ax = sns.heatmap(esg_df, 
                annot=True,
                cmap="YlGnBu",
                fmt="d",
                linewidths=.5,
                cbar_kws={'label': 'Sentence Count'})

plt.title("ESG Theme Distribution in Report", fontsize=14)
plt.xlabel("ESG Categories", fontsize=12)
plt.ylabel("Document", fontsize=12)
plt.tight_layout()

# Display the final visualization
plt.show()

# Print a summary of the analysis
print("\nAnalysis Summary:")
print(f"Total sentences analyzed: {len(sentences)}")
print(f"Environmental themes: {esg_counts['Environmental']} sentences ({esg_counts['Environmental']/len(sentences)*100:.1f}%)")
print(f"Social themes: {esg_counts['Social']} sentences ({esg_counts['Social']/len(sentences)*100:.1f}%)")
print(f"Governance themes: {esg_counts['Governance']} sentences ({esg_counts['Governance']/len(sentences)*100:.1f}%)")
print(f"Dominant theme: {max(esg_counts, key=esg_counts.get)}")