# Machine Learning Approaches to Measuring the Impact of Mergers and Acquisitions on Stock Returns

## Project Description

### Overview

The impact of mergers and acquisitions (M&A) on the stock returns of both proponent (acquiring) and target (selling) firms has been extensively studied. However, most research has been limited to specific sectors or regions[^1].

In this project, we aim to leverage **unsupervised machine learning models** to cluster M&A transactions based on their key determinants and fundamental characteristics. These clusters can then be used to identify potential **short-term trading opportunities**.

### Methodology

We will use the **CRISP-DM**[^2] framework, originally published in 1999, which remains the *de facto* standard for data mining[^3] and is widely recognized for its strong alignment with business objectives.

The CRISP-DM process consists of six main steps:

1. **Business Understanding** – Define the project objectives and requirements from a business perspective.  
2. **Data Understanding** – Collect initial data and become familiar with its structure and quality.  
3. **Data Preparation** – Build the final dataset ready for modeling through cleaning, selection, and transformation.  
4. **Modeling** – Apply machine learning or statistical models and tune their parameters.  
5. **Evaluation** – Ensure that the model meets both technical and business objectives.  
6. **Deployment** – Deliver the final model or insights into production or decision-making environments.

Please note that **CRISP-DM is an iterative process** — insights from one phase may lead to revisiting and refining earlier phases.

In this project, we will focus on the **first five steps**.

## Bibliography
[ˆ1]: Campa, J. M., & Hernando, I. (2006). M&As performance in the European financial industry. Journal of Banking & Finance.
[ˆ2]: Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2000). CRISP-DM 1.0: Step-by-step data mining guide. CRISP-DM Consortium.
[ˆ3]: Schröer, C., Kruse, F., & Gómez, J. M. (2021). A systematic literature review on applying CRISP-DM process model. Procedia Computer Science.

## 1. Business Understanding

### 1.1. Overview

The objective of this project is to classify M&A transactions based on their determinants and fundamental characteristics. This classification aims to help investors identify potential trading opportunities involving the target firms, acquiring firms, or their comparable companies.

To align with academic requirements, we will limit the analysis to companies incorporated in the Euro area (Eurozone). If this subset still contains more than 500 transactions, we will randomly select 500 instances for the study.

After completing the Data Understanding step, the following information should be collected for each transaction:

- **Business description** of the target and the acquiring company, obtained from databases or the internet.  
- **Transaction description**, including the rationale of the deal, based on filings, analyses, or news sources.  
- **Financial data** for both the target and the acquirer.  
- **Financial metrics** specific to the transaction itself.

> **Note:** All information should be consistent with the transaction announcement date to prevent data leakage.

---

### 1.2 Business Description

In addition to administrative data, such as business name or year of incorporation, a business description should include the following items:

- **Mission statement**  
- **Products and services**  
- **Targeted markets**  
- **Targeted customers**  
- **Primary technologies** (if applicable)  
- **Competitive advantages**  
- **Business objectives**

---

### 1.3 Transaction Description

Since the **Transaction Comments** field already summarizes financial details, the description for each M&A transaction should focus on the **rationale** behind the deal. Research on M&A suggests that the rationale may include one or more of the following categories:

- **Operational** – economies of scale, scope, or efficiency gains.  
- **Financial** – diversification, access to capital markets, lowering financial costs, deleveraging, or leveraging.  
- **Regulatory** – obtaining licenses, patents, or meeting regulatory requirements.  
- **Restructuring** – restructuring shareholding structure, voting rights, legal structure.  
- **Technology** – acquiring R&D pipelines, innovation capabilities, or fostering an innovative culture.

---

### 1.4 Financial Data

For each M&A transaction, the following financial data should be collected for both the target and acquiring companies:

- **Balance Sheet**: total assets, total liabilities, shareholders’ equity, cash and cash equivalents, short-term and long-term debt.  
- **Income Statement**: revenue/sales, cost of goods sold (COGS), operating income (EBIT), net income.  
- **Profitability Metrics**: gross margin, operating margin, net margin, return on assets (ROA), return on equity (ROE).  
- **Liquidity and Leverage Metrics**: current ratio, quick ratio, debt-to-equity ratio, debt-to-assets ratio.  
- **Market Capitalization** at the announcement date.

> These data can be collected through APIs such as Yahoo Finance or other financial databases.

---

### 1.5 Financial Metrics of Transactions

- **Transaction Value** (in USD millions, using historical exchange rates if applicable) — already available in the dataset.  
- **Deal Premium** — difference between offer price and pre-announcement market price, if available.  
- **Valuation Ratios** — Price-to-Earnings (PER), Price-to-EBITDA, Price-to-Revenue, Price-to-Book Value.

> **Note:** All financial data should correspond to the period prior to or at the announcement date to prevent data leakage. Features should be normalized or scaled for clustering to account for differences in company size.

## 2. Data Understanding
5 points

In [None]:
# Analyze transaction comments (text data)
print("Text Data Analysis:")
print("="*70)

if 'Transaction Comments' in df.columns:
    # Text length analysis
    df['comment_length'] = df['Transaction Comments'].fillna('').astype(str).apply(len)
    df['comment_word_count'] = df['Transaction Comments'].fillna('').astype(str).apply(lambda x: len(x.split()))
    
    print(f"\nComment Length Statistics:")
    print(f"  Average characters: {df['comment_length'].mean():.0f}")
    print(f"  Average words: {df['comment_word_count'].mean():.0f}")
    print(f"  Max length: {df['comment_length'].max():.0f} characters")
    
    # Sample a few transaction comments
    print(f"\n\nSample Transaction Descriptions:")
    print("="*70)
    for idx, row in df.head(3).iterrows():
        print(f"\nTransaction #{idx + 1}:")
        print(f"Target: {row['Target/Issuer']}")
        print(f"Buyer: {row['Buyers/Investors']}")
        comment = row['Transaction Comments']
        if pd.notna(comment):
            print(f"Description: {str(comment)[:300]}...")
        print("-"*70)

In [None]:
# Analyze categorical features
print("Categorical Feature Analysis:")
print("="*70)

# Country distribution
if 'Country/Region of Incorporation [Target/Issuer]' in df.columns:
    print("\nTop 10 Target Countries:")
    country_counts = df['Country/Region of Incorporation [Target/Issuer]'].value_counts().head(10)
    print(country_counts)
    
    # Visualization
    plt.figure(figsize=(12, 6))
    country_counts.plot(kind='barh', color='skyblue')
    plt.xlabel('Number of Transactions')
    plt.title('Top 10 Target Countries in Eurozone M&A')
    plt.gca().invert_yaxis()
    plt.grid(axis='x', alpha=0.3)
    plt.tight_layout()
    plt.show()

# Transaction status
if 'Transaction Status' in df.columns:
    print("\n\nTransaction Status Distribution:")
    status_counts = df['Transaction Status'].value_counts()
    print(status_counts)
    
    # Pie chart
    plt.figure(figsize=(8, 8))
    status_counts.plot(kind='pie', autopct='%1.1f%%', startangle=90)
    plt.title('Transaction Status Distribution')
    plt.ylabel('')
    plt.show()

In [None]:
# Analyze transaction values
print("Transaction Value Analysis:")
print("="*70)

if 'Total Transaction Value ($USDmm, Historical rate)' in df.columns:
    values = df['Total Transaction Value ($USDmm, Historical rate)'].dropna()
    
    print(f"\nDescriptive Statistics:")
    print(f"  Mean: ${values.mean():.2f}M")
    print(f"  Median: ${values.median():.2f}M")
    print(f"  Std Dev: ${values.std():.2f}M")
    print(f"  Min: ${values.min():.2f}M")
    print(f"  Max: ${values.max():.2f}M")
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Histogram
    axes[0].hist(values, bins=30, edgecolor='black', alpha=0.7)
    axes[0].set_xlabel('Transaction Value ($M USD)')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('Distribution of Transaction Values')
    axes[0].grid(alpha=0.3)
    
    # Box plot
    axes[1].boxplot(values, vert=True)
    axes[1].set_ylabel('Transaction Value ($M USD)')
    axes[1].set_title('Transaction Value Box Plot')
    axes[1].grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Explore data types and missing values
print("Dataset Information:")
print("="*70)
print(f"\nData Types:")
print(df.dtypes)

print(f"\n\nMissing Values:")
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Column': missing.index,
    'Missing Count': missing.values,
    'Missing %': missing_pct.values
})
missing_df[missing_df['Missing Count'] > 0].sort_values('Missing %', ascending=False)

In [None]:
# Display first few rows
print("Sample Transactions:")
print("="*100)
df.head()

In [None]:
# Load the dataset
data_file = "MA Transactions Over 50M.xlsx"

# Check if file exists
if Path(data_file).exists():
    df = pd.read_excel(data_file, header=1)
    print(f"✓ Dataset loaded successfully!")
    print(f"  Shape: {df.shape}")
    print(f"  Columns: {df.shape[1]}")
    print(f"  Transactions: {df.shape[0]}")
else:
    print(f"✗ File not found: {data_file}")
    print("Please ensure 'MA Transactions Over 50M.xlsx' is in the same directory")

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

In this step, the goal is to **explore and understand the dataset** to guide preprocessing and modeling. Key activities include:

1. **Import and format**: Import the dataset, parse dates, and rename columns.
2. **Explore Feature Types:** Identify numeric, categorical, and textual variables, and understand their meaning and relevance.
3. **Statistical Summaries:** Compute basic statistics for numeric features (mean, median, variance) and counts/frequencies for categorical variables.
4. **Text Analysis:** Count tokens, analyze text lengths, and inspect sample entries to understand language characteristics, which will inform embedding and modeling choices.
5. **Initial Quality Checks:** Detect missing values, outliers, or inconsistencies.
6. **Insights for Modeling:** Identify which features will be used for clustering or dimensionality reduction, ensuring they are ready for preprocessing in the next step.

Column Labels:

- **All Transactions Announced Date** – Date of the official transaction announcement.  
- **All Transactions Announced Date (Including Bids and Letters of Intent)** – Date of unofficial or preliminary announcement, including bids and letters of intent.  
- **Target/Issuer** – Name of the target company.  
- **Exchange Ticker** – Exchange ticker of the target/issuer, if listed.  
- **Transaction Types** – Type of transaction; should be limited to *Merger/Acquisition*.  
- **Transaction Status** – Current status of the transaction.  
- **Total Transaction Value ($USDmm, Historical rate)** – Value of the transaction in USD millions, based on historical exchange rates.  
- **Buyers/Investors** – Name of the acquiring company.  
- **Sellers** – Name of the sellers or shareholders of the target.  
- **CIQ Transaction ID** – Internal Capital IQ identifier.  
- **Transaction Comments** – Main financial considerations or notes related to the transaction.  
- **Country/Region of Incorporation [Target/Issuer]** – Country where the target company is incorporated.  
- **Country/Region of Incorporation [Sellers]** – Country where the sellers are incorporated.  
- **Target Security Types** – Type of the security of the target.

## 3. Data Preparation
5 points

In this step, the goal is to **clean, transform, collect, and structure the data** to make it suitable for modeling. Key activities include:

1. **Handling Missing Values:** Impute or remove missing data to ensure consistency across features.
2. **Encoding Categorical Variables:** Convert categorical data into numeric representations (one-hot encoding, ordinal encoding, or embeddings).
3. **Scaling Numeric Features:** Normalize or standardize numeric variables to make them comparable for clustering or distance-based methods.
4. **Text Processing:** Clean textual data (remove punctuation, lowercase, etc.) and convert into embeddings (e.g., SBERT) for use in downstream analysis.
5. **Feature Engineering:** Create new features, combine or transform existing ones, and reduce dimensionality if needed (PCA, UMAP).
6. **Final Dataset Assembly:** Concatenate numeric, categorical, and textual features into a single matrix ready for unsupervised learning algorithms.

> **Note:** This step ensures that all features are in a consistent format and scale, allowing clustering or other unsupervised methods to effectively detect patterns.

To reduce the workload, you can limit the analysis to a sample of 300 transactions with the target in the Eurozone.

In this particular project, you will have to collect additional information:
* Collect a business description for each target and acquirer.
* Extract information from the Transaction Comments.
* Build a transaction description from news and analysis.
* Find financial data 



In [None]:
# Run preprocessing pipeline
print("Running preprocessing pipeline...")
print("="*70)
print("This step will:")
print("  1. Extract keywords from transaction descriptions")
print("  2. Handle missing values")
print("  3. Scale numeric features")
print("  4. Encode categorical features")
print("  5. Generate text embeddings using Sentence-BERT")
print("\nThis may take a few minutes...")
print("="*70)

# Check if preprocessing already done
if Path("../X_final.npy").exists():
    print("\n✓ Preprocessed data already exists!")
    X_final = np.load("../X_final.npy")
    print(f"  Feature matrix shape: {X_final.shape}")
    print(f"  Total features: {X_final.shape[1]}")
else:
    print("\n⚠ Run preprocessing.py first:")
    print("  Command: python ../preprocessing.py")

### 3.2 Feature Engineering and Preprocessing

In [None]:
# Analyze deal rationale distribution
if Path(enriched_file).exists():
    rationale_cols = [c for c in df_enriched.columns if c.startswith('rationale_')]
    
    if rationale_cols:
        print("Deal Rationale Analysis:")
        print("="*70)
        
        rationale_data = df_enriched[rationale_cols].sum().sort_values(ascending=False)
        
        # Clean names for display
        rationale_data.index = [c.replace('rationale_', '').replace('_', ' ').title() 
                               for c in rationale_data.index]
        
        print("\nTotal mentions by category:")
        print(rationale_data)
        
        # Visualization
        plt.figure(figsize=(10, 6))
        rationale_data.plot(kind='bar', color='coral')
        plt.xlabel('Deal Rationale Category')
        plt.ylabel('Total Mentions Across All Deals')
        plt.title('Distribution of M&A Deal Rationales')
        plt.xticks(rotation=45, ha='right')
        plt.grid(axis='y', alpha=0.3)
        plt.tight_layout()
        plt.show()

In [None]:
# Load enriched data if available, otherwise run enrichment
import re
from collections import Counter

enriched_file = "../eurozone_transactions_enriched.csv"

if Path(enriched_file).exists():
    df_enriched = pd.read_csv(enriched_file)
    print(f"✓ Loaded enriched dataset: {df_enriched.shape}")
else:
    print("✗ Enriched data not found. Please run data_loader.py first.")
    print("  Command: cd .. && python data_loader.py")
    
# Display new columns created during enrichment
if Path(enriched_file).exists():
    new_cols = [c for c in df_enriched.columns if c not in df.columns]
    print(f"\nNew enrichment features ({len(new_cols)}):")
    for col in new_cols[:15]:
        print(f"  - {col}")
    if len(new_cols) > 15:
        print(f"  ... and {len(new_cols) - 15} more")

### 3.1 Data Enrichment with Business Insights

We'll extract deal rationale, financial metrics, and key entities from transaction descriptions to enrich our feature set.

# 4. Modeling

After dimensionality reduction and clustering, you can optionally conduct an analysis on the cumulative abnormal return (CAR) of your cluster. Here is a link to a detailed computation of the [CAR](https://eventstudy.de/blog/cumulative-abnormal-return/)


In [None]:
# Apply K-Means with K=3 (recommended based on business analysis)
if 'X_pca' in locals():
    K = 3  # Can be changed to 4 or other values
    print(f"Applying K-Means clustering with K={K}...")
    
    kmeans = KMeans(n_clusters=K, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(X_pca)
    
    print(f"✓ Clustering complete!")
    print(f"\nCluster distribution:")
    unique, counts = np.unique(cluster_labels, return_counts=True)
    for cluster_id, count in zip(unique, counts):
        print(f"  Cluster {cluster_id}: {count} transactions ({count/len(cluster_labels)*100:.1f}%)")
    
    # t-SNE for 2D visualization
    print(f"\nGenerating t-SNE visualization...")
    tsne = TSNE(n_components=2, random_state=42, perplexity=30)
    X_tsne = tsne.fit_transform(X_pca)
    
    # Create visualization
    plt.figure(figsize=(12, 8))
    scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], 
                         c=cluster_labels, cmap='viridis', 
                         s=50, alpha=0.6, edgecolors='black', linewidth=0.5)
    plt.colorbar(scatter, label='Cluster')
    plt.xlabel('t-SNE Dimension 1')
    plt.ylabel('t-SNE Dimension 2')
    plt.title(f'M&A Transaction Clusters (K={K}) - t-SNE Visualization')
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print("✓ Visualization complete!")
else:
    print("✗ PCA data not available. Run previous cells first.")

### 4.3 Apply Clustering and Visualize Results

In [None]:
# Find optimal number of clusters
if 'X_pca' in locals():
    print("Finding optimal K using Silhouette Score and Elbow Method...")
    print("="*70)
    
    K_range = range(2, 11)
    silhouette_scores = []
    inertias = []
    
    for k in K_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        labels = kmeans.fit_predict(X_pca)
        
        score = silhouette_score(X_pca, labels)
        inertia = kmeans.inertia_
        
        silhouette_scores.append(score)
        inertias.append(inertia)
        
        print(f"K={k}: Silhouette={score:.4f}, Inertia={inertia:.2f}")
    
    # Visualize metrics
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Silhouette scores
    axes[0].plot(K_range, silhouette_scores, marker='o', linewidth=2, markersize=8)
    axes[0].set_xlabel('Number of Clusters (K)')
    axes[0].set_ylabel('Silhouette Score')
    axes[0].set_title('Silhouette Score vs. K')
    axes[0].grid(alpha=0.3)
    best_k = K_range[np.argmax(silhouette_scores)]
    axes[0].axvline(x=best_k, color='r', linestyle='--', 
                   label=f'Best K={best_k}')
    axes[0].legend()
    
    # Elbow plot
    axes[1].plot(K_range, inertias, marker='o', linewidth=2, markersize=8)
    axes[1].set_xlabel('Number of Clusters (K)')
    axes[1].set_ylabel('Inertia (Within-cluster sum of squares)')
    axes[1].set_title('Elbow Method')
    axes[1].grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"\n✓ Optimal K based on Silhouette Score: {best_k}")
else:
    print("✗ PCA data not available. Run previous cell first.")

### 4.2 K-Means Clustering: Finding Optimal K

In [None]:
# Load processed data and apply PCA
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.manifold import TSNE

if Path("../X_final.npy").exists():
    X = np.load("../X_final.npy")
    print(f"Loaded feature matrix: {X.shape}")
    
    # Apply PCA
    print("\nApplying PCA (preserving 95% variance)...")
    pca = PCA(n_components=0.95, random_state=42)
    X_pca = pca.fit_transform(X)
    
    print(f"✓ PCA complete!")
    print(f"  Original dimensions: {X.shape[1]}")
    print(f"  Reduced dimensions: {X_pca.shape[1]}")
    print(f"  Variance explained: {pca.explained_variance_ratio_.sum():.2%}")
    
    # Plot variance explained
    plt.figure(figsize=(10, 5))
    cumsum = np.cumsum(pca.explained_variance_ratio_)
    plt.plot(cumsum, linewidth=2)
    plt.xlabel('Number of Components')
    plt.ylabel('Cumulative Explained Variance')
    plt.title('PCA: Cumulative Variance Explained')
    plt.grid(alpha=0.3)
    plt.axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
    plt.legend()
    plt.tight_layout()
    plt.show()
else:
    print("✗ Preprocessed data not found. Run preprocessing.py first.")

### 4.1 Dimensionality Reduction with PCA

# 5. Evaluation

### 5.3 Key Findings and Business Implications

**Summary of Results:**

Based on the clustering analysis, we can identify distinct patterns in M&A transactions:

1. **Cluster Differentiation**: Each cluster represents transactions with similar characteristics in terms of deal size, rationale, geographic focus, and sector concentration.

2. **Deal Rationale Patterns**: Different clusters show dominant motivations (operational synergies, technology acquisition, market expansion, etc.)

3. **Company Relationships**: Identify serial acquirers and potential strategic partnerships within clusters.

4. **Trading Opportunities**: Clusters can help identify similar companies and potential takeover targets based on historical patterns.

**Next Steps:**
- Deep dive into specific transactions within interesting clusters
- Analyze stock performance patterns within clusters
- Identify emerging M&A trends by cluster
- Use clusters for predictive modeling of future deals

In [None]:
# Create comprehensive cluster comparison visualizations
if 'df_enriched_clustered' in locals():
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # 1. Deal value comparison
    if 'Total Transaction Value ($USDmm, Historical rate)' in df_enriched_clustered.columns:
        avg_values = df_enriched_clustered.groupby('Cluster')['Total Transaction Value ($USDmm, Historical rate)'].mean()
        avg_values_orig = np.expm1(avg_values)
        axes[0, 0].bar(avg_values_orig.index, avg_values_orig.values, color='steelblue')
        axes[0, 0].set_xlabel('Cluster')
        axes[0, 0].set_ylabel('Average Deal Value ($M USD)')
        axes[0, 0].set_title('Average Transaction Value by Cluster')
        axes[0, 0].grid(axis='y', alpha=0.3)
    
    # 2. Deal rationale heatmap
    rationale_cols = [c for c in df_enriched_clustered.columns if c.startswith('rationale_')]
    if rationale_cols:
        rationale_means = df_enriched_clustered.groupby('Cluster')[rationale_cols].mean()
        rationale_means.columns = [c.replace('rationale_', '').replace('_', ' ').title() 
                                   for c in rationale_means.columns]
        im = axes[0, 1].imshow(rationale_means.T, cmap='YlOrRd', aspect='auto')
        axes[0, 1].set_xticks(range(len(rationale_means)))
        axes[0, 1].set_xticklabels(rationale_means.index)
        axes[0, 1].set_yticks(range(len(rationale_means.columns)))
        axes[0, 1].set_yticklabels(rationale_means.columns)
        axes[0, 1].set_xlabel('Cluster')
        axes[0, 1].set_title('Deal Rationale Intensity Heatmap')
        plt.colorbar(im, ax=axes[0, 1])
    
    # 3. Geographic distribution
    if 'Country/Region of Incorporation [Target/Issuer]' in df_enriched_clustered.columns:
        top_countries = df_enriched_clustered['Country/Region of Incorporation [Target/Issuer]'].value_counts().head(8)
        axes[1, 0].barh(range(len(top_countries)), top_countries.values, color='coral')
        axes[1, 0].set_yticks(range(len(top_countries)))
        axes[1, 0].set_yticklabels(top_countries.index)
        axes[1, 0].set_xlabel('Number of Transactions')
        axes[1, 0].set_title('Top 8 Target Countries (All Clusters)')
        axes[1, 0].grid(axis='x', alpha=0.3)
        axes[1, 0].invert_yaxis()
    
    # 4. Cluster size pie chart
    cluster_sizes = df_enriched_clustered['Cluster'].value_counts().sort_index()
    axes[1, 1].pie(cluster_sizes, labels=[f'Cluster {i}' for i in cluster_sizes.index],
                   autopct='%1.1f%%', startangle=90, colors=plt.cm.viridis(np.linspace(0, 1, len(cluster_sizes))))
    axes[1, 1].set_title('Cluster Distribution')
    
    plt.tight_layout()
    plt.show()
    
    print("✓ Cluster comparison visualizations complete!")

### 5.2 Cluster Comparison and Visualization

In [None]:
# Analyze cluster characteristics
if 'df_enriched_clustered' in locals():
    print("CLUSTER BUSINESS ANALYSIS")
    print("="*70)
    
    for cluster_id in sorted(df_enriched_clustered['Cluster'].unique()):
        cluster_data = df_enriched_clustered[df_enriched_clustered['Cluster'] == cluster_id]
        
        print(f"\n{'#'*70}")
        print(f"# CLUSTER {cluster_id}")
        print(f"{'#'*70}")
        print(f"Size: {len(cluster_data)} transactions ({len(cluster_data)/len(df_enriched_clustered)*100:.1f}%)")
        
        # Average deal value
        if 'Total Transaction Value ($USDmm, Historical rate)' in cluster_data.columns:
            avg_value = np.expm1(cluster_data['Total Transaction Value ($USDmm, Historical rate)'].mean())
            print(f"Average Deal Value: ${avg_value:.2f}M USD")
        
        # Dominant deal rationale
        rationale_cols = [c for c in cluster_data.columns if c.startswith('rationale_')]
        if rationale_cols:
            rationale_sums = cluster_data[rationale_cols].sum()
            dominant = rationale_sums.idxmax().replace('rationale_', '').replace('_', ' ').title()
            print(f"Dominant Rationale: {dominant}")
        
        # Top countries
        if 'Country/Region of Incorporation [Target/Issuer]' in cluster_data.columns:
            print(f"\nTop 3 Countries:")
            for country, count in cluster_data['Country/Region of Incorporation [Target/Issuer]'].value_counts().head(3).items():
                pct = (count / len(cluster_data)) * 100
                print(f"  → {country}: {count} ({pct:.1f}%)")
        
        # Top sectors
        if 'Sector' in cluster_data.columns:
            print(f"\nTop 3 Sectors:")
            for sector, count in cluster_data['Sector'].value_counts().head(3).items():
                if pd.notna(sector):
                    pct = (count / len(cluster_data)) * 100
                    print(f"  → {sector}: {count} ({pct:.1f}%)")
        
        # Active acquirers
        if 'Buyers/Investors' in cluster_data.columns:
            buyers = cluster_data['Buyers/Investors'].value_counts()
            active_buyers = buyers[buyers > 1]
            if len(active_buyers) > 0:
                print(f"\nActive Acquirers (multiple deals):")
                for buyer, count in active_buyers.head(3).items():
                    print(f"  → {buyer}: {count} acquisitions")

In [None]:
# Load clustered data and perform detailed analysis
clustered_file = "../eurozone_transactions_clustered_k3.csv"

if Path(clustered_file).exists():
    df_clustered = pd.read_csv(clustered_file)
    print(f"✓ Loaded clustered data: {df_clustered.shape}")
    
    # Add cluster labels from our analysis
    if 'cluster_labels' in locals():
        df_enriched_clustered = df_enriched.copy()
        df_enriched_clustered['Cluster'] = cluster_labels
    else:
        df_enriched_clustered = df_clustered
    
    print(f"\nCluster sizes:")
    print(df_enriched_clustered['Cluster'].value_counts().sort_index())
else:
    print("✗ Clustered data not found. Run modeling.py first.")
    print("  Command: python ../modeling.py")

### 5.1 Cluster Business Insights

Now we'll analyze each cluster to understand the business characteristics, deal rationales, and company relationships.