# Deeptech M&A Momentum: Time-Series Aggregation

## Phase 3, Step 3.1: sector-level deal flow time-series

This notebook takes the deeptech M\&A deals (classified in Phase 2) and aggregates the total dollar value of transactions into continuous quarterly time series for each sector. This aggregated data forms the base for our momentum signal.

---

### Setup and configuration

In [8]:
# Imports
from pathlib import Path
import sys

import polars as pl

In [9]:
# File paths
CLASSIFIED_DATA_PATH = Path("../../data/processed/2.2_classified_deals.parquet")

# List of frequencies to test
TEST_FREQUENCIES = ["1mo", "3mo", "6mo"]
print(f"Frequencies to Aggregate: {TEST_FREQUENCIES}")

# --- Utility Function: Ensure Continuity ---
def ensure_continuity(df_volume: pl.DataFrame, freq: str) -> pl.DataFrame:
    """
    Transforms the aggregated Polars DataFrame into a continuous panel dataset,
    filling missing (zero-volume) periods for each sector.
    
    This function also prints a wide-format sample for visual inspection.
    """
    
    # Identify all unique sectors
    all_sectors = df_volume.get_column("deeptech_sector").unique().to_list()
    
    # 1. Pivot to WIDE format (Date Index, Sectors as Columns)
    df_wide = df_volume.pivot(
        index="announced_date", 
        columns="deeptech_sector", 
        values="total_deal_volume_usd"
    ).fill_null(0).sort("announced_date")
    
    print(f"\n  -- Inspection: Wide Format Volume for {freq} (Head) --")
    # Display the requested wide format for inspection
    print(df_wide.head(3))
    print("  --------------------------------------------------------")


    # 2. Melt back to LONG format to prepare for rolling window operations
    df_continuous = df_wide.melt(
        id_vars="announced_date", 
        value_vars=all_sectors,
        variable_name="deeptech_sector", 
        value_name="total_deal_volume_usd"
    ).sort("deeptech_sector", "announced_date")


    # 3. Merge back the transaction count and fill nulls (which are now zeros)
    df_final_series = df_continuous.join(
        df_volume.select("announced_date", "deeptech_sector", "transaction_count"),
        on=["announced_date", "deeptech_sector"],
        how="left"
    ).fill_null(0) 

    # 4. Final type casting
    df_final_series = df_final_series.with_columns(
        pl.col("transaction_count").cast(pl.Int64)
    )
    
    print(f"  ✓ Continuity ensured for {freq}. Total time-series points: {len(df_final_series):,}")
    return df_final_series

Frequencies to Aggregate: ['1mo', '3mo', '6mo']


### 3.1: Load data and filter actionable sectors

In [10]:
# Read the Parquet file and filter for only *actionable* sectors
df_classified = pl.read_parquet(CLASSIFIED_DATA_PATH)
initial_count = len(df_classified)

# CRITICAL FILTER: Filter out Noise and Unclassified/Non-Actionable clusters
# The filtering condition is now strictly based on the LLM output tags.
df_actionable = df_classified.filter(
    (pl.col("deeptech_sector") != "NOISE") & 
    (pl.col("deeptech_sector") != "NON_DEEPTECH") &
    (pl.col("deeptech_sector") != "CLASSIFICATION_FAILED_API_ERROR") &
    (pl.col("deeptech_sector") != "CLASSIFICATION_FAILED_NO_API_CLIENT")
).select(
    "announced_date", 
    "deal_value_usd", 
    "deeptech_sector"
).with_columns(
    # Convert announced_date to datetime type for group_by_dynamic
    pl.col("announced_date").str.strptime(pl.Date, "%Y-%m-%d").alias("announced_date")
).sort("deeptech_sector", "announced_date")

final_count = len(df_actionable)
print(f"✓ Total deals loaded: {initial_count:,}")
print(f"✓ Actionable deals retained: {final_count:,} ({(final_count/initial_count)*100:.1f}%)")
print(f"Sectors identified: {df_actionable.get_column('deeptech_sector').n_unique()}")


✓ Total deals loaded: 38,655
✓ Actionable deals retained: 7,278 (18.8%)
Sectors identified: 17


### 3.2 \& 3.3: Aggregate, ensure continuity and save (looping frequencies)

In [12]:
# Ensure output directory exists before the loop
OUTPUT_DATA_DIR = Path("../../data/processed")
OUTPUT_DATA_DIR.mkdir(parents=True, exist_ok=True)

for freq in TEST_FREQUENCIES:
    print("\n" + "="*50)
    print(f"Aggregating data for frequency: {freq}")
    print("="*50)
    
    # 1. Resample (group) by Sector and Time, summing the deal value.
    df_volume = df_actionable.group_by_dynamic(
        "announced_date", 
        every=freq,                  # Current frequency (e.g., "3mo")
        by="deeptech_sector",        
        closed="left",               # Signal for period T uses data from period T
        label="left"                
    ).agg(
        pl.sum("deal_value_usd").alias("total_deal_volume_usd"),
        pl.len().alias("transaction_count") # Use pl.len() for count in dynamic groupby
    ).sort("deeptech_sector", "announced_date")
    
    print(f"Intermediate aggregated rows (before continuity): {len(df_volume):,}")

    # 2. Ensure Continuity (This step will print the wide sample)
    df_final_series = ensure_continuity(df_volume, freq)
    
    # 3. Export the dataset with the frequency in the filename
    # We save the LONG (stacked) panel data, as required for efficient rolling calculations in Phase 3.3
    OUTPUT_PATH = OUTPUT_DATA_DIR / f"3.0_sector_volume_{freq}.csv" 
    df_final_series.write_csv(OUTPUT_PATH)
    
    print(f"✓ Final series (Long Format) for {freq} saved to: {OUTPUT_PATH}")
    print(df_final_series.head(3))

print("\n" + "="*60)
print("PHASE 3, STEP 3.1 COMPLETE: All time-series frequencies generated.")
print("Ready for Phase 3, Step 3.3: Sentiment Normalization.")


Aggregating data for frequency: 1mo
Intermediate aggregated rows (before continuity): 1,379

  -- Inspection: Wide Format Volume for 1mo (Head) --
shape: (3, 18)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ announced ┆ Additive  ┆ Advanced  ┆ Aerospace ┆ … ┆ Precision ┆ Robotics  ┆ Semicondu ┆ Solar    │
│ _date     ┆ Manufactu ┆ Battery   ┆ Defense   ┆   ┆ Machining ┆ ---       ┆ ctors (mi ┆ Grid Opt │
│ ---       ┆ ring / 3D ┆ Chemistry ┆ Systems   ┆   ┆ and       ┆ f64       ┆ crochips) ┆ imizatio │
│ date      ┆ Pr…       ┆ / S…      ┆ Inte…     ┆   ┆ Metrol…   ┆           ┆ ---       ┆ n /      │
│           ┆ ---       ┆ ---       ┆ ---       ┆   ┆ ---       ┆           ┆ f64       ┆ Smar…    │
│           ┆ f64       ┆ f64       ┆ f64       ┆   ┆ f64       ┆           ┆           ┆ ---      │
│           ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆ f64      │
╞═══════════╪═══════════╪════

  df_volume = df_actionable.group_by_dynamic(
  df_wide = df_volume.pivot(
  df_continuous = df_wide.melt(
