# 03 - Feature Engineering for Causal Inference

This notebook creates causally-interpretable features for counterfactual demand analysis.

## Feature Categories

| Category | Purpose | Examples |
|----------|---------|----------|
| **Outcome Variables** | What we want to predict | funding_ratio, success_binary |
| **Treatment Variables** | Main causal effect of interest | price_positioning, price_ambition |
| **Endogenous Variables** | Chosen strategically (confounded) | goal_ambition, campaign_length |
| **Instrumental Variables** | Exogenous shocks for causal ID | launch_day, holiday_proximity |
| **Confounders** | Must control for these | competition, creator_experience |
| **Censoring Indicators** | Data quality flags | demand_censored, early_success |

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from pathlib import Path
import warnings

warnings.filterwarnings('ignore')
sns.set_theme(style='darkgrid', palette='husl')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 100

%matplotlib inline

In [None]:
# Load enriched data
df = pd.read_csv('../data/processed/kickstarter_enriched.csv')
print(f"Loaded {len(df)} campaigns with {len(df.columns)} features")
df.head()

---
## 1. Outcome Variables

**Why these are outcomes**: These are the results we observe AFTER the campaign runs. They're what we want to predict/explain causally.

In [None]:
# Standardize column names
goal_col = 'funding_goal' if 'funding_goal' in df.columns else 'goal'
pledged_col = 'pledged_amount' if 'pledged_amount' in df.columns else 'pledged'
duration_col = 'campaign_duration_days' if 'campaign_duration_days' in df.columns else 'duration_days'

# 1. Funding Ratio
df['funding_ratio'] = df[pledged_col] / df[goal_col].replace(0, 1)
df['funding_ratio'] = df['funding_ratio'].clip(0, 20)  # Cap extreme values

# 2. Success Binary
if 'status' in df.columns:
    df['success_binary'] = (df['status'] == 'successful').astype(int)
elif 'is_successful' in df.columns:
    df['success_binary'] = df['is_successful']
else:
    df['success_binary'] = (df['funding_ratio'] >= 1.0).astype(int)

# 3. Backers per Day
df['backers_per_day'] = df['backers_count'] / df[duration_col].replace(0, 1)

# 4. Overfunding Ratio (how much over 100%)
df['overfunding_ratio'] = np.maximum(0, df['funding_ratio'] - 1.0)

print("Outcome Variables Created:")
print(df[['funding_ratio', 'success_binary', 'backers_per_day', 'overfunding_ratio']].describe())

---
## 2. Treatment Variables

**Why these are treatments**: Price is our main variable of interest. We want to estimate the CAUSAL effect of pricing on outcomes. However, pricing is endogenous - creators choose prices based on expectations.

In [None]:
# Calculate category medians for normalization
category_price_median = df.groupby('category')['avg_reward_price'].transform('median')
category_goal_median = df.groupby('category')[goal_col].transform('median')

# 1. Price Positioning (relative to category)
df['price_positioning'] = df['avg_reward_price'] / category_price_median.replace(0, 1)

# 2. Price Ambition (price relative to goal size)
df['price_ambition'] = df['avg_reward_price'] / (df[goal_col] / 100).replace(0, 1)

# 3. Log Average Price (handles skewness)
df['log_avg_price'] = np.log1p(df['avg_reward_price'])

print("Treatment Variables Created:")
print(df[['avg_reward_price', 'price_positioning', 'price_ambition', 'log_avg_price']].describe())

---
## 3. Endogenous Variables

**Why these are endogenous**: These are strategic choices made by creators based on their expectations and private information. They're correlated with unobserved quality, making naive regression biased.

In [None]:
# 1. Goal Ambition (relative to category)
df['goal_ambition'] = df[goal_col] / category_goal_median.replace(0, 1)

# 2. Campaign Length Category
def categorize_length(days):
    if pd.isna(days):
        return 'medium'
    if days < 30:
        return 'short'
    elif days <= 45:
        return 'medium'
    else:
        return 'long'

df['campaign_length_category'] = df[duration_col].apply(categorize_length)

print("Endogenous Variables:")
print(f"\nGoal Ambition: Mean={df['goal_ambition'].mean():.2f}, Std={df['goal_ambition'].std():.2f}")
print(f"\nCampaign Length Distribution:")
print(df['campaign_length_category'].value_counts())

---
## 4. Instrumental Variables (IVs)

**Why these are instruments**: These variables affect campaign outcomes but are NOT chosen based on pricing strategy. They provide exogenous variation we can use to identify causal effects.

- **Launch day**: When you launch is often determined by readiness, not price strategy
- **Holiday proximity**: External calendar events, not a conscious pricing choice
- **Trend spike**: Market conditions outside creator's control

In [None]:
# 1. Weekend Launch
df['is_weekend_launch'] = (df['day_of_week'] >= 5).astype(int)

# 2. Holiday Proximity (already have is_holiday_week)
df['holiday_proximity'] = df['is_holiday_week']

# 3. Trend Spike (above 120% of category median)
category_trend_median = df.groupby('category')['trend_index'].transform('median')
df['trend_spike'] = (df['trend_index'] > category_trend_median * 1.2).astype(int)

print("Instrumental Variables:")
print(f"Weekend launches: {df['is_weekend_launch'].sum()} ({df['is_weekend_launch'].mean()*100:.1f}%)")
print(f"Holiday proximity: {df['holiday_proximity'].sum()} ({df['holiday_proximity'].mean()*100:.1f}%)")
print(f"Trend spikes: {df['trend_spike'].sum()} ({df['trend_spike'].mean()*100:.1f}%)")

---
## 5. Confounders

**Why these are confounders**: These variables affect BOTH pricing decisions AND outcomes. If we don't control for them, our causal estimates will be biased.

In [None]:
# 1. Competition Intensity (normalized)
historical_avg_concurrent = df['concurrent_campaigns'].mean()
df['competition_intensity'] = df['concurrent_campaigns'] / historical_avg_concurrent

# 2. Creator Experience Proxy (based on description effort + update frequency)
# Normalize components before combining
desc_norm = (df['description_length'] - df['description_length'].mean()) / df['description_length'].std()
update_norm = (df['update_frequency'] - df['update_frequency'].mean()) / df['update_frequency'].std()
df['creator_experience_proxy'] = (desc_norm + update_norm) / 2

# 3. Category Saturation (campaigns in category that year / total)
if 'year' not in df.columns:
    df['launch_date'] = pd.to_datetime(df['launch_date'])
    df['year'] = df['launch_date'].dt.year

category_year_counts = df.groupby(['category', 'year']).size().reset_index(name='cat_year_count')
year_totals = df.groupby('year').size().reset_index(name='year_total')
category_year_counts = category_year_counts.merge(year_totals, on='year')
category_year_counts['category_saturation'] = category_year_counts['cat_year_count'] / category_year_counts['year_total']

df = df.merge(
    category_year_counts[['category', 'year', 'category_saturation']], 
    on=['category', 'year'], 
    how='left'
)

print("Confounder Variables:")
print(f"Competition Intensity: Mean={df['competition_intensity'].mean():.2f}")
print(f"Creator Experience Proxy: Mean={df['creator_experience_proxy'].mean():.2f}")
print(f"Category Saturation: Mean={df['category_saturation'].mean():.3f}")

---
## 6. Censoring Indicators

**Why these matter**: Some observations give us incomplete information about true demand. Campaigns that hit 300%+ funding may have had even higher potential demand - the observed outcome is *censored*.

In [None]:
# 1. Demand Censored (very successful campaigns - true demand unknown)
df['demand_censored'] = (df['funding_ratio'] > 3.0).astype(int)

# 2. Early Success (proxied by overfunding, as temporal data not available)
# If funding ratio is high AND backers_per_day is high, likely early success
bpd_median = df['backers_per_day'].median()
df['early_success'] = (
    (df['funding_ratio'] >= 1.0) & 
    (df['backers_per_day'] > bpd_median * 1.5)
).astype(int)

print("Censoring Indicators:")
print(f"Demand Censored (>300%): {df['demand_censored'].sum()} ({df['demand_censored'].mean()*100:.1f}%)")
print(f"Early Success: {df['early_success'].sum()} ({df['early_success'].mean()*100:.1f}%)")

---
# Section 1: Data Quality

Check for outliers and suspicious data points.

In [None]:
# Flag outliers
df['is_outlier_funding'] = df['funding_ratio'] > 10
df['is_outlier_price'] = df['avg_reward_price'] > 1000

print("OUTLIER DETECTION")
print("="*50)
print(f"Extreme funding ratio (>10): {df['is_outlier_funding'].sum()}")
print(f"Extreme price (>$1000): {df['is_outlier_price'].sum()}")

# Distribution histograms
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Funding Ratio
axes[0,0].hist(df['funding_ratio'].clip(upper=5), bins=50, color='steelblue', edgecolor='black')
axes[0,0].axvline(x=1, color='red', linestyle='--', linewidth=2, label='100% Funded')
axes[0,0].set_title('Funding Ratio Distribution (capped at 5)')
axes[0,0].set_xlabel('Funding Ratio')
axes[0,0].legend()

# Average Price
axes[0,1].hist(df['avg_reward_price'].clip(upper=500), bins=50, color='coral', edgecolor='black')
axes[0,1].set_title('Average Reward Price (capped at $500)')
axes[0,1].set_xlabel('Price ($)')

# Log Price
axes[1,0].hist(df['log_avg_price'], bins=50, color='teal', edgecolor='black')
axes[1,0].set_title('Log Average Price')
axes[1,0].set_xlabel('Log(Price)')

# Goal Ambition
axes[1,1].hist(df['goal_ambition'].clip(upper=5), bins=50, color='purple', edgecolor='black')
axes[1,1].axvline(x=1, color='red', linestyle='--', linewidth=2, label='Category Median')
axes[1,1].set_title('Goal Ambition (capped at 5x median)')
axes[1,1].set_xlabel('Goal / Category Median')
axes[1,1].legend()

plt.tight_layout()
plt.show()

---
# Section 2: Descriptive Statistics

Summary table and correlation analysis.

In [None]:
# Summary table of key features
key_features = [
    # Outcomes
    'funding_ratio', 'success_binary', 'backers_per_day', 'overfunding_ratio',
    # Treatment
    'avg_reward_price', 'price_positioning', 'price_ambition', 'log_avg_price',
    # Instruments
    'day_of_week', 'is_weekend_launch', 'holiday_proximity', 'trend_spike',
    # Confounders
    'competition_intensity', 'creator_experience_proxy', 'goal_ambition'
]

available_features = [f for f in key_features if f in df.columns]
summary = df[available_features].describe().T
summary['median'] = df[available_features].median()
summary = summary[['count', 'mean', 'median', 'std', 'min', 'max']]
print("SUMMARY STATISTICS")
print("="*80)
print(summary.round(3))

In [None]:
# Correlation matrix: Treatment, Outcomes, Instruments
corr_features = [
    'funding_ratio', 'success_binary',  # Outcomes
    'avg_reward_price', 'price_positioning',  # Treatment
    'day_of_week', 'is_weekend_launch', 'holiday_proximity', 'trend_spike',  # IVs
    'goal_ambition', 'competition_intensity'  # Confounders
]

available_corr = [f for f in corr_features if f in df.columns]

fig, ax = plt.subplots(figsize=(12, 10))
corr_matrix = df[available_corr].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, 
            fmt='.2f', square=True, ax=ax, mask=mask)
ax.set_title('Correlation Matrix: Outcomes, Treatment, Instruments, Confounders', fontsize=12)
plt.tight_layout()
plt.show()

---
# Section 3: Preliminary Causal Check

### Instrument Validity Tests
For a valid instrument (Z), we need:
1. **Relevance**: Z affects Treatment (but can be weak)
2. **Exclusion**: Z affects Outcome ONLY through Treatment

We check:
- Corr(Instrument, Treatment) should be **WEAK** (<0.2)
- Corr(Instrument, Outcome) can be present (shows exclusion works through treatment)

In [None]:
# Plot 1: Price vs Funding Ratio, colored by success
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# Price vs Funding Ratio
colors = df['success_binary'].map({0: '#EF4444', 1: '#00D4AA'})
axes[0,0].scatter(df['avg_reward_price'].clip(upper=500), 
                  df['funding_ratio'].clip(upper=5), 
                  c=colors, alpha=0.5, s=30)
axes[0,0].set_xlabel('Average Reward Price ($)')
axes[0,0].set_ylabel('Funding Ratio')
axes[0,0].set_title('Price vs Funding Ratio (Green=Success, Red=Failed)')
axes[0,0].axhline(y=1, color='black', linestyle='--', alpha=0.5)

# Day of Week vs Price (should be WEAK for valid IV)
day_price = df.groupby('day_of_week')['avg_reward_price'].mean()
day_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
day_price.index = [day_names[i] for i in day_price.index]
day_price.plot(kind='bar', ax=axes[0,1], color='coral')
axes[0,1].set_title('Launch Day vs Avg Price (Should be FLAT for valid IV)')
axes[0,1].set_ylabel('Average Price ($)')
axes[0,1].axhline(y=df['avg_reward_price'].mean(), color='red', linestyle='--')

# Day of Week vs Funding Ratio (can show variation)
day_funding = df.groupby('day_of_week')['funding_ratio'].mean()
day_funding.index = [day_names[i] for i in day_funding.index]
day_funding.plot(kind='bar', ax=axes[1,0], color='steelblue')
axes[1,0].set_title('Launch Day vs Funding Ratio (Can show variation)')
axes[1,0].set_ylabel('Funding Ratio')
axes[1,0].axhline(y=df['funding_ratio'].mean(), color='red', linestyle='--')

# Holiday vs Outcomes
holiday_stats = df.groupby('holiday_proximity').agg({
    'funding_ratio': 'mean',
    'avg_reward_price': 'mean'
})
x = np.arange(2)
width = 0.35
ax2 = axes[1,1].twinx()
bars1 = axes[1,1].bar(x - width/2, holiday_stats['funding_ratio'], width, 
                      color='steelblue', label='Funding Ratio')
bars2 = ax2.bar(x + width/2, holiday_stats['avg_reward_price'], width, 
                color='coral', label='Avg Price')
axes[1,1].set_xticks(x)
axes[1,1].set_xticklabels(['Not Holiday', 'Holiday Week'])
axes[1,1].set_ylabel('Funding Ratio', color='steelblue')
ax2.set_ylabel('Avg Price ($)', color='coral')
axes[1,1].set_title('Holiday Proximity: Funding Ratio vs Price')

plt.tight_layout()
plt.show()

In [None]:
# Statistical tests for instrument validity
print("INSTRUMENT VALIDITY TESTS")
print("="*60)
print("\nWe want: Corr(IV, Treatment) < 0.2 (weak relationship)")
print("We accept: Corr(IV, Outcome) can be non-zero (effect through treatment)")
print("\n" + "-"*60)

instruments = ['day_of_week', 'is_weekend_launch', 'holiday_proximity', 'trend_spike']

for iv in instruments:
    if iv in df.columns:
        corr_treatment = df[iv].corr(df['avg_reward_price'])
        corr_outcome = df[iv].corr(df['funding_ratio'])
        
        validity = "✓ VALID" if abs(corr_treatment) < 0.2 else "✗ WEAK IV"
        
        print(f"\n{iv}:")
        print(f"  Corr with Treatment (price): {corr_treatment:.4f}")
        print(f"  Corr with Outcome (funding): {corr_outcome:.4f}")
        print(f"  Status: {validity}")

---
# Section 4: Selection Bias Visualization

**Key Question**: Do successful campaigns systematically choose different prices/goals than failed ones?

In [None]:
# 2x2 grid: Price and Goal distributions by success
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

successful = df[df['success_binary'] == 1]
failed = df[df['success_binary'] == 0]

# Top-left: Price distribution for successful
axes[0,0].hist(successful['avg_reward_price'].clip(upper=500), bins=40, 
               color='#00D4AA', edgecolor='black', alpha=0.7)
axes[0,0].axvline(x=successful['avg_reward_price'].median(), color='red', 
                  linestyle='--', linewidth=2, label=f"Median: ${successful['avg_reward_price'].median():.0f}")
axes[0,0].set_title('✓ SUCCESSFUL: Price Distribution')
axes[0,0].set_xlabel('Average Reward Price ($)')
axes[0,0].legend()

# Top-right: Price distribution for failed
axes[0,1].hist(failed['avg_reward_price'].clip(upper=500), bins=40, 
               color='#EF4444', edgecolor='black', alpha=0.7)
axes[0,1].axvline(x=failed['avg_reward_price'].median(), color='red', 
                  linestyle='--', linewidth=2, label=f"Median: ${failed['avg_reward_price'].median():.0f}")
axes[0,1].set_title('✗ FAILED: Price Distribution')
axes[0,1].set_xlabel('Average Reward Price ($)')
axes[0,1].legend()

# Bottom-left: Goal distribution for successful
axes[1,0].hist(np.log10(successful[goal_col].clip(lower=100)), bins=40, 
               color='#00D4AA', edgecolor='black', alpha=0.7)
axes[1,0].axvline(x=np.log10(successful[goal_col].median()), color='red', 
                  linestyle='--', linewidth=2, label=f"Median: ${successful[goal_col].median():,.0f}")
axes[1,0].set_title('✓ SUCCESSFUL: Goal Distribution (log scale)')
axes[1,0].set_xlabel('Log10(Funding Goal)')
axes[1,0].legend()

# Bottom-right: Goal distribution for failed
axes[1,1].hist(np.log10(failed[goal_col].clip(lower=100)), bins=40, 
               color='#EF4444', edgecolor='black', alpha=0.7)
axes[1,1].axvline(x=np.log10(failed[goal_col].median()), color='red', 
                  linestyle='--', linewidth=2, label=f"Median: ${failed[goal_col].median():,.0f}")
axes[1,1].set_title('✗ FAILED: Goal Distribution (log scale)')
axes[1,1].set_xlabel('Log10(Funding Goal)')
axes[1,1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Statistical test for selection bias
print("SELECTION BIAS ANALYSIS")
print("="*60)
print("\nQuestion: Do successful campaigns systematically choose different prices?")
print("\n" + "-"*60)

# T-test for price difference
t_stat, p_value = stats.ttest_ind(
    successful['avg_reward_price'],
    failed['avg_reward_price']
)

print(f"\nAverage Price:")
print(f"  Successful campaigns: ${successful['avg_reward_price'].mean():.2f}")
print(f"  Failed campaigns: ${failed['avg_reward_price'].mean():.2f}")
print(f"  Difference: ${successful['avg_reward_price'].mean() - failed['avg_reward_price'].mean():.2f}")
print(f"  T-statistic: {t_stat:.3f}")
print(f"  P-value: {p_value:.4f}")

if p_value < 0.05:
    print("\n⚠️  SIGNIFICANT SELECTION BIAS DETECTED")
    print("   Successful campaigns choose systematically different prices.")
    print("   Naive regression will be biased - causal methods needed!")
else:
    print("\n✓ No significant selection bias in pricing.")

---
## Save Final Dataset

In [None]:
# Drop intermediate/flag columns before saving
cols_to_drop = ['is_outlier_funding', 'is_outlier_price', 'launch_month', 
                'reward_tiers_list', 'cat_year_count', 'year_total']
final_df = df.drop(columns=[c for c in cols_to_drop if c in df.columns], errors='ignore')

# Save
output_path = '../data/processed/kickstarter_causal_features.csv'
final_df.to_csv(output_path, index=False)

print(f"Saved {len(final_df)} campaigns with {len(final_df.columns)} features")
print(f"Output: {output_path}")

# List new causal features
print("\n" + "="*60)
print("NEW CAUSAL FEATURES CREATED:")
print("="*60)

new_features = [
    ('OUTCOMES', ['funding_ratio', 'success_binary', 'backers_per_day', 'overfunding_ratio']),
    ('TREATMENT', ['price_positioning', 'price_ambition', 'log_avg_price']),
    ('ENDOGENOUS', ['goal_ambition', 'campaign_length_category']),
    ('INSTRUMENTS', ['is_weekend_launch', 'holiday_proximity', 'trend_spike']),
    ('CONFOUNDERS', ['competition_intensity', 'creator_experience_proxy', 'category_saturation']),
    ('CENSORING', ['demand_censored', 'early_success'])
]

for category, features in new_features:
    print(f"\n{category}:")
    for f in features:
        if f in final_df.columns:
            print(f"  ✓ {f}")