# RQ2: Do Land Deals Predict Climate Litigation and ISDS Cases? (MINIMAL)

**Research Question:** Do countries with large-scale land acquisitions experience more climate litigation and investment disputes?

**Hypothesis:** Land grabbing → environmental conflict → both climate cases AND investor disputes

**Analysis:**
- Model A: Land Deals → Climate Cases
- Model B: Land Deals → ISDS Cases (ALL SECTORS - revised from Ag/Mining only)

**Unit of Analysis:** Country-level (aggregated data)

---

## VERSION: MINIMAL PREDICTOR SET (5 VARIABLES)

**This is the MINIMAL version maximizing sample size.**

For the 7-predictor version, see: `RQ2_Land_Deals_to_Litigation_Analysis_REVISED.ipynb`

---

## MINIMAL PREDICTOR SET:

**Variables Removed** (poor coverage):
- ❌ Rule_of_Law_Score_Pct (46.6% coverage)
- ❌ Avg_Deal_Size (multicollinear)
- ❌ Prop_Agriculture (52% missing in Model A, 34% in Model B)
- ❌ Literacy_Rate_Pct (26% missing in Model A, 10% in Model B)

**Variables Kept** (5 core predictors):
1. ✅ Total_Deal_Size (hectares)
2. ✅ Num_Deals (count)
3. ✅ Corruption_Score (governance)
4. ✅ Press_Freedom_Score (governance)
5. ✅ log_Population (control for country size)

**Rationale:**
- Maximizes statistical power (sample size)
- Keeps core theoretical variables (land deals + governance)
- Controls for country size
- Reduces multicollinearity

**Expected Sample Sizes:**
- Model A: ~50 countries (was 23 with 7 predictors)
- Model B: ~113 countries (was 72 with 7 predictors)
- Observations per predictor: 10.0 and 22.6 respectively ✓

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries loaded successfully!")

## STEP 0: DATA AGGREGATION TO COUNTRY LEVEL

Aggregate all three datasets to country level and merge with governance indicators.

In [None]:
# Load datasets
print("Loading datasets...")
landmatrix = pd.read_csv('Processed_Datasets/landmatrix_final_analytical_dataset.csv')
isds = pd.read_csv('Processed_Datasets/ISDS_processed_dataset_final.csv')
climate = pd.read_csv('Processed_Datasets/climate_litigation_final_unified.csv')
governance = pd.read_csv('indipendent_variables/combined_independent_variables.csv')
population = pd.read_csv('indipendent_variables/WorldPopulationByCountry.csv',
                         encoding='utf-8-sig', index_col=False)

print(f"Land Matrix: {len(landmatrix)} deals")
print(f"ISDS: {len(isds)} cases")
print(f"Climate: {len(climate)} cases")
print(f"Governance: {len(governance)} countries")
print(f"Population: {len(population)} countries")

In [None]:
# Aggregate Land Matrix to country level
print("\nAggregating Land Matrix by country...")

# Filter valid deals (size > 0)
lm_valid = landmatrix[landmatrix['Deal size'] > 0].copy()

# Get primary sector
def get_primary_sector(nace_str):
    if pd.isna(nace_str):
        return 'Unknown'
    return nace_str.split(' ')[0]

lm_valid['primary_sector'] = lm_valid['nace_sector'].apply(get_primary_sector)

# Aggregate by country
lm_country = lm_valid.groupby('Target country_iso3').agg({
    'Deal size': ['sum', 'mean', 'count'],
    'primary_sector': lambda x: (x == 'A').sum() / len(x),  # Proportion Agriculture
}).reset_index()

lm_country.columns = ['ISO3', 'Total_Deal_Size', 'Avg_Deal_Size', 'Num_Deals', 'Prop_Agriculture']

# Add proportion mining
lm_country['Prop_Mining'] = lm_valid.groupby('Target country_iso3').apply(
    lambda x: (x['primary_sector'] == 'B').sum() / len(x)
).values

print(f"Countries with land deals: {len(lm_country)}")
print(lm_country.head())

In [None]:
# Aggregate Climate Litigation to country level
print("\nAggregating Climate Litigation by country...")

# Extract main country (first in Geographies)
climate['main_country'] = climate['Geographies'].str.split(';').str[0].str.strip()
climate['iso3_clean'] = climate['main_country'].str[:3]  # Extract ISO3

# Get primary sector for climate
def get_primary_sector_climate(row):
    for sector in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'O']:
        if row.get(f'sector_{sector}', 0) == 1:
            return sector
    return 'Unknown'

climate['primary_sector'] = climate.apply(get_primary_sector_climate, axis=1)

# Aggregate by country
climate_country = climate.groupby('iso3_clean').agg({
    'Case URL': 'count',  # Total cases
    'Status': lambda x: (x.isin(['Decided', 'Decided '])).sum() / len(x),  # Proportion decided
    'primary_sector': lambda x: (x == 'A').sum() / len(x),  # Proportion Agriculture
}).reset_index()

climate_country.columns = ['ISO3', 'Climate_Cases', 'Prop_Climate_Decided', 'Prop_Climate_Agriculture']

print(f"Countries with climate cases: {len(climate_country)}")
print(climate_country.head())

In [None]:
# Aggregate ISDS to country level
print("\nAggregating ISDS by country...")

# Get primary sector for ISDS
def get_primary_sector_isds(row):
    for sector in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N']:
        if row.get(f'sector_{sector}', 0) == 1:
            return sector
    return 'Unknown'

isds['primary_sector'] = isds.apply(get_primary_sector_isds, axis=1)

# Filter Agriculture & Mining cases
isds['is_ag_mining'] = isds['primary_sector'].isin(['A', 'B'])

# Aggregate by respondent country
isds_country = isds.groupby('respondent_state_iso3').agg({
    'amount_claimed_musd': ['count', 'mean', 'sum'],
    'is_ag_mining': 'sum',  # Count of Ag/Mining cases
}).reset_index()

isds_country.columns = ['ISO3', 'ISDS_Cases', 'Avg_Amount_Claimed', 'Total_Amount_Claimed', 'ISDS_Cases_Ag_Mining']

print(f"Countries with ISDS cases: {len(isds_country)}")
print(isds_country.head())

In [None]:
# Merge all datasets
print("\nMerging all datasets...")

# Start with governance (has all countries)
merged = governance.copy()

# Merge land matrix
merged = merged.merge(lm_country, on='ISO3', how='left')

# Merge climate
merged = merged.merge(climate_country, on='ISO3', how='left')

# Merge ISDS
merged = merged.merge(isds_country, on='ISO3', how='left')

# Clean and merge population data
population_clean = population.copy()
population_clean.columns = population_clean.columns.str.strip()  # Remove whitespace
population_clean['ISO3'] = population_clean['Iso code'].astype(str).str.strip()  # Convert to string first
population_clean = population_clean[['ISO3', 'pop2025']].rename(columns={'pop2025': 'Population'})

merged = merged.merge(population_clean, on='ISO3', how='left')

# Create log-transformed population (handle NaN and zeros)
merged['log_Population'] = np.log(merged['Population'].replace(0, np.nan))

# Fill NaN in count variables with 0 (no cases = 0)
count_cols = ['Num_Deals', 'Climate_Cases', 'ISDS_Cases', 'ISDS_Cases_Ag_Mining']
merged[count_cols] = merged[count_cols].fillna(0)

# Fill NaN in size variables with 0
size_cols = ['Total_Deal_Size', 'Avg_Deal_Size', 'Total_Amount_Claimed', 'Avg_Amount_Claimed']
merged[size_cols] = merged[size_cols].fillna(0)

print(f"\nMerged dataset: {len(merged)} countries")
print(f"Countries with Population data: {merged['Population'].notna().sum()}")
print(f"Countries with log_Population data: {merged['log_Population'].notna().sum()}")
print(f"\n--- Sample Sizes by Dataset ---")
print(f"Countries with land deals: {(merged['Num_Deals'] > 0).sum()}")
print(f"Countries with climate cases: {(merged['Climate_Cases'] > 0).sum()}")
print(f"Countries with ISDS cases (all sectors): {(merged['ISDS_Cases'] > 0).sum()}")
print(f"Countries with ISDS Ag/Mining cases: {(merged['ISDS_Cases_Ag_Mining'] > 0).sum()}")
print(f"\nCountries with ALL THREE datasets: {((merged['Num_Deals'] > 0) & (merged['Climate_Cases'] > 0) & (merged['ISDS_Cases'] > 0)).sum()}")

# Save merged dataset
merged.to_csv('results/minimal/merged_country_level_data_minimal.csv', index=False)
print("\nSaved: results/minimal/merged_country_level_data_minimal.csv")

---
# MODEL A: Land Deals → Climate Cases
---

## STEP 1: DATA INSPECTION (Model A)

In [None]:
# Filter countries with climate cases
df_a = merged[merged['Climate_Cases'] > 0].copy()

print(f"Countries with climate cases: {len(df_a)}")
print(f"\nShape: {df_a.shape}")
print(f"\nColumn names:")
print(df_a.columns.tolist())

# Check data types
print("\n--- Data Types ---")
print(df_a.dtypes)

# Missing values
print("\n--- Missing Values ---")
print(df_a.isnull().sum()[df_a.isnull().sum() > 0])

# Descriptive statistics
print("\n--- Descriptive Statistics ---")
cols_of_interest = ['Climate_Cases', 'Total_Deal_Size', 'Num_Deals', 'Corruption_Score', 'Rule_of_Law_Score_Pct']
print(df_a[cols_of_interest].describe())

# Check DV distribution
print("\n--- DV Distribution (Climate_Cases) ---")
print(f"Skewness: {df_a['Climate_Cases'].skew():.2f}")
print(f"Mean: {df_a['Climate_Cases'].mean():.2f}, Median: {df_a['Climate_Cases'].median():.0f}")

In [None]:
# Plot DV distribution (before and after log transformation)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original
axes[0].hist(df_a['Climate_Cases'], bins=30, edgecolor='black')
axes[0].set_title('Original: Climate Cases')
axes[0].set_xlabel('Climate Cases')
axes[0].set_ylabel('Frequency')

# Log-transformed
df_a['Climate_Cases_Log'] = np.log(df_a['Climate_Cases'] + 1)
axes[1].hist(df_a['Climate_Cases_Log'], bins=30, edgecolor='black', color='orange')
axes[1].set_title('Log-Transformed: log(Climate Cases + 1)')
axes[1].set_xlabel('log(Climate Cases + 1)')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.savefig('results/minimal/modelA_dv_distribution_minimal.png', dpi=300, bbox_inches='tight')
plt.show()

print("DV will be log-transformed for modeling.")

## STEP 2: CORRELATION CHECK (Model A)

In [None]:
# Select numeric predictors (MINIMAL: 5 core variables)
predictors_a = ['Total_Deal_Size', 'Num_Deals',
                'Corruption_Score', 'Press_Freedom_Score',
                'log_Population']

# Check missing data BEFORE dropping
print("--- Missing Data Analysis (Model A - MINIMAL) ---")
print(f"Countries with Climate cases: {len(df_a)}")
for pred in predictors_a:
    missing = df_a[pred].isna().sum()
    print(f"  Missing {pred}: {missing} ({missing/len(df_a)*100:.1f}%)")

# Drop rows with missing predictors
df_a_clean = df_a.dropna(subset=predictors_a)
print(f"\nAfter dropping missing predictors: {len(df_a_clean)} countries")
print(f"Lost {len(df_a) - len(df_a_clean)} countries due to missing data")
print(f"Observations per predictor: {len(df_a_clean) / len(predictors_a):.1f}\n")

# Correlation matrix
corr_matrix = df_a_clean[predictors_a].corr()

# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix - Model A Predictors (MINIMAL)', fontsize=14)
plt.tight_layout()
plt.savefig('results/minimal/modelA_correlation_matrix_minimal.png', dpi=300, bbox_inches='tight')
plt.show()

# Flag high correlations
print("\n--- High Correlations (|r| > 0.7) ---")
high_corr = np.where(np.abs(corr_matrix) > 0.7)
high_corr_pairs = [(corr_matrix.index[x], corr_matrix.columns[y], corr_matrix.iloc[x, y]) 
                   for x, y in zip(*high_corr) if x != y and x < y]
for var1, var2, corr in high_corr_pairs:
    print(f"{var1} <-> {var2}: {corr:.3f}")

if not high_corr_pairs:
    print("No high correlations found.")

## STEP 3: PREPARE DATA (Model A)

In [None]:
# Define X and y
y_a = df_a_clean['Climate_Cases_Log']
X_a = df_a_clean[predictors_a]

print(f"X shape: {X_a.shape}")
print(f"y shape: {y_a.shape}")

# Standardize predictors
scaler_a = StandardScaler()
X_a_scaled = pd.DataFrame(scaler_a.fit_transform(X_a), columns=X_a.columns, index=X_a.index)

# Train/test split
X_train_a, X_test_a, y_train_a, y_test_a = train_test_split(
    X_a_scaled, y_a, test_size=0.2, random_state=42
)

print(f"\nTraining set: {len(X_train_a)} countries")
print(f"Test set: {len(X_test_a)} countries")

## STEP 4: FIT BASELINE MODEL (Model A)

In [None]:
# Fit OLS regression
model_a = LinearRegression()
model_a.fit(X_train_a, y_train_a)

# Get coefficients
coef_df_a = pd.DataFrame({
    'Feature': X_train_a.columns,
    'Coefficient': model_a.coef_
}).sort_values('Coefficient', ascending=False)

print("\n--- MODEL A: Land Deals → Climate Cases ---")
print(f"Intercept: {model_a.intercept_:.4f}")
print("\nCoefficients:")
print(coef_df_a.to_string(index=False))

# Train R²
train_r2_a = model_a.score(X_train_a, y_train_a)
print(f"\nTrain R²: {train_r2_a:.4f}")

# Calculate VIF (simplified - check correlations)
print("\n--- VIF Check (using correlation method) ---")
print("Note: VIF > 5 indicates potential multicollinearity")
for col in X_a_scaled.columns:
    # Simple VIF approximation: 1 / (1 - R²) from regressing this var on others
    X_temp = X_a_scaled.drop(columns=[col])
    y_temp = X_a_scaled[col]
    lr_temp = LinearRegression()
    lr_temp.fit(X_temp, y_temp)
    r2_temp = lr_temp.score(X_temp, y_temp)
    vif = 1 / (1 - r2_temp) if r2_temp < 0.999 else 999
    flag = " ⚠️" if vif > 5 else ""
    print(f"{col}: {vif:.2f}{flag}")

## STEP 5: TEST SET EVALUATION (Model A)

In [None]:
# Predict on test set
y_pred_a = model_a.predict(X_test_a)

# Calculate metrics
test_r2_a = r2_score(y_test_a, y_pred_a)
test_rmse_a = np.sqrt(mean_squared_error(y_test_a, y_pred_a))
test_mae_a = mean_absolute_error(y_test_a, y_pred_a)

print("\n--- TEST SET PERFORMANCE (Model A) ---")
print(f"Test R²: {test_r2_a:.4f}")
print(f"Test RMSE: {test_rmse_a:.4f}")
print(f"Test MAE: {test_mae_a:.4f}")

# Compare train vs test
print(f"\nTrain R²: {train_r2_a:.4f}")
print(f"Test R²: {test_r2_a:.4f}")
print(f"Difference: {train_r2_a - test_r2_a:.4f}")
if train_r2_a - test_r2_a > 0.1:
    print("⚠️ Large gap suggests possible overfitting")
else:
    print("✓ Model generalizes well")

## STEP 6: FEATURE IMPORTANCE (Model A)

In [None]:
# Plot feature importance
coef_df_a_sorted = coef_df_a.sort_values('Coefficient')
colors = ['green' if x > 0 else 'red' for x in coef_df_a_sorted['Coefficient']]

plt.figure(figsize=(10, 6))
plt.barh(coef_df_a_sorted['Feature'], coef_df_a_sorted['Coefficient'], color=colors, edgecolor='black')
plt.xlabel('Standardized Coefficient', fontsize=12)
plt.title('Feature Importance: Model A (Land Deals → Climate Cases) - MINIMAL', fontsize=14, fontweight='bold')
plt.axvline(0, color='black', linestyle='--', linewidth=0.8)
plt.tight_layout()
plt.savefig('results/minimal/modelA_feature_importance_minimal.png', dpi=300, bbox_inches='tight')
plt.show()

## STEP 7: INTERACTION TERM (Model A)

In [None]:
# Create interaction: Total_Deal_Size × Corruption_Score
X_a_scaled['Deal_x_Corruption'] = X_a_scaled['Total_Deal_Size'] * X_a_scaled['Corruption_Score']

# Re-split with interaction
X_train_a_int, X_test_a_int, y_train_a_int, y_test_a_int = train_test_split(
    X_a_scaled, y_a, test_size=0.2, random_state=42
)

# Fit model with interaction
model_a_int = LinearRegression()
model_a_int.fit(X_train_a_int, y_train_a_int)

# Get interaction coefficient
interaction_coef = model_a_int.coef_[-1]
train_r2_a_int = model_a_int.score(X_train_a_int, y_train_a_int)

print("\n--- MODEL A WITH INTERACTION ---")
print(f"Interaction coefficient (Deal_x_Corruption): {interaction_coef:.4f}")
print(f"\nTrain R² (without interaction): {train_r2_a:.4f}")
print(f"Train R² (with interaction): {train_r2_a_int:.4f}")
print(f"Change in R²: {train_r2_a_int - train_r2_a:.4f}")

# Test statistical significance (simplified - using change in R²)
if train_r2_a_int - train_r2_a > 0.01:
    print("\n✓ Interaction appears meaningful (ΔR² > 0.01)")
else:
    print("\n✗ Interaction does not improve model substantially")

## STEP 8: RESIDUAL DIAGNOSTICS (Model A)

In [None]:
# Calculate residuals
y_pred_train_a = model_a.predict(X_train_a)
residuals_a = y_train_a - y_pred_train_a

# Create diagnostic plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: Residuals vs Fitted
axes[0].scatter(y_pred_train_a, residuals_a, alpha=0.6, edgecolors='k')
axes[0].axhline(0, color='red', linestyle='--', linewidth=2)
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted Values')

# Plot 2: Histogram of residuals
axes[1].hist(residuals_a, bins=20, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Residuals')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Histogram of Residuals')

# Plot 3: Q-Q plot
stats.probplot(residuals_a, dist="norm", plot=axes[2])
axes[2].set_title('Q-Q Plot')

plt.tight_layout()
plt.savefig('results/minimal/modelA_residual_diagnostics_minimal.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n--- Residual Diagnostics ---")
print("Check plots for:")
print("- Residuals vs Fitted: Should show no pattern (random scatter around 0)")
print("- Histogram: Should be roughly normal")
print("- Q-Q Plot: Points should follow diagonal line")

## STEP 9: kNN COMPARISON (Model A)

In [None]:
# Fit kNN models with different k values
k_values = [3, 5, 7]
knn_results_a = {}

print("\n--- kNN Comparison (Model A) ---")
for k in k_values:
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_train_a, y_train_a)
    y_pred_knn = knn.predict(X_test_a)
    rmse_knn = np.sqrt(mean_squared_error(y_test_a, y_pred_knn))
    r2_knn = r2_score(y_test_a, y_pred_knn)
    knn_results_a[k] = {'RMSE': rmse_knn, 'R2': r2_knn}
    print(f"k={k}: Test RMSE = {rmse_knn:.4f}, Test R² = {r2_knn:.4f}")

# Find best k
best_k_a = min(knn_results_a, key=lambda k: knn_results_a[k]['RMSE'])
print(f"\nBest k: {best_k_a} (RMSE = {knn_results_a[best_k_a]['RMSE']:.4f})")

# Compare to linear regression
print(f"\nLinear Regression Test RMSE: {test_rmse_a:.4f}")
print(f"Best kNN (k={best_k_a}) Test RMSE: {knn_results_a[best_k_a]['RMSE']:.4f}")

if test_rmse_a < knn_results_a[best_k_a]['RMSE']:
    print("\n✓ Linear Regression performs better")
else:
    print(f"\n✓ kNN (k={best_k_a}) performs better")

## Save Model A Summary

In [None]:
# Save model summary
with open('results/minimal/modelA_summary_minimal.txt', 'w') as f:
    f.write("MODEL A: Land Deals → Climate Cases (MINIMAL - 5 predictors)\n")
    f.write("=" * 50 + "\n\n")
    f.write(f"Sample Size: {len(df_a_clean)} countries\n")
    f.write(f"Train/Test Split: {len(X_train_a)}/{len(X_test_a)}\n")
    f.write(f"Observations per predictor: {len(df_a_clean) / len(predictors_a):.1f}\n\n")
    f.write("COEFFICIENTS:\n")
    f.write(coef_df_a.to_string(index=False))
    f.write(f"\n\nIntercept: {model_a.intercept_:.4f}\n\n")
    f.write("PERFORMANCE:\n")
    f.write(f"Train R²: {train_r2_a:.4f}\n")
    f.write(f"Test R²: {test_r2_a:.4f}\n")
    f.write(f"Test RMSE: {test_rmse_a:.4f}\n")
    f.write(f"Test MAE: {test_mae_a:.4f}\n\n")
    f.write("INTERACTION TERM:\n")
    f.write(f"Deal_x_Corruption Coefficient: {interaction_coef:.4f}\n")
    f.write(f"R² with interaction: {train_r2_a_int:.4f}\n")
    f.write(f"Change in R²: {train_r2_a_int - train_r2_a:.4f}\n")

print("\nSaved: results/minimal/modelA_summary_minimal.txt")

---
# MODEL B: Land Deals → ISDS Cases (All Sectors - REVISED)
---

## STEP 1: DATA INSPECTION (Model B)

In [None]:
# Filter countries with ISDS cases (REVISED: using ALL sectors, not just Ag/Mining)
df_b = merged[merged['ISDS_Cases'] > 0].copy()

print(f"Countries with ISDS cases (all sectors): {len(df_b)}")
print(f"\nShape: {df_b.shape}")

# Missing values
print("\n--- Missing Values ---")
print(df_b.isnull().sum()[df_b.isnull().sum() > 0])

# Descriptive statistics
print("\n--- Descriptive Statistics ---")
cols_of_interest_b = ['ISDS_Cases', 'Total_Deal_Size', 'Num_Deals', 'Corruption_Score', 'Press_Freedom_Score']
print(df_b[cols_of_interest_b].describe())

# Check DV distribution
print("\n--- DV Distribution (ISDS_Cases - All Sectors) ---")
print(f"Skewness: {df_b['ISDS_Cases'].skew():.2f}")
print(f"Mean: {df_b['ISDS_Cases'].mean():.2f}, Median: {df_b['ISDS_Cases'].median():.0f}")

In [None]:
# Plot DV distribution (MINIMAL: using all ISDS cases)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original
axes[0].hist(df_b['ISDS_Cases'], bins=20, edgecolor='black')
axes[0].set_title('Original: ISDS Cases (All Sectors)')
axes[0].set_xlabel('ISDS Cases')
axes[0].set_ylabel('Frequency')

# Log-transformed
df_b['ISDS_Cases_Log'] = np.log(df_b['ISDS_Cases'] + 1)
axes[1].hist(df_b['ISDS_Cases_Log'], bins=20, edgecolor='black', color='orange')
axes[1].set_title('Log-Transformed: log(ISDS Cases + 1)')
axes[1].set_xlabel('log(ISDS Cases + 1)')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.savefig('results/minimal/modelB_dv_distribution_minimal.png', dpi=300, bbox_inches='tight')
plt.show()

## STEP 2-9: Full Pipeline (Model B - REVISED)

Running the same pipeline as Model A, but with ALL ISDS cases as the outcome (not just Ag/Mining sectors).

In [None]:
# STEP 2: CORRELATION CHECK (MINIMAL: 5 predictors)
predictors_b = ['Total_Deal_Size', 'Num_Deals',
                'Corruption_Score', 'Press_Freedom_Score',
                'log_Population']

df_b_clean = df_b.dropna(subset=predictors_b)
print(f"After dropping missing: {len(df_b_clean)} countries")
print(f"Observations per predictor: {len(df_b_clean) / len(predictors_b):.1f}\n")

corr_matrix_b = df_b_clean[predictors_b].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix_b, annot=True, fmt='.2f', cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix - Model B Predictors (MINIMAL)', fontsize=14)
plt.tight_layout()
plt.savefig('results/minimal/modelB_correlation_matrix_minimal.png', dpi=300, bbox_inches='tight')
plt.show()

# STEP 3: PREPARE DATA
y_b = df_b_clean['ISDS_Cases_Log']
X_b = df_b_clean[predictors_b]

scaler_b = StandardScaler()
X_b_scaled = pd.DataFrame(scaler_b.fit_transform(X_b), columns=X_b.columns, index=X_b.index)

X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(
    X_b_scaled, y_b, test_size=0.2, random_state=42
)

print(f"Training set: {len(X_train_b)}, Test set: {len(X_test_b)}")

# STEP 4: FIT MODEL
model_b = LinearRegression()
model_b.fit(X_train_b, y_train_b)

coef_df_b = pd.DataFrame({
    'Feature': X_train_b.columns,
    'Coefficient': model_b.coef_
}).sort_values('Coefficient', ascending=False)

print("\n--- MODEL B: Land Deals → ISDS Cases (All Sectors - MINIMAL) ---")
print(coef_df_b.to_string(index=False))

train_r2_b = model_b.score(X_train_b, y_train_b)
print(f"\nTrain R²: {train_r2_b:.4f}")

# STEP 5: TEST EVALUATION
y_pred_b = model_b.predict(X_test_b)
test_r2_b = r2_score(y_test_b, y_pred_b)
test_rmse_b = np.sqrt(mean_squared_error(y_test_b, y_pred_b))
test_mae_b = mean_absolute_error(y_test_b, y_pred_b)

print(f"\nTest R²: {test_r2_b:.4f}")
print(f"Test RMSE: {test_rmse_b:.4f}")
print(f"Test MAE: {test_mae_b:.4f}")

# STEP 6: FEATURE IMPORTANCE
coef_df_b_sorted = coef_df_b.sort_values('Coefficient')
colors_b = ['green' if x > 0 else 'red' for x in coef_df_b_sorted['Coefficient']]

plt.figure(figsize=(10, 6))
plt.barh(coef_df_b_sorted['Feature'], coef_df_b_sorted['Coefficient'], color=colors_b, edgecolor='black')
plt.xlabel('Standardized Coefficient', fontsize=12)
plt.title('Feature Importance: Model B (ISDS Cases - All Sectors) - MINIMAL', fontsize=14, fontweight='bold')
plt.axvline(0, color='black', linestyle='--', linewidth=0.8)
plt.tight_layout()
plt.savefig('results/minimal/modelB_feature_importance_minimal.png', dpi=300, bbox_inches='tight')
plt.show()

# STEP 7: INTERACTION
X_b_scaled['Deal_x_Corruption'] = X_b_scaled['Total_Deal_Size'] * X_b_scaled['Corruption_Score']
X_train_b_int, X_test_b_int, y_train_b_int, y_test_b_int = train_test_split(
    X_b_scaled, y_b, test_size=0.2, random_state=42
)

model_b_int = LinearRegression()
model_b_int.fit(X_train_b_int, y_train_b_int)
train_r2_b_int = model_b_int.score(X_train_b_int, y_train_b_int)

print(f"\nInteraction coefficient: {model_b_int.coef_[-1]:.4f}")
print(f"R² change: {train_r2_b_int - train_r2_b:.4f}")

# STEP 8: RESIDUALS
y_pred_train_b = model_b.predict(X_train_b)
residuals_b = y_train_b - y_pred_train_b

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
axes[0].scatter(y_pred_train_b, residuals_b, alpha=0.6, edgecolors='k')
axes[0].axhline(0, color='red', linestyle='--', linewidth=2)
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted Values')

axes[1].hist(residuals_b, bins=15, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Residuals')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Histogram of Residuals')

stats.probplot(residuals_b, dist="norm", plot=axes[2])
axes[2].set_title('Q-Q Plot')

plt.tight_layout()
plt.savefig('results/minimal/modelB_residual_diagnostics_minimal.png', dpi=300, bbox_inches='tight')
plt.show()

# STEP 9: kNN
knn_results_b = {}
print("\n--- kNN Comparison (Model B) ---")
for k in [3, 5, 7]:
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_train_b, y_train_b)
    y_pred_knn = knn.predict(X_test_b)
    rmse_knn = np.sqrt(mean_squared_error(y_test_b, y_pred_knn))
    knn_results_b[k] = rmse_knn
    print(f"k={k}: RMSE = {rmse_knn:.4f}")

best_k_b = min(knn_results_b, key=knn_results_b.get)
print(f"\nBest kNN: k={best_k_b} (RMSE={knn_results_b[best_k_b]:.4f})")
print(f"Linear Regression RMSE: {test_rmse_b:.4f}")

# Save summary
with open('results/minimal/modelB_summary_minimal.txt', 'w') as f:
    f.write("MODEL B: Land Deals → ISDS Cases (All Sectors - MINIMAL - 5 predictors)\n")
    f.write("=" * 50 + "\n\n")
    f.write(f"Sample Size: {len(df_b_clean)} countries\n")
    f.write(f"Observations per predictor: {len(df_b_clean) / len(predictors_b):.1f}\n\n")
    f.write(coef_df_b.to_string(index=False))
    f.write(f"\n\nTrain R²: {train_r2_b:.4f}\n")
    f.write(f"Test R²: {test_r2_b:.4f}\n")
    f.write(f"Test RMSE: {test_rmse_b:.4f}\n")

print("\nSaved: results/minimal/modelB_summary_minimal.txt")

---
# COMPARATIVE ANALYSIS: Model A vs Model B
---

In [None]:
# Create comparison table (MINIMAL)
comparison = pd.DataFrame({
    'Dataset': ['Model A: Climate Cases', 'Model B: ISDS Cases (All Sectors)'],
    'N': [len(df_a_clean), len(df_b_clean)],
    'Predictors': [len(predictors_a), len(predictors_b)],
    'Obs/Predictor': [len(df_a_clean)/len(predictors_a), len(df_b_clean)/len(predictors_b)],
    'Train R²': [train_r2_a, train_r2_b],
    'Test R²': [test_r2_a, test_r2_b],
    'Test RMSE': [test_rmse_a, test_rmse_b],
    'Best kNN k': [best_k_a, best_k_b],
    'Best kNN RMSE': [knn_results_a[best_k_a]['RMSE'], knn_results_b[best_k_b]],
    'Best Model': ['Linear Reg' if test_rmse_a < knn_results_a[best_k_a]['RMSE'] else f'kNN (k={best_k_a})',
                   'Linear Reg' if test_rmse_b < knn_results_b[best_k_b] else f'kNN (k={best_k_b})']
})

print("\n" + "=" * 80)
print("FINAL COMPARISON: Model A vs Model B (MINIMAL - 5 PREDICTORS)")
print("=" * 80)
print(comparison.to_string(index=False))

# Save comparison
comparison.to_csv('results/minimal/model_comparison_minimal.csv', index=False)
print("\nSaved: results/minimal/model_comparison_minimal.csv")

In [None]:
# Coefficient comparison plot (MINIMAL)
fig, ax = plt.subplots(figsize=(12, 8))

# Combine coefficients
coef_comparison = pd.merge(
    coef_df_a[['Feature', 'Coefficient']].rename(columns={'Coefficient': 'Climate Cases'}),
    coef_df_b[['Feature', 'Coefficient']].rename(columns={'Coefficient': 'ISDS Cases'}),
    on='Feature'
)

x = np.arange(len(coef_comparison))
width = 0.35

ax.barh(x - width/2, coef_comparison['Climate Cases'], width, label='Model A: Climate', color='steelblue', edgecolor='black')
ax.barh(x + width/2, coef_comparison['ISDS Cases'], width, label='Model B: ISDS (All Sectors)', color='coral', edgecolor='black')

ax.set_yticks(x)
ax.set_yticklabels(coef_comparison['Feature'])
ax.set_xlabel('Standardized Coefficient', fontsize=12)
ax.set_title('Coefficient Comparison: Climate vs ISDS (MINIMAL - 5 Predictors)', fontsize=14, fontweight='bold')
ax.axvline(0, color='black', linestyle='--', linewidth=0.8)
ax.legend()
ax.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.savefig('results/minimal/coefficient_comparison_minimal.png', dpi=300, bbox_inches='tight')
plt.show()

## Key Findings Summary

In [None]:
print("\n" + "=" * 80)
print("KEY FINDINGS (MINIMAL - 5 PREDICTORS)")
print("=" * 80)

print("\n1. MODEL PERFORMANCE:")
print(f"   - Climate Cases: Test R² = {test_r2_a:.3f}, RMSE = {test_rmse_a:.3f}")
print(f"   - ISDS Cases: Test R² = {test_r2_b:.3f}, RMSE = {test_rmse_b:.3f}")

if test_r2_a > test_r2_b:
    print("   → Land deals BETTER predict climate litigation than ISDS cases")
else:
    print("   → Land deals BETTER predict ISDS cases than climate litigation")

print("\n2. TOP PREDICTORS:")
print("   Model A (Climate):")
print(f"   - Most positive: {coef_df_a.iloc[0]['Feature']} ({coef_df_a.iloc[0]['Coefficient']:.3f})")
print(f"   - Most negative: {coef_df_a.iloc[-1]['Feature']} ({coef_df_a.iloc[-1]['Coefficient']:.3f})")

print("\n   Model B (ISDS - All Sectors):")
print(f"   - Most positive: {coef_df_b.iloc[0]['Feature']} ({coef_df_b.iloc[0]['Coefficient']:.3f})")
print(f"   - Most negative: {coef_df_b.iloc[-1]['Feature']} ({coef_df_b.iloc[-1]['Coefficient']:.3f})")

print("\n3. INTERACTION EFFECTS:")
print(f"   - Model A (Deal × Corruption): ΔR² = {train_r2_a_int - train_r2_a:.4f}")
print(f"   - Model B (Deal × Corruption): ΔR² = {train_r2_b_int - train_r2_b:.4f}")

print("\n4. MINIMAL PREDICTOR SET (5 variables):")
print("   Removed (poor coverage):")
print("     ✗ Rule_of_Law_Score_Pct")
print("     ✗ Avg_Deal_Size (multicollinear)")
print("     ✗ Prop_Agriculture (52% missing in Model A)")
print("     ✗ Literacy_Rate_Pct (26% missing in Model A)")
print("\n   Kept (core theory + control):")
print("     ✓ Total_Deal_Size")
print("     ✓ Num_Deals")
print("     ✓ Corruption_Score")
print("     ✓ Press_Freedom_Score")
print("     ✓ log_Population")

print("\n5. SAMPLE SIZE IMPROVEMENTS:")
print(f"   Model A: {len(df_a_clean)} countries (was 23 with 7 predictors)")
print(f"   Model B: {len(df_b_clean)} countries (was 72 with 7 predictors)")
print(f"   Observations per predictor:")
print(f"     - Model A: {len(df_a_clean)/len(predictors_a):.1f} (was 3.3)")
print(f"     - Model B: {len(df_b_clean)/len(predictors_b):.1f} (was 10.3)")

print("\n" + "=" * 80)
print("\nMinimal analysis complete! Check results/minimal/ folder for all outputs.")
print("For the 7-predictor version, see: RQ2_Land_Deals_to_Litigation_Analysis_REVISED.ipynb")
print("=" * 80)