# **Bank Ensemble: V-Blend**

https://www.kaggle.com/code/stpeteishii/bank-ensemble-h-blend

https://www.kaggle.com/code/stpeteishii/bank-ensemble-v-blend

https://www.kaggle.com/code/nina2025/ps-s5e8-v-blend

https://www.kaggle.com/code/nina2025/ps-s5e8-binary-classification-hv-blend-bokeh-copy

---
## **Blending Pipeline Overview**

This pipeline implements an **adaptive weighted blending** method that dynamically adjusts weights based on prediction confidence intervals.

### Pipeline Stages:

1. **Data Ingestion**
   - Load multiple submission files (CSV format)
   - Each file contains predictions from different models
   - Files are identified by their performance scores (e.g., 0.97772, 0.97771)

2. **Data Merging**
   - Combine all predictions into a single DataFrame
   - Use a common ID column to align predictions
   - Each model's predictions become separate columns

3. **Confidence Assessment**
   - Calculate `mx-m` (maximum minus minimum) for each row
   - This represents the spread/disagreement between models
   - Categorizes predictions into confidence tiers:
     - **Tier 1**: High confidence (0.00 < mx-m ≤ 0.10)
     - **Tier 2**: Medium confidence (0.10 < mx-m ≤ 0.20)  
     - **Tier 3**: Low confidence (mx-m > 0.20)

4. **Adaptive Weighting**
   - **Base Weights**: Static weights assigned to each model
   - **Position Weights**: Dynamic weights based on prediction ranking
   - **Three Weight Sets**: Different weight combinations for each confidence tier

5. **Blending Execution**
   - **Descending Sort**: Rank predictions from highest to lowest
   - **Ascending Sort**: Rank predictions from lowest to highest  
   - **Final Blend**: 70% descending + 30% ascending blend

### The Blending Method: Adaptive Position-Based Weighting

**Core Concept**: The blend doesn't use fixed weights. Instead, it considers both the model's inherent quality AND its relative position in the prediction ranking for each specific sample.

**Weight Components:**
1. **Model Quality Weight**: Fixed weight reflecting overall model performance
2. **Position Bonus/Penalty**: Adjustments based on where the model ranks for this specific prediction

**Mathematical Formulation:**
```
Final_Weight = Base_Weight + Position_Adjustment

Final_Prediction = Σ (Model_Prediction × Final_Weight)
```

**Three Confidence-Based Strategies:**
- **High Confidence**: Smaller adjustments, trust top-ranked models more
- **Medium Confidence**: Moderate adjustments, balance between models  
- **Low Confidence**: Larger adjustments, more conservative blending

### Key Advantages:

1. **Context-Aware**: Adapts to prediction uncertainty
2. **Robust**: Handles cases where models strongly disagree
3. **Performance-Tuned**: Different strategies for different confidence levels
4. **Explainable**: You can see which models contributed most for each prediction

### Parameter Structure:

The method uses three weight sets:
- `subwts`: Position adjustments for high confidence
- `subwts2`: Position adjustments for medium confidence  
- `subwts3`: Position adjustments for low confidence

Each model has:
- `name`: File identifier
- `weight`: Base model quality weight

### Output Features:

1. **Detailed Analysis**: Shows predictions, confidence scores, and ranking
2. **Visualization**: Position distribution charts for model rankings
3. **Final Submission**: Blended predictions ready for competition submission

This method is particularly effective for competitions where different models excel in different regions of the prediction space, providing a sophisticated way to leverage the strengths of each ensemble member.

---

In [1]:
import ast
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

def v_blend(path_to_ds, file_short_names, dk):

    def read(dk, i):
        tnm = dk["subm"][i]["name"]
        FiN = dk["path"] + tnm + ".csv"
        df = pd.read_csv(FiN)
        # Handle column renaming more robustly
        if 'target' in df.columns:
            df = df.rename(columns={'target': tnm})
        if dk["target"] in df.columns and dk["target"] != tnm:
            df = df.rename(columns={dk["target"]: tnm})
        return df
        
    def merge(dfs_subm):
        if not dfs_subm:
            return pd.DataFrame()
        df_subms = dfs_subm[0]
        for i in range(1, len(dfs_subm)):
            df_subms = pd.merge(df_subms, dfs_subm[i], on=[dk['id']])
        return df_subms
        
    def da(dk, sorting_direction):
        
        dfs_subm = [read(dk, i) for i in range(len(dk["subm"]))]
        df_subms = merge(dfs_subm)
        
        if df_subms.empty:
            print("No data found. Check file paths and names.")
            return pd.DataFrame()
            
        cols = [col for col in df_subms.columns if col != dk['id']]
        short_name_cols = cols.copy()  # Use actual column names
        
        def alls(x, sd=sorting_direction, cs=cols):
            reverse = True if sd == 'desc' else False
            tes = {c: x[c] for c in cs}.items()
            subms_sorted = [t[0] for t in sorted(tes, key=lambda k: k[1], reverse=reverse)]
            return subms_sorted
            
        def summa(x, cs, wts, ic_alls): 
            return sum([x[cs[j]] * (wts[0][j] + wts[1][ic_alls[j]]) for j in range(len(cs))])
            
        wts = [
            [[e['weight'] for e in dk["subm"]], dk["subwts"]],
            [[e['weight'] for e in dk["subm2"]], dk["subwts2"]],
            [[e['weight'] for e in dk["subm3"]], dk["subwts3"]],
        ]
        
        def correct(x, cs=cols, wts=wts):
            i = [x['alls'].index(c) for c in short_name_cols]
            if 0.00 < x['mx-m'] <= 0.10: 
                return summa(x, cs, wts[0], i)
            elif 0.10 < x['mx-m'] <= 0.20: 
                return summa(x, cs, wts[1], i)
            else:                          
                return summa(x, cs, wts[2], i)
                
        def amxm(x, cs=cols):
            list_values = [x[c] for c in cs]
            mxm = abs(max(list_values) - min(list_values))
            return mxm
            
        df_subms['mx-m'] = df_subms.apply(lambda x: amxm(x), axis=1)
        df_subms['alls'] = df_subms.apply(lambda x: alls(x), axis=1)
        df_subms[dk["target"]] = df_subms.apply(lambda x: correct(x), axis=1)
        
        # Rename columns for display
        schema_rename = {old_nc: new_shnc for old_nc, new_shnc in zip(cols, short_name_cols)}
        df_subms = df_subms.rename(columns=schema_rename)
        df_subms = df_subms.rename(columns={dk["target"]: "ensemble"})
        
        # Add separator column
        df_subms.insert(loc=1, column=' _ ', value=['   '] * len(df_subms))
        df_subms[' _ '] = df_subms[' _ '].astype(str)
        
        # Configure display options
        pd.set_option('display.max_rows', 100)
        pd.set_option('display.float_format', '{:.4f}'.format)
        
        # Select and order columns for display
        vcols = [dk['id']] + [' _ '] + short_name_cols + [' _ '] + ['mx-m'] + [' _ '] + ['alls'] + [' _ '] + ['ensemble']
        df_subms = df_subms[vcols]
        
        # Display first 5 rows
        print(df_subms.head(5))
        
        # Reset float format and prepare for export
        pd.set_option('display.float_format', '{:.5f}'.format)
        df_subms = df_subms.rename(columns={"ensemble": dk["target"]})
        df_subms.to_csv(f'tida_{sorting_direction}.csv', index=False)
        
        return df_subms[[dk['id'], dk['target']]]
   
    def ensemble_da(dk): 
        dfD = da(dk, 'desc')
        dfA = da(dk, 'asc')
        
        if dfD.empty or dfA.empty:
            print("Error: One or both DataFrames are empty")
            return pd.DataFrame()
            
        # Combine desc and asc results
        dfA[dk['target']] = (dk['desc'] * dfD[dk['target']] + 
                             dk['asc'] * dfA[dk['target']])
        return dfA
    
    return ensemble_da(dk)


def data_in_col(i, matrix):
    return [row[i] for row in matrix]


def info_mx_m(df):
    if df.empty:
        print("DataFrame is empty")
        return
        
    try:
        matrix = [ast.literal_eval(row.alls) for row in df.itertuples()]
        subms = sorted(matrix[0])
        
        df_subms = pd.DataFrame({f'col_{i}': data_in_col(i, matrix) for i in range(len(subms))})
        
        fig, axes = plt.subplots(ncols=len(subms), figsize=(12, 3))
        if len(subms) == 1:
            axes = [axes]
            
        for i in range(len(subms)):
            sns.countplot(x=df_subms[f"col_{i}"], ax=axes[i])
            axes[i].set_title(f'Position {i+1}')
            
        plt.tight_layout()
        plt.show()
        
    except Exception as e:
        print(f"Error in info_mx_m: {e}")


# Main execution
if __name__ == "__main__":
    path = '/kaggle/input/30-august-2025-ps-s5e8/' + 'submission '
    fins = ['0.97772', '0.97771', '0.97770']

    params_v16 = {
        'path': path,
        'id': 'id',
        'target': "y",
        'desc': 0.70,
        'asc': 0.30,
        'subwts': [+0.03, -0.02, -0.01],
        'subm': [
            {'name': fins[0], 'weight': 0.85},
            {'name': fins[1], 'weight': 0.14},
            {'name': fins[2], 'weight': 0.01},
        ],
        'subm2': [
            {'name': fins[0], 'weight': 0.79},
            {'name': fins[1], 'weight': 0.19},
            {'name': fins[2], 'weight': 0.02},
        ],
        'subwts2': [+0.04, -0.01, -0.03],
        'subm3': [
            {'name': fins[0], 'weight': 0.85},
            {'name': fins[1], 'weight': 0.14},
            {'name': fins[2], 'weight': 0.01},
        ],
        'subwts3': [+0.03, -0.01, -0.02]
    }

    params = params_v16

    df = v_blend(path, fins, params)

    if not df.empty:
        df.to_csv('submission.csv', index=False)
        print(df.head())
        
        # Optional: Call info_mx_m if you want to analyze the results
        # info_mx_m(df)
    else:
        print("No output generated. Check your input files and parameters.")

       id   _   0.97772  0.97771  0.97770   _    mx-m   _   \
0  750000        0.0072   0.0048   0.0006      0.0066        
1  750001        0.0765   0.0441   0.0120      0.0645        
2  750002        0.0007   0.0005   0.0001      0.0006        
3  750003        0.0006   0.0005   0.0001      0.0005        
4  750004        0.0091   0.0056   0.0012      0.0079        

                          alls   _   ensemble  
0  [0.97772, 0.97771, 0.97770]         0.0069  
1  [0.97772, 0.97771, 0.97770]         0.0727  
2  [0.97772, 0.97771, 0.97770]         0.0007  
3  [0.97772, 0.97771, 0.97770]         0.0006  
4  [0.97772, 0.97771, 0.97770]         0.0087  
       id   _   0.97772  0.97771  0.97770   _    mx-m   _   \
0  750000        0.0072   0.0048   0.0006      0.0066        
1  750001        0.0765   0.0441   0.0120      0.0645        
2  750002        0.0007   0.0005   0.0001      0.0006        
3  750003        0.0006   0.0005   0.0001      0.0005        
4  750004        0.0091   0.0

---

Here's a comparison table between **v_blend** (Vertical/Adaptive Blending) and **h_blend** (Horizontal/Static Blending):

## Comparison Table: v_blend vs h_blend

| Aspect | v_blend (Vertical/Adaptive Blending) | h_blend (Horizontal/Static Blending) |
|--------|--------------------------------------|--------------------------------------|
| **Core Approach** | Adaptive, sample-level weighting | Static, fixed weighting across all samples |
| **Weight Application** | Dynamic weights per prediction based on confidence | Fixed weights for all predictions |
| **Confidence Sensitivity** | ✅ Uses prediction spread (mx-m) to adjust weights | ❌ Ignores prediction confidence levels |
| **Weight Components** | Base weight + position adjustment | Single fixed weight per model |
| **Weight Sets** | Multiple sets (3 tiers for different confidence levels) | Single weight set |
| **Sorting Direction** | Both ascending and descending blends combined | Typically single direction or no sorting |
| **Complexity** | High - adaptive logic per sample | Low - simple weighted average |
| **Computational Cost** | Higher due to per-sample calculations | Lower - simple arithmetic operations |
| **Flexibility** | Adapts to different prediction scenarios | Rigid - same blending for all cases |
| **Parameter Tuning** | Complex - multiple weight sets to optimize | Simple - fewer parameters to tune |
| **Use Case** | When models have varying performance across data regions | When models perform consistently across dataset |
| **Error Handling** | Better handles model disagreements | Can be skewed by outlier predictions |
| **Implementation** | Requires ranking and confidence calculations | Simple weighted average formula |
| **Output Analysis** | Provides detailed confidence metrics | Basic prediction output only |
| **Visualization** | Includes ranking distribution charts | Limited or no visualization |
| **Best For** | Competitions with diverse prediction patterns | Stable, well-behaved model ensembles |

---

## Performance Scenarios

| Scenario | v_blend Performance | h_blend Performance |
|----------|---------------------|---------------------|
| **Models agree strongly** | ⭐⭐⭐⭐⭐ (Optimal weighting) | ⭐⭐⭐⭐⭐ (Works well) |
| **Models disagree moderately** | ⭐⭐⭐⭐⭐ (Adapts weights) | ⭐⭐ (Static weights may fail) |
| **Models disagree strongly** | ⭐⭐⭐⭐ (Conservative blending) | ⭐ (Poor performance) |
| **Mixed confidence levels** | ⭐⭐⭐⭐⭐ (Handles variety) | ⭐⭐ (One-size-fits-all fails) |
| **Simple ensemble** | ⭐⭐⭐ (Overkill) | ⭐⭐⭐⭐⭐ (Perfect fit) |

---

## When to Use Each

### Choose v_blend when:
- ✅ Models have specialized strengths in different data regions
- ✅ Prediction confidence varies significantly across samples
- ✅ You need robust handling of model disagreements
- ✅ Competition performance demands sophisticated blending
- ✅ You have time for parameter optimization

### Choose h_blend when:
- ✅ Models perform consistently across all data
- ✅ You need a simple, fast solution
- ✅ Model predictions are generally aligned
- ✅ You're working with well-correlated models
- ✅ Quick implementation is priority

---

## Practical Considerations

**v_blend Advantages:**
- Better handling of edge cases
- More explainable results (you see which models "win" per sample)
- Higher potential peak performance

**v_blend Disadvantages:**
- More complex to implement and debug
- Higher risk of overfitting to validation set
- Requires more careful parameter tuning

**h_blend Advantages:**
- Simple to implement and maintain
- Less prone to overfitting
- Faster computation
- Easier to explain to others

**h_blend Disadvantages:**
- Can't adapt to varying prediction scenarios
- May underutilize model specialties
- Limited performance ceiling

The choice depends on your specific competition context, model characteristics, and available optimization time.